Test of Dictation Programs: DragonDictate for Windows 1.0 and Kurzweil Voice for Windows 1.1 John B. Armstrong, University of Ottawa (jbarm@acadvm1.uottawa.ca) Dictation programs, as they exist today, can be very useful for enhancing the productivity of someone with poor typing skills, or who suffers from repetitive stress injury. They are invaluable to someone who is disabled and unable to use a key- board. But, they are not 100% accurate, and thus an investment in training is required. I personally find any dictation program preferable to the alternatives for someone who is unable to type, but you have to be patient with them. With training, they now achieve good recognition, but only if you maintain separate voice files for each user. This is the way they were designed, and one should not expect them to perform efficiently otherwise. Some readers may have encountered reviews of DragonDictate for Windows and Kurzweil Voice for Windows in computer magazines. While many of the comments are certainly valid, I suspect that these reviews have been based on rather limited experience with the programs, and that they may give a somewhat inaccurate impression of their relative capabilities. Interface: The interface for both Dictate and Voice is fairly similar. At start-up, what you see on the screen is normally a control bar with buttons for turning the microphone on and off and for accessing options menus. When you are actually dictating, a choice list appears. In Voice, you are presented with 4 alternatives to the word the program believes you were saying. In Dictate, you can select up to 8 alternatives (9, if you count "none of the above"). [However, when we were running recognition accuracy comparisons, we elected 4 alternatives.] Dictate, in addition to allowing flexibility in the number of alternatives displayed, also allows you to pick a font and font size for the list. We found this very useful, even for someone without a serious visual impairment. Running with an 18 point font resulted in a choice list which was easy to check but still did not take up excessive space. Alternatively, the application window may narrowed and the choice list may be positioned along side, so as not to obscure the text. (There is a handy program for setting this up automatically; see the end of this article.) As a third alternative, someone with a moderate visual impairment could select a font as large as 48 points, placing the choice list in the center of the screen. They could follow their dictation by watching the choice list rather than the application itself. The present version of Voice for Windows allows no customization of the choice list, and the font size is very small (about 10 points) making it difficult to read even for someone with normal vision, unless they have a large monitor. Correcting errors: The approach to correcting recognition errors is essentially the same as long as the alternative appears on the choice list. The commands are different. In Voice, you choose an alternative by saying "take 3", while in Dictate it would be "choose 3". Second, in Voice the word the program thinks you said is unnumbered (the first alternative being 1) while in Dictate the first alternative is numbered 2. When the correct choice does not appear on the list, you have to say "correct that" or "spell mode". You can then begin to spell the correct word using the international voice alphabet (alpha, bravo, Charlie, etc.). Alternatively, you can type in your correction. With Voice, saying "correct that" (or clicking on the appropriate button) is obligatory. Otherwise you will find yourself typing directly into the application after the incorrectly chosen word. However, with Dictate you simply start typing, and then say "choose n", or click on the choice when it appears in the choice list. Some of Voice's commands seem counter-intuitive, but that could be more a question of my being familiar with Dictate first. For example, I find it more natural to say "shift key" before a word I want capitalized, because that's the way I'd type. With Voice you say the word first and, then, "initial-caps-that". Mind you, with Dictate, it would be a useful option for those times when you forget to capitalize a word. On the other hand, when I want to shut off the microphone, I find Voice's "stop listening" both more natural and less trouble-prone than Dragon's "go to sleep". In AmiPro, this command inevitably brought up the GO TO dialog box, which I then had to cancel before trying again. I tried retraining both "go to" and "go to sleep" umpteen times with no improvement. I even tried training Dictate to recognize "stop listening", which seemed to work fine until I tried saving my vocabulary, at which point I got a General Perfection Fault! Commands: In DragonDictate for DOS, commands are given as two or more words spoken without a break. Voice uses essentially the same format. However, in the Windows version of Dictate, instead of saying "file-menu" to pull down the file menu, you now have to say "command mode" and then "file". Not only is this awkward, but you can easily forget which mode you are in. [Certain commands, such as "choose-3" and "go-to-sleep" are still available in the "Dictate Mode".] Dictate is supposed to be able to recognize the command structure of even an unsupported application, through its Vocabulary Manager and "Tracking Phrases" feature. I found the Vocabulary Manager to be distinctly unfriendly. If you are interested in ease of use, take a look at Creative Labs VoiceAssist. It's tracking of menus and interface for training commands -- even one's that involve pushing buttons -- is, in my opinion, much superior. It is, however, only a navigation program, and is not adaptive. We found that if we used the multimedia sound card for input, we could have both loaded simultaneously, BUT you cannot have both listening simultaneously. Incidentally, VoiceAssist is very sensitive to the microphone employed. If you have previously trained it with its own microphone, you will have to retrain with your Dragon microphone. Where Dictate does have a distinct advantage over Voice is in its ability to control the mouse by voice, a feature that actually does work fairly well. The major problem is that mouse movement is too rapid, even at the lowest speed, for precise control, and it is easy to overshoot your target. Furthermore, the drag feature does not work with all programs (e.g., the vector graphics program Canvas). Tutorial: Neither is particularly useful unless you are totally unfamiliar with such products. Both are highly graphical but not highly informative. In contrast, the DOS version of Dictate has a very good tutorial that not only teaches you how to use the program, but makes a good start training the program to recognize your voice. Manuals: Voice tries (and largely succeeds) in keeping it short and simple, though occasionally one wishes for more information. Dictate provides two manuals, a short "Getting Started Guide" and a more detailed "User's Guide" running 200+ pages. Since both primary testers were already familiar with the DOS version of Dictate, we found it difficult to judge the usefulness of the manuals for a novice user. However, the shorter Guide probably provides all that is necessary to get started. Recognition accuracy: Voice's untrained recognition is about 20% better than Dictate's, but even with Voice we have never achieved better than 83% and have averaged around 80% with a number of test users. At this level, you can use it, out-of-box, as long as you are patient and correct errors conscientiously. With Dictate you are in for major frustration unless you first run through the so-called "Quick Training", which involves running through about 750 words and commands - an exercise that will take an hour or so, but is well worth it. After you get going, Dictate learns very quickly and soon achieves recognition accuracy comparable to Voice. After you have used Voice for a while, it will ask you to go through "enrollment", a procedure that involves dictating 400 words, followed by an hour or more of computations by the computer in an attempt to improve recognition. However, as far as I could tell, "enrollment" did not seem to result in noticeable improvement. Unfortunately, neither program has a built-in statistics utility (unlike the DOS version of Dictate), so it is difficult to assess accuracy on a routine basis. However, before writing this I tested all three with a standard piece of text we have been using for all our testing with untrained users. I achieved 94% accuracy with Dictate (DOS and Windows, though not the same errors), and 90% with Voice for Windows. I feel that the two versions of DragonDictate are fully trained, but that Voice may not be. Your guess is probably as good as mine as to whether it would potentially achieve the same or higher accuracy, but I think it is unlikely that it would be more than 1-2% different. User-independence: Some would like a program that is "user- independent" -- in other words, a program that you would not have to train and would still recognize what you are saying with acceptable accuracy. There are certainly uses to which such a program could be put, in addition to saving the individual user's time otherwise spent in training and/or correcting dictated text. Such uses might include having patients dictate medical history to a computer, or conducting a literature search through a library computer. The programs we have been evaluating cannot, in my opinion, be put to such use. Recognition accuracy, without training, is simply not good enough. At the very least, you would have to instruct each user on how to make corrections using the keyboard and/or mouse. Aside from the time actually spent instructing each user on how to make corrections, our experience with untrained users indicates that most find such a correction procedure tedious. Obviously, what constitutes acceptable accuracy depends to some extent on the situation, and on whether the error made is in a word critical to obtaining the sense of what has been dictated. Having said that, can we nevertheless give a figure that would be useful for routine use: 85%, 90, 95, 98? As a test, we took a passage from an article in the business section of our local newspaper and dictated it without correction. We were able to catch both Voice and Dictate for Windows at a similar recognition accuracy -- about 85%. Both programs made 18 errors in 114 words, though with Voice this counts 2 that were not recognized at all in five tries. In addition, it took 2-3 attempts to get Voice to recognize three other words. This illustrates one difference between the two programs. DragonDictate almost never fails to come up with a "guess" at what you said, no matter how inappropriate. [For example, Dictate came up with "hydrocarbon" for underproductive, whereas Voice refused to come up with a choice in five attempts.] The uncorrected passage was then given to a number of volunteers who attempted to determine what the subject was. The two programs made different errors, but in both cases, most were minor (present instead of past tense, missed plurals, etc.) and the general sense of the article came through. However, everyone was slightly uncomfortable and felt that they were possibly missing some critical detail. For example, how would you interpret "The employee in question, cold technical competence as an incubator..."? [It actually reads, "The employee in question, although technically competent as an engineer..."] I suppose, given the rapid progress in technology, that we will eventually see a user-independent dictation program. Given the wide variety of accents just within English-speaking North America, I don't think this will be easy, but in light of what has been accomplished just in the last decade, I would certainly not say it can't be done. Summary: (5 point scale) Dictate Voice Ease of dictation 5 5 Ease of correction 5 4 Navigation commands 3 4 Mouse commands 4 NA Customizable interface 5 1 Initial recognition 2.5 4 Trained recognition 4.5 4.5 Tutorial 2 2 Manuals 4.5 4 Final grade A- B+? (see below) Conclusions: Voice, in my opinion, suffers a fatal flaw in not allowing the user to customize the choice list to a readable size. Unless you are young and have good eyesight, using Voice with less than a 21" monitor is a certain route to more, rather than less stress. Correct that, and you would be left choosing between a program that does not allow true hands-free operation, but is otherwise reasonably user-friendly, and one that has the potential for completely hands-free operation but, at times, can frustrate you beyond words trying to get it to do so. Both are slower than DragonDictate for DOS. I find it annoying having to wait on screen updates, and I can outrun the ability of either Windows program to accept input (the maximum appears to be about 45-50 words/min.). Outrunning the DOS version of Dictate is much more difficult. I have gone as high as 70 words/min., but find it difficult unless I am working with a memorized passage. Most often I am composing as I dictate, and therefore must slow down to collect my thoughts. Who, these days, is going to use a dictation program to dictate something already written? That's what scanners are for! After about 5 months of fairly regular use, I am finally beginning to like DragonDictate for Windows. Still, if I have to make a recommendation, I would tell anyone to look seriously at the DOS version of Dictate unless there is some overriding need to operate in the Windows environment. A word of caution: These days we see increased awareness of the risks of repetitive stress injury. More attention is being paid to ergonomic design, from keyboards to entire workstations, and to taking frequent breaks. Dictation programs are offered as a solution to the hazards of RSI, and as a way for someone who is suffering from RSI to continue working without further aggravating the problem. There is an implication that you can sit back in a comfortable chair, close your eyes, and simply talk to your computer. It's not quite that simple. Even at 90-95% recognition, errors are made. Some errors are minor, but others, discovered later, leave you wondering that what you were trying to say. Others can hang you up completely if a word has been recognized as a command. Watching your screen like a hawk is not an ideal solution either. Not moving can be as hazardous as repetitive movement. Neck muscles get tense from maintaining the same head position for a prolonged period, and that can lead to stiffness and tension headaches. Each person will likely have to come up with their own solution for the best way to dictate. Kurzweil's manual says that "you may find it inefficient to verify every word as you say it. Instead, speak a series of words with brief pauses between each word, and verify the document at the end of a thought group." This is not bad advice, but personally, I find it more disruptive to my thought process than trying to catch each error as it occurs. This is partly why I operate with a fairly large font; I find it much easier to pick out the errors. Sermon: I can understand why good touch typists, not suffering from RSI, feel that dictation programs are inefficient. When I could still type, I really didn't have to pay very close attention to what was appearing on the screen. Errors could generally be found later with a spell-checker. But dictation programs never make spelling errors; they make recognition errors. This raises the question, addressed above, of how accurate does speech/voice recognition have to be to be useful as a user-independent system? I have often heard that when you listen to someone speaking, you really only catch only about 70% of what they say, and your mind fills in the gaps according to the context. Is the same true for printed text? I think not. We have been trained from an early age to expect printed material to have the correct words 100% of the time -- or very close to it, allowing for a few typos. But speech recognition programs do not make typos, and errors often result in a completely different word. [If the computer interprets "pepperoni" as "anchovy", you may simply end up with an angry customer, and no great damage is done. However, if the computer interprets "glaucoma" as "sarcoma", you could be in real trouble.] Part of the problem is that you don't expect to find the wrong word in printed text, and part, I believe, is that uncorrected dictation ends up being subject to a two-step guessing process. In the first step, the computer is making a "guess" at what it thinks you said, and in the second step, the reader is put in the position, not of guessing what you said, but of trying to figure out what the computer's guess was. We are going to find ourselves using ever more efficient grammar- checkers. It makes me wonder if, somewhere down the road, we won't simply tell the computer to "write an article on...", and it will do all the work -- and we will be content! Wish list (both programs): 1. Faster response 2. Better tutorials 3. Critical command confirmation, so as to prevent losing your work if a command is misrecognized 4. A longer memory of the choices for previous words, now 10 for Voice and 12 for Dictate. This necessitates keeping "thought groups" short. 5. An option for audible feedback of the first choice. While this might prove annoying to some, it would benefit users with poor vision. Dictate for Windows 1. A more obvious indication you're in "Spell Mode" 2. A different command for turning off the microphone 3. A better vocabulary manager 4. A lower speed range for the mouse 5. Consistency in commands between the DOS and Windows versions Voice for Windows 1. The ability to increase the font size of the choice list (should be the highest priority for the next release) 2. The ability to run a mouse by voice command 3. More intuitive commands NOTES: 1. Test beds: 486/33 PCI with 16 MB RAM and 128 kB cache; IBM ACPA and AudioWave Platinum 16 sound cards (both work equally well with Dictate for Windows) and a 486/50 ISA with 20 MB RAM and 256 kB cache; ACPA and Kurzweil sound cards. 2. Test passage for recognition accuracy: a 168 "word" passage (counting spoken punctuation; 152 actual words) from "Pride and Prejudice"). Special thanks to student volunteers who acted as untrained users. 3. Test programs: AmiPro (for dictation), and the graphics programs PC Paintbrush, Corel DRAW, and Canvas (to test navigation and mouse control). [My current strategy for using AmiPro with Dictate is to narrow AmiPro's window to about 2/3 the screen width and run Dictate's choice list in the other third with 24 pt text. AmiPro will still display the entire page width if you customize the size to 65%. Then increase the font to something you can see comfortably; you can always change it back to somthing smaller come print time. To save having to narrow the window every time you start AmiPro, install RunHere, a shareware utility available by ftp from oak.oakland.edu (it's in the SimTel/win3/desktop directory as rnhere16.zip).] More thanks: To Kurzweil and Dragon Systems for agreeing to the tests and providing the software. To Patrick at Applied AI Systems, Kathleen at Kaitlin Computer Consultants, Al at Micro A/L Computers, and Christof at the U. of Ottawa for technical assistance. This review was done for EASI (Equal Access to Software and Information) and originally appeared in the EASI Digest in 4 parts between Nov. 1994 and Feb. 1995