Date: Fri, 29 Oct 1993 01:13:05 -0400 From: Jeff DelPapa Subject: voice recognition systems To: Multiple recipients of list SOREHAND In-reply-to: Rob Liebschutz's message of Thu, 28 Oct 1993 20:45:38 PDT <199310290346.AA28970@world.std.com> X-To: SOREHAND%UCSFVM.BITNET@cmsa.Berkeley.EDU X-Cc: rob@rjl.com Rob, I have been using voice recognition with sun boxes for about a year now. I have DragonDictate30K, and couldn't envision using any of the smaller vocabulary systems. Things like IN3, and the other small vocabulary systems aren't much use to someone who is seriously injured. I will disagree with the IN3 author (who posts here) that a small vocabulary system is particularly useful. The theory with the small systems is that if you restrict yourself to qwerty, and don't hit any of the funky shift keys, it will be enough. Many of us are too injured for that, and in any case the small systems don't have enough words to seriously replace my use of funny key combinations. I am an emacs user, and have devoted over 500 words to it. My private vocabulary was 1400 words when I upgraded to V2. (I know this as the upgrade required rebuilding of voice models, I got a very good count of my vocabulary) Having said that, you should seriously consider using one of the large vocabulary systems, preferably with a2x. a2x (and a lot of other stuff like the FAQ) can be found in the /pub/typing-injury directories on soda.berkeley.edu. Dragon and Kurzweil provide large vocabulary, discrete utterance, speaker adapting, full text recognition systems. What this means is that you get lots of words (30,000 active from a 100,000 word dictionary 5,000 user defined from dragon, 50,000 active from a 200,000 word dictionary, 10,000 user defined from kurzweil), you have to put short pauses between each word so it can find the boundaries (it is particularly annoing to share an office with someone using a discrete utterance system, there is a bonus that will be mentioned below), and there is some training involved. Full text means that the voice system doesn't up its chances by limiting its choices to the commands available at that moment -- you can use words in any order, an obvious necessity when you need real dictation. Both systems come with various sets of voice data -- this is an average taken from many people saying all the words in the dictionary. The cost of amassing such a database is one of the real barriers to entry if you want to get into the game. Anyhow, you identify yourself, and go thru an introduction phase, reading it some small fraction of the words. The system will chose the database you are closest to, and start building a set of models specific to you. The base models are good for 80% (or better) accuracy right then, but as you correct its errors, it refines your model, and gets to an accuracy of about 95%. (much better than that is almost impossible -- english is too homophonic, and grammar based word prediction can only go so far. You can play games with pronunciation, inventing a different way to say there and their, but I haven't myself) Discrete utterance is a mixed blessing -- it is annoying to have to talk that way (and it is very speed limiting -- 35 -40wpm is about top speed that way), but you can use short phrases as an "out-of-band" macro -- thus I say lispworkssupportaddress and it emits lispworks-support@harlequin.com (one case where saying is faster than typing) the phrases can be up to 5 seconds in length, and generate up to 1,000 keystrokes. I have a few that emit whole blocks of boilerplate. The systems are fairly pricey - Dragon wants $4995, and Kurzweil wants $6,000 (about $0.15/word). [note: since this was originally written, Kurzweil slashed their price to approx $3000 -- dwallach] They also want an IBM PC to live in, and that needs a fair amount of power. Dragon claims that it will run on a 386/25, but I have never tried it in less than a 486/33. Kurzweil wants the 486/33DX as a minimum, and suggests a /66. These are the worlds largest DOS TSR's -- dragon needs 12 meg of memory at a minimum, Kurzweil wants at least 24 (or perhaps 32). You can buy the dos box for under $2,000 if you will take a clone on the desk, transporability (one of the concrete block sized "lunchboxes") adds $500 or so, and the best you can do is a recently discontinued but still slightly available Toshiba 6400 for about $5,000. You mentioned use with unix. The best way we have is by using a2x, a program developed by Bob Schiefler, head of the MIT X consortium, and himself a typing injury victim. a2x itself is free, but you can spend as much as $1,000 for network software and hardware to give the best result. (It is operable without any of it, but it is least cumbersome when used with a network -- this is described in the a2x manual) A2x provides a fairly complex way of getting keystrokes to an X client. You run the program on the unix machine, it takes ascii from the recognition program, turns this into special events that it sends events to your X server, which are transmutated by the Xtest protocol extension (x11r5 patch 18, the DEC Xserver, or NCD Xterms) into keystroke events and sent to whichever client owns input focus. It can also replace the mouse. Crufty (especially when you remote the dos window so you see the menus on the same screen as the rest of your X stuff) but there are a number of us using it daily. We have a fairly quiet mailing list (send to a2x-users-request@x.org to join), we are starting to pass around voice macros, etc... All of us are currently using DragonDictate, partly because Kurzweil's Voice was not available when a lot of us bought our systems (first shipped in January), and partly because we all have doubts that it could be easily made to work with Desqview/X, which is how we get the menu's on the X Server screen. (I would like to be proved wrong on that one, because it has a larger vocabulary than Dragon has) Success as a programmer very much depends on the language you use, and how much of your code is original to you, etc. Using a language with a small number of reserved words (like C), and doing mostly original work, so you can choose dictation friendly variable and function names (composed of correctly spelled words in monocase) you can do pretty well. If you are working in an enormous language (in my case commonlisp), on an existing body of code, you will have some problems coping with all the funny words (defmumble ) and all the variables like *Rndom-all-strungtogether-with-MixeD-caseandsome-vwls-rmvd*, especially since there are far more of them than you have total words available. I personally get frustrated when I can't generate code quickly and with "grace", so I now run a software support organization, and most of my day is spent producing regular text, which the system is well suited to doing. (as it can spell, and I can't, there was an actual improvement in quality when I switched to voice.) I will be happy to chat with you (saving your hands) I can be reached during the day at 617 252 0052. I have no connections with any of the voice recognition companies, save as a grateful client. (I mean that quite literally, it was so liberating to be able to communicate with my friends again, that I was up the day I got it installed until 5AM sending email. For the first time in over a year, I could interact without pain.)