REVIEW OF LITERATURES ON SPEECH RECOGNITION TECHNOLOGY
BY
JASMINE ALJARALLAH
Abstract
This paper presents a brief history about Human-Computer-Interaction and its effect on desktop environment. In addition a review of literatures on topic of speech recognition (SR).Areas covered are: development of speech recognition , concept, SR usability and applications, types, techniques that are commonly used in speech recognition , problems associated with speech recognition systems designing, some methods to enhance the accuracy of contemporary speech recognition software and potential future research in this fields.
A brief history of Human Computer Interaction
Understanding the early days of human computer interaction (HCI) field can help a lot in realizing where the modern computing comes from, and to what extent computers will be developed for successful interaction with human beings. However, many more researches and studies have been carried out successfully in HCI and permanently shaped computing in general. Concisely I will summaries the most significant systems and technologies that have the largest impact on HCI such as Sketchpad, WIMP Interfaces, Hypertext and first internet.
The 1960’s was the decade that witnessed many improvement in personalizing the computer as there have been many researches from different organisations funded by the governments. As consequence a large number of inventions has played a profound part in the field of HCI. For instance, Sketchpad was invented by Ivan Sutherland in 1963, witch was the first graphical image manipulation interface (Sutherland, 1963). By using a light pen the objects can be grabbed, moved, expanded, copied (Myers, 1998) [23].It was a revolution in HCI at that time and helped people to interact with the computers in a better way. Sketchpad is now widely considered to be the grandfather of computer-aided drafting and as well as the graphical user interface (Bissell, 1990) [24]. The most user friendly input device the mouse was invented in 1964 by Douglas Engelbart. Initially created as an X and Y position pointer for a display (Engelbart, 1970). As it was cheap and easy to use, it took over the power from the light pen (Goldberg, 1988) [21]. no one can deny that Engelbart has added a powerful impact on major graphical desktop environments .
.By 1970’s, the invention of the WIMP (windows, icons, menus, pointing) interfaces credited to the Xerox Palo Alto research centre (Swinehart et al., 1986) [22]. Later in the mid 80’s Apple Macintosh generalized the WIMP model (van Dam, 1997) [23]. And it is still the universal style of the user interface in modern desktop computers.
In term of Windows: a split screen (tiled windows) was established by Englebart’s NLS in 1968 (Englebart, 1994) [25]. Started from two-tiled windows to multiple tiled windows, witch is afterward led to the idea of overlapping windows by Alan Kay in 1969 (kay, 1977) [25]. By 1974 overlapping windows demonstrated in his Smalltalk system still used in graphical user interfaces today. In term of Icon: Between (Kay, 1996) [26]
ARPANET the first internet: The ARPANET was born on October, 1969 and developed by the Information Processing Techniques Office and it was limited to the team’s members. In 1972 started to expand to about thirteen sites. However, it was unknown for the public at that time (Peter, 1998) [28].
Hypertext 1945: “tying two items together,” “two items to be joined,” (English, 1967) [27]. The idea of connecting stored information in the machine through associative link, represented by Dr Vannevar Bush in his article “As We My Think”. In 1965 Ted nelson coined the term “hypertext” (Nelson, 1965) [29]. Afterward Nelson and two students from Brown University designed the Hypertext Editing System (Van Dam et al., 1969) [30].The Hypertext idea led to the idea of Hyperties and followed by the Hypercard from Apple and then World Wide Web By Berner-lee.
- Technology of today have changed the way that user interact with the computer significantly such as, Three dimensions interface, gesture recognition, multimodal interfaces, speech recognition and with I represented In this review
Literature Review
One of the most natural ways that can information be exchanged between human beings is the speech. As human we learn how to speak fluently from childhood without efforts and we continue performing spoken interaction fluently through our life .Since it comes naturally we can not notice the complexity of the speech (zhang, 2001) [17]. “Human speech is inherently a multi modal process that involves the analysis of the uttered acoustic signal and includes higher level knowledge sources such as grammar semantics and pragmatics” (Dupont, 2000) [3]. For centuries attempts have been made on machines in order to be able to produce and understand speech like human being (Pinker, 1994 [12]; Deshmukh et al., 1999 [2]). Obviously, this would be a very valuable interface (Kandasamy, 1995,) [6].
The context of speech recognition is the technology that makes the computer capable to recognise spoken words through an input device such as a telephone or a microphone, in some cases spoken word can be translated into commands to perform functions in the computer for instance (Open file, Save a document), or it can be translated into a text such as word processing documents (Kirriemuir, 2003) [7].
And it can be defined as the transformation of an acoustic signal to a set of words (Zue et al., 1996 [36]; Mengjie, 2001 [11]). As reviewed by (Ding and Yu, 2005) []. Speech recognition technology enables the user to get along without standard desktop interface (i.e. menu, icon, window and mouse) by speaking to the computer and then the acoustic signal will be automatically converted into textual words.
When computer and technology become inevitable in our daily life, there have been large demands for vocal interaction with the machine. Since 1950 a large number of researches has been carried out and made a great success in speech recognition field, especially since the 70’s (zhang, 2001) [17]. However, after the high-performance of (text to speech) systems in the mid 1960s, a series of techniques and approaches have been introduced one after the other in order to generate a system with a high level of spoken word recognition. This resulted in the first positive attempt in the 1970’s, when general pattern matching technique was innovated (Jackson, 2005) [5]. Originally, in 1972, word processing systems and dictation were joined to devise the early speech recognition system (Lange, 1993 [8]; Meisel, 1993 [10]). SR system was expensive at that time and could only manipulate isolated speech, when pauses required between every word. Additionally, this application was limited and can not be extended reviewed by (Jackson, 2005) [5]. Thus, a new technique started to be considered, which is based on statistical approach. This approach helps especially in representing the variations in speech (zhang, 2001) [17]. The 80s and 90s saw a significant progress in automatic speech recognition products with high performance algorithms, to the extent that such software for desktop dictation became affordable for almost everyone (Zue et al., 1996) [18]. On the other hand, they were discrete and require a gap between each word (i.e. it ….is…..very…..time….. consuming) as the computer needs time to process each individual word (Kirriemuir, 2003) [7]. Witch were obviously doomed to failure.
Not surprisingly, after 30 years of continued growth in speech recognition algorithms and technology, continuous speech recognition systems have progressively advanced allowing the user to speak at nearly spontaneous speed with high accuracy, freeing the user from command-like manner. One disadvantage is that, the user’s speech in this kind of systems does not have ongoing adaptations in contrast with the discrete systems as they do (De La Paz, in press) [19]. However, until now the general implementation of automatic speech recognition ASR are not widely accepted due to some drawbacks (Lai et al., 2000) [20].
ASR benefits and current applications:
Since some users for some reasons can not use their hands to interact with the computer using the main traditional devices (i.e. mouse and a keyboard) (Rudnicky et al., 1993) [15], the computer needs to be adaptable to their particular needs. As (Kirriemuir, 2003) [7], pointed out that an effective Automatic Speech Recognition system can provide hands free interaction by relying almost entirely on human voice rather than using the usual interacting tools. Which can especially help people with poor keyboard skills, people with spelling difficulties including dyslexic people and the most useful advantage is that some disabled people can get benefit from the latest technologies (Kirriemuir, 2003) [7].
In addition, this kind of systems can potentially relieve people with repetitive stress problems associated with prolonged time of typing, and with those who do not suffer from any pain it can significantly enhance their performance (Renee, 2001) [14]. Furthermore, ASR can assist drivers to concentrate on driving and this is strongly supported by (Marvin et al., 2004) [9] study, as they stated that providing hands off and head up operation of inner car devices can remarkably reduce the distraction that comes from drivers operation of such these devices (i.e. cell phone, radio and GPS), consequently driving performance will improve. From my point of view I believe this is a great benefit as the dramatic increase of functionality of the latest mobile phones as well as the overspread of navigation systems, both can elevate distraction problem during the driving. According to (Kirriemuir, 2003) [7], due to the high competition in the car Market, it is not surprisingly that car industry is investigating the use of SR technology in the area of inner car steering systems as the modern cars are marketed by virtue of the innovations of technical features.
Automatic Speech Recognition as a new innovation in man-machine interaction, it is more and more used recently in many different applications, the most common use is the telephone-based information retrieval systems, such as voice dialling in some mobile phones where the user can say the contact name and then it dialled automatically and it is faster than the touch screen. Another aspect of telephone application where SR is making great steps is the automated telephone-based systems interactive services, which facilitate the service for both the customer and supplier as human operators free and can be operated 24 hours a day and 7days a week , an example of this stock market quotes (Kirriemuir, 2003) [7]. Gambling industry have applied Speech Recognition applications in games such as online multiplayer poker, where the players can hear the vocal commands and the host computer can interpret whenever it is appropriate (Kirriemuir, 2003) [7].
In terms of contemporary speech recognition software packages there are several commercial systems available in the market such as: Dragon Naturally Speaking, IBM’s viaVoice, L&H’s Voice Xpress, Microsoft SAPI And Philips’ FreeSpeech (Huang et al., 2004) [4].
Automatic Speech Recognition Types
Automatic Speech Recognition technology has two main technical types: one is the direct voice input “DVI” and the other is the large vocabulary continuous speech recognition “LVCSR”: the former is mainly designed to vocally command and control, usually this kind of systems are respond directly and it is merely configured for vocabularies with small to medium size (i.e. voice dialling on cell phones).
Whereas the later is aimed at voice-based document creation and form filling. as this type of ASR is typically deigned to transcribe continuous speech, it involves very large sized vocabularies ( up to hundreds of thousands of words) (Jackson, 2005) [5].
Zhang classified the ASR systems In terms of speaker dependency of the systems into two types: speaker dependent and speaker independent systems. The first one is mainly built with speech patterns to be operated by one speaker. The advantages of such systems that more accurate and easy to develop. However, they are not flexible as the system is not possible to be trained in many applications. As an example of these systems is that the automated telephone operator systems do not enquire the speaker to train the system before using it (Zhang, 2001) [17].
On the other hand Zhang described the speaker independent system as a multi-speaker system with training flexibility but not so easy to develop and the accuracy level in this system is a bit low.
In contrast, (Manasse, 1999)[] reported that the speaker dependent system needs to be trained to build words recognition template, and this template can be accessed whenever the system is operated. Whereas, the speaker independent system is previously provided with accessible prerecorded words by the system producer. In support of Manasse’s view dictation system requires the speaker to perform an hour or more of training
To build words recognition template (Zhang, 2001) [17].
Techniques and Matching Techniques
Speech Recognition techniques can be categorised into thee main techniques:
Template based approaches: in this approach Unknown speech is matched up to a group of templates in order to find the most appropriate match. The accuracy in this technique is almost high by using the ideally word paradigms. On the other hand, it is judged impractical with respect to variations in speech as the pre-recorded templates are fixed (Rabiner et al., 1979) [13].
Knowledge based approaches:” An expert knowledge about variations in speech is hand coded into a system” (Jackson, 2005) [5]. The advantage of this approach is that explicit modelling variations in speech has been achieved. However as it is not easy to obtain and use expert knowledge successfully, this technique becomes impractical (Jackson, 2005) [5].
Statistical based technique. It is modelling the differences in speech statistically, by using statistical learning procedure. This kind of models requires priori modelling assumptions and this resulted in limiting the system performance. An example of statistical based technique is Hidden Markov Models (HMMs) (Jackson, 2005) [5].The advantage of HMMs is that the use of a probabilistic acoustic model eliminates the need for building reference templates and it is capable of representing the variations in speech (zhang, 2001) [17]. (Lai et al., 2000 [20]; Melnikoff et al., 2001[21]) pointed out that most successful and useful techniques to speech recognition systems are based on hidden Markov Models. The technique of HMMs came into existence since 70’s and still dominating the most widespread speech recognition systems from basic to highly advanced applications .( Melnikoff et al., 2001) [21].
Matching technique
Speech-recognition machines usually match a detected word to a recognized word using either Whole-word matching or of Sub-word matching technique (Svendsen et al., 1989) [16].the first technique compares the received digital acoustic signal against a pre-recorded word. Whereas sub-word matching technique searches for sub-word such as phonemes and then carry out more pattern recognition on those. This
requires much less processing and less amount of storage in contrast with the Whole-word matching as it takes large amounts of storage and long processing (Svendsen et al., 1989) [16].
Problems associated with Speech Recognition systems designing
In recent years, automatic speech recognition Systems have achieved major improvements, At the same time there are many difficulties which affect it’s ability to function in a high quality performance. For instance, speaker variability, adverse environments, large vocabulary and number of speakers.
The most important factors in speaker’s variability are sex and accent (Huang et al., 2001) [4]. In term of the sex factor, it has been managed recently with gender dependent models (Huang et al., 2000) [4]. However, with regards to accent issue it was found that people with heavy pronunciations tend to make lots of errors in terms of standard pronunciation (Huang et al., 2000) [4].
Through one experiment related to the accent factor it was observed that the pronunciation errors constituted a significant amount of total errors (Huang et al., 2000) [4]. Nevertheless, there are some researches on accented speech recognition, especially for people with the same mother tongue (Huang et al., 2000) [4].
Environment conditions: ASR performance can be effected by different adverse conditions such as noise in a car or a factory, distorted signal and so on (zhang, 2001) [17].
Vocabulary size: “In general, increasing the size of the vocabulary decrease the recognition scores” (Jackson, 2005) [5].
Number of speakers: an ASR system must deal with speech variability problem from one person to another (Jackson, 2005) [5]. Large speech database has been used as training data to overcome this problem (Huang et al., 2004) [4].
In addition, numbers of studies consider other issues that could affected the functionality of SR systems, this include: Nature of the utterance and Language complexity (Jackson, 2005) [5].
Accuracy
The level of accuracy in speech recognition systems is affected by many extraneous factors (either human or technical), in order to reduce these factors here are some methods witch is considered to be practical:
For optimum performance of contemporary speech recognition software we should
equip the computer with: a fast processor as Pentium III/650 should be the minimum and also with a large amount of RAM as a minimum of128MB RAM is good but 256MB of RAM is more efficient (Renee, 2001) [14]. Additionally high quality microphones can minimize the noise from the background and maximize the quality of the speech signal as some of them provided with active noise reduction or adaptive filter, besides this number of projects recommended installing high quality duplex (input and output) sound cards (Kirriemuir, 2003) [7]. (Renee, 2001) [14], recommend using PCI sound card. Finally, proper training for the system is required as the accuracy level range from 90% to 95% depending on training (Renee, 2001) [14].
Future implementation of ASR
Currently, most speech recognition software is computer implemented; however, configuration of software that can be used over a network would be helpful (Kirriemuir, 2003) [7].
References
1- Deng, L and Yu, D, (2005) A Speech-Centric Perspective for Human-Computer Interface: A Case Study, Journal of VLSI, 41.
2- Deshmukh, N., Ganapathiraju, A, Picone J., (1999), Hierarchical Search for Large Vocabulary Conversational Speech Recognition. IEEE Signal Processing Magazine, 1(5):84-107.
3- Dupont,S., (2000), Audio-Visual Speech Modeling for Continuous Speech Recognition, IEEE Transactions on multimedia, 2(3):141-151
4- HUANG, C., CHEN, T., AND CHANG, E., (2004), Accent Issues in Large Vocabulary Continuous Speech Recognition, Microsoft Research Asia, 5F, Sigma Center, No. 49, Zhichun Road, Beijing TECHNOLOGY (2004). in Human Language Technology. Kauii, Hawaii, USA
5- Jackson,M.,(2005) AUTOMATIC SPEECH RECOGNITION:HUMAN COMPUTER INTERFACE FOR KINYARWANDA LANGUAGE, Master thesis, Makerere.
6- Kandasamy, S., (1995),Speech recognition systems. SURPRISE Journal,1(1).
7- Kirriemuir,J., (2003) Speech Recognition Technologies retrieved March30, 2003 from http://www.jisc.ac.uk/uploaded_documents/tsw_03-03.pdf.
8- Lange, H. (1993). Speech synthesis and speech recognition: Tomorrowís human-computer interfaces? Annual Review of Information Science and Technology (ARIST), 28,153-185.
9- MARVIN, C., JOHN, L., JOEL, B. and JAMES, L., (2004), Speech Recognition and In-Vehicle Telematics Devices: PotentialReductions in Driver Distraction, Department of Mechanical and Industrial Engineering, University of Iowa, Iowa City, IA, USA , INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY (2004).
10- Meisel, W. (1993). Talk to your computer: voice technology lets you verbally command your computer or convert speech to text. Byte, 18,113-120.
11- Mengjie, Z., (2001) Overview of speech recognition and related machine learning techniques, Technical report. retrieved December 10, 2004 from http://www.mcs.vuw.ac.nz/comp/Publications/archive/CS-TR-01/CS-TR-01-15.pdf
12- Pinker, S., (1994), The Language Instinct, Harper Collins, New York City, New York,
13- Rabiner L.R., S.E.L.evinson: (1981) ”Isolated and connected word recognition – Theory and selected applications”, IEEE Trans. COM-29, pp.621-629
14- Renee, L., (2001), Successful Implementation of Speech Recognition Technology, Zephyr-TEC Corp, San Mateo, CA.
15- Rudnicky, A.I., Lee, K.F., and Hauptmann, A.G. (1992) Survey of current speech technology. Communications of the ACM,37(3):52-57.
16- Svendsen T., Paliwal K. K., Harborg E., Husy P. O. (1989). Proc. ICASSP’89, Glasgow.
17- zhang (2001)'Overview of Speech Recognition and Related Machine Learning Techniques', Technical Report CS-TR-01/15, Victoria University Of Wellington.
18- Zue, V., Cole, R., Ward, W. (1996). Speech Recognition.Survey of the State of the Art, 100080, China, INTERNATIONAL JOURNAL OF SPEECH
19- De La Paz, (in press). Composing via dictation and speech recognition systems: compensatory technology for students with learning disabilities. Learning Disabilities Quarterly.
20- J. Lai, (Ed.), “Conversational Interfaces: Special section,”
Communications of the ACM, vol. 43, no. 9, 2000.
21- S.J. Melnikoff,, “Implementing a Hidden Markov Model Speech
Recognition System in Programmable Logic, School of Electronic and Electrical Engineering, University of Birmingham, Edgbaston,
Birmingham, B15 2TT, United Kingdom
21- Goldberg, A., ed. A History of Personal
Workstations. Addison-Wesley Publishing
Company, New York, 1988.
22- Swinehart, D. et al. “A structural view of
the Cedar programming environment.” ACM
Transactions on Programming Languages and
Systems 8, 4 (1986), pp. 419–490.
23- van Dam, A. and Rice, D.E. “On-line text
editing: A survey.” Computing Surveys 3, 3
(1971), pp. 93–114.
24- Myers, B.A. “A Brief History of
Human-Computer
Interaction Technology
(1998). Available at http://www.cc.gatech.edu/classes/AY2002/cs4470_fall/CMU-CS-96-163.pdf.
24- Bissell, D., (1990) The father of computer graphics, Byte 1990
25- Engelbart, D. and English, W. “A Research
Center for Augmenting Human Intellect.”
Reprinted in ACM SIGGRAPH Video Review
106 (1994), video made in 1968
25- Kay, A. The Reactive Engine. Doctoral
dissertation, Electrical Engineering and Computer
Science, University of Utah, 1969.
18. Kay, A. “Personal dynamic media.” IEEE
Computer 10, 3 (1977), pp. 31–42.
26- The Early His
27- English, W.K., Engelbart, D.C., and
Berman, M.L. “Display selection techniques
for text manipulation.” IEEE Transactions on
Human Factors in Electronics HFE-8, 1 (1967),
28- peter, T., Kirstein, (1998)’ Early Experiences with the ARPANET and INTERNET in the UK’, Department of Computer Science, University College London
29- Nelson, T. “A File Structure for the Complex,
the Changing, and the Indeterminate.”
In Proceedings of ACM National Conference,
1965, pp. 84–100
30- van Dam, A. et al. “A Hypertext Editing
System for the 360.” In Proceedings of Conference
in Computer Graphics, University of Illinois,
1969.
