In the first week of May, 2010 Google announced the worldwide release of its YouTube video transcription services. Although released in mid 2009, the beta version of YouTube video transcription was available to a select few Universities, News Broadcasters and Government agencies.
The history of speech recognition technology dates back to the late 1930’s, when AT&T Bell Laboratories developed a primitive device that could recognize speech. Researchers knew that the widespread use of speech recognition would depend on the ability to accurately and consistently perceive subtle and complex verbal input. But because the computing technology was not good enough, the development of speech recognition was snail paced.
50 years down the line, the capabilities of many digital electronic devices had surpassed even the best and the costliest technologies of the 1930’s. This was made possible due to the breakthroughs made in chip and semiconductor fabrication. The largest barriers to the speed and accuracy of speech recognition – computer speed and power – were no longer an issue.
With more computing power (measured in units of FLOPS) than our 1930’s computer scientists could imagine, programmers could now develop algorithms to code and decode a multitude of voice patterns. Practically they could now build a database of thousands of different voice patterns, convert them into digital sine waves and analyze words based on the mathematics of voice pattern signals. Over a period of time, as the speech to text technologies became usable; many companies started offering voice recognition to its consumers – Dragon Dictation, Microsoft (XP, Vista), google voice pricing and other niche companies.
So now the question arises – How reliable are these technologies, particularly Google YouTube transcription and will they ever compete if not surpass human transcription accuracy?
Those who like to view YouTube videos with captions turned on, you may see that the accuracy of the captions has increased several folds over the past few months. The accuracy is going up day by day and is only going to improve as more people use the service. As Eric Schmidt, CEO of Google Inc. says -‘ Our Google YouTube transcriptions will improve over a period of time as more and more users use it, it’s a self learning technology ”
But there are still a few major flaws that could be foreseen despite it being a self learning technology –
1. Accurate captioning is possible only in the case when the speaker is speaking very clearly and distinctly.
2. The environment has to be free from any sort of disturbance
3. Errors creep in because of similar sounding words such – sky and high -when spoken quickly, the system is not able to differentiate between the two.
4. Interjections – People often pause or make some thinking sounds during speeches – these include uh’s, Hmmms, ahh etc. The recognition software makes an effort to transcribe these as well, at times giving hilarious results. (Search YouTube for Hilarious Google voice transcription)
And finally comes the major downside of them all
5. Psychological Satisfaction – After the captioning has been done by the Google robots, can uploader be sure of the accuracy? It is quite obvious that the transcribed captions would need to be thoroughly checked for errors and proofread several times. This means going through the whole video several times, manually correcting the words, correcting the grammar portion including commas, hyphens, quotes etc and them uploading them. A very time consuming process.