Members | Sign In
MOVI User Community Forum > MOVI Question & Answers
avatar

considering purchase, but please explain

posted Mar 05, 2016 11:59:08 by StephenKramer
Hello, I see the MOVI shield will go on sale shortly and I'm considering a purchase, however, as a new Arduino user, I have a lot of questions. My only experience with voice recognition is using "Ok Google" on my phone.

I read the user manual, and after reading it I am more confused about how to use the shield, what it can do, and how to do it. I am concerned the MOVI shield will be too difficult to use and an unsuccessful and expensive experience.

I would really like to see many examples as to how it is used as far as the Arduino code required to use it, and examples for the functions that it has. I am mainly interested in using shield for voice interaction with and control of a robot I am buiding.

Could someone please explain the following:
1. Will it work well if I use this in a robot, indoors, from ~5-10 feet away, how accurate do you expect it to be?
2. Does it actually recognize individual spoken words and convert them to text?
3. Can Arduino code then access the text output from the recognizer and use it with the voice synthesizer to speak some of the speech that has just been recognized along with other words from the code?
4. Do all of the words to be recognized need to be "voice trained" for it to recognize them?
5. If it needs to be trained, can it only be trained for one person?
6. Can it recognize the speech of any person?
7. Can you train it to recognize new words?
8. Will the use of the shield interfere in any way with other functions of the Arduino and sketches, such as moving servos or other motors, PWM outputs, data acquisition, or affect or slow the loop speed or affect the timing of the loop or add delays to it? For example, can you acquire speech while moving a servo motor, controlling addressable LEDs and reading sensor values ?

I would really appreciate your official responses. I think other potential buyers may appreciate your responses also.

Thanks
page   1
3 replies
avatar
fractor@audeme.com said Mar 05, 2016 19:23:17
Hi,

Thank you for your precise and very valid questions!

First of all: Sorry, that you were confused by our user manual. MOVI follows Arduino's philosophy of "learning by doing". So the user manual is not meant to be your first source of information. It's more like a reference once you have a board and it provides some more advanced explanations. Btw. I don't even know if Arduino has a user manual. Everything is taught in the examples and the API reference. So, yes, I totally agree examples are the most important teacher!

MOVI's Arduino library comes with 7 examples. If you already have an Arduino, feel free to download the library and take a look at the examples even without the board. Even easier: If you go to "Example Videos" on this website. You'll see the examples in action in a video. Each video is accompanied by a link to the commented source code on GitHub that is executed in each video. Hopefully this gives you a better idea how MOVI is working.

Regarding your questions:
1) It will most likely work with a robot indoors 5-10 feet away. We've tested this scenario during our Kickstarter campaign with a Romibo robot. The video can be found here.
2) Yes. There are two ways to work with the results: One way is you only get the sentence number (according to how you trained the sentences) and the second one is you work with the words as text directly. The "LightSwitch" example uses the first way, the "KeyWordSpotter" example uses the second way.
3) Yes.
4) No. There is no voice training ever. The recognition works based on you providing textual sentences of you want MOVI to recognize. The videos of our examples (except the first) include the full training procedure (e.g. this one).
5) No voice training. MOVI is speaker independent as long as the speakers speak English and are at least about 12 years old. MOVI does not work with small children.
6) Yes, as long as it's decent English. Heavy accents make it harder.
7) Any word that can be phonetized using the rules of canonical American English phonetization can be recognized. This includes proper names and even artificial words as long as they are pronounceable.
8) We tried to design MOVI so that it takes the least resources possible away from Arduino. All the computation is done on the shield and not on the Arduino. However, for the communication between Arduino and MOVI two header pins are used, by default 10 and 11. However, they can be changed with a software setting and jumpers. So as long as your servo motors, LEDs and sensors do not collide with the communication pins, you should be fine. Also, MOVI is compatible with a wide range of Arduinos. I am saying that because for an application like this, I speculate you probably want to go with an Arduino MEGA so you have more pins and more compute on the Arduino compared to an Arduino UNO.

Again thank you for bringing these very valid questions up and hopefully this gives you a better idea about MOVI. One more thing: We never edit any of the example videos (except for trimming beginning and ending). So what you see is what you get.
avatar
StephenKramer said Mar 12, 2016 18:56:25
Thanks for responding to my post. I am confused though about the voice training it seems like you said some contradictary things regarding that in your response. Can you explain the voice training requirememt for the MOVI?

Can you please also explain the role of 'sentences' as you mentioned, can it recognize individual words to correlate with text?

Basically, how would your system be different compared to using an app that uses Google speech to text?

Thanks

Thanks
avatar
Bertrand said Mar 12, 2016 20:05:23
Stephen,

I do recommend you take a look at the code examples and the accompanying videos. It's really that simple.

There is no voice training on your side because we did it for you. MOVI comes with pre-trained voice models of many many people, however, as I said, they do not include children under 12 and they only include English speakers. The only thing you have to train is a grammar model (in speech recognition terms: language model). However, we simplified that process, so that all you have to do is specify a couple sample sentences.

So now there are two ways of working with the recognizer:
1) You let MOVI match against the sample sentences because you expect only these sentences and nothing else (see the LightSwitch example). This gives you the highest accuracy and also it's the easiest way of operating MOVI.
2) You take the recognized words and process them yourself. The big caveat is that MOVI will only recognize words that were in the sample sentences and will favor sequences of words according to the sample sentences. This is explained in Chapter 4 of our user manual. Our Hunt the Wumpus example does that.

I don't know Google's API that well but as far as I know, Google's speech to text does not allow you to create your own language model. This means, the recognizer will try to match whatever you say against a corpus that contains the whole English language (somebody once told me that Google trained on about 300 years of speech). For the typical use case of controlling a little robot using an Arduino or creating a speech interface for an Internet-of-Things gadget, this will most likely yield a lot of falsely recognized words. Now, of course, if you are trying to do recognize messages on a voice mailbox or trying to do dictation, Google speech to text is the better choice for you. Mind you, however, that since MOVI is a hardware solution, it comes with it's own audio front end and is therefore able to detect noisy environments and work from a distance. And, of course, we don't need an Internet connection as MOVI operates cloudlessly (and therefore privately). Which also means: Google might decide to charge for that API anytime. Once you own a MOVI, you can use it forever. So the two are quite distinct!

Hopefully this clarifies things a bit better...
Login below to reply: