I have an idea to build program than can interact with the user voice in Arabic language, since one year ,I started with sphinx-4 but I need to make arabic acoustic model , grammar , dictionary. .
but I can't find the rood I want you to tell me in detailed description how to create those things?
the needed iIDE or program
please help me....
Ok, let me start at the very beginning, because I think you are not aware of the dimensions of your project, and you are mixing up things (ASR and TTS). First, I would like to explain what the different things are that you were talking about:
Acoustic Model: Every speech recognition system requires an acoustic model. Language, in particular words, are made up of phonemes. Phonemes describe how something sounds. To give you an example, the letter a is not always pronounced the same way, as you can see from the two words below:
to bark <=> to take
Now your ASR system needs to detect these phonemes. To do this, it performs a spectral analysis of many short frames of the audio signal and computes features, like MFCCs. What to do with these features? It puts them into a classifier (I could write a new chapter about the classifier here, but this will be too much information). Your classifier has to learn how to actually perform classification. What it does in simple words is it maps a set of features to a phoneme.
Dictionary: In your dictionary, you define every word that can be recognized by your ASR system. It tells the ASR the phoneme composition of a word. A short example for this is:
hello H EH L OW
world W ER L D
With this small dictionary, your system would be able to recognize the words hello and world.
Language Model (or Grammar): The language model holds information about the assembly of words for a given language. What does this mean? Think of the virtual keyboard of your smartphone. When you type in the words 'Will you marry', your keyboard might guess the next word to be 'me'. That is no magic. The model was learned from huge amounts of text files. Your LM does the same. It adds the knowledge about meaningful word compositions (what everybody calls a sentence) into the ASR system to further improve detection.
Now back to your problem: You need transcribed audio data for the following reasons:
You want to train your acoustic model if you have none.
You want to create a large enough dictionary.
You want to generate a language model from the text.
Long story short: You are wrong if you think you could solve all these tasks on your own. Only a reliable transcription is already a large amount of work. You should clearly overthink your idea.
Related
I'm using OCR to recognize (German) text in an image. It works well but not perfectly. Sometimes a word gets messed-up. Therefore, I want to implement some sort of validation. Of course, I can just use a word list and find words that are similar to the messed-up word, but is there a way to check if the sentence is plausible with these words?
After all, my smartphone can give me good suggestions on how to complete a sentence.
You need to look for Natural Language Processing (NLP) solutions. With them, you can validate syntactically the lexical (either the whole text, which may be better as some of them may take on consideration the context, or phrase by phrase).
I am not an expert in the area, but this article can help you to choose a tool to start trying.
Also, please notice: your keyboard on your cellphone is developed and maintained by specialized teams, either on Apple, Google or any other company that you use their app. So, please, don't underestimate this task: there are dozens of research areas on this, that includes either software engineers and linguistics specialists to achieve proper results.
Edit: well, two days later, I've just came to this link: https://medium.com/quick-code/12-best-natural-language-processing-courses-2019-updated-2a6c28aebd48
I am a musician/singer/songwriter,
I was hoping someone might know of information already out that does some if not all of what I'm trying to achieve.
I record song ideas into raw digital wav files using only my voice to emulate instruments ( vocal melody, bass, guitar, drums, etc.) into a song structure (verse, chorus, bridge).
I was hoping that java/fft could be used to slice each mili-second into an array that could be broken down into notes and riffs that I am singing.
Here is a list of some of the steps I see that need to be done with my wav files.
Find out the note that I'm singing. The software would take each note and nudge it into the nearest "true note" (a4=440hz).
It would take the notes and find out which key or possible keys the song may be in.
From a very large database of real songs, the software would make chord suggestions and placement suggestions depending on the genre the song is in.
It would take the riffs ( any sequence of more than 3 notes done more than 3 times in a song) and create loops with a drop down box of alternative voiceings and randomizing.
There is much more, but this should show you the basics of what I’m trying to do.
If there aren’t programs already written that already do all or part of this, would it be possible for me to write a program that uses java and fft to slice every millisecond into an array to determine notes?
I have read some java/fft material and it is way over my head mostly (I have studied a little java) but I was hoping someone might be able to lead me in the right direction.
I want to implement object detection in license plate (the city name) . I have an image:
and I want to detect if the image contains the word "بابل":
I have tried using a template matching method using OpenCV and also using MATLAB but the result is poor when tested with other images.
I have also read this page, but I was not able to get a good understanding of what to do from that.
Can anyone help me or give me a step by step way to solve that?
I have a project to recognize the license plate and we can recognize and detect the numbers but I need to detect and recognize the words (it is the same words with more cars )
Your question is very broad, but I will do my best to explain optical character recognition (OCR) in a programmatic context and give you a general project workflow followed by successful OCR algorithms.
The problem you face is easier than most, because instead of having to recognize/differentiate between different characters, you only have to recognize a single image (assuming this is the only city you want to recognize). You are, however, subject to many of the limitations of any image recognition algorithm (quality, lighting, image variation).
Things you need to do:
1) Image isolation
You'll have to isolate your image from a noisy background:
I think that the best isolation technique would be to first isolate the license plate, and then isolate the specific characters you're looking for. Important things to keep in mind during this step:
Does the license plate always appear in the same place on the car?
Are cars always in the same position when the image is taken?
Is the word you are looking for always in the same spot on the license plate?
The difficulty/implementation of the task depends greatly on the answers to these three questions.
2) Image capture/preprocessing
This is a very important step for your particular implementation. Although possible, it is highly unlikely that your image will look like this:
as your camera would have to be directly in front of the license plate. More likely, your image may look like one of these:
depending on the perspective where the image is taken from. Ideally, all of your images will be taken from the same vantage point and you'll simply be able to apply a single transform so that they all look similar (or not apply one at all). If you have photos taken from different vantage points, you need to manipulate them or else you will be comparing two different images. Also, especially if you are taking images from only one vantage point and decide not to do a transform, make sure that the text your algorithm is looking for is transformed to be from the same point-of-view. If you don't, you'll have an not-so-great success rate that's difficult to debug/figure out.
3) Image optimization
You'll probably want to (a) convert your images to black-and-white and (b) reduce the noise of your images. These two processes are called binarization and despeckling, respectively. There are many implementations of these algorithms available in many different languages, most accessible by a Google search. You can batch process your images using any language /free tool if you want, or find an implementation that works with whatever language you decide to work in.
4) Pattern recognition
If you only want to search for the name of this one city (only one word ever), you'll most likely want to implement a matrix matching strategy. Many people also refer to matrix matching as pattern recognition so you may have heard it in this context before. Here is an excellent paper detailing an algorithmic implementation that should help you immensely should you choose to use matrix matching. The other algorithm available is feature extraction, which attempts to identify words based on patterns within letters (i.e. loops, curves, lines). You might use this if the font style of the word on the license plate ever changes, but if the same font will always be used, I think matrix matching will have the best results.
5) Algorithm training
Depending on the approach you take (if you use a learning algorithm), you may need to train your algorithm with data that is tagged. What this means is that you have a series of images that you've identified as True (contains city name) or False (does not). Here's a psuedocode example of how this works:
train = [(img1, True), (img2, True), (img3, False), (img4, False)]
img_recognizer = algorithm(train)
Then, you apply your trained algorithm to identify untagged images.
test_untagged = [img5, img6, img7]
for image in test_untagged:
img_recognizer(image)
Your training sets should be much larger than four data points; in general, the bigger the better. Just make sure, as I said before, that all the images are of an identical transformation.
Here is a very, very high-level code flow that may be helpful in implementing your algorithm:
img_in = capture_image()
cropped_img = isolate(img_in)
scaled_img = normalize_scale(cropped_img)
img_desp = despeckle(scaled_img)
img_final = binarize(img_desp)
#train
match() = train_match(training_set)
boolCity = match(img_final)
The processes above have been implemented many times and are thoroughly documented in many languages. Below are some implementations in the languages tagged in your question.
Pure Java
cvBlob in OpenCV (check out this tutorial and this blog post too)
tesseract-ocr in C++
Matlab OCR
Good luck!
If you ask "I want to detect if the image contains the word "بابل" - this is classic problem which is solved using http://code.opencv.org/projects/opencv/wiki/FaceDetection like classifier.
But I assume you still want more. Years ago I tried to solve simiar problems and I provide example image to show how good/bad it was:
To detected licence plate I used very basic rectangle detection which is included in every OpenCV samples folder. And then used perspective transform to fix layout and size. It was important to implement multiple checks to see if rectangle looks good enough to be licence plate. For example if rectangle is 500px tall and 2px wide, then probably this is not what I want and was rejected.
Use https://code.google.com/p/cvblob/ to extract arabic text and other components on detected plate. I just had similar need yesterday on other project. I had to extract Japanese kanji symbols from page:
CvBlob does a lot of work for you.
Next step use technique explained http://blog.damiles.com/2008/11/basic-ocr-in-opencv/ to match city name. Just teach algorithm with example images of different city names and soon it will tell 99% of them just out of box. I have used similar approaches on different projects and quite sure they work
Im looking for a speech recognition software for java that acts more like the android version, in that, instead of having .gram files and stuff, it just returns a string of what was said, and I can act on it. Ive tried using sphinx-4, but using .gram files makes my program a lot harder to do.
The point of a grammar file is to improve the accuracy of what you're getting back. Instead of trying to come up with random strings of english words, you tell it to expect specific input.
That said, sphinx-4 can do ordinary large-dictionary ASR as well. Read the N-Gram part of this tutorial and look at the Transcriber sample that comes with the sphinx source code.
In addition, you can train your own trigram model that will enhance the results you get. (E.g., place more probability on the word "weather" being detected.) This is certainly what Siri does. Apple/Google have a huge corpus of pieces of audio that people speak into their phones, part of which is human transcribed, from which they train both acoustic and linguistic models (so their engines detect things people typically say instead of nonsense).
i am developing a desktop application using java. this application is for school kid to teach English, where user can upload some English audio can be in any format which need to be converted into text file. where they can read the text.
I've found some api but i am not sure about them.
http://cmusphinx.sourceforge.net/wiki/
I've seen many question on stackoverflow regarding this but none was helpful. if someone can help on this will be very greatful
thank you
There are many technologies and services available to perform speech recognition. For an intro to some of the choices see https://stackoverflow.com/a/6351055/90236.
I'm not sure that the results will be acceptable for teaching children English as a second language, but it is worth trying.
What you seek is currently breaking edge technology. Tools like cmusphinx can detect words from a dedicated, limited dictionary (so you can teach it to understand, say, 15 words and that's it - you can't teach it to understand English).
Basically, those tools try to find patterns in the sound waves that you feed them. They don't understand anything, they just use the same algorithm on anything and then try to find the closest match. This works well for small sets of words but as the number of words increases, the difference between then shrinks and the jobs gets ever harder (without even starting with words like whether and weather or C and see).
What you might consider is "repeat after me" software. Here, you need to record all words for the test as templates. Then you can record the words from the pupils and then compute the difference. If the difference is not too large, the word is correct. But again: This is simple repetition to improve pronunciation - not English.
There is desktop software which can understand a lot of English (for example the products from Nuance, Dragon Naturally Speaking being one of the most prominent). They do offer server solutions but that software isn't free or cheap if you're on a tight budget.