Im looking for a speech recognition software for java that acts more like the android version, in that, instead of having .gram files and stuff, it just returns a string of what was said, and I can act on it. Ive tried using sphinx-4, but using .gram files makes my program a lot harder to do.
The point of a grammar file is to improve the accuracy of what you're getting back. Instead of trying to come up with random strings of english words, you tell it to expect specific input.
That said, sphinx-4 can do ordinary large-dictionary ASR as well. Read the N-Gram part of this tutorial and look at the Transcriber sample that comes with the sphinx source code.
In addition, you can train your own trigram model that will enhance the results you get. (E.g., place more probability on the word "weather" being detected.) This is certainly what Siri does. Apple/Google have a huge corpus of pieces of audio that people speak into their phones, part of which is human transcribed, from which they train both acoustic and linguistic models (so their engines detect things people typically say instead of nonsense).
Related
I'm using OCR to recognize (German) text in an image. It works well but not perfectly. Sometimes a word gets messed-up. Therefore, I want to implement some sort of validation. Of course, I can just use a word list and find words that are similar to the messed-up word, but is there a way to check if the sentence is plausible with these words?
After all, my smartphone can give me good suggestions on how to complete a sentence.
You need to look for Natural Language Processing (NLP) solutions. With them, you can validate syntactically the lexical (either the whole text, which may be better as some of them may take on consideration the context, or phrase by phrase).
I am not an expert in the area, but this article can help you to choose a tool to start trying.
Also, please notice: your keyboard on your cellphone is developed and maintained by specialized teams, either on Apple, Google or any other company that you use their app. So, please, don't underestimate this task: there are dozens of research areas on this, that includes either software engineers and linguistics specialists to achieve proper results.
Edit: well, two days later, I've just came to this link: https://medium.com/quick-code/12-best-natural-language-processing-courses-2019-updated-2a6c28aebd48
I am a college student, obviously i am newbie in Machine learning so please bear with me.
I am implementing a Java application that would recognize and classify Road/Traffic signs and my major problem is to create and train SVM with SURF descriptors.
I read a lot and came across many different things when it comes to SVM i became even more confused but i will try to clarify what i understood.
FIRST: i know that i must have a dataset that includes Pos images(images that have my objects) and Neg images(images that don't have my objects) to train SVM. I tried to look how it is done in python due to the lack of documentation in Java and came across this code
import numpy as np
dataset = np.loadtxt('./datasetExample.csv', delimiter=",")
And it was simple as that, what is CSV doing here? where is the images of the dataset? i know that the data has to be represented in numbers like inside the CSV file, but where they came from and what it has to do with SVM.
SECOND: I found that in almost all resources SVM can be trained by two ways HOG Descriptors or BagOfWords and didn't find the SURF Descriptor method(ACTUALLY i am not sure if it is possible.. but my Dr. said it can be done).
THIRD: Since i am classifying traffic signs i need to have more than one class (EX. One for Warning signs, one for Regulatory signs, etc..), and each class of course has sub-classes like in the Speed limit signs it includes different types of signs. I came across smth called Multi-Class SVM and i really don't know what is that!!
Currently i managed to extract SURF Descriptors from a given image using this code.
Mat objectImage = Highgui.imread(signObject, Highgui.CV_LOAD_IMAGE_COLOR);
featureDetector.detect(objectImage, objectKeyPoints);
descriptorExtractor.compute(objectImage, objectKeyPoints, objectDescriptors);
datasetObjImage.add(objectImage);
datasetKeyPoints.add(objectKeyPoints);
datasetDescriptors.add(objectDescriptors);
What i was planning to do is to loop over all images of the dataset and extract their descriptors features to train the SVM, but i stucked their since i found the dataset is actually doesn't contain images at all....
So please i would appreciate any sort of help or descriptive steps to achieve that or even good resources i can look at.
Thanks
I have an idea to build program than can interact with the user voice in Arabic language, since one year ,I started with sphinx-4 but I need to make arabic acoustic model , grammar , dictionary. .
but I can't find the rood I want you to tell me in detailed description how to create those things?
the needed iIDE or program
please help me....
Ok, let me start at the very beginning, because I think you are not aware of the dimensions of your project, and you are mixing up things (ASR and TTS). First, I would like to explain what the different things are that you were talking about:
Acoustic Model: Every speech recognition system requires an acoustic model. Language, in particular words, are made up of phonemes. Phonemes describe how something sounds. To give you an example, the letter a is not always pronounced the same way, as you can see from the two words below:
to bark <=> to take
Now your ASR system needs to detect these phonemes. To do this, it performs a spectral analysis of many short frames of the audio signal and computes features, like MFCCs. What to do with these features? It puts them into a classifier (I could write a new chapter about the classifier here, but this will be too much information). Your classifier has to learn how to actually perform classification. What it does in simple words is it maps a set of features to a phoneme.
Dictionary: In your dictionary, you define every word that can be recognized by your ASR system. It tells the ASR the phoneme composition of a word. A short example for this is:
hello H EH L OW
world W ER L D
With this small dictionary, your system would be able to recognize the words hello and world.
Language Model (or Grammar): The language model holds information about the assembly of words for a given language. What does this mean? Think of the virtual keyboard of your smartphone. When you type in the words 'Will you marry', your keyboard might guess the next word to be 'me'. That is no magic. The model was learned from huge amounts of text files. Your LM does the same. It adds the knowledge about meaningful word compositions (what everybody calls a sentence) into the ASR system to further improve detection.
Now back to your problem: You need transcribed audio data for the following reasons:
You want to train your acoustic model if you have none.
You want to create a large enough dictionary.
You want to generate a language model from the text.
Long story short: You are wrong if you think you could solve all these tasks on your own. Only a reliable transcription is already a large amount of work. You should clearly overthink your idea.
I have a Java application where I'm looking to determine in real time whether a given piece of text is talking about a topic supplied as a query.
Some techniques I've looked into for this are coreference detection with packages like open-nlp and Stanford-NLP coref detection, but these models take extremely long to load and don't seem practical in a production application environment. Is it possible to perform coreference analysis such that given a piece of text and a topic, I can get a boolean answer that the text is discussing the topic?
Other than document classification which requires a trained corpus, are there any other techniques that can help me achieve such a thing?
I suggest have a look at Weka. It is written in Java so will gel well with your environment, will be faster for your kind of requirement, has lots of tools and comes with a UI as well as API. If you are looking at unsupervised approach (that is one without any learning with pre-classified corpus), here is an interesting paper: http://www.newdesign.aclweb.org/anthology/C/C00/C00-1066.pdf
You can also search for "unsupervised text classification/ information retrieval" on Google. You will get lots of approaches. You can choose the one you find easiest.
for each topic(if they are predefined) you can create list of terms and for each sentence check the cosine similarity of sentence and each topic list and show the most near topic to user
i am developing a desktop application using java. this application is for school kid to teach English, where user can upload some English audio can be in any format which need to be converted into text file. where they can read the text.
I've found some api but i am not sure about them.
http://cmusphinx.sourceforge.net/wiki/
I've seen many question on stackoverflow regarding this but none was helpful. if someone can help on this will be very greatful
thank you
There are many technologies and services available to perform speech recognition. For an intro to some of the choices see https://stackoverflow.com/a/6351055/90236.
I'm not sure that the results will be acceptable for teaching children English as a second language, but it is worth trying.
What you seek is currently breaking edge technology. Tools like cmusphinx can detect words from a dedicated, limited dictionary (so you can teach it to understand, say, 15 words and that's it - you can't teach it to understand English).
Basically, those tools try to find patterns in the sound waves that you feed them. They don't understand anything, they just use the same algorithm on anything and then try to find the closest match. This works well for small sets of words but as the number of words increases, the difference between then shrinks and the jobs gets ever harder (without even starting with words like whether and weather or C and see).
What you might consider is "repeat after me" software. Here, you need to record all words for the test as templates. Then you can record the words from the pupils and then compute the difference. If the difference is not too large, the word is correct. But again: This is simple repetition to improve pronunciation - not English.
There is desktop software which can understand a lot of English (for example the products from Nuance, Dragon Naturally Speaking being one of the most prominent). They do offer server solutions but that software isn't free or cheap if you're on a tight budget.