sphinx speech recognition delay

sphinx speech recognition delay - java

I am using the open source sphinx sdk to do some voice recognition. I am currently running the HelloWorld example. However response is very sluggish, it takes several attempts to recognize a word, and sometimes it recognizes it but takes a little to output what I have said. Any ideas how to improve this? Also when I change the grammer file it doesn't update and recognize my new words.
Thanks

Basically you can use Sphinx in several configurations. If you know the pattern of the voice that you have to recognize then you can use the configuration with custom grammar.
In that configuration its having higher response rate than normal configuration, since it only listen for predefine words with pre-define pattern. (a Grammar)
You can define your own grammar file by following the JSGF standards. (more)
Sample Configuration
Configuration configuration = new Configuration();
configuration.setAcousticModelPath(ACOUSTIC_MODEL);
configuration.setDictionaryPath(DICTIONARY_PATH);
configuration.setGrammarPath(GRAMMAR_PATH);
configuration.setUseGrammar(true);
configuration.setGrammarName("mygrammar");
LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(configuration);
Sample Grammar File
#JSGF V1.0;
grammar mygrammar;
public <COMMON_COMMAND> = [please] turn (on | off) lighs;

Related

CMUSphinx German command & control app, bad accuracy

I'm trying to implement a German command and control application with CMUSphinx and Java. So far, the application should recognize only a few words (numbers from 1 to 9, yes/no).
Unfortunately the accuracy is very bad. It seems, if a word is recognized correctly, it is only by chance.
Here is my java code so far (adapted from the tutorial):
public static void main(String[] args) throws IOException {
// Configuration Object
Configuration configuration = new Configuration();
// Set path to the acoustic model.
configuration.setAcousticModelPath("resource:/cmusphinx-de-voxforge-5.2");
// Set path to the dictionary.
configuration.setDictionaryPath("resource:/cmusphinx-voxforge-de.dic");
// use grammar
configuration.setGrammarPath("resource:/");
configuration.setGrammarName("dialog");
configuration.setUseGrammar(true);
LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(configuration);
recognizer.startRecognition(true);
SpeechResult result;
while ((result = recognizer.getResult()) != null) {
System.out.format("Hypothesis: %s\n", result.getHypothesis());
}
recognizer.stopRecognition();
}
Here is my grammer file:
#JSGF V1.0;
grammar dialog;
public <digit> = 1 | 2 | 3 | 4 |5 | 6 | 7 | 8 | 9 | ja | nein;
I've downloaded the German acoustic model and dictionary from here: https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/German/
Is there something obvious I'm missing here? Where is the problem?
Thanks in advance and kind regards.

I have tried to use pocketsphinx with Eng and German model and accuracy is good when it comes with predefined/limited set of phrases! You can forget about general things like "could you please find me a restaurant in the downtown".
To achieve good accuracy with a pocketshinx:
Check that your mic, audio device, file and everything are 16 kHz while general model is trained with such acoustic examples.
You should create your own limited dictionary you cannot use cmusphinx-voxforge-de.dic while accuracy is dramatically dropped.
You should create your own language model.
You can try to modify pronunciation files to fit your accent.
You can search for Jasper project on GitLab to see how it's implemented.
Also you can check the documentation

Well, accuracy is not great, probably the original database didn't have many examples like yours. Partially your dialect also contributes, Germans say 7 with z, not with s. Partially echo in your room contributes too. I am not sure how you recorded your audio, if you used some compression or codec in between it might also contribute to bad accuracy.
You might want to collect few hundred samples and perform MAP adaptation to improve the accuracy.

Interacting with a large java program as a service?

How can I do the following?
What I want to do is load Stanford NLP ONCE, then interact with it via an HTTP or other endpoint. The reason is that it takes a long time to load, and loading for every string to analyze is out of the question.
For example, here is Stanford NLP loading in a simple C# program that loads the jars... I'm looking to do what I did below, but in java:
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [9.3 sec].
Loading classifier from D:\Repositories\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models\edu\stanford\nlp\models\ner\english.all.3class.distsim.crf.ser.gz ... done [12.8 sec].
Loading classifier from D:\Repositories\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models\edu\stanford\nlp\models\ner\english.muc.7class.distsim.crf.ser.gz ... done [5.9 sec].
Loading classifier from D:\Repositories\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models\edu\stanford\nlp\models\ner\english.conll.4class.distsim.crf.ser.gz ... done [4.1 sec].
done [8.8 sec].
Sentence #1 ...
This is over 30 seconds. If these all have to load each time, yikes. To show what I want to do in java, I wrote a working example in C#, and this complete example may help someone some day:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using java.io;
using java.util;
using edu.stanford.nlp;
using edu.stanford.nlp.pipeline;
using Console = System.Console;
namespace NLPConsoleApplication
{
class Program
{
static void Main(string[] args)
{
// Path to the folder with models extracted from `stanford-corenlp-3.6.0-models.jar`
var jarRoot = #"..\..\..\..\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models";
// Text for intial run processing
var text = "Kosgi Santosh sent an email to Stanford University. He didn't get a reply.";
// Annotation pipeline configuration
var props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, sentiment");
props.setProperty("ner.useSUTime", "0");
// We should change current directory, so StanfordCoreNLP could find all the model files automatically
var curDir = Environment.CurrentDirectory;
Directory.SetCurrentDirectory(jarRoot);
var pipeline = new StanfordCoreNLP(props);
Directory.SetCurrentDirectory(curDir);
// loop
while (text != "quit")
{
// Annotation
var annotation = new Annotation(text);
pipeline.annotate(annotation);
// Result - Pretty Print
using (var stream = new ByteArrayOutputStream())
{
pipeline.prettyPrint(annotation, new PrintWriter(stream));
Console.WriteLine(stream.toString());
stream.close();
}
edu.stanford.nlp.trees.TreePrint tprint = new edu.stanford.nlp.trees.TreePrint("words");
Console.WriteLine();
Console.WriteLine("Enter a sentence to evaluate, and hit ENTER (enter \"quit\" to quit)");
text = Console.ReadLine();
} // end while
}
}
}
So it takes the 30 seconds to load, but each time you give it a string on the console, it takes the smallest fraction of a second to parse & tokenize that string.
You can see that I loaded the jar files prior to the while loop.
This may end up being a socket service, HTML, or something else that will entertain requests (in the form of strings), and spit back the parsing.
My ultimate goal is to use a mechanism in Nifi, via a processor that can send strings to be parsed, and have them returned in less than a second, versus 30+ seconds if a traditional web server threaded example (for instance) is used. Every request would load the whole thing for 30 seconds, THEN get down to business. I hope I made this clear!
How to do this?

Any of the mechanisms you list are perfectly reasonable routes forward for leveraging that service with Apache NiFi. Depending on your needs, some of the processors and extensions that are bundled with the standard release of NiFi may be sufficient to interact with your proposed web service or similar offering.
If you are striving for performing all of this within NiFi itself, a custom Controller Service might be a great path to provide this resource to NiFi that falls within the lifecycle of the application itself.
NiFi can be extended with items like controller services and custom processors and we have some documentation to get you started down that path.
Additional details could certainly help to provide some more information. Feel free to follow up on here with additional comments and/or reach out to the community via our mailing lists.
One item I did want to call out if it was unclear that NiFi is JVM driven and work would be done in Java or JVM friendly languages.

You should look at the new CoreNLP Server which Stanford NLP introduced in version 3.6.0. It seems like it does just what you want? Some other people such as ETS have done similar things.
Fine point: If using this heavily, you might (at present) want to grab the latest CoreNLP code from github HEAD, since it contains a few fixes to the server which will be in the next release.

Programatically embed a video in a slideshow using Apache Open Office API

I want to create a plugin that adds a video on the current slide in an open instance of Open Office Impress by specifying the location of the video automatically. I have successfully added shapes to the slide. But I cannot find a way to embed a video.
Using the .uno:InsertAVMedia I can take user input to choose a file and it works. How do I want to specify the location of the file programmatically?
CONCLUSION:
This is not supported by the API. Images and audio can be inserted without user intervention but videos cannot be done this way. Hope this feature is released in subsequent versions.

You requested information about an extension, even though the code you are using is quite different, using a file stream reader and POI.
If you really do want to develop an extension, then start with one of the Java samples. An example that uses Impress is https://wiki.openoffice.org/wiki/File:SDraw.zip.
Inserting videos into an Impress presentation can be difficult. First be sure you can get it to work manually. The most obvious way to do that seems to be Insert -> Media -> Audio or Video. However many people use links to files instead of actually embedding the file. See also https://ask.libreoffice.org/en/question/1898/how-to-embed-video-into-impress-presentation/.
If embedding is working for your needs and you want to automate the embedding by using an extension (which seems to be what your question is asking), then there is a dispatcher method called InsertAVMedia that does this.
I do not know offhand what the parameters are for the call. See https://forum.openoffice.org/en/forum/viewtopic.php?f=20&t=61127 for how to look up parameters for dispatcher calls.
EDIT
Here is some Basic code that inserts a video.
sub insert_video
dim document as object
dim dispatcher as object
document = ThisComponent.CurrentController.Frame
dispatcher = createUnoService("com.sun.star.frame.DispatchHelper")
dispatcher.executeDispatch(document, ".uno:InsertAVMedia", "", 0, Array())
end sub
From looking at InsertAVMedia in sfx.sdi, it seems that this call does not take any parameters.
EDIT 2
Sorry but InsertVideo and InsertImage do not take parameters either. From svx.sdi it looks like the following calls take parameters of some sort: InsertGalleryPic, InsertGraphic, InsertObject, InsertPlugin, AVMediaToolBox.
However according to https://wiki.openoffice.org/wiki/Documentation/OOoAuthors_User_Manual/Getting_Started/Sometimes_the_macro_recorder_fails, it is not possible to specify a file for InsertObject. That documentation also mentions that you never know what will work until you try it.
InsertGraphic takes a FileName parameter, so I would think that should work.

It is possible to add an XPlayer on the current slide. It looks like this will allow you to play a video, and you can specify the file's URL automatically.
Here is an example using createPlayer: https://forum.openoffice.org/en/forum/viewtopic.php?f=20&t=57699.
EDIT:
This Basic code works on my system. To play the video, simply call the routine.
sub play_video
If Video_flag = 0 Then
video =converttoURL( _
"C:\Users\JimStandard\Downloads\H264_test1_Talkinghead_avi_480x360.avi")
Video_flag = 1
'for windows:
oManager = CreateUnoService("com.sun.star.media.Manager_DirectX")
' for Linux
' oManager = CreateUnoService("com.sun.star.media.Manager_GStreamer")
oPlayer = oManager.createPlayer( video )
' oPlayer.CreatePlayerwindow(array()) ' crashes?
'oPlayer.setRate(1.1)
oPlayer.setPlaybackLoop(False)
oPlayer.setMediaTime(0.0)
oPlayer.setVolumeDB(GetSoundVolume())
oPlayer.start() ' Lecture
Player_flag = 1
Else
oPlayer.start() ' Lecture
Player_flag = 1
End If
End Sub

Testing HLS using JMeter

I am using JMeter to test HLS playback from a Streaming Server. So, the first HTTP request is for a master manifest file(m3u8). Say,
http://myserver/application1/subpath1/file1.m3u8
The reply to this will result in a playlist something like,
subsubFolder/360p/file1.m3u8
subsubFolder/480p/file1.m3u8
subsubFolder/720p/file1.m3u8
So, next set of URLs become
http://myserver/application1/subpath1/subsubFolder/360p/file1.m3u8
http://myserver/application1/subpath1/subsubFolder/480p/file1.m3u8
http://myserver/application1/subpath1/subsubFolder/720p/file1.m3u8
Now, individual reply to these further will be an index of chunks, like
0/file1.ts
1/file1.ts
2/file2.ts
3/file3.ts
Again, we have next set of URLs as
http://myserver/application1/subpath1/subsubFolder/360p/0/file1.ts
http://myserver/application1/subpath1/subsubFolder/360p/1/file1.ts
http://myserver/application1/subpath1/subsubFolder/360p/2/file1.ts
http://myserver/application1/subpath1/subsubFolder/360p/3/file1.ts
This is just the case of one set(360p). There will be 2 more sets like these(for 480p, 720p).
I hope the requirement statement is clear uptill this.
Now, the problem statement.
Using http://myserver/application1 as static part, regex(.+?).m3u8 is applied at 1st reply which gives subpath1/subsubFolder/360p/file1. This, is then added to the static part again, to get http://myserver/application1/subpath1/subsubFolder/360p/file1 + .m3u8
The problem comes at the next stage. As, you can see, with parts extracted previously, all I'm getting is
http://myserver/application1/subpath1/subsubFolder/360p/file1/0/file1.ts
The problem is obvious, an extra file1, 360p/file1 in place of 360p/0.
Any suggestions, inputs or alternate approaches appreciated.

If I understood the problem correctly, all you need is the file name as the other URLs can be constructed with it. Rather than using http://myserver/application1 as static part of your regex, I would try to get the filename directly:
([^\/.]+)\.m3u8$
# match one or more characters that are not a forward slash or a period
# followed by a period
# followed by the file extension (m3u8)
# anchor the whole match to the end
Now consider your urls, e.g. http://myserver/application1/subpath1/subsubFolder/360p/file1.m3u8, the above regex will capture file1, see a working demo here. Now you can construct the other URLs, e.g. (pseudo code):
http://myserver/application1/subpath1/subsubFolder/360p/ + filename + .m3u8
http://myserver/application1/subpath1/subsubFolder/360p/ + filename + /0/ + filename + .ts
Is this what you were after?

Make sure you use:
(.*?) - as Regular Expression (change plus to asterisk in your regex)
-1 - as Match No.
$1$- as template
See How to Load Test HTTP Live Media Streaming (HLS) with JMeter article for detailed instructions.

If you are ready to pay for a commercial plugin, then there is an easy and much more realistic solution which is a plugin for Apache JMeter provided by UbikLoadPack:
Besides doing this job for you, it will simulate the way a player would read the file. It will also scale much better than any custom script or player solution.
It supports VOD and Live which are quite difficult to script.
See:
http://www.ubik-ingenierie.com/blog/easy-and-realistic-load-testing-of-http-live-streaming-hls-with-apache-jmeter/
http://www.ubik-ingenierie.com/blog/ubikloadpack-http-live-streaming-plugin-jmeter-videostreaming-mpegdash/
Disclaimer, we are the providers of this solution

Java voice recognition for very small dictionary

I have MP3 audio files that contain voicemails that are left by a computer.
The message content is always in same format and left by the same computer voice with only a slight variation in content:
"You sold 4 cars today" (where the 4 can be anything from 0 to 9).
I have be trying to set up Sphinx, but the out-of-the-box models did not work too good.
I then tried to write my own acoustic model and haven't had much better success yet (30% unrecognized is my best).
I am wondering if voice recognition might be overkill for this task since I have exactly ONE voice, an expected audio pattern and a very limited dictionary that would need to be recognized.
I have access to each of the ten sounds (spoken numbers) that I would need to search for in the message.
Is there a non-VR approach to finding sounds in an audio file (I can convert MP3 to another format if necessary).
Update: My solution to this task follows
After working with Nikolay directly, I learned that the answer to my original question is irrelevant since the desired results may be achieved (with 100% accuracy) using Sphinx4 and a JSGF grammar.
1: Since the speech in my audo files is very limited, I created a JSGF grammar (salesreport.gram) to describe it. All of the information I needed to create the following grammar was available on this JSpeech Grammar Format page.
#JSGF V1.0;
grammar salesreport;
public <salesreport> = (<intro> | <sales> | <closing>)+;
<intro> = this is your automated automobile sales report;
<sales> = you sold <digit> cars today;
<closing> = thank you for using this system;
<digit> = zero | one | two | three | four | five | six | seven | eight | nine;
NOTE: Sphinx does not support JSGF tags in the grammar. If necessary, a regular expression may be used to extract specific information (the number of sales in my case).
2: It is very important that your audio files are properly formatted. The default sample rate for Sphinx is 16Khz (16Khz means there are 16000 samples collected every second). I converted my MP3 audio files to WAV format using FFmpeg.
ffmpeg -i input.mp3 -acodec pcm_s16le -ac 1 -ar 16000 output.wav
Unfortunately, FFmpeg renders this solution OS dependent. I am still looking for a way to convert the files using Java and will update this post if/when I find it.
Although it was not required to complete this task, I found Audacity helpful for working with audio files. It includes many utilities for working with the audio files (checking sample rate and bandwidth, file format conversion, etc).
3: Since telephone audio has a maximum bandwidth (the range of frequencies included in the audio) of 8kHz, I used the Sphinx en-us-8khz acoustic model.
4: I generated my dictionary, salesreport.dic, using lmtool
5: Using the files mentioned in the previous steps and the following code (modified version of Nikolay's example), my speech is recognized with 100% accuracy every time.
public String parseAudio(File voiceFile) throws FileNotFoundException, IOException
{
String retVal = null;
StringBuilder resultSB = new StringBuilder();
Configuration configuration = new Configuration();
configuration.setAcousticModelPath("file:acoustic_models/en-us-8khz");
configuration.setDictionaryPath("file:salesreport.dic");
configuration.setGrammarPath("file:salesreportResources/")
configuration.setGrammarName("salesreport");
configuration.setUseGrammar(true);
StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);
try (InputStream stream = new FileInputStream(voiceFile))
{
recognizer.startRecognition(stream);
SpeechResult result;
while ((result = recognizer.getResult()) != null)
{
System.out.format("Hypothesis: %s\n", result.getHypothesis());
resultSB.append(result.getHypothesis()
+ " ");
}
recognizer.stopRecognition();
}
return resultSB.toString().trim();
}

The accuracy on such task must be 100%. Here is the code sample to use with the grammar:
public class TranscriberDemoGrammar {
public static void main(String[] args) throws Exception {
System.out.println("Loading models...");
Configuration configuration = new Configuration();
configuration.setAcousticModelPath("file:en-us-8khz");
configuration.setDictionaryPath("cmu07a.dic");
configuration.setGrammarPath("file:./");
configuration.setGrammarName("digits");
configuration.setUseGrammar(true);
StreamSpeechRecognizer recognizer =
new StreamSpeechRecognizer(configuration);
InputStream stream = new FileInputStream(new File("file.wav"));
recognizer.startRecognition(stream);
SpeechResult result;
while ((result = recognizer.getResult()) != null) {
System.out.format("Hypothesis: %s\n",
result.getHypothesis());
}
recognizer.stopRecognition();
}
}
You also need to make sure that both sample rate and audio bandwidth matches the decoder configuration
http://cmusphinx.sourceforge.net/wiki/faq#qwhat_is_sample_rate_and_how_does_it_affect_accuracy

First of all, Sphinx only work with WAVE file. For very limited vocabulary, Sphinx should generate good result when using a JSGF grammar file (but not that good in dictation mode). The main issue I found is that it does not provide confidence score (it is currently bugged). You might want to check three other alternative:
SpeechRecognizer from Windows platform. It provide easy to use recognition with confidence score and support grammar. This is C#, but you could build a native wrapper or custom server.
Google Speech API is an online speech recognition engine, free up to 50 request per day. There is several API for this, but I like JARVIS. Be careful though, since there is no official support or documentation about this and Google might (and already have in the past) close this engine whenever they want. Of course, you will have some privacy issue (is it okay to send this audio data to a third party ?).
I recently came through ISpeech and got good result with it. It provides its own Java wrapper API, free for mobile app. Same privacy issue as Google API.
I myself choose to go with the first option and build a speech recognition service in a custom http server. I found it to be the most effective way to tackle speech recognition from Java until Sphinx scoring issue is fixed.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.