CMUSphinx German command & control app, bad accuracy

CMUSphinx German command & control app, bad accuracy - java

I'm trying to implement a German command and control application with CMUSphinx and Java. So far, the application should recognize only a few words (numbers from 1 to 9, yes/no).
Unfortunately the accuracy is very bad. It seems, if a word is recognized correctly, it is only by chance.
Here is my java code so far (adapted from the tutorial):
public static void main(String[] args) throws IOException {
// Configuration Object
Configuration configuration = new Configuration();
// Set path to the acoustic model.
configuration.setAcousticModelPath("resource:/cmusphinx-de-voxforge-5.2");
// Set path to the dictionary.
configuration.setDictionaryPath("resource:/cmusphinx-voxforge-de.dic");
// use grammar
configuration.setGrammarPath("resource:/");
configuration.setGrammarName("dialog");
configuration.setUseGrammar(true);
LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(configuration);
recognizer.startRecognition(true);
SpeechResult result;
while ((result = recognizer.getResult()) != null) {
System.out.format("Hypothesis: %s\n", result.getHypothesis());
}
recognizer.stopRecognition();
}
Here is my grammer file:
#JSGF V1.0;
grammar dialog;
public <digit> = 1 | 2 | 3 | 4 |5 | 6 | 7 | 8 | 9 | ja | nein;
I've downloaded the German acoustic model and dictionary from here: https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/German/
Is there something obvious I'm missing here? Where is the problem?
Thanks in advance and kind regards.

I have tried to use pocketsphinx with Eng and German model and accuracy is good when it comes with predefined/limited set of phrases! You can forget about general things like "could you please find me a restaurant in the downtown".
To achieve good accuracy with a pocketshinx:
Check that your mic, audio device, file and everything are 16 kHz while general model is trained with such acoustic examples.
You should create your own limited dictionary you cannot use cmusphinx-voxforge-de.dic while accuracy is dramatically dropped.
You should create your own language model.
You can try to modify pronunciation files to fit your accent.
You can search for Jasper project on GitLab to see how it's implemented.
Also you can check the documentation

Well, accuracy is not great, probably the original database didn't have many examples like yours. Partially your dialect also contributes, Germans say 7 with z, not with s. Partially echo in your room contributes too. I am not sure how you recorded your audio, if you used some compression or codec in between it might also contribute to bad accuracy.
You might want to collect few hundred samples and perform MAP adaptation to improve the accuracy.

Related

How to specify phonetic keywords for IBM Watson speech2text service?

While we have had good success with Bluemix Java SDK in the general case, we've bumped into problems while trying to recognize occasional non-English words (e.g., foreign last names). Our hope was that one could specify the keyword list using SPR phonetic notation (which works great for text2speech), but that does not seem to be supported for speech2text. Any suggestions/workarounds?
SpeechToText service = new SpeechToText();
service.setUsernameAndPassword("USERNAME", "PASSWORD");
File audio = new File("C:\\Users\\AudioFiles\\euler.wav");
RecognizeOptions options = new RecognizeOptions().Builder()
.contentType(HttpMediaType.AUDIO_WAV)
.continuous(true)
.inactivityTimeout(500)
.keywords({"Agarwal", "Euler", "Qin"})
.keywordsThreshold(0.5)
.build();
SpeechResults transcript = service.recognize(audio, options);
System.out.println(transcript);
The objective is to be able say "My name is John Euler." and for the transcript not to return something like "My name is John Oyler." (which is what it does currently).
Thx.

Hmm, the three words that you pass are actually in the vocabulary, but maybe they are not found because they have very little weight in the language model. Have you tried relaxing the threshold? You can also try to use the Watson STT customization service to boost probabilities of names if the task is name focused

Naive Bayes Text Classification Algorithm

Hye there! I just need the help for implementing Naive Bayes Text Classification Algorithm in Java to just test my Data Set for research purposes. It is compulsory to implement the algorithm in Java; rather using Weka or Rapid Miner tools to get the results!
My Data Set has the following type of Data:
Doc Words Category
Means that I have the Training Words and Categories for each training (String) known in advance. Some of the Data Set is given below:
Doc Words Category
Training
1 Integration Communities Process Oriented Structures...(more string) A
2 Integration Communities Process Oriented Structures...(more string) A
3 Theory Upper Bound Routing Estimate global routing...(more string) B
4 Hardware Design Functional Programming Perfect Match...(more string) C
.
.
.
Test
5 Methodology Toolkit Integrate Technological Organisational
6 This test contain string naive bayes test text text test
SO the Data Set comes from a MySQL DataBase and it may contain multiple training strings and test strings as well! The thing is I just need to implement Naive Bayes Text Classification Algorithm in Java.
The algorithm should follow the following example mentioned here Table 13.1
Source: Read here
The thing is that I can implement the algorithm in Java Code myself but i just need to know if it is possible that there exist some kind a Java library with source code documentation available to allow me to just test the results.
The problem is I just need the results for just one time only means its just a test for results.
So, come to the point can somebody tell me about any good java library that helps my code this algorithm in Java and that could made my dataset possible to process the results, or can somebody give me any good ideas how to do it easily...something good that can help me.
I will be thankful for your help.
Thanks in advance

As per your requirement, you can use the Machine learning library MLlib from apache. The MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities. There is also a java code template to implement the algorithm utilizing the library. So to begin with, you can:
Implement the java skeleton for the Naive Bayes provided on their site as given below.
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.mllib.classification.NaiveBayes;
import org.apache.spark.mllib.classification.NaiveBayesModel;
import org.apache.spark.mllib.regression.LabeledPoint;
import scala.Tuple2;
JavaRDD<LabeledPoint> training = ... // training set
JavaRDD<LabeledPoint> test = ... // test set
final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
JavaPairRDD<Double, Double> predictionAndLabel =
test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
#Override public Tuple2<Double, Double> call(LabeledPoint p) {
return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
}
});
double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
#Override public Boolean call(Tuple2<Double, Double> pl) {
return pl._1().equals(pl._2());
}
}).count() / (double) test.count();
For testing your datasets, there is no best solution here than use the Spark SQL. MLlib fits into Spark's APIs perfectly. To start using it, I would recommend you to go through the MLlib API first, implementing the Algorithm according to your needs. This is pretty easy using the library.
For the next step to allow the processing of your datasets possible, just use the Spark SQL.
I will recommend you to stick to this. I too have hunted down multiple options before settling for this easy to use library and it's seamless support for inter-operations with some other technologies. I would have posted the complete code here to perfectly fit your answer. But I think you are good to go.

You can use the Weka Java API and include it in your project if you do not want to use the GUI.
Here's a link to the documentation to incorporate a classifier in your code:
https://weka.wikispaces.com/Use+WEKA+in+your+Java+code

Please take a look at the Bow toolkit.
It has a Gnu license and source code. Some of its code includes
Setting word vector weights according to Naive Bayes, TFIDF, and several other methods.
Performing test/train splits, and automatic classification tests.
It's not a Java library, but you could compile the C code to ensure that you Java had similar results for a given corpus.
I also spotted a decent Dr. Dobbs article that implements in Perl. Once again, not the desired Java, but will give you the one-time results that you are asking for.

Hi I thinks Spark would help you a lot:
http://spark.apache.org/docs/1.2.0/mllib-naive-bayes.html
you can even choose the language you think is the most appropriate to your needs Java / Python / Scala!

You may want to take a look at this.
https://mahout.apache.org/users/classification/bayesian.html

Please use scipy from python. There is already an implementation of what you need:
class sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)¶
scipy

You can use an algorithm platform like KNIME, it has variety of classification algorithms (Naive bayed included). You can run it with a GUI or Java API.

If you want to implement Naive Bayes Text Classification Algorithm in Java, then WEKA Java API will be a better solution. The data set should have to be in .arff format. Creating an .arff file from mySql database is very easy. Here is the attachment of the java code for the classifier a link of a sample .arff file.
Create a new Text document. Open it with Notepad. Copy and paste all the texts below the link. Save it as DataSet.arff. http://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/weather.arff
Download Weka Java API: http://www.java2s.com/Code/Jar/w/weka.htm
Code for the classifier:
public static void main(String[] args) {
try {
StringBuilder txtAreaShow = new StringBuilder();
//reads the arff file
BufferedReader breader = null;
breader = new BufferedReader(new FileReader("DataSet.arff"));
//if 40 attributes availabe then 39 will be the class index/attribuites(yes/no)
Instances train = new Instances(breader);
train.setClassIndex(train.numAttributes() - 1);
breader.close();
//
NaiveBayes nB = new NaiveBayes();
nB.buildClassifier(train);
Evaluation eval = new Evaluation(train);
eval.crossValidateModel(nB, train, 10, new Random(1));
System.out.println("Run Information\n=====================");
System.out.println("Scheme: " + train.getClass().getName());
System.out.println("Relation: ");
System.out.println("\nClassifier Model(full training set)\n===============================");
System.out.println(nB);
System.out.println(eval.toSummaryString("\nSummary Results\n==================", true));
System.out.println(eval.toClassDetailsString());
System.out.println(eval.toMatrixString());
//txtArea output
txtAreaShow.append("\n\n\n");
txtAreaShow.append("Run Information\n===================\n");
txtAreaShow.append("Scheme: " + train.getClass().getName());
txtAreaShow.append("\n\nClassifier Model(full training set)"
+ "\n======================================\n");
txtAreaShow.append("" + nB);
txtAreaShow.append(eval.toSummaryString("\n\nSummary Results\n==================\n", true));
txtAreaShow.append(eval.toClassDetailsString());
txtAreaShow.append(eval.toMatrixString());
txtAreaShow.append("\n\n\n");
System.out.println(txtAreaShow.toString());
} catch (FileNotFoundException ex) {
System.err.println("File not found");
System.exit(1);
} catch (IOException ex) {
System.err.println("Invalid input or output.");
System.exit(1);
} catch (Exception ex) {
System.err.println("Exception occured!");
System.exit(1);
}

You can take a look at Blayze - It's a pretty minimal Naive Bayes library for the JVM written in Kotlin. Should be easy to follow.
Full disclosure: I'm one of the authors of Blayze

Java voice recognition for very small dictionary

I have MP3 audio files that contain voicemails that are left by a computer.
The message content is always in same format and left by the same computer voice with only a slight variation in content:
"You sold 4 cars today" (where the 4 can be anything from 0 to 9).
I have be trying to set up Sphinx, but the out-of-the-box models did not work too good.
I then tried to write my own acoustic model and haven't had much better success yet (30% unrecognized is my best).
I am wondering if voice recognition might be overkill for this task since I have exactly ONE voice, an expected audio pattern and a very limited dictionary that would need to be recognized.
I have access to each of the ten sounds (spoken numbers) that I would need to search for in the message.
Is there a non-VR approach to finding sounds in an audio file (I can convert MP3 to another format if necessary).
Update: My solution to this task follows
After working with Nikolay directly, I learned that the answer to my original question is irrelevant since the desired results may be achieved (with 100% accuracy) using Sphinx4 and a JSGF grammar.
1: Since the speech in my audo files is very limited, I created a JSGF grammar (salesreport.gram) to describe it. All of the information I needed to create the following grammar was available on this JSpeech Grammar Format page.
#JSGF V1.0;
grammar salesreport;
public <salesreport> = (<intro> | <sales> | <closing>)+;
<intro> = this is your automated automobile sales report;
<sales> = you sold <digit> cars today;
<closing> = thank you for using this system;
<digit> = zero | one | two | three | four | five | six | seven | eight | nine;
NOTE: Sphinx does not support JSGF tags in the grammar. If necessary, a regular expression may be used to extract specific information (the number of sales in my case).
2: It is very important that your audio files are properly formatted. The default sample rate for Sphinx is 16Khz (16Khz means there are 16000 samples collected every second). I converted my MP3 audio files to WAV format using FFmpeg.
ffmpeg -i input.mp3 -acodec pcm_s16le -ac 1 -ar 16000 output.wav
Unfortunately, FFmpeg renders this solution OS dependent. I am still looking for a way to convert the files using Java and will update this post if/when I find it.
Although it was not required to complete this task, I found Audacity helpful for working with audio files. It includes many utilities for working with the audio files (checking sample rate and bandwidth, file format conversion, etc).
3: Since telephone audio has a maximum bandwidth (the range of frequencies included in the audio) of 8kHz, I used the Sphinx en-us-8khz acoustic model.
4: I generated my dictionary, salesreport.dic, using lmtool
5: Using the files mentioned in the previous steps and the following code (modified version of Nikolay's example), my speech is recognized with 100% accuracy every time.
public String parseAudio(File voiceFile) throws FileNotFoundException, IOException
{
String retVal = null;
StringBuilder resultSB = new StringBuilder();
Configuration configuration = new Configuration();
configuration.setAcousticModelPath("file:acoustic_models/en-us-8khz");
configuration.setDictionaryPath("file:salesreport.dic");
configuration.setGrammarPath("file:salesreportResources/")
configuration.setGrammarName("salesreport");
configuration.setUseGrammar(true);
StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);
try (InputStream stream = new FileInputStream(voiceFile))
{
recognizer.startRecognition(stream);
SpeechResult result;
while ((result = recognizer.getResult()) != null)
{
System.out.format("Hypothesis: %s\n", result.getHypothesis());
resultSB.append(result.getHypothesis()
+ " ");
}
recognizer.stopRecognition();
}
return resultSB.toString().trim();
}

The accuracy on such task must be 100%. Here is the code sample to use with the grammar:
public class TranscriberDemoGrammar {
public static void main(String[] args) throws Exception {
System.out.println("Loading models...");
Configuration configuration = new Configuration();
configuration.setAcousticModelPath("file:en-us-8khz");
configuration.setDictionaryPath("cmu07a.dic");
configuration.setGrammarPath("file:./");
configuration.setGrammarName("digits");
configuration.setUseGrammar(true);
StreamSpeechRecognizer recognizer =
new StreamSpeechRecognizer(configuration);
InputStream stream = new FileInputStream(new File("file.wav"));
recognizer.startRecognition(stream);
SpeechResult result;
while ((result = recognizer.getResult()) != null) {
System.out.format("Hypothesis: %s\n",
result.getHypothesis());
}
recognizer.stopRecognition();
}
}
You also need to make sure that both sample rate and audio bandwidth matches the decoder configuration
http://cmusphinx.sourceforge.net/wiki/faq#qwhat_is_sample_rate_and_how_does_it_affect_accuracy

First of all, Sphinx only work with WAVE file. For very limited vocabulary, Sphinx should generate good result when using a JSGF grammar file (but not that good in dictation mode). The main issue I found is that it does not provide confidence score (it is currently bugged). You might want to check three other alternative:
SpeechRecognizer from Windows platform. It provide easy to use recognition with confidence score and support grammar. This is C#, but you could build a native wrapper or custom server.
Google Speech API is an online speech recognition engine, free up to 50 request per day. There is several API for this, but I like JARVIS. Be careful though, since there is no official support or documentation about this and Google might (and already have in the past) close this engine whenever they want. Of course, you will have some privacy issue (is it okay to send this audio data to a third party ?).
I recently came through ISpeech and got good result with it. It provides its own Java wrapper API, free for mobile app. Same privacy issue as Google API.
I myself choose to go with the first option and build a speech recognition service in a custom http server. I found it to be the most effective way to tackle speech recognition from Java until Sphinx scoring issue is fixed.

sphinx speech recognition delay

I am using the open source sphinx sdk to do some voice recognition. I am currently running the HelloWorld example. However response is very sluggish, it takes several attempts to recognize a word, and sometimes it recognizes it but takes a little to output what I have said. Any ideas how to improve this? Also when I change the grammer file it doesn't update and recognize my new words.
Thanks

Basically you can use Sphinx in several configurations. If you know the pattern of the voice that you have to recognize then you can use the configuration with custom grammar.
In that configuration its having higher response rate than normal configuration, since it only listen for predefine words with pre-define pattern. (a Grammar)
You can define your own grammar file by following the JSGF standards. (more)
Sample Configuration
Configuration configuration = new Configuration();
configuration.setAcousticModelPath(ACOUSTIC_MODEL);
configuration.setDictionaryPath(DICTIONARY_PATH);
configuration.setGrammarPath(GRAMMAR_PATH);
configuration.setUseGrammar(true);
configuration.setGrammarName("mygrammar");
LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(configuration);
Sample Grammar File
#JSGF V1.0;
grammar mygrammar;
public <COMMON_COMMAND> = [please] turn (on | off) lighs;

A tool to add and complete PHP source code documentation [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I have several finished, older PHP projects with a lot of includes that I would like to document in javadoc/phpDocumentor style.
While working through each file manually and being forced to do a code review alongside the documenting would be the best thing, I am, simply out of time constraints, interested in tools to help me automate the task as much as possible.
The tool I am thinking about would ideally have the following features:
Parse a PHP project tree and tell me where there are undocumented files, classes, and functions/methods (i.e. elements missing the appropriate docblock comment)
Provide a method to half-way easily add the missing docblocks by creating the empty structures and, ideally, opening the file in an editor (internal or external I don't care) so I can put in the description.
Optional:
Automatic recognition of parameter types, return values and such. But that's not really required.
The language in question is PHP, though I could imagine that a C/Java tool might be able to handle PHP files after some tweaking.
Thanks for your great input!

I think PHP_Codesniffer can indicate when there is no docblock -- see the examples of reports on this page (quoting one of those) :
--------------------------------------------------------------------------------
FOUND 5 ERROR(S) AND 1 WARNING(S) AFFECTING 5 LINE(S)
--------------------------------------------------------------------------------
2 | ERROR | Missing file doc comment
20 | ERROR | PHP keywords must be lowercase; expected "false" but found
| | "FALSE"
47 | ERROR | Line not indented correctly; expected 4 spaces but found 1
47 | WARNING | Equals sign not aligned with surrounding assignments
51 | ERROR | Missing function doc comment
88 | ERROR | Line not indented correctly; expected 9 spaces but found 6
--------------------------------------------------------------------------------
I suppose you could use PHP_Codesniffer to at least get a list of all files/classes/methods that don't have a documentation; from what I remember, it can generate XML as output, which would be easier to parse using some automated tool -- that could be the first step of some docblock-generator ;-)
Also, if you are using phpDocumentor to generate the documentation, can this one not report errors for missing blocks ?
After a couple of tests, it can -- for instance, running it on a class-file with not much documentation, with the --undocumentedelements option, such as this :
phpdoc --filename MyClass.php --target doc --undocumentedelements
Gives this in the middle of the output :
Reading file /home/squale/developpement/tests/temp/test-phpdoc/MyClass.php -- Parsing file
WARNING in MyClass.php on line 2: Class "MyClass" has no Class-level DocBlock.
WARNING in MyClass.php on line 2: no #package tag was used in a DocBlock for class MyClass
WARNING in MyClass.php on line 5: Method "__construct" has no method-level DocBlock.
WARNING in MyClass.php on line 16: File "/home/squale/developpement/tests/temp/test-phpdoc/MyClass.php" has no page-level DocBlock, use #package in the first DocBlock to create one
done
But, here, too, even if it's useful as a reporting tool, it's not that helpful when it comes to generating the missing docblocks...
Now, I don't know of any tool that will pre-generate the missing docblocks for you : I generally use PHP_Codesniffer and/or phpDocumentor in my continuous integration mecanism, it reports missing docblocks, and, then, each developper adds what is missing, from his IDE...
... Which works pretty fine : there is generally not more than a couple of missing docblocks every day, so the task can be done by hand (and Eclipse PDT provides a feature to pre-generate the docblock for a method, when you are editing a specific file/method).
Appart from that, I don't really know any fully-automated tool to generate docblocks... But I'm pretty sure we could manage to create an interesting tool, using either :
The Reflection API
token_get_all to parse the source of a PHP file.
After a bit more searching, though, I found this blog-post (it's in french -- maybe some people here will be able to understand) : Ajout automatique de Tags phpDoc à l'aide de PHP_Beautifier.
Possible translation of the title : "Automatically adding phpDoc tags, using PHP_Beautifier"
The idea is actually not bad :
The PHP_Beautifier tool is pretty nice and powerful, when it comes to formating some PHP code that's not well formated
I've used it many times for code that I couldn't even read ^^
And it can be extended, using what it calls "filters".
see PHP_Beautifier_Filter for a list of provided filters
The idea that's used in the blog-post I linked to is to :
create a new PHP_Beautifier filter, that will detect the following tokens :
T_CLASS
T_FUNCTION
T_INTERFACE
And add a "draft" doc-block just before them, if there is not already one
To run the tool on some MyClass.php file, I've had to first install PHP_Beautifier :
pear install --alldeps Php_Beautifier-beta
Then, download the filter to the directory I was working in (could have put it in the default directory, of course) :
wget http://fxnion.free.fr/downloads/phpDoc.filter.phpcs
cp phpDoc.filter.phpcs phpDoc.filter.php
And, after that, I created a new beautifier-1.php script (Based on what's proposed in the blog-post I linked to, once again), which will :
Load the content of my MyClass.php file
Instanciate PHP_Beautifier
Add some filters to beautify the code
Add the phpDoc filter we just downloaded
Beautify the source of our file, and echo it to the standard output.
The code of the beautifier-1.php script will like this :
(Once again, the biggest part is a copy-paste from the blog-post ; I only translated the comments, and changed a couple of small things)
require_once 'PHP/Beautifier.php';
// Load the content of my source-file, with missing docblocks
$sourcecode = file_get_contents('MyClass.php');
$oToken = new PHP_Beautifier();
// The phpDoc.filter.php file is not in the default directory,
// but in the "current" one => we need to add it to the list of
// directories that PHP_Beautifier will search in for filters
$oToken->addFilterDirectory(dirname(__FILE__));
// Adding some nice filters, to format the code
$oToken->addFilter('ArrayNested');
$oToken->addFilter('Lowercase');
$oToken->addFilter('IndentStyles', array('style'=>'k&r'));
// Adding the phpDoc filter, asking it to add a license
// at the beginning of the file
$oToken->addFilter('phpDoc', array('license'=>'php'));
// The code is in $sourceCode
// We could also have used the setInputFile method,
// instead of having the code in a variable
$oToken->setInputString($sourcecode);
$oToken->process();
// And here we get the result, all clean !
echo $oToken->get();
Note that I also had to path two small things in phpDoc.filter.php, to avoid a warning and a notice...
The corresponding patch can be downloaded there : http://extern.pascal-martin.fr/so/phpDoc.filter-pmn.patch
Now, if we run that beautifier-1.php script :
$ php ./beautifier-1.php
With a MyClass.php file that initialy contains this code :
class MyClass {
public function __construct($myString, $myInt) {
//
}
/**
* Method with some comment
* #param array $params blah blah
*/
public function doSomething(array $params = array()) {
// ...
}
protected $_myVar;
}
Here's the kind of result we get -- once our file is Beautified :
<?php
/**
*
* PHP version 5
*
* LICENSE: This source file is subject to version 3.0 of the PHP license
* that is available through the world-wide-web at the following URI:
* http://www.php.net/license/3_0.txt. If you did not receive a copy of
* the PHP License and are unable to obtain it through the web, please
* send a note to license#php.net so we can mail you a copy immediately.
* #category PHP
* #package
* #subpackage Filter
* #author FirstName LastName <mail>
* #copyright 2009 FirstName LastName
* #link
* #license http://www.php.net/license/3_0.txt PHP License 3.0
* #version CVS: $Id:$
*/
/**
* #todo Description of class MyClass
* #author
* #version
* #package
* #subpackage
* #category
* #link
*/
class MyClass {
/**
* #todo Description of function __construct
* #param $myString
* #param $myInt
* #return
*/
public function __construct($myString, $myInt) {
//
}
/**
* Method with some comment
* #param array $params blah blah
*/
public function doSomething(array $params = array()) {
// ...
}
protected $_myVar;
}
We can note :
The license block at the beginning of the file
The docblock that's been added on the MyClass class
The docblock that's been added on the __construct method
The docblock on the doSomething was already present in our code : it's not been removed.
There are some #todo tags ^^
Now, it's not perfect, of course :
It doesn't document all the stuff we could want it too
For instance, here, it didn't document the protected $_myVar
It doesn't enhance existing docblocks
And it doesn't open the file in any graphical editor
But that would be much harder, I guess...
But I'm pretty sure that this idea could be used as a starting point to something a lot more interesting :
About the stuff that doesn't get documented : adding new tags that will be recognized should not be too hard
You just have to add them to a list at the beginning of the filter
Enhancing existing docblocks might be harder, I have to admit
A nice thing is this could be fully-automated
Using Eclipse PDT, maybe this could be set as an External Tool, so we can at least launch it from our IDE ?

Since PHPCS was already mentioned, I throw in the Reflection API to check for missing DocBlocks. The article linked below is a short tutorial on how you could approach your problem:
http://www.phpriot.com/articles/reflection-api
There is also a PEAR Package PHP_DocBlockGenerator that can Create the file Page block and the DocBlocks for includes, global variables, functions, parameters, classes, constants, properties and methods (and other things).

php-tracer-weaver can instrument code and generate docblocks with the parameter types, deducted through runtime analysis.

You can use the Code Sniffer for PHP to test your code against a predefined set of coding guidelines. It will also check for missing docblocks and generate a report you can use to identify the files.

The 1.4.x versions of phpDocumentor have the -ue option (--undocumentedelements) [1], which will cause undocumented elements to be listed as warnings on the errors.html page that it generates during its doc run.
Further, PHP_DocBlockGenerator [2] from PEAR looks like it can generate missing docblocks for you.
[1] -- http://manual.phpdoc.org/HTMLSmartyConverter/HandS/phpDocumentor/tutorial_phpDocumentor.howto.pkg.html#using.command-line.undocumentedelements
[2] -- http://pear.php.net/package/PHP_DocBlockGenerator

We use codesniffer for this functionality at work, using standard PEAR or Zend standards. It will not allow you to edit the files on the fly, but will definitely give you a list, with lines and description of what kind of docblock is missing.
HTH,
Jc

No idea if it's any help, but if Codesniffer can point out the functions/methods, then a decent PHP IDE (I offer PHPEd) can easily inspect and scaffold the PHPDoc comments for each function.
Simply type /** above each function and press ENTER, and PHPEd will auto-complete the code with #param1, #param1, #return, etc. filled out correctly, ready for your extra descriptions. Here's the first one I tried in order to provide an example:
/**
* put your comment here...
*
* #param mixed $url
* #param mixed $method
* #param mixed $timeout
* #param mixed $vars
* #param mixed $allow_redirects
* #return mixed
*/
public static function curl_get_file_contents($url, $method = 'get', $timeout = 30, $vars = array(), $allow_redirects = true)
This is easily tweaked to:
/**
* Retrieves a file using the cURL extension
*
* #param string $url
* #param string $method
* #param int $timeout
* #param array $vars parameters to pass to cURL
* #param int $allow_redirects boolean choice to follow any redirects $url serves up
* #return mixed
*/
public static function curl_get_file_contents($url, $method = 'get', $timeout = 30, $vars = array(), $allow_redirects = true)
Not exactly an automated solution, but quick enough for me as a lazy developer :)

You want to actually automate the problem of filling in the "javadoc" type data?
The DMS Software Reengineering Toolkit could be configured to do this.
It parses source text just like compilers do, builds internal compiler structures, lets you implement arbitrary analyses, make modification to those structures, and then regenerate ("prettyprint") the source text changed according to the structure changes. It even preserves comments and formatting of the original text; you can of course insert additional comments and they will appear and this seems to be your primary goal. DMS does this for many languages, including PHP
What you would want to do is parse each PHP file, locate every class/method, generate the "javadoc" comments that should be that entity (difference for classes and methods, right?) and then check that corresponding comments were actually present in the compiler structures. If not, simply insert them. PrettyPrint the final result.
Because it has access to the compiler structures that represent the code, it shouldn't be difficult to generate parameter and return info, as you suggested. What it can't do, of course, is generate comments about intendend purpose; but it could generate a placeholder for you to fill in later.

I had to do a large batch of automation of docblock fixing recently, mostly based on the correct answer above kwith some context-specific changes. It's a hack, but I'm linking back here in case it's useful to anyone else in the future. Essentially, it does basic parsing on comment block tokens within PHP Beautifier.
https://gist.github.com/israelshirk/408f2656100196e73367

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.