Creating Weka classifier model without evaluation - java

I am trying to use java to feed a training dataset to Weka and get the model as output.
Found this instruction in Weka wiki:
You save a trained classifier with the -d option (dumping), e.g.:
java weka.classifiers.trees.J48 -t /some/where/train.arff -d /other/place/j48.model
The problem is when I use the mentioned command it first builds the model (takes seconds) and then it evaluates the data using 10-fold cross validation method, which takes minutes and is not needed.
The question is how can use weka to model the data for me without evaluating it.

java weka.classifiers.trees.J48 -no-cv -t /some/where/train.arff -d /other/place/j48.model
How I got there:
java weka.classifiers.trees.J48 --help
lists the available options, among others:
-no-cv Do not perform any cross validation.
So when I use your command and add the -no-cv flag, that seems to do what you want.

Related

Mahout: where can I find the java class executed by a bash shell script?

I'm tring to write a java program using some functions from Mahout. I know that I can execute some Mahout functions with command line but I also want to know where I can find those functions in the .java files.
https://chimpler.wordpress.com/2013/04/17/generating-eigenfaces-with-mahout-svd-to-recognize-person-faces/
It seems like I can execute a java class with this command: $ mahout cleansvd -ci covariance.seq -ei output -o output2
So I checked the bash file and found this:
exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" $CLASS "$#"
However I cannot find any definition or assignment of $CLASS, and I don't know where the "cleansvd" class is.
Also, I can execute this command to perform a Singular Value Decomposition with 5 arguments:
$ mahout svd --input covariance.seq --numRows 150 --numCols 150 --rank 50 --output output
And I did find class SingularValueDecomposition in the source file, which takes only one argument and cannot reduce rank.
I really want to know what happened and how shell scripts locate java classes.
first of all, that's a very old blog post.
I wrote this one to use with "new mahout".
https://rawkintrevo.org/2016/11/10/deep-magic-volume-3-eigenfaces/
It uses Scala, not Java, but the code is very simple and straight forward. You could easily make a jar and import it into a Java program.
The blog also shows you how the whole eigenfaces thing works- you basically just need to do SVD / DS-SVD on a matrix of faces-as-vectors

Which WEKA output option from command-line, works with WEKA Experiment Environment Analyse?

I am running the following command from the command line:
java -cp "./weka.jar" weka.classifiers.trees.J48 -t ./WEKA_reference_test_set.arff -i > test.arff
I want to be able to take my output results file, test.arff , and analyse the data in WEKA's GUI environment. Why? My experiments will eventually be run on a cluster which needs to output to a format where I can access the same amount of results data, as I would be able to from the GUI environment (the GUI, of course, not be available on the cluster). Specifically I am trying to get to the results for each fold of the classifier, upon doing cross-validation on the data. WEKA runs fine and performs the classification as instructed (I have checked the results by storing them in a csv file), so there is no problem there.
Currently I am getting the following error when trying to load my results file in the WEKA GUI environment:
Any ideas?
This is running on OS X with WEKA version 3.6.10 (latest stable release).
This will give you limited results, not as extensive as running your experiment through the GUI, but it works.
java -cp "./weka.jar" weka.classifiers.trees.J48 -t ./WEKA_reference_test_set.arff -threshold-file test.arff

MSword to XML/HTML using Apache Tika

I happened to know Tika, very useful in text extraction from word:
curl www.vit.org/downloads/doc/tariff.doc \
| java -jar tika-app-1.3.jar --text
But is there a way to use it to convert the Ms Word file into XML/HTML?
Yes, it involves changing a whooping 4 characters in your command!
If you run java -jar tika-app-1.3.jar --help you'll get something that starts with:
usage: java -jar tika-app.jar [option...] [file|port...]
Options:
-? or --help Print this usage message
-v or --verbose Print debug level messages
-V or --version Print the Apache Tika version number
-g or --gui Start the Apache Tika GUI
-s or --server Start the Apache Tika server
-f or --fork Use Fork Mode for out-of-process extraction
-x or --xml Output XHTML content (default)
-h or --html Output HTML content
-t or --text Output plain text content
-T or --text-main Output plain text content (main content only)
-m or --metadata Output only metadata
.....
From that, you'll see that if you change your --text option to --html or --xml you'll get out nicely formatted XML instead of just the plain text
Despite the fact that this has been answered, since the op tagged the question with the java tag, for completeness I'll add reference to easily see how to do this in java.
The TikaTest.java superclass from Tika's unit tests is the easiest reference to convert word to html using the getXML method. It's a pity that they saw the usefulness of such an API in writing their unit tests, but chose not to expose it as a handy tool, forcing everyone to deal with handlers etc. which is unfortunate boilerplate for the common use case.

Parsing javadoc with Python-Sphinx

I use a shared repository partly containing Java and Python code. The code basis mainly stands on python, but some libraries are written in Java.
Is there a possibility to parse or preprocess Java documentation in order to use
it later in Python-Sphinx or even a plugin?
javasphinx (Github) (Documentation)
It took me way to long to find all the important details to set this up, so here's a brief for all my trouble.
Installation
# Recommend working in virtual environments with latest pip:
mkdir docs; cd docs
python3 -m venv env
source ./env/bin/activate
pip install --upgrade pip
# Recommend installing from source:
pip install git+https://github.com/bronto/javasphinx.git
The pypi version seemed to have broken imports, these issues did not seem to exist in the latest checkout.
Setup & Configuration
Assuming you've got a working sphinx setup already:
Important: add the java "domain" to sphinx, this is embedded in the javasphinx package and does not follow the common .ext. extension-namespace format. (This is the detail I missed for hours):
# docs/sources/conf.py
extensions = ['javasphinx']
Optional: If you want external javadoc linking:
# docs/sources/conf.py
javadoc_url_map = {
'<namespace_here>' : ('<base_url_here>', 'javadoc'),
}
Generating Documentation
The javasphinx package adds the shell tool javasphinx-apidoc, if your current environment is active you can call it as just javasphinx-apidoc, or use its full path: ./env/bin/javasphinx-apidoc:
$ javasphinx-apidoc -o docs/source/ --title='<name_here>' ../path/to/java_dirtoscan
This tool takes arguments nearly identical to sphinx-apidoc:
$ javasphinx-apidoc --help
Usage: javasphinx-apidoc [options] -o <output_path> <input_path> [exclude_paths, ...]
Options:
-h, --help show this help message and exit
-o DESTDIR, --output-dir=DESTDIR
Directory to place all output
-f, --force Overwrite all files
-c CACHE_DIR, --cache-dir=CACHE_DIR
Directory to stored cachable output
-u, --update Overwrite new and changed files
-T, --no-toc Don't create a table of contents file
-t TOC_TITLE, --title=TOC_TITLE
Title to use on table of contents
--no-member-headers Don't generate headers for class members
-s SUFFIX, --suffix=SUFFIX
file suffix (default: rst)
-I INCLUDES, --include=INCLUDES
Additional input paths to scan
-p PARSER_LIB, --parser=PARSER_LIB
Beautiful Soup---html parser library option.
-v, --verbose verbose output
Include Generated Docs in Index
In the output directory of the javasphinx-apidoc command there will have been a packages.rst table-of-contents file generated, you will likely want to include this into your index.html's table of contents like:
#docs/sources/index.rst
Contents:
.. toctree::
:maxdepth: 2
packages
Compile Documentation (html)
With either your python environment active or your path modified:
$ cd docs
$ make html
or
$ PATH=$PATH:./env/bin/ make html
The javadoc command allows you to write and use your own doclet classes to generate documentation in whatever form you choose. The output doesn't need to be directly human-readable ... so there's nothing stopping you outputting in a Sphinx compatible format.
However, I couldn't find any existing doclet that does this specific job.
References:
Oracle's Doclet Overview
UPDATE
The javasphinx extension may be a better alternative. It allows you to generate Sphinx documentation from javadoc comments embedded in Java source code.
Sphinx does not provide a built-in way to parse JavaDoc, and I do not know of any 3rd party extension for this task.
You'll likely have to write your own documenter for the Sphinx autodoc extension. There are different approaches you may follow:
Parse JavaDoc manually. I do not think that there is a JavaDoc pParser for Python, though.
Use Doxygen to parse JavaDoc into XML, and parse that XML. The Sphinx extension breathe does this, though for C++.
Write a Doclet for Java to turn JavaDoc into whatever output format you can hande, and parse this output.

Weka: Batch filtering command is showing error: Input file format differ, using MAC OS X

I have been trying to run one simple example to check Weka GUI interface as I am planning to develop Support Vector Machine(SVM) using Weka API/WLSVM in my Java code. There are three steps I am following to make arff from text datasets (Training & Testing). You can assistant me to run it in Java code.
text file to .arff file converter.
Applied StringToWordVector Filter.
Applied Batch Filter on training and test datasets.
1. text file to .arff file converter.
This step works fine on Simple CLI using following command
Error: java weka.core.converters.TextDirectoryLoader -dir Testing_Text > Testing.arff
but when I run it on MAC bash it gives following error, how can I resolve this issue?
Could not find or load main class weka.core.converters.TextDirectoryLoader
2. Applied StringToWordVector Filter
I applied this filter using Weka GUI Interface separately first training and then on testing datasets.
3. Applied Batch Filter on training and testing dataset.
When I try to apply batch filter it gives error: Input file formats differ on Simple CLI using following command.
java weka.filters.unsupervised.attribute.Standardize -b -i Training_STWV.arff -o train_std.arff -r TestingDiff_STWV.arff -s test_std.arff
Input file formats differ.
Kindly guide me, I am stuck to run Support Vector Machine(SVM) classifier using Weka.
Batch Filtering command (-b) is working now with following command.
java weka.filters.unsupervised.attribute.StringToWordVector -b -i Training.arff -o train_std.arff -r Testing.arff -s test_std.arff
The standard procedure on the Mac is, to change into the directory (e.g. weka-3.6.8/) and run
java -Xmx1000M -jar weka.jar
Check if that works.
If it does, check that in your own example, you have the classpath properly set (-jar weka.jar).

Categories

Resources