I have successfully integrated LibSVM API to mu java code. I need to transfer large document collection to numerical representation and give it to LibSVM classifier. As far as I know weka has the ability to transfer documents to feature vectors. Can any one please tell me how to do that?
U can do it like this
DataSource source = new DataSource(new File("mycsvinputfile"));
System.out.println(source.getStructure());
Instances data = source.getDataSet();
// setting class attribute if the data format does not provide this information
// For example, the XRFF format saves the class attribute information as well
if (data.classIndex() == -1)
data.setClassIndex(data.numAttributes() - 1);
//initialize svm classifier
LibSVM svm = new LibSVM();
svm.buildClassifier(data);
Don't forget to use weka.jar, libsvm.jar, and wlsvm.jar (the libsvm wrapper) in your project. So just include all 3 jars in your build path or class path or whatever.
Related
I am new in the semantic web field, and i'm trying to create a java model using JENA to extract classes, subclass and/or comments from an OWL file..
any help/guidance on how to do such thing would be appreciated.
Thank you
You can do so with the Jena Ontology API. This API allows you to create an ontology model from owl file and then provides you access to all the information stored in the ontology as Java Classes.
Here is a quick introduction to Jena ontology. This introduction contains useful information on getting started with Jena Ontology.
The code generally looks like this:
String owlFile = "path_to_owl_file"; // the file can be on RDF or TTL format
/* We create the OntModel and specify what kind of reasoner we want to use
Depending on the reasoner you can acces different kind of information, so please read the introduction. */
OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM);
/* Now we read the ontology file
The second parameter is the ontology base uri.
The third parameter can be TTL or N3 it represents the file format*/
model.read(owlFile, null, "RDF/XML");
/* Then you can acces the information using the OntModel methods
Let's access the ontology properties */
System.out.println("Listing the properties");
model.listOntProperties().forEachRemaining(System.out::println);
// let's access the classes local names and their subclasses
try {
base.listClasses().toSet().forEach(c -> {
System.out.println(c.getLocalName());
System.out.println("Listing subclasses of " + c.getLocalName());
c.listSubClasses().forEachRemaining(System.out::println);
});
} catch (Exception e) {
e.printStackTrace();
}
// Note that depending on the classes types, accessing some information might throw an exception.
Here is the Jena Ontology API JavaDoc.
I hope it was useful!
I am wondering if it is possible to get load speed information when using the Java API.
The code I have to load "large" files (few gb) is this:
try (InputStream in = new FileInputStream(arguments.input)) {
RDFParser.create()
.source(in)
.lang(lang)
.errorHandler(ErrorHandlerFactory.errorHandlerStrict)
.base("http://example.com/")
.streamManager()
.parse(model);
}
The loading seems to work but I have no clue about speed, number of triples parsed etc. Is there a way to get such statistics every n-triples or n-seconds? I am not using tdbloader2 as this code is part of a bigger program.
I am using weka API in my java code and have a dataset with string ID to keep track of instances, weka mentioned in this page that there is an option p that can help printing the ID of each instance in the prediction result even if the attribute has removed. But how this can be approached in java code since none of the options listed in RemoveType filter is p?
Thank you
p option, on the weka page you mentioned, is the parameter which you can set through some of the the classes which are available in the package weka.classifiers.evaluation.output.prediction
With these classes you can set the things you want in output prediction file. E.g. OutputDistribution, AttributeIndices(P)- Attribute indices which you want to have in output file, Number of decimal places in prediction probabilities, etc.
You can use any of the below classes depending on the output file format you want.
PlainText
HTML
XML
CSV
Setting the parameters through code :
Evaluation eval = new Evaluation(data);
StringBuffer forPredictionsPrinting = new StringBuffer();
PlainText classifierOutput = new PlainText();
classifierOutput.setBuffer(forPredictionsPrinting);
Boolean outputDistribution = new Boolean(true);
classifierOutput.setOutputDistribution(true);
You can find detailed usage of this class at
https://www.programcreek.com/java-api-examples/?api=weka.classifiers.evaluation.output.prediction.PlainText
How can I use SparkContext (to create SparkSession or Cassandra Sessions) on executors?
If I pass it as a parameter to the foreach or foreachPartition, then it will have a null value. Shall I create a new SparkContext in each executor?
What I'm trying to do is as follows:
Read a dump directory with millions of XML files:
dumpFiles = Directory.listFiles(dumpDirectory)
dumpFilesRDD = sparkContext.parallize(dumpFiles, numOfSlices)
dumpFilesRDD.foreachPartition(dumpFilePath->parse(dumpFilePath))
In parse(), every XML file is validated, parsed and inserted into several tables using Spark SQL. Only valid XML files will present objects of same type that can be saved. Portion of the data needs to be replaced by other keys before being inserted into one of the tables.
In order to do that, SparkContext is needed in the function parse to use sparkContext.sql().
If I'd rephrase your question, what you want is to:
Read a directory with millions of XML files
Parse them
Insert them into a database
That's a typical Extract, Transform and Load (ETL) process that terribly easy in Spark SQL.
Loading XML files can be done using a separate package spark-xml:
spark-xml A library for parsing and querying XML data with Apache Spark, for Spark SQL and DataFrames. The structure and test tools are mostly copied from CSV Data Source for Spark.
You can "install" the package using --packages command-line option:
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.11:0.4.1
Quoting spark-xml's Scala API (with some changes to use SparkSession instead):
// Step 1. Loading XML files
val path = "the/path/to/millions/files/*.xml"
val spark: SparkSession = ???
val files = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load(path)
That makes the first requirement almost no-brainer. You've got your million XML files taken care by Spark SQL.
Step 2 is about parsing the lines (from the XML files) and marking rows to be saved to appropriate tables.
// Step 2. Transform them (using parse)
def parse(line: String) = ???
val parseDF = files.map { line => parse(line) }
Your parse function could return something (as the main result) and the table that something should be saved to.
With the table markers, you split the parseDF into DataFrames per table.
val table1DF = parseDF.filter($"table" === "table1")
And so on (per table).
// Step 3. Insert into DB
table1DF.write.option(...).jdbc(...)
That's just a sketch of what you may really be after, but that's the general pattern to follow. Decompose your pipeline into digestable chunks and tackle one chunk at a time.
It is important to keep in mind that in Spark we are not supposed to program in terms of executors.
In Spark programming model, your driver program is mostly a self-contained program where certain sections will be automatically converted to a physical execution plan. Ultimately a bunch of tasks distributed across worker/executors.
When you need to execute something for each partition, you can use something like mapPartitions(). Refer Spark : DB connection per Spark RDD partition and do mapPartition for further details. Pay attention to how the dbConnection object is enclosed in the function body.
It is not clear what you mean by a parameter. If it is just data (not a DB connection or similar), I think you need to use a boradcast variable.
I have been working recently with Jena TDB. My goal is to store an RDF file which is a representation of an RDF graphs.
Everything works fine with my code and i am able to query what i have stored as well.But I am still not sure if my data was completely stored or not!
I know that Jena TDB index the content of the file and that there are several indexes built for one file which will be stored in a specified folder. But how do I check if the database is created and all the RDF files that i will provide to TDB will be stored with the previous ones?
is there any way to do so online maybe or in java? and is my code enough to work with big amount of data or not ?
public static void main(String[] args) {
String directory = "/*location*/ ";
Dataset dataset = TDBFactory.createDataset(directory);
Model tdb = dataset.getNamedModel("RDFData");
// read the input file
String source = "/*location*/rdfstorage.rdf";
FileManager.get().readModel( tdb, source);
tdb.close();
dataset.close();
}
Check the location and see if the files have been updated.
It is better to use a transaction. Your code is OK but if it is interrupted, the store may be corrupted.
https://jena.apache.org/documentation/tdb/tdb_transactions.html
If the source is large, use the bulkloader from the command line.