Jena TDB to store and query using API

Jena TDB to store and query using API - java

I am new to both Jena-TDB and SPARQL, so it might be a silly question. I am using tdb-0.9.0, on Windows XP.
I am creating the TDB model for my trail_1.rdf file. My understanding here(correct me if I am wrong) is that the following code will read the given rdf file in TDB model and also stores/load (not sure what's the better word) the model in the given directory D:\Project\Store_DB\data1\tdb:
// open TDB dataset
String directory = "D:\\Project\\Store_DB\\data1\\tdb";
Dataset dataset = TDBFactory.createDataset(directory);
Model tdb = dataset.getDefaultModel();
// read the input file
String source = "D:\\Project\\Store_DB\\tmp\\trail_1.rdf";
FileManager.get().readModel( tdb, source);
tdb.close();
dataset.close();
Is this understanding correct?
As per my understanding since now the model is stored at D:\Project\Store_DB\data1\tdb directory, I should be able to run query on it at some later point of time.
So to query the TDB Store at D:\Project\Store_DB\data1\tdb I tried following, but it prints nothing:
String directory = "D:\\Project\\Store_DB\\data1\\tdb" ;
Dataset dataset = TDBFactory.createDataset(directory) ;
Iterator<String> graphNames = dataset.listNames();
while (graphNames.hasNext()) {
String graphName = graphNames.next();
System.out.println(graphName);
}
I also tried this, which also did not print anything:
String directory = "D:\\Project\\Store_DB\\data1\\tdb" ;
Dataset dataset = TDBFactory.createDataset(directory) ;
String sparqlQueryString = "SELECT (count(*) AS ?count) { ?s ?p ?o }" ;
Query query = QueryFactory.create(sparqlQueryString) ;
QueryExecution qexec = QueryExecutionFactory.create(query, dataset) ;
ResultSet results = qexec.execSelect() ;
ResultSetFormatter.out(results) ;
What am I doing incorrect? Is there anything wrong with my understanding that I have mentioned above?

For part (i) of your question, yes, your understanding is correct.
For part (ii), the reason that listNames does not return any results is because you have not put your data into a named graph. In particular,
Model tdb = dataset.getDefaultModel();
means that you are storing data into TDB's default graph, i.e. the graph with no name. If you wish listNames to return something, change that line to:
Model tdb = dataset.getNamedGraph( "graph42" );
or something similar. You will, of course, then need to refer to that graph by name when you query the data.
If your goal is simply to test whether or not you have successfully loaded data into the store, try the command line tools bin/tdbdump (Linux) or bat\tdbdump.bat (Windows).
For part (iii), I tried your code on my system, pointing at one of my TDB images, and it works just as one would expect. So: either the TDB image you're using doesn't have any data in it (test with tdbdump), or the code you actually ran was different to the sample above.

The problem in your part 1 code is, I think, you are not committing the data .
Try with this version of your part 1 code:
String directory = "D:\\Project\\Store_DB\\data1\\tdb";
Dataset dataset = TDBFactory.createDataset(directory);
Model tdb = dataset.getDefaultModel();
// read the input file
String source = "D:\\Project\\Store_DB\\tmp\\trail_1.rdf";
FileManager.get().readModel( tdb, source);
dataset.commit();//INCLUDE THIS STAMEMENT
tdb.close();
dataset.close();
and then try with your part 3 code :) ....

Related

How to predict out of sample values using Spark (JAVA API)

I'm quite new to Spark and I need to use the JAVA api. Our goal is to serve predictions on the fly, where the user is going to provide a few of the variables, but not the label or the goal variable, of course.
But the model seems to need the data to be split in training data and test data for training and validation.
How can I get the prediction and the RMSE for the out of the sample data, that the user will query on the fly?
Dataset<Row>[] splits = df.randomSplit(new double[] {0.99, 0.1});
Dataset<Row> trainingData = splits[0];
Dataset<Row> testData = df_p;
My out of sample data has the following format (where 0s is data the user cannot provide)
IMO,PORT_ID,DWT,TERMINAL_ID,BERTH_ID,TIMESTAMP,label,OP_ID
0000000,1864,80000.00,5689,6060,2020-08-29 00:00:00.000,1,2
'label' is the result I want to predict.
This is how I used the models:
// Train a GBT model.
GBTRegressor gbt = new GBTRegressor()
.setLabelCol("label")
.setFeaturesCol("features")
.setMaxIter(10);
// Chain indexer and GBT in a Pipeline.
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] {assembler, gbt, discretizer});
// Train model. This also runs the indexer.
PipelineModel model = pipeline.fit(trainingData);
// Make predictions.
Dataset<Row> predictions = model.transform(testData);
// Select example rows to display.
predictions.select("prediction", "label", "weekofyear", "dayofmonth", "month", "year", "features").show(150);
// Select (prediction, true label) and compute test error.
RegressionEvaluator evaluator = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("rmse");
double rmse = evaluator.evaluate(predictions);
System.out.println("Root Mean Squared Error (RMSE) on test data = " + rmse);

testing OpenNLP classifier model

I'm currently training a model for a classifier. yesterday I found out that it will be more accurate if you also test the created classify model. I tried searching on the internet how to test a model : testing openNLP model. But I cant get it to work. I think the reason is because i'm using OpenNLP version 1.83 instead of 1.5. Could anyone explain me how to properly test my model in this version of OpenNLP?
Thanks in advance.
Below is the way im training my model:
public static DoccatModel trainClassifier() throws IOException
{
// read the training data
final int iterations = 100;
InputStreamFactory dataIn = new MarkableFileInputStreamFactory(new File("src/main/resources/trainingSets/trainingssetTest.txt"));
ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
// define the training parameters
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, iterations+"");
params.put(TrainingParameters.CUTOFF_PARAM, 0+"");
params.put(AbstractTrainer.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);
// create a model from traning data
DoccatModel model = DocumentCategorizerME.train("NL", sampleStream, params, new DoccatFactory());
return model;
}

I can think of two ways to test your model. Either way, you will need to have annotated documents (an by annotated I really mean expert-classified).
The first way involves using the opennlp DocCatEvaluator. The syntax would be something akin to
opennlp DoccatEvaluator -model model -data sampleData
The format of your sampleData should be
OUTCOME <document text....>
documents are separated by the new line character.
The second way involves creating an DocumentCategorizer. Something like:
(the model is the DocCat model from your question)
DocumentCategorizer categorizer = new DocumentCategorizerME(model);
// could also use: Tokenizer tokenizer = new TokenizerME(tokenizerModel)
Tokenizer tokenizer = WhitespaceTokenizer.INSTANCE();
// linesample is like in your question...
for(String sample=linesample.read(); sample != null; sample=linesample.read()){
String[] tokens = tokenizer.tokenize(sample);
double[] outcomeProb = categorizer.categorize(tokens);
String sampleOutcome = categorizer.getBestCategory(outcomeProb);
// check if the outcome is right...
// keep track of # right and wrong...
}
// calculate agreement metric of your choice
Since I typed the code here there may be a syntax error or two (either I or the SO community can fix), but the idea for running through your data, tokenizing, running it through the document categorizer and keeping track of the results is how you want to evaluate your model.
Hope it helps...

Inserting data with UCanAccess from big text files is very slow

I'm trying to read text files .txt with more than 10.000 lines per file, splitting them and inserting the data in Access database using Java and UCanAccess. The problem is that it becomes slower and slower every time (as the database gets bigger).
Now after reading 7 text files and inserting them into database, it would take the project more than 20 minutes to read another file.
I tried to do just the reading and it works fine, so the problem is the actual inserting into database.
N.B: This is my first time using UCanAccess with Java because I found that the JDBC-ODBC Bridge is no longer available. Any suggestions for an alternative solution would also be appreciated.

If your current task is simply to import a large amount of data from text files straight into the database, and it does not require any sophisticated SQL manipulations, then you might consider using the Jackcess API directly. For example, to import a CSV file you could do something like this:
String csvFileSpec = "C:/Users/Gord/Desktop/BookData.csv";
String dbFileSpec = "C:/Users/Public/JackcessTest.accdb";
String tableName = "Book";
try (Database db = new DatabaseBuilder()
.setFile(new File(dbFileSpec))
.setAutoSync(false)
.open()) {
new ImportUtil.Builder(db, tableName)
.setDelimiter(",")
.setUseExistingTable(true)
.setHeader(false)
.importFile(new File(csvFileSpec));
// this is a try-with-resources block,
// so db.close() happens automatically
}
Or, if you need to manually parse each line of input, insert a row, and retrieve the AutoNumber value for the new row, then the code would be more like this:
String dbFileSpec = "C:/Users/Public/JackcessTest.accdb";
String tableName = "Book";
try (Database db = new DatabaseBuilder()
.setFile(new File(dbFileSpec))
.setAutoSync(false)
.open()) {
// sample data (e.g., from parsing of an input line)
String title = "So, Anyway";
String author = "Cleese, John";
Table tbl = db.getTable(tableName);
Object[] rowData = tbl.addRow(Column.AUTO_NUMBER, title, author);
int newId = (int)rowData[0]; // retrieve generated AutoNumber
System.out.printf("row inserted with ID = %d%n", newId);
// this is a try-with-resources block,
// so db.close() happens automatically
}
To update an existing row based on its primary key, the code would be
Table tbl = db.getTable(tableName);
Row row = CursorBuilder.findRowByPrimaryKey(tbl, 3); // i.e., ID = 3
if (row != null) {
// Note: column names are case-sensitive
row.put("Title", "The New Title For This Book");
tbl.updateRow(row);
}
Note that for maximum speed I used .setAutoSync(false) when opening the Database, but bear in mind that disabling AutoSync does increase the chance of leaving the Access database file in a damaged (and possibly unusable) state if the application terminates abnormally while performing the updates.

Also, if you need to use slq/ucanaccess, you have to call setAutocommit(false) on the connection at the begin, and do a commit each 200/300 record. The performances will improve drammatically (about 99%).

Spark: Running multiple queries on multiple files, optimization

I am using spark 1.5.0.
I have a set of files on s3 containing json data in sequence file format, worth around 60GB. I have to fire around 40 queries on this dataset and store results back to s3.
All queries are select statements with a condition on same field. Eg. select a,b,c from t where event_type='alpha', select x,y,z from t where event_type='beta' etc.
I am using an AWS EMR 5 node cluster with 2 core nodes and 2 task nodes.
There could be some fields missing in the input. Eg. a could be missing. So, the first query, which selects a would fail. To avoid this I have defined schemas for each event_type. So, for event_type alpha, the schema would be like {"a": "", "b": "", c:"", event_type=""}
Based on the schemas defined for each event, I'm creating a dataframe from input RDD for each event with the corresponding schema.
I'm using the following code:
JavaPairRDD<LongWritable,BytesWritable> inputRDD = jsc.sequenceFile(bucket, LongWritable.class, BytesWritable.class);
JavaRDD<String> events = inputRDD.map(
new Function<Tuple2<LongWritable,BytesWritable>, String>() {
public String call(Tuple2<LongWritable,BytesWritable> tuple) throws JSONException, UnsupportedEncodingException {
String valueAsString = new String(tuple._2.getBytes(), "UTF-8");
JSONObject data = new JSONObject(valueAsString);
JSONObject payload = new JSONObject(data.getString("payload"));
return payload.toString();
}
}
);
events.cache();
for (String event_type: events_list) {
String query = //read query from another s3 file event_type.query
String jsonSchemaString = //read schema from another s3 file event_type.json
List<String> jsonSchema = Arrays.asList(jsonSchemaString);
JavaRDD<String> jsonSchemaRDD = jsc.parallelize(jsonSchema);
DataFrame df_schema = sqlContext.read().option("header", "true").json(jsonSchemaRDD);
StructType schema = df_schema.schema();
DataFrame df_query = sqlContext.read().schema(schema).option("header", "true").json(events);
df_query.registerTempTable(tableName);
DataFrame df_results = sqlContext.sql(query);
df_results.write().format("com.databricks.spark.csv").save("s3n://some_location);
}
This code is very inefficient, it takes around 6-8 hours to run. How can I optimize my code?
Should I try using HiveContext.
I think the current code is taking multipe passes at the data, not sure though as I have cached the RDD? How can I do it in a single pass if that is so.

Neo4j ExecutionEngine does not return valid results

Trying to use a similar example from the sample code found here
My sample function is:
void query()
{
String nodeResult = "";
String rows = "";
String resultString;
String columnsString;
System.out.println("In query");
// START SNIPPET: execute
ExecutionEngine engine = new ExecutionEngine( graphDb );
ExecutionResult result;
try ( Transaction ignored = graphDb.beginTx() )
{
result = engine.execute( "start n=node(*) where n.Name =~ '.*79.*' return n, n.Name" );
// END SNIPPET: execute
// START SNIPPET: items
Iterator<Node> n_column = result.columnAs( "n" );
for ( Node node : IteratorUtil.asIterable( n_column ) )
{
// note: we're grabbing the name property from the node,
// not from the n.name in this case.
nodeResult = node + ": " + node.getProperty( "Name" );
System.out.println("In for loop");
System.out.println(nodeResult);
}
// END SNIPPET: items
// START SNIPPET: columns
List<String> columns = result.columns();
// END SNIPPET: columns
// the result is now empty, get a new one
result = engine.execute( "start n=node(*) where n.Name =~ '.*79.*' return n, n.Name" );
// START SNIPPET: rows
for ( Map<String, Object> row : result )
{
for ( Entry<String, Object> column : row.entrySet() )
{
rows += column.getKey() + ": " + column.getValue() + "; ";
System.out.println("nested");
}
rows += "\n";
}
// END SNIPPET: rows
resultString = engine.execute( "start n=node(*) where n.Name =~ '.*79.*' return n.Name" ).dumpToString();
columnsString = columns.toString();
System.out.println(rows);
System.out.println(resultString);
System.out.println(columnsString);
System.out.println("leaving");
}
}
When I run this in the web console I get many results (as there are multiple nodes that have an attribute of Name that contains the pattern 79. Yet running this code returns no results. The debug print statements 'in loop' and 'nested' never print either. Thus this must mean there are not results found in the Iterator, yet that doesn't make sense.
And yes, I already checked and made sure that the graphDb variable is the same as the path for the web console. I have other code earlier that uses the same variable to write to the database.
EDIT - More info
If I place the contents of query in the same function that creates my data, I get the correct results. If I run the query by itself it returns nothing. It's almost as the query works only in the instance where I add the data and not if I come back to the database cold in a separate instance.
EDIT2 -
Here is a snippet of code that shows the bigger context of how it is being called and sharing the same DBHandle
package ContextEngine;
import ContextEngine.NeoHandle;
import java.util.LinkedList;
/*
* Class to handle streaming data from any coded source
*/
public class Streamer {
private NeoHandle myHandle;
private String contextType;
Streamer()
{
}
public void openStream(String contextType)
{
myHandle = new NeoHandle();
myHandle.createDb();
}
public void streamInput(String dataLine)
{
Context context = new Context();
/*
* get database instance
* write to database
* check for errors
* report errors & success
*/
System.out.println(dataLine);
//apply rules to data (make ContextRules do this, send type and string of data)
ContextRules contextRules = new ContextRules();
context = contextRules.processContextRules("Calls", dataLine);
//write data (using linked list from contextRules)
NeoProcessor processor = new NeoProcessor(myHandle);
processor.processContextData(context);
}
public void runQuery()
{
NeoProcessor processor = new NeoProcessor(myHandle);
processor.query();
}
public void closeStream()
{
/*
* close database instance
*/
myHandle.shutDown();
}
}
Now, if I call streamInput AND query in in the same instance (parent calls) the query returns results. If I only call query and do not enter ANY data in that instance (yet web console shows data for same query) I get nothing. Why would I have to create the Nodes and enter them into the database at runtime just to return a valid query. Shouldn't I ALWAYS get the same results with such a query?

You mention that you are using the Neo4j Browser, which comes with Neo4j. However, the example you posted is for Neo4j Embedded, which is the in-process version of Neo4j. Are you sure you are talking to the same database when you try your query in the Browser?
In order to talk to Neo4j Server from Java, I'd recommend looking at the Neo4j JDBC driver, which has good support for connecting to the Neo4j server from Java.
http://www.neo4j.org/develop/tools/jdbc
You can set up a simple connection by adding the Neo4j JDBC jar to your classpath, available here: https://github.com/neo4j-contrib/neo4j-jdbc/releases Then just use Neo4j as any JDBC driver:
Connection conn = DriverManager.getConnection("jdbc:neo4j://localhost:7474/");
ResultSet rs = conn.executeQuery("start n=node({id}) return id(n) as id", map("id", id));
while(rs.next()) {
System.out.println(rs.getLong("id"));
}
Refer to the JDBC documentation for more advanced usage.
To answer your question on why the data is not durably stored, it may be one of many reasons. I would attempt to incrementally scale back the complexity of the code to try and locate the culprit. For instance, until you've found your problem, do these one at a time:
Instead of looping through the result, print it using System.out.println(result.dumpToString());
Instead of the regex query, try just MATCH (n) RETURN n, to return all data in the database
Make sure the data you are seeing in the browser is not "old" data inserted earlier on, but really is an insert from your latest run of the Java program. You can verify this by deleting the data via the browser before running the Java program using MATCH (n) OPTIONAL MATCH (n)-[r]->() DELETE n,r;
Make sure you are actually working against the same database directories. You can verify this by leaving the server running. If you can still start your java program, unless your Java program is using the Neo4j REST Bindings, you are not using the same directory. Two Neo4j databases cannot run against the same database directory simultaneously.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Jena TDB to store and query using API - java

Related

How to predict out of sample values using Spark (JAVA API)

testing OpenNLP classifier model

Inserting data with UCanAccess from big text files is very slow

Spark: Running multiple queries on multiple files, optimization

Neo4j ExecutionEngine does not return valid results

Categories

Resources