Jena Rule Engine with TDB

Jena Rule Engine with TDB - java

I am having my data loaded in TDB model and have written some rule using Jena in order to apply into TDB. Then I am storing the inferred data into a new TDB.
I applied the case above in a small dataset ~200kb and worded just fine. HOWEVER, my actual TDB is 2.7G and the computer has been running for about a week and it is in fact still running.
Is that something normal, or am I doing something wrong? What is the alternative of the Jena rule engine to use?
Here is a small piece of the code:
public class Ruleset {
private List<Rule> rules = null;
private GenericRuleReasoner reasoner = null;
public Ruleset (String rulesSource){
this.rules = Rule.rulesFromURL(rulesSource);
this.reasoner = new GenericRuleReasoner(rules);
reasoner.setOWLTranslation(true);
reasoner.setTransitiveClosureCaching(true);
}
public InfModel applyto(Model mode){
return ModelFactory.createInfModel(reasoner, mode);
}
public static void main(String[] args) {
System.out.println(" ... Running the Rule Engine ...");
String rulepath = "src/schemaRules.osr";
Ruleset rule = new Ruleset (rulepath);
InfModel infedModel = rule.applyto(data.tdb);
infdata.close();
}
}

A large dataset in a persistent store is not a good match with Jena's rule system. The basic problem is that the RETE engine will make many small queries into the graph during rule propagation. The overhead in making these queries to any persistent store, including TDB, tends to make the execution times unacceptably long, as you have found.
Depending on your goals for employing inference, you may have some alternatives:
Load your data into a large enough memory graph, then save the inference closure (the base graph plus the entailments) to a TDB store in a single transaction. Thereafter, you can query the store without incurring the overhead of the rules system. Updates, obviously, can be an issue with this approach.
Have your data in TDB, as now, but load a subset dynamically into a memory model to use live with inference. Makes updates easier (as long as you update both the memory copy and the persistent store), but requires you to partition your data.
If you only want some basic inferences, such as closure of the rdfs:subClassOf hierarchy, you can use the infer command line tool to generate an inference closure which you can load into TDB:
$ infer -h
infer --rdfs=vocab FILE ...
General
-v --verbose Verbose
-q --quiet Run with minimal output
--debug Output information for debugging
--help
--version Version information
Infer can be more efficient, because it doesn't require a large memory model. However, it is restricted in the inferences that it will compute.
If none of these work for you, you may want to consider commercial inference engines such as OWLIM or Stardog.

Thanks Ian.
I was actually able to do it via SPARQL update as DAVE advise me to and it took only 10 minutes to finish the job.
Here is an example of the code:
System.out.println(" ... Load rules ...");
data.startQuery();
String query = data.loadQuery("src/sparqlUpdatesRules.tql");
data.endQuery();
System.out.println(" ... Inserting rules ...");
UpdateAction.parseExecute(query, inferredData.tdb);
System.out.println(" ... Printing RDF ...");
inferredData.exportRDF();
System.out.println(" ... closeing ...");
inferredData.close();
and here is an example of the SPARQL update:
INSERT {
?w ddids:carries ?p .
} WHERE {
?p ddids:is_in ?w .
};
thanks for your answers

Related

Drools Rule Engine taking too long to update

I have implemented a rules engine based on drools (v7.25). The users dynamically add, update, and delete the rules.
The application could be expanded further and currently includes over 15000 rules. These rules are distributed evenly throughout several files even though they are part of the same package.
The application was initially functioning properly, but as the number of rules increased, the add and update performance of the rules has decreased noticeably.
For instance, it only takes the KieBuilder 300ms to build the KieBases when there are up to 1000 rules. But now, the same process can take up to 38000ms.
Here is a sample of my code for your reference:
List<CustomerRule> customerRules = rulesBuilder.getCustomerRules(customerId);
if (!customerRules.isEmpty()) {
kieFileSystem.write(rulesBuilder.getDRLFilePath(customerId),
rulesBuilder.getRulesFileData(customerId, customerRules));
}
KieBuilder kieBuilder = kieServices.newKieBuilder(kieFileSystem).buildAll();
Results results = kieBuilder.getResults();
if (results.hasMessages(Message.Level.ERROR)) {
log.error(results.getMessages());
return;
}
results = ((KieContainerImpl) kieContainer).updateToKieModule((InternalKieModule) kieBuilder.getKieModule());
if (results.hasMessages(Message.Level.ERROR)) {
log.error(results.getMessages());
return;
}
kieContainer.dispose();
Is there a way to improve the re-building process of the KieBases? What are the best practises I should take into account?

XML node replace failure

I try Corb to search and update node in large number of documents:
Sample input:
<hcmt xmlns="http://horn.thoery">
<susceptible>X</susceptible>
<reponsible>foresee–intervention</reponsible>
<intend>Benefit Protagonist</intend>
<justified>Goal Outwiegen</justified>
</hcmt>
Xquery:
(: let $resp := "foresee–intervention" :)
let $docs :=
cts:search(doc(),
cts:and-query((
cts:collection-query("hcmt"),
cts:path-range-query("/horn:hcmt/horn:responsible", "=", $resp)
))
)
return
for $doc in $docs
return
xdmp:node-replace($doc/horn:hcmt/horn:responsible, "Foresee Intervention")
Expected output:
<hcmt xmlns="http://horn.thoery">
<susceptible>X</susceptible>
<reponsible>Foresee Intervention</reponsible>
<intend>Benefit Protagonist</intend>
<justified>Goal Outwiegen</justified>
</hcmt>
But node-replace didn’t happen in Corb and no error returns. Other queries work fine in Corb. How can the node-replace work correctly in Corb?
Thanks in advance for any help.

I create functions to reconcile the encoding matters. This not only mitigates potential API transaction failures but also is a requisite to validate & encode parameter or element/property/uri name.
That said, a sample MarkLogic Java API implementation is:
Create a dynamic query construct in the filesystem, in my case, product-query-option.xml (use the query value directly: Chooser–Option)
<search xmlns="http://marklogic.com/appservices/search">
<query>
<and-query>
<collection-constraint-query>
<constraint-name>Collection</constraint-name>
<uri>proto</uri>
</collection-constraint-query>
<range-constraint-query>
<constraint-name>ProductType</constraint-name>
<value>Chooser–Option</value>
</range-constraint-query>
</and-query>
</query>
</search>
Deploy the persistent query options to modules database, in my case, search-lexis.xml, the options file is like:
<options xmlns="http://marklogic.com/appservices/search">
<constraint name="Collection">
<collection prefix=""/>
</constraint>
<constraint name="ProductType">
<range type="xs:string" collation="http://marklogic.com/collation/en/S1">
<path-index xmlns:prod="schema://fc.fasset/product">/prod:requestProduct/prod:_metaData/prod:productType</path-index>
</range>
</constraint>
</options>
Follow on from Dynamic Java Search
File file = new File("src/main/resources/queryoption/product-query-option.xml");
FileHandle fileHandle = new FileHandle(file);
RawCombinedQueryDefinition rcqDef = queryMgr.newRawCombinedQueryDefinition(fileHandle, queryOption);
You can, assuredly, combine the query and the options as one handle in QueryDefinition.
Your original node-replace is translated as Java Partial Update
make sure the DocumentPatchBuilder setNamespaces with the correct NamespaceContext.
For batch data operation, the performant approach is MarkLogic Data Movement: instantiate the QueryBatcher with the searched Uris, supply the replace value or data fragment PatchBuilder.replaceValue
, and complete the batch with
dbClient.newXMLDocumentManager().patch(uri, patchHandle);
MarkLogic Data Services: If you succeed above, perhaps, then go at a more robust and scalable enterprise SOA approach, please review Data Services.
The implementation with Gradle is like:
(Note, all of the transformation metrics should be parameters, including path/element/property name, namespace, value…etc. Nothing is hardcoded.) One proxy service declared in service.json can serve multiple end points (under /root/df-ds/fxd ) with different types of modules which give you the free rein to develop pure Java or extend the development platform to handle complex data operations.
If these operations are persistent node update, you should consider in-memory node transform before the ingestion. Besides the MarkLogic data transformation tools, you can harness the power of XSLT2+.
Saxon XPathFactory could be a serviceable vehicle to query/transform node. Not sure if it is a reciprocity, ML Java API implements the XPath compile to split large paths and stream transaction. XSLT/Saxon is not my forte; therefore, I can’t comment how comparable it is with this encode/decode particularity or how it handles transaction (insert, update…etc) streaming.

Wikidata Toolkit: Is it possible to access properties of entities?

First of all, I want to clarify that my experience working with wikidata is very limited, so feel free to correct if any of my terminology is wrong.
I've been playing with wikidata toolkit, more specifically their wdtk-wikibaseapi. This allows you to get entity information and their different properties as such:
WikibaseDataFetcher wbdf = WikibaseDataFetcher.getWikidataDataFetcher();
EntityDocument q42 = wbdf.getEntityDocument("Q42");
List<StatementGroup> groups = ((ItemDocument) q42).getStatementGroups();
for(StatementGroup g : groups) {
List<Statement> statements = g.getStatements();
for(Statement s : statements) {
System.out.println(s.getMainSnak().getPropertyId().getId());
System.out.println(s.getValue());
}
}
The above would get me the entity Douglas Adams and all the properties under his site: https://www.wikidata.org/wiki/Q42
Now wikidata toolkit has the ability to load and process dump files, meaning you can download a dump to your local and process it using their DumpProcessingController class under the wdtk-dumpfiles library. I'm just not sure what is meant by processing.
Can anyone explain me what does processing mean in this context?
Can you do something similar to what was done using wdtk-wikibaseapi in the example above but using a local dump file and wdtk-dumpfiles i.e. get an entity and it's respective properties? I don't want to get the info from online source, only from the dump (offline).
If this is not possible using wikidata-toolkit, could you point me to somewhere that can get me started on getting entities and their properties from a dump file for wikidata please? I am using Java.

Getting CPU 100 percent when I am trying to downloading CSV in Spring

I am getting CPU performance issue on server when I am trying to download CSV in my project, CPU goes 100% but SQL returns the response within 1 minute. In the CSV we are writing around 600K records for one user it is working fine but for concurrent users we are getting this issue.
Environment
Spring 4.2.5
Tomcat 7/8 (RAM 2GB Allocated)
MySQL 5.0.5
Java 1.7
Here is the Spring Controller code:-
#RequestMapping(value="csvData")
public void getCSVData(HttpServletRequest request,
HttpServletResponse response,
#RequestParam(value="param1", required=false) String param1,
#RequestParam(value="param2", required=false) String param2,
#RequestParam(value="param3", required=false) String param3) throws IOException{
List<Log> logs = service.getCSVData(param1,param2,param3);
response.setHeader("Content-type","application/csv");
response.setHeader("Content-disposition","inline; filename=logData.csv");
PrintWriter out = response.getWriter();
out.println("Field1,Field2,Field3,.......,Field16");
for(Log row: logs){
out.println(row.getField1()+","+row.getField2()+","+row.getField3()+"......"+row.getField16());
}
out.flush();
out.close();
}}
Persistance Code:- I am using spring JDBCTemplate
#Override
public List<Log> getCSVLog(String param1,String param2,String param3) {
String sql =SqlConstants.CSV_ACTIVITY.toString();
List<Log> csvLog = JdbcTemplate.query(sql, new Object[]{param1, param2, param3},
new RowMapper<Log>() {
#Override
public Log mapRow(ResultSet rs, int rowNum)
throws SQLException {
Log log = new Log();
log.getField1(rs.getInt("field1"));
log.getField2(rs.getString("field2"));
log.getField3(rs.getString("field3"));
.
.
.
log.getField16(rs.getString("field16"));
}
return log;
}
});
return csvLog;
}

I think you need to be specific on what you meant by "100% CPU usage" whether it's the Java process or MySQL server. As you have got 600K records, trying to load everything in to memory would easily end up in OutOfMemoryError. Given that this works for one user means that you've got enough heap space to process this number of records for just one user and symptoms surface when there are multiple users trying to use the same service.
First issue I can see in your posted code is that you try to load everything into one big list and the size of the list varies based on the content of the Log class. Using a list like this also means that you have to have enough memory to process JDBC result set and generate new list of Log instances. This can be a major problem with a growing number of users. This type of short-lived objects will cause frequent GC and once GC cannot keep up with the amount of garbage being collected it fails obviously. To solve this major issue my suggestion is to use ScrollableResultSet. Additionally you can make this result set read-only, for example below is code fragment for creating a scrollable result set. Take a look at the documentation for how to use it.
Statement st = conn.createStatement(ResultSet.TYPE_SCROLL_SENSITIVE, ResultSet.CONCUR_READ_ONLY);
Above option is suitable if you're using pure JDBC or SpringJDBC template. If Hibernate is already used in your project you can still achieve the same this with the below code fragment. Again please check the documentation for more information and you have a different JPA provider.
StatelessSession session = sessionFactory.openStatelessSession();
Query query = session.createSQLQuery(queryStr).setCacheable(false).setFetchSize(Integer.MIN_VALUE).setReadOnly(true);
query.setParameter(query_param_key, query_paramter_value);
ScrollableResults resultSet = query.scroll(ScrollMode.FORWARD_ONLY);
This way you're not loading all the records to Java process in one go, instead you they're loaded on demand and will have small memory footprint at any given time. Note that JDBC connection will be open until you're done with processing the entire record set. This also means that your DB connection pool can be exhausted if many users are going to download CSV files from this endpoint. You need to take measures to overcome this problem (i.e use of an API manager to rate limit the calls to this endpoint, reading from a read-replica or whatever viable option).
My other suggestion is to stream data which you have already done, so that any records fetched from the DB are processed and sent to client before the next set of records are processed. Again I would suggest you to use a CSV library such as SuperCSV to handle this as these libraries are designed to handle a good load of data.
Please note that this answer may not exactly answer your question as you haven't provided necessary parts of your source such as how to retrieve data from DB but will give the right direction to solve this issue

Your problem in loading all data on application server from database at once, try to run query with limit and offset parameters (with mandatory order by), push loaded records to client and load next part of data with different offset. It help you decrease memory footprint and will not required keep connection to database open all the time. Of course, database will loaded a bit more, but maybe whole situation will better. Try different limit values, for example 5K-50K and monitor cpu usage - on both app server and database.
If you can allow keep many open connection to database #Bunti answer is very good.
http://dev.mysql.com/doc/refman/5.7/en/select.html

Get prediction percentage in WEKA using own Java code and a model

Overview
I know that one can get the percentages of each prediction in a trained WEKA model through the GUI and command line options as conveniently explained and demonstrated in the documentation article "Making predictions".
Predictions
I know that there are three ways documented to get these predictions:
command line
GUI
Java code/using the WEKA API, which I was able to do in the answer to "Get risk predictions in WEKA using own Java code"
this fourth one requires a generated WEKA .MODEL file
I have a trained .MODEL file and now I want to classify new instances using this together with the prediction percentages similar to the one below (an output of the GUI's Explorer, in CSV format):
inst#,actual,predicted,error,distribution,
1,1:0,2:1,+,0.399409,*0.7811
2,1:0,2:1,+,0.3932409,*0.8191
3,1:0,2:1,+,0.399409,*0.600591
4,1:0,2:1,+,0.139409,*0.64
5,1:0,2:1,+,0.399409,*0.600593
6,1:0,2:1,+,0.3993209,*0.600594
7,1:0,2:1,+,0.500129,*0.600594
8,1:0,2:1,+,0.399409,*0.90011
9,1:0,2:1,+,0.211409,*0.60182
10,1:0,2:1,+,0.21909,*0.11101
The predicted column is what I want to get from a .MODEL file.
What I know
Based from my experience with the WEKA API approach, one can get these predictions using the following code (the PlainText inserted into an Evaluation object) BUT I do not want to do k-fold cross-validation that is provided by the Evaluation object.
StringBuffer predictionSB = new StringBuffer();
Range attributesToShow = null;
Boolean outputDistributions = new Boolean(true);
PlainText predictionOutput = new PlainText();
predictionOutput.setBuffer(predictionSB);
predictionOutput.setOutputDistribution(true);
Evaluation evaluation = new Evaluation(data);
evaluation.crossValidateModel(j48Model, data, numberOfFolds,
randomNumber, predictionOutput, attributesToShow,
outputDistributions);
System.out.println(predictionOutput.getBuffer());
From the WEKA documentation
Note that a .MODEL file classifies data from an .ARFF or related input is discussed in "Use Weka in your Java code" and "Serialization" a.k.a. "How to use a .MODEL file in your own Java code to classify new instances" (why the vague title smfh).
Using own Java code to classify
Loading a .MODEL file is through "Deserialization" and the following is for versions > 3.5.5:
// deserialize model
Classifier cls = (Classifier) weka.core.SerializationHelper.read("/some/where/j48.model");
An Instance object is the data and it is fed to the classifyInstance. An output is provided here (depending on the data type of the outcome attribute):
// classify an Instance object (testData)
cls.classifyInstance(testData.instance(0));
The question "How to reuse saved classifier created from explorer(in weka) in eclipse java" has a great answer too!
Javadocs
I have already checked the Javadocs for Classifier (the trained model) and Evaluation (just in case) but none directly and explicitly addresses this issue.
The only thing closest to what I want is the classifyInstances method of the Classifier:
Classifies the given test instance. The instance has to belong to a dataset when it's being classified. Note that a classifier MUST implement either this or distributionForInstance().
How can I simultaneously use a WEKA .MODEL file to classify and get predictions of a new instance using my own Java code (aka using the WEKA API)?

This answer simply updates my answer from How to reuse saved classifier created from explorer(in weka) in eclipse java.
I will show how to obtain the predicted instance value and the prediction percentage (or distribution). The example model is a J48 decision tree created and saved in the Weka Explorer. It was built from the nominal weather data provided with Weka. It is called "tree.model".
import weka.classifiers.Classifier;
import weka.core.Instances;
public class Main {
public static void main(String[] args) throws Exception
{
String rootPath="/some/where/";
Instances originalTrain= //instances here
//load model
Classifier cls = (Classifier) weka.core.SerializationHelper.read(rootPath+"tree.model");
//predict instance class values
Instances originalTrain= //load or create Instances to predict
//which instance to predict class value
int s1=0;
//perform your prediction
double value=cls.classifyInstance(originalTrain.instance(s1));
//get the prediction percentage or distribution
double[] percentage=cls.distributionForInstance(originalTrain.instance(s1));
//get the name of the class value
String prediction=originalTrain.classAttribute().value((int)value);
System.out.println("The predicted value of instance "+
Integer.toString(s1)+
": "+prediction);
//Format the distribution
String distribution="";
for(int i=0; i <percentage.length; i=i+1)
{
if(i==value)
{
distribution=distribution+"*"+Double.toString(percentage[i])+",";
}
else
{
distribution=distribution+Double.toString(percentage[i])+",";
}
}
distribution=distribution.substring(0, distribution.length()-1);
System.out.println("Distribution:"+ distribution);
}
}
The output from this is:
The predicted value of instance 0: no
Distribution: *1, 0

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.