Drools Rule Engine taking too long to update

Drools Rule Engine taking too long to update - java

I have implemented a rules engine based on drools (v7.25). The users dynamically add, update, and delete the rules.
The application could be expanded further and currently includes over 15000 rules. These rules are distributed evenly throughout several files even though they are part of the same package.
The application was initially functioning properly, but as the number of rules increased, the add and update performance of the rules has decreased noticeably.
For instance, it only takes the KieBuilder 300ms to build the KieBases when there are up to 1000 rules. But now, the same process can take up to 38000ms.
Here is a sample of my code for your reference:
List<CustomerRule> customerRules = rulesBuilder.getCustomerRules(customerId);
if (!customerRules.isEmpty()) {
kieFileSystem.write(rulesBuilder.getDRLFilePath(customerId),
rulesBuilder.getRulesFileData(customerId, customerRules));
}
KieBuilder kieBuilder = kieServices.newKieBuilder(kieFileSystem).buildAll();
Results results = kieBuilder.getResults();
if (results.hasMessages(Message.Level.ERROR)) {
log.error(results.getMessages());
return;
}
results = ((KieContainerImpl) kieContainer).updateToKieModule((InternalKieModule) kieBuilder.getKieModule());
if (results.hasMessages(Message.Level.ERROR)) {
log.error(results.getMessages());
return;
}
kieContainer.dispose();
Is there a way to improve the re-building process of the KieBases? What are the best practises I should take into account?

Related

Thread-Safe way to access EclipsePreferences (Project)

I am currently developing an Eclipse-RCP application that stores per-project preferences using the EclipsePreference mechanism through ProjectScope. At first this seemed to work very well, but we have run into trouble when (read-) accessing these preferences in multithreaded scenarios while at the same time changes are being made to the workspace. What appears to be praticularly problematic is accessing such a preference node (ProjectScope.getNode()) while the project is being deleted by an asynchronous user action (right click on Project -> Delete Project). In such cases we get a pretty mix of
org.osgi.service.prefs.BackingStoreException
java.io.FileNotFoundException
org.eclipse.core.runtime.CoreException
Essentially they all complain that the underlying file is no longer there.
Initial attempts to fix this using checks like IProject.exists() or isAccessible() and even going so far as checking the presence of the actual .prefs file were as futile as expected: They only make the exceptions less likely but do not really prevent them.
So my question is: How are you supposed to safely access things like ProjectScope.getNode()? Do you need to go so far to put every read into a WorkspaceJob or is there some other, clever way to prevent the above problems like putting the read access in Display.asyncExec()?
Although I tried, I did not really find answers to the above question in the Eclipse documentation.

Usually scheduling rules are used to concurrently access resources in the workspace.
I've never worked with ProjectScopeed preferences but if they are stored within a project or its metadata, then a scheduling rule should help to coordinate access. If you are running the preferences access code in a Job, then setting an appropriate scheduling rule should do:
For example:
IProject project = getProjectForPreferences( projectPreferences );
ISchedulingRule rule = project.getWorkspace().getRuleFactory().modifyRule( project );
Job job = new Job( "Access Project Preferences" ) {
#Override
protected IStatus run( IProgressMonitor monitor ) {
if( project.exists() ) {
// read or write project preferences
}
return Status.OK_STATUS;
}
};
job.setRule( rule );
job.schedule();
The code acquires a rule to modify the project and the Job is guaranteed to run only when no other job with a conflichting rule runs.
If your code isn't running within a job, you can also manually acquire a lock with IJobManager.beginRule() and endRule().
For example:
ISchedulingRule rule = ...;
try {
jobManager.beginRule( rule, monitor );
if( project.exists() ) {
// read or write project preferences
}
} finally {
jobManager.endRule( rule );
}
As awkward as it looks, the call to beginRule must be within the try block, see the JavaDoc for more details.

Google App Engine Objectify - load single objects or list of keys?

I am trying to get a grasp on Google App Engine programming and wonder what the difference between these two methods is - if there even is a practical difference.
Method A)
public Collection<Conference> getConferencesToAttend(Profile profile)
{
List<String> keyStringsToAttend = profile.getConferenceKeysToAttend();
List<Conference> conferences = new ArrayList<Conference>();
for(String conferenceString : keyStringsToAttend)
{
conferences.add(ofy().load().key(Key.create(Conference.class,conferenceString)).now());
}
return conferences;
}
Method B)
public Collection<Conference> getConferencesToAttend(Profile profile)
List<String> keyStringsToAttend = profile.getConferenceKeysToAttend();
List<Key<Conference>> keysToAttend = new ArrayList<>();
for (String keyString : keyStringsToAttend) {
keysToAttend.add(Key.<Conference>create(keyString));
}
return ofy().load().keys(keysToAttend).values();
}
the "conferenceKeysToAttend" list is guaranteed to only have unique Conferences - does it even matter then which of the two alternatives I choose? And if so, why?

Method A loads entities one by one while method B does a bulk load, which is cheaper, since you're making just 1 network roundtrip to Google's datacenter. You can observe this by measuring time taken by both methods while loading a bunch of keys multiple times.
While doing a bulk load, you need to be cautious about loaded entities, if datastore operation throws exception. Operation might succeed even when some of the entities are not loaded.

The answer depends on the size of the list. If we are talking about hundreds or more, you should not make a single batch. I couldn't find documentation what is the limit, but there is a limit. If it not that much, definitely go with loading one by one. But, you should make the calls asynchronous by not using the now function:
List<<Key<Conference>> conferences = new ArrayList<Key<Conference>>();
conferences.add(ofy().load().key(Key.create(Conference.class,conferenceString));
And when you need the actual data:
for (Key<Conference> keyConference : conferences ) {
Conference c = keyConference.get();
......
}

Elasticsearch Performance Analysis

We are currently evaluating Elasticsearch as our solution for Analytics. The main driver is the fact that once the data is populated into Elasticsearch, the reporting comes for free with Kibana.
Before adopting it, I am tasked to do a performance analysis of the tool.
The main requirement is supporting a PUT rate of 500 evt/sec.
I am currently starting with a small setup as follows just to get a sense of the API before I upload that to a more serious lab.
My Strategy is basically, going over CSVs of analytics that correspond to the format I need and putting them into elasticsearch. I am not using the bulk API because in reality the events will not arrive in a bulk fashion.
Following is the main code that does this:
// Created once, used for creating a JSON from a bean
ObjectMapper mapper = new ObjectMapper();
// Creating a measurement for checking the count of sent events vs
// ES stored events
AnalyticsMetrics metrics = new AnalyticsMetrics();
metrics.startRecording();
File dir = new File(mFolder);
for (File file : dir.listFiles()) {
CSVReader reader = new CSVReader(new FileReader(file.getAbsolutePath()), '|');
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
AnalyticRecord record = new AnalyticRecord();
record.serializeLine(nextLine);
// Generate json
String json = mapper.writeValueAsString(record);
IndexResponse response = mClient.getClient().prepareIndex("sdk_sync_log", "sdk_sync")
.setSource(json)
.execute()
.actionGet();
// Recording Metrics
metrics.sent();
}
}
metrics.stopRecording();
return metrics;
I have the following questions:
How do I know through the API when all the requests are completed and the data is saved into Elasticsearch? I could query Elasticsearch for the objects counts in my particular index but doing that would be a new performance factor by itself, hence I am eliminating this option.
Is the above the fastest way to insert object to Elasticsearch or are there other optimizations I could do. Keep in mind the bulk API is not an option for now.
Thx in advance.
P.S: the Elasticsearch version I am using on both client and server is 1.0.0.

Elasticsearch index response has isCreated() method that returns true if the document is a new one or false if it has been updated and can be used to see if the document was successfully inserted/updated.
If bulk indexing is not an option there are other areas that could be tweaked to improve performance like
increasing index refresh interval using index.refresh_interval
disabling replicas by setting index.number_of_replicas to 0
Disabling _source and _all fields if they are not needed.

Jena Rule Engine with TDB

I am having my data loaded in TDB model and have written some rule using Jena in order to apply into TDB. Then I am storing the inferred data into a new TDB.
I applied the case above in a small dataset ~200kb and worded just fine. HOWEVER, my actual TDB is 2.7G and the computer has been running for about a week and it is in fact still running.
Is that something normal, or am I doing something wrong? What is the alternative of the Jena rule engine to use?
Here is a small piece of the code:
public class Ruleset {
private List<Rule> rules = null;
private GenericRuleReasoner reasoner = null;
public Ruleset (String rulesSource){
this.rules = Rule.rulesFromURL(rulesSource);
this.reasoner = new GenericRuleReasoner(rules);
reasoner.setOWLTranslation(true);
reasoner.setTransitiveClosureCaching(true);
}
public InfModel applyto(Model mode){
return ModelFactory.createInfModel(reasoner, mode);
}
public static void main(String[] args) {
System.out.println(" ... Running the Rule Engine ...");
String rulepath = "src/schemaRules.osr";
Ruleset rule = new Ruleset (rulepath);
InfModel infedModel = rule.applyto(data.tdb);
infdata.close();
}
}

A large dataset in a persistent store is not a good match with Jena's rule system. The basic problem is that the RETE engine will make many small queries into the graph during rule propagation. The overhead in making these queries to any persistent store, including TDB, tends to make the execution times unacceptably long, as you have found.
Depending on your goals for employing inference, you may have some alternatives:
Load your data into a large enough memory graph, then save the inference closure (the base graph plus the entailments) to a TDB store in a single transaction. Thereafter, you can query the store without incurring the overhead of the rules system. Updates, obviously, can be an issue with this approach.
Have your data in TDB, as now, but load a subset dynamically into a memory model to use live with inference. Makes updates easier (as long as you update both the memory copy and the persistent store), but requires you to partition your data.
If you only want some basic inferences, such as closure of the rdfs:subClassOf hierarchy, you can use the infer command line tool to generate an inference closure which you can load into TDB:
$ infer -h
infer --rdfs=vocab FILE ...
General
-v --verbose Verbose
-q --quiet Run with minimal output
--debug Output information for debugging
--help
--version Version information
Infer can be more efficient, because it doesn't require a large memory model. However, it is restricted in the inferences that it will compute.
If none of these work for you, you may want to consider commercial inference engines such as OWLIM or Stardog.

Thanks Ian.
I was actually able to do it via SPARQL update as DAVE advise me to and it took only 10 minutes to finish the job.
Here is an example of the code:
System.out.println(" ... Load rules ...");
data.startQuery();
String query = data.loadQuery("src/sparqlUpdatesRules.tql");
data.endQuery();
System.out.println(" ... Inserting rules ...");
UpdateAction.parseExecute(query, inferredData.tdb);
System.out.println(" ... Printing RDF ...");
inferredData.exportRDF();
System.out.println(" ... closeing ...");
inferredData.close();
and here is an example of the SPARQL update:
INSERT {
?w ddids:carries ?p .
} WHERE {
?p ddids:is_in ?w .
};
thanks for your answers

OrientDB slow write

OrientDB official site says:
On common hardware stores up to 150.000 documents per second, 10
billions of documents per day. Big Graphs are loaded in few
milliseconds without executing costly JOIN such as the Relational
DBMSs.
But, executing the following code shows that it's taking ~17000ms to insert 150000 simple documents.
import com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx;
import com.orientechnologies.orient.core.record.impl.ODocument;
public final class OrientDBTrial {
public static void main(String[] args) {
ODatabaseDocumentTx db = new ODatabaseDocumentTx("remote:localhost/foo");
try {
db.open("admin", "admin");
long a = System.currentTimeMillis();
for (int i = 1; i < 150000; ++i) {
final ODocument foo = new ODocument("Foo");
foo.field("code", i);
foo.save();
}
long b = System.currentTimeMillis();
System.out.println(b - a + "ms");
for (ODocument doc : db.browseClass("Foo")) {
doc.delete();
}
} finally {
db.close();
}
}
}
My hardware:
Dell Optiplex 780
Intel(R) Core(TM)2 Duo CPU E7500 # 2.93Ghz
8GB RAM
Windows 7 64bits
What am I doing wrong?
Splitting the saves in 10 concurrent threads to minimize Java's overhead made it run in ~13000ms. Still far slower than what OrientDB front page says.

You can achieve that by using 'Flat Database' and orientdb as an embedded library in java
see more explained here
http://code.google.com/p/orient/wiki/JavaAPI
what you use is server mode and it sends many requests to orientdb server,
judging by your benchmark you got ~10 000 inserts per seconds which is not bad,
e.g I think 10 000 requests/s is very good performance for any webserver
(and orientdb server actually is a webserver and you can query it through http, but I think java is using binary mode)

The numbers from the OrientDB site are benchmarked for a local database (with no network overhead), so if you use a remote protocol, expect some delays.
As Krisztian pointed out, reuse objects if possible.

Read the documentation first on how to achive the best performance!
Few tips:
-> Do NOT instantiate ODocument always:
final ODocument doc;
for (...) {
doc.reset();
doc.setClassName("Class");
// Put data to fields
doc.save();
}
-> Do NOT rely on System.currentTimeMillis() - use perf4j or similar tool to measure times, because the first one measures global system times hence includes execution time of all other programs running on your system!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.