Performance issue loading dataset/model from Apache TDB

Performance issue loading dataset/model from Apache TDB - java

I have a RDF file that has 7MB and ~ 80k statements.
When starting the application, I have the following code, that retrieves a list of itens I need to show to the user:
NodeIterator iterator = technologyModel.listObjectsOfProperty(subject);
while (iterator.hasNext()) {
RDFNode node = iterator.nextNode();
myCollection.add(node.asLiteral().getString().trim());
}
Note: This code works just fine and returns something about 3k results, and is the first time the "technologyModel" is accessed.
Obviously, before doing that, I have to load the dataset/model, and here is the problem.
Case (1) When I load the dataset/model from a RDF file, doing this:
InputStream in = FileManager.get().open(ParamsHelper.sourceRDF);
technologyModel.read(in, "RDF/XML-ABBREV");
the technologyModel seems instantly loaded and the first code posted runs in less than a second.
Case (2) However, when I try to load the model from a TDB database (previously loaded with the same RDF file used on first case), with this code:
dataset = TDBFactory.createDataset(ParamsHelper.tdbBaseDir);
dataset.begin(ReadWrite.READ) ;
technologyModel = dataset.getNamedModel("http://a.example.biz/technology");
dataset.end();
the technologyModel doesn´t seem to be instantly loaded, and even though the first code posted returns as expected, it runs in about 30 seconds at the first call.
If I call that same code after the first time, or, for example, insert another operation like technologyModel.listSubjects() before calling this code for the first time, it will run immediately, as expected.
It seems to me that on the second case, the model is really loaded only afthe the first operation it suffers. Does it make any sense?
I don´t want to keep my data in a RDF file, but rather have a TDB database storing the triples. That´s why the second option seems to fit me better.
Can anyone help me on this? I hope I could expose the problem correctly.
Thanks in advance.

There are two effects here:
TDBFactory.createDataset doesn't loaded any data - it connects to the database. Data is loaded into memory (cached) as it is used so when you are doing listObjectsOfProperty the first time, all caches are cold and the database may well be slow. It will be quite sensitive to the hardware you are running on at this point.
The second is that Model API calls can have access patterns that are databse-unfriendly. It is better to use SPARQL on the dataset.
By the way: listObjectsOfProperty does not take a subject - it takes a property and can access a lotof the database. If myCollection is a set, then you may be adding a lot more than 3K items.

Related

WebApp: suggestion about how save a lot of information from a Sphinx search

I have an issue about my webapp: it's a intranet search webapp that asks Sphinx http://sphinxsearch.com/ (the real search engine) for a query typed by the user. The problem is that the result could be very big (also for a intranet network) so I want to save the result on the server to handle a sort of lazy load of the data.
I suppose to use Hibernate but...if the result is too big and I save, for example, 40.000 items...will it be too effort for hibernate? And retrieving them?!
Any suggestions?
Thanks in advance

You can use a limit and offset in the sphinxsearch itself: http://sphinxsearch.com/docs/2.1.3/api-func-setlimits.html. As from this doc a word about limiting the result from the sphinx server (which is 1000 by default):
One thousand records is enough to present to the end user. And if you're thinking about pulling the results to application for further sorting or filtering, that would be much more efficient if performed on Sphinx side.

maybe I'm missing something, but why not just get it piecemeal direct out of sphinx? You can jsut get small pages worth of results at a time. with setLimits.
That way you dont download the data as you need it.

Search optimization when data owner is someone else

In my project, we have 2 REST calls which take too much time, so we are planning to optimize that. Here is how it works currently - we make 1st call to system A and then pass the response to system B for further processing. Once we get the response from system B, we have to manipulate it further before passing it to UI layer and this entire process takes lot of time. We planned on using Solr/Lucene but since we are not the data owners, we can't implement that. Can someone please shed some light on how best this can be handled? We are using Spring MVC and Spring webflow. Thanks in advance!!
[EDIT:] This is not the actual scenario and I am writing this as an example for better understanding. Think of this as making a store locator call for a particular zip to get a list of 100 stores and then sending those 100 stores to another call to get a list of inventory etc. So, this list of stores would change for every zip code and also the inventory there.

If your queries parameters to System A / System B are frequently the same you can add a cache framework to your code. If you use Spring3, you can use the cache easily with an #Cacheable annotation on your code calling SystemA. See :
http://static.springsource.org/spring/docs/3.1.0.M1/spring-framework-reference/html/cache.html
The cache subsystem will cache the result including processing code.

Check if files exists without too much garbage

I'm trying to do the following: I've a database filled with file names located under a directory. This directory is changing constantly (downloaded files are being added and removed). My application is supposed to scan this directory for the first time and add the files into the database. The second time the application will run, it needs to check if the filenames in the database are still available in the directory.
For the check I use the following pseudo code:
get the filename from the database
check if exists (file f = new File(filename))
if (f.exists()){
mark as existing;
} else {
mark is as deleted
}
if it does, then mark it as existing, else mark it as removed (later will clean the database up)
The question is: How can I check all the files on the database if they exists without producing much garbage? Files can be more than 1000. Running the loop with "new File(...)" more than 1000 times will cause too much garbage.
Any help is appreciated.

The File() object is really tiny. It has only path string in it and reference to the FileSystem object. It just look like a wasting resources, but it's not.
Think about File object as a path String with few helper methods to deal with file paths.
It has nothing to do with file descriptor or other heavy resources.
Never do optimization before profiling. You will end up with non optimal difficult to maintain code.

Files can be more than 1000. Running the loop with "new File(...)"
more than 1000 times will cause too much garbage.
Really? Have you tested this? I can't see this being a significant concern under modern systems. (What are you most worried about? The JVM garbage collection?)
Otherwise, get the current directory, then call .list() or .listFiles(), load into a Set for performance (a HashSet would probably do nicely), then just query against the Set. (You'll still be creating Strings and entries within the Set that could be a similar GC concern.) The potential problem here is that you're now loading a potentially "large" number of elements into memory within the JVM - rather than checking on-demand as you read each row out of the database.
I'd stick with the code that you have outlined. +1 for Michal's answer - please review for additional details as to why doing this should be of no concern.

Do it the other way--you add a set of rows to a database table. You then scan the directory the files are in and just get a list of filenames and compare that list to a 'select names from filesTable' type of query.

read/write to a large size file in java

i have a binary file with following format :
[N bytes identifier & record length] [n1 bytes data]
[N bytes identifier & record length] [n2 bytes data]
[N bytes identifier & record length] [n3 bytes data]
as you see i have records with different lengths. in each record i have N bytes fixed which contains and id and the length of data in record.
this file is very big and can contains 3 millions records.
I want to open this file by an application and let user to browse and edit the records.
( Insert / Update / Delete records)
my initial plan is to create and index file from original file and for each record, keep next and previous record address to navigate forward and backward easily. (some sort of linked list but in file not in memory)
is there library (java library) to help me to implement this requirement ?
any recommendation or experience that you think is useful?
----------------- EDIT ----------------------------------------------
Thanks for guides and suggestions,
some more info:
the original file and its format is out of my control (it's a third party file) and i can't change the file format. but i have to read it, let user to navigate over records and edit some of them (insert new record/ update an existing record/ delete a record) and at the end save it back to original file format.
do u still recommend DataBase instead of a normal index file ?
----------------- SECOND EDIT ----------------------------------------------
record size in update mode is fixed. it means updated (edited) record has same length as original record's, unless user delete the record and create another record with different format.
Many Thanks

Seriously, you should NOT be using a binary file for this. You should use a database.
The problems with trying to implement this as a regular file stem from the fact that operating systems do not allow you to insert extra bytes into the middle of an existing file. So if you need to insert a record (anywhere but the end), update a record (with a different size) or remove a record, you would need to:
rewrite other records (after the insertion/update/deletion point) to make or reclaim space, or
implement some kind of free space management within the file.
All of this is complicated and / or expensive.
Fortunately, there is a class of software that implements this kind of thing. It is called database software. There are a wide range of options, ranging from using a full-scale RDBMS to light-weight solutions like BerkeleyDB files.
In response to your 1st and 2nd edits, a database will still be simpler.
However, here's an alternative that might perform better for this use-case than using a DB... without doing complicated free-space management.
Read the file and build an in-memory index that maps ids to file locations.
Create a second file to hold new and updated records.
Perform the record adds/updates/deletes:
An addition is handled by writing the new record to the end of the second file, and adding an index entry for it.
An update is handled by writing the updated record to the end of the second file, and changing the existing index entry to point to it.
A delete is handled by deleting the index entry for the record's key.
Compact the file as follows:
Create a new file.
Read each record in the old file in order, and check the index for the record's key. If the entry still points to the location of the record, copy the record to the new file. Otherwise skip it.
Repeat the step 4.2 for the second file.
If we completed all of the above successfully, delete the old file and second file.
Note this relies on being able to keep the index in memory. If that is not feasible, then the implementation is going to be more complicated ... and more like a database.

Having a data file and an index file would be the general base idea for such an implementation, but you'd pretty much find yourself dealing with data fragmentation upon repeated data updates/deletion, etc. This kind of project, in itself, should be a separate project and should not be part of your main application. However, essentially, a database is what you need as it is specifically designed for such operations and use cases and will also allow you to search, sort, and extend (alter) your data structure without having to refactor an in-house (custom) solution.
May I suggest you to download Apache Derby and create a local embedded database (derby does it for you want you create a new embedded connection at run-time). It will not only be faster than anything you'll write yourself, but will make your application easier to maintain.
Apache Derby is a single jar file that you can simply include and distribute with your project (check the license if any legal issue may apply in your app). There is no need for a database server or third party software; it's all pure Java.
Bottom line as that it all depends on how large is your application, if you need to share the data across many clients, if speed is a critical aspect of your app, etc.
For a stand-alone, single user project, I recommend Apache Derby. For a n-tier application, you might want to look into MySQL, PostgreSQL or (hrm) even Oracle. Using already made and tested solutions is not only smart, but will cut down your development time (and maintenance efforts).
Cheers.

Generally you are better off letting a library or database do the work for you.
You may not want to have an SQL database and there are plenty of simple databases which don't use SQL. http://nosql-database.org/ lists 122 of them.
At a minimum, if you are going to write this I suggest you read the source for one of these databases to see how they work.
Depending on the size of the records, 3 million isn't that much and I would suggest you keep as much in memory as possible.
The problem you are likely to have is ensuring the data is consistent and recovering the data when a corruption occurs. The second problem is dealing with fragmentation efficiently (some thing the brightest minds working on the GC deal with) The third problem is likely to be maintain the index in a transaction fashion with the source data to ensure there are no inconsistencies.
While this may appear simple at first, there are significant complexities in making sure there data is reliable, maintainable and can be accessed efficiently. This is why most developers use an existing database/datastore library and concentrate on the features which are unqiue to their application.

(Note: My answer is about the problem in general, not considering any Java libraries or - like the other answers also proposed - using a database (library), which might be better than reinventing the wheel)
The idea to create an index is good and will be very helpful performance-wise (although you wrote "index file", I think it should be kept in memory). Generating the index should be quite fast if you read the ID and record length for each entry and then just skip the data with a file seek.
You should also think about the edit functionality. Especially inserting and deleting can be very slow on such a big file if you do it wrong (f.e. deleting and then moving all the following entries to close the gap).
The best option would be to only mark deleted entries as deleted. When inserting, you can overwrite one of those or append to the end of the file.

Insert / Update / Delete records
Inserting (rather than merely appending) and deleting records to a file is expensive because you have to move all the following content of the file to create space for the new record or to remove the space it used. Updating is similarly expensive if the update changes the length of the record (you say they are variable length).
The file format you propose is fundamentally unsuitable for the kinds of operations you want to perform. Others have suggested using a data-base. If you don't want to go that far, adding an index file (as you suggest) is the way to go. I recommend making the index records all the same length.

As others have stated a database would seem a better solution. The following are Java SQL DB's that could be used: H2, Derby or HSQLDB
If you want to use an index file look at Berkley DB or No Sql
If there is some reason for using a file, look at JRecord . It has
Several Classes for reading/writing files with variable length binary records (they where written for Cobol VB files). Any of Mainframe / Fujitsu / Open Cobol VB file structures should do the job.
An Editor for editing JRecord files. The latest version of the Editor can handle large files (it uses Compression / spill file). The editor suffers from having to download the whole file and only one user can edit the file at one time.
The JRecord solution will only work if
There is a limited number (preferably one) users all located in the one location
Fast infostructure

Retrieving Large Lists of Objects Using Java EE

Is there a generally-accepted way to return a large list of objects using Java EE?
For example, if you had a database ResultSet that had millions of objects how would you return those objects to a (remote) client application?
Another example -- that is closer to what I'm actually doing -- would be to aggregate data from hundreds of sources, normalize it, and incrementally transfer it to a client system as a single "list".
Since all the data cannot fit in memory, I was thinking that a combination of a stateful SessionBean and some sort of custom Iterator that called back to the server would do the trick.
So, in other words, if I have an API like Iterator<Data> getData() then what's a good way to implement getData() and Iterator<Data>?
How have you successfully solved this problem in the past?

Definitely don't duplicate the entire DB into Java's memory. This makes no sense and only makes things unnecessarily slow and memory-hogging. Rather introduce pagination at database level. You should query only the data you actually need to display on the current page, like as Google does.
If you actually have a hard time in implementing this properly and/or figuring the SQL query for the specific database, then have a look at this answer. For JPA/Hibernate equivalent, have a look at this answer.
Update as per the comments (which actually changes the entire question subject...), here's a basic (pseudo) kickoff example:
List<Source> inputSources = createItSomehow();
Source outputSource = createItSomehow();
for (Source inputSource : inputSources) {
while (inputSource.next()) {
outputSource.write(inputSource.read());
}
}
This way you effectively end up with a single entry in Java's memory instead of the entire collection as in the following (inefficient) example:
List<Source> inputSources = createItSomehow();
List<Entry> entries = new ArrayList<Entry>();
for (Source inputSource : inputSources) {
while (inputSource.next()) {
entries.add(inputSource.read());
}
}
Source outputSource = createItSomehow();
for (Entry entry : entries) {
outputSource.write(entry);
}

Pagination is a good solution when working with a web based ui. sometimes, however, it is much more efficient to stream everything in one call. the rmiio library was written explicitly for this purpose, and is already known to work in a variety of app servers.

If your list is huge, you must assume that it can't fit in memory. Or at least that if your server need to handle that on many concurrent access then you have high risk of OutOfMemoryException.
So basically, what you do is paging and using batch reading. let say you load 1 thousand objects from your database, you send them to the client request response. And you loop until you have processed all objects. (See response from BalusC)
Problem is same on client side, and you'll likely to need to stream the data to the file system to prevent OutOfMemory errors.
Please also note : It is okay to load millions of object from a database as an administrative task : like for performing a backup, and export of some 'exceptional' case. But you should not use it as a request any user could do. It will be slow and drain server resources.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.