Triplestore for Large Datasets [closed]

Triplestore for Large Datasets [closed] - java

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I want to ask about a good triplestore to use for large datasets, it should:
Scale well (millions of triples)
Have a Java interface

You should consider using the OpenLink Virtuoso store. It is available via an OpenSource license and scales to billions of triples. You can use it via the Sesame and Jena APIs.
See here for an overview of large scale triple stores. Virtuoso is definitely easier to set up than BigData. Beside that I have used the Sesame NativeStore, which doesn't scale too well.
4Store is also a good choice, although I haven't used it. One benefit of Virtuoso over 4Store is that you can easily mix standard relational models with RDF, since Virtuoso is under the hood a relational database.

4store: Scalable RDF storage
Quoting 4store Web ...
4store's main strengths are its
performance, scalability and
stability. It does not provide many
features over and above RDF storage
and SPARQL queries, but if your are
looking for a scalable, secure, fast
and efficient RDF store, then 4store
should be on your shortlist.
Personally I have tested 4store with very large databases (up to 2 billion triples) with very good results. 4store is written in C, runs on Linux/Unix 64 bit platforms and the current version 1.1.1 has partially implemented SPARQL 1.1.
4store can be deployed on a cluster of commodity servers which may boost the performance of your queries and assertion throughput can get up to 100 KTriples/second. But even if you use it in a single server you will get quite a decent performance.
Here at the University of Southampton is our choice for very big datasets in research projects and also for our Webmaster team, see Data Stores for Southampton and ECS Open Data.
Here you have also a list of all the libraries that you can use to query and administrate 4store Client Libraries. Also, 4store's IRC channel has an active community of users that will help if you run into any issues.
If you are a Linux/Unix user 4store is definitely a good choice.

I would also recommend 4store, but in the spirit of full disclosure, I was the lead architect :)
If you want to take advantage of the standardisation of RDF stores then you should look to use a Java library that implements SPARQL, rather than using one that exposes a JAVA API natively.
Otherwise you could end up being stuck with whatever store you choose first, due to the effort of moving between them, which is typical SQL migration hell.

I am personally quite happy with GraphDB . Which runs quite well on medium hardware (256GB ram server) with 15 billion triples. Which is accesible both via the sesame and jena interfaces. (Although jena is beta'ish).
If you can afford it an Oracle 12c instance is not bad. And might fit in with an existing oracle infrastructure (back-ups etc...).
Virtuoso 7.1 scales very well and can deal with humongous data volumes for reasonable cost. Unfortunately its SPARQL standards compliance is spotty

#Steve - don't know how to comment so I guess I am going to answer 2 questions at once.
JDBC driver for SPARQL below:
http://code.google.com/p/jdbc4sparql/
supports SPARQL Protocol and SPARUL (over the SPARQL protocol as an update, not over the SPARUL protocol).
#myahya
4Store is highly recommended, so worth appraising as a candidate.
Virtuoso also has native JDBC drivers and supports large datasets (up to 12 billion triples)
www.openlinksw.com/wiki/main/Main/
Also, Oracle have something, but be prepared to pay big bucks:
http://www.oracle.com/technetwork/database/options/semantic-tech/index.html

In addition to 4Store, Virtuoso, and Owlim, Bigdata is also worth looking at.

Related

Full-Text-Search of database [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
Iam looking for a performant way and readable way to implement a full-text-search. I have a lot of requirements for the serach. See this list below.
Requirements
Peformance
My database growing up very fast. Load all data into HEAP an doing some .stream()-magic is not an option. The search should be performed by the DBMS.
Readability
I need easy a solution. A complex query like this How to implement simple full text search in JPA (Spring Data JPA)? (see option #2) is also not a solution. I would need some JOINs and the resulting query is to complex.
The overhead with an "index-field" is also not possible (to much joined data).
Concurrency
The application need to be scalable (with n-instances), so a solution with Lucene is not very good here is an example
no mixing of technologies
I dont want to mix the logic into different systems. This means, the whole search-logic should be defined in Java. A combination of the Java-Logic with views or sql-functions should be avoided.
Discovered options yet
QueryDsl
This is my old solution. But its very complex and produced a lot of problems with the automated generated classes.
Lucence
I like this. But there only one big problem: The index. Keep the index up2date on all instances is going a bit too overkill.
Very long #Query
The resulting query getting to complex to handle it.
Java.stream()...
// kinda
getAllUsers().stream()
.filter(user -> user.getName().contains(searchTerm)
|| user.getSex().contains(searchTerm)
|| user.getAge().toString().equals(searchTerm)
|| ...)
I have to much data to do that. So this solution will also not scale well.
Specification Interface
My preferred solution. But maybe there are other (and better) solutions?
SearchFiled or similar
Too many JOINS. Too much data.
?
Question
What are your expericenes with full-text-search in a Spring-Boot-Application? Do you know a solution that met my requierements?

If you have reached till Lucene, then a step further is Solr. I haven't used the options you have mentioned above, but I have certainly worked with Solr and can safely say that it is worth a try, for speed and ease of use.
Out of the four constraints you have put, the first three are taken care of, I feel with Solr.
Performance: Solr is a proven candidate in this area.
Readability: I assume you mean readability of code. Though this depends upon the code and design are done, the Solr part is quite friendly to code, understand and maintain because of the lack of JOIN and other RDBMS concepts.
Concurrency: From the official documentation at lucene.apache.org/solr:
Both Lucene and Solr were designed to scale to support large implementations with minimal custom coding.
and that Solr can do the following in this regard:
distributing an index across multiple servers
replicating an index on multiple servers
merging indexes
no mixing of technologies: With the option of using Solr, you have at least two technologies: Java and Solr. I am not sure if you wanted to keep your solution to pure Java/JEE. If that is the case, then this may not satisfy that need.
However, this requirement:
The search should be performed by the DBMS.
is surely not taken care of.
Also, can't think of a way other than a custom design for this:
Keep the index up2date on all instances is a bit overkill.
A warning: It may take some time to get a good grasp on Solr if you are new to it.

you may consider apache solr for searching

How to handle big data with Java framework? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm fairly new to data science, and now just starting to develop a system that required me to analyze large data (e.g. 5 - 6 million records in each DBs).
In a bigger picture: I have multiple DBs containing various kind of data which need to be integrated. After integrating the data, I also need to perform some data analysis. And lastly, I need to visualize the data to many clients.
Overall,I want to know what is the current technology/trend for handling big data (i.e with java framework)

The answer is: Depends of your non-functional requirements. Your use cases will be critical in deciding which technology to use.
Let me share one of my experience, in order to clarify what I mean:
In 2012 I needed to deal with ~2 million non-structured records per month, and perform algorithms of entropy (information theory) and similarity for ~600 requests per minute.
Our scenario were composed by:
Records non-structured, but already in JSON format.
The algorithms for entropy and similarity were based in all content of the DB vs records to be matched (Take a look in [Shannon entropy formula][1], and you will understand the complexity I'm talking about)
more them 100 different web applications as clients of this solution.
Given those requirements (and many others), and after performing PoCs with [Casandra][2], [Hadoop][3], [Voldmort][4], [neo4j][5], and also tests of stress, resiliency, scalability, and robustness, we arrived in the best solution for that moment (2012):
Java EE 7 (with the new Garbage-First (G1) collector activated)
JBoss AS 7 ([wildfly][6]) + [Infinispan][7] for the MapReduce race condition, among other clusters' control, and distributed cache needs.
Servlet 3.0 (because it's Non-blocking I/O)
[Nginx][8] (In that time was beta, but different of httpd2, it was already multiple connections in a non-blocking fashion)
[mongoDB][9] (due our raw content already being in JSON document style)
[Apache Mahout][10] for all algorithms implementation, including the MapReduce strategy
among other stuffs.
So, all depends on your requirements. There's no silver bullet. Each situation demands an architectural analysis.
I remember Nasa in that time was processing ~1TB per hour in AWS with Hadoop, due the [Mars project with the Curiosity][11].
In your case, I would recommend paying attention in your requirements, maybe a Java framework it's not what you need (or not just what you need):
If you are going just to implement algorithms for data analysis, statisticians and data miners (for example), probably [R programming language][12] is gonna be the best choice.
If you need a really fast I/O (aircraft stuff for example): any native compiled language like [Go Lang][13], [C++][14], etc.
But if actually you're going to create a web applications that actually will be just a client or feed the big data solution, I'd recommend something more soft and scalable like [nodeJS][15] or even a just in time compiled technology like those one based in JVM ([Scala][16], [Jython][17], Java) in [dockerized][18] [microservices][19]...
Good luck! (Sorry, the Stack Overflow didn't allow me to add the references link yet - But all I have talked about here, can easily been googled).

Java based library for sensor data collection [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm looking for an embeddable Java library that is suitable for collecting real-time streams of sensor data in a general-purpose way. I plan to use this to develop a "hub" application for reporting on multiple disparate sensor streams, running on a JVM based server (will also be using Clojure for this).
Key things it needs to have:
Interfaces for various common sensor types / APIs. I'm happy to build what I need myself, but it would be nice if some standard stuff comes out of the box.
Suitable for "soft real time" usage, i.e. fairly low latency and low overhead.
Ability to monitor and manage streams at runtime, gather statistics etc.
Open source under a reasonably permissive license so that I can integrate it with other code (Apache, EPL, BSD, LGPL all fine)
A reasonably active community / developer ecosystem
Is there something out there that fits this profile that you can recommend?

1. Round-robin database (wikipedia)
RRDtool (acronym for round-robin database tool) aims to handle
time-series data like network bandwidth, temperatures, CPU load, etc.
The data are stored in a round-robin database (circular buffer), thus
the system storage footprint remains constant over time.
This approach/DB format is widely used, stable and simple enough. Out of the box it allows to generate nice plots:
There is Java implementation -- RRD4J:
RRD4J is a high performance data logging and graphing system for time
series data, implementing RRDTool's functionality in Java. It follows
much of the same logic and uses the same data sources, archive types
and definitions as RRDTool does. Open Source under Apache 2.0 License.
Update
Forget to mention there is Clojure RRD API (examples).
2. For some experiments with real-time data I would suggest to consider Perst
It is small, fast and reliable enough, but distributed under GPLv3. Perst provides several indexing algorithms:
B-Tree
T-Tree (optimized for in-memory database)
R-Tree (spatial index)
Patricia Trie (prefix search)
KD-Tree (multidimensional index)
Time series (large number of fixed size objects with timestamp)
The last one suits your needs very well.
3. Neo4J with Relationship indexes
A good example where this approach pays dividends is in time series
data, where we have readings represented as a relationship per
occurrence.
4. Oracle Berkeley DB Java Edition
Oracle Berkeley DB Java Edition is an open source, embeddable,
transactional storage engine written entirely in Java. It takes full
advantage of the Java environment to simplify development and
deployment. The architecture of Oracle Berkeley DB Java Edition
supports very high performance and concurrency for both read-intensive
and write-intensive workloads.
Suggestion
Give a try to RRD4J:
It is simple enough
It dose provide quite a nice plots
It has Clojure API
It supports several back-ends including Oracle Berkeley DB Java Edition
It can store/visualize detailed data sets

For collecting real-time streams of sensor data following might be of help
Have you checked LeJos API's. This http://lejos.sourceforge.net/nxt/nxj/api/index.html
Also it is worth checking Oracle Java ME Embedded and the target markets they are addressing http://www.unitask.com/oracledaily/2012/10/04/at-the-java-demogrounds-oracle-java-me-embedded-enables-the-internet-of-things/
Can be downloaded from http://www.oracle.com/technetwork/java/embedded/downloads/javame/index.html
For storing the Time series data nothing beats cassandra http://cassandra.apache.org/ and to answer why cassandra refer http://www.datastax.com/why-cassandra
For accessing Cassandra from Java refer https://github.com/jmctee/Cassandra-Client-Tutorial
It is quite helpful and applying the time series concept in cassandra db refer
http://www.datastax.com/wp-content/uploads/2012/08/C2012-ColumnsandEnoughTime-JohnAkred.pdf

Choosing an ORM for Android project (min. API level 7) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I currently have an application, where it's primary performance issue is using file-based database consisted of JSON responses.
I'd like to rewrite my application to use SQLite database feature.
Since I'm lazy, I'd like to use some kind of ORM.
So far I have found only two big ORM libraries:
ORMLite
GreenDAO ORM
DB4O
ActiveAndroid
My primary goal is to raise performance on working with data as much as possible
But I've found two possible issues with those libraries.
ORMLite uses annotations, which is big performance issue in pre-honeycomb due to this bug
GreenDAO is using some kind of code generator, and that would slow me down on development as I would have to write generator, and then use generated code. And I don't very like this idea.
DB4O is JPA, which I've always considered as slow and heavy on memory usage, therefore unsuitable for low-end devices (remember the Android API v7)
ad #ChenKinnrot:
The estimated load should be sufficient to think about using an ORM.
In my case it is about 25-30 unique tables, and at least 10 table joins (2 - 4 tables at a time). About 300-500 unique fields (columns)
So my questions are:
Should I use ORM/JPA layer in Android application?
If so, what library would you recommend me to use? (and please add some arguments too)

I've used ORMLite and found it straightforward once you got the hang of it (a few hours), quite powerful and didn't cause any performance problems (app tested in Gingerbread on HTC desire and HTC Hero).
I will be using it again in any projects I need to use a DB for.

A ORM layer is appealing.
However, in practice I either write simple ORM myself or use the Content Provider paradigm, which does not cooperate well with ORM.
I have looked into some existing ORM libraries (mainly ORMLite ,activeAnroid) but they all scared me away
as they seems not so easy to get started.
"We're talking about 25-30 unique tables, and at least 10 table joins.
About 300-500 unique fields (columns)"
If you have fixed and limited patterns of how the data will be queried, I would recommend to write the ORM/sql yourself.
My 2 cents.

If you are worried about your app's performance, I'd recommend greenDAO. It will save you from writing lots of boring code, so code generation should not be an issue. In return, it will generate also entities and DB unit tests for you.

I got some knowledge to share so:
ORM by definition is slower than writing your own sql, it's suppose to simplify the coding of data access, and provide a generic solution, generic = runs slower than you write your queries, if you know sql well.
The real question is how good performance you want to get, if it's the best possible, don't consider any data mapping framework, only sql generation framework that will help you write stuff faster, but gives you full control of everything.
If you don't want to get the most out of the sql db, use orm, I got no experience with this orm you mentioned, so I can't say what to choose.
And your DB is not so big and complex so the time you'll save with orm is not an issue.

In my experience, I had a lot of benefits from using ORM engines. However, there was the case when I had to deal with performance problems.
I had to load about 10 000 rows from the database, and with a standard implementation (I was using ORMLite), it took about 1 minute to complete (depends on the device CPU).
When you need to read a lot of data from the database, you can execute plain SQL and parse the results yourself (in my case, I only needed to query for 3 columns from the table). ORMLite also allows you to retrieve raw-results. By this, the performance has increased by 10 times. All 10 000 rows were loaded in 5 seconds or less!

Is there any file-based db with concurrent per-file user access similar to MS Access but Open Source? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Most time I build simple applications that share data through an MS Access DB through a network drive.
However, MS Access in damn slow, does not support common features of SQL, is a pain to automate and makes uses to "explore" my application's data directly.
I would like to migrate to an different db with a similar profile. It should be
needing no installation, because they are not permitted through the IT departement
be file-based (same as above)
savely placeable on a network drive to give multi-user support
open source
(preferably have a pure java driver)
Do you know anything out there that matches at least some of my creterias?
I have reviewed sqlite, derby, hsqldb. They all seem to support all requirements except the shareability through an network drive.
But this is the most required feature.
I would appreciate any answers.

I'm pretty sure the requirement "needing no installation" is a show-stopper for you. The alternative is to learn to make better use of Access. With careful design, it will surprise you. In the olden days, when lawyers in the US were expected to design, build, and maintain their own litigation support databases, I used to demonstrate design and queries against a 2,000,000 row Access database. Response times were less than 250 msec. I know that sounds awfully slow nowadays, but back then--early 1990s--that was ripping fast.
Not kidding about the lawyers. The canonical reference for litigation support databases at the time was Wilmer, Cutler, and Pickering Manual on Litigation Support Databases. In my experience, most lawyers believe their expertise at the bar transfers to all other fields. Including database design.

Try HSQLDB.
From their homepage:
It offers a small, fast multithreaded and transactional database engine which offers in-memory and disk-based tables and supports embedded and server modes.
It's also used by JBoss AS as its internal database.
I have reviewed sqlite, derby, hsqldb. They all seem to support all requirements except the shareability through an network drive.
You might need to put a small server on the network, though :)

I have worked with Firebird Embedded with great results, the only problem: no multi-user access is posible.
It provides you with full modern RDBMS features.
No need for installation: just copy a dll.
The database is a single file.
Here is some info: http://www.firebirdsql.org/manual/ufb-cs-embedded.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.