I am working on a project where tons of graph operations are performed in near real-time. We are currently using Hibernate, MySQL and EhCache but considering moving all the graph-related persistence to a graph database like Neo4j or Titan.
Can graph databases perform better than Hibernate+relational? I just want to make sure we are not going to replace six of one with half a dozen of the other.
The deeper the object graph, the more the performance advantage swings to object/graph databases.
Relational database performance drops off markedly with more than seven JOINs.
Geometric systems such as CAD/CAM, with deep object graphs for bills of materials, outperform their relational counterparts.
Relational databases have one huge advantage: relational algebra and a clear separation between the data and the "how" of accessing and manipulating it. But they are not perfect for every problem.
The advantage you have when moving to neo4j (or some graph db) is that the query time remains constant (well almost) and hence predictable irrespective of the increase in data volume. It always better to do a proof-of-concept based on your data domain as generalized answers are generally not applicable for nosql dbs.
Taken from here.
Both graph and relational databases rely on caches to improve query performance. However, an edge traversal in a graph database is usually a constant time operation, and the edge is typically cached if the vertex is cached. With an RDBMS, a foreign key traversal requires a B-Tree index lookup on the target table which takes O(log n) time. When the index doesn't fit in the cache, the database would have to perform disk-seek operations which are slow.
Check out Bitsy. If your graph fits in memory, it is very fast for queries and updates. Or you can go with another Blueprints implementation, like Neo4J and Titan, which can handle larger datasets.
If you're using Hibernate then you're persisting domain objects which by their nature ARE object graphs.
Databases are tabular structures and do OK with this relationship but break down fast. In addition, Hibernate has a nasty habit of pulling in the entire database with joins.
Given that Neo4j was designed with object relations as it's core function and you're doing domain persistence, this nature design fit is sure to be better.
Also, Neo4j does its queries using Lucene (a stupid fast search index) and can jump straight to your node for traversal.
Bottom line: Neo4j was design for mind blowing scale and exactly the idea of graph-related data. You're not going wrong for scaling but you will find the tools/libraries aren't as mature for the job as they are for a classic DB connection
Related
TLDR: What is the best way to store spatial data in an SQL database to be used in R-Trees?
Long question:
I am writing a feature that incorporates spatial data. The goal is to store POIs and being able to retrieve the data quickly, perform clustering etc.
My understanding is that R*-Trees are a good solution for this kind of task. I am planning on using: https://github.com/davidmoten/rtree.
SQLite seems to offer R-Trees, but I can use only SQL. What would be the most efficient way to store this data?
Get a database that has R trees.
For example SQLite, PostgreSQL, Oracle, ...
But beware that the query performance of these databases will usually be pretty bad compared to an in-memory index such as ELKIs. In particular if you want nearest neighbor with haversine distance, which is what I need mostly.
Often, their R tree index is a ugly hack. It seems they usually create a table to store the pages of the tree, so querying means repeatedly selecting rows from that table.
i m working on Java EE projects using Hibernate as ORM , I have come to a phase where i have to perform some mathematical calculation on my Classes , like SUM , COUNT , addition and division .
i have 2 solutions :
To select my classes and apply those operation programmatically in my code
To do calculations on my named queries
i want to please in terms of performance and speed , which one is better ?
And thank you
If you are going to load the same entities that you want to do the aggregation on from the database in the same transaction, then the performance will be better if you do the calculation in Java.
It saves you one round-trip to the database, because in that case you already have the entities in memory.
Other benefits are:
Easier to unit-test the calculation because you can stick to a Java-based unit testing framework
Keeps the logic in one language
Will also work for collections of entities that haven't been persisted yet
But if you're not going to load the same set of entities that you want to do the calculation on, then you will get a performance improvement in almost any situation if you let the database do the calculation. The more entities are involved, the bigger the performance benefit.
Imagine doing a summation over all line items in this year's orders, perhaps several million of them.
It should be clear that having to load all these entities into the memory of the Java process across a TCP connection (even if it is within the same machine) first will take more time, and more memory, than letting the database perform the calculation.
And if your mapping requires additional queries per entity, then Hibernate would have at least one extra round-trip to the database for every entity, in which case the performance benefits of calculating things in SQL on the database would be even bigger.
Are these calculation on the entities (or data)? if yes, then you can indeed go for queries(or even faster, use sql queries iso hql). From performance perspective ,IMO, stored procedures shines but people don't use them so often with hibernate.
Also, if you have some frequent repetitive calculation, try using caching in your application.
I'm thinking about possible solution (tool) for my issue.
There is a collection of locations with a huge amount (more than 600 000) of elements. Locations have name (in different languages) and represented in tree structure: region->country->admin division->city->zip. User can add custom location, but I plan that these actions will happen rarely. Application should provide efficient ability to perform search by location name, type, to build hierarchical name (f.e. "London->England->United Kingdom"), build subtree of locations (f.e. all countries and cities in those countries of Europe).
I've considered three solutions.
Plain database: locations will hold in some tables and the main building logic will be implemented in java code. In case of this solution I am worried about performance, because search, building tree and creating custom locations can involve additional table joining.
SOLR: at first glance this task is exactly for solr: data set changes rarely, we need search by names. But I'm worried if Solr pivots feature will satisfy the tree building needs. Also I'm not sure if Solr searching will be much better then plain DB, because search is not so difficult (just searching by names which are short strings).
graph db Neo4j: it seems useful for building trees and subtrees. But I'm not sure about search performance (it seems I should use community edition, which does not have some useful performance features like caching and etc.)
Database is a big NO. as RDBMS is not optimized for relation based queries. For example show me the people who are eating in the same restaurant where I do and also belong to the same region where I do. OR to make it more complex, a db query can be a killer where level of relations are to be calculated. Like I can be your second level friend where one or more of your friends is/are my friend(s).
SOLR: Solr is a good option but you have to see the performance impact of it. With so many rows to index it can be a memory killer. Go through these first before implementing SOLR.
http://wiki.apache.org/solr/SolrPerformanceProblems
http://wiki.apache.org/solr/SolrPerformanceFactors
SOLR also not a good solution for more logical searches as you have to learn it all before going for it.
Neo4J (or Any other graph DB) is perfect solution. I have implemented all these three technologies myself and with my experience I found Neo4J best for such requirement.
However, you must see how to backup the database and how to recover it in case of a crash.
All the best.
I've currently designed a schema in Cassandra but I'm wondering if there's a better way to do things. Basically, the issue is that most, if not all of the reads are dynamic. I've constructed a segmentation system as an application service that reads a dynamic custom query (completely unrelated to Cassandra, but the query is strict and limited to the application) and it goes ahead and queries cassandra and merges the results.
I've made most of the column families as wide as I thought would be good, and because the data is extremely write intensive, used composite keys to partition the load.
This is basically implementing a query layer on-top of Cassandra that's application specific, including having some sort of join or merge operation.
Are there any limitations to this layout or process?
If you are trying to do some kind of OLAP using Cassandra as the back-end, I think you will have problems. The advice I've seen on designing Cassandra tables is to start with the queries you expect to run, then design denormalised tables that make your queries fast. So you need to know what the queries are; it sounds like that is not the case for your application. Perhaps a RDBMS would be better?
One option is PlayOrm for cassandra (really a object nosql mapping not relational as it follows many nosql patterns). It does have it's own S-SQL language that does joins of partitions. It is not going to join your billion row table with billion rows though, but if your partitions are say under a million rows, it can help you out there.
nosql once in a while has client side joins depending on context and PlayOrm makes it so you don't have to do that much work when you do need a join in nosql which can be pretty rare though.....many times denormalization is better.
The patterns in playorm are also different than hibernate like a one to many, the FK's for the many are embedded in the row as this is how you do it in nosql.
later,
Dean
I was wondering about the following two options when one is not using SQL tables but ORM based DBs (Example - when you are using GAE)
Would the second option be less efficient?
Requirement:
There is an object. The object has a collection of similar items. I need to store this object. Example, say the object is a tree and it has a collection of leaves.
Option 1:
Traditional SQL type structure:
Table for the Tree (with TreeId as the
identifier for a row in the Table.)
Table for the Leaves (where each leaf
has a TreeId and to show the leaves
of a tree, I query all leaves where
the TreeId is the Id of the tree.)
Here, the Tree structure DOES NOT
have a field with leaves.
Option 2:
ORM / GAE Tables:
Using the same example above,
I have an object for Tree where the object has a collection (Set/List in Java/C++) of leaves.
I store and retrieve the Tree together with the leaves (as the leaves are implemented as a Set in the Tree object)
My question is, will the second one be less efficient that the first option?
If so, why? Are there other alternatives?
Thank you!
It would be better to use Hibernate(for java) or other ORM framework than ORM db.
1. orm db's are mostly amateur.
2. no one appreciates it. You will be much better specialist if you know PostgreSQL with orm framework than just some orm db.
3. there are many standards in the world of rdbms. no standards in orm dbs.
4. rdbms support and community make this choice safer in long term.
5. effeciency is a tricky question. almost 80% sure that if you want to find row with "name = 'Alex'" it will be faster in rdbms than in orm db, cuz orm db will need to unpack object for this operation.
PS: i understand, my post is almost offtopic, but i think it contains some good stuff to think about.