I've currently designed a schema in Cassandra but I'm wondering if there's a better way to do things. Basically, the issue is that most, if not all of the reads are dynamic. I've constructed a segmentation system as an application service that reads a dynamic custom query (completely unrelated to Cassandra, but the query is strict and limited to the application) and it goes ahead and queries cassandra and merges the results.
I've made most of the column families as wide as I thought would be good, and because the data is extremely write intensive, used composite keys to partition the load.
This is basically implementing a query layer on-top of Cassandra that's application specific, including having some sort of join or merge operation.
Are there any limitations to this layout or process?
If you are trying to do some kind of OLAP using Cassandra as the back-end, I think you will have problems. The advice I've seen on designing Cassandra tables is to start with the queries you expect to run, then design denormalised tables that make your queries fast. So you need to know what the queries are; it sounds like that is not the case for your application. Perhaps a RDBMS would be better?
One option is PlayOrm for cassandra (really a object nosql mapping not relational as it follows many nosql patterns). It does have it's own S-SQL language that does joins of partitions. It is not going to join your billion row table with billion rows though, but if your partitions are say under a million rows, it can help you out there.
nosql once in a while has client side joins depending on context and PlayOrm makes it so you don't have to do that much work when you do need a join in nosql which can be pretty rare though.....many times denormalization is better.
The patterns in playorm are also different than hibernate like a one to many, the FK's for the many are embedded in the row as this is how you do it in nosql.
later,
Dean
Related
I need to fetch records from a read only table based on user input. The number of requests from users to retrieve these records is approximately 50,000 - 60,000 per hour and table has millions of records.
I considered using plain JDBC but people suggested using Hibernate as it optimizes query for better performance. By using Hibernate over JDBC, am I going to get better performance?
I thought using an ORM solution is unnecessary here since it is just a single read only table. But if it is going to help with performance which is what I want, I would sure go for it. Any suggestions is greatly appreciated.
JDBC will always give better performance as compared to Hibernate for most of the database vendors. You can check the comparison made as given in the link below. He concludes that hibernate is fast when querying tables with less rows else jdbc is way better.
http://phpdao.com/hibernate_vs_jdbc/
The choice of hibernate over jdbc and SQL queries is not because of the performance but because of reasons mainly object persistence and database independence in terms of not writing database specific queries. You can read PDF guide to get a better view.
http://www.mindfiresolutions.com/mindfire/Java_Hibernate_JDBC.pdf
In my current project we have multiple search pages in the system where we fetch a lot of data from the database to be shown in a large table element in the UI. We're using JPA for data access (our provider is Hibernate). The data for most of the pages is gathered from multiple database tables - around 10 in many cases - including some aggregate data from OneToMany relationships (e.g. "number of associated entities of type X"). In order to improve performance, we're using result set pagination with TypedQuery.setFirstResult() and TypedQuery.setMaxResults() to lazy-load additional rows from the database as the user scrolls the table. As the searches are very dynamic, we're using the JPA CriteriaQuery API to build the queries. However, we're currently somewhat suffering from the N+1 SELECT problem. It's pretty bad in some cases actually, as we might be iterating through 3 levels of nested OneToMany relationships, where on each level the data is lazy-loaded. We can't really declare those collections as eager loaded in the entity mappings, as we're only interested in them in some of our pages. I.e. we might fetch data from the same table in several different pages, but we're showing different data from the table and from different associated tables in different pages.
In order to alleviate this, we started experimenting with JPA entity graphs, and they seem to help a lot with the N+1 SELECT problem. However, when you use entity graphs, Hibernate apparently applies the pagination in-memory. I can somewhat understand why it does that, but this behavior negates a lot (if not all) of the benefits of the entity graphs in many cases. When we didn't use entity graphs, we could load data without applying any WHERE restrictions (i.e. considering the whole table as the result set), no matter how many millions of rows the table had, as only a very limited amount of rows were actually fetched due to the pagination. Now that the pagination is done in-memory, Hibernate basically fetches the whole database table (plus all relationships defined in the entity graph), and then applies the pagination in-memory, throwing the rest of the rows away. Not good.
So the question is, is there an efficient way to apply both pagination and entity graphs with JPA (Hibernate)? If JPA does not offer a solution to this, Hibernate-specific extensions are also acceptable. If that's not possible either, what are the other alternatives? Using database Views? Views would be a bit cumbersome, as we support several database vendors. Creating all of the necessary views for different vendors would increase development effort quite a bit.
Another idea I've had would be to apply both the entity graphs and pagination as we currently do, and simply not trigger any queries if they would return too many rows. I already need to do COUNT queries to get the lazy-loading of rows to work properly in the UI.
I'm not sure I fully understand your problem but we faced something similar: We have paged lists of entities that may contain data from multiple joined entities. Those lists might be sorted and filtered (some of those sorts/filters have to be applied in memory due missing capabilities in the dbms but that's just a side note) and the paging should be applied afterwards.
Keeping all that data in memory doesn't work well so we took the following approach (there might be better/more standard ones):
Use a query to load the primary keys (simple longs in our case) of the main entities. Join only what is needed for sorting and filtering to make the query as simple as possible.
In our case the query would actually load more data to apply sorts and filters in memory where necessary but that data is released asap and only the primary keys are kept.
When displaying a specific page we extract the corresponding primary keys for a page and use a second query to load everything that is to be displayed on that page. This second query might contain more joins and thus be more complex and slower than the one in step 1 but since we only load data for that page the actual burden on the system is quite low.
i m working on Java EE projects using Hibernate as ORM , I have come to a phase where i have to perform some mathematical calculation on my Classes , like SUM , COUNT , addition and division .
i have 2 solutions :
To select my classes and apply those operation programmatically in my code
To do calculations on my named queries
i want to please in terms of performance and speed , which one is better ?
And thank you
If you are going to load the same entities that you want to do the aggregation on from the database in the same transaction, then the performance will be better if you do the calculation in Java.
It saves you one round-trip to the database, because in that case you already have the entities in memory.
Other benefits are:
Easier to unit-test the calculation because you can stick to a Java-based unit testing framework
Keeps the logic in one language
Will also work for collections of entities that haven't been persisted yet
But if you're not going to load the same set of entities that you want to do the calculation on, then you will get a performance improvement in almost any situation if you let the database do the calculation. The more entities are involved, the bigger the performance benefit.
Imagine doing a summation over all line items in this year's orders, perhaps several million of them.
It should be clear that having to load all these entities into the memory of the Java process across a TCP connection (even if it is within the same machine) first will take more time, and more memory, than letting the database perform the calculation.
And if your mapping requires additional queries per entity, then Hibernate would have at least one extra round-trip to the database for every entity, in which case the performance benefits of calculating things in SQL on the database would be even bigger.
Are these calculation on the entities (or data)? if yes, then you can indeed go for queries(or even faster, use sql queries iso hql). From performance perspective ,IMO, stored procedures shines but people don't use them so often with hibernate.
Also, if you have some frequent repetitive calculation, try using caching in your application.
I am working on a project where tons of graph operations are performed in near real-time. We are currently using Hibernate, MySQL and EhCache but considering moving all the graph-related persistence to a graph database like Neo4j or Titan.
Can graph databases perform better than Hibernate+relational? I just want to make sure we are not going to replace six of one with half a dozen of the other.
The deeper the object graph, the more the performance advantage swings to object/graph databases.
Relational database performance drops off markedly with more than seven JOINs.
Geometric systems such as CAD/CAM, with deep object graphs for bills of materials, outperform their relational counterparts.
Relational databases have one huge advantage: relational algebra and a clear separation between the data and the "how" of accessing and manipulating it. But they are not perfect for every problem.
The advantage you have when moving to neo4j (or some graph db) is that the query time remains constant (well almost) and hence predictable irrespective of the increase in data volume. It always better to do a proof-of-concept based on your data domain as generalized answers are generally not applicable for nosql dbs.
Taken from here.
Both graph and relational databases rely on caches to improve query performance. However, an edge traversal in a graph database is usually a constant time operation, and the edge is typically cached if the vertex is cached. With an RDBMS, a foreign key traversal requires a B-Tree index lookup on the target table which takes O(log n) time. When the index doesn't fit in the cache, the database would have to perform disk-seek operations which are slow.
Check out Bitsy. If your graph fits in memory, it is very fast for queries and updates. Or you can go with another Blueprints implementation, like Neo4J and Titan, which can handle larger datasets.
If you're using Hibernate then you're persisting domain objects which by their nature ARE object graphs.
Databases are tabular structures and do OK with this relationship but break down fast. In addition, Hibernate has a nasty habit of pulling in the entire database with joins.
Given that Neo4j was designed with object relations as it's core function and you're doing domain persistence, this nature design fit is sure to be better.
Also, Neo4j does its queries using Lucene (a stupid fast search index) and can jump straight to your node for traversal.
Bottom line: Neo4j was design for mind blowing scale and exactly the idea of graph-related data. You're not going wrong for scaling but you will find the tools/libraries aren't as mature for the job as they are for a classic DB connection
Facts
Database: PostgreSQL (latest)
Programming language: Java
Problem statement (simplified)
We have 2 tables - overview and details. There could be millions of rows in "overview" and each row of "overview" can have millions of rows associated with it in "details". The foreign key details.overview_id refers to overview.id. Most queries are of the general formSELECT * FROM details WHERE overview_id = xxx AND details.id > yyy AND details.id < zzz; If we have a single table for details, the queries will be too slow (although the queries on details are almost always on primary keys). More on the nature of DB activities: INSERT and UPDATE on overview happens infrequently. INSERT on details happen at a rapid pace, while UPDATE on the same table almost never happens and bulk DELETE happens sometimes.
What we already have
In the past we used raw SQL to partition the table "details" against each row in "overview". (In practice, we did not actually partition, instead we created new tables based on a template. These tables did not have any column called overview_id (saving storage space), instead we had a separate table that did the mapping between overview.id and the table-name of the specific partition table.) So, as you can understand, the partitions had to be generated on the fly as new rows were inserted in overview and partitions were dropped as rows were deleted from overview. All of this was managed inside the application. The application-database interaction has been blazing fast, but the application code is fairly complex, implying it is hard to maintain. Also, with raw SQL lying around everywhere, it is hard to scale the DB horizontally - we have to reinvent what most JPA providers have already done.
Current goal
Currently we are exploring options for a mechanism by which this partitioning can happen behind the scene - possibly by a JPA provider (I understand that this is not part of the JPA spec), so that we can focus on the application while the underlying framework/layer takes care of the scalability issues.
I looked at openJPA Slice and EclipseLink. Both of them provide partition (shard) management across hosts. We certainly need that. But we also need partition management within a single host. However, if there is a better or more elegant solution to this or if there is a totally different angle to look at this, I will be really glad to know about that.
I will appreciate any insight you can provide.
Thanks.
Prajesh
Have you looked into using Postgres's table partitioning?
http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
Thank you all for your comments/answers till date. We decided to stick to what we already have (see the section named "what we already have"), with minor modifications.