TLDR: What is the best way to store spatial data in an SQL database to be used in R-Trees?
Long question:
I am writing a feature that incorporates spatial data. The goal is to store POIs and being able to retrieve the data quickly, perform clustering etc.
My understanding is that R*-Trees are a good solution for this kind of task. I am planning on using: https://github.com/davidmoten/rtree.
SQLite seems to offer R-Trees, but I can use only SQL. What would be the most efficient way to store this data?
Get a database that has R trees.
For example SQLite, PostgreSQL, Oracle, ...
But beware that the query performance of these databases will usually be pretty bad compared to an in-memory index such as ELKIs. In particular if you want nearest neighbor with haversine distance, which is what I need mostly.
Often, their R tree index is a ugly hack. It seems they usually create a table to store the pages of the tree, so querying means repeatedly selecting rows from that table.
Related
I am working on a project where tons of graph operations are performed in near real-time. We are currently using Hibernate, MySQL and EhCache but considering moving all the graph-related persistence to a graph database like Neo4j or Titan.
Can graph databases perform better than Hibernate+relational? I just want to make sure we are not going to replace six of one with half a dozen of the other.
The deeper the object graph, the more the performance advantage swings to object/graph databases.
Relational database performance drops off markedly with more than seven JOINs.
Geometric systems such as CAD/CAM, with deep object graphs for bills of materials, outperform their relational counterparts.
Relational databases have one huge advantage: relational algebra and a clear separation between the data and the "how" of accessing and manipulating it. But they are not perfect for every problem.
The advantage you have when moving to neo4j (or some graph db) is that the query time remains constant (well almost) and hence predictable irrespective of the increase in data volume. It always better to do a proof-of-concept based on your data domain as generalized answers are generally not applicable for nosql dbs.
Taken from here.
Both graph and relational databases rely on caches to improve query performance. However, an edge traversal in a graph database is usually a constant time operation, and the edge is typically cached if the vertex is cached. With an RDBMS, a foreign key traversal requires a B-Tree index lookup on the target table which takes O(log n) time. When the index doesn't fit in the cache, the database would have to perform disk-seek operations which are slow.
Check out Bitsy. If your graph fits in memory, it is very fast for queries and updates. Or you can go with another Blueprints implementation, like Neo4J and Titan, which can handle larger datasets.
If you're using Hibernate then you're persisting domain objects which by their nature ARE object graphs.
Databases are tabular structures and do OK with this relationship but break down fast. In addition, Hibernate has a nasty habit of pulling in the entire database with joins.
Given that Neo4j was designed with object relations as it's core function and you're doing domain persistence, this nature design fit is sure to be better.
Also, Neo4j does its queries using Lucene (a stupid fast search index) and can jump straight to your node for traversal.
Bottom line: Neo4j was design for mind blowing scale and exactly the idea of graph-related data. You're not going wrong for scaling but you will find the tools/libraries aren't as mature for the job as they are for a classic DB connection
I have a requirement like running 'n' numbers of select queries at fixed time intervals and storing that data. These results need to be pulled later upon a client's demand.
My question is:
1) Is it okay to store it as csv files? Or could you suggest another format?
2) Or, should it be stored as clob variable in a db?
Please suggest any compression techniques to store these query results; also, is it possible to store only revisions of previous resultsets instead of storing the whole resultset?
note:
The minimum time interval is hourly.
The number of queries (n) will be varying (currently 10 to 200 queries.)
The resultset size of each query is also varying (say 10 to 1,000,000 but mostly around 10k.)
The resultset data fetched between each time intervals doesn't differ much. (The row value will not be updated frequently.)
I am new to computer science and programming and also not very aware about storage or db designs.
It sounds like you should be building a data warehouse.
Performance-wise I suppose it would be better to have a table which purpose is to store the query results.
I think you need to store the data in a database. SQL database can serve you the best.
Regarding to storing the data in fixed interval of time, you just need to make effect of the change in the data set instead of storing the whole data again and again. I don't know what is your requirement and how much infrastructure you can afford. If you have such huge queries, I recommend you to work in Distributed System. Use NOSQL database for better performance.
We are working on an internship project for company. The project itself consists of Datamining. Let's say the structure of database we have to work is huge (in Gigabytes).
Sad to say that DB itself is very poorly structured with inconsistent values and most importantly no primary or foreign keys. So in our simple Servlet modules to extract and show the inconsistent data, it takes forever for queries to perform and show up on servlet.
As n00b programmers we do not know about Join and such things in DB. Also we are using MySQL as our DB server. The DB is composed of real-time data from telecom towers.
To find sample inconsistency in table values we are using combination of multiple queries, output of one query serving as input to another query like:
"SELECT distinct(tow_id) FROM 'tower_data' WHERE TIME_STAMP LIKE ? ";
//query for finding tower-id.
"SELECT time_stamp FROM tower_data WHERE 'TIME_STAMP' LIKE ? AND 'PARAM_CODE' = ? AND 'TOW_ID'=? GROUP BY time_stamp HAVING count( * ) >1";
//query for finding time stamps with duplicate data.
And so on.
Also there are some 10 tables in the database. We need to combine 2-3 tables to get values for custom queries.
After finding all the inconsistent values for multiple factors, we have to do data cleansing, removal of noise, data prediction and such tasks in the next stage.
So we thought we can apply some Java Data Mining tools which would in turn apply some algorithm to speed up the data retrieval.
Please guide us towards some good datamining tools. Any guidance towards optimizing/rewriting the queries would also be highly appreciated.
I'm not 100% sure it will help in your case, but have a look at google-refine...
Since you seem to have a lot of badly structured data, I do not think data-mining will help.
You may consider using Apache Hadoop for going over all this data and finding inconsistencies. You can use Amazon EC2 for a simple and relatively cheap way to run Hadoop. You can also use Hadoop to port the databases to a better schema, provided that you can build one.
EDIT: I guess you can also do some things within MySQL. Use query explanation to find the slow parts of your query - I believe 'LIKE' is usually slow, and maybe you can reformulate the query to something faster. Maybe you can first sort your schema by timestamp and then look at sub-ranges. Again, you first have to have an efficient way to get the data, and then you can try to mine it. Good luck.
I have a database to access via JDBC which returns around 200k of records I have to consolidate data table style for further processing. I could if that leads to good performance send a few SELECT (Count(...) ...) statements upfront.
What is a good algorithm in Java to compute such a pivot table?
Clarification
I specifically look for a Java solution, not a SQL approach (and the back end isn't Oracle)
I want to put "random" output from my result set (about 1.5 mil rows) in a file in a sorted manner. I know i can use sort by command in my query but that command is "expensive".
Can you tell me is there any algorithm for writing result set rows in a file so the content would be sorted in the end and can i gain in performance with this?
I'm using java 1.6, and query has multiple joins.
Define an index for the sort criteria in your table, then you can use the order by clause without problems and write the file as it comes from the resultset.
If your query has multiple joins, create the proper indexes for the joins and for the sort criteria. You can sort the data on your program but you'd be wasting time. That time will be a lot more valuable when employed learning how to properly tune/use your database rather than reinventing sorting algorithms already present in the database engine.
Grab your database's profiler and check the query's execution plan.
In my experience sorting at the database side is usually as fast or faster...certainly if the column you sort on is indexed
If you're reading from a database, getting sorted output shouldn't be so 'expensive' if you have appropriate indexes.
But, sometimes with complex queries it's very hard for the SQL optimiser to apply indexes. In that case, the DB simply accumulates the results in a temporary table and sorts it for you, transparently.
It's very unlikely that you could match the level of optimisations put into your DB engine; but if your problem arises because you're doing some postprocessing of the data that negates any sorting done by the DB, then you have no alternative other than sorting it yourself.
Again, the easiest would be to use the DB: simply write to a temporary table with an appropriate index and dump from there.
If you're certain that the data will always fit in RAM, you can sort it in memory. It's the only case in which you might be able to beat the DB engine, just because you know you won't need HD access.
But that's a lot of 'ifs'. Better stay with your DB
If you need the data sorted, someone has to do it - either you or the database. It's certainly easier effort-wise to add the ORDER BY to the query. But there's no reason you can't sort it in-memory on your side. The easiest way is to chunk the data in a sorted collection (TreeSet, TreeMap) using a Comparator to sort on the column you need. Then write out the sorted data.