I have java application that process such kind of data:
class MyData
{
Date date;
double one;
double two;
String comment;
}
All data are stored in csv format on hard disk, maximum size of such data sequence is ~ 150 mb, and for this moment I just load it fully to memory and work with it.
Now I have the task to increase maximum data sequence for hundreds of gigabyte. guess I need to use DB, but I did not work with them before.
My questions:
Which DB better to choose for my
reasons(there will be only 1 table
with data as abowe) ?
Which library
better to use to connect Java <-> DB
I guess there will be used something
like cursor?!? if so, is there any
cursor realization with good record
caching for fast access?
Any other tips&tricks about java <-> DB are welcome!
Your question is pretty unspecific. There isn't a best of breed - it depends on how much money you have and what kind of hardware.
Since your mapping between Java and the DB is pretty simple, JDBC should be enough. JDBC will create a cursor for you as necessary; lost loop over the rows in the ResultSet. Depending on the database, you may need to configure it to use cursors, though.
Since you mention "hundreds of gigabytes", that rules out most of the "simple" databases. If you have money, try Oracle. If you don't have money, try MySQL or Postgres.
You can also try JavaDB (also known as Derby). But I'm not sure the performance will be what you need.
Note that they all have their quirks and "features", so expect to spend a couple of weeks to find your way with them.
Depends entirely on what you will be doing with the data. Do you need to index it to retrieve specific records, or are you stream processing the entire data set to generate some statistics (for example)? Does the database need to be accessed concurrently by multiple clients/processes?
Don't rush immediately towards SQL/JDBC, relational databases are powerful, but they add a lot of complexity and are often entirely unnecessary for the task at hand.
Again, depending on what you actually need to do, something like BerkeleyDB may fit the bill, or you may just need a more compact binary message format: check out Protocol Buffers and Kryo.
If you really need to scale things up, look at Hadoop/HDFS for distributed processing (but that's getting rather complicated).
Oh, and generally speaking, JavaDB/Derby tends to suck somewhat.
I would recommend JavaDB. I have used it in a Point of Sale system and it works very good. It is very easy to integrate into your Java Application, and you can integrate it to the same .jar file if you want.
Using Java DB in Desktop Applications may be a useful article. You will use JDBC for interfacing the database from Java, this makes it easy to switch to another database if you don't want to use JavaDB.
You'll want to evaluate several databases (you can get trials of just about any of them if they're not open source/free already). I'd recommend trying Oracle, Mysql/Postgres and with the size of your data (and its lack of apparent complexity) you might want to consider a datagrid as well (gridgain or similar).
Definitely prototype though.
I'd just like to add that the "fastest" database is not necessarily the best.
You also need to take into account:
reliability,
software license cost,
ease of use,
ease of administration,
availability of support,
and so on.
Related
I have about a year of experience in coding in Java. To hone my skills I'm trying to write a Calendar/journal entry desktop app in Java. I've realized that I still have no experience in data persistence and still don't really understand what the data persistence options would be for this program -- So perhaps I'm jumping the gun, and the design choices that I'm hoping to implement aren't even applicable once I get into the nitty gritty.
I mainly want to write a calendar app that allows you to log daily journal entries with associated activity logs for time spent on daily tasks. In terms of adding, editing and viewing the journal entries, using a hash table with the dates of the entries as keys and the entries themselves as the values seems most Big-Oh efficient (O(1) average case for each using a hash table).
However, I'm also hoping to implement a feature that could, given a certain range of dates, provide a simple analysis of average amount of time spent on certain tasks per day. If this is one of the main features I'm interested in, am I wrong in thinking that perhaps a sorted array would be more Big-Oh efficient? Especially considering that the data entries are generally expected to already be added date by date.
Or perhaps there's another option I'm unaware of?
The reason I'm asking is because of the answer provided by this following question: Why not use hashing/hash tables for everything?
And the reason I'm unsure if I'm even asking the right question is because of the answer to the following question: Whats the best data structure for a calendar / day planner?
If so, I would really appreciate being directed other resources on data persistence in java.
Thank you for the help!
Use a NavigableMap interface (implemented by TreeMap, a red-black tree).
This allows you to easily and efficiently select date ranges and traverse over events in key order.
As an aside, if you consider time or date intervals to be "half-open" it will make many problems easier. That is, when selecting events, include the lower bound in results, but exclude the upper. The methods of NavigableMap, like subMap(), are designed to work this way, and it's a good practice when you are working with intervals of any quantity, as it's easy to define a sequence of intervals without overlap or gaps.
Depends on how serious you want your project to be. In all cases, be careful of premature optimization. This is when you try too hard to make your code "efficient", and sacrifice readability/maintainability in the process. For example, there is likely a way of doing manual memory management with native code to make a more efficient implementation of a data structure for your calendar, but it likely does not outweigh the beneits of using familiar APIs etc. It might do, but you only know when you run your code.
Write readable code
Run it, test for performance issues
Use a profiler (e.g. JProfiler) to identify the code that is responsible for poor performance
Optimise that code
Repeat
For code that will "work", but will not be very scalable, a simple List will usually do fine. You can use JSONs to store your objects, and a library such as Jackson Databind to map between List and JSON. You could then simply save it to a file for persistence.
For an application that you want to be more robust and protected against data corruption, a database is probably better. With this, you can guarantee that, for example, data is not partially written, concurrent access to the same data will not result in corruption, and a whole host of other benefits. However, you will need to have a database server running alongside your application. You can use JDBC and suitable drivers for your database vendor (e.g. Mysql) to connect to, read from and write to the database.
For a serious application, you will probably want to create an API for your persistence. A framework like Spring is very helpful for this, as it allows you to declare REST endpoints using annotations, and introduces useful programming concepts, such as containers, IoC/Dependency Injection, Testing (unit tests and integration tests), JPA/ORM systems and more.
Like I say, this is all context dependent, but above all else, avoid premature optimization.
This thread might give you some ideas what data structure to use for Range Queries.
Data structure for range query
And it even might be easier to use a database and using an API to query for the desired range.
If you are using (or are able to use) Guava, you might consider using RangeMap (*).
This would allow you to use, say, a RangeMap<Instant, Event>, which you could then query to say "what event is occurring at time T".
One drawback is that you wouldn't be able to model concurrent events (e.g. when you are double-booked in two meetings).
(*) I work for Google, Guava is Google's open-sourced Java library. This is the library I would use, but others with similar range map offerings are available.
I have some entries with dates in my database. What is best?:
Fetch them with a sql statement and also apply order by.
Get the list with sql, and order them within the application with collection.sort or so?
Thanks
This a very broad question that is very difficult to answer, and it depends a lot on what you mean by best?
From a performance perspective, you will simply have to measure to determine what part of your system is the bottleneck. Databases are usually very efficient, but it could still be relevant to off-load that work to the client.
From a separation of concern perspective, it depends on how the sorting matters in the application and how the application is layered.
Ask your self: "where does the knowledge that the data is sorted belong?" and "What would happen if I where to change from a relational database storage to something different".
To some extent, it depends on how many values are in the complete collection. If it is, say, 20-30 values then you can sort anywhere — even a relatively poor sorting algorithm can do that quickly (avoid Stooge Sort though; that's terrible) — as that is the sort of size of data chunk which you might expect to actually fetch in one service response.
But once you get into larger datasets you need to plan much more carefully. In particular, you want to avoid moving data around if you don't have to. If the data is currently only present in the database, you really don't want to fetch it all into the client just to sort it (a relatively expensive operation) and then throw virtually all of it away. It's far better to actually keep the data sorted in the database to start with, so that picking it up in order is trivial; in relational database terms, keeping the data sorted is functionally identical to maintaining an index on the data. Indeed, you can have multiple indices on the data, which can make even rather complex queries quick. (NoSQL DBs are more varied; some even don't support the concept of keeping data sorted.) The downside of maintaining indices is that they take up more space and they take time to maintain, particularly when the data is being created in the first place.
So… to return to your question, you probably want to try to not sort the data in the application: for most data, an appropriate index can be much more efficient as it lets your code not even look at unwanted data. But if you have to fetch it all into your application for some other reason and you can't bring it in pre-sorted, there's no reason to avoid sorting it yourself: Java's sorting algorithms are efficient and stable. But you should measure whether fetching it from the DB in the new order is faster. (The question is whether the DB overheads exceed the super-linear costs of re-sorting; lots of problems are in the domain where “maybe; hard to tell” is the answer.)
The other thing to balance is whether it is simpler for your code to not do sorting itself and instead always delegate that to the DB. Keeping your code simpler (and more bug-free) is a good goal to have…
Database management systems (DMBS) are optimized for these tasks, so I think you should stick with them. Especially if you are accessing the database from a script written in PHP or (other scripting language), it might be slower to perform that task using a script. You might also reach a memory limit allowed to be used by PHP if you sort the array using a script.
I don't mean to raise a question of performance of different programming languages, just want to point out that it is a very good practice to rely on the DMBS whenever you can.
This is a very interesting question to me, and I want to present the other side of the accepted answer, which BTW is a very good answer with which I don't necessarily *dis*agree. Just want to present the other side.
When I started in my career, I was working on mainframe DB2, and the old-timers that taught me were VERY INSISTENT that sorting be done OUTSIDE of the db. Their rational for this is that it's work that CAN be offloaded, and this leaves the DB free to service other requests.
Of course, it's far more nuanced than this. In general, I'd say the factors you're weighing are:
A) How busy, or central to your system, is your database? If your db is very busy, if you have a lot of OLTP processing on clients or app servers, and your client or application servers have lots of excess capacity, why not sort on the app server or client? Even if it's less efficient, it spreads the work through the system and gets you more throughput from a whole-systems perspective.
B) How big is the sort? It would be silly to, say, blow your call stack or java heap because you sorted a gazillion MB of data.
C) Will sorting in your app or app server cause pauses, latency, etc? In other words, if your particular programming language has REALLY bad sorting libraries, and you don't want to write your own, maybe letting the DB take 0.5 seconds is better than making your application take 5.0 seconds.
So, as with all things, "it depends" ;-). But, I think these are the things upon which it depends.
I am planning to develop some application like connecting with friends of friends of friends. It may look like as Facebook or Twitter but initially i am planning to implement that to learn more about NOSQL databases.
There are number of database tools in NOSQL. I have gone through many database types like document store, key-value store, column type, graph databases. And finally i come up with two database tools which are cassandra & Neo4J. Is it right to choose any one, if not correct me & provide me some your valuable opinions.
One more thing is the language binding which i choose is JAVA.
My question is,
Which database tool suits for my application?
Awaiting for your valuable opinions. Thanks for spending your valuable time.
Tim, you really should have posted your question separately, rather than as an answer to the OP, which it wasn't.
But to answer, first, go read Ben Black's slides at http://www.slideshare.net/benjaminblack/introduction-to-cassandra-replication-and-consistency.
Done? Okay, now for the specific questions:
"How would differences in [replica] data-state be reconciled on a subsequent read?"
The highest timestamp wins.
"Do all zones work off the same system clock?"
Timestamps are provided by clients (i.e., your app server). They should be synchronized with e.g. ntpd (which is good practice anyway), but high precision is not required because if ordering matters you should be avoiding conflict either by using unique column names or by using external locking.
For example: if you have a list of users following you in a Twitter clone, you should give each follower its own column and there will be no way to lose data no matter how out of sync the clocks are.
If you have an admin tool for your website and two admins upload a new favicon "simultaneously," one update is going to win and it doesn't really matter which. Here, you do want your clocks synchronized but "within a few ms" is close enough.
If you are managing user registration and you want to allow creating account "jbellis" only if it doesn't already exist, you need a lock manager no matter how closely synchronzied your clocks are.
"Would stale data get returned?"
A node (a better unit to think about than a "zone") will not have data it missed during its downtime until it is sent that data by read repair, hinted handoff, or anti-entropy repair. In the meantime, it will reply to read requests with stale data; if you use a high enough consistencylevel read requests will wait for enough other replies to make sure you always see the most recent version anyway, which may mean not being able to fulfil requests if enough other replicas are down.
Otherwise, a low consistencylevel (e.g. ONE) implicitly means "I understand that the higher availability and lower latency I get with this lower consistencylevel means I'm okay with seeing stale data temporarily after downtime."
I'm not sure I understand all of the implications of the Cassandata consistency model with respect to data-agreement across multiple availability zones.
Given multiple zones, and given that the coordinator node in Cassandra has used a consistency level that does not require all zones to report back, but only a quorum, how would differences in zone data-state be reconciled on a subsequent read?
Do all zones work off the same system clock? Or does each zone have its own clock? If they don't work off the same clock, how are they synchronized so that timestamps can be compared during the "healing" process when differences are reconciled?
Let's say that a zone that does have accurate, up-to-date data is now offline, and a zone that was offline during a previous write (so it didn't get updated and contains stale data) is now back online. Would stale data get returned? Would the coordinator have any way to know the data were stale?
If you don't need to scale in the short term I'd go with Neo4j because it is designed to store networks like the one you described. (If you eventually do need to scale, maybe you can throw Gizzard in front of it or something. Good luck!)
Have you looked on Riak database? It has the same background as Cassandra, but you don't need to care about timestamp synchronization (they involve different method for resolving data status).
My first application was build on a Cassandra database. But I am now trying Riak because it is more suitable. It is not only the difference in keys (keys - values / super column - keys - values) but goes further with the document store feature.
It has a method to create complex queries using MapReduce. Cassandra does have this option using Hadoop, but it sounds difficult.
Further more it uses a well known and defined access protocol in http/s so it's easy to manage the server when you have a lot of traffic.
The only bad point is that is slower than Cassandra. But usually you will read records more than write (and Cassandra is optimised on writes, not reads) so the end result should be ok.
My need is to aggregate real time statistics of a web application server.
For example:
How many requests of content type X have been done
How long it takes to process request of type Y
And so on.
This data has to be completely in memory, not in a file, for best performance. It doesn't log each and every request but instead only stores counters of various aspects.
The most easy way I know is to store the values in a SQL-like table and do SQL-like queries. The benefit is that the indexing is coming off-the-shelf without development effort. I guess some embedded Java databases like Apache Derby would do the work.
The other way to go is to implement collection (say a list) and hash table for each "index column". This way it's all done with Java/Scala collections API, but I actually have to implement indexing mechanism myself, test it, maintain it, etc.
So my question is what way do you think is preferred, and if there are other ways to easily and quickly implement this feature?
Thanks.
I would choose H2 database, I have very positive experiences with it, performance is great as well.
Are you sure that SQL database is well suited for your needs, and have you looked at javamelody, to see if it suits your needs, or if it does not suit you take a look at JRobin for a rolling database implementation.
I would imagine you only need one collection per type of information you need to collection. To improve performance, simplify code I would use TObjectIntHashMap. e.g.
How many requests of content type X have been done
TObjectIntHashMap<ContentType> contentTypeCount
= new TObjectIntHashMap<ContentType>();
contentTypeCount.increment(contentType);
How long it takes to process request of type Y
TObjectLongHashMap<ProcessType> contentTypeTime
= new TObjectLongHashMap<ProcessType>();
contentTypeTime.adjustValue(processType, processTime);
I don't see how you can make it any shorter/simpler/faster by using the other approaches you mentioned.
The average time to perform increment(key) on my machines takes 15 ns (billionths of a second)
I also been noticed about Twitter Ostrich that is statistics library for Scala.
It contains counters, gauges and timing meters.
Data is accessible from HTTP REST API.
I'm new to databases, but I think I finally have a situation where flat files won't work.
I'm writing a program to analyze the outcomes of multiplayer games, where each game could have any number of players grouped into any number of teams. I want to allow players can win, tie, or leave partway through the game (and win/lose based on team performance).
I also might want to store historical player ratings (unless it's faster to just recompute that from their game history), so I don't know if that means storing each player's rating alongside each game played, or having a separate table for each player, or what.
I don't see any criteria that impacts database choice, but I'll list the free ones:
PostgreSQL
MySQL
SQL Server Express
Oracle Express
I don't recommend an embedded database like SQLite, because embedded databases make trade-offs in features to accommodate space & size concerns. I don't agree with their belief that data typing should be relaxed - it's lead to numerous questions on SO about about to deal with date/time filtration, among others...
You'll want to learn about normalization, getting data to Third Normal Form (3NF) because it enforces referential integrity, which also minimizes data redundancy. For example, your player stats would not be stored in the database - they'd be calculated at the time of the request based on the data onhand.
You didn't mention any need for locking mechanisms where multiple users may be competing to write the same data to the same resource (a database record or file in the case of flat files) simultaneously. What I would suggest is get a good book on database design and try to understand normalization rules in depth. Distributing data across separate tables have a performance impact, but they also have an effect on the ease-of-use of query construction. This is a very involving topic, and there's no simple answer to it. That's why companies hire database administrators to keep their data structures optimized.
You might want to look at SQLite, if you need a lightweight database engine.
Some good options were mentioned already, but I really think that on Java platform, H2 is a very good choice. It is perfect for testing (in-memory test database), but works very well also for embedded use cases and as stand-alone "real database". Plus it is easy to export as dump file, import from that, to move around. And works efficiently too.
It is developed by a very good Java DB guy, and is not his first take, and you can see this from maturity of the project. On top of this it is still being actively developed as well as supported.
A word on why nobody even mentions any of the "NoSQL" databases while you have used it as a tag:
Non-SQL databases are getting a lot of attention (or even outright hype) recently, because of some high-profile usecases, because they're new (and therefore interesting), and because their promise of incredible scalability (which is "sexy" to programmers). However, only a very few very big players actually need that kind of scalability - and you certainly don't.
Another factor is that SQL databases require you to define your DB schema (the structure of tables and columns) beforehand, and changing it is somewhat problematic (especially if you already have a very large database). Non-SQL databases are more flexible in that regard, but you pay for it with more complex code (e.g. after you introduce a new field, your code needs to be able to deal with elements where it's not yet present). It doesn't sound like you need this kind of flexibility either.
Try also OrientDB. It's free (Apache 2 license), run everywhere, supports SQL and it's really fast. Can insert 1,000,000 of records in 6 seconds on common hw.