I am planning to develop some application like connecting with friends of friends of friends. It may look like as Facebook or Twitter but initially i am planning to implement that to learn more about NOSQL databases.
There are number of database tools in NOSQL. I have gone through many database types like document store, key-value store, column type, graph databases. And finally i come up with two database tools which are cassandra & Neo4J. Is it right to choose any one, if not correct me & provide me some your valuable opinions.
One more thing is the language binding which i choose is JAVA.
My question is,
Which database tool suits for my application?
Awaiting for your valuable opinions. Thanks for spending your valuable time.
Tim, you really should have posted your question separately, rather than as an answer to the OP, which it wasn't.
But to answer, first, go read Ben Black's slides at http://www.slideshare.net/benjaminblack/introduction-to-cassandra-replication-and-consistency.
Done? Okay, now for the specific questions:
"How would differences in [replica] data-state be reconciled on a subsequent read?"
The highest timestamp wins.
"Do all zones work off the same system clock?"
Timestamps are provided by clients (i.e., your app server). They should be synchronized with e.g. ntpd (which is good practice anyway), but high precision is not required because if ordering matters you should be avoiding conflict either by using unique column names or by using external locking.
For example: if you have a list of users following you in a Twitter clone, you should give each follower its own column and there will be no way to lose data no matter how out of sync the clocks are.
If you have an admin tool for your website and two admins upload a new favicon "simultaneously," one update is going to win and it doesn't really matter which. Here, you do want your clocks synchronized but "within a few ms" is close enough.
If you are managing user registration and you want to allow creating account "jbellis" only if it doesn't already exist, you need a lock manager no matter how closely synchronzied your clocks are.
"Would stale data get returned?"
A node (a better unit to think about than a "zone") will not have data it missed during its downtime until it is sent that data by read repair, hinted handoff, or anti-entropy repair. In the meantime, it will reply to read requests with stale data; if you use a high enough consistencylevel read requests will wait for enough other replies to make sure you always see the most recent version anyway, which may mean not being able to fulfil requests if enough other replicas are down.
Otherwise, a low consistencylevel (e.g. ONE) implicitly means "I understand that the higher availability and lower latency I get with this lower consistencylevel means I'm okay with seeing stale data temporarily after downtime."
I'm not sure I understand all of the implications of the Cassandata consistency model with respect to data-agreement across multiple availability zones.
Given multiple zones, and given that the coordinator node in Cassandra has used a consistency level that does not require all zones to report back, but only a quorum, how would differences in zone data-state be reconciled on a subsequent read?
Do all zones work off the same system clock? Or does each zone have its own clock? If they don't work off the same clock, how are they synchronized so that timestamps can be compared during the "healing" process when differences are reconciled?
Let's say that a zone that does have accurate, up-to-date data is now offline, and a zone that was offline during a previous write (so it didn't get updated and contains stale data) is now back online. Would stale data get returned? Would the coordinator have any way to know the data were stale?
If you don't need to scale in the short term I'd go with Neo4j because it is designed to store networks like the one you described. (If you eventually do need to scale, maybe you can throw Gizzard in front of it or something. Good luck!)
Have you looked on Riak database? It has the same background as Cassandra, but you don't need to care about timestamp synchronization (they involve different method for resolving data status).
My first application was build on a Cassandra database. But I am now trying Riak because it is more suitable. It is not only the difference in keys (keys - values / super column - keys - values) but goes further with the document store feature.
It has a method to create complex queries using MapReduce. Cassandra does have this option using Hadoop, but it sounds difficult.
Further more it uses a well known and defined access protocol in http/s so it's easy to manage the server when you have a lot of traffic.
The only bad point is that is slower than Cassandra. But usually you will read records more than write (and Cassandra is optimised on writes, not reads) so the end result should be ok.
Related
I have about a year of experience in coding in Java. To hone my skills I'm trying to write a Calendar/journal entry desktop app in Java. I've realized that I still have no experience in data persistence and still don't really understand what the data persistence options would be for this program -- So perhaps I'm jumping the gun, and the design choices that I'm hoping to implement aren't even applicable once I get into the nitty gritty.
I mainly want to write a calendar app that allows you to log daily journal entries with associated activity logs for time spent on daily tasks. In terms of adding, editing and viewing the journal entries, using a hash table with the dates of the entries as keys and the entries themselves as the values seems most Big-Oh efficient (O(1) average case for each using a hash table).
However, I'm also hoping to implement a feature that could, given a certain range of dates, provide a simple analysis of average amount of time spent on certain tasks per day. If this is one of the main features I'm interested in, am I wrong in thinking that perhaps a sorted array would be more Big-Oh efficient? Especially considering that the data entries are generally expected to already be added date by date.
Or perhaps there's another option I'm unaware of?
The reason I'm asking is because of the answer provided by this following question: Why not use hashing/hash tables for everything?
And the reason I'm unsure if I'm even asking the right question is because of the answer to the following question: Whats the best data structure for a calendar / day planner?
If so, I would really appreciate being directed other resources on data persistence in java.
Thank you for the help!
Use a NavigableMap interface (implemented by TreeMap, a red-black tree).
This allows you to easily and efficiently select date ranges and traverse over events in key order.
As an aside, if you consider time or date intervals to be "half-open" it will make many problems easier. That is, when selecting events, include the lower bound in results, but exclude the upper. The methods of NavigableMap, like subMap(), are designed to work this way, and it's a good practice when you are working with intervals of any quantity, as it's easy to define a sequence of intervals without overlap or gaps.
Depends on how serious you want your project to be. In all cases, be careful of premature optimization. This is when you try too hard to make your code "efficient", and sacrifice readability/maintainability in the process. For example, there is likely a way of doing manual memory management with native code to make a more efficient implementation of a data structure for your calendar, but it likely does not outweigh the beneits of using familiar APIs etc. It might do, but you only know when you run your code.
Write readable code
Run it, test for performance issues
Use a profiler (e.g. JProfiler) to identify the code that is responsible for poor performance
Optimise that code
Repeat
For code that will "work", but will not be very scalable, a simple List will usually do fine. You can use JSONs to store your objects, and a library such as Jackson Databind to map between List and JSON. You could then simply save it to a file for persistence.
For an application that you want to be more robust and protected against data corruption, a database is probably better. With this, you can guarantee that, for example, data is not partially written, concurrent access to the same data will not result in corruption, and a whole host of other benefits. However, you will need to have a database server running alongside your application. You can use JDBC and suitable drivers for your database vendor (e.g. Mysql) to connect to, read from and write to the database.
For a serious application, you will probably want to create an API for your persistence. A framework like Spring is very helpful for this, as it allows you to declare REST endpoints using annotations, and introduces useful programming concepts, such as containers, IoC/Dependency Injection, Testing (unit tests and integration tests), JPA/ORM systems and more.
Like I say, this is all context dependent, but above all else, avoid premature optimization.
This thread might give you some ideas what data structure to use for Range Queries.
Data structure for range query
And it even might be easier to use a database and using an API to query for the desired range.
If you are using (or are able to use) Guava, you might consider using RangeMap (*).
This would allow you to use, say, a RangeMap<Instant, Event>, which you could then query to say "what event is occurring at time T".
One drawback is that you wouldn't be able to model concurrent events (e.g. when you are double-booked in two meetings).
(*) I work for Google, Guava is Google's open-sourced Java library. This is the library I would use, but others with similar range map offerings are available.
We have market data handlers which publish quotes to KDB Ticker Plant. We use exxeleron q java libary for this purpose. Unfortunately latency is quite high: hundreds milliseconds when we try to insert a batch of records. May you suggest some latency tips for KDB + Java binding, as we need to publish quite fast.
There's not enough information in this message to give a fully qualified response, but having done the same with Java+KDB it really comes down to eliminating the possibilities. This is common sense, really, nothing super technical.
make sure you're inserting asynchronously
Verify it's exxeleron q java that is causing the latency. I don't think there's 100's of millis overhead there.
Verify the CPU that your tickerplant is on isn't overloaded. Consider re-nicing, core binding, etc
Analyse your network latencies. Also, if you're using Linux, there's a few tcp tweaks you can try, e.g. TCP_QUICKACK
As you're using Java, be smarter about garbage collection. It's highly configurable, although not directly controllable.
if you find out the tickerplant is the source of latency, you could either recode it to not write to disk - or get a faster local disk.
There's so many more suggestions, but the question is a bit too ambiguous.
EDIT
Back in 2007, with old(ish) servers and a very old version of KDB+ we were managing an insertion rate of 90k rows per second using the vanilla c.java. That was after many rounds of the above points. I'm sure you can achieve way more now, it's a matter of finding where the bottlenecks are and fixing them one by one.
Make sure the data publish to ticket plant are is batch, like wait for a little bit to insert say few rows of data in batch, but not insert row by row once any new records coming
We have a java based product which keeps Calculation object in database as blob. During runtime we keep this in memory for fast performance. Now there is another process which updates this Calculation object in database at regular interval. Now, what could be the best strategy to implement so that when this object get updated in database, the cache removes the stored object and fetch it again from database.
I won't prefer any caching framework until it is must to use.
I appreciate response on this.
It is very difficult to give you good answer to your question without any knowledge of your system architecture, design constraints, your IT strategy etc.
Personally I would use Messaging pattern to solve this issue. A few advantages of that pattern are as follows:
Your system components (Calculation process, update process) can be loosely coupled
Depending on implementation of Messaging pattern you can "connect" many Calculation processes (out-scaling) and many update processes (with master-slave approach).
However, implementing Messaging pattern might be very challenging task and I would recommend taking one of the existing frameworks or products.
I hope that will help at least a bit.
I did some work similar to your scenario before, generally there are 2 ways.
One, the cache holder poll the database regularly, fetch the data it needs and keep it in the memory. The data can be stored in a HashMap or some other collections. This approach is simple and easy to implement, no extra framework or library needed. But users will have to endure dirty data from time to time. Besides, polling will cause a lot of pressure on DB if the number of pollers is huge or the query is not fast enough. However, it is generally not a bad one if your requirement for real-time is not that high and the scale of your system is relatively small.
The other approach is that the cache holder subscribes the notification of the data updater and update its data after being notified. It provides better user experience, but this will bring more complexity to your system because you have to get some MS infrastructure, such as JMS, involved. Developing and tuning is more time-consuming.
I know I am quite late resonding this but it might help somebody searching for the same issue.
Here was my problem, I was storing requestPerMinute information in a Hashmap in a Java filter which gets loaded during the start of the application. The problem if somebody updates the DB with new information ,the map doesn't know about this.
Solution: I took one variable updateTime in my Java filter which just stored when was my hashmap last got updated and with every request it checks if the current time is time more than 24 hours , if yes then it updates the hashmap from the database.So every 24 hours it just refreshes the whole hashmap.
Although my usecase was not to update at real time so it fits the use case.
I have java application that process such kind of data:
class MyData
{
Date date;
double one;
double two;
String comment;
}
All data are stored in csv format on hard disk, maximum size of such data sequence is ~ 150 mb, and for this moment I just load it fully to memory and work with it.
Now I have the task to increase maximum data sequence for hundreds of gigabyte. guess I need to use DB, but I did not work with them before.
My questions:
Which DB better to choose for my
reasons(there will be only 1 table
with data as abowe) ?
Which library
better to use to connect Java <-> DB
I guess there will be used something
like cursor?!? if so, is there any
cursor realization with good record
caching for fast access?
Any other tips&tricks about java <-> DB are welcome!
Your question is pretty unspecific. There isn't a best of breed - it depends on how much money you have and what kind of hardware.
Since your mapping between Java and the DB is pretty simple, JDBC should be enough. JDBC will create a cursor for you as necessary; lost loop over the rows in the ResultSet. Depending on the database, you may need to configure it to use cursors, though.
Since you mention "hundreds of gigabytes", that rules out most of the "simple" databases. If you have money, try Oracle. If you don't have money, try MySQL or Postgres.
You can also try JavaDB (also known as Derby). But I'm not sure the performance will be what you need.
Note that they all have their quirks and "features", so expect to spend a couple of weeks to find your way with them.
Depends entirely on what you will be doing with the data. Do you need to index it to retrieve specific records, or are you stream processing the entire data set to generate some statistics (for example)? Does the database need to be accessed concurrently by multiple clients/processes?
Don't rush immediately towards SQL/JDBC, relational databases are powerful, but they add a lot of complexity and are often entirely unnecessary for the task at hand.
Again, depending on what you actually need to do, something like BerkeleyDB may fit the bill, or you may just need a more compact binary message format: check out Protocol Buffers and Kryo.
If you really need to scale things up, look at Hadoop/HDFS for distributed processing (but that's getting rather complicated).
Oh, and generally speaking, JavaDB/Derby tends to suck somewhat.
I would recommend JavaDB. I have used it in a Point of Sale system and it works very good. It is very easy to integrate into your Java Application, and you can integrate it to the same .jar file if you want.
Using Java DB in Desktop Applications may be a useful article. You will use JDBC for interfacing the database from Java, this makes it easy to switch to another database if you don't want to use JavaDB.
You'll want to evaluate several databases (you can get trials of just about any of them if they're not open source/free already). I'd recommend trying Oracle, Mysql/Postgres and with the size of your data (and its lack of apparent complexity) you might want to consider a datagrid as well (gridgain or similar).
Definitely prototype though.
I'd just like to add that the "fastest" database is not necessarily the best.
You also need to take into account:
reliability,
software license cost,
ease of use,
ease of administration,
availability of support,
and so on.
I'm new to databases, but I think I finally have a situation where flat files won't work.
I'm writing a program to analyze the outcomes of multiplayer games, where each game could have any number of players grouped into any number of teams. I want to allow players can win, tie, or leave partway through the game (and win/lose based on team performance).
I also might want to store historical player ratings (unless it's faster to just recompute that from their game history), so I don't know if that means storing each player's rating alongside each game played, or having a separate table for each player, or what.
I don't see any criteria that impacts database choice, but I'll list the free ones:
PostgreSQL
MySQL
SQL Server Express
Oracle Express
I don't recommend an embedded database like SQLite, because embedded databases make trade-offs in features to accommodate space & size concerns. I don't agree with their belief that data typing should be relaxed - it's lead to numerous questions on SO about about to deal with date/time filtration, among others...
You'll want to learn about normalization, getting data to Third Normal Form (3NF) because it enforces referential integrity, which also minimizes data redundancy. For example, your player stats would not be stored in the database - they'd be calculated at the time of the request based on the data onhand.
You didn't mention any need for locking mechanisms where multiple users may be competing to write the same data to the same resource (a database record or file in the case of flat files) simultaneously. What I would suggest is get a good book on database design and try to understand normalization rules in depth. Distributing data across separate tables have a performance impact, but they also have an effect on the ease-of-use of query construction. This is a very involving topic, and there's no simple answer to it. That's why companies hire database administrators to keep their data structures optimized.
You might want to look at SQLite, if you need a lightweight database engine.
Some good options were mentioned already, but I really think that on Java platform, H2 is a very good choice. It is perfect for testing (in-memory test database), but works very well also for embedded use cases and as stand-alone "real database". Plus it is easy to export as dump file, import from that, to move around. And works efficiently too.
It is developed by a very good Java DB guy, and is not his first take, and you can see this from maturity of the project. On top of this it is still being actively developed as well as supported.
A word on why nobody even mentions any of the "NoSQL" databases while you have used it as a tag:
Non-SQL databases are getting a lot of attention (or even outright hype) recently, because of some high-profile usecases, because they're new (and therefore interesting), and because their promise of incredible scalability (which is "sexy" to programmers). However, only a very few very big players actually need that kind of scalability - and you certainly don't.
Another factor is that SQL databases require you to define your DB schema (the structure of tables and columns) beforehand, and changing it is somewhat problematic (especially if you already have a very large database). Non-SQL databases are more flexible in that regard, but you pay for it with more complex code (e.g. after you introduce a new field, your code needs to be able to deal with elements where it's not yet present). It doesn't sound like you need this kind of flexibility either.
Try also OrientDB. It's free (Apache 2 license), run everywhere, supports SQL and it's really fast. Can insert 1,000,000 of records in 6 seconds on common hw.