Tools to do data processing from Java

Tools to do data processing from Java - java

I've got a legacy system that uses SAS to ingest raw data from the database, cleanse and consolidate it, and then score the outputted documents.
I'm wanting to move to a Java or similar object oriented solution, so I can implement unit testing, and otherwise general better code control. (I'm not talking about overhauling the whole system, but injecting java where I can).
In terms of data size, we're talking about around 1 TB of data being both ingested and created. In terms of scaling, this might increase by a factor of around 10, but isn't likely to increase on massive scale like a worldwide web project might.
The question is - what tools would be most appropriate for this kind of project?
Where would I find this information - what search terms should be used?
Is doing processing on an SQL database (creating and dropping tables, adding columns, as needed) an appropriate, or awful, solution?
I've had a quick look at Hadoop - but due to the small scale of this project, would Hadoop be an unnecessary complication?
Are there any Java packages that do similar functionality as SAS or SQL in terms of merging, joining, sorting, grouping datasets, as well as modifying data?

It's hard for me to prescribe exactly what you need given your problem statement.
It sounds like a good database API (i.e. native JDBC might be all you need with a good open source database backend)
However, I think you should take some time to check out Lucene. It's a fantastic tool and may meet your scoring needs very well. Taking a search engine indexing approach to your problem may be fruitful.

I think the question you need to ask yourself is
what's the nature of your data set, how often it will be updated.
what's the workload you will have on this 1TB or more data in the future. Will there be mainly offline read and analysis operations? Or there will also have a lot random write operations?
Here is an article talking about if to choose using Hadoop or not which I think is worth reading.
Hadoop is a better choice if you only have daily or weekly update of your data set. And the major operations on the data is read-only operations, along with further data analysis. For the merging, joining, sorting, grouping datasets operation you mentioned, Cascading is a Java library running on top of Hadoop which supports this operation well.

Related

Choosing a database service - mongohq vs dynamodb

Currently I am gathering information what database servce we should use.
I am still very new to web development but we think we want to have a noSQL database.
We are using Java with Play! 2.
We only need a database for user registration.
Now I am already familiar with GAE ndb which is a key value store such as dynamoDB. MongoDB is a document db.
I am not sure what advantages each solution has.
I also know that dynamoDB runs on SSD's and mongoDB is inmemory.
An advantage of mongoDB would be that Java Play! already "supports" mongodb.
Now we don't expect too much database usage, but we would need to scale pretty fast if our app grows.
What alternatives do I have? What pros/cons do they have?
Considering:
Pricing
Scaling
Ease of use
Play! support?

(Disclosure: I'm a founder of MongoHQ, and would obviously prefer you choose us)
The biggest difference from a developer perspective is the querying capability. On DynamoDB, you need the exact key for a given document, or you need to build your keys in such a way that you can use them for range based queries. In Mongo, you can query on the structure of the document, add secondary indexes, do aggregations, etc.
The advantage of doing it with k/v only is that it forces you to build your application in a way that DynamoDB can scale. The advantage of Mongo flexible queries against your docs is that you can do much faster development, even if you discount what the Play framework includes. It's always going to be quicker to do new development with something like Mongo because you don't have to make your scaling decisions from the get go.
Implementation wise, both Mongo and DynamoDB can grow basically unbounded. Dynamo abstracts most of the decisions on storage, RAM and processor power. Mongo requires that you (or someone like us) make decisions on how much RAM to have, what kind of disks to use, how to managed bottlenecks, etc. The operations hurdles are different, but the end result is very similar. We run multiple Mongo DBs on top of very fast SSDs and it works phenomenally well.
Pricing is incredibly difficult to compare, unfortunately. DynamoDB pricing is based on a nominal per GB fee, but you pay for data access. You need to be sure you understand how your costs are going to grow as your database gets more active. I'm not sure I can predict DynamoDB pricing effectively, but I know we've had customers who've been surprised (to say the least) at how expensive Dynamo ended up being for the stuff they wanted to do.
Running Mongo is much more predictable cost-wise. You likely need 1GB of RAM for every 10GB of data, running a redundant setup doubles your price, etc. It's a much easier equation to wrap your head around and you're not in for quite as nasty of a shock if you have a huge amount of traffic one day.
By far the biggest advantage of Mongo (and MongoHQ) is this: you can leave your provider at any time. If you get irked at your Mongo provider, it's only a little painful to migrate away. If you get irked at Amazon, you're going to have to rewrite your app to work with an entirely different engine. This has huge implications on the support you should expect to receive, hosting Mongo is competitive enough that you get very good support from just about any Mongo specific company you choose (or we'd die).
I addressed scaling a little bit above, but the simplest answer is this: if you define your data model well, either option will scale out just about as far as you can imagine you'd need to go. You are likely to not do this right with Mongo at first, though, since you'll probably be developing quickly. This means that once you can't scale vertically any more (by adding RAM, disk speed, etc to a single server) you will have to be careful about how you choose to shard. The biggest difference between Mongo and Dynamo scaling is when you choose to make your "how do I scale my data?" decisions, not overall scaling ability.
So I'd choose Mongo (duh!). I think you can build a fantastic app on top of DynamoDB, though.

As you said, mongoDB is one step ahead among other options, because you can use morphia plugin to simplify DB interactions(you have JPA support as well). Play framework provides CRUD module (admin console) and secure module as well (for your overall login system), so I strongly suggest you to have a look at' em.

Loading facebook's big text file to memory (39MB) for autocompletion

I'm trying to implement part of the facebook ads api, the auto complete function ads.getAutoCompleteData
Basically, Facebook supplies this 39MB file which updated weekly, and which contains targeting ads data including colleges, college majors, workplaces, locales, countries, regions and cities.
Our application needs to access all of those objects and supply auto completion using this file's data.
I'm thinking of preferred ways to solved this. I was thinking about one of the following options:
Loading it to memory using Trie (Patricia-trie), the disadvantage of course that it will take too much memory on the server.
Using a dedicated search platform such as Solr on a different machine, the disadvantage is perhaps over-engineering (Though the file size will probably increase largely in the future).
(Fill here cool, easy and speed of light option) ?
Well, what do you think?

I would stick with a service oriented architecture (especially if the product is supposed to handle high volumes) and go with Solr. That being said, 39 MB is not a lot of hold in memory if it's going to be a singleton. With indexes and all this will get up to what? 400MB? This of course depends on what your product does and what kind of hardware you wish to run it on.
I would go with Solr or write your own service that reads the file into a fast DB like MySQL's MyISAM table (or even in-memory table) and use mysql's text search feature to serve up results. Barring that I would try to use Solr as a service.
The benefit of writing my own service is that I know what is going on, the down side is that it'll be no where as powerful as Solr. However I suspect writing my own service will take less time to implement.
Consider writing your own service that serves up request in a async manner (if your product is a website then using ajax). The trouble with Solr or Lucene is that if you get stuck, there is not a lot of help out there.
Just my 2 cents.

Need for Hibernate in the legacy world [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have a several questions about hibernate.
In many questions here in stackoverflow, several people are saying that hibernate is not a good choise for very complex databases. If we have very complex database, hibernate is not the right choice. It better suits for green field projects, but it is not so good for complex legacy database.
Is this true?
Also hibernate generates queries.
Every project manager will like to have optimized queries (hibernate cannot generate more optimized queries than sql specialist!). So for big project it is not a problem to hire sql specialist. The sql specialist will optimize the queries (use explain sql, use joins ...)
My question is how come a huge and expensive project does not care about sql optimization?
(you will say that you can write HQL but as I have seen in a lot of posts that explains that HQL is not so powerful than sql and a lot of programmers get headache and several hours of tuning) (you like all your organs in your body to work ideally don't you?)
Also the second level cache helps hibernate a lot because hibernate knows to generate a lot of queries instead of complex join.
My question is: Is really a complex db only modified by one system (example the web site)? If we are talking about the enterprise system the db can be accessed via several processes, sharing different programming languages and platforms.
So in this case the second level cache does not help very much.
For what kind of projects hibernate is suitable for?
Is it for back office projects where nobody cares about the sql ?
What happens when your administrator says: please use memcached for caching and please use this optimized queries instead of yours?
If you are using oracle database, orache has the most advanced sql syntax. They have spend a lot of time and money on the syntax that is very powerful. What for is this syntax if it is not used.
The software is written only once (and then maintained) and used for a long time.
If I am a company that orders software I will say: I will use the software for a couple of years and I like to be fast, and if you spend 1 month for writing software with hibernate I will pay one more month for software that uses example IBATIS knowing that it will work better for years
(when you are buying a car you are interested in the car economy 1kg-oil/km, not how short and easy the manufacturer produced the car!). So as a software consumer I do not interested in your productivity, just how fast the software is. Of course also the price is relevant but if we are speaking about the price there are more complex mathematics.
Can we call something engineering when we really cannot predict some part of the system?
(can electrical engineer be really a engineer if he cannot predict the current)
Please share your opinion.
Regards

1) (...) Is this true?
No it isn't, Hibernate can deal with pretty complex databases, including existing ones. However, it might not deal very well with an heavily denormalized database or an exotic schema. This is different.
2) (...) My question is how come a huge and expensive project does not care about sql optimization?
This is non-sense, using Hibernate doesn't mean you don't care about optimization. I have worked on a huge and complex STP system (several hundreds millions € budget) and performance was definitely an important concern and we actually introduced Hibernate to benefit from things like lazy loading, second level cache (and speed up development).
Here is the deal when using an ORM like Hibernate (when suitable):
You'll be done faster with ORM than without ORM (or there wouldn't be any point at using them).
The vast majority of the generated queries will behave correctly (and the fact is that Hibernate generates better SQL than the average developer).
You can (and have to) tune queries and Hibernate to a certain degree.
Even if you spend some time on performance optimization (including falling back to native SQL for really problematic queries), you'll still be done faster.
3) (...) So in this case the second level cache does not help very much.
Well, you are right about the fact that using the second level cache ideally means using Hibernate APIs (although you can still evict the cache "manually" and although I tend to prefer using it for "mostly read" entities). But, more important, to my experience sharing data between many applications through the database just leads to unmaintainable applications (changing a single bit becomes impossible as it may impact several applications) and and should be avoided. Use an EAI/ESB and expose services of the main system through it. This way, you can reuse the business logic, the 2nd level cache, etc.
4) (...) For what kind of projects hibernate is suitable for? Is it for back office projects where nobody cares about the sql ?
Hibernate is indeed very nice for CRUD applications, but not only (see above) and your question shows some ignorance as I already said. However, it isn't suitable for any project:
I would probably not use it for a data warehouse or a big reporting application.
I might not use it with a heavily denormalized or exotic legacy database (a data mapper like mybatis might be a better choice in this case).
I might not use it with an existing system using stored procedure for everything.
I would not use it with a non RDBMS datastore :)
5) (...) What happens when your administrator says: please use memcached for caching and please use this optimized queries instead of yours?
I tell him that memcached is maybe not the best solution in our context (no, I don't want to always send my data over the wire and I don't care that Facebook/LiveJournal/Twitter/whatever are using it, our app might have different needs), there are other better cache implementations when working with Hibernate, I ask him to discuss problems with me and we discuss the various solutions, etc. We work as a team, not against each other.
To sum up, ORM solutions are not always suitable but I think that you currently have a biased opinion and my experience is different from the opinions (misbeliefs?) expressed in your question.
See also
When NOT to use O/R mapping in Java

It's good for green field projects, but it's also good for legacy projects. You may need to do some mapping tricks, but it offers reasonably flexible mapping.
Since you can use native queries, and since you can integrate it with your favorite caching solution, you don't need to suffer any performance problems just because you're using Hibernate. When your db administrator says that you should use memcached, you can use this memcached/Hibernate integration. You can write a caching implementation using your favorite cache and plug in into Hibernate. When she says you should use this optimized query, you say "great! Hibernate has a native SQL facility that will let me use that query". You can use native Oracle syntax, you can use the native syntax of whatever RDBMS you've chosen.
A multiple-application environment poses the same challenges to Hibernate as it does to any solution. If you want your application to perform well, you will use what amounts to a second-level cache. Hibernate happens to offer an ORM that is integrated with the cache. It doesn't solve the problem of coordinating a cache across multiple applications, but you'll have to solve that problem even if you don't use Hibernate.

Your question is probably too broad. I can tell you about my experience.
I worked on a project that adopted the .NET version (NHibernate). A naive implementation of loading a single row from a single table was almost two orders of magnitude slower than a raw ADO query. After much optimization I believe they got it down to merely one order of magnitude slower.
In java where the start up time is probably less of a factor. The web server loads java and hibernate at server start instead of while a user waits for a desktop app to start.
Personally I really dislike it. It hides implementation details that are necessary to efficiently manage your data. I've found no real world application that could perform acceptably with a vanilla implementation of a data layer that hides database details.
But that may be sour grapes on my part since I was forced to use it and blamed for
not being able to put enough lipstick on the pig.

No matter how complex database is. The most important question is how complex domain model of application is.
Is query select * from anytable where anycol = #anyvalue optimized? I have no idea. Nobody has. Because there is only one true criteria of optimization - this is performance of such queries. You can save a lot of time with hibernate or other ORM, then use this time to find actually slow queries. As far as I know Hibernate has some ways to use optimized query.
Third your question is good. But also there is no one answer to the question 'Is dirty data good every time everywhere?'. Strictly saying, until locked, any data read from database are dirty, no matter how its were read and where its were stored. Data blocking is not good thing for performance, so usually you should find compromisse between actual data and performance.
There is no silver bullet. ORM has a lot of advantages, but there is only one serious case when it is not suitable: it is dynamic resultsets depends of parameters (when different parameters returns data with different column sets). Because object structure are static at compile time (in static typed languages) ORM can't help in this case.
Every other case can be solved. Entity sevices (changes tracking etc.) can be off, second-level cache can be disabled, and optimized query can be used instead of generated. I have no idea how to do all that things in Hibernate, but I'm sure it is possible.
ORM has a great advantage it concentrate all data access logic in manageable form, and put it in specific place. Also it supports few things are not so easy and direct to implement in your own data access library, like transaction management (including nested transactions, etc), identity mapping (one row - one object), complex hierarchy persisting (if you use objects and object hierarchies), optimistic locking etc, and ORM can greatly helps you with it.

Recommendations for an in memory database vs thread safe data structures

TLDR: What are the pros/cons of using an in-memory database vs locks and concurrent data structures?
I am currently working on an application that has many (possibly remote) displays that collect live data from multiple data sources and renders them on screen in real time. One of the other developers have suggested the use of an in memory database instead of doing it the standard way our other systems behaves, which is to use concurrent hashmaps, queues, arrays, and other objects to store the graphical objects and handling them safely with locks if necessary. His argument is that the DB will lessen the need to worry about concurrency since it will handle read/write locks automatically, and also the DB will offer an easier way to structure the data into as many tables as we need instead of having create hashmaps of hashmaps of lists, etc and keeping track of it all.
I do not have much DB experience myself so I am asking fellow SO users what experiences they have had and what are the pros & cons of inserting the DB into the system?

Well a major con would be the mismatch between Java and a DB. That's a big headache if you don't need it. It would also be a lot slower for really simple access. On the other hand, the benefits would be transactions and persistence to the file system in case of a crash. Also, depending on your needs, it allows for querying in a way that might be difficult to do with a regular Java data structure.
For something in between, I would take a look at Neo4j. It is a pure Java graph database. This means that it is easily embeddable, handles concurrency and transactions, scales well, and does not have all of the mismatch problems that relational DBs have.
Updated If your data structure is simple enough - a map of lists, map of maps, something like that, you can probably get away with either the concurrent collections in the JDK or Google Collections, but much beyond that, and you will likely find yourself recreating an in memory database. And if your query constraints are even remotely difficult, you're going to have to implement all of those facilities yourself. And then you'll have to make sure that they work concurrently etc. If this requires any serious complexity or scale(large datasets), I would definitely not roll your own unless you really want to commit to it.
If you do decided to go with an embedded DB there are quite a few choices. You might want to start by considering whether or not you want to go the SQL or the NoSQL route. Unless you see real benefits to go SQL, I think it would also greatly add to the complexity of your app. Hibernate is probably your easiest route with the least actual SQL, but its still kind of a headache. I've done it with Derby without serious issues, but it's still not straightforward. You could try db4o which is an object database that can be embedded and doesn't require mapping. This is a good overview. Like I had said before, if it were me if I would likely try Neo4j, but that could just be me wanting to play with new and shiny things ;) I just see it as being a very transparent library that makes sense. Hibernate/SQL and db4o just seems like too much hand waving to feel lightweight.

You could use something like Space4J and get the benefits of both a collections like interface and an in memory database. In practical use something as basic as a Collection is an in memory database with no index. A List is an in memory database with a single int index. A Map is an in memory database with a single index type T based index and no concurrency unless synchronized or a java.util.concurrency.* implementation.

I was once working for a project which has been using Oracle TimesTen. This was back in early 2006 when Java 5 was just released and java.util.concurrent classes were barely known. The system we have developed had reasonably big scalability and throughput requirements (it was one of the core telco boxes for SMS/MMS messaging).
Briefly speaking, reasoning for TimesTen was fair: "let's outsource our concurrency/scalability problems to somebody else and focus on our business domain" and made perfect sense then. But this was back in 2006. I don't think such a decision would be made today.
Concurrency is hard, but so is handling of in-memory databases. Freeing yourself of concurrency problems you'd have to become an expert of in-memory database world. Fine tuning TimesTen for replication is hard (we had to hire a professional consultant from Oracle to do this). License(s) don't come for free. You also need to worry about additional layer which is not open source and/or might be written in a different language than the one you understand.
But it is really hard to make any judgement without knowing your experience, budget, time requirements, etc. Do a shopping around, spend some time for looking into decent concurrency frameworks (such as http://akkasource.org/) ...and let us know what you have decided ;)

Below are few questions which could facilitate a decision.
Queries - do you need to query/reproject/aggregate your data in different forms?
Transactions - do you ever need to rollback added data?
Persistence - do you only need to present the gathered data or do you also need to store it in some way?
Scalability - will your data always fit in the memory?
Performance - how fast should it be?

It is unclear to me why you feel that an in memory database cannot be thread safe.
Why don't you look at JDO and DataNucleus? They have a lot of different datastores where you get to plug in what your back end persistence provider is at run time as a configuration step. Your application code is dependent on an ORM but that ORM might be plugged into an RDBMS, DB40, NeoDatis, LDAP, etc. If one backend doesn't work for you, then switch to another.

Easy way to store and retrieve objects in Java without using a relational DB? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Do you know of an "easy" way to store and retrieve objects in Java without using a relational DB / ORM like Hibernate?
[Note that I am not considering serialization as-is for this purpose, as it won't allow to retrieve arbitrary objects in the middle of an object graph. Neither am I considering DB4O because of its restrictive license. Thanks.]
"Easy" meaning: not having to handle low-level details such as key/value pairs to rebuild an object graph (as with BerkeleyDB or traditional caches). The same applies for rebuilding objects from a document- or column-oriented DB (CouchDB, HBase, ..., even Lucene).
Perhaps there are interesting projects out there that provide a layer of integration between the mentioned storage systems and the object model (like ORM would be for RDBMSs) that I am not aware of.
Anyone successfully using those in production, or experimenting with persistence strategies other than relational DBs? How about RDF stores?
Update: I came across a very interesting article: A list of distributed key-value stores

Object Serialization (aka storing things to a file)
Hibernate (uses a relational database but it is fairly transparent to the developer)
I would suggest Hibernate because it will deal with most of the ugly details that bog developers down when using a database while still allowing for the optimizations that have been made to database software over the years.

NeoDatis looks interesting. It is licensed under the LGPL, so not quite as restrictive as the GLP proper.
Check out their 1 minute tutorial to see if it will work for your needs.

I would like to recommend XStream which simply takes your POJOs and creates XML out of them so you can store it on disk. It is very easy to use and is also open source.

I'd recommend Hibernate (or, more general, OR-mapping) like Matt, but there is also a RDBMS at the backend and I'm not so sure about what you mean by
...without using a relational DB?...
It also would be interesting to know more about the application, because OR-mapping is not always a good idea (development performance vs. runtime performance).
Edit: I shortly learned about terracotta and there is a good stackoverflow discussion here about replacing DBs with that tool. Still experimental, but worth reading.

I still think you should consider paying for db4o.
If you want something else, add "with an MIT-style license" to the title.

Check out comments on Prevayler on this question. Prevayler is a transactional wrapper around object serialization - roughly, use objects in plain java and persist to disk through java API w/o sql, a bit neater than writing your own serialization.
Caveats- with serialization as a persistance mechanism, you run the risk of invalidating your saved data when you update the class. Even with a wrapper library you'll probably want to customize the serialization/deserialization handling. It also helps to include the serialVersionUID in the class so you override the JVM's idea of when the class is updated (and therefore can't reload your saved serialized data).

Hmm... without serialization, and without an ORM solution, I would fall back to some sort of XML based implementation? You'd still have to design it carefully if you want to pull out only some of the objects from the object graph - perhaps a different file for each object, where object relationships are referenced by a URI to another file?
I would have said that wasn't "easy" because I've always found designing the mapping of XML to objects to be somewhat time consuming, but I was really inspired by a conversation on Apache Betwixt that has me feeling hopeful that I'm just out of date, and easier solutions are now available.

Terracotta provides a highly available, highly scalable persistent to disk object store. You can use it for just this feature alone - or you can use it's breadth of features to implement a fully clustered application - your choice.
Terracotta:
does not break object identity giving you the most natural programming interface
does not require Serialization
clusters (and persists) nearly all Java classes (Maps, Locks, Queues, FutureTask, CyclicBarrier, and more)
persists objects to disk at memory speeds
moves only object deltas, giving very high performance
Here's a case study about how gnip uses Terracotta for in-memory persistence - no database. Gnip takes in all of the events on Facebook, Twitter, and the like and produces them for consumers in a normalized fashion. Their current solution is processing in excess of 50,000 messages / second.
It's OSS and has a high degree of integration with many other 3rd party frameworks including Spring and Hibernate.

I guess I have found a sort of answer to my question.
Getting the document-oriented paradigm mindset is no easy task when you have always thought your data in terms of relationships, normalization and joins.
CouchDB seems to fit the bill. It still could act as a key-value store but its great querying capabilities (map/reduce, view collations), concurrency readiness and language-agnostic HTTP access makes it my choice.
Only glitch is having to correclty define and map JSON structures to objects, but I'm confident I will come up with a simple solution for usage with relational models from Java and Scala (and worry about caching later on, as contention is moved away from the database). Terracotta could still be useful but certainly not as with an RDBMS scenario.
Thank you all for your input.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.