database for log analysis application in java - 2014

database for log analysis application in java - 2014 - java

I want to create a java application for the purpose of handling and analyzing live streaming logs. I have to implement some complex filter functionality also. I was doing a research on finding the best suited database for the same.
I came across many portable database like mongodb, hbase, h2 and all. Among all, mongodb seems to be a better candidate. But for my requirement, there may be insertion and selection happening at the same time. Somewhere I read like, mongodb is not best at handling concurrency.
I'm sure, moving forward the performance of database is going to play a crucial role in the whole performance of the application.
I came across many stack overflow links regarding the same. But the thing is, all of them are asked 2 or more years back.
Can mongodb handle concurrency? Is there any other portable database which is better than mongodb for the same?
Please help.

Have you looked to some solution, for instance, like elasticsearch coupled with kibana and td-agent?
It provides asynchronous logging. I've used it to store and analyze 30 millions events per day from several servers, but it depends what you want to do in the end.

Related

Pivotal gemfire load testing

How to test pivotal gemfire with more than 4000 concurrent users inserting data into gemfire region and same number of concurrent users reading data from gemfire region.
Reading of data from gemfire region can happen after the insertion operation or parallel.
Can you please suggest some best solution around it.

The question is a bit ambiguous.
If you're looking at purely benchmarking GemFire then the YCSB framework would be a good place to start as it provides standardized tests across various IMDG and RDBM systems.
If you are looking for tools for your own app then I'd suggest looking at Jmeter. You'll obviously need to provide some custom code in order to do puts and gets but it will provide you with many other capabilities such as being able to scale your test and also quantify the results.
If you're looking for suggestions on a GemFire architecture to support the scale of your test then you'll need to provide more details as to the functional and non-functional requirements of your application.

Hadoop-Hive-HBase Advice for Web Analytics

The team I work on is fortunate enough to have management that recognizes the need to enhance our skills and learn new technologies. As a result, whenever we have a little downtime between major projects, we are encouraged to use that time to stretch our minds a bit and learn something new. We often tackle a large research project as a team so that everyone benefits from the knowledge. For example, we built a spec-compliant Kerberos authentication server to get familiar with the ins and outs of the protocol. We wrote our own webserver to learn about efficient design strategies for networked applications.
Recently, we've been very curious about Map-Reduce, specifically Hadoop and the various supporting components (HBase, HDFS, Pig, Hive, etc.). To learn a bit more about it, we would like to write a web analytics service. It will use Javascript page tagging to gather the metrics, and Hadoop and something to make analytics and reports available via a web interface.
The non-Hadoop side of the architecture is easy. A Java servlet will parse the parameters from a Javascript tag (easy enough -- we're a Java shop). The servlet will then send out a JMS message for asynchronous processing (again, easy).
My question is... What next? We've researched things like Hive a bit, and it sounds like a great fit for querying the datastore for the various metrics we're looking for. But, it's high latency. We're fortunate enough to be able to drop this onto a website that gets a few million hits per month. We'd really like to get relatively quick metrics using the web interface for our analytics tool. Latency is not our friend. So, what is the best way to accomplish this? Would it be to run the queries as a scheduled job and then store the results somewhere with lower latency (PostgreSQL, etc.) and retrieve them from there? If that's the case, where should the component listening for the JMS messages store the data? Can Hive get its data from HBase directly? Should we store it in HDFS somewhere and read it in Hive?
Like I said, we're a very technical team and love learning new technologies. This, though, is way different from anything we've learned before, so we'd like to get a sense of what the "best practices" would be here. Any advice or opinions you can give are GREATLY appreciated!
EDIT : I thought I'd add some clarification as to what I'm looking for. I'm seeking advice on architecture and design for a solution such as this. We'll collect 20-30 different metrics on a site that gets several million page views per month. This will be a lot of data, and we'd like to be able to get metrics in as close to realtime as possible. I'm looking for best practices and advice on the architecture of such a solution, because I don't want us to come up with something on our own that is really bad that will leave us thinking we're "Hadoop experts" just because it works.

Hive, as you mentioned, has high latency for queries. It can be pointed at HBase (see https://cwiki.apache.org/Hive/hbaseintegration.html), but the integration results in HBase having tables that are forced into a mostly-rectangular, relational-like schema that is not optimal for HBase. Plus, the overhead of doing it is extremely costly- hive queries against hbase are, on my cluster, at least an order of magnitude slower than against plain HDFS files.
One good strategy is to store the raw metrics in HBase or on plain HDFS (Might want to look at Flume if these metrics are coming from log files) and run periodic MapReduce jobs (even every 5 minutes) to create pre-aggregated results that you can store in plain rectangular files that you can query through Hive. When you are just reading a file and Hive doesn't have to do anything fancy (e.g. sorting, joining, etc), then Hive is actually reasonably low latency- it doesn't run MapReduce, it just streams the file's contents out to you.
Finally, another option is to use something like Storm (which runs on Hadoop) to collect and analyze data in real time, and store the results for querying as mentioned above, or storing them in HBase for display through a custom user interface that queries HBase directly.

Performance Testing Various Databases

I am currently testing a few different relational database management systems. (MySQL, PostgreSQL, Oracle and Firebird SQL) using a Java application to do so.
I was debating what tests I should run to distinguish the performances of each.
The obvious ones would be insert, select, delete and drop.
I would love to hear your opinions and to make this apply to the question answer format I shall ask for the 5 most appropriate tests to indicate performace differences. In an ideal world I would like to mimic real word use.
Thanks to all who answer.

I think that any of them would probably be fine. However, your configuration of the different databases for what you are trying to do would be different based on your application.
Suggested place to start: look for apps similar to yours. See what they are using, if you can. Then start testing the different databases with similar configurations and see what works for you.
Personally I've used Oracle, MySql, and Postgres over the last 11 years and they've all worked well. It's really all in your configuration, which is where a good DBA comes in handy.

Here are the results of a fairly extensive benchmarks of JPA providers and RDBMS's. You can either use the data they provide, or you can download their code and run it yourself.

Test concurrency. In other words, what happens under various locking scenarios? Ideally you would like to test under as close to real world conditions as possible, with multiple users using the system as it was meant to be used. See my answer to this SO question.

Need for Hibernate in the legacy world [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have a several questions about hibernate.
In many questions here in stackoverflow, several people are saying that hibernate is not a good choise for very complex databases. If we have very complex database, hibernate is not the right choice. It better suits for green field projects, but it is not so good for complex legacy database.
Is this true?
Also hibernate generates queries.
Every project manager will like to have optimized queries (hibernate cannot generate more optimized queries than sql specialist!). So for big project it is not a problem to hire sql specialist. The sql specialist will optimize the queries (use explain sql, use joins ...)
My question is how come a huge and expensive project does not care about sql optimization?
(you will say that you can write HQL but as I have seen in a lot of posts that explains that HQL is not so powerful than sql and a lot of programmers get headache and several hours of tuning) (you like all your organs in your body to work ideally don't you?)
Also the second level cache helps hibernate a lot because hibernate knows to generate a lot of queries instead of complex join.
My question is: Is really a complex db only modified by one system (example the web site)? If we are talking about the enterprise system the db can be accessed via several processes, sharing different programming languages and platforms.
So in this case the second level cache does not help very much.
For what kind of projects hibernate is suitable for?
Is it for back office projects where nobody cares about the sql ?
What happens when your administrator says: please use memcached for caching and please use this optimized queries instead of yours?
If you are using oracle database, orache has the most advanced sql syntax. They have spend a lot of time and money on the syntax that is very powerful. What for is this syntax if it is not used.
The software is written only once (and then maintained) and used for a long time.
If I am a company that orders software I will say: I will use the software for a couple of years and I like to be fast, and if you spend 1 month for writing software with hibernate I will pay one more month for software that uses example IBATIS knowing that it will work better for years
(when you are buying a car you are interested in the car economy 1kg-oil/km, not how short and easy the manufacturer produced the car!). So as a software consumer I do not interested in your productivity, just how fast the software is. Of course also the price is relevant but if we are speaking about the price there are more complex mathematics.
Can we call something engineering when we really cannot predict some part of the system?
(can electrical engineer be really a engineer if he cannot predict the current)
Please share your opinion.
Regards

1) (...) Is this true?
No it isn't, Hibernate can deal with pretty complex databases, including existing ones. However, it might not deal very well with an heavily denormalized database or an exotic schema. This is different.
2) (...) My question is how come a huge and expensive project does not care about sql optimization?
This is non-sense, using Hibernate doesn't mean you don't care about optimization. I have worked on a huge and complex STP system (several hundreds millions € budget) and performance was definitely an important concern and we actually introduced Hibernate to benefit from things like lazy loading, second level cache (and speed up development).
Here is the deal when using an ORM like Hibernate (when suitable):
You'll be done faster with ORM than without ORM (or there wouldn't be any point at using them).
The vast majority of the generated queries will behave correctly (and the fact is that Hibernate generates better SQL than the average developer).
You can (and have to) tune queries and Hibernate to a certain degree.
Even if you spend some time on performance optimization (including falling back to native SQL for really problematic queries), you'll still be done faster.
3) (...) So in this case the second level cache does not help very much.
Well, you are right about the fact that using the second level cache ideally means using Hibernate APIs (although you can still evict the cache "manually" and although I tend to prefer using it for "mostly read" entities). But, more important, to my experience sharing data between many applications through the database just leads to unmaintainable applications (changing a single bit becomes impossible as it may impact several applications) and and should be avoided. Use an EAI/ESB and expose services of the main system through it. This way, you can reuse the business logic, the 2nd level cache, etc.
4) (...) For what kind of projects hibernate is suitable for? Is it for back office projects where nobody cares about the sql ?
Hibernate is indeed very nice for CRUD applications, but not only (see above) and your question shows some ignorance as I already said. However, it isn't suitable for any project:
I would probably not use it for a data warehouse or a big reporting application.
I might not use it with a heavily denormalized or exotic legacy database (a data mapper like mybatis might be a better choice in this case).
I might not use it with an existing system using stored procedure for everything.
I would not use it with a non RDBMS datastore :)
5) (...) What happens when your administrator says: please use memcached for caching and please use this optimized queries instead of yours?
I tell him that memcached is maybe not the best solution in our context (no, I don't want to always send my data over the wire and I don't care that Facebook/LiveJournal/Twitter/whatever are using it, our app might have different needs), there are other better cache implementations when working with Hibernate, I ask him to discuss problems with me and we discuss the various solutions, etc. We work as a team, not against each other.
To sum up, ORM solutions are not always suitable but I think that you currently have a biased opinion and my experience is different from the opinions (misbeliefs?) expressed in your question.
See also
When NOT to use O/R mapping in Java

It's good for green field projects, but it's also good for legacy projects. You may need to do some mapping tricks, but it offers reasonably flexible mapping.
Since you can use native queries, and since you can integrate it with your favorite caching solution, you don't need to suffer any performance problems just because you're using Hibernate. When your db administrator says that you should use memcached, you can use this memcached/Hibernate integration. You can write a caching implementation using your favorite cache and plug in into Hibernate. When she says you should use this optimized query, you say "great! Hibernate has a native SQL facility that will let me use that query". You can use native Oracle syntax, you can use the native syntax of whatever RDBMS you've chosen.
A multiple-application environment poses the same challenges to Hibernate as it does to any solution. If you want your application to perform well, you will use what amounts to a second-level cache. Hibernate happens to offer an ORM that is integrated with the cache. It doesn't solve the problem of coordinating a cache across multiple applications, but you'll have to solve that problem even if you don't use Hibernate.

Your question is probably too broad. I can tell you about my experience.
I worked on a project that adopted the .NET version (NHibernate). A naive implementation of loading a single row from a single table was almost two orders of magnitude slower than a raw ADO query. After much optimization I believe they got it down to merely one order of magnitude slower.
In java where the start up time is probably less of a factor. The web server loads java and hibernate at server start instead of while a user waits for a desktop app to start.
Personally I really dislike it. It hides implementation details that are necessary to efficiently manage your data. I've found no real world application that could perform acceptably with a vanilla implementation of a data layer that hides database details.
But that may be sour grapes on my part since I was forced to use it and blamed for
not being able to put enough lipstick on the pig.

No matter how complex database is. The most important question is how complex domain model of application is.
Is query select * from anytable where anycol = #anyvalue optimized? I have no idea. Nobody has. Because there is only one true criteria of optimization - this is performance of such queries. You can save a lot of time with hibernate or other ORM, then use this time to find actually slow queries. As far as I know Hibernate has some ways to use optimized query.
Third your question is good. But also there is no one answer to the question 'Is dirty data good every time everywhere?'. Strictly saying, until locked, any data read from database are dirty, no matter how its were read and where its were stored. Data blocking is not good thing for performance, so usually you should find compromisse between actual data and performance.
There is no silver bullet. ORM has a lot of advantages, but there is only one serious case when it is not suitable: it is dynamic resultsets depends of parameters (when different parameters returns data with different column sets). Because object structure are static at compile time (in static typed languages) ORM can't help in this case.
Every other case can be solved. Entity sevices (changes tracking etc.) can be off, second-level cache can be disabled, and optimized query can be used instead of generated. I have no idea how to do all that things in Hibernate, but I'm sure it is possible.
ORM has a great advantage it concentrate all data access logic in manageable form, and put it in specific place. Also it supports few things are not so easy and direct to implement in your own data access library, like transaction management (including nested transactions, etc), identity mapping (one row - one object), complex hierarchy persisting (if you use objects and object hierarchies), optimistic locking etc, and ORM can greatly helps you with it.

Recommendations for an in memory database vs thread safe data structures

TLDR: What are the pros/cons of using an in-memory database vs locks and concurrent data structures?
I am currently working on an application that has many (possibly remote) displays that collect live data from multiple data sources and renders them on screen in real time. One of the other developers have suggested the use of an in memory database instead of doing it the standard way our other systems behaves, which is to use concurrent hashmaps, queues, arrays, and other objects to store the graphical objects and handling them safely with locks if necessary. His argument is that the DB will lessen the need to worry about concurrency since it will handle read/write locks automatically, and also the DB will offer an easier way to structure the data into as many tables as we need instead of having create hashmaps of hashmaps of lists, etc and keeping track of it all.
I do not have much DB experience myself so I am asking fellow SO users what experiences they have had and what are the pros & cons of inserting the DB into the system?

Well a major con would be the mismatch between Java and a DB. That's a big headache if you don't need it. It would also be a lot slower for really simple access. On the other hand, the benefits would be transactions and persistence to the file system in case of a crash. Also, depending on your needs, it allows for querying in a way that might be difficult to do with a regular Java data structure.
For something in between, I would take a look at Neo4j. It is a pure Java graph database. This means that it is easily embeddable, handles concurrency and transactions, scales well, and does not have all of the mismatch problems that relational DBs have.
Updated If your data structure is simple enough - a map of lists, map of maps, something like that, you can probably get away with either the concurrent collections in the JDK or Google Collections, but much beyond that, and you will likely find yourself recreating an in memory database. And if your query constraints are even remotely difficult, you're going to have to implement all of those facilities yourself. And then you'll have to make sure that they work concurrently etc. If this requires any serious complexity or scale(large datasets), I would definitely not roll your own unless you really want to commit to it.
If you do decided to go with an embedded DB there are quite a few choices. You might want to start by considering whether or not you want to go the SQL or the NoSQL route. Unless you see real benefits to go SQL, I think it would also greatly add to the complexity of your app. Hibernate is probably your easiest route with the least actual SQL, but its still kind of a headache. I've done it with Derby without serious issues, but it's still not straightforward. You could try db4o which is an object database that can be embedded and doesn't require mapping. This is a good overview. Like I had said before, if it were me if I would likely try Neo4j, but that could just be me wanting to play with new and shiny things ;) I just see it as being a very transparent library that makes sense. Hibernate/SQL and db4o just seems like too much hand waving to feel lightweight.

You could use something like Space4J and get the benefits of both a collections like interface and an in memory database. In practical use something as basic as a Collection is an in memory database with no index. A List is an in memory database with a single int index. A Map is an in memory database with a single index type T based index and no concurrency unless synchronized or a java.util.concurrency.* implementation.

I was once working for a project which has been using Oracle TimesTen. This was back in early 2006 when Java 5 was just released and java.util.concurrent classes were barely known. The system we have developed had reasonably big scalability and throughput requirements (it was one of the core telco boxes for SMS/MMS messaging).
Briefly speaking, reasoning for TimesTen was fair: "let's outsource our concurrency/scalability problems to somebody else and focus on our business domain" and made perfect sense then. But this was back in 2006. I don't think such a decision would be made today.
Concurrency is hard, but so is handling of in-memory databases. Freeing yourself of concurrency problems you'd have to become an expert of in-memory database world. Fine tuning TimesTen for replication is hard (we had to hire a professional consultant from Oracle to do this). License(s) don't come for free. You also need to worry about additional layer which is not open source and/or might be written in a different language than the one you understand.
But it is really hard to make any judgement without knowing your experience, budget, time requirements, etc. Do a shopping around, spend some time for looking into decent concurrency frameworks (such as http://akkasource.org/) ...and let us know what you have decided ;)

Below are few questions which could facilitate a decision.
Queries - do you need to query/reproject/aggregate your data in different forms?
Transactions - do you ever need to rollback added data?
Persistence - do you only need to present the gathered data or do you also need to store it in some way?
Scalability - will your data always fit in the memory?
Performance - how fast should it be?

It is unclear to me why you feel that an in memory database cannot be thread safe.
Why don't you look at JDO and DataNucleus? They have a lot of different datastores where you get to plug in what your back end persistence provider is at run time as a configuration step. Your application code is dependent on an ORM but that ORM might be plugged into an RDBMS, DB40, NeoDatis, LDAP, etc. If one backend doesn't work for you, then switch to another.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.