How to implement Java/Scala in-memory statistics database?

How to implement Java/Scala in-memory statistics database? - java

My need is to aggregate real time statistics of a web application server.
For example:
How many requests of content type X have been done
How long it takes to process request of type Y
And so on.
This data has to be completely in memory, not in a file, for best performance. It doesn't log each and every request but instead only stores counters of various aspects.
The most easy way I know is to store the values in a SQL-like table and do SQL-like queries. The benefit is that the indexing is coming off-the-shelf without development effort. I guess some embedded Java databases like Apache Derby would do the work.
The other way to go is to implement collection (say a list) and hash table for each "index column". This way it's all done with Java/Scala collections API, but I actually have to implement indexing mechanism myself, test it, maintain it, etc.
So my question is what way do you think is preferred, and if there are other ways to easily and quickly implement this feature?
Thanks.

I would choose H2 database, I have very positive experiences with it, performance is great as well.
Are you sure that SQL database is well suited for your needs, and have you looked at javamelody, to see if it suits your needs, or if it does not suit you take a look at JRobin for a rolling database implementation.

I would imagine you only need one collection per type of information you need to collection. To improve performance, simplify code I would use TObjectIntHashMap. e.g.
How many requests of content type X have been done
TObjectIntHashMap<ContentType> contentTypeCount
= new TObjectIntHashMap<ContentType>();
contentTypeCount.increment(contentType);
How long it takes to process request of type Y
TObjectLongHashMap<ProcessType> contentTypeTime
= new TObjectLongHashMap<ProcessType>();
contentTypeTime.adjustValue(processType, processTime);
I don't see how you can make it any shorter/simpler/faster by using the other approaches you mentioned.
The average time to perform increment(key) on my machines takes 15 ns (billionths of a second)

I also been noticed about Twitter Ostrich that is statistics library for Scala.
It contains counters, gauges and timing meters.
Data is accessible from HTTP REST API.

Related

Sorted Array vs Hashtable: Which data structure would be more efficient in searching over a range of dates in a calendar app?

I have about a year of experience in coding in Java. To hone my skills I'm trying to write a Calendar/journal entry desktop app in Java. I've realized that I still have no experience in data persistence and still don't really understand what the data persistence options would be for this program -- So perhaps I'm jumping the gun, and the design choices that I'm hoping to implement aren't even applicable once I get into the nitty gritty.
I mainly want to write a calendar app that allows you to log daily journal entries with associated activity logs for time spent on daily tasks. In terms of adding, editing and viewing the journal entries, using a hash table with the dates of the entries as keys and the entries themselves as the values seems most Big-Oh efficient (O(1) average case for each using a hash table).
However, I'm also hoping to implement a feature that could, given a certain range of dates, provide a simple analysis of average amount of time spent on certain tasks per day. If this is one of the main features I'm interested in, am I wrong in thinking that perhaps a sorted array would be more Big-Oh efficient? Especially considering that the data entries are generally expected to already be added date by date.
Or perhaps there's another option I'm unaware of?
The reason I'm asking is because of the answer provided by this following question: Why not use hashing/hash tables for everything?
And the reason I'm unsure if I'm even asking the right question is because of the answer to the following question: Whats the best data structure for a calendar / day planner?
If so, I would really appreciate being directed other resources on data persistence in java.
Thank you for the help!

Use a NavigableMap interface (implemented by TreeMap, a red-black tree).
This allows you to easily and efficiently select date ranges and traverse over events in key order.
As an aside, if you consider time or date intervals to be "half-open" it will make many problems easier. That is, when selecting events, include the lower bound in results, but exclude the upper. The methods of NavigableMap, like subMap(), are designed to work this way, and it's a good practice when you are working with intervals of any quantity, as it's easy to define a sequence of intervals without overlap or gaps.

Depends on how serious you want your project to be. In all cases, be careful of premature optimization. This is when you try too hard to make your code "efficient", and sacrifice readability/maintainability in the process. For example, there is likely a way of doing manual memory management with native code to make a more efficient implementation of a data structure for your calendar, but it likely does not outweigh the beneits of using familiar APIs etc. It might do, but you only know when you run your code.
Write readable code
Run it, test for performance issues
Use a profiler (e.g. JProfiler) to identify the code that is responsible for poor performance
Optimise that code
Repeat
For code that will "work", but will not be very scalable, a simple List will usually do fine. You can use JSONs to store your objects, and a library such as Jackson Databind to map between List and JSON. You could then simply save it to a file for persistence.
For an application that you want to be more robust and protected against data corruption, a database is probably better. With this, you can guarantee that, for example, data is not partially written, concurrent access to the same data will not result in corruption, and a whole host of other benefits. However, you will need to have a database server running alongside your application. You can use JDBC and suitable drivers for your database vendor (e.g. Mysql) to connect to, read from and write to the database.
For a serious application, you will probably want to create an API for your persistence. A framework like Spring is very helpful for this, as it allows you to declare REST endpoints using annotations, and introduces useful programming concepts, such as containers, IoC/Dependency Injection, Testing (unit tests and integration tests), JPA/ORM systems and more.
Like I say, this is all context dependent, but above all else, avoid premature optimization.

This thread might give you some ideas what data structure to use for Range Queries.
Data structure for range query
And it even might be easier to use a database and using an API to query for the desired range.

If you are using (or are able to use) Guava, you might consider using RangeMap (*).
This would allow you to use, say, a RangeMap<Instant, Event>, which you could then query to say "what event is occurring at time T".
One drawback is that you wouldn't be able to model concurrent events (e.g. when you are double-booked in two meetings).
(*) I work for Google, Guava is Google's open-sourced Java library. This is the library I would use, but others with similar range map offerings are available.

Cache update with db changes

We have a java based product which keeps Calculation object in database as blob. During runtime we keep this in memory for fast performance. Now there is another process which updates this Calculation object in database at regular interval. Now, what could be the best strategy to implement so that when this object get updated in database, the cache removes the stored object and fetch it again from database.
I won't prefer any caching framework until it is must to use.
I appreciate response on this.

It is very difficult to give you good answer to your question without any knowledge of your system architecture, design constraints, your IT strategy etc.
Personally I would use Messaging pattern to solve this issue. A few advantages of that pattern are as follows:
Your system components (Calculation process, update process) can be loosely coupled
Depending on implementation of Messaging pattern you can "connect" many Calculation processes (out-scaling) and many update processes (with master-slave approach).
However, implementing Messaging pattern might be very challenging task and I would recommend taking one of the existing frameworks or products.
I hope that will help at least a bit.

I did some work similar to your scenario before, generally there are 2 ways.
One, the cache holder poll the database regularly, fetch the data it needs and keep it in the memory. The data can be stored in a HashMap or some other collections. This approach is simple and easy to implement, no extra framework or library needed. But users will have to endure dirty data from time to time. Besides, polling will cause a lot of pressure on DB if the number of pollers is huge or the query is not fast enough. However, it is generally not a bad one if your requirement for real-time is not that high and the scale of your system is relatively small.
The other approach is that the cache holder subscribes the notification of the data updater and update its data after being notified. It provides better user experience, but this will bring more complexity to your system because you have to get some MS infrastructure, such as JMS, involved. Developing and tuning is more time-consuming.

I know I am quite late resonding this but it might help somebody searching for the same issue.
Here was my problem, I was storing requestPerMinute information in a Hashmap in a Java filter which gets loaded during the start of the application. The problem if somebody updates the DB with new information ,the map doesn't know about this.
Solution: I took one variable updateTime in my Java filter which just stored when was my hashmap last got updated and with every request it checks if the current time is time more than 24 hours , if yes then it updates the hashmap from the database.So every 24 hours it just refreshes the whole hashmap.
Although my usecase was not to update at real time so it fits the use case.

Sort a list with SQL or as a collection?

I have some entries with dates in my database. What is best?:
Fetch them with a sql statement and also apply order by.
Get the list with sql, and order them within the application with collection.sort or so?
Thanks

This a very broad question that is very difficult to answer, and it depends a lot on what you mean by best?
From a performance perspective, you will simply have to measure to determine what part of your system is the bottleneck. Databases are usually very efficient, but it could still be relevant to off-load that work to the client.
From a separation of concern perspective, it depends on how the sorting matters in the application and how the application is layered.
Ask your self: "where does the knowledge that the data is sorted belong?" and "What would happen if I where to change from a relational database storage to something different".

To some extent, it depends on how many values are in the complete collection. If it is, say, 20-30 values then you can sort anywhere — even a relatively poor sorting algorithm can do that quickly (avoid Stooge Sort though; that's terrible) — as that is the sort of size of data chunk which you might expect to actually fetch in one service response.
But once you get into larger datasets you need to plan much more carefully. In particular, you want to avoid moving data around if you don't have to. If the data is currently only present in the database, you really don't want to fetch it all into the client just to sort it (a relatively expensive operation) and then throw virtually all of it away. It's far better to actually keep the data sorted in the database to start with, so that picking it up in order is trivial; in relational database terms, keeping the data sorted is functionally identical to maintaining an index on the data. Indeed, you can have multiple indices on the data, which can make even rather complex queries quick. (NoSQL DBs are more varied; some even don't support the concept of keeping data sorted.) The downside of maintaining indices is that they take up more space and they take time to maintain, particularly when the data is being created in the first place.
So… to return to your question, you probably want to try to not sort the data in the application: for most data, an appropriate index can be much more efficient as it lets your code not even look at unwanted data. But if you have to fetch it all into your application for some other reason and you can't bring it in pre-sorted, there's no reason to avoid sorting it yourself: Java's sorting algorithms are efficient and stable. But you should measure whether fetching it from the DB in the new order is faster. (The question is whether the DB overheads exceed the super-linear costs of re-sorting; lots of problems are in the domain where “maybe; hard to tell” is the answer.)
The other thing to balance is whether it is simpler for your code to not do sorting itself and instead always delegate that to the DB. Keeping your code simpler (and more bug-free) is a good goal to have…

Database management systems (DMBS) are optimized for these tasks, so I think you should stick with them. Especially if you are accessing the database from a script written in PHP or (other scripting language), it might be slower to perform that task using a script. You might also reach a memory limit allowed to be used by PHP if you sort the array using a script.
I don't mean to raise a question of performance of different programming languages, just want to point out that it is a very good practice to rely on the DMBS whenever you can.

This is a very interesting question to me, and I want to present the other side of the accepted answer, which BTW is a very good answer with which I don't necessarily *dis*agree. Just want to present the other side.
When I started in my career, I was working on mainframe DB2, and the old-timers that taught me were VERY INSISTENT that sorting be done OUTSIDE of the db. Their rational for this is that it's work that CAN be offloaded, and this leaves the DB free to service other requests.
Of course, it's far more nuanced than this. In general, I'd say the factors you're weighing are:
A) How busy, or central to your system, is your database? If your db is very busy, if you have a lot of OLTP processing on clients or app servers, and your client or application servers have lots of excess capacity, why not sort on the app server or client? Even if it's less efficient, it spreads the work through the system and gets you more throughput from a whole-systems perspective.
B) How big is the sort? It would be silly to, say, blow your call stack or java heap because you sorted a gazillion MB of data.
C) Will sorting in your app or app server cause pauses, latency, etc? In other words, if your particular programming language has REALLY bad sorting libraries, and you don't want to write your own, maybe letting the DB take 0.5 seconds is better than making your application take 5.0 seconds.
So, as with all things, "it depends" ;-). But, I think these are the things upon which it depends.

Java DB choose for better perfomance

I have java application that process such kind of data:
class MyData
{
Date date;
double one;
double two;
String comment;
}
All data are stored in csv format on hard disk, maximum size of such data sequence is ~ 150 mb, and for this moment I just load it fully to memory and work with it.
Now I have the task to increase maximum data sequence for hundreds of gigabyte. guess I need to use DB, but I did not work with them before.
My questions:
Which DB better to choose for my
reasons(there will be only 1 table
with data as abowe) ?
Which library
better to use to connect Java <-> DB
I guess there will be used something
like cursor?!? if so, is there any
cursor realization with good record
caching for fast access?
Any other tips&tricks about java <-> DB are welcome!

Your question is pretty unspecific. There isn't a best of breed - it depends on how much money you have and what kind of hardware.
Since your mapping between Java and the DB is pretty simple, JDBC should be enough. JDBC will create a cursor for you as necessary; lost loop over the rows in the ResultSet. Depending on the database, you may need to configure it to use cursors, though.
Since you mention "hundreds of gigabytes", that rules out most of the "simple" databases. If you have money, try Oracle. If you don't have money, try MySQL or Postgres.
You can also try JavaDB (also known as Derby). But I'm not sure the performance will be what you need.
Note that they all have their quirks and "features", so expect to spend a couple of weeks to find your way with them.

Depends entirely on what you will be doing with the data. Do you need to index it to retrieve specific records, or are you stream processing the entire data set to generate some statistics (for example)? Does the database need to be accessed concurrently by multiple clients/processes?
Don't rush immediately towards SQL/JDBC, relational databases are powerful, but they add a lot of complexity and are often entirely unnecessary for the task at hand.
Again, depending on what you actually need to do, something like BerkeleyDB may fit the bill, or you may just need a more compact binary message format: check out Protocol Buffers and Kryo.
If you really need to scale things up, look at Hadoop/HDFS for distributed processing (but that's getting rather complicated).
Oh, and generally speaking, JavaDB/Derby tends to suck somewhat.

I would recommend JavaDB. I have used it in a Point of Sale system and it works very good. It is very easy to integrate into your Java Application, and you can integrate it to the same .jar file if you want.
Using Java DB in Desktop Applications may be a useful article. You will use JDBC for interfacing the database from Java, this makes it easy to switch to another database if you don't want to use JavaDB.

You'll want to evaluate several databases (you can get trials of just about any of them if they're not open source/free already). I'd recommend trying Oracle, Mysql/Postgres and with the size of your data (and its lack of apparent complexity) you might want to consider a datagrid as well (gridgain or similar).
Definitely prototype though.

I'd just like to add that the "fastest" database is not necessarily the best.
You also need to take into account:
reliability,
software license cost,
ease of use,
ease of administration,
availability of support,
and so on.

Techniques for querying a set of object in-memory in a Java application

We have a system which performs a 'coarse search' by invoking an interface on another system which returns a set of Java objects. Once we have received the search results I need to be able to further filter the resulting Java objects based on certain criteria describing the state of the attributes (e.g. from the initial objects return all objects where x.y > z && a.b == c).
The criteria used to filter the set of objects each time is partially user configurable, by this I mean that users will be able to select the values and ranges to match on but the attributes they can pick from will be a fixed set.
The data sets are likely to contain <= 10,000 objects for each search. The search will be executed manually by the application user base probably no more than 2000 times a day (approx). It's probably worth mentioning that all the objects in the result set are known domain object classes which have Hibernate and JPA annotations describing their structure and relationship.
Possible Solutions
Off the top of my head I can think of 3 ways of doing this:
For each search persist the initial result set objects in our database, then use Hibernate to re-query them using the finer grained criteria.
Use an in-memory Database (such as hsqldb?) to query and refine the initial result set.
Write some custom code which iterates the initial result set and pulls out the desired records.
Option 1
Option 1 seems to involve a lot of toing and froing across a network to a physical Database (Oracle 10g) which might result in a lot of network and disk activity. It would also require the results from each search to be isolated from other result sets to ensure that different searches don't interfere with each other.
Option 2
Option 2 seems like a good idea in principle as it would allow me to do the finer query in memory and would not require the persistence of result data which would only be discarded after the search was complete. Gut feeling is that this could be pretty performant too but might result in larger memory overheads (which is fine as we can be pretty flexible on the amount of memory our JVM gets).
Option 3
Option 3 could be very performant but is something I would like to avoid as any code we write would require such careful testing that the time taken to acheive something flexible and robust enough would probably be prohibitive.
I don't have time to prototype all 3 ideas so I am looking for comments people may have on the 3 options above, plus any further ideas I have not considered, to help me decide which idea might be most suitable. I'm currently leaning toward option 2 (in memory database) so would be keen to hear from people with experience of querying POJOs in memory too.
Hopefully I have described the situation in enough detail but don't hesitate to ask if any further information is required to better understand the scenario.
Cheers,
Edd

Options 1 and 2 are quite compatible: by implementing one you can replace it with the other with simple reconfiguration of persistence.xml (given that in-memory database is JPA compatible, e.g. JavaDB, Derby, etc.).
Option 3 is re-implementing both third-party software (database) and your own code (existing JPA entities). You also listed its advantages as concerns. It's clearly a less feasible option in your case. I can't think of anything else to promote Option 3 either.
It seems that in-memory database is more suitable given use cases and their time span. If requirements evolve into less transient ones then you can switch to Oracle.

If your expressions are not too complex, you can use an expression language for evaluating string queries on your Java objects (POJOs). I can recommend MVEL http://mvel.codehaus.org .
The idea is that you put your objects into MVEL context. Then you provide string query written according to MVEL simple notation, and finally evaluate expression.
Example taken from MVEL site:
Map vars = new HashMap();
vars.put("x", new Integer(5));
vars.put("y", new Integer(10));
Integer result = (Integer) MVEL.eval("x * y", vars);
assert result.intValue() == 50; // Mind the JDK 1.4 compatible code :)
Usually expression languages support traversing your object graph (collections) and
accessing members in JSP EL style (dot notation).
Also, I can suggest looking at OGNL (google it, I can't add more than one link)

How complex are the refining criteria? If the majority are quite simple, I'd be tempted to go for option (3) to start with, but make sure it's encapsulated behind a suitable interface so that if you come across something that is too complex or inefficient to code up yourself you can switch to the in-memory DB at that point (either wholesale for all queries, or just for the complex ones if there's an overhead in setting up the temporary tables).

Option 2 seems to be good - since you can toggle between 1 & 2 as per need. 3 is restricted in terms of future data sizing issue as well. Querying objects would imply greater dependency on the code structure for storage and querying.
Probably it would be good idea to include some caching mechanism (ehcache/memcache) along with usage of Option 2 and then profiling to check the performance difference.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.