JGroups ReplicatedHashMap in a cluster

JGroups ReplicatedHashMap in a cluster - java

My Spring based web app is deployed to production in a Tomcat cluster (4+ nodes) with sticky sessions. The max number of nodes will not exceed 8-10 in a few years time.
I need to cache some data(mostly configuration), to avoid hitting Oracle. Since the nature of this data is mostly configuration, I would say the ratio of reads to writes is 999999 / 1.
I don't want to use a full-blown caching solution such as Infinispan/Hazelcast/Redis as it adds operation complexity to the product and the requirement is to cache some small, mostly read-only data(let's say a few hundred kilobytes the most)
At first, I wanted to implement a simple replicating map myself, then I saw [JGroups][1] ships with a [ReplicatedHashMap][1]. I think it suits my needs but I'm not sure whether I'm missing something.
What else should I consider?
Has anyone used it in production?

ReplicatedHashMap is one class of 700 lines, so it isn't particularly complex, and uses JGroups, which has been used for decade(s) in production.
If you need something simple, without transactions/overflow-store etc, then it might be right for your job. Note that you could modify it and/or write your own, with RHM as template.
RHM replicates all data to all nodes, so if you have many nodes (you don't), or your data is large, then ReplCache may be the better choice.

Related

ELK stack for storing metering data

In our project we're using an ELK stack for storing logs in a centralized place. However I've noticed that recent versions of ElasticSearch support various aggregations. In addition Kibana 4 supports nice graphical ways to build graphs. Even recent versions of grafana can now work with Elastic Search 2 datasource.
So, does all this mean that ELK stack can now be used for storing metering information gathered inside the system or it still cannot be considered as a serious competitor to existing solutions: graphite, influx db and so forth.
If so, does anyone use ELK for metering in production? Could you please share your experience?
Just to clarify the notions, I consider metering data as something that can be aggregated and and show in a graph 'over time' as opposed to regular log message where the main use case is searching.
Thanks a lot in advance

Yes you can use Elasticsearch to store and analyze time-series data.
To be more precise - it depends on your use case. For example in my use case (financial instrument price tick history data, in development) I am able to get 40.000 documents inserted / sec (~125 byte documents with 11 fields each - 1 timestamp, strings and decimals, meaning 5MB/s of useful data) for 14 hrs/day, on a single node (big modern server with 192GB ram) backed by corporate SAN (which is backed by spinning disks, not SSD!). I went to store up to 1TB of data, but I predict having 2-4TB could also work on a single node.
All this is with default config file settings, except for the ES_HEAP_SIZE of 30GB. I am suspecting it would be possible to get significantly better write performance on that hardware with some tuning (eg. I find it strange that iostat reports device util at 25-30% as if Elastic was capping it / conserving i/o bandwith for reads or merges... but it could also be that the %util is an unrealiable metric for SAN devices..)
Query performance is also fine - queries / Kibana graphs return quick as long as you restrict the result dataset with time and/or other fields.
In this case you would not be using Logstash to load your data, but bulk inserts of big batches directly into the Elasticsearch. https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
You also need to define a mapping https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html to make sure elastic parses your data as you want it (numbers, dates, etc..) creates the wanted level of indexing, etc..
Other recommended practices for this use case are to use a separate index for each day (or month/week depending on your insert rate), and make sure that index is created with just enough shards to hold 1 day of data (by default new indexes get created with 5 shards, and performance of shards starts degrading after a shard grows over a certain size - usually few tens of GB, but it might differ for your use case - you need to measure/experiment).
Using Elasticsearch aliases https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html helps with dealing with multiple indexes, and is a generally recommended best practice.

How to publish to KDB Ticker Plant from Java effectively

We have market data handlers which publish quotes to KDB Ticker Plant. We use exxeleron q java libary for this purpose. Unfortunately latency is quite high: hundreds milliseconds when we try to insert a batch of records. May you suggest some latency tips for KDB + Java binding, as we need to publish quite fast.

There's not enough information in this message to give a fully qualified response, but having done the same with Java+KDB it really comes down to eliminating the possibilities. This is common sense, really, nothing super technical.
make sure you're inserting asynchronously
Verify it's exxeleron q java that is causing the latency. I don't think there's 100's of millis overhead there.
Verify the CPU that your tickerplant is on isn't overloaded. Consider re-nicing, core binding, etc
Analyse your network latencies. Also, if you're using Linux, there's a few tcp tweaks you can try, e.g. TCP_QUICKACK
As you're using Java, be smarter about garbage collection. It's highly configurable, although not directly controllable.
if you find out the tickerplant is the source of latency, you could either recode it to not write to disk - or get a faster local disk.
There's so many more suggestions, but the question is a bit too ambiguous.
EDIT
Back in 2007, with old(ish) servers and a very old version of KDB+ we were managing an insertion rate of 90k rows per second using the vanilla c.java. That was after many rounds of the above points. I'm sure you can achieve way more now, it's a matter of finding where the bottlenecks are and fixing them one by one.

Make sure the data publish to ticket plant are is batch, like wait for a little bit to insert say few rows of data in batch, but not insert row by row once any new records coming

Store and search sets (with many possible values) in a database (from Java)

The problem is how to store (and search) a set of items a user likes and dislikes. Although each user may have 2-100 items in their set, the possible values for the items numbers in the tens of thousands (and is expanding).
Associated with each item is a value say from 10 (like) to 0 (neutral) to -10 (dislike).
So given a user with a particular set, how to find users with similar sets (say a percentage overlap on the intersection)? Ideally the set of matches could be reduced via a filter that includes only items with like/dislike values within a certain percentage.
I don't see how to use key/value or column-store for this, and walking relational table of items for each user would seem to consume too many resources. Making the sets into documents would seem to lose clarity.
The web app is in Java. I've searched ORMS, NoSQL, ElasticSearch and related tools and databases. Any suggestions?

Ok this seems like the actual storage isn’t the problem, but you want to make a suggestion system based on the likes/dislikes.
The point is that you can store things however you want, even in SQL, most SQL RDBMS will be good enough for your data store, but you can of course also use anything else you want. The point, is that no SQL solution (which I know of) will give you good results with this. The thing you are looking for is a suggestion system based on artificial intelligence, and the best one for distributed systems, where they have many libraries implemented, is Apache Mahout.
According to what I’ve learned about it so far, it can do what you need basically out of the box. I know that it’s based on Hadoop and Yarn but I’m not sure if you can import data from anywhere you want, or need to have it in HDFS.
Other option would be to implement a machine learning algorithm on your own, which would run only on one machine, but you just won’t get the results you want with a simple query in any sql system.
The reason you need machine learning algorithms and a query with some numbers won’t be enough in most of the cases, is the diversity of users you are facing… What if you have a user B which liked / disliked everything he has in common with user A the same way - but the coverage is only 15%. On the other hand you have user C which is pretty similar to A (while not at 100%, the directions are pretty much the same) and C has marked over 90% of the things, which A also marked. In this scenario C is much closer to A than B would be, but B has 100% coverage. There are many other scenarios where most simple percentages won’t be enough, and that’s why many companies which have suggestion systems (Amazon, Netflix, Spotify, …) use Apache Mahout and similar systems to get those done.

Sort a list with SQL or as a collection?

I have some entries with dates in my database. What is best?:
Fetch them with a sql statement and also apply order by.
Get the list with sql, and order them within the application with collection.sort or so?
Thanks

This a very broad question that is very difficult to answer, and it depends a lot on what you mean by best?
From a performance perspective, you will simply have to measure to determine what part of your system is the bottleneck. Databases are usually very efficient, but it could still be relevant to off-load that work to the client.
From a separation of concern perspective, it depends on how the sorting matters in the application and how the application is layered.
Ask your self: "where does the knowledge that the data is sorted belong?" and "What would happen if I where to change from a relational database storage to something different".

To some extent, it depends on how many values are in the complete collection. If it is, say, 20-30 values then you can sort anywhere — even a relatively poor sorting algorithm can do that quickly (avoid Stooge Sort though; that's terrible) — as that is the sort of size of data chunk which you might expect to actually fetch in one service response.
But once you get into larger datasets you need to plan much more carefully. In particular, you want to avoid moving data around if you don't have to. If the data is currently only present in the database, you really don't want to fetch it all into the client just to sort it (a relatively expensive operation) and then throw virtually all of it away. It's far better to actually keep the data sorted in the database to start with, so that picking it up in order is trivial; in relational database terms, keeping the data sorted is functionally identical to maintaining an index on the data. Indeed, you can have multiple indices on the data, which can make even rather complex queries quick. (NoSQL DBs are more varied; some even don't support the concept of keeping data sorted.) The downside of maintaining indices is that they take up more space and they take time to maintain, particularly when the data is being created in the first place.
So… to return to your question, you probably want to try to not sort the data in the application: for most data, an appropriate index can be much more efficient as it lets your code not even look at unwanted data. But if you have to fetch it all into your application for some other reason and you can't bring it in pre-sorted, there's no reason to avoid sorting it yourself: Java's sorting algorithms are efficient and stable. But you should measure whether fetching it from the DB in the new order is faster. (The question is whether the DB overheads exceed the super-linear costs of re-sorting; lots of problems are in the domain where “maybe; hard to tell” is the answer.)
The other thing to balance is whether it is simpler for your code to not do sorting itself and instead always delegate that to the DB. Keeping your code simpler (and more bug-free) is a good goal to have…

Database management systems (DMBS) are optimized for these tasks, so I think you should stick with them. Especially if you are accessing the database from a script written in PHP or (other scripting language), it might be slower to perform that task using a script. You might also reach a memory limit allowed to be used by PHP if you sort the array using a script.
I don't mean to raise a question of performance of different programming languages, just want to point out that it is a very good practice to rely on the DMBS whenever you can.

This is a very interesting question to me, and I want to present the other side of the accepted answer, which BTW is a very good answer with which I don't necessarily *dis*agree. Just want to present the other side.
When I started in my career, I was working on mainframe DB2, and the old-timers that taught me were VERY INSISTENT that sorting be done OUTSIDE of the db. Their rational for this is that it's work that CAN be offloaded, and this leaves the DB free to service other requests.
Of course, it's far more nuanced than this. In general, I'd say the factors you're weighing are:
A) How busy, or central to your system, is your database? If your db is very busy, if you have a lot of OLTP processing on clients or app servers, and your client or application servers have lots of excess capacity, why not sort on the app server or client? Even if it's less efficient, it spreads the work through the system and gets you more throughput from a whole-systems perspective.
B) How big is the sort? It would be silly to, say, blow your call stack or java heap because you sorted a gazillion MB of data.
C) Will sorting in your app or app server cause pauses, latency, etc? In other words, if your particular programming language has REALLY bad sorting libraries, and you don't want to write your own, maybe letting the DB take 0.5 seconds is better than making your application take 5.0 seconds.
So, as with all things, "it depends" ;-). But, I think these are the things upon which it depends.

What database to use?

I'm new to databases, but I think I finally have a situation where flat files won't work.
I'm writing a program to analyze the outcomes of multiplayer games, where each game could have any number of players grouped into any number of teams. I want to allow players can win, tie, or leave partway through the game (and win/lose based on team performance).
I also might want to store historical player ratings (unless it's faster to just recompute that from their game history), so I don't know if that means storing each player's rating alongside each game played, or having a separate table for each player, or what.

I don't see any criteria that impacts database choice, but I'll list the free ones:
PostgreSQL
MySQL
SQL Server Express
Oracle Express
I don't recommend an embedded database like SQLite, because embedded databases make trade-offs in features to accommodate space & size concerns. I don't agree with their belief that data typing should be relaxed - it's lead to numerous questions on SO about about to deal with date/time filtration, among others...
You'll want to learn about normalization, getting data to Third Normal Form (3NF) because it enforces referential integrity, which also minimizes data redundancy. For example, your player stats would not be stored in the database - they'd be calculated at the time of the request based on the data onhand.

You didn't mention any need for locking mechanisms where multiple users may be competing to write the same data to the same resource (a database record or file in the case of flat files) simultaneously. What I would suggest is get a good book on database design and try to understand normalization rules in depth. Distributing data across separate tables have a performance impact, but they also have an effect on the ease-of-use of query construction. This is a very involving topic, and there's no simple answer to it. That's why companies hire database administrators to keep their data structures optimized.
You might want to look at SQLite, if you need a lightweight database engine.

Some good options were mentioned already, but I really think that on Java platform, H2 is a very good choice. It is perfect for testing (in-memory test database), but works very well also for embedded use cases and as stand-alone "real database". Plus it is easy to export as dump file, import from that, to move around. And works efficiently too.
It is developed by a very good Java DB guy, and is not his first take, and you can see this from maturity of the project. On top of this it is still being actively developed as well as supported.

A word on why nobody even mentions any of the "NoSQL" databases while you have used it as a tag:
Non-SQL databases are getting a lot of attention (or even outright hype) recently, because of some high-profile usecases, because they're new (and therefore interesting), and because their promise of incredible scalability (which is "sexy" to programmers). However, only a very few very big players actually need that kind of scalability - and you certainly don't.
Another factor is that SQL databases require you to define your DB schema (the structure of tables and columns) beforehand, and changing it is somewhat problematic (especially if you already have a very large database). Non-SQL databases are more flexible in that regard, but you pay for it with more complex code (e.g. after you introduce a new field, your code needs to be able to deal with elements where it's not yet present). It doesn't sound like you need this kind of flexibility either.

Try also OrientDB. It's free (Apache 2 license), run everywhere, supports SQL and it's really fast. Can insert 1,000,000 of records in 6 seconds on common hw.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.