ELK stack for storing metering data

ELK stack for storing metering data - java

In our project we're using an ELK stack for storing logs in a centralized place. However I've noticed that recent versions of ElasticSearch support various aggregations. In addition Kibana 4 supports nice graphical ways to build graphs. Even recent versions of grafana can now work with Elastic Search 2 datasource.
So, does all this mean that ELK stack can now be used for storing metering information gathered inside the system or it still cannot be considered as a serious competitor to existing solutions: graphite, influx db and so forth.
If so, does anyone use ELK for metering in production? Could you please share your experience?
Just to clarify the notions, I consider metering data as something that can be aggregated and and show in a graph 'over time' as opposed to regular log message where the main use case is searching.
Thanks a lot in advance

Yes you can use Elasticsearch to store and analyze time-series data.
To be more precise - it depends on your use case. For example in my use case (financial instrument price tick history data, in development) I am able to get 40.000 documents inserted / sec (~125 byte documents with 11 fields each - 1 timestamp, strings and decimals, meaning 5MB/s of useful data) for 14 hrs/day, on a single node (big modern server with 192GB ram) backed by corporate SAN (which is backed by spinning disks, not SSD!). I went to store up to 1TB of data, but I predict having 2-4TB could also work on a single node.
All this is with default config file settings, except for the ES_HEAP_SIZE of 30GB. I am suspecting it would be possible to get significantly better write performance on that hardware with some tuning (eg. I find it strange that iostat reports device util at 25-30% as if Elastic was capping it / conserving i/o bandwith for reads or merges... but it could also be that the %util is an unrealiable metric for SAN devices..)
Query performance is also fine - queries / Kibana graphs return quick as long as you restrict the result dataset with time and/or other fields.
In this case you would not be using Logstash to load your data, but bulk inserts of big batches directly into the Elasticsearch. https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
You also need to define a mapping https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html to make sure elastic parses your data as you want it (numbers, dates, etc..) creates the wanted level of indexing, etc..
Other recommended practices for this use case are to use a separate index for each day (or month/week depending on your insert rate), and make sure that index is created with just enough shards to hold 1 day of data (by default new indexes get created with 5 shards, and performance of shards starts degrading after a shard grows over a certain size - usually few tens of GB, but it might differ for your use case - you need to measure/experiment).
Using Elasticsearch aliases https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html helps with dealing with multiple indexes, and is a generally recommended best practice.

Related

Performance: Google Dataflow writes avro file to GCS

I am working on Google Dataflow which pulls data from pubsub, converts to avro, and writes them to GCS.
According to the monitoring page, the bottleneck is writing avro file to GCS (spending 70-80 % of total execution time).
I use
10 workers of n1-standard-8
10 numShards
5sec fixedwindow
The region of GCS and Dataflow endpoint is same.
Then the performance is around 200,000 elements per second.
Is it fast on this situation or is there anything I can do to make it faster? (I really want to!)
Thanks

Have you considered naming your files following a specific convention in order to optimize access read and write ?
In order to maintain a high request rate, avoid using sequential names. Using completely random object names will give you the best load distribution. If you want to use sequential numbers or timestamps as part of your object names, introduce randomness to the object names by adding a hash value before the sequence number or timestamp.
Basically you need to follow the same rules as choosing a RowKey in BigTable.

JGroups ReplicatedHashMap in a cluster

My Spring based web app is deployed to production in a Tomcat cluster (4+ nodes) with sticky sessions. The max number of nodes will not exceed 8-10 in a few years time.
I need to cache some data(mostly configuration), to avoid hitting Oracle. Since the nature of this data is mostly configuration, I would say the ratio of reads to writes is 999999 / 1.
I don't want to use a full-blown caching solution such as Infinispan/Hazelcast/Redis as it adds operation complexity to the product and the requirement is to cache some small, mostly read-only data(let's say a few hundred kilobytes the most)
At first, I wanted to implement a simple replicating map myself, then I saw [JGroups][1] ships with a [ReplicatedHashMap][1]. I think it suits my needs but I'm not sure whether I'm missing something.
What else should I consider?
Has anyone used it in production?

ReplicatedHashMap is one class of 700 lines, so it isn't particularly complex, and uses JGroups, which has been used for decade(s) in production.
If you need something simple, without transactions/overflow-store etc, then it might be right for your job. Note that you could modify it and/or write your own, with RHM as template.
RHM replicates all data to all nodes, so if you have many nodes (you don't), or your data is large, then ReplCache may be the better choice.

Java - Custom Hash Map/Table Some Points

In some previous posts I have asked some questions about coding of Custom Hash Map/Table in java. Now as I can't solve it and may be I forgot to properly mentioning what I really want, I am summarizing all of them to make it clear and precise.
What I am going to do:
I am trying to code for our server in which I have to find users access type by URL.
Now, I have 1110 millions of URLs (approx).
So, what we did,
1) Divided the database on 10 parts each of 110 millions of Urls.
2) Building a HashMap using parallel array whose key are URL's one part (represented as LONG) and values are URL's other part (represented as INT) - key can have multiple values.
3) Then search the HashMap for some other URLs (millions of URLs saved in one day) per day at the beginning when system starts.
What you have Tried:
1) I have tried many NoSQL databases, however we found not so good for our purpose.
2) I have build our custom hashmap(using two parallel arrays) for that purpose.
So, what the issue is:
When the system starts we have to load our hashtable of each database and perform search for million of url:
Now, issue is,
1) Though the HashTable performance is quite nice, code takes more time while loading HashTable (we are using File Channel & memory-mapped buffer to load it which takes 20 seconds to load HashTable - 220 millions entry - as load factor is 0.5, we found it most faster)
So, we are spending time: (HashTable Load + HashTable Search) * No. of DB = (5 + 20) * 10 = 250 seconds. Which is quite expensive for us and most of the time (200 out of 250 sec) is going for loading hashtables.
Have you think any-other way:
One way can be:
Without worrying about loading and storing, and leave caching to the operating system by using a memory-mapped buffer. But, as I have to search for millions of keys, it gives worser performance than above.
As we found HashTable performance is nice but loading time is high, we thought to cut it off in another way like:
1) Create an array of Linked Lists of the size Integer_MAX (my own custom linked list).
2) Insert values (int's) to the Linked Lists whose number is key number (we reduce the key size to INT).
3) So, we have to store only the linked lists to the disks.
Now, issue is, it is taking lots of time to create such amount of Linked Lists and creating such large amount of Linked Lists has no meaning if data is not well distributed.
So, What is your requirements:
Simply my requirements:
1) Key with multiple values insertion and searching. Looking for nice searching performance.
2) Fast way to load (specially) into memory.
(keys are 64 bit INT and Values are 32 bit INT, one key can have at most 2-3 values. We can make our key 32 bit also but will give more collisions, but acceptable for us, if we can make it better).
Can anyone help me, how to solve this or any comment how to solve this issue ?
Thanks.
NB:
1) As per previous suggestions of Stack Overflow, Pre-read data for disk caching is not possible because when system starts our application will start working and on next day when system starts.
2) We have not found NoSQL db's are scaling well as our requirements are simple (means just insert hashtable key value and load and search (retrieve values)).
3) As our application is a part of small project and to be applied on a small campus, I don't think anybody will buy me a SSD disk for that. That is my limitation.
4) We use Guava/ Trove also but they are not able to store such large amount of data in 16 GB also (we are using 32 GB ubuntu server.)

If you need quick access to 1110 million data items then hashing is the way to go. But dont reinvent the wheel, use something like:
memcacheDB: http://memcachedb.org
MongoDB: http://www.mongodb.org
Cassandra: http://cassandra.apache.org

It seems to me (if I understand your problem correctly) that you are trying to approach the problem in a convoluted manner.
I mean the data you are trying to pre-load are huge to begin with (let's say 220 Million * 64 ~ 14GB). And you are trying to memory-map etc for this.
I think this is a typical problem that is solved by distributing the load in different machines. I.e. instead of trying to locate the linked list index you should be trying to figure out the index of the appropriate machine that a specific part of the map has been loaded and get the value from that machine from there (each machine has loaded part of this database map and you get the data from the appropriate part of the map i.e. machine each time).
Maybe I am way off here but I also suspect you are using a 32bit machine.
So if you have to stay using a one machine architecture and it is not economically possible to improve your hardware (64-bit machine and more RAM or SSD as you point out) I don't think that you can make any dramatic improvement.

I don't really understand in what form you are storing the data on disk. If what you are storing consists of urls and some numbers, you might be able to speed up loading from disk quite a bit by compressing the data (unless you are already doing that).
Creating a multithreaded loader that decompresses while loading might be able to give you quite a big boost.

which NOSQL database tool is better to choose for my application?

I am planning to develop some application like connecting with friends of friends of friends. It may look like as Facebook or Twitter but initially i am planning to implement that to learn more about NOSQL databases.
There are number of database tools in NOSQL. I have gone through many database types like document store, key-value store, column type, graph databases. And finally i come up with two database tools which are cassandra & Neo4J. Is it right to choose any one, if not correct me & provide me some your valuable opinions.
One more thing is the language binding which i choose is JAVA.
My question is,
Which database tool suits for my application?
Awaiting for your valuable opinions. Thanks for spending your valuable time.

Tim, you really should have posted your question separately, rather than as an answer to the OP, which it wasn't.
But to answer, first, go read Ben Black's slides at http://www.slideshare.net/benjaminblack/introduction-to-cassandra-replication-and-consistency.
Done? Okay, now for the specific questions:
"How would differences in [replica] data-state be reconciled on a subsequent read?"
The highest timestamp wins.
"Do all zones work off the same system clock?"
Timestamps are provided by clients (i.e., your app server). They should be synchronized with e.g. ntpd (which is good practice anyway), but high precision is not required because if ordering matters you should be avoiding conflict either by using unique column names or by using external locking.
For example: if you have a list of users following you in a Twitter clone, you should give each follower its own column and there will be no way to lose data no matter how out of sync the clocks are.
If you have an admin tool for your website and two admins upload a new favicon "simultaneously," one update is going to win and it doesn't really matter which. Here, you do want your clocks synchronized but "within a few ms" is close enough.
If you are managing user registration and you want to allow creating account "jbellis" only if it doesn't already exist, you need a lock manager no matter how closely synchronzied your clocks are.
"Would stale data get returned?"
A node (a better unit to think about than a "zone") will not have data it missed during its downtime until it is sent that data by read repair, hinted handoff, or anti-entropy repair. In the meantime, it will reply to read requests with stale data; if you use a high enough consistencylevel read requests will wait for enough other replies to make sure you always see the most recent version anyway, which may mean not being able to fulfil requests if enough other replicas are down.
Otherwise, a low consistencylevel (e.g. ONE) implicitly means "I understand that the higher availability and lower latency I get with this lower consistencylevel means I'm okay with seeing stale data temporarily after downtime."

I'm not sure I understand all of the implications of the Cassandata consistency model with respect to data-agreement across multiple availability zones.
Given multiple zones, and given that the coordinator node in Cassandra has used a consistency level that does not require all zones to report back, but only a quorum, how would differences in zone data-state be reconciled on a subsequent read?
Do all zones work off the same system clock? Or does each zone have its own clock? If they don't work off the same clock, how are they synchronized so that timestamps can be compared during the "healing" process when differences are reconciled?
Let's say that a zone that does have accurate, up-to-date data is now offline, and a zone that was offline during a previous write (so it didn't get updated and contains stale data) is now back online. Would stale data get returned? Would the coordinator have any way to know the data were stale?

If you don't need to scale in the short term I'd go with Neo4j because it is designed to store networks like the one you described. (If you eventually do need to scale, maybe you can throw Gizzard in front of it or something. Good luck!)

Have you looked on Riak database? It has the same background as Cassandra, but you don't need to care about timestamp synchronization (they involve different method for resolving data status).
My first application was build on a Cassandra database. But I am now trying Riak because it is more suitable. It is not only the difference in keys (keys - values / super column - keys - values) but goes further with the document store feature.
It has a method to create complex queries using MapReduce. Cassandra does have this option using Hadoop, but it sounds difficult.
Further more it uses a well known and defined access protocol in http/s so it's easy to manage the server when you have a lot of traffic.
The only bad point is that is slower than Cassandra. But usually you will read records more than write (and Cassandra is optimised on writes, not reads) so the end result should be ok.

Java DB choose for better perfomance

I have java application that process such kind of data:
class MyData
{
Date date;
double one;
double two;
String comment;
}
All data are stored in csv format on hard disk, maximum size of such data sequence is ~ 150 mb, and for this moment I just load it fully to memory and work with it.
Now I have the task to increase maximum data sequence for hundreds of gigabyte. guess I need to use DB, but I did not work with them before.
My questions:
Which DB better to choose for my
reasons(there will be only 1 table
with data as abowe) ?
Which library
better to use to connect Java <-> DB
I guess there will be used something
like cursor?!? if so, is there any
cursor realization with good record
caching for fast access?
Any other tips&tricks about java <-> DB are welcome!

Your question is pretty unspecific. There isn't a best of breed - it depends on how much money you have and what kind of hardware.
Since your mapping between Java and the DB is pretty simple, JDBC should be enough. JDBC will create a cursor for you as necessary; lost loop over the rows in the ResultSet. Depending on the database, you may need to configure it to use cursors, though.
Since you mention "hundreds of gigabytes", that rules out most of the "simple" databases. If you have money, try Oracle. If you don't have money, try MySQL or Postgres.
You can also try JavaDB (also known as Derby). But I'm not sure the performance will be what you need.
Note that they all have their quirks and "features", so expect to spend a couple of weeks to find your way with them.

Depends entirely on what you will be doing with the data. Do you need to index it to retrieve specific records, or are you stream processing the entire data set to generate some statistics (for example)? Does the database need to be accessed concurrently by multiple clients/processes?
Don't rush immediately towards SQL/JDBC, relational databases are powerful, but they add a lot of complexity and are often entirely unnecessary for the task at hand.
Again, depending on what you actually need to do, something like BerkeleyDB may fit the bill, or you may just need a more compact binary message format: check out Protocol Buffers and Kryo.
If you really need to scale things up, look at Hadoop/HDFS for distributed processing (but that's getting rather complicated).
Oh, and generally speaking, JavaDB/Derby tends to suck somewhat.

I would recommend JavaDB. I have used it in a Point of Sale system and it works very good. It is very easy to integrate into your Java Application, and you can integrate it to the same .jar file if you want.
Using Java DB in Desktop Applications may be a useful article. You will use JDBC for interfacing the database from Java, this makes it easy to switch to another database if you don't want to use JavaDB.

You'll want to evaluate several databases (you can get trials of just about any of them if they're not open source/free already). I'd recommend trying Oracle, Mysql/Postgres and with the size of your data (and its lack of apparent complexity) you might want to consider a datagrid as well (gridgain or similar).
Definitely prototype though.

I'd just like to add that the "fastest" database is not necessarily the best.
You also need to take into account:
reliability,
software license cost,
ease of use,
ease of administration,
availability of support,
and so on.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.