I've recently started introducing myself to the BigData world experimenting with the Apache Storm. I have faced the following problem, thought a lot how to solve it, but all my approaches seem naïve.
Technologies
Apache Storm 0.9.3, Java 1.8.0_20
Context
There is a big xml file (~400MB) that is required to be read line-by-line (xml-file-spout). Each read file line is then emitted and processed by a chain of bolts.
It has to be a guaranteed message processing (emitting with anchoring...)
Problem
As far as the file is pretty big (contains at about 20 billions of lines) I read it with a scanner, based on buffered stream not to load the whole file into the memory. So far so good. The problem emerges when there is an error somewhere in the middle of processing: the xml-file-spout itself died, or there is some internal issue...
The Nimbus will restart the spout, but the whole processing starts from the very beginning;
This approach does not scale at all.
Solution Thoughts
An initial idea for solving the 1'st problem was to save the current state somewhere: distributed cache, JMS queue, local disk file. When a spout opens, it should find such storage, read the state and proceed from the specified file line. Here I also thought about storing the state in the Storm's Zookeeper, but I don't know whether it is possible to address Zookeeper form the spout (is there such the ability)? Could you please suggest the best practice for this?
For the problem 2 I thought about breaking the initial file to a set of subfiles and process them in parallel. It can be done by introducing a new 'breaking' spout, where each file would be processed by a dedicated bolt. In this case the big problem raises with a guaranteed processing, cause in case of error the subfile, that contains the failed line, has to be fully reprocessed (ack/fail methods of the spout)... Could you please suggest the best practice for solution to this problem?
Update
Ok, what I did so far.
Prerequisites
The following topology works because all its parts (spouts and bolts) are idempotent.
Introduced a separate spout that reads file lines (one by one) and sends them to an intermediate ActiveMQ queue ('file-line-queue') to be able to replay failed file lines easily (see the next step);
Created a separate spout for the 'file-line-queue' queue, that receives each file line and emits it to the subsequent bolts. As far as I use the guaranteed message processing, in case of any bolt's failure a message is reprocessed, and if the bolt chain is successful a corresponding message is acknowledged (CLIENT_ACKNOWLEDGE mode).
In case of a first (file reading) spout's failure, a RuntimeException is thrown, which kills the spout. Later on a dedicated supervisor restarts the spout making an inout file be re-read. This will cause duplicated messages, but as far as everything is idempotent, it is not a problem. Also, here it is worth thinking about a state repository to produce less duplicates...
New Issue
In order to make the intermediate JMS more reliable I've added an on-exception-listener that restores a connection and a session for both the consumer and producer. The problem is with the consumer: if a session is restored and I have a JMS message unacked in the middle of the bolt processing, after a successful processing I need to ack it, but as far as a session is new, I receive the 'can't find correlation id' issue.
Could somebody please suggest how to deal with it?
To answer your questions first:
Yes you can store state somewhere like Zookeeper and use a library like Apache Curator to handle that.
Breaking the files up might help but still doesn't solve your problem of having to manage state.
Lets talk a bit about design here. Storm is built for streaming, not for batch. It seems to me that a Hadoop technology which works better for batch would work better here: MapReduce, Hive, Spark, etc.
If you are intent on using storm, then it will help to stream the data somewhere that is easier to work with. You could write the file to Kafka or a queue to help with your problem of managing state, ack/fail, and retry.
Related
I'm using a Spring Boot back-end to provide some restful API and need to log all of my request-response logs into ElasticSearch.
Which of the following two methods has better performance?
Using Spring Boot ResponseBodyAdvice to log every request and response that is sent to the client directly to ElasticSearch.
Log every request and response into a log file and using filebeat and/or logstash to send them to ElasticSearch.
First off, I assume, that you have a distributed application, otherwise just write your stuff in a log file and that's it
I also assume that you have quite a log of logs to manage, otherwise, if you're planning to log like a couple of messages in a hour, then it doesn't really matter which way you go - both will do the job.
Technically both ways can be implemented, although for the first path I would suggest a different approach, at least I did something similar ~ 5 years ago in one of my projects:
Create a custom log appender that throws everything into some queue (for async processing) and from that took an Apache Flume project that can write stuff to the DB of your choice in a transaction manner with batch support, "all-or-nothing" semantics, etc.
This approach solves issues that might appear in the "first" option that you've presented, while some other issues will be left unsolved.
If I compare the first and the second option that you've presented,
I think you better off with filebeat / logstash or even both to write to ES, here is why:
When you log in the advice - you will "eat" the resources of your JVM - memory, CPU to maintain ES connections pool, thread pool for doing an actual log (otherwise the business flow might slow down because of logging the requests to ES).
In addition you won't be able to write "in batch" into the elasticsearch without the custom code and instead will have to create an "insert" per log message that might be wasty.
One more "technicality" - what happens if the application gets restarted for some reason, will you be able to write all the logs prior to the restart if everything gets logged in the advice?
Yet another issue - what happens if you want to "rotate" the indexes in the ES, namely create an index with TTL and produce a new index every day.
filebeat/logstash potentially can solve all these issues, however they might require a more complicated setup.
Besides, obviously you'll have more services to deploy and maintain:
logstash is way heavier than filebeat from the resource consumption standpoint, and usually you should parse the log message (usually with grok filter) in logstash.
filebeat is much more "humble" when it comes to the resource consumption, and if you have like many instances to log (really distributed logging, that I've assumed you have anyway) consider putting a service of filebeat (deamon set if you have k8s) on each node from which you'll gather the logs, so that a single filebeat process could handle different instances, and then deploy a cluster of instances of logstash on a separate machine so that they'll do a heavy log-crunching all the time and stream the data to the ES.
How does logstash/filebeat help?
Out of my head:
It will run in its own pace, so even if process goes down, the messages produced by this process will be written to the ES after all
It even can survive short outages of the ES itself I think (should check that)
It can handle different processes written in different technologies, what if tomorrow you'll want to gather logs from the database server, for example, that doesn't have spring/not written java at all
It can handle indices rotation, batch writing internally so you'll end up with effective ES management that otherwise you had to write by yourself.
What are the drawbacks of the logstash/filebeat approach?
Again, out of my head, not a full list or something:
Well, much more data will go through the network all-in-all
If you use "LogEvent" you don't need to parse the string, so this conversion is redundant.
As for performance implications - it basically depends on what do you measure how exactly does your application look like, what hardware do you have, so I'm afraid I won't be able to give you a clear answer on that - you should measure in your concrete case and come up with a way that works for you better.
Not sure if you can expect a clear answer to that. It really depends on your infrastructure and used hardware.
And do you mean by performance the performance of your spring boot backend application or performance in terms of how long it takes for your logs to arrive at ElasticSearch?
I just assume the first one.
When sending the logs directly to ElasticSearch your bottleneck will be the used network and while logging request and responses into a log file first, your bottleneck will probably be the used harddisk and possible max I/O operations.
Normally I would say that sending the logs directly to ElasticSearch via network should be the faster option when you are operating inside your company/network because writing to a disk is always quite slow in comparison. But if you are using fast SSDs the effect should be neglectable. And if you need to send your network packages to a different location/country this can also change fast.
So in summary:
If you have a fast network connection to your ElasticSearch and HDDs/slower SSDs the performance might be better using the network.
If your ElasticSearch is not at your location and you can use fast SSD, writing the logs into a file first might be the faster option.
But in the end you maybe have to try out both approaches, implement some timers and check for yourself.
we are using both solution. first approach have less complexity.
we choose second approach when we dont want to touch the code and have too many instance of app.
about performance. with writing directly on elasticsearch you have better performance because you are not occupying disk I/O. but assume that when the connection between your app and elasticsearch server is dropped. you would have lost log after some retrying attempts.
using rsyslog and logstash is more reliable for big clusters.
We have to process large CSV files. We use Apache Camel for reading the files from an SFTP location (But we are open to Java based solutions if there are better approaches).
One of the requirement is to resume the processing from the point of failure. That is, if there is an exception happened while processing line number 1000, we should start processing from line 1000 rather than from the beginning. We should not process the record twice as well.
We are using Apache ActiveMQ to save the records in the queues and for managing the pipeline. But initial loading of the file from the location can also cause failures.
To track the state, we are using a database which will get updated at every step using Apache Camel.
We are open to ideas and suggestions. Thanks in advance.
As far as I know, Camel File component cannot resume from the point of failure.
It depends on your configuration (see moveFailed option) if a failed file is moved away or reprocessed on the next attempt (but from the beginning).
To read a CSV file, you need to split the single lines. Because your files are big, you should use the streaming option of the Splitter. Otherwise the whole file is read before splitting!
To decrease the probability of failures and reprocessing of the whole file, you can simply send every single CSV line to ActiveMQ (without parsing it). The simpler the splitter, the lower the probability that you need to reprocess the whole file because of problems in a single record.
The decoupled consumer of the queue can parse and process the CSV records without affecting the file import. Like this, you can handle errors for every single record.
If you nevertheless have file import failures, the file is reprocessed from the beginning. Therefore you should design you processing pipeline idempotent. For example check for an existing record and if there is already one, update it instead of just inserting every record.
In a messaging environment you have to deal with at-least-once delivery semantics. The only solution is to have idempotent components. Even if Camel would try to resume at the point of failure, it could not guarantee that every record is read only once.
We are building an java application which will use embedded Neo4j for graph traversal. Below are the reasons why we want to use embedded version instead of centralized server
This app is not a data owner. Data will be ingested on it through other app. Keeping data locally will help us in doing quick calculation and hence it will improve our api sla.
Since data foot print is small we don't want to maintain centralized server which will incur additional cost and maintenance.
No need for additional cache
Now this architecture bring two challenges. First How to update data in all instance of embedded Neo4j application at same time. Second how to make sure that all instance are in sync i.e using same version of data.
We thought of using Kafka to solve first problem. Idea is to have kafka listener with different groupid(to ensure all get updates) in all instance . Whenever there is update, event will be posted in kafka. All instance will listen for event and will perform the update operation.
However we still don't have any solid design to solve second problem. For various reason one of the instance can miss the event (it's consumer is down). One of the way is to keep checking latest version by calling api of data owner app. If version is behind replay the events.But this brings additional complexity of maintaining the event logs of all updates. Do you guys think if it can be done in a better and simpler way?
Kafka consumers are extremely consistent and reliable once you have them configured properly, so there shouldn't be any reason for them to miss messages, unless there's an infrastructure problem, in which case any solution you architect will have problems. If the Kafka cluster is healthy (e.g. at least one of the copies of the data is available, and at least quorum zookeepers are up and running), then your consumers should receive every single message from the topics they're subscribed to. The consumer will handle the retries/reconnecting itself, as long as your timeout/retry configurations are sane. The default configs in the latest kafka versions are adequate 99% of the time.
Separately, you can add a separate thread, for example, that is constantly checking what the latest offset is per topic/partitions, and compare it to what the consumer has last received, and maybe issue an alert/warning if there is a discrepancy. In my experience, and with Kafka's reliability, it should be unnecessary, but it can give you peace of mind, and shouldn't be too difficult to add.
Apologies for the wording of my question.
I am using Tomee.
I have an ActiveMQ queue set up and receiving messages from a Producer (the Tomee provided example)
It is persisted in MySQL (in case that matters)
My scenario is this...
A message comes into the queue
A Consumer/Monitor reads the message and starts a thread to run a process (backup, copying, processing etc...) that could take some time to complete.
At anyone time I could have 5 messages to process or 500+ (and anything in between)
Ideally, I would like some Java/Apache library that is designed to monitor the queue and read 10 messages (for example) and then start the threads and then wait for one to finish before starting any more. For all intents and purposes I am trying to create a 'thread pool' or 'work queue' that prevents too many processes from starting up at any one time.
OR
Does this need to be thread pooled outside of ActiveMQ ?
I'm new to JMS and am beginning to understand it but still a long way to go.
Any help is appreciated.
Trevor
What you are looking to do sounds like something that could easily be solved using Apache Camel. Take a look at the Camel documentation for the competing consumers EIP which sounds like an ideal fit for your case.
Of course there is the obvious way of using "synchronized".
But I'm creating a system designed for running on several cores
and writing to that file various times at the same milisecond.
So I believe that using synchronize will hurt performance badly.
I was thinking of using the Pipe class of java (but not sure if it will help)
or having each thread write to a different file and an additional thread collecting
those writings, creating the final result.
I should mention that the order of the writings isn't important and it is timestamped
in nanotime anyway.
What is the better idea of those two? have any other suggestions?
thanks.
Using some sort of synchronization (eg. single mutexes) is quite easy to implement.
If I had enough RAM I would create a queue of some sort for each log-producer thread, and a log-consumer thread to read from all the queues in a round-robin fashion (or something like that).
Not a direct answer to your question, but the logback project has synchronization facilities built into it for writing to the same file from different threads, so you might try to use it if it suits your needs, or at least take a look at it's source code. Since it's built for speed, I'm pretty sure the implementation isn't to be taken for granted.
You are right to be concerned, you are not going to be able to have all the threads write to the same file without a performance problem.
When I had this problem (writing my own logging, way back before Log4j) I created two fixed-size buffers in memory and had all the producer threads write to one buffer while a dedicated consumer thread read from the other buffer and wrote to a file. That way the writer threads had to synchronize only on getting and incrementing the index to the buffer and when the buffers were being swapped, and it only blocked when the current buffer was full. It was memory-intensive but fast.
For other ideas you could check out how loggers like Log4j and Logback work, they would have had to solve this problem.
Try to use JMS. All your processes running on different machines may send JMS message instead of writing to file. Create only one queue receiver that receives messages and writes them to file. Log4J already has this functionality: see JMSAppender.