Need help in deciding what frameworks I can use in this scenario. I'm exploring Zookeeper, but not completely sure on how to solution this usecase.
background :
Say there is application that connects to a streaming source(Kafka or Activemq etc) and writes messages that were processed from the stream
to a file.
This application is deployed as 4 instances.Each instance is processing messages and writing to file that were processed in last 1 hr.
Each instance creates a file that stores messages that it processed last 1
hour. example -filename is servername_8.00 for messages processed from 8-9
Requirement is to transfer all the files that were created last 1 hour if every instance created a file in that window and also send only one consolidated file which lists all the 4 file names and
number of records.
what i'm looking for :
1. How do I make sure application instances know if other instances also created files and if every instance created then only they should transmit file
2. whatever instance sending, consolidated file should know what was transmitted.
what frameworks I can use to solve this?
You can definitely use ZooKeeper for this. I would use Apache Curator as well (note: I'm the main author of Curator).
Do all the instances share a file server? i.e. can each instance see all of the created files? If so, you can nominate a leader using ZooKeeper/Curator and only the leader does all of the work. You can see sample leader election code here: https://github.com/apache/curator/tree/master/curator-examples/src/main/java/leader
If the instances do not share a file server, you could still use ZooKeeper to coordinate the writing of the shared file. You'd again nominate a leader who exposes an API of some kind that all the instances can write to and the leader creates the shared file.
You also might find the Curator barrier recipes useful: http://curator.apache.org/curator-recipes/double-barrier.html and http://curator.apache.org/curator-recipes/barrier.html
You'd have to give a lot more detail on your use case if you want a more detail design.
Related
I am a newbie to Springboot and micro service development and have a question related to Springbot Service designing.
Requirement -
We have a requirement where a Springboot Service need to listen to 3 or 4 different Kafka topic individually and create 3 or 4 csv files respectively (After filtering few of the attributes from event message and upload files to FTP server at different time of day).
Requirement Summary
Design and Inputs Needed
I am thinking of solution like below -
#1 I am thinking of having a Kafka Consumer, which can read from one Kafka topics and apply filters and keep on creating files whole day. Once file reaches to100 MB, it will get rotated like A1.csv, A2.csv etc
#2 Also to have job manager which can create cron jobs, which can stitch files once in a day and upload to FTP.
https://spring.io/guides/gs/scheduling-tasks/
#3 Would like jobs to be created on the basis of configuration as tomorrow if we want to add new jobs, it can be quick.
#4 How to design this so as scalability can be achieved. As event coming would be huge in numbers.
#5 Will it be recommended to use elastic cache instead of creating multiple files and then sticking them together to a single file.
#6 I also want to have a fail safe logic so as if service fails, I should be able to continue from where I am left off.
Please point me to any existing solution etc which I can refer to. Any API which can help with job/batch scheduler and also to manage configurations.
Regards,
Dan
You can have checkpointing in Kafka, so if application fails, you can start consuming events from the same point. Checkpoint only shifts once your job is done.
At this scale, it is good to have spark jobs etc to speed up the computation.
You can have distributed cache cluster like Redis or Aerospike to cache some of the files for fast reads.
Its good to have job trigger time as configurable which can be changed tomorrow.
Let me know what else would you need to know.
Is there any possibility to store logs from my different application may be in different languages can store logs in single file as per timestamp.
You could retrieve and aggregate every application's logs in some kind of logstash.
Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to your favorite “stash.”
If you can force that every one of your applications outputs logs with the same pattern, I guess logstash (plus an elasticsearch or anything of its kind behind) would exactly answer your needs.
I have two applications.
First one creates typical files.
Second application uses these files.
When first application change some file, the second application should be noted about this.
I tried to do this with ServerSocket and it does work.
First application is a client (java.net.Socket) and second is a server (java.net.ServerSocket).
But it should work also for multiple instances of applications.
In case we have multiple instances of application two, the first should alert each one.
Both application are desktop application running on same machine without any databases. The question is how to implement it and not on actual code. The actual code runs OK. It just don't fit the specifications.
For understanding the problem lets take one example.
There is one application which is producing some thing lets call it as prodApp and there are many other applications which should get notified lets call them as consApp1 ,consApp2,...consAppN
Solution to this problem can be designed using the JMS (Java Messaging Service).
JMS provide the way by which multiple consApp can register at one place (which is called as TOPIC in JMS) and they got notified as soon as some thing has been put on TOPIC (which in this case will be done by prodApp).
So it will work like this prodApp will do its processing writes its status on JMS TOPIC as a result of this all the consApp will get notified and start there own processing.
In case the number of files is small and they are known to be saved in a single place, the second application[s] could check the files periodically (like every minute) for the last modification time of each file.
This could be even faster than socket, RMI or other network communications.
We have existing Spring Batch Application, that we want to make scalable to run on multiple nodes.
The scalabilty docs for Spring Batch involves code changes and configuration changes.
I am just wondering if this can be achieved by just configuration changes ( adding new classes and wiring it in configuration is fine but just want to avoid code changes to existing classes).
Thanks a lot for the help in advance.
It really depends on your situation. Specifically, why do you do you want to run on multiple nodes? What is the bottle neck you're attempting to overcome? The typical two scenarios that Spring Batch handles out of the box for scaling across multiple nodes are remote chunking and remote partitioning. Both are master/slave configurations, but each have a different use case.
Remote chunking is used when the processor in a step is the bottle neck. In this case, the master node reads the input and sends it via a Spring Integration channel to remote nodes for processing. Once the item has been processed, the result is returned to the master for writing. In this case, reading and writing are done locally to the master. While this helps parallelize processing, it takes an I/O hit because every item is being sent over the wire (and requires guaranteed delivery, ala JMS for example).
Remote partitioning is the other scenario. In this case, the master generates a description of the input to be processed for each slave and only that description is sent over the wire. For example, if you're processing records in a database, the master may send a range of row ids to each slave (1-100, 101-200, etc). Reading and writing occur local to the slaves and guaranteed delivery is not required (although useful in certain situations).
Both of these options can be done with minimal (or no) new classes depending on your use case. There are a couple different places to look for information on these capabilities:
Spring Batch Integration Github repository - Spring Batch Integration is the project that supports the above use cases. You can read more about it here: https://github.com/spring-projects/spring-batch-admin/tree/master/spring-batch-integration
My remote partitioning example - This talk walks though remote partitioning and provides a working example to run on CloudFoundry (currently only works on CF v1 but updates for CF2 are coming in a couple days). The configuration is almost the same, only the connection pool for Rabbit is different: https://github.com/mminella/Spring-Batch-Talk-2.0 The video for this presentation can be found on YouTube here: http://www.youtube.com/watch?v=CYTj5YT7CZU
Gunnar Hillert's presentation on Spring Batch and Spring Integration: This was presented at SpringOne2GX 2013 and contains a number of examples: https://github.com/ghillert/spring-batch-integration-sample
In any of these cases, remote chunking should be accomplishable with zero new classes. Remote partitioning typically requires you to implement one new class (the Partitioner).
We have a goal of moving text-file represented database table row coming from several machines to a single machine, our current solution is file based
- Zip the files then send it over the wire
- Server will receive zip files from those machines and unzip to some folder according.
There are lots of other file moving operation in between that is happening which is really faulty.
I'm thinking of using hazlecast to move the each "row" String into the server. Is Hazelcast up to this kind of job?
The text file is being generate from many machines with a rate of 200K to 300K per day. These files must be send to the server. So I want to migrate this to Hazelcast.
You can do this with hazelcast, but it is the wrong use case for it. Hazelcast will synchronize in all directions. If you add an entry on client1 it will be transfered to the server but also to client2. Even if this doesn't scare you it shows hazelcast is mis-used here.
You will be better by implement a simple webservice on the server to which the clients pushes the "rows".