I read many sites related to Storm's topologies design setup. But, I didn't get clarity.
In my project, I am going to processing more than a million records. So, I planned to create topologies dynamically based on internal modules. The count might be reached more than a thousand. My doubt is what is the best way to manage topologies? How many topologies can be created in a single cluster? Are there any problems with maintaining multiple topologies?
I would say, that this really depends on your machines in the cluster, so it is hard to answer that generally - this is especially true, if the cluster has heterogeneous instances.
Basically storm can handle many topologies that you can control over the CLI or the GUI.
I am currently managing those with storm list and storm kill commands. Limits should be in the RAM, storage and network connections of the single machines. To be precisely, I would predict, that the bottleneck is the JVM size of a supervisor instance. This can hold multiple workers (that each having components like bolts & spouts that are initially configured with a JVM of 256 MB), but if there are too many workers, the overall consumed JVM per supervisor will be exceeded.
Related
I have a program which spins up thousands of threads. I am currently using one host for all the threads which takes a lot of time. If I want to use multiple hosts (say 10 hosts, each running 100 different threads), how should I proceed ?
Having thousands of threads on a single JVM sounds like a bad idea - you may spend most time context-switching instead of doing the actual work.
To split your work across multiple host, you cannot use threads managed by a single JVM. You'll need to have each host exposing an API that can receive part of work and return the result of the work done.
One approach would be to use Java RMI (remote method invocation) to complete this task, but really, your question lacks so many details important for the decision of what architecture to choose.
Creating 1000 threads in on JVM is very bad design and need to minimise count.
High thread count will not give you multi-threading benefit as context switching will be very frequent and will hit performance.
If you are thinking of dividing in multiple hosts then you need parallel processing system like Hadoop /Spark.
They internally handles task allocation as well as central system for syncing all hosts on which threads/tasks are running.
I am working as a developer on a batch processing solution, how it works is that we split a big file and process it across jvms. So we have 4 processor jvms which take a chunk of file and process it and 1 gateway jvm job of gateway jvm is to split the file into no. of processor jvms i.e. 4 and send a rest request which is consumed by processor jvms, rest request has all the details the file location it has to pick the file from and some other details
Now if i want to add another processor jvm without any downtime is there any way we can do it. Currently we are maintaining the urls for 4 jvms in a property file is there any better way to do it ? which provided me the ability to add more jvms without restarting any component
You can consider setting up a load balancer and putting your JVM(s) behind it. The load balancer would be responsible for distributing the incoming requests to the JVMs.
This way you can scale up or scale down your JVM depending on the work load. Also, if one of the JVMs are not working, other part of your system need not care about it anymore.
Not sure what is your use case and tech stack you are following. But it seems that you need to have distributed system with auto-scaling and dynamic provisioning capabilities. Have you considered Hadoop or Spark clusters or Akka?
If you can not use any of it, then solution is to maintain list of JVMs in some datastore (lets say in a table); its dynamic data meaning one can add/remove/update JVMs. Then you need a resource manager who can decide whether to spin up a new JVM based on load or any other conditional logic. This resource manager needs to monitor entire system. Also, whenever you create a task or chunk or slice data then distribute it using message queues such as ApacheMQ, ActiveMQ. You can also consider Kafka for complex use cases. Now a days, application servers such as websphere (Libery profile), weblogic also provide auto-scaling capability. So, if you are already using any of such application server then you can think of making use of that capability. I hope this helps.
We are running same Jetty service on two servers but are seeing different number of threads created by both services (50 vs ~100 threads).
Both servers are running identical Java code on RedHat5 (they do have slightly different kernels). Yet Jetty on one of the servers creates more threads than the other one. How is it possible?
Thread counts are dynamic, depends on many many factors.
The number of threads that you see at any one point can vary greatly, based on hardware differences (number of cpu cores, number of network interfaces, etc), kernel differences, java differences, load differences, active user counts, active connection counts, transactions per second, if there are external dependencies (like databases), how async processing is done, how async I/O is done, use of http/2 vs http/1, use of websocket, and even ${jetty.base} configuration differences.
As for the counts you are seeing, 50 vs 100, that's positively tiny for a production server. Many production servers on moderately busy systems can use 500 (java) threads, and on very busy commodity systems its can be in the 5,000+ range. Even on specialized hardware (like an Azul systems devices) its not unheard of to be in the 90,000+ thread range with multiple active network interfaces.
In a simplified manner my Java application can be described as follows:
It is a web application running on a Tomcat server with a SOAP interface. The application uses JPA/Hibernate to store data in a MySQL database. The data stored consists of list of users, a list of hosts, and a list of URIs pointing to huge files (10GB) in the filesystem.
The whole system consists of a central server, where my application is running on, and a bunch of worker hosts. A user can connect to the SOAP interface and ask the system to copy the files that belong to him to a specific worker host, where he then can analyze the data in some way (We cannot use NFS, we need to copy the data to the local disc storage of a worker host). The database then stores for each user on which worker host his files are stored.
At the moment the system is running with one central server with the Tomcat application and the MySQL database and 10 worker hosts and about 30 users which have 100 files (on average 10GB) size stored distributed over the worker hosts.
But in the future I have to scale the system by a factor of 100-1000. So I might have to deal with 10000 users, 100000 files and 10000 hosts. And the system should also become fault tolerant, so that I have don't have a single central server (which is the single point of failure in the system now), but maybe several ones. Also, if one of the worker hosts fails the system should be notified, so it doesn't try to copy files on that server.
My question is now: Which Java technologies could I use to make my application scalable and fault tolerant? What kind of architecture would you recommend? Should I still have a huge database storing all the information about all files, hosts and users in the system in one place, or should I better distribute my database on several hosts and synchronize them somehow?
The technology you need is called Architecture.
No matter which technology you use, you need to have a well-architected system for scalability and redundancy. Make a diagram of the entire architecture of the system as it currently works. Mark each component with its limitations for users, jobs, bandwidth, hard drive space, memory, or whatever parts are limiting for your application. This will give you the baseline design.
Now draw that same diagram as it would need to be to meet your scalability and redundancy requirements. You might have to break apart pieces to make it work, or develop entirely new pieces. This diagram will make it very clear what you need.
One specific thing I want to address is the database. If you can split the database across logistic lines so that you do not join any queries from one to another, then you should have separate databases. Beyond that, the best configuration for a database is to have each database on one fast machine with lots of storage and very fast access times. If you do this, the only thing that will slow down your database are bad queries or poorly-indexed tables. In my experience, synchronizing databases is to be avoided unless you have one master database that has write access and it replicates to other databases which are read-only. Regardless, this can be a last step after you've profiled all of your queries and you literally need additional hardware.
I have a situation here where I need to distribute work over to multiple JAVA processes running in different JVMs, probably different machines.
Lets say I have a table with records 1 to 1000. I am looking for work to be collected and distributed is sets of 10. Lets say records 1-10 to workerOne. Then records 11-20 to workerThree. And so on and so forth. Needless to say workerOne never does the work of workerTwo unless and until workerTwo couldnt do it.
This example was purely based on database but could be extended to any system, I believe be it File processing, email processing and so forth.
I have a small feeling that the immediate response would be to go for a Master/Worker approach. However here we are talking about different JVMs. Even if one JVM were to come down the other JVM should just keep doing its work.
Now the million dollar question would be: Are there any good frameworks(production ready) that would give me facility to do this. Even if there are concrete implementations of specific needs like Database records, File processing, Email processing and their likes.
I have seen the Java Parallel Execution Framework, but am not sure if it can be used for different JVMs and if one were to come down would the other keep going.I believe Workers could be on multiple JVMs, but what about the Master?
More Info 1: Hadoop would be a problem because of the JDK 1.6 requirement. Thats bit too much.
Thanks,
Franklin
Might want to look into MapReduce and Hadoop
You could also use message queues. Have one process that generates the list of work and packages it in nice little chunks. It then plops those chunks on a queue. Each one of the workers just keeps waiting on the queue for something to show up. When it does, the worker pulls a chunk off the queue and processes it. If one process goes down, some other process will pick up the slack. Simple and people have been doing it that way for a long time so there's a lot information about it on the net.
Check out Hadoop
I believe Terracotta can do this. If you are dealing with web pages, JBoss can be clustered.
If you want to do this yourself you will need a work manager which keeps track of jobs to do, jobs in progress and jobs never done which needs to be rescheduled. The workers then ask for something to do, do it, and send the result back, asking for more.
You may want to elaborate on what kind of work you want to do.
The problem you've described is definitely best solved using the master/worker pattern.
You should have a look into JavaSpaces (part of the Jini framework), it's really well suited to this kind of thing. Basically you just want to encapsulate each task to be carried out inside a Command object, subclassing as necesssary. Dump these into the JavaSpace, let your workers grab and process one at a time, then reassemble when done.
Of course your performance gains will totally depend on how long it takes you to process each set of records, but JavaSpaces won't cause any problems if distributed across several machines.
If you work on records in a single database, consider performing the work within the database itself using stored procedures. The gain for processing the records on different machine might be negated by the cost of retrieving and transmitting the work between the database and the computing nodes.
For file processing it could be a similar case. Working on files in (shared) filesystem might introduce large I/O pressure for OS.
And the cost for maintaining multiple JVM's on multiple machines might be an overkill too.
And for the question: I used the JADE (Java Agent Development Environment) for some distributed simulation once. Its multi-machine suppord and message passing nature might help you.
I would consider using Jgroups for that. You can cluster your jvms and one of your nodes can be selected as master and then can distribute the work to the other nodes by sending message over network. Or you can already partition your work items and then manage in master node the distribution of the partitions like partion-1 one goes to JVM-4 , partion-2 goes to JVM-3, partion-3 goes to JVM-2 and so on. And if JVM-4 goes down it will be realized by the master node and then master node will tell to one of the other nodes to start pick up partition-1 as well.
One other alternative which is easier to use is redis pub sub support. http://redis.io/topics/pubsub . But then you will have to maintain redis servers which i dont like.