nested iterations with Apache Spark?

nested iterations with Apache Spark? - java

I'm considering Apache Spark (in java) for a project, but this project requires the data processing framework to support nested iterations. I haven't been able to find any confirmation on that, does it support it?
In addition, is there any example of the use of nested iterations?
Thanks!

Just about anything can be done, but the question is what fits the execution model well enough to bother. Spark's operations are inherently parallel, not iterative. That is, some operations happens in parallel to a bunch of pieces of the data, rather than, something happens to each piece sequentially (and then happens again).
However a Spark (driver) program is just a program and can do whatever you want, locally. Of course, nested loops or whatever you like are entirely fine just as in any scala program.
I think you might use Spark operations for the bucketing process and to compute summary stats for each bucket, but otherwise run the simple remainder of the logic locally on the driver.
So the process is:
Broadcast a bucketing scheme
Bucket according to that scheme in a distributed operation
Pull small summary stats to the driver
Update bucketing scheme and send again
repeat...

Related

Initialising your app using multithreading in Java

I'm new to the whole multithreading in Java thing and I'm unable to get my head around something.
I'm trying to initialise my app properly using multithreading.
For example, I'm using a database (mongodb to be exact) and need to initialise a connection to it, then connect and check a collection exists and then read from it.
Once I have that, I will eventually have a list view (JavaFX) that will display the information taken from the database.
Ideally, whilst this is going on, I'd like other things to be done (in true mutlithreading style).
Would I need to put each submitted task into a queue of sorts and then iterate through, wait if they're not ready and then remove them once they're finished?
I've always done this singlethreaded and it's always been slow
Cheers

Use an asynchronous MongoDB client like the high level ones mentioned here
MongoDB RxJava Driver (An RxJava implementation of the MongoDB Driver)
MongoDB Reactive Streams Java Driver (A Reactive Streams implementation for the JVM)

For such purpose you could add connection pool for your application.
Everything depends on configuration for your project.
The best is to make it configurable. When load is low just have ~4 connections (min) at pool. If load is increased it could go up till 20 (max).

You need to coordinate partial tasks. If you represent tasks with threads, coordination can be done using semaphores and/or blocking queues.
More effective way is represent tasks as dataflow actors - they consume less memory, and you can generate all the tasks at the very start.

Database query vs java processing

Is it better to make two database calls or one database call with some java processing?
One database call gets only the relevant data which is to be separated into two different list which requires few lines of java.

Database dips are always an expensive operation. If you can manage with one db fetch and do some java processing, it should be a better and faster choice for you.
But you may have to analyze in your scenarion, which one is turning to be a more efficent choice. I assume singly DB fetch and java processing should be better.

Testing is key. Some questions you may want to ask yourself:
How big is each database call
How much bigger/smaller would the calls be if I combined them
Should I push the procesing to the client?
Timing
How time critical is processing?
Do you need to swarm the DB or is it okay to piggy back on the client?
Is the difference negligible?

Java Processing is much faster than SQL Fetch, As I had the same problem so I recommend you to fetch single data with some processing, because maybe the time both options take has a minor difference but some Computers take a lot of time to fetch data from DB so I suggest you to just Get single data with some Java Processing.

Generally Javaprocessing is better if its not some simple DB query that you are doing.
I Would recomend you for trying them both, measure some time and load and see what fits your application the best.

It all depends on how intensive your processing is and how your database is setup. For instance an Oracle running on a native file system will most likely be more performant then doing the java processing code on your own for complex operations. Note that most build in operations on well known databases are highly optimized and usually very performant.

Is there a Java local queue library I can use that keeps memory usage low by dumping to the hard drive?

This maybe not possible but I thought I might just give it a try. I have some work that process some data, it makes 3 decisions with each data it proceses: keep, discard or modify/reprocess(because its unsure to keep/discard). This generates a very large amount of data because the reprocess may break the data into many different parts.
My initial method was to send it to my executionservice that was processing the data but because the number of items to process was large I would run out of memory very quickly. Then I decided to maybe offload the queue off to a messaging server(rabbitmq) which works fine but now I'm bound by network IO. What I like about rabbitmq is it keeps messages in memory up to a certain level and then dumps old messages to the local drive so if I have 8 gigs of memory on my server I can still have a 100 gig message queue.
So my question is, is there any library that has a similar feature in Java? Something that I can use as a nonblocking queue that keeps only X items in queue(either by number of items or size) and writes the rest to the local drive.
note: Right now I'm only asking for this to be used on one server. In the future I might add more servers but because each server is self-generating data I would try to take messages from one queue and push them to another if one server's queue is empty. The library would not need to have network access but I would need to access the queue from another Java process. I know this is a long shot but thought if anyone knew it would be SO.

Not sure if it id the approach you are looking for, but why not using a lightweight database like hsqldb and a persistence layer like hibernate? You can have your messages in memory, then commit to db to save on disk, and later query them, with a convenient SQL query.

Actually, as Cuevas wrote, HSQLDB could be a solution. If you use the "cached table" provided, you can specify the maximum amount of memory used, exceeding data will be sent to the hard drive.

Use the filesystem. It's old-school, yet so many engineers get bitten with libraries because they are lazy. True that HSQLDB provides lots of value-add features, but in the context of being light weight....

Workload Distribution / Parallel Execution in JAVA

I have a situation here where I need to distribute work over to multiple JAVA processes running in different JVMs, probably different machines.
Lets say I have a table with records 1 to 1000. I am looking for work to be collected and distributed is sets of 10. Lets say records 1-10 to workerOne. Then records 11-20 to workerThree. And so on and so forth. Needless to say workerOne never does the work of workerTwo unless and until workerTwo couldnt do it.
This example was purely based on database but could be extended to any system, I believe be it File processing, email processing and so forth.
I have a small feeling that the immediate response would be to go for a Master/Worker approach. However here we are talking about different JVMs. Even if one JVM were to come down the other JVM should just keep doing its work.
Now the million dollar question would be: Are there any good frameworks(production ready) that would give me facility to do this. Even if there are concrete implementations of specific needs like Database records, File processing, Email processing and their likes.
I have seen the Java Parallel Execution Framework, but am not sure if it can be used for different JVMs and if one were to come down would the other keep going.I believe Workers could be on multiple JVMs, but what about the Master?
More Info 1: Hadoop would be a problem because of the JDK 1.6 requirement. Thats bit too much.
Thanks,
Franklin

Might want to look into MapReduce and Hadoop

You could also use message queues. Have one process that generates the list of work and packages it in nice little chunks. It then plops those chunks on a queue. Each one of the workers just keeps waiting on the queue for something to show up. When it does, the worker pulls a chunk off the queue and processes it. If one process goes down, some other process will pick up the slack. Simple and people have been doing it that way for a long time so there's a lot information about it on the net.

Check out Hadoop

I believe Terracotta can do this. If you are dealing with web pages, JBoss can be clustered.
If you want to do this yourself you will need a work manager which keeps track of jobs to do, jobs in progress and jobs never done which needs to be rescheduled. The workers then ask for something to do, do it, and send the result back, asking for more.
You may want to elaborate on what kind of work you want to do.

The problem you've described is definitely best solved using the master/worker pattern.
You should have a look into JavaSpaces (part of the Jini framework), it's really well suited to this kind of thing. Basically you just want to encapsulate each task to be carried out inside a Command object, subclassing as necesssary. Dump these into the JavaSpace, let your workers grab and process one at a time, then reassemble when done.
Of course your performance gains will totally depend on how long it takes you to process each set of records, but JavaSpaces won't cause any problems if distributed across several machines.

If you work on records in a single database, consider performing the work within the database itself using stored procedures. The gain for processing the records on different machine might be negated by the cost of retrieving and transmitting the work between the database and the computing nodes.
For file processing it could be a similar case. Working on files in (shared) filesystem might introduce large I/O pressure for OS.
And the cost for maintaining multiple JVM's on multiple machines might be an overkill too.
And for the question: I used the JADE (Java Agent Development Environment) for some distributed simulation once. Its multi-machine suppord and message passing nature might help you.

I would consider using Jgroups for that. You can cluster your jvms and one of your nodes can be selected as master and then can distribute the work to the other nodes by sending message over network. Or you can already partition your work items and then manage in master node the distribution of the partitions like partion-1 one goes to JVM-4 , partion-2 goes to JVM-3, partion-3 goes to JVM-2 and so on. And if JVM-4 goes down it will be realized by the master node and then master node will tell to one of the other nodes to start pick up partition-1 as well.
One other alternative which is easier to use is redis pub sub support. http://redis.io/topics/pubsub . But then you will have to maintain redis servers which i dont like.

How to create a Linux cluster for running physics simulations in java?

I am developing a scientific application used to perform physical simulations. The algorithms used are O(n3), so for a large set of data it takes a very long time to process. The application runs a simulation in around 17 minutes, and I have to run around 25,000 simulations. That is around one year of processing time.
The good news is that the simulations are completely independent from each other, so I can easily change the program to distribute the work among multiple computers.
There are multiple solutions I can see to implement this:
Get a multi-core computer and distribute the work among all the cores. Not enough for what I need to do.
Write an application that connects to multiple "processing" servers and distribute the load among them.
Get a cluster of cheap linux computers, and have the program treat everything as a single entity.
Option number 2 is relatively easy to implement, so I don't look so much for suggestions for how to implement this (Can be done just by writing a program that waits on a given port for the parameters, processes the values and returns the result as a serialized file). That would be a good example of Grid Computing.
However, I wonder at the possibilities of the last option, a traditional cluster. How difficult is to run a Java program in a linux grid? Will all the separate computers be treated as a single computer with multiple cores, making it thus easy to adapt the program? Is there any good pointers to resources that would allow me to get started? Or I am making this over-complicated and I am better off with option number 2?
EDIT: As extra info, I am interested on how to implement something like described in this article from Wired Magazine: Scientific replaced a supercomputer with a Playstation 3 linux cluster. Definitively number two sounds like the way to go... but the coolness factor.
EDIT 2: The calculation is very CPU-Bound. Basically there is a lot of operations on large matrixes, such as inverse and multiplication. I tried to look for better algorithms for these operations but so far I've found that the operations I need are 0(n3) (In libraries that are normally available). The data set is large (for such operations), but it is created on the client based on the input parameters.
I see now that I had a misunderstanding on how a computer cluster under linux worked. I had the assumption that it would work in such a way that it would just appear that you had all the processors in all computers available, just as if you had a computer with multiple cores, but that doesn't seem to be the case. It seems that all these supercomputers work by having nodes that execute tasks distributed by some central entity, and that there is several different libraries and software packages that allow to perform this distribution easily.
So the question really becomes, as there is no such thing as number 3, into: What is the best way to create a clustered java application?

I would very highly recommend the Java Parallel Processing Framework especially since your computations are already independant. I did a good bit of work with this undergraduate and it works very well. The work of doing the implementation is already done for you so I think this is a good way to achieve the goal in "number 2."
http://www.jppf.org/

Number 3 isn't difficult to do. It requires developing two distinct applications, the client and the supervisor. The client is pretty much what you have already, an application that runs a simulation. However, it needs altering so that it connects to the supervisor using TCP/IP or whatever and requests a set of simulation parameters. It then runs the simulation and sends the results back to the supervisor. The supervisor listens for requests from the clients and for each request, gets an unallocated simulation from a database and updates the database to indicate the item is allocated but unfinished. When the simulation is finished, the supervisor updates the database with the result. If the supervisor stores the data in an actual database (MySql, etc) then the database can be easily queried for the current state of the simulations. This should scale well up to the point where the time taken to provide the simulation data to all the clients is equal to the time required to perform the simulation.

Simplest way to distribute computing on a Linux cluster is to use MPI. I'd suggest you download and look at MPICH2. It's free. their home page is here
If your simulations are completely independent, you don't need most of the features of MPI. You might have to write a few lines of C to interface with MPI and kick off execution of your script or Java program.

You should check out Hazelcast, simplest peer2peer (no centralized server) clustering solution for Java. Try Hazelcast Distributed ExecutorService for executing your code on the cluster.
Regards,
-talip

You already suggested it, but disqualified it: Multi cores. You could go for multi core, if you had enough cores. One hot topic atm is GPGPU computing. Esp. NVIDIAs CUDA is a very priomising approach if you have many independent task which have to do the same computation. A GTX 280 delivers you 280 cores, which can compute up to 1120 - 15360 threads simultanously . A pair of them could solve your problem. If its really implementable depends on your algorithm (data flow vs. control flow), because all scalar processors operate in a SIMD fashion.
Drawback: it would be C/C++, not java

How optimized are your algorithms? Are you using native BLAS libraries? You can get about an order of magnitude performance gain by switching from naive libraries to optimized ones. Some, like ATLAS will also automatically spread the calculations over multiple CPUs on a system, so that covers bullet 1 automatically.
AFAIK clusters usually aren't treated as a single entity. They are usually treated as separate nodes and programmed with stuff like MPI and SCALAPACK to distribute the elements of matrices onto multiple nodes. This doesn't really help you all that much if your data set fits in memory on one node anyways.

Have you looked at Terracotta?
For work distribution you'll want to use the Master/Worker framework.

Ten years ago, the company I worked for looked at a similar virtualization solution, and Sun, Digital and HP all supported it at the time, but only with state-of-the-art supercomputers with hardware hotswap and the like. Since then, I heard Linux supports the type of virtualization you're looking for for solution #3, but I've never used it myself.
Java primitives and performance
However, if you do matrix calculations you'd want to do them in native code, not in Java (assuming you're using Java primitives). Especially cache misses are very costly, and interleaving in your arrays will kill performance. Non-interleaved chunks of memory in your matrices and native code will get you most of the speedup without additional hardware.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.