I am developing a Java based application and I decided to use machine learning algorithms implemented in Mahout library. My application will run on single machine, without Hadoop.
I would like to ask, if single node Mahout has also overhead, like distributed one? I read in a book Mahout in action, than multiple cluster Mahout has some overhead (initializing, transfering data, etc.). But if we use Mahout algorithms without MapReduce paradigm, there should be no overhead, right?
It does not make a difference whether you run it in a single machine or a 1000-node cluster. Hadoop serializes all the intermediate data (MAP's key-value output), and persist it on the disk. In the reduce phase, it loads the key-value pairs back into the memory. Therefore, it has huge processing and disk-access overheads.
Basically, if you have few machines (e.g. <7 machines), hadoop is possibly not a good choice, specially for speedup analysis. In this case, you may just use the small cluster to check your code's logic before deploying it on a larger environment.
Related
I've been digging into the depths of IBM's research on JavaSplit and cJVM because I want to run a JVM program across a cluster of 4 Raspberry Pi 3 Model B's like This.
I know nearly nothing about clusters and distributed computing, so I'm starting my dive into the depths by trying to get a Minecraft Server running across them.
My question is, is there a relatively simply way to get a Java program running on a JVM to split across a cluster without source code access?
Notes:
The main problem is that most java programs (toy program included) were not built to run across a cluster, but I'm hoping that I can find a method to hack the JVM to have it work.
I've seen some possible solutions but due to the nature of Minecraft and Java, updates come so frequently and the landscape changes that I don't even know what is possible.
As far as I know, FastCraft implements multithreading support, or it used to and it's now built in.
Purpose:
This is a both a toy program and a practical problem for me. I'm doing it to learn how clusters work, to learn more about Linux administration and distributed computing, and because it's fun. I'm not doing it to setup a minecraft server. The server is a cherry on top, but if it doesn't work out I'll shove it on a Dell tower.
MineCraft can be scaled using what is effectively a partitioning service. The tool which is usually used is BungeeCord This allows a client to connect to a service which passes the session to multiple backend servers which run largely without change. This limits the number of users which can be in one server, but between them you can have any number of servers.
I can only reiterate that such a generic solution, if one exists, is not commonly applied. There are inherent challenges to try and distribute a JVM, such as translating a shared memory execution model, where all memory access is cheap, to a distributed model, where non-local memory access is orders of magnitude more expensive, without degrading performance. This requires smart partitioning of data, and finding such partitions in an automated way is a very complex optimization problem.
In the particular example of minecraft, one would additionally have to transform a single threaded program into a multi threaded one, which is a rather complex program transformation by itself.
In a nutshell, solving the clustering problem in such generality is a research level topic, for which, to the best of my knowledge, no algorithms competitive with manual code changes currently exist. In addition, if such an algorithm were to exist, if would be very unlikely to be offered free of charge, because it would represent both a significant achievement, and could be licensed for a lot of money.
How can load balancing be handled in Hadoop mapreduce? I am writing a distributed application in which the server distributes jobs to worker nodes based on a benchmark test, memory available, number of CPU cores, CPU usage, number of GPUs available / usage? I am not very experienced with mapreduce and have read some documentation on apache's website but am still not sure how to go about and solve this problem. Can I do the benchmark calculation and get this all of this information and then by an algorithm to dynamically split up the input?
Thank you!
"MapReduce is a programming model and an associated implementation for processing and generating large data sets" extract of the abstract of MapReduce paper.
As you said it in comments, it seems your project is not data intensive but computing intensive, thus I think MapReduce is not the tool you need to use.
Performance of MapReduce systems strongly depends on an even data distribution.
Apache MapReduce frameworks use a simplistic approach to distribute the work load and assign the same number of clusters to each reducer.
The load imbalance, which raises the processing time, is even amplified by the high runtime complexities of the reducer tasks. An adaptive load balancing strategy is required to address the problem of estimating the cost of the tasks that are distributed to the reducers based on a given cost model.
Does it help to use Redis with Java to develop data intensive applications (e.g. data-mining) in Java?
Does it work faster or consume less memory comparing to plain Java for similar operation on high volume of data?
Edit: My question is mostly about running on single machine. For example for working with a large number of list/set/maps and query and sort them.
Redis will definitely not be faster that native Java on a single machine. It would allow you to distribute processing, but if the chunks of data really are large, they're not likely to fit into memory anyway. Without knowing more about what you're doing, I would suggest storing the data on disk. When you get multiple machines, you can network mount the partition and share the data that way. Alternatively, Hadoop with MapReduce sounds like the right sort of thing for what you're doing.
I'm running a streaming job in Hadoop (on Amazon's EMR) with the mapper and reducer written in Python. I want to know about the speed gains I would experience if I implement the same mapper and reducer in Java (or use Pig).
In particular, I'm looking for people's experiences on migrating from streaming to custom jar deployments and/or Pig and also documents containing benchmark comparisons of these options. I found this question, but the answers are not specific enough for me. I'm not looking for comparisons between Java and Python, but comparisons between custom jar deployment in Hadoop and Python-based streaming.
My job is reading NGram counts from the Google Books NGgram dataset and computing aggregate measures. It seems like CPU utilization on the compute nodes are close to 100%. (I would like to hear your opinions about the differences of having CPU-bound or an IO-bound job, as well).
Thanks!
Amaç
Why consider deploying custom jars ?
Ability to use more powerful custom Input formats. For streaming jobs, even if you use pluggable input/output like it's mentioned here, you are limited to the key and value(s) to your mapper/reducer being a text/string. You would need to expend some amount of CPU cycles to convert to your required type.
Ive also heard that Hadoop can be smart about reusing JVMs across multiple Jobs which wont be possible when streaming (can't confirm this)
When to use pig ?
Pig Latin is pretty cool and is a much higher level data flow language than java/python or perl. Your Pig scripts WILL tend to be much smaller than an equivalent task written any of the other languages
When to NOT use pig ?
Even though pig is pretty good at figuring out by itself how many maps/reduce and when to spawn a map or reduce and a myriad of such things, if you are dead sure how many maps/reduce you need and you have some very specific computation you need to do within your Map/reduce functions and you are very specific about performance, then you should consider deploying your own jars. This link shows that pig can lag native hadoop M/R in performance. You could also take a look at writing your own Pig UDFs which isolate some compute intensive function (and possibly even use JNI to call some native C/C++ code within the UDF)
A Note on IO and CPU bound jobs :
Technically speaking, the whole point of hadoop and map reduce is to parallelize compute intensive functions, so i'd presume your map and reduce jobs are compute intensive. The only time the Hadoop subsystem is busy doing IO is in between the map and reduce phase when data is sent across the network. Also if you have large amount of data and you have manually configured too few maps and reduces resulting in spills to disk (although too many tasks will results in too much time spent starting / stopping JVMs and too many small files). A streaming Job would also have the additional overhead of starting a Python/Perl VM and have data being copied to and fro between the JVM and the scripting VM.
I am developing a scientific application used to perform physical simulations. The algorithms used are O(n3), so for a large set of data it takes a very long time to process. The application runs a simulation in around 17 minutes, and I have to run around 25,000 simulations. That is around one year of processing time.
The good news is that the simulations are completely independent from each other, so I can easily change the program to distribute the work among multiple computers.
There are multiple solutions I can see to implement this:
Get a multi-core computer and distribute the work among all the cores. Not enough for what I need to do.
Write an application that connects to multiple "processing" servers and distribute the load among them.
Get a cluster of cheap linux computers, and have the program treat everything as a single entity.
Option number 2 is relatively easy to implement, so I don't look so much for suggestions for how to implement this (Can be done just by writing a program that waits on a given port for the parameters, processes the values and returns the result as a serialized file). That would be a good example of Grid Computing.
However, I wonder at the possibilities of the last option, a traditional cluster. How difficult is to run a Java program in a linux grid? Will all the separate computers be treated as a single computer with multiple cores, making it thus easy to adapt the program? Is there any good pointers to resources that would allow me to get started? Or I am making this over-complicated and I am better off with option number 2?
EDIT: As extra info, I am interested on how to implement something like described in this article from Wired Magazine: Scientific replaced a supercomputer with a Playstation 3 linux cluster. Definitively number two sounds like the way to go... but the coolness factor.
EDIT 2: The calculation is very CPU-Bound. Basically there is a lot of operations on large matrixes, such as inverse and multiplication. I tried to look for better algorithms for these operations but so far I've found that the operations I need are 0(n3) (In libraries that are normally available). The data set is large (for such operations), but it is created on the client based on the input parameters.
I see now that I had a misunderstanding on how a computer cluster under linux worked. I had the assumption that it would work in such a way that it would just appear that you had all the processors in all computers available, just as if you had a computer with multiple cores, but that doesn't seem to be the case. It seems that all these supercomputers work by having nodes that execute tasks distributed by some central entity, and that there is several different libraries and software packages that allow to perform this distribution easily.
So the question really becomes, as there is no such thing as number 3, into: What is the best way to create a clustered java application?
I would very highly recommend the Java Parallel Processing Framework especially since your computations are already independant. I did a good bit of work with this undergraduate and it works very well. The work of doing the implementation is already done for you so I think this is a good way to achieve the goal in "number 2."
http://www.jppf.org/
Number 3 isn't difficult to do. It requires developing two distinct applications, the client and the supervisor. The client is pretty much what you have already, an application that runs a simulation. However, it needs altering so that it connects to the supervisor using TCP/IP or whatever and requests a set of simulation parameters. It then runs the simulation and sends the results back to the supervisor. The supervisor listens for requests from the clients and for each request, gets an unallocated simulation from a database and updates the database to indicate the item is allocated but unfinished. When the simulation is finished, the supervisor updates the database with the result. If the supervisor stores the data in an actual database (MySql, etc) then the database can be easily queried for the current state of the simulations. This should scale well up to the point where the time taken to provide the simulation data to all the clients is equal to the time required to perform the simulation.
Simplest way to distribute computing on a Linux cluster is to use MPI. I'd suggest you download and look at MPICH2. It's free. their home page is here
If your simulations are completely independent, you don't need most of the features of MPI. You might have to write a few lines of C to interface with MPI and kick off execution of your script or Java program.
You should check out Hazelcast, simplest peer2peer (no centralized server) clustering solution for Java. Try Hazelcast Distributed ExecutorService for executing your code on the cluster.
Regards,
-talip
You already suggested it, but disqualified it: Multi cores. You could go for multi core, if you had enough cores. One hot topic atm is GPGPU computing. Esp. NVIDIAs CUDA is a very priomising approach if you have many independent task which have to do the same computation. A GTX 280 delivers you 280 cores, which can compute up to 1120 - 15360 threads simultanously . A pair of them could solve your problem. If its really implementable depends on your algorithm (data flow vs. control flow), because all scalar processors operate in a SIMD fashion.
Drawback: it would be C/C++, not java
How optimized are your algorithms? Are you using native BLAS libraries? You can get about an order of magnitude performance gain by switching from naive libraries to optimized ones. Some, like ATLAS will also automatically spread the calculations over multiple CPUs on a system, so that covers bullet 1 automatically.
AFAIK clusters usually aren't treated as a single entity. They are usually treated as separate nodes and programmed with stuff like MPI and SCALAPACK to distribute the elements of matrices onto multiple nodes. This doesn't really help you all that much if your data set fits in memory on one node anyways.
Have you looked at Terracotta?
For work distribution you'll want to use the Master/Worker framework.
Ten years ago, the company I worked for looked at a similar virtualization solution, and Sun, Digital and HP all supported it at the time, but only with state-of-the-art supercomputers with hardware hotswap and the like. Since then, I heard Linux supports the type of virtualization you're looking for for solution #3, but I've never used it myself.
Java primitives and performance
However, if you do matrix calculations you'd want to do them in native code, not in Java (assuming you're using Java primitives). Especially cache misses are very costly, and interleaving in your arrays will kill performance. Non-interleaved chunks of memory in your matrices and native code will get you most of the speedup without additional hardware.