Streaming or custom Jar in Hadoop

Streaming or custom Jar in Hadoop - java

I'm running a streaming job in Hadoop (on Amazon's EMR) with the mapper and reducer written in Python. I want to know about the speed gains I would experience if I implement the same mapper and reducer in Java (or use Pig).
In particular, I'm looking for people's experiences on migrating from streaming to custom jar deployments and/or Pig and also documents containing benchmark comparisons of these options. I found this question, but the answers are not specific enough for me. I'm not looking for comparisons between Java and Python, but comparisons between custom jar deployment in Hadoop and Python-based streaming.
My job is reading NGram counts from the Google Books NGgram dataset and computing aggregate measures. It seems like CPU utilization on the compute nodes are close to 100%. (I would like to hear your opinions about the differences of having CPU-bound or an IO-bound job, as well).
Thanks!
Amaç

Why consider deploying custom jars ?
Ability to use more powerful custom Input formats. For streaming jobs, even if you use pluggable input/output like it's mentioned here, you are limited to the key and value(s) to your mapper/reducer being a text/string. You would need to expend some amount of CPU cycles to convert to your required type.
Ive also heard that Hadoop can be smart about reusing JVMs across multiple Jobs which wont be possible when streaming (can't confirm this)
When to use pig ?
Pig Latin is pretty cool and is a much higher level data flow language than java/python or perl. Your Pig scripts WILL tend to be much smaller than an equivalent task written any of the other languages
When to NOT use pig ?
Even though pig is pretty good at figuring out by itself how many maps/reduce and when to spawn a map or reduce and a myriad of such things, if you are dead sure how many maps/reduce you need and you have some very specific computation you need to do within your Map/reduce functions and you are very specific about performance, then you should consider deploying your own jars. This link shows that pig can lag native hadoop M/R in performance. You could also take a look at writing your own Pig UDFs which isolate some compute intensive function (and possibly even use JNI to call some native C/C++ code within the UDF)
A Note on IO and CPU bound jobs :
Technically speaking, the whole point of hadoop and map reduce is to parallelize compute intensive functions, so i'd presume your map and reduce jobs are compute intensive. The only time the Hadoop subsystem is busy doing IO is in between the map and reduce phase when data is sent across the network. Also if you have large amount of data and you have manually configured too few maps and reduces resulting in spills to disk (although too many tasks will results in too much time spent starting / stopping JVMs and too many small files). A streaming Job would also have the additional overhead of starting a Python/Perl VM and have data being copied to and fro between the JVM and the scripting VM.

Related

Mahout single-machine performance

I am developing a Java based application and I decided to use machine learning algorithms implemented in Mahout library. My application will run on single machine, without Hadoop.
I would like to ask, if single node Mahout has also overhead, like distributed one? I read in a book Mahout in action, than multiple cluster Mahout has some overhead (initializing, transfering data, etc.). But if we use Mahout algorithms without MapReduce paradigm, there should be no overhead, right?

It does not make a difference whether you run it in a single machine or a 1000-node cluster. Hadoop serializes all the intermediate data (MAP's key-value output), and persist it on the disk. In the reduce phase, it loads the key-value pairs back into the memory. Therefore, it has huge processing and disk-access overheads.
Basically, if you have few machines (e.g. <7 machines), hadoop is possibly not a good choice, specially for speedup analysis. In this case, you may just use the small cluster to check your code's logic before deploying it on a larger environment.

How can load balancing be handled in Hadoop mapreduce?

How can load balancing be handled in Hadoop mapreduce? I am writing a distributed application in which the server distributes jobs to worker nodes based on a benchmark test, memory available, number of CPU cores, CPU usage, number of GPUs available / usage? I am not very experienced with mapreduce and have read some documentation on apache's website but am still not sure how to go about and solve this problem. Can I do the benchmark calculation and get this all of this information and then by an algorithm to dynamically split up the input?
Thank you!

"MapReduce is a programming model and an associated implementation for processing and generating large data sets" extract of the abstract of MapReduce paper.
As you said it in comments, it seems your project is not data intensive but computing intensive, thus I think MapReduce is not the tool you need to use.

Performance of MapReduce systems strongly depends on an even data distribution.
Apache MapReduce frameworks use a simplistic approach to distribute the work load and assign the same number of clusters to each reducer.
The load imbalance, which raises the processing time, is even ampliﬁed by the high runtime complexities of the reducer tasks. An adaptive load balancing strategy is required to address the problem of estimating the cost of the tasks that are distributed to the reducers based on a given cost model.

Resource usage of google Go vs Python and Java on Appengine

Will google Go use less resources than Python and Java on Appengine? Are the instance startup times for go faster than Java's and Python's startup times?
Is the go program uploaded as binaries or source code and if it is uploaded as source code is it then compiled once or at each instance startup?
In other words: Will I benefit from using Go in app engine from a cost perspective? (only taking to account the cost of the appengine resources not development time)

Will google Go use less resources than Python and Java on Appengine?
Are the instance startup times for go faster than Java's and Python's
startup times?
Yes, Go instances have a lower memory than Python and Java (< 10 MB).
Yes, Go instances start faster than Java and Python equivalent because the runtime only needs to read a single executable file for starting an application.
Also even if being atm single threaded, Go instances handle incoming request concurrently using goroutines, meaning that if 1 goroutine is waiting for I/O another one can process an incoming request.
Is the go program uploaded as binaries or source code and if it is
uploaded as source code is it then compiled once or at each instance
startup?
Go program is uploaded as source code and compiled (once) to a binary when deploying a new version of your application using the SDK.
In other words: Will I benefit from using Go in app engine from a cost
perspective?
The Go runtime has definitely an edge when it comes to performance / price ratio, however it doesn't affect the pricing of other API quotas as described by Peter answer.

The cost of instances is only part of the cost of your app. I only use the Java runtime right now, so I don't know how much more or less efficient things would be with Python or Go, but I don't imagine it will be orders of magnitude different. I do know that instances are not the only cost you need to consider. Depending on what your app does, you may find API or storage costs are more significant than any minor differences between runtimes. All of the API costs will be the same with whatever runtime you use.
Language "might" affect these costs:
On-demand Frontend Instances
Reserved Frontend Instances
Backed Instances
Language Independent Costs:
High Replication Datastore (per gig stored)
Outgoing Bandwidth (per gig)
Datastore API (per ops)
Blobstore API storge (per gig)
Email API (per email)
XMPP API (per stanza)
Channel API (per channel)

The question is mostly irrelevant.
The minimum memory footprint for a Go app is less than a Python app which is less than a Java app. They all cost the same per-instance, so unless your application performs better with extra heap space, this issue is irrelevant.
Go startup time is less than Python startup time which is less than Java startup time. Unless your application has a particular reason to churn through lots of instance startup/shutdown cycles, this is irrelevant from a cost perspective. On the other hand, if you have an app that is exceptionally bursty in very short time periods, the startup time may be an advantage.
As mentioned by other answers, many costs are identical among all platforms - in particular, datastore operations. To the extent that Go vs Python vs Java will have an effect on the instance-hours bill, it is related to:
Does your app generate a lot of garbage? For many applications, the biggest computational cost is the garbage collector. Java has by far the most mature GC and basic operations like serialization are dramatically faster than with Python. Go's garbage collector seems to be an ongoing subject of development, but from cursory web searches, doesn't seem to be a matter of pride (yet).
Is your app computationally intensive? Java (JIT-compiled) and Go are probably better than Python for mathematical operations.
All three languages have their virtues and curses. For the most part, you're better off letting other issues dominate - which language do you enjoy working with most?

It's probably more about how you allocate the resources than your language choice. I read that GAE was built the be language-agnostic so there is probably no builtin advantage for any language, but you can get an advantage from choosing the language you are comfortable and motivated with. I use python and what made my deployment much more cost-effective was the upgrade to python 2.7 and you can only make that upgrade if you use the correct subset of 2.6, which is good. So if you choose a language you're comfortable with, it's likely that you will gain an advantage from your ability using the language rather than the combo language + environment itself.
In short, I'd recommend python but that's the only app engine language I tried and that's my choice even though I know Java rather well the code for a project will be much more compact using my favorite language python.
My apps are small to medium sized and they cost like nothing:

I haven't used Go, but I would strongly suspect it would load and execute instances much faster, and use less memory purely because it is compiled. Anecdotally from the group, I believe that Python is more responsive than Java, at least in instance startup time.
Instance load/startup times are important because when your instance is hit by more requests than it can handle, it spins up another instance. This makes that request take much longer, possibly giving the impression that the site is generally slow. Both Java and Python have to startup their virtual machine/interpreter, so I would expect Go to be an order of magnitude faster here.
There is one other issue - now Python2.7 is available, Go is the only option that is single-threaded (ironically, given that Go is designed as a modern multi-process language). So although Go requests should be handled faster, an instance can only handle requests serially. I'd be very surprised if this limitation last long, though.

hadoop beginners question

I've read some documentation about hadoop and seen the impressive results. I get the bigger picture but am finding it hard whether it would fit our setup. Question isnt programming related but I'm eager to get opinion of people who currently work with hadoop and how it would fit our setup:
We use Oracle for backend
Java (Struts2/Servlets/iBatis) for frontend
Nightly we get data which needs to be summarized. this runs as a batch process (takes 5 hours)
We are looking for a way to cut those 5 hours to a shorter time.
Where would hadoop fit into this picture? Can we still continue to use Oracle even after hadoop?

The chances are you can dramatically reduce the elapsed time of that batch process with some straightforward tuning. I offer this analysis on the simple basis of past experience. Batch processes tend to be written very poorly, precisely because they are autonomous and so don't have irate users demanding better response times.
Certainly I don't think it makes any sense at all to invest a lot of time and energy re-implementing our application in a new technology - no matter how fresh and cool it may be - until we have exhausted the capabilities of our current architecture.
If you want some specific advice on how to tune your batch query, well that would be a new question.

Hadoop is designed to parallelize a job across multiple machines. To determine whether it will be a good candidate for your setup, ask yourself these questions:
Do I have many machines on which I can run Hadoop, or am I willing to spend money on something like EC2?
Is my job parallelizable? (If your 5 hour batch process consists of 30 10-minute tasks that have to be run in sequence, Hadoop will not help you).
Does my data require random access? (This is actually pretty significant - Hadoop is great at sequential access and terrible at random access. In the latter case, you won't see enough speedup to justify the extra work / cost).
As far as where it "fits in" - you give Hadoop a bunch of data, and it gives you back output. One way to think of it is like a giant Unix process - data goes in, data comes out. What you do with it is your business. (This is of course an overly simplified view, but you get the idea.) So yes, you will still be able to write data to your Oracle database.

Hadoop distributed filesystem supports highly paralleled batch processing of data using MapReduce.
So your current process takes 5 hours to summarize the data. Of the bat, general summarization tasks are one of the 'types' of job MapReduce excels at. However you need to understand weather your processing requirements will translate into a MapReduce job. By this I mean, can you achieve the summaries you need using the key/value pairs MapReduce limits you to using?
Hadoop requires a cluster of machines to run. Do you have hardware to support a cluster? This usually comes down to how much data you are storing on the HDFS and also how fast you want to process the data. Generally when running MapReduce on a Hadoop the more machines you have either the more data you can store or the faster you run a job. Having an idea of the amount of data you process each night would help a lot here?
You can still use Oracle. You can use Hadoop/MapReduce to do the data crunching and then use custom code to insert the summary data into an oracle DB.

How to create a Linux cluster for running physics simulations in java?

I am developing a scientific application used to perform physical simulations. The algorithms used are O(n3), so for a large set of data it takes a very long time to process. The application runs a simulation in around 17 minutes, and I have to run around 25,000 simulations. That is around one year of processing time.
The good news is that the simulations are completely independent from each other, so I can easily change the program to distribute the work among multiple computers.
There are multiple solutions I can see to implement this:
Get a multi-core computer and distribute the work among all the cores. Not enough for what I need to do.
Write an application that connects to multiple "processing" servers and distribute the load among them.
Get a cluster of cheap linux computers, and have the program treat everything as a single entity.
Option number 2 is relatively easy to implement, so I don't look so much for suggestions for how to implement this (Can be done just by writing a program that waits on a given port for the parameters, processes the values and returns the result as a serialized file). That would be a good example of Grid Computing.
However, I wonder at the possibilities of the last option, a traditional cluster. How difficult is to run a Java program in a linux grid? Will all the separate computers be treated as a single computer with multiple cores, making it thus easy to adapt the program? Is there any good pointers to resources that would allow me to get started? Or I am making this over-complicated and I am better off with option number 2?
EDIT: As extra info, I am interested on how to implement something like described in this article from Wired Magazine: Scientific replaced a supercomputer with a Playstation 3 linux cluster. Definitively number two sounds like the way to go... but the coolness factor.
EDIT 2: The calculation is very CPU-Bound. Basically there is a lot of operations on large matrixes, such as inverse and multiplication. I tried to look for better algorithms for these operations but so far I've found that the operations I need are 0(n3) (In libraries that are normally available). The data set is large (for such operations), but it is created on the client based on the input parameters.
I see now that I had a misunderstanding on how a computer cluster under linux worked. I had the assumption that it would work in such a way that it would just appear that you had all the processors in all computers available, just as if you had a computer with multiple cores, but that doesn't seem to be the case. It seems that all these supercomputers work by having nodes that execute tasks distributed by some central entity, and that there is several different libraries and software packages that allow to perform this distribution easily.
So the question really becomes, as there is no such thing as number 3, into: What is the best way to create a clustered java application?

I would very highly recommend the Java Parallel Processing Framework especially since your computations are already independant. I did a good bit of work with this undergraduate and it works very well. The work of doing the implementation is already done for you so I think this is a good way to achieve the goal in "number 2."
http://www.jppf.org/

Number 3 isn't difficult to do. It requires developing two distinct applications, the client and the supervisor. The client is pretty much what you have already, an application that runs a simulation. However, it needs altering so that it connects to the supervisor using TCP/IP or whatever and requests a set of simulation parameters. It then runs the simulation and sends the results back to the supervisor. The supervisor listens for requests from the clients and for each request, gets an unallocated simulation from a database and updates the database to indicate the item is allocated but unfinished. When the simulation is finished, the supervisor updates the database with the result. If the supervisor stores the data in an actual database (MySql, etc) then the database can be easily queried for the current state of the simulations. This should scale well up to the point where the time taken to provide the simulation data to all the clients is equal to the time required to perform the simulation.

Simplest way to distribute computing on a Linux cluster is to use MPI. I'd suggest you download and look at MPICH2. It's free. their home page is here
If your simulations are completely independent, you don't need most of the features of MPI. You might have to write a few lines of C to interface with MPI and kick off execution of your script or Java program.

You should check out Hazelcast, simplest peer2peer (no centralized server) clustering solution for Java. Try Hazelcast Distributed ExecutorService for executing your code on the cluster.
Regards,
-talip

You already suggested it, but disqualified it: Multi cores. You could go for multi core, if you had enough cores. One hot topic atm is GPGPU computing. Esp. NVIDIAs CUDA is a very priomising approach if you have many independent task which have to do the same computation. A GTX 280 delivers you 280 cores, which can compute up to 1120 - 15360 threads simultanously . A pair of them could solve your problem. If its really implementable depends on your algorithm (data flow vs. control flow), because all scalar processors operate in a SIMD fashion.
Drawback: it would be C/C++, not java

How optimized are your algorithms? Are you using native BLAS libraries? You can get about an order of magnitude performance gain by switching from naive libraries to optimized ones. Some, like ATLAS will also automatically spread the calculations over multiple CPUs on a system, so that covers bullet 1 automatically.
AFAIK clusters usually aren't treated as a single entity. They are usually treated as separate nodes and programmed with stuff like MPI and SCALAPACK to distribute the elements of matrices onto multiple nodes. This doesn't really help you all that much if your data set fits in memory on one node anyways.

Have you looked at Terracotta?
For work distribution you'll want to use the Master/Worker framework.

Ten years ago, the company I worked for looked at a similar virtualization solution, and Sun, Digital and HP all supported it at the time, but only with state-of-the-art supercomputers with hardware hotswap and the like. Since then, I heard Linux supports the type of virtualization you're looking for for solution #3, but I've never used it myself.
Java primitives and performance
However, if you do matrix calculations you'd want to do them in native code, not in Java (assuming you're using Java primitives). Especially cache misses are very costly, and interleaving in your arrays will kill performance. Non-interleaved chunks of memory in your matrices and native code will get you most of the speedup without additional hardware.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.