hadoop beginners question

hadoop beginners question - java

I've read some documentation about hadoop and seen the impressive results. I get the bigger picture but am finding it hard whether it would fit our setup. Question isnt programming related but I'm eager to get opinion of people who currently work with hadoop and how it would fit our setup:
We use Oracle for backend
Java (Struts2/Servlets/iBatis) for frontend
Nightly we get data which needs to be summarized. this runs as a batch process (takes 5 hours)
We are looking for a way to cut those 5 hours to a shorter time.
Where would hadoop fit into this picture? Can we still continue to use Oracle even after hadoop?

The chances are you can dramatically reduce the elapsed time of that batch process with some straightforward tuning. I offer this analysis on the simple basis of past experience. Batch processes tend to be written very poorly, precisely because they are autonomous and so don't have irate users demanding better response times.
Certainly I don't think it makes any sense at all to invest a lot of time and energy re-implementing our application in a new technology - no matter how fresh and cool it may be - until we have exhausted the capabilities of our current architecture.
If you want some specific advice on how to tune your batch query, well that would be a new question.

Hadoop is designed to parallelize a job across multiple machines. To determine whether it will be a good candidate for your setup, ask yourself these questions:
Do I have many machines on which I can run Hadoop, or am I willing to spend money on something like EC2?
Is my job parallelizable? (If your 5 hour batch process consists of 30 10-minute tasks that have to be run in sequence, Hadoop will not help you).
Does my data require random access? (This is actually pretty significant - Hadoop is great at sequential access and terrible at random access. In the latter case, you won't see enough speedup to justify the extra work / cost).
As far as where it "fits in" - you give Hadoop a bunch of data, and it gives you back output. One way to think of it is like a giant Unix process - data goes in, data comes out. What you do with it is your business. (This is of course an overly simplified view, but you get the idea.) So yes, you will still be able to write data to your Oracle database.

Hadoop distributed filesystem supports highly paralleled batch processing of data using MapReduce.
So your current process takes 5 hours to summarize the data. Of the bat, general summarization tasks are one of the 'types' of job MapReduce excels at. However you need to understand weather your processing requirements will translate into a MapReduce job. By this I mean, can you achieve the summaries you need using the key/value pairs MapReduce limits you to using?
Hadoop requires a cluster of machines to run. Do you have hardware to support a cluster? This usually comes down to how much data you are storing on the HDFS and also how fast you want to process the data. Generally when running MapReduce on a Hadoop the more machines you have either the more data you can store or the faster you run a job. Having an idea of the amount of data you process each night would help a lot here?
You can still use Oracle. You can use Hadoop/MapReduce to do the data crunching and then use custom code to insert the summary data into an oracle DB.

Related

Is there a best practice to query on huge CSV in spring boot context?

I'm working for well known company in a project that should bring integration with other system that are producing one csv per hour of 27Gb. The target is query these files without import em (the main problem is bureaucracy, nobody want resposibility if some data change).
Main filters on this files can be done by dates, the end-user must insert a range start-end dates. After that can be filter by few strings.
Context: spring boot microservices
Server: xeon processor 24 core 256gb Ram
Filesystem: NFS mounted from external server
Test data: 1000 files, each one 1Gb
For performance improvement i'm indexing files by date writing on each file name the range that contains and making a folder structure like yyyy/mm/dd. For each of following test the first step was make a raw file paths list that will be read.
research will read all files
Spring batch - buffered reader and parse into object: 12,097 sec
Plain java - threadpool, buffered reader and parse into object: 10,882 sec
Linux egrep with regex and parallel ran from java and parse into object: 7,701 sec
The dirtiest is also fastes. I want avoid it because security department warned me about all checks to make on input data to prevent shell injection.
Googling i found mariadb CONNECT engine that can point also huge csvs, so now i'm going on this way creating temporary table with files that research have interest, the bad part is i have to do one table for each query since dates can be different.
For first year We're expecting not more than 5 parallel researches in same time, with an average of 3 weeks of range. This queries will be done asyncronousely.
Do you know something that can help me on it? Not only for the speed but a good practice to apply.
Thanks a lot folks.

To answer your question:
No. There are no best practices. And, AFAIK there are no generally applicable "good" practices.
But I do have some general advice. If you allow considerations such as bureaucracy and (to a lesser extent) security edicts to dictate your technical solutions, then you are liable to end up with substandard solutions; i.e. solutions that are slow or costly to run and keep running. (If "they" want it to be fast, then "they" shouldn't put impediments in your way.)
I don't think we can give you an easy solution to your problem, but I can say some things about your analysis.
You said about the grep solution.
"I want avoid it because security department warned me about all checks to make on input data to prevent shell injection."
The solution to that concern simple: don't use an intermediate shell. The dangerous injection attacks will be via shell trickery rather than grep. Java's ProcessBuilder doesn't use a shell unless you explicitly use one. The grep program itself can only read the files that are specified in its arguments, and write to standard output and standard error.
You said about the general architecture:
"The target is query these files without import them (the main problem is bureaucracy, nobody want responsibility if some data change)."
I don't understand the objection here. We know that the CSV files are going to change. You are getting a new 27GB CSV file every hour!
If the objection is that the format of the CSV files is going to change, well that affects your ability to effectively query them. But with a little ingenuity, you could detect the the change in format and adjust the ingestion process on the fly.
"We're expecting not more than 5 parallel researches in same time, with an average of 3 weeks of range."
If you haven't done this already, you need to do some analysis to see whether your proposed solution is going to be viable. Estimate how much CSV data needs to be scanned to satisfy a typical query. Multiply that by the number of queries to be performed in (say) 24 hours. Then compare that against your NFS server's ability to satisfy bulk reads. Then redo the calculation assuming a given number of queries running in parallel.
Consider what happens if your (above) expectations are wrong. You only need a couple of "idiot" users doing unreasonable things ...
Having a 24 core server for doing the queries is one thing, but the NFS server also needs to be able to supply the data fast enough. You can improve things with NFS tuning (e.g. by tuning block sizes, the number of NFS daemons, using FS-Cache) but the the ultimate bottlenecks will be getting the data off the NFS server's disks and across the network to your server. Bear in mind that there could be other servers "hammering" the NFS server while your application is doing its thing.

Streaming or custom Jar in Hadoop

I'm running a streaming job in Hadoop (on Amazon's EMR) with the mapper and reducer written in Python. I want to know about the speed gains I would experience if I implement the same mapper and reducer in Java (or use Pig).
In particular, I'm looking for people's experiences on migrating from streaming to custom jar deployments and/or Pig and also documents containing benchmark comparisons of these options. I found this question, but the answers are not specific enough for me. I'm not looking for comparisons between Java and Python, but comparisons between custom jar deployment in Hadoop and Python-based streaming.
My job is reading NGram counts from the Google Books NGgram dataset and computing aggregate measures. It seems like CPU utilization on the compute nodes are close to 100%. (I would like to hear your opinions about the differences of having CPU-bound or an IO-bound job, as well).
Thanks!
Amaç

Why consider deploying custom jars ?
Ability to use more powerful custom Input formats. For streaming jobs, even if you use pluggable input/output like it's mentioned here, you are limited to the key and value(s) to your mapper/reducer being a text/string. You would need to expend some amount of CPU cycles to convert to your required type.
Ive also heard that Hadoop can be smart about reusing JVMs across multiple Jobs which wont be possible when streaming (can't confirm this)
When to use pig ?
Pig Latin is pretty cool and is a much higher level data flow language than java/python or perl. Your Pig scripts WILL tend to be much smaller than an equivalent task written any of the other languages
When to NOT use pig ?
Even though pig is pretty good at figuring out by itself how many maps/reduce and when to spawn a map or reduce and a myriad of such things, if you are dead sure how many maps/reduce you need and you have some very specific computation you need to do within your Map/reduce functions and you are very specific about performance, then you should consider deploying your own jars. This link shows that pig can lag native hadoop M/R in performance. You could also take a look at writing your own Pig UDFs which isolate some compute intensive function (and possibly even use JNI to call some native C/C++ code within the UDF)
A Note on IO and CPU bound jobs :
Technically speaking, the whole point of hadoop and map reduce is to parallelize compute intensive functions, so i'd presume your map and reduce jobs are compute intensive. The only time the Hadoop subsystem is busy doing IO is in between the map and reduce phase when data is sent across the network. Also if you have large amount of data and you have manually configured too few maps and reduces resulting in spills to disk (although too many tasks will results in too much time spent starting / stopping JVMs and too many small files). A streaming Job would also have the additional overhead of starting a Python/Perl VM and have data being copied to and fro between the JVM and the scripting VM.

Long-running stats process - thoughts on language choice?

I am on a LAMP stack for a website I am managing. There is a need to roll up usage statistics (a variety of things related to our desktop product).
I initially tackled the problem with PHP (being that I had a bunch of classes to work with the data already). All worked well on my dev box which was using 5.3.
Long story short, 5.1 memory management seems to suck a lot worse, and I've had to do a lot of fooling to get the long-term roll up scripts to run in a fixed memory space. Our server guys are unwilling to upgrade PHP at this time. I've since moved my dev server back to 5.1 so I don't run into this problem again.
For mining of MySQL databases to roll up statistics for different periods and resolutions, potentially running a process that does this all the time in the future (as opposed to on a cron schedule), what language choice do you recommend? I was looking at Python (I know it more or less), Java (don't know it that well), or sticking it out with PHP (know it quite well).
Edit: design clarification for commenter
Resolutions: The way the rollup script works currently, is I have some classes for defining resolutions and buckets. I have year, month, week, day -- given a "bucket number" each class gives a start and end timestamp that defines the time range for that bucket -- this is based on arbitrary epoch date. The system maintains "complete" records, ie it will complete its rolled up dataset for each resolution since the last time it was run, currently.
SQL Strat: The base stats are located in many dissimilar schemas and tables. I do individual queries for each rolled up stat for the most part, then fill one record for insert. Your are suggesting nested subqueries such as:
INSERT into rolled_up_stats (someval, someval, someval, ... ) VALUES (SELECT SUM(somestat) from someschema, SELECT AVG(somestat2) from someschema2)
Those subqueries will generate temporary tables, right? My experience is that had been slow as molasses in the past. Is it a better approach?
Edit 2: Adding some inline responses to the question
Language was a bottleneck in the case of 5.1 php -- I was essentially told I made the wrong language choice (though the scripts worked fine on 5.3). You mention python, which I am checking out for this task. To be clear, what I am doing is providing a management tool for usage statistics of a desktop product (the logs are actually written by an EJB server to mysql tables). I do apache log file analysis, as well as more custom web reporting on the web side, but this project is separate. The approach I've taken so far is aggregate tables. I'm not sure what these message queue products could do for me, I'll take a look.
To go a bit further -- the data is being used to chart activity over time at the service and the customer level, to allow management to understand how the product is being used. You might select a time period (April 1 to April 10) and retrieve a graph of total minutes of usage of a certain feature at different granularities (hours, days, months etc) depending on the time period selected. Its essentially an after-the-fact analysis of usage. The need seems to be tending towards real-time, however (look at the last hour of usage)

There are a lot of different approaches to this problem, some of which are mentioned here, but what you're doing with the data post-rollup is unclear...?
If you want to utilize this data to provide digg-like 'X diggs' buttons on your site, or summary graphs or something like that which needs to be available on some kind of ongoing basis, you can actually utilize memcache for this, and have your code keep the cache key for the particular statistic up to date by incrementing it at the appropriate times.
You could also keep aggregation tables in the database, which can work well for more complex reporting. In this case, depending on how much data you have and what your needs are, you might be able to get away with having an hourly table, and then just creating views based on that base table to represent days, weeks, etc.
If you have tons and tons of data, and you need aggregate tables, you should look into offloading statistics collection (and perhaps the database queries themselves) to a queue like RabbitMQ or ActiveMQ. On the other side of the queue put a consumer daemon that just sits and runs all the time, updating things in the database (and perhaps the cache) as needed.
One thing you might also consider is your web server's logs. I've seen instances where I was able to get a somewhat large portion of the required statistics from the web server logs themselves after just minor tweaks to the log format rules in the config. You can roll the logs every , and then start processing them offline, recording the results in a reporting database.
I've done all of these things with Python (I released loghetti for dealing with Apache combined format logs, specifically), though I don't think language is a limiting factor or bottleneck here. Ruby, Perl, Java, Scala, or even awk (in some instances) would work.

I have worked on a project to do a similar thing in the past, so I have actual experience with performance. You would be hard pressed to beat the performance of "INSERT ... SELECT" (not "INSERT...VALUES (SELECT ...)". Please see http://dev.mysql.com/doc/refman/5.1/en/insert-select.html
The advantage is that if you do that, especially if you keep the roll-up code in MySQL procedures, is that all you need from the outside is just a cron-job to poke the DB into performing the right roll-ups at the right times -- as simple as a shell-script with 'mysql <correct DB arguments etc.> "CALL RollupProcedure"'
This way, you are guaranteeing yourself zero memory allocation bugs, as well as having decent performance when the MySQL DB is on a separate machine (no moving of data across machine boundary...)
EDIT: Hourly resolution is fine -- just run an hourly cron-job...

If you are running mostly SQL commands, why not just use MySQL etc on the command line? You could create a simple table that lists aggregate data then run a command like mysql -u[user] -p[pass] < commands.sql to pass SQL in from a file.
Or, split the work into smaller chunks and run them sequentially (as PHP files if that's easiest).
If you really need it to be a continual long-running process then a programming language like python or java would be better, since you can create a loop and keep it running indefinitely. PHP is not suited for that kind of thing. It would be pretty easy to convert any PHP classes to Java.

Workload Distribution / Parallel Execution in JAVA

I have a situation here where I need to distribute work over to multiple JAVA processes running in different JVMs, probably different machines.
Lets say I have a table with records 1 to 1000. I am looking for work to be collected and distributed is sets of 10. Lets say records 1-10 to workerOne. Then records 11-20 to workerThree. And so on and so forth. Needless to say workerOne never does the work of workerTwo unless and until workerTwo couldnt do it.
This example was purely based on database but could be extended to any system, I believe be it File processing, email processing and so forth.
I have a small feeling that the immediate response would be to go for a Master/Worker approach. However here we are talking about different JVMs. Even if one JVM were to come down the other JVM should just keep doing its work.
Now the million dollar question would be: Are there any good frameworks(production ready) that would give me facility to do this. Even if there are concrete implementations of specific needs like Database records, File processing, Email processing and their likes.
I have seen the Java Parallel Execution Framework, but am not sure if it can be used for different JVMs and if one were to come down would the other keep going.I believe Workers could be on multiple JVMs, but what about the Master?
More Info 1: Hadoop would be a problem because of the JDK 1.6 requirement. Thats bit too much.
Thanks,
Franklin

Might want to look into MapReduce and Hadoop

You could also use message queues. Have one process that generates the list of work and packages it in nice little chunks. It then plops those chunks on a queue. Each one of the workers just keeps waiting on the queue for something to show up. When it does, the worker pulls a chunk off the queue and processes it. If one process goes down, some other process will pick up the slack. Simple and people have been doing it that way for a long time so there's a lot information about it on the net.

Check out Hadoop

I believe Terracotta can do this. If you are dealing with web pages, JBoss can be clustered.
If you want to do this yourself you will need a work manager which keeps track of jobs to do, jobs in progress and jobs never done which needs to be rescheduled. The workers then ask for something to do, do it, and send the result back, asking for more.
You may want to elaborate on what kind of work you want to do.

The problem you've described is definitely best solved using the master/worker pattern.
You should have a look into JavaSpaces (part of the Jini framework), it's really well suited to this kind of thing. Basically you just want to encapsulate each task to be carried out inside a Command object, subclassing as necesssary. Dump these into the JavaSpace, let your workers grab and process one at a time, then reassemble when done.
Of course your performance gains will totally depend on how long it takes you to process each set of records, but JavaSpaces won't cause any problems if distributed across several machines.

If you work on records in a single database, consider performing the work within the database itself using stored procedures. The gain for processing the records on different machine might be negated by the cost of retrieving and transmitting the work between the database and the computing nodes.
For file processing it could be a similar case. Working on files in (shared) filesystem might introduce large I/O pressure for OS.
And the cost for maintaining multiple JVM's on multiple machines might be an overkill too.
And for the question: I used the JADE (Java Agent Development Environment) for some distributed simulation once. Its multi-machine suppord and message passing nature might help you.

I would consider using Jgroups for that. You can cluster your jvms and one of your nodes can be selected as master and then can distribute the work to the other nodes by sending message over network. Or you can already partition your work items and then manage in master node the distribution of the partitions like partion-1 one goes to JVM-4 , partion-2 goes to JVM-3, partion-3 goes to JVM-2 and so on. And if JVM-4 goes down it will be realized by the master node and then master node will tell to one of the other nodes to start pick up partition-1 as well.
One other alternative which is easier to use is redis pub sub support. http://redis.io/topics/pubsub . But then you will have to maintain redis servers which i dont like.

How to create a Linux cluster for running physics simulations in java?

I am developing a scientific application used to perform physical simulations. The algorithms used are O(n3), so for a large set of data it takes a very long time to process. The application runs a simulation in around 17 minutes, and I have to run around 25,000 simulations. That is around one year of processing time.
The good news is that the simulations are completely independent from each other, so I can easily change the program to distribute the work among multiple computers.
There are multiple solutions I can see to implement this:
Get a multi-core computer and distribute the work among all the cores. Not enough for what I need to do.
Write an application that connects to multiple "processing" servers and distribute the load among them.
Get a cluster of cheap linux computers, and have the program treat everything as a single entity.
Option number 2 is relatively easy to implement, so I don't look so much for suggestions for how to implement this (Can be done just by writing a program that waits on a given port for the parameters, processes the values and returns the result as a serialized file). That would be a good example of Grid Computing.
However, I wonder at the possibilities of the last option, a traditional cluster. How difficult is to run a Java program in a linux grid? Will all the separate computers be treated as a single computer with multiple cores, making it thus easy to adapt the program? Is there any good pointers to resources that would allow me to get started? Or I am making this over-complicated and I am better off with option number 2?
EDIT: As extra info, I am interested on how to implement something like described in this article from Wired Magazine: Scientific replaced a supercomputer with a Playstation 3 linux cluster. Definitively number two sounds like the way to go... but the coolness factor.
EDIT 2: The calculation is very CPU-Bound. Basically there is a lot of operations on large matrixes, such as inverse and multiplication. I tried to look for better algorithms for these operations but so far I've found that the operations I need are 0(n3) (In libraries that are normally available). The data set is large (for such operations), but it is created on the client based on the input parameters.
I see now that I had a misunderstanding on how a computer cluster under linux worked. I had the assumption that it would work in such a way that it would just appear that you had all the processors in all computers available, just as if you had a computer with multiple cores, but that doesn't seem to be the case. It seems that all these supercomputers work by having nodes that execute tasks distributed by some central entity, and that there is several different libraries and software packages that allow to perform this distribution easily.
So the question really becomes, as there is no such thing as number 3, into: What is the best way to create a clustered java application?

I would very highly recommend the Java Parallel Processing Framework especially since your computations are already independant. I did a good bit of work with this undergraduate and it works very well. The work of doing the implementation is already done for you so I think this is a good way to achieve the goal in "number 2."
http://www.jppf.org/

Number 3 isn't difficult to do. It requires developing two distinct applications, the client and the supervisor. The client is pretty much what you have already, an application that runs a simulation. However, it needs altering so that it connects to the supervisor using TCP/IP or whatever and requests a set of simulation parameters. It then runs the simulation and sends the results back to the supervisor. The supervisor listens for requests from the clients and for each request, gets an unallocated simulation from a database and updates the database to indicate the item is allocated but unfinished. When the simulation is finished, the supervisor updates the database with the result. If the supervisor stores the data in an actual database (MySql, etc) then the database can be easily queried for the current state of the simulations. This should scale well up to the point where the time taken to provide the simulation data to all the clients is equal to the time required to perform the simulation.

Simplest way to distribute computing on a Linux cluster is to use MPI. I'd suggest you download and look at MPICH2. It's free. their home page is here
If your simulations are completely independent, you don't need most of the features of MPI. You might have to write a few lines of C to interface with MPI and kick off execution of your script or Java program.

You should check out Hazelcast, simplest peer2peer (no centralized server) clustering solution for Java. Try Hazelcast Distributed ExecutorService for executing your code on the cluster.
Regards,
-talip

You already suggested it, but disqualified it: Multi cores. You could go for multi core, if you had enough cores. One hot topic atm is GPGPU computing. Esp. NVIDIAs CUDA is a very priomising approach if you have many independent task which have to do the same computation. A GTX 280 delivers you 280 cores, which can compute up to 1120 - 15360 threads simultanously . A pair of them could solve your problem. If its really implementable depends on your algorithm (data flow vs. control flow), because all scalar processors operate in a SIMD fashion.
Drawback: it would be C/C++, not java

How optimized are your algorithms? Are you using native BLAS libraries? You can get about an order of magnitude performance gain by switching from naive libraries to optimized ones. Some, like ATLAS will also automatically spread the calculations over multiple CPUs on a system, so that covers bullet 1 automatically.
AFAIK clusters usually aren't treated as a single entity. They are usually treated as separate nodes and programmed with stuff like MPI and SCALAPACK to distribute the elements of matrices onto multiple nodes. This doesn't really help you all that much if your data set fits in memory on one node anyways.

Have you looked at Terracotta?
For work distribution you'll want to use the Master/Worker framework.

Ten years ago, the company I worked for looked at a similar virtualization solution, and Sun, Digital and HP all supported it at the time, but only with state-of-the-art supercomputers with hardware hotswap and the like. Since then, I heard Linux supports the type of virtualization you're looking for for solution #3, but I've never used it myself.
Java primitives and performance
However, if you do matrix calculations you'd want to do them in native code, not in Java (assuming you're using Java primitives). Especially cache misses are very costly, and interleaving in your arrays will kill performance. Non-interleaved chunks of memory in your matrices and native code will get you most of the speedup without additional hardware.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.