How to optimize Mapreduce Job

How to optimize Mapreduce Job - java

So I have a job that does in mapper computing. With each task taking about .08 seconds, a 360026 line file will take about 8 hours to just do this. If it was done on one node. File sizes will generally be about the size of 1-2 block sizes (often 200 MB or less).
Assuming the in code is optimized, is there anyway to mess with the settings? Should I be using a smaller block size for example? I currently am using AWS EMR, with c4.large instances and autoscaling on YARN, but it only went up to 4 extra task nodes, as the load wasn't too high. Even though YARN memory wasn't too high, it still took over 7 hours to complete (which is way to long).

Related

M1 Max VERY inconsistent CPU usage running the same Java codes

So I have the same JSON files(over 7000 files) I'm reading using the same Java code that I wrote.
On my M1 Max mac, I'm using 9 threads to read the files, the reading time varies between 3-16 seconds, and the CPU usage ranges from 10-85%.
On my i7-10875H XPS 15, I'm using 15 threads to read the exact same files, the reading time is within 4-5 seconds, and the CPU usage stays on the same level of 80-90%.
The codes on both machines are the same other than the number of threads used. I've done the test thousands of times, and the M1 max Mac's reading time is super inconsistent. According to the VisualVM sampler, it's java.io.FileReader that's taking the extra time, and the mac's CPU usage is very low when the read time is long.
I don't understand why this is happening and does anyone know how to avoid this? Why does the same code have no such issue on the XPS?

MapReduce: Increase number of concurrent mapper tasks

I am using AWS EMR to run a map reduce job. My input set contains 1 million files each of around 15KB. Since input files are very small, so this will lead to a huge number of mappers. So, I changed s3 block size to 20KB and used 5 r3.2xlarge instances but number of concurrent tasks running is still just 30. Shouldn't the job run more number of concurrent mappers now after reducing the block size or even after reducing block size, memory taken by each mapper is still same?
How can I limit the memory usage of each mapper or increase the number of concurrent mapper tasks? The current expected completion time is 100hours, will combining these files to lesser number of bigger files, like 400MB files, increase the processing time?

Reducing Block size can increase the number of mappers required for a particular job , but will not increase the parallel number of mappers that your cluster can run at a given point nor the memory used for those mappers.
used 5 r3.2xlarge instances but number of concurrent tasks running is
still just 30
To find the parallel maps/Reducers that a Hadoop 2 EMR cluster can support , please see this article AWS EMR Parallel Mappers?
Ex: r3.2xlarge * 5 core's :
mapreduce.map.memory.mb 3392 3392
yarn.scheduler.maximum-allocation-mb 54272
yarn.nodemanager.resource.memory-mb 54272
Once core-node can have 54272/3392 = 16 mappers .
So, a cluster can have a total of 16*5 = 80 mappers in parallel.
So , if your job spins up like 1000 mappers , cluster can launch 80 mappers with that preconfigured memory and heap on your nodes and other mappers will be simply Queued up.
If you want more parallel mappers, you might want to configure less memory (based on that math) and less heap for mapper.

What you are looking for is CombineFileInputFormat .
Do remember map slit size by default = HDFS block size by default. Changing one will not affect the other.
Please follow the link : http://bytepadding.com/big-data/map-reduce/understanding-map-reduce-the-missing-guide/

Java process in linux needs initial warmup

we have a legacy java multithreaded process on a RHEL 6.5 which is very time critical (low latency), and it processes hundreds of thousands of message a day. It runs in a powerful Linux machine with 40cpus. What we found is the process has a high latency when it process the first 50k messages with average of 10ms / msg , and after this 'warmup' time, the latency starts to drop and became about 7ms, then 5ms and eventually stops at about 3-4ms / second at day end.
this puzzles me , and one of possibility that i can think of is maps are being resized at the beginning till it reaches a very big capacity - and it just doesn't exceed the load factor anymore. From what I see, the maps are not initialized with initial capacity - so that is why i say that may be the case. I tried to put it thru profiler and pump millions of messages inside, hoping to see some 'resize' method from the java collections, but i was unable to find any of them. It could be i am searching for wrong things, or looking into wrong direction. As a new joiner and existing team member left, i am trying to see if there are other reasons that i haven't thought of.
Another possibility that i can think of is kernel settings related, but i am unsure what it could be.
I don't think it is a programming logic issue, because it runs in acceptable speed after the first 30k-50k of messages.
Any suggestion?

It sounds like it takes some time for the operating system to realize that your application is a big resource consumer. So after few seconds it sees that there is a lot of activity with files of your application, and only then the operating system deals with the activity by populating the cache and action like this.

Multiple image operations in a single process with gm4java

This question is about using the gm4java library to interact with Graphics Magick (in scala).
I've been testing the PooledGMService as it is demonstrated here with scala, and it's working well.
However, I noticed that it does not perform similarly to batch mode within the command line interface for gm (gm batch batchfile.gm). When I run a gm batch file from the command line with any number of images, it launches 1 gm process. However, if I:
val config = new GMConnectionPoolConfig()
val service = new PooledGMService(config)
and then share the instance of service across 4 threads, where I perform some operation on one image per thread like:
service.execute(
"convert",
srcPath.toString(),
"-resize", percent + "%",
outPath.toString()
)
I see that 4 separate gm processes are created.
I believe this has performance impacts (a test with 100 images, with the code mentioned above against the gm cli with a batch file, takes the same time, but my scala code uses 4x as much CPU).
My question is: how do I use gm4java so that a single gm process is working on several images (or at least several kinds of conversions for the same image), just like the cli batch mode? I've tried a few attempts (some desperately silly) with no luck here.
My exact scala code, can be found here if you are curious.
update 05/27/14
With the guidance of a comment by gm4java's author I realized that I was benchmarking two different gm commands. The updated benchmarking results are:
100 x 30MB images (3.09GB tot)
on i7 quadcore (8 logical cpu's w/ hyper-threading)
Criteria Time
gm cli batchfile 106s
my code 1 thread 112s
my code 4 threads 40s
my code 6 threads 31s
my code 7 threads 31s
my code 8 threads 28s
Upon closer inspection, I also saw that while my code ran, the same gm processes with the same process ids were kept up the whole time. This alleviated my worries that I was losing out on performance due to some overhead related to starting and terminating gm threads.
Rephrasing
I guess the heart of my question is what to do to make gm4java as fast as possible? The tip about matching gm the threadcount with the machine's execution engine count is useful. Is there anything else that comes to mind?
My particular use case is resizing input images (30MB is average, 50-60MB occasionally, and 100-500MB very rarely) to a few set sizes (with thumbnails being the most important and highest priority). Deployment will probably be on amazon ec2 with 7 or 14 "compute units"

The design of PooledGMService is to make max use of your computer power by starting multiple instances of GM processes to process your image manipulation request in a highly concurrent manner. 100 image is a too small sample size to test performance. If your goal is to make best use of your multi-CPU server to convert images, you need to test with large amount of samples (at least few thousands) and tweak the configuration to find the best number of concurrent GM processes to use. See documentation of GMConnectionPoolConfig for all the configuration options.
If you have only 8 CPUs, don't start more than 7 GM processes. If you are testing on a 2-CPU laptop, don't run more than 2 GM processes. In the example, you accepted all the default configuration setting, which will start maximal 8 GM processes upon demand. But that won't be the right configuration to just process 100 images on a merely 2 CPU laptop.
If all you want is to mimic the command line batch mode. Than the SimpleGMService is your best friend. Look at the usage pattern here.
The right solution is very much depends on your real use case. If you can tell us more about what exactly you are trying to achieve, your hardware environment and etc, we can be better equipped to help you.

Performance tuning , detecting and page faults

I am trying to Tune one of my applications on JAVA.
I am using JAVA-Profiler and got some reports from it.
I saw that the number of page -faults for application are ranging from 30000 to 35000 range.
How can I decide if this number is too high or normal ?
I am getting same data for initial one minute and after half an hour as well.
My RAM is 2 GB and I am using application with single thread.
Thread is only trying to read messages from queue every 3 seconds and queue is empty.
Since no processing is being done, I think that page faults should not occur at all.
Please guide me here.

When you start your JVM, it reserves the maximum heap size as a continuous block. However, this virtual memory is only turned into main memory as you access those pages. i.e. every time your heap grows by 4 KB, you get one page fault. You will also get page fault from thread stacks in the same manner.
Your 35K page faults suggests you are using about 140 MB of heap.
BTW You can buy 8 GB for £25. You might consider an upgrade.

What's your JVM? If it's HotSpot, you can use JVM options like -XX:LargePageSizeInBytes or -XX:+UseMPSS to force desired page sizes in order to minimize page swapping. I Think there should be similar options for other JVMs too.
Take a look at this:
http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.