M1 Max VERY inconsistent CPU usage running the same Java codes - java

So I have the same JSON files(over 7000 files) I'm reading using the same Java code that I wrote.
On my M1 Max mac, I'm using 9 threads to read the files, the reading time varies between 3-16 seconds, and the CPU usage ranges from 10-85%.
On my i7-10875H XPS 15, I'm using 15 threads to read the exact same files, the reading time is within 4-5 seconds, and the CPU usage stays on the same level of 80-90%.
The codes on both machines are the same other than the number of threads used. I've done the test thousands of times, and the M1 max Mac's reading time is super inconsistent. According to the VisualVM sampler, it's java.io.FileReader that's taking the extra time, and the mac's CPU usage is very low when the read time is long.
I don't understand why this is happening and does anyone know how to avoid this? Why does the same code have no such issue on the XPS?

Related

How to optimize Mapreduce Job

So I have a job that does in mapper computing. With each task taking about .08 seconds, a 360026 line file will take about 8 hours to just do this. If it was done on one node. File sizes will generally be about the size of 1-2 block sizes (often 200 MB or less).
Assuming the in code is optimized, is there anyway to mess with the settings? Should I be using a smaller block size for example? I currently am using AWS EMR, with c4.large instances and autoscaling on YARN, but it only went up to 4 extra task nodes, as the load wasn't too high. Even though YARN memory wasn't too high, it still took over 7 hours to complete (which is way to long).

How to analyze multithreaded program throttle at high number of used threads

I'm writing a simple program that calculates the number Pi according to this formula. Before I elaborate more on the problem, let me say that I'm testing my program (written in Java 8) on a 12-core CPU with 24 threads. According to htop, when running the tests, I have no load on the server, so that is out of the question.
However, I expect this to have near-linear speedup, when it starts to choke at high number of threads (let's say at >8, when it gets off the y=x line). At that point, the time in which the program executes for the same parameters with different numbers of threads is constant and speedup kind of tops at 10
Without too much concrete information, I would like to know how can I analyze where my program chokes. In other words, what are some must-do's when it comes to checking parallel programs' speedup.

Java multithread code does not fully consumes the cores on some procs

I have some intensive processing code that reads chunks from a file, process the data and writes to output file IN THE SAME ORDER. Some numbers: input file is about 29MB, output file is about 39MB, there are 39461 chunks.
Single thread version takes 100% processor (for multicore only one core is used).
Again some numbers (seconds for each proc):
Pentium 4 (1 core) 2.8GHz 4302.407
Intel Xeon (1 core) 2.8GHz 3805.281
Intel E8300 (2 cores) 1773.062
Intel Q6600 (4 cores) 2202.231
Intel i5-4440 (4 cores) 1300.127
i7-3632QM (4 cores, 8 threads) 1412.191
Interesting to see that better rated i7 is almost as fast as the old E8300 in single thread and slower than i5-4440.
To take advantage of multicore structure I modified the code. I launch a number of threads equal to the number of cores(threads) - Runtime.getRuntime().availableProcessors(). Each thread reads from the file and picks a number (synchronized block), does the intense processing, then waits in line for its picked number to be available in order to write in the output file, does the writing and increment the number that shows whose turn is now to write in output file (also synchronized).
The code works fine, the output file is generated correctly.
For the above processors(multicore) I got this :
Intel E8300 (2 cores) 937.766
Intel Q6600 (4 cores) 657.515
Intel i5-4440 (4 cores) 345.244
i7-3632QM (4 cores, 8 threads) 584.346
Improvement indeed compared to singlethread version, but: Task Manager (all systems on Windows) shows to my satisfaction 100% busy on all procs/all cores except for i7 - here it uses all the 8 threads but only about 40% each, and the results reflect this behaviour. i7 falls between old q6600 and new (but less rated) i5-4440 and closer to the first one.
Some remarks:
The way threads wait to write in the output file is:
while(ai.intValue() != outSeed.intValue()) {
Thread.sleep(10);
}
ai is the number picked when it read from input file, and now waits its turn to write. outSeed is incremented by the threads that succeded to write.
Tested intensely on Q6600 the 10ms sleep proved to give the best times. i5 also improved very well. i7 not so good, so I tried sleep(3), sleep(1), sleep(0). For 3ms i7 ran in 529.782 time. Sleep(0) raised the busy percentage to approx. 60% for all 8 threads and time was 440.897. It is better but it is not enough as I would expect less than 200 seconds, and I think it is possible if I can achieve a more busy processor.
Again, the resulted file is what expected, the behavior is what expected on most procs (100% busy) except for i7-3632QM. What are your suggestions? I tried setPriority=realTime from TaskManager, no effect. Is it possible that the Op. System limits the proc use?
Latter I may have access to a six core Xeon and try on that too.
Thanks for reading.
Hyperthreading will obviously not linearly scale performance. Your i7 has 4 cores, not 8, and only has a bit of logic in front of those cores which make context switching faster. You can only expect up to 20-30% increase over the performance of a 4-core system without hyperthreading.
What you see in the Task Manager does not directly reflect the efficiency of individual Java threads because threads get reassigned between cores. The same readout may be accomplished with less than 8 threads, each running at full speed.
The mere fact that you have 8 threads instead of 4 may cause some blockage issues as you are unable to feed all 8 threads with work. That explicit sleep may influence this.
You should try replacing your polling loop with a design that relies on Phaser. That class seems to be a perfect match for your use case.
What you are coding has already been provided in Java 8 with the Streams API. I have recently written a post about exactly this subject, which explains how to use the Streams API to parallelize any I/O-based source. You may try that avenue as well.
You are reading and writing files.
There will be times when your application will be blocked on I/O and therefore will not be consuming CPU.
After the comments of Marko and Claudio I went back to testing:
-#Claudio I disabled HT in i7 and ran again multicore : busy 70%/core and results slightly better (sleep(0)): 431.099
-#Marko : I ran again single thread on i7 with HT enabled and observed that the java process was 13% from total but there were 2 threads from 8 busy instead of one at about 40%. This is OS intervention, seems that somehow divides the job on tho threads. Your observations were right, I didn't notice there was a second thread busy on i7.
-ran again on E8300, single thread is 50 busy from total but both cores are loaded about 50%. Again OS decision. Multithread load is 90%.
-single thread on Q6600 java process has 25% cpu but on graphic view all the cores have some loading (there are no other intense processes), so OS divides the single thread to all cores somehow. Multithread is 90%.
#Steve C - the faster per thread i5 should have the same problem. The i/o operations on i7 system should not be slower than on i5 system as it is fairly new
The impression that all procs use 100% except i7 which uses 40% is now diminishing, but still would like to get more from i7. Perhaps using what Marko suggested will help. Right now I'm somehow satisfied with the results but if smaller times are required I will change the code to see the difference.

Why multithread program in Java slow and yet doesn't use much CPU time?

My Java program uses java.util.concurrent.Executor to run multiple threads, each one starts a runnable class, in that class it reads from a comma delimited text file on C: drive and loops through the lines to split and parse text into floats, after that data is stored into :
static Vector
static ConcurrentSkipListMap
My PC is a Win 7 64bit, Intel Core i7, has six * 2 cores and 24GB of RAM, I have noticed the program will run for 2 minutes and finish all 1700 files, but the CPU usage is only around 10% to 15%, no matter how many threads I assign using :
Executor executor=Executors.newFixedThreadPool(50);
Executors.newFixedThreadPool(500) won't have a better CPU usage or shorter time to finish the tasks. There is no network traffic, everything is on local C: drive, There is enough RAM for more threads to use, it will have an "OutOfMemoryError" when I increase the threads to 1000.
How come more threads doesn't translate to more CPU usage and less time of processing, why ?
Edit : My hard drive is a SSD 200 GB.
Edit : Finally found where the problem was, each thread writes it's results to a log file which is shared by all threads, the more times I run the app, the larger the log file, the slower it gets, and since it's shared, this definitely slows down the process, so after I stopped writing to the log file, it finishes all tasks in 10 seconds !
The OutOfMemoryError is probably coming from Java's own limits on its memory usage. Try using some of the arguments here to increase the maximum memory.
For speed, Adam Bliss starts with a good suggestion. If this is the same file over and over, then I imagine having multiple threads try to read it at the same time could result in a lot of contention over locks on the file. More threads would even mean more contention, which could even result in worse overall performance. So avoid that and simply load the file once if it's possible. Even if it's a large file, you have 24 GB of RAM. You can hold quite a large file, but you may need to increase the JVM's allowed memory to allow the whole file to be loaded.
If there are multiple files being used, then consider this fact: your disk can only read one file at a time. So having multiple threads trying to use the disk all at the same time probably won't be too effective if the threads aren't spending much time processing. Since you have so little CPU usage, it could be that the thread loads part of the file, then runs very quickly on the part that got buffered, and then spends a lot of time waiting for the rest of the file to load. If you're loading the file over and over, that could even still apply.
In short: Disk IO probably is your culprit. You need to work to reduce it so that the threads aren't contending for file content so much.
Edit:
After further consideration, it's more likely a synchronization issue. Threads are probably getting held up trying to add to the result list. If access is frequent, this will result in huge amounts of contention for locks on the object. Consider doing something like having each thread save it's results in a local list (like ArrayList, which is not thread safe), and then copying all values into the final, shared list in chunks to try to reduce contention.
You're probably being limited by IO, not cpu.
Can you reduce the number of times you open the file to read it? Maybe open it once, read all the lines, keep them in memory, and then iterate on that.
Otherwise, you'll have to look at getting a faster hard drive. SSDs can be quite speedy.
It is possible that your threads are somehow given low priority on the system? Increasing the number of threads in that case wouldn't correspond to an increase in CPU usage, since the amount of CPU space allotted to your program may be throttled somewhere else.
Are there any configuration files/ initialization steps where something like this could possibly occur?

Methods of limiting emulated cpu speed

I'm writing a MOS 6502 processor emulator as part of a larger project I've undertaken in my spare time. The emulator is written in Java, and before you say it, I know its not going to be as efficient and optimized as if it was written in c or assembly, but the goal is to make it run on various platforms and its pulling 2.5MHZ on a 1GHZ processor which is pretty good for an interpreted emulator. My problem is quite to the contrary, I need to limit the number of cycles to 1MHZ. Ive looked around but not seen many strategies for doing this. Ive tried a few things including checking the time after a number of cycles and sleeping for the difference between the expected time and the actual time elapsed, but checking the time slows down the emulation by a factor of 8 so does anyone have any better suggestions or perhaps ways to optimize time polling in java to reduce the slowdown?
The problem with using sleep() is that you generally only get a granularity of 1ms, and the actual sleep that you will get isn't necessarily even accurate to the nearest 1ms as it depends on what the rest of the system is doing. A couple of suggestions to try (off the top of my head-- I've not actually written a CPU emulator in Java):
stick to your idea, but check the time between a large-ish number of emulated instructions (execution is going to be a bit "lumpy" anyway especially on a uniprocessor machine, because the OS can potentially take away the CPU from your thread for several milliseconds at a time);
as you want to execute in the order of 1000 emulated instructions per millisecond, you could also try just hanging on to the CPU between "instructions": have your program periodically work out by trial and error how many runs through a loop it needs to go between instructions to "waste" enough CPU to make the timing work out at 1 million emulated instructions / sec on average (you may want to see if setting your thread to low priority helps system performance in this case).
I would use System.nanoTime() in a busy wait as #pst suggested earlier.
You can speed up the emulation by generating byte code. Most instructions should translate quite well and you can add a busy wait call so each instruction takes the amount of time the original instruction would have done. You have an option to increase the delay so you can watch each instruction being executed.
To make it really cool you could generate 6502 assembly code as text with matching line numbers in the byte code. This would allow you to use the debugger to step through the code, breakpoint it and see what the application is doing. ;)
A simple way to emulate the memory is to use direct ByteBuffer or native memory with the Unsafe class to access it. This will give you a block of memory you can access as any data type in any order.
You might be interested in examining the Java Apple Computer Emulator (JACE), which incorporates 6502 emulation. It uses Thread.sleep() in its TimedDevice class.
Have you looked into creating a Timer object that goes off at the cycle length you need it? You could have the timer itself initiate the next loop.
Here is the documentation for the Java 6 version:
http://download.oracle.com/javase/6/docs/api/java/util/Timer.html

Categories

Resources