Profile a Java Application Running on Google Dataflow - java

Do you have any idea how to profile a java application running on a dataflow worker?
Do you know any tools that can allow me to discover memory leaks of my application?

For time profiling, you can try the instructions described in this issue 72, but there may be difficulties with workers being torn-down or auto-scaled away before you can get the profiles off the worker. Unfortunately it doesn't provide memory profiling so it won't help with memory leaks.
You can also run with the DirectPipelineRunner, which will execute the pipeline locally on your machine. This will allow you to profile the code in your pipeline without needing to deal with Dataflow workers. Depending on the scale of the pipeline you'll likely need to adjust the input size to be something that can be handled on one machine.
It may also be helpful to try to distinguish code that runs on the worker -- eg., the code within a single DoFn and the structure of the pipeline and the data. For instance, out-of-memory problems can be caused by having a GroupByKey with too many values associated with a single key and reading that into a list.

Related

OOM crash - Hadoop Filesystem Cache Growth

I have a java program that submits thousands of Hadoop-DistCp jobs, and this is causing a OOM error on the client java process side (not on the cluster).
The DistCp jobs are submitted internally within the same java process. I do
import org.apache.hadoop.tools.{DistCp, DistCpOptions}
val distCp = new DistCp(<config>, new DistCpOptions(<foo>, <bar>))
distCp.createAndSubmitJob()
In other words, the java program is not spawning external OS processes and running the distCp cli tool separately.
The problem is the the java program, after a few thousand distCp job submissions, is eventually crashing with an OutOfMemory error. This can take some days.
This is a nuisance because re-starting the java program is a not a reasonable solution for us on the medium term.
By analysing a few heap dumps, the issue became clear. Almost all the heap is being used to hold objects on the map Map<Key, FileSystem> of the FileSystem.Cache of org.apache.hadoop.fs.FileSystem. This cache is global.
A few debugging sessions later I found that upon each instantiation of new DistCp(<config>, new DistCpOptions(<foo>, <bar>)) there is eventually a call to FileSystem.newInstance(<uri>, <conf>) via the org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter of the library org.apache.hadoop:hadoop-yarn-common.
The problem with this FileSystem.newInstance(<uri>, <conf>) calls is that it creates a unique entry on the cache each time and there doesn't seem to exist any mechanism to clear these entries. I would have tried to set the configuration flag fs.hdfs.impl.disable.cache to true, but again the FileSystem.newInstance bypasses the check of the flag, so it doesn't work.
Alternatively, I could to FileSystem.closeAll(). But this will close all file systems of the program, including "legitime" uses on other parts of the application.
In essence,
My java program launches, over time, thousands of distCp jobs within the same java process.
Each distCp job seems to add one entry to the global FileSystem.Cache.
The cache grows, and eventually the program crashes with OOM.
Does anyone see a solution to this?
I am surprised to not have found similar issues on the internet. I would have thought that having java process launching thousands of distCp jobs to be quite a common usage.

JAVA Lightest Thread Framework

I have a project where I have to send emails using Amazon SES REST API, now amazon allows concurrent connections at a same time based on account. So in my case amazon allows me to open 50 connections at a same time, which means I can send 50 emails/sec. To achieve this, currently I am using JAVA Executioner threads where I control the thread speed to be 50/sec. Also I have implemented this in Hibernate framework because I need to execute some SQL queries before sending emails.
This java program runs continuously in background(its a jar file). This takes around 512MB RAM, so my question is that can I use some other frameworks or better thread system to make it more lighter? The SQL query I execute is only a select query, update/delete/create queries are not used.
I am not good in JAVA so may be this sounds stupid.
I guess the smallest possible framework to use would be plain JDBC.
This would limit your libraries to those in the jre plus the DB driver and maybe libs for AWS / Email. Depending on what else you need, selecting a compact profile might be worth investigating.
Also check your memory settings:
If you set -Xms512m it's really not surprising your app uses 512m, is it?
Edit due to rephrased question
In your level of parallelism, most of your Memory is consumed by Objects, not by Threads (well, Threads are objects, but small ones). Threads are good the way they are in Java. You can run hundrets of them without them consuming 500 mb of heap or more as you claim.
So the issue with 50 threads consuming 512m of your memory is more likely rooted in your code and your objects, not (only) in your threads.
In order to reduce memory footprint, tra the follwing:
Remove hibernate. As you say you only have a simple select SQL, so you don't need the overhead and additional libraries.
Take a memory dump of your running app and analyse it. (MAT - Eclipse Memory Analyser tool comes to mind)
Check other objects and how you use them. When you say "sending emails" - how large are your emails? Might there be duplicate buffers do to bad choice of coding? Share your code for how you do it, then we can have a look.
Try running without any memory options and see how the program runs on defaults.
Add garbage collector output and check that

How to handle varying input batch size in mapreduce jobs

Issue -
I am running a series of mapreduce jobs wrapped in an oozie workflow. Input data consists of a bunch text files most of which are fairly small (KBs) but every now and then I get files over 1-2 MB which cause my jobs to fail. I am seeing two reasons why jobs are failing - one, inside one or two mr jobs the file is parsed into a graph in memory and for a bigger file its mr is running out of memory and two, the jobs are timing out.
Questions -
1) I believe I can just disable the timeout by setting mapreduce.task.timeout to 0. But I am not able to find any documentation that mention any risk in doing this.
2) For the OOM error, what are the various configs I can mess around with ? Any links here on potential solutions and risks would be very helpful.
3)I see a lot of "container preempted by scheduler" messages before I finally get OOM.. is this a seperate issue or related? How do I get around this?
Thanks in advance.
About the timeout: no need to set it to "unlimited", a reasonably large value can do (e.g. in our Prod cluster it is set to 300000).
About requiring a non-standard RAM quota in Oozie: the properties you are looking for are probably mapreduce.map.memory.mb for global YARN container quota, oozie.launcher.mapreduce.map.java.opts to instruct the JVM about that quota (i.e. fail gracefully with OOM exception instead of crashing the container with no useful error message), and the .reduce. counterparts.
See also that post for the (very poorly documented) oozie.launcher. prefix in case you want to set properties for a non-MR Action -- e.g. a Shell, or a Java program that spawns indirectly a series of Map and Reduce steps

Taking dump of hashmap from memory, for a java process

I have a Java process which has been running for over a week. In the process, I have been processing some data and have been storing some intermediate result into a hashmap in memory.
Now, I need to stop the process due to some bugs in the code. But if i kill the process, then i lose the data in the hashmap n will have to reprocess it again, the next time i run the code.
Is their a way by which i can take a dump of the hashmap present in the memory ?
You can trigger a memory dump which you can read with various tools such as profilers. This is very difficult to read and I have never heard of some one using this to restart a program.
Restarting a program is not something you can do as an afterthought, it needs to designed into your application and thoroughly tested. One way people get around having to worry about this issue is to use a database. This is because databases are, designed and tested professionally, to be restarted without losing data.
You can use JConsole, which is included with the Standard JDK, or visualvm to attach to an already-running process and trigger a heap dump.
However, as Peter said, actually reading this dump is not going to be fun unless you can find a tool to assist you.

How to detect Out Of Memory condition?

I have an application running on Websphere Application Server 6.0 and it crashes nearly every day because of Out-Of-Memory. From verbose GC is certain there are the memory leaks(many of them)
Unfortunately the application is provided by external vendor and getting things fixed is slow & painful process. As part of the process I need to gather the logs and heapdumps each time the OOM occurs.
Now I'm looking for some way how to automate it. Fundamental problem is how to detect OOM condition. One way would be to create shell script which will periodically search for new heapdumps. This approach seems me a kinda dirty. Another approach might be to leverage the JMX somehow. But I have little or no experience in this area and don't have much idea how to do it.
Or is in WAS some kind of trigger/hooks for this? Thank you very much for every advice!
You can pass the following arguments to the JVM on startup and a heap dump will be automatically generated on an OutOfMemoryError. The second argument lets you specify the path for the heap dump file. By using this at least you could check for the existence of a specific file to see if a heap dump has occurred.
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=<value>
I see two options if you want heap dumping automated but #Mark's solution with heap dump on OOM isn't satisfactory.
You can use the MemoryMXBean to detect high memory pressure, and then programmatically create a heap dump if the usage (or usage delta) seems high.
You can periodically get memory usage info and generate heap dumps with a cron'd shell script using jmap (works both locally and remote).
It would be nice if you could have a callback on OOM, but, uhm, that callback probably would just crash with an OOM error. :)
Have you looked at JConsole ? It uses JMX to give you visibility of a variety of JVM metrics, including memory info. It would probably be worth monitoring your application using this to begin with, to get a feel for how/when the memory is consumed. You may find the memory is consumed uniformly over the day, or when using certain features.
Take a look at the detecting low memory section of the above link.
If you need you can then write a JMX client to watch the application automatically and trigger whatever actions required. JConsole will indicate which JMX methods you need to poll.
And alternative to waiting until the application has crashed may be to script a controlled restart like every night if you're optimistic that it can survive for twelve hours..
Maybe even websphere can do that for you !?
You could add a listener (Session scoped or Application scope attribute listener) class that would be called each time a new object is added in session/app scope.
In this - you can attempt to check the total memory used by app (Log it) as as call run gc (note that invoking it will not imply gc will always run)
(The above is for the logging part and gc based on usage growth)
For scheduled gc:
In addition you can keep a timer task class that runs after every few hrs and does a request for gc.
Our experience with ITCAM has been less than stellar from the monitoring perspective. We dumped it in favor of CA Wily Introscope.
Have you had a look on the jvisualvm tool in the latest Java 6 JDK's?
It is great for inspecting running code.
I'd dispute that the you need the heap dumps when the OOM occurs. Periodic gathering of the information over time should give the picture of what's going on.
As has been observed various tools exist for analysing these problems. I have had success with ITCAM for WebSphere, as an IBMer I have ready access to that. We were very quickly able to indentify the exact lines of code in out problem situation.
If there's any way you can get a tool of that nature then that's the way to go.
It should be possible to write a simple program to get the process list from the kernel and scan it to see if your WAS process is still running. On a Unix box you could probably whip up something in Perl in a few minutes (if you know Perl), not sure how difficult it would be under Windows. Run it as a scheduled task every five minutes or so, and if the process doesn't show up you could have it fork off another process that would deal with the heap dump and re-start WAS.

Categories

Resources