How to check why job gets killed on Google Dataflow ( possible OOM )

How to check why job gets killed on Google Dataflow ( possible OOM ) - java

I've got the simple task. I've got a bunch of files ( ~100GB in total ), each line represents one entity. I have to send this entity to JanusGraph server.
2018-07-07_05_10_46-8497016571919684639 <- job id
After a while, I am getting OOM, logs say that Java gets killed.
From dataflow view, i can see the following logs:
Workflow failed. Causes: S01:TextIO.Read/Read+ParDo(Anonymous)+ParDo(JanusVertexConsumer) failed., A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on:
From stackdriver view, I can see: https://www.dropbox.com/s/zvny7qwhl7hbwyw/Screenshot%202018-07-08%2010.05.33.png?dl=0
Logs are saying:
E Out of memory: Kill process 1180 (java) score 1100 or sacrifice child
E Killed process 1180 (java) total-vm:4838044kB, anon-rss:383132kB, file-rss:0kB
More here: https://pastebin.com/raw/MftBwUxs
How can I debug what's going on?

There is too few information to debug the issue right now, so I am providing general information about Dataflow.
The most intuitive way for me to find the logs is going to Google Cloud Console -> Dataflow -> Select name of interest -> upper right corner (errors + logs).
More detailed information about monitoring is described here (in beta phase).
Some basic clues to troubleshoot the pipeline, as well as the most common error messages, are described here.
If you are not able to fix the issue, update the post with the error information please.
UPDATE
Based on the deadline exceeded error and the information you shared, I think your job is "shuffle-bound" for memory exhaustion. According to this guide:
Consider one of, or a combination of, the following courses of action:
Add more workers. Try setting --numWorkers with a higher value when you run your pipeline.
Increase the size of the attached disk for workers. Try setting --diskSizeGb with a higher value when you run your pipeline.
Use an SSD-backed persistent disk. Try setting --workerDiskType="compute.googleapis.com/projects//zones//diskTypes/pd-ssd"
when you run your pipeline.
UPDATE 2
For specific OOM errors you can use:
--dumpHeapOnOOM will cause a heap dump to be saved locally when the JVM crashes due to OOM.
--saveHeapDumpsToGcsPath=gs://<path_to_a_gcs_bucket> will cause the heap dump to be uploaded to the configured GCS path on next worker restart. This makes it easy to download the dump file for inspection. Make sure that the account the job is running under has write permissions on the bucket.
Please take into account that heap dump support has some overhead cost and dumps can be very large. These flags should only be used for debugging purposes and always disabled for production jobs.
Find other references on DataflowPipelineDebugOptions methods.
UPDATE 3
I did not find public documentation about this but I tested that Dataflow scales the heap JVM size with the machine type (workerMachineType), which could also fix your issue. I am with GCP Support so I filed two documentation requests (one for a description page and another one for a dataflow troubleshooting page) to update the documents to introduce this information.
On the other hand, there is this related feature request which you might find useful. Star it to make it more visible.

Related

Profile a Java Application Running on Google Dataflow

Do you have any idea how to profile a java application running on a dataflow worker?
Do you know any tools that can allow me to discover memory leaks of my application?

For time profiling, you can try the instructions described in this issue 72, but there may be difficulties with workers being torn-down or auto-scaled away before you can get the profiles off the worker. Unfortunately it doesn't provide memory profiling so it won't help with memory leaks.
You can also run with the DirectPipelineRunner, which will execute the pipeline locally on your machine. This will allow you to profile the code in your pipeline without needing to deal with Dataflow workers. Depending on the scale of the pipeline you'll likely need to adjust the input size to be something that can be handled on one machine.
It may also be helpful to try to distinguish code that runs on the worker -- eg., the code within a single DoFn and the structure of the pipeline and the data. For instance, out-of-memory problems can be caused by having a GroupByKey with too many values associated with a single key and reading that into a list.

How to handle varying input batch size in mapreduce jobs

Issue -
I am running a series of mapreduce jobs wrapped in an oozie workflow. Input data consists of a bunch text files most of which are fairly small (KBs) but every now and then I get files over 1-2 MB which cause my jobs to fail. I am seeing two reasons why jobs are failing - one, inside one or two mr jobs the file is parsed into a graph in memory and for a bigger file its mr is running out of memory and two, the jobs are timing out.
Questions -
1) I believe I can just disable the timeout by setting mapreduce.task.timeout to 0. But I am not able to find any documentation that mention any risk in doing this.
2) For the OOM error, what are the various configs I can mess around with ? Any links here on potential solutions and risks would be very helpful.
3)I see a lot of "container preempted by scheduler" messages before I finally get OOM.. is this a seperate issue or related? How do I get around this?
Thanks in advance.

About the timeout: no need to set it to "unlimited", a reasonably large value can do (e.g. in our Prod cluster it is set to 300000).
About requiring a non-standard RAM quota in Oozie: the properties you are looking for are probably mapreduce.map.memory.mb for global YARN container quota, oozie.launcher.mapreduce.map.java.opts to instruct the JVM about that quota (i.e. fail gracefully with OOM exception instead of crashing the container with no useful error message), and the .reduce. counterparts.
See also that post for the (very poorly documented) oozie.launcher. prefix in case you want to set properties for a non-MR Action -- e.g. a Shell, or a Java program that spawns indirectly a series of Map and Reduce steps

Sonar Qube UI long to update

We have recently installed a SonarQube instance to check our source code.
The codebase is pretty large, with more than 1 million lines of code.
We run sonar-runner automatically via Jenkins.
Now I get that the UI gets updates only after sonar-runner stores its results in the database.
But it seems to really take ages sometimes, up to an hour after the success of sonar-runner before we are able to see anything coming in the UI.
So I have a couple questions, all related :
Is there a way to see analysis that are still 'in the pipes'?
Where can I see whether the conversion from database to the UI has failed?
Is there a way to speed the process?
So if I summarize? How can I impact the sonar-runner to sonar UI latency?
I went through all the docs but couldn´t find much about this yet.
Thanks for the info,

Is there a way to see analysis that are still 'in the pipes'?
Yes, log in as admin and go to Settings > System > Analysis report
Where can I see whether the conversion from database to the UI has failed?
have a look at the content of the "Current Activity" and "Past Reports" tabs
Is there a way to speed the process?
This is a very broad question which implies tones of different answers. It all depends on where time is spent. You may be CPU bound, or memory bound or database bound, ...
Having a look at the queue of report processing might give you a hint.

1 MLoc is not so huge. I run SonarQube thru sonar-runner+Jenkins, and when Jenkins indicates in the log that the analysis has been successful, I am able to see it in SonarQube's dashboard. So I would say your 'latency' is not normal.
Could you please precise your environment? Physical/virtual? OS? DB? SQ release? etc.

After loads of searching around, I realized that for some reason sonarQube didn´t handle correctly the fact that I was running several sonar-runner analysis right after each other.
After the ´Store results in database´ message, there are a couple seconds for which starting a new analysis will cause SonarQube GUI to not see the analysis.
Running analysis with a bit more time between them reduced the latecny by a great deal.
Due to the fact that Seb gave a lot of insight about SonarQube itself, I will accept his answer. It is also probably more fit to a general public and less specific to my situation.

How to speed up frequent writing

we created an java agent which does a check on our application suite to see if for instance the parent/child structure is still correct. Therefore it needs to check for 8000+ documents accros several applications.
The check itself goes very fast. We use a navigator to retrieve data from views and only read data from those entries. The problem is within our logging mechanism. Whenever we report a log entry with level SEVERE ( aka: A realy big issue ) the backend document is directly updated. This is becuase we dont want to lose any info about these issues.
In our test runs we see that everything runs smoot but as soon as we 'create' a lot of severe issues the performance drops enormously because of all the writes. I would like to see if there are any notes developers facing the same challenge.. How couuld we speed up the writing without losing any data?
-- added more info after comment from simon --
Its a scheduled agent which runs every night to check for inconsistencies. Goal is ofcourse to find inconsistencies and fix the cause and to eventualy have no inconsistencies reported at all.

Its a scheduled agent which runs every night to check for
inconsistencies.
OK. So there are a number of factors to take into account.
Are there any embedded Jars? When an agent has embedded jars the server has to detach them from the agent to the disk before they can run the code. This is done every time the agent executes. This can be a performance hit. If your agent spawns a number of times, remove the embedded jars and put them into the lib\ext folder on the server instead (requires server restart).
You mention it runs at night. By default general housekeeping processes run at night. Check the notes ini for Server Tasks scheduled and appraise what impact they have on the server/agent when running. For example:
ServerTasksAt1=Catalog,Design
ServerTasksAt2=Updall
ServerTasksAt5=Statlog
In this case if ran between 2-5 then UPDALL could have an impact on it. Also check program documents for scheduled executions.
In what way are you writing? If you are creating a document for each incident and the document contents is not much then the write time should be reasonable. What is liable to be a hit in performance is one of the following.
If you are multi threading those writes.
Pulling a log document, appending a line, saving and then repeating.
One last thing to think about. If you are getting 3000 errors, there must be a point where X amount of errors means that there is no point continuing and instead to alert the admin via SNMP/email/etc? It might be worth coding that in as well.
Other then that, you should probably post some sample code in relation to the write.

Hmm, difficult or general question.
As far as I understand, you update the documents in the view you are walking through. I would set view.AutoUpdate to false. This ensures that the view is not reloaded while you are running your code. This should speed up your code.
This is an extract from the Designer help:
Avoid automatically updating the parent view by explicitly setting
AutoUpdate to False. Automatic updates degrade performance and may
invalidate entries in the navigator ("Entry not found in index"). You
can update the view as needed with Refresh.
Hope that helps.
If that does not help you might want to post a code fragment or more details.

Create separate documents for each error rather than one huge document.
or
Write to a text file directly rather than a database and then pulling if necessary into a document. This should speed things up considerably.

How to detect Out Of Memory condition?

I have an application running on Websphere Application Server 6.0 and it crashes nearly every day because of Out-Of-Memory. From verbose GC is certain there are the memory leaks(many of them)
Unfortunately the application is provided by external vendor and getting things fixed is slow & painful process. As part of the process I need to gather the logs and heapdumps each time the OOM occurs.
Now I'm looking for some way how to automate it. Fundamental problem is how to detect OOM condition. One way would be to create shell script which will periodically search for new heapdumps. This approach seems me a kinda dirty. Another approach might be to leverage the JMX somehow. But I have little or no experience in this area and don't have much idea how to do it.
Or is in WAS some kind of trigger/hooks for this? Thank you very much for every advice!

You can pass the following arguments to the JVM on startup and a heap dump will be automatically generated on an OutOfMemoryError. The second argument lets you specify the path for the heap dump file. By using this at least you could check for the existence of a specific file to see if a heap dump has occurred.
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=<value>

I see two options if you want heap dumping automated but #Mark's solution with heap dump on OOM isn't satisfactory.
You can use the MemoryMXBean to detect high memory pressure, and then programmatically create a heap dump if the usage (or usage delta) seems high.
You can periodically get memory usage info and generate heap dumps with a cron'd shell script using jmap (works both locally and remote).
It would be nice if you could have a callback on OOM, but, uhm, that callback probably would just crash with an OOM error. :)

Have you looked at JConsole ? It uses JMX to give you visibility of a variety of JVM metrics, including memory info. It would probably be worth monitoring your application using this to begin with, to get a feel for how/when the memory is consumed. You may find the memory is consumed uniformly over the day, or when using certain features.
Take a look at the detecting low memory section of the above link.
If you need you can then write a JMX client to watch the application automatically and trigger whatever actions required. JConsole will indicate which JMX methods you need to poll.

And alternative to waiting until the application has crashed may be to script a controlled restart like every night if you're optimistic that it can survive for twelve hours..
Maybe even websphere can do that for you !?

You could add a listener (Session scoped or Application scope attribute listener) class that would be called each time a new object is added in session/app scope.
In this - you can attempt to check the total memory used by app (Log it) as as call run gc (note that invoking it will not imply gc will always run)
(The above is for the logging part and gc based on usage growth)
For scheduled gc:
In addition you can keep a timer task class that runs after every few hrs and does a request for gc.

Our experience with ITCAM has been less than stellar from the monitoring perspective. We dumped it in favor of CA Wily Introscope.

Have you had a look on the jvisualvm tool in the latest Java 6 JDK's?
It is great for inspecting running code.

I'd dispute that the you need the heap dumps when the OOM occurs. Periodic gathering of the information over time should give the picture of what's going on.
As has been observed various tools exist for analysing these problems. I have had success with ITCAM for WebSphere, as an IBMer I have ready access to that. We were very quickly able to indentify the exact lines of code in out problem situation.
If there's any way you can get a tool of that nature then that's the way to go.

It should be possible to write a simple program to get the process list from the kernel and scan it to see if your WAS process is still running. On a Unix box you could probably whip up something in Perl in a few minutes (if you know Perl), not sure how difficult it would be under Windows. Run it as a scheduled task every five minutes or so, and if the process doesn't show up you could have it fork off another process that would deal with the heap dump and re-start WAS.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.