– update — we found that below situation occurs when we encounter
"com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard"
https://github.com/awslabs/amazon-kinesis-client/issues/108
we use s3 directory (and dynamodb) to store checkpoints, but if such occurs blocks should not get stuck but continue to be evicted gracefully from memory, obviously kinesis library race condition is a problem onto itself...
Ran to a problem (also submitted spark jira task on the subject below) where "block streams" (some not all) are persisting leading at to the OOM
App is a standard kinesis/spark streaming app written in java (spark version is 2.0.2)
run starts initially well and automated SparkCleaner does its job nicely recycling streaming jobs (verified by looking at the Storage tab in admin)
then after some time some blocks get stuck in memory such as this block on one of the executor nodes
input-0-1485362233945 1 ip-<>:34245 Memory Serialized 1442.5 KB
after more time more blocks are getting stuck and never freed up
It is my understanding that the SparkContext cleaner will trigger removal or older blocks as well as trigger System.gc at the given interval which is 30 minutes by default
thanks on any feedback on this as this issue here prevents 100% uptime of the application
If this could be of value we use StorageLevel.MEMORY_AND_DISK_SER()
Spark Jira
You can try to perform the eviction manually:
get KinesisInputDStream.generatedRDDs (using reflection)
for each generatedRDDs (it should contains BlockRDD) perform something like DStream.clearMetadata.
I already used a similar hack for mapWithState and for other Spark things that use memory for nothing.
Related
I have a java program that submits thousands of Hadoop-DistCp jobs, and this is causing a OOM error on the client java process side (not on the cluster).
The DistCp jobs are submitted internally within the same java process. I do
import org.apache.hadoop.tools.{DistCp, DistCpOptions}
val distCp = new DistCp(<config>, new DistCpOptions(<foo>, <bar>))
distCp.createAndSubmitJob()
In other words, the java program is not spawning external OS processes and running the distCp cli tool separately.
The problem is the the java program, after a few thousand distCp job submissions, is eventually crashing with an OutOfMemory error. This can take some days.
This is a nuisance because re-starting the java program is a not a reasonable solution for us on the medium term.
By analysing a few heap dumps, the issue became clear. Almost all the heap is being used to hold objects on the map Map<Key, FileSystem> of the FileSystem.Cache of org.apache.hadoop.fs.FileSystem. This cache is global.
A few debugging sessions later I found that upon each instantiation of new DistCp(<config>, new DistCpOptions(<foo>, <bar>)) there is eventually a call to FileSystem.newInstance(<uri>, <conf>) via the org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter of the library org.apache.hadoop:hadoop-yarn-common.
The problem with this FileSystem.newInstance(<uri>, <conf>) calls is that it creates a unique entry on the cache each time and there doesn't seem to exist any mechanism to clear these entries. I would have tried to set the configuration flag fs.hdfs.impl.disable.cache to true, but again the FileSystem.newInstance bypasses the check of the flag, so it doesn't work.
Alternatively, I could to FileSystem.closeAll(). But this will close all file systems of the program, including "legitime" uses on other parts of the application.
In essence,
My java program launches, over time, thousands of distCp jobs within the same java process.
Each distCp job seems to add one entry to the global FileSystem.Cache.
The cache grows, and eventually the program crashes with OOM.
Does anyone see a solution to this?
I am surprised to not have found similar issues on the internet. I would have thought that having java process launching thousands of distCp jobs to be quite a common usage.
Since the past 1 month we have started facing this problem where the AppEngine backend randomly stops processing any job at all.
The closest other Qs I have seen is this one and this, but nothing of use.
My push queue configuration is
<queue>
<name>MyFetcherQueue</name>
<target>mybackend</target>
<rate>30/m</rate>
<max-concurrent-requests>1</max-concurrent-requests>
<bucket-size>30</bucket-size>
<retry-parameters>
<task-retry-limit>10</task-retry-limit>
<min-backoff-seconds>10</min-backoff-seconds>
<max-backoff-seconds>200</max-backoff-seconds>
<max-doublings>2</max-doublings>
</retry-parameters>
</queue>
Standard Java App Engine with multiple modules, out of which one is a backend with Basic scaling, B4.
<runtime>java8</runtime>
<threadsafe>true</threadsafe>
<instance-class>B4</instance-class>
<basic-scaling>
<max-instances>2</max-instances>
</basic-scaling>
Note, just to fix this problem, I have tried the following but to no avail:
Updated java from 7 to 8 - didnt work
Changed from earlier B1 to B4 - (thinking it could be a memory issue, but nothing as such in the logs)- didnt work
Changed queue processing rate to as low as 15/m with bucket size 15.
Changed max-instances from earlier 1 to 2 hoping that if atleast one instance hangs, the other should be able to process, but to no avail. I noticed (today) that when this issue occurred, the 2nd instance had only 2 requests (the other had 1400+), yet the new instance did not pick up the remaining tasks from the queue. The task ETA just kept increasing.
Updated Java Appengine sdk from earlier 1.9.57 to latest 1.9.63
Behavior: After a while (almost once per day now), the Backend would just stop responding and the tasks in this queue would be left as is. Only way to continue is to kill the backend instance from the console, after which the new instance then starts processing the tasks. These tasks are simple http calls to fetch data. The Queue usually has anywhere from 1 task to upto 15-20 at any time. Noticed that even with as low as 3-4 it sometimes stalls.
The logs don't show anything at all, nothing scrolls. Only when the backend instance is deleted, the log would show that the /task was terminated via console. No out of memory, no crashes, no 404.
Earlier logs and screenshots which I had captured:
I am unable to add more images, but here are the ones for Utilization, Memory Usage & most recent memory usage after which I had deleted the instance and the tasks resumed.
This and recent other experience has shaken my faith in Google App Engine. What am I doing wrong? (This exact B1 backend setup with queue config has worked previously for many years!)
Issue -
I am running a series of mapreduce jobs wrapped in an oozie workflow. Input data consists of a bunch text files most of which are fairly small (KBs) but every now and then I get files over 1-2 MB which cause my jobs to fail. I am seeing two reasons why jobs are failing - one, inside one or two mr jobs the file is parsed into a graph in memory and for a bigger file its mr is running out of memory and two, the jobs are timing out.
Questions -
1) I believe I can just disable the timeout by setting mapreduce.task.timeout to 0. But I am not able to find any documentation that mention any risk in doing this.
2) For the OOM error, what are the various configs I can mess around with ? Any links here on potential solutions and risks would be very helpful.
3)I see a lot of "container preempted by scheduler" messages before I finally get OOM.. is this a seperate issue or related? How do I get around this?
Thanks in advance.
About the timeout: no need to set it to "unlimited", a reasonably large value can do (e.g. in our Prod cluster it is set to 300000).
About requiring a non-standard RAM quota in Oozie: the properties you are looking for are probably mapreduce.map.memory.mb for global YARN container quota, oozie.launcher.mapreduce.map.java.opts to instruct the JVM about that quota (i.e. fail gracefully with OOM exception instead of crashing the container with no useful error message), and the .reduce. counterparts.
See also that post for the (very poorly documented) oozie.launcher. prefix in case you want to set properties for a non-MR Action -- e.g. a Shell, or a Java program that spawns indirectly a series of Map and Reduce steps
we created an java agent which does a check on our application suite to see if for instance the parent/child structure is still correct. Therefore it needs to check for 8000+ documents accros several applications.
The check itself goes very fast. We use a navigator to retrieve data from views and only read data from those entries. The problem is within our logging mechanism. Whenever we report a log entry with level SEVERE ( aka: A realy big issue ) the backend document is directly updated. This is becuase we dont want to lose any info about these issues.
In our test runs we see that everything runs smoot but as soon as we 'create' a lot of severe issues the performance drops enormously because of all the writes. I would like to see if there are any notes developers facing the same challenge.. How couuld we speed up the writing without losing any data?
-- added more info after comment from simon --
Its a scheduled agent which runs every night to check for inconsistencies. Goal is ofcourse to find inconsistencies and fix the cause and to eventualy have no inconsistencies reported at all.
Its a scheduled agent which runs every night to check for
inconsistencies.
OK. So there are a number of factors to take into account.
Are there any embedded Jars? When an agent has embedded jars the server has to detach them from the agent to the disk before they can run the code. This is done every time the agent executes. This can be a performance hit. If your agent spawns a number of times, remove the embedded jars and put them into the lib\ext folder on the server instead (requires server restart).
You mention it runs at night. By default general housekeeping processes run at night. Check the notes ini for Server Tasks scheduled and appraise what impact they have on the server/agent when running. For example:
ServerTasksAt1=Catalog,Design
ServerTasksAt2=Updall
ServerTasksAt5=Statlog
In this case if ran between 2-5 then UPDALL could have an impact on it. Also check program documents for scheduled executions.
In what way are you writing? If you are creating a document for each incident and the document contents is not much then the write time should be reasonable. What is liable to be a hit in performance is one of the following.
If you are multi threading those writes.
Pulling a log document, appending a line, saving and then repeating.
One last thing to think about. If you are getting 3000 errors, there must be a point where X amount of errors means that there is no point continuing and instead to alert the admin via SNMP/email/etc? It might be worth coding that in as well.
Other then that, you should probably post some sample code in relation to the write.
Hmm, difficult or general question.
As far as I understand, you update the documents in the view you are walking through. I would set view.AutoUpdate to false. This ensures that the view is not reloaded while you are running your code. This should speed up your code.
This is an extract from the Designer help:
Avoid automatically updating the parent view by explicitly setting
AutoUpdate to False. Automatic updates degrade performance and may
invalidate entries in the navigator ("Entry not found in index"). You
can update the view as needed with Refresh.
Hope that helps.
If that does not help you might want to post a code fragment or more details.
Create separate documents for each error rather than one huge document.
or
Write to a text file directly rather than a database and then pulling if necessary into a document. This should speed things up considerably.
We have a 64 bit linux machine and we make multiple HTTP connections to other services and Drools Guvnor website(Rule engine if you don't know) is one of them. In drools, we create knowledge base per rule being fired and creation of knowledge base makes a HTTP connection to Guvnor website.
All other threads are blocked and CPU utilization goes up to ~100% resulting into OOM. We can make changes to compile the rules after 15-20 mins. but I want to be sure of the problem if someone has already faced it.
I checked for "cat /proc/sys/kernel/threads-max" and it shows 27000 threads, Can it be a reason?
I have a couple of question:
When do we know that we are running over capacity?
How many threads can be spawned internally (any rough estimate or formula relating diff parameters will work)?
Has anyone else seen similar issues with Drools? Concurrent access to Guvnor website is basically causing the issue.
Thanks,
I am basing my answer on the assumption that you are creating a knowledge base for each request, and this knowledge base creation incudes the download of latest rule sources from Guvnor please correct if I am mistaken.
I suspect that the build /compilation of packages is taking time and hog your system.
Instead of compiling packages on each and every request, you can download pre build packages from guvnor, and also you can cache this packages locally if your rules does not change much. Only restriction is that you need to use the same version of drools both on guvnor and in your application.
I checked for "cat /proc/sys/kernel/threads-max" and it shows 27000
threads, Can it be a reason?
That number does look large but we dont know if a majority of those threads belong to you java app. Create a java thread dump to confirm this. Your thread dump will also show the CPU time taken by each thread.
When do we know that we are running over capacity?
You have 100% CPU and an OOM error. You are over capacity :) Jokes aside, you should monitor your HTTP connection queue to determine what you are doing wrong. Your post says nothing about how you are handling the HTTP connections (presumably through some sort of pooling mechanism backed by a queue ?). I've seen containers and programs queue requests infinitely causing them to crash with a big bang. Plot the following graphs to isolate your problem
The number of blocking threads over time
Time taken for each thread
Number of threads per thread pool and how they increase / decrease with time (pool size)
How many threads can be spawned internally (any rough estimate or
formula relating diff parameters will work)?
Only a load test can answer this question. Load your server and determine the number of concurrent users it can support at 60-70% capacity. Note the number of threads spawned internally at this point. That is your peak capacity (allowing room for unexpected traffic)
Has anyone else seen similar issues with Drools? Concurrent access to
Guvnor website is basically causing the issue
I cant help there since I've not accessed drools this way. Sorry.