I have a batch process, written in java, that analyzes extremely long sequences of tokens (maybe billions or even trillions of them!) and observes bi-gram patterns (aka, word-pairs).
In this code, bi-grams are represented as Pairs of Strings, using the ImmutablePair class from Apache commons. I won't know in advance the cardinality of the tokens. They might be very repetitive, or each token might be totally unique.
The more data I can fit into memory, the better the analysis will be!
But I definitely can't process the whole job at once. So I need to load as much data as possible into a buffer, perform a partial analysis, flush my partial results to a file (or to an API, or whatever), then clear my caches and start over.
One way I'm optimizing memory usage is by using Guava interners to de-duplicate my String instances.
Right now, my code looks essentially like this:
int BUFFER_SIZE = 100_000_000;
Map<Pair<String, String>, LongAdder> bigramCounts = new HashMap<>(BUFFER_SIZE);
Interner<String> interner = Interners.newStrongInterner();
String prevToken = null;
Iterator<String> tokens = getTokensFromSomewhere();
while (tokens.hasNest()) {
String token = interner.intern(tokens.next());
if (prevToken != null) {
Pair<String, String> bigram = new ImmutablePair(prevToken, token);
LongAdder bigramCount = bigramCounts.computeIfAbsent(
bigram,
(c) -> new LongAdder()
);
bigramCount.increment();
// If our buffer is full, we need to flush!
boolean tooMuchMemoryPressure = bigramCounts.size() > BUFFER_SIZE;
if (tooMuchMemoryPressure) {
// Analyze the data, and write the partial results somewhere
doSomeFancyAnalysis(bigramCounts);
// Clear the buffer and start over
bigramCounts.clear();
}
}
prevToken = token;
}
The trouble with this code is that this is a very crude way of determining whether there is tooMuchMemoryPressure.
I want to run this job on many different kinds of hardware, with varying amounts of memory. No matter the instance, I want this code to automatically adjust to maximize the memory consumption.
Rather than using some hard-coded constant like BUFFER_SIZE (derived through experimentation, heuristics, guesswork), I actually just want ask the JVM whether the memory is almost full. But that's a very complicated question, considering the complexities of mark/sweep algorithms, and all the different generational collectors.
What would be a good general-purpose approach for accomplishing something like this, assuming this batch-job might run on a variety of different machines, with different amounts of available memory? I don't need this to be extremely precise... I'm just looking for a rough signal to know that I need to flush the buffer soon, based on the state of the actual heap.
The simplest way to get a first glimpse of what is going on with the process' heap space is Runtime.freeMemory() together with .maxMemory and .totalMemory. Yet the first does not factor in garbage and so is an under-estimation at best and may be completely misleading just before the GC kicks in.
Assuming that for your application "memory pressure" basically means "(soon) not enough", the interesting value is free memory right after a GC.
This is available by using GarbageCollectorMXBean
which provides GcInfo with memory usage after the GC.
The bean can be watched exactly after GC since it is a NotificationEmitter, despite this is not being advertised in the Javadoc. Some minimal code, patterned after a longer example is
void registerCallback() {
List<GarbageCollectorMXBean> gcbeans =
java.lang.management.ManagementFactory.getGarbageCollectorMXBeans();
for (GarbageCollectorMXBean gcbean : gcbeans) {
System.out.println(gcbean.getName());
NotificationEmitter emitter = (NotificationEmitter) gcbean;
emitter.addNotificationListener(this::handle, null, null);
}
}
private void handle(Notification notification, Object handback) {
if (!notification.getType()
.equals(GarbageCollectionNotificationInfo.GARBAGE_COLLECTION_NOTIFICATION)) {
return;
}
GarbageCollectionNotificationInfo info = GarbageCollectionNotificationInfo
.from((CompositeData) notification.getUserData());
GcInfo gcInfo = info.getGcInfo();
gcInfo.getMemoryUsageAfterGc().forEach((name, memUsage) -> {
System.err.println(name+ "->" + memUsage);
});
}
There will be several memUsage entries and this will also differ depending on the GC. But from the values provided, used, committed and max we can derive upper limits on free memory which again should give the "rough signal" the OP is asking for.
The doSomeFancyAnalysis will certainly also need its share of fresh memory, so with a very rough estimate how much that will be per bigramm to analyze, this could be the limit to watch for.
Related
I have to update org.geotools.feature.DefaultFeatureCollection at every 1 seconds till the app is running (more than an hour).
I have created DefaultFeatureCollection lineCollection = new DefaultFeatureCollection(); as a class member. Adding points to it at every 1 second lineCollection.add(feature);
public void addLines(Coordinate[] coords) {
try {
line = geometryFactory.createLineString(coords);
featureBuilder.add(line);
feature = featureBuilder.buildFeature(null);
lineCollection.add(feature);
}
catch(Exception e) {
e.printStackTrace();
}
}
However, the collection gets huge and heap memory increases gradually, resulting in high CPU usage and app lagging.
Is there a way to free memory once line is displayed on map ?
You've tagged your question [memory-leaks] - is there any evidence you are leaking memory? You can use jmap to check. If so the developers would love to hear about it with the evidence.
It seems more likely that you are just drawing 3600 lines (60*60) after an hour with out having any indexing. If your dataset was constant I'd recommend a SpatialIndexFeatureCollection but as yours changes I would suggest you use a GeoPackage or other database based store (if you already have one) that will manage a spatial index for your lines.
I have big text file that contains source-target nodes and threshold.I store all the distinct nodes in HashSet,then filter the edges based on user threshold and store the filtered nodes in separated Hash Set.So i want to find a way to do the processing as fast as possible.
public class Simulator {
static HashSet<Integer> Alledgecount = new HashSet<>();
static HashSet<Integer> FilteredEdges = new HashSet<>();
static void process(BufferedReader reader,double userThres) throws IOException {
String line = null;
int l = 0;
BufferedWriter writer = new BufferedWriter( new FileWriter("C:/users/mario/desktop/edgeList.txt"));
while ((line = reader.readLine()) != null & l < 50_000_000) {
String[] intArr = line.split("\\s+");
checkDuplicate(Integer.parseInt(intArr[1]), Integer.parseInt(intArr[2]), Alledgecount);
double threshold = Double.parseDouble(intArr[3]);
if(threshold > userThres) {
writeToFile(intArr[1],intArr[2],writer);
checkDuplicate(Integer.parseInt(intArr[1]), Integer.parseInt(intArr[2]), FilteredEdges);
}
l++;
}
writer.close();
}
static void writeToFile(String param1,String param2,Writer writer) throws IOException {
writer.write(param1+","+param2);
writer.write("\r\n");
}
The graph class does BFS and writes the nodes in separated file.I have done the processing excluding some functionalities and the timings are below.
Timings with 50 million lines read in process()
without calling BFS(),checkDuplicates,writeAllEdgesToFile() -> 54s
without calling BFS(),writeAllEdgesToFile() -> 50s
without calling writeAllEdgesToFile() -> 1min
Timings with 300 million lines read in process()
without calling writeAllEdges() 5 min
Reading a file doesn't depend only on CPU cores.
IO operations on a file will be limited by physical constraints of classic disks that contrary to CPU core cannot parallel operations.
What you could do is having a thread for IO operations and other(s) for data processing but it makes sense only if data processing is long enough to make relevant to create a Thread for this task as Threads have a cost in terms of CPU scheduling.
Getting a multi-threaded Java program to run correctly can be very tricky. It needs some deep understanding of things like synchronization issues etc. Without the knowledge/experience necessary, you'll have a hard time searching for bugs that occur sometimes but aren't reliably reproducible.
So, before trying multi-threading, find out if there are easier ways to achieve acceptable performance:
Find the part of your program that takes the time!
First question: is it I/O or CPU? Have a look at Task Manager. Does your single-threaded program occupy one core (e.g. CPU close to 25% on a 4-core machine)? If it's far below that, then I/O must be the limiting factor, and changing your program probably won't help much - buy a faster HD. (In some situations, the software style of doing I/O might influence the hardware performance, but that's rare.)
If it's CPU, use a profiler, e.g. the JVisualVM contained in the JDK, to find the method that takes most of the runtime and think about alternatives. One candidate might be the line.split("\\s+"), using a regular expression. They are slow, especially if the expression isn't compiled to a Pattern beforehand - but that's nothing more than a guess, and the profiler will most probably tell you some very different place.
I have two large CSV files which contain data that is required for users of a web application to validate some info. I defined an ArrayList< String[] > and intended to keep the contents of both files in memory so I wouldn't have to read them each time a user logged in and used the application.
I'm getting a java.lang.OutOfMemoryError: Java heap space, though, when initializing the application and trying to read the second file. (It finishes reading the first file just fine but hangs when reading the second file and after a while I get that exception)
The code for reading the files is pretty straight forward:
ArrayList<String[]> tokenizedLines = new ArrayList<String[]>();
public void parseTokensFile() throws Exception {
BufferedReader bRead = null;
FileReader fRead = null;
try {
fRead = new FileReader(this.tokensFile);
bRead = new BufferedReader(fRead);
String line;
while ((line = bRead.readLine()) != null) {
tokenizedLines.add(StringUtils.split(line, fieldSeparator));
}
} catch (Exception e) {
throw new Exception("Error parsing file.");
} finally {
bRead.close();
fRead.close();
}
}
I read Java's split function could use up a lot of memory when reading large amounts of data since the substring function makes a reference to the original string, so a substring of some String will use up the same amount of memory as the original, even though we only want a few chars, so I made a simple split function to try avoiding this:
public String[] split(String inputString, String separator) {
ArrayList<String> storage = new ArrayList<String>();
String remainder = new String(inputString);
int separatorLength = separator.length();
while (remainder.length() > 0) {
int nextOccurance = remainder.indexOf(separator);
if (nextOccurance != -1) {
storage.add(new String(remainder.substring(0, nextOccurance)));
remainder = new String(remainder.substring(nextOccurance + separatorLength));
} else {
break;
}
}
storage.add(remainder);
String[] tokenizedFields = storage.toArray(new String[storage.size()]);
storage = null;
return tokenizedFields;
}
This gives me the same error though, so I'm wondering if it's not a memory leak but simply that I can't have structures with so many objects in memory. One file is about 600'000 lines long, with 5 fields per line, and the other is around 900'000 lines long with about the same amount of fields per line.
The full stacktrace is:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at xxx.xxx.xxx.StringUtils.split(StringUtils.java:16)
at xxx.xxx.xxx.GFTokensFile.parseTokensFile(GFTokensFile.java:36)
So, after the long post (sorry :P), is this a restriction of the amount of memory assigned to my JVM or am I missing something obvious and wasting resources somewhere?
Your JVM won't get more than 2GB on a 32-bit operating system with 4GB of RAM. That's one upper limit.
The second is the max heap size you specify when you start the JVM. Look at that -Xmx parameter.
The third is the fact of life that you cannot fit X units of anything into a Y sized container where X > Y. You know the size of your files. Try parsing each one individually and seeing what kind of heap they're consuming.
I'd recommend that you download Visual VM, install all the available plugins, and have it monitor your application while it's running. You'll be able to see the entire heap, perm gen space, GC collection, what objects are taking up the most memory, etc.
Getting data is invaluable for all problems, but especially ones like this. Without it, you're just guessing.
I cannot see a storage leak in the original version of the program.
The scenarios where split and similar methods can leak significant storage are rather limitted:
You have to NOT be retaining a reference to the original string that you split.
You need to be retaining references to a subset of the strings produced by the string splitting.
What happens when String.substring() is called is that it creates a new String object that shares the original String's backing array. If the original String reference is then garbage collected, then the substring String is now holding onto an array of characters that includes characters that are not "in" the substring. This can be a storage leak, depending on how long the substring is kept.
In your example, you are keeping strings that contain all characters apart for the field separator character. There is a good chance that this is actually saving space ... compared to the space used if each substring was an independent String. Certainly, it is no surprise that your version of split doesn't solve the problem.
I think you need to either increase the heap size, or change your application so that it doesn't need to keep all of the data in memory at the same time.
Try improving your code or leave data processing to a database.
The memory usage is larger as your file sizes, since the code makes redundant copies of the processed data. There is a to be processed one processed and some partial data.
String is immutable, see here, no need to use new String(...) to store the result, split does that copy already.
If you can, delegate the whole data storage and searching to a database. CSV files are easily imported/exported to databases and they do all the hard work.
While I wouldn't recommend actual string interning for what you are doing, how about using the idea behind that technique? You could use a HashSet or HashMap to make sure you only use a single String instance whenever your data contains the same sequence of characters. I mean, there must be some kind of overlap in the data, right?
On the other hand, what you might be seeing here could be a bad case of heap fragmentation. I'm not sure how the JVM handles these cases, but in the Microsoft CLR larger objects (especially arrays) will be allocated on a separate heap. Growth strategies, such as those of the ArrayList will create a larger array, then copy over the content of the previous array before releasing the reference to it. The Large Object Heap (LOH) isn't compacted in the CLR, so this growth strategy will leave huge areas of free memory that the ArrayList can no longer use.
I don't know how much of that applies to the Lava VM, but you could try building the list using LinkedList first, then dump the list content into an ArrayList or directly into an array. That way the large array of lines would be created only once, without causing any fragmentation.
Be sure that the total length of both files is lower than your heap size. You can set the max heap size using the JVM option -Xmx.
Then if you have so much content maybe you shouldn't load it entirely in memory. One time I had a similar problem and I fixed it using an index file that store index of informations in the large file. then I just had to read one line at the good offset.
Also in your split method there is some strange things.
String remainder = new String(inputString);
You don't have to take care of preserve inputString using a copy, String are immutable so changes only apply to the scope of the split method.
I try to build a map with the content of a file and my code is as below:
System.out.println("begin to build the sns map....");
String basePath = PropertyReader.getProp("oldbasepath");
String pathname = basePath + "\\user_sns.txt";
FileReader fr;
Map<Integer, List<Integer>> snsMap =
new HashMap<Integer, List<Integer>>(2000000);
try {
fr = new FileReader(pathname);
BufferedReader br = new BufferedReader(fr);
String line;
int i = 1;
while ((line = br.readLine()) != null) {
System.out.println("line number: " + i);
i++;
String[] strs = line.split("\t");
int key = Integer.parseInt(strs[0]);
int value = Integer.parseInt(strs[1]);
List<Integer> list = snsMap.get(key);
//if the follower is not in the map
if(snsMap.get(key) == null)
list = new LinkedList<Integer>();
list.add(value);
snsMap.put(key, list);
System.out.println("map size: " + snsMap.size());
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("finish building the sns map....");
return snsMap;
The program is very fast at first but gets much slowly when the information printed is :
map size: 1138338
line number: 30923602
map size: 1138338
line number: 30923603
....
I try to find to reason with two System.out.println() clauses to judge the preformance of BufferedReader and HashMap instead of a Java profiler.
Sometimes it takes a while to get the information of the map size after getting the line number information, and sometimes, it takes a while to get the information of the line number information after get the map size. My question is: which makes my program slow? the BufferedReader for a big file or HashMap for a big map?
If you are testing this from inside Eclipse, you should be aware of the huge performance penalty of writing to stdout/stderr, due to Eclipse's capturing that ouptut in the Console view. Printing inside a tight loop is always a performance issue, even outside of Eclipse.
But, if what you are complaining about is the slowdown experienced after processing 30 million lines, then I bet it's a memory issue. First it slows down due to intense GC'ing and then it breaks with OutOfMemoryError.
You will have to check you program with some profiling tools to understand why it is slow.
In general file access is much more slower than in memory operations (unless you are constrained in memory and doing excess GC) so the guess would be that reading file could be the slower here.
Before you profiled, you will not know what is slow and what isn't.
Most likely, the System.out will show up as being the bottleneck, and you'll then have to profile without them again. System.out is the worst thing you can do for finding performance bottlenecks, because in doing so you usually add an even worse bottleneck.
An obivous optimization for your code is to move the line
snsMap.put(key, list);
into the if statement. You only need to put this when you created a new list. Otherwise, the put will just replace the current value with itself.
Java cost associated with Integer objects (and in particular the use of Integers in the Java Collections API) is largely a memory (and thus Garbage Collection!) issue. You can sometimes get significant gains by using primitive collections such as GNU trove, depending how well you can adjust your code to use them efficiently. Most of the gains of Trove are in memory usage. Definitely try rewriting your code to use TIntArrayList and TIntObjectMap from GNU trove. I'd avoid linked lists, too, in particular for primitive types.
Roughly estimated, a HashMap<Integer, List<Integer>> needs at least 3*16 bytes per entry. The doubly linked list again needs at least 2*16 bytes per entry stored. 1m keys + 30m values ~ 1 GB. No overhead included yet. With GNU trove TIntObjectHash<TIntArrayList> that should be 4+4+16 bytes per key and 4 bytes per value, so 144 MB. The overhead is probably similar for both.
The reason that Trove uses less memory is because the types are specialized for primitive values such as int. They will store the int values directly, thus using 4 bytes to store each.
A Java collections HashMap consists of many objects. It roughly looks like this: there are Entry objects that point to a key and a value object each. These must be objects, because of the way generics are handled in Java. In your case, the key will be an Integer object, which uses 16 bytes (4 bytes mark, 4 bytes type, 4 bytes actual int value, 4 bytes padding) AFAIK. These are all 32 bit system estimates. So a single entry in the HashMap will probably need some 16 (entry) + 16 (Integer key) + 32 (yet empty LinkedList) bytes of memory that all need to be considered for garbage collection.
If you have lots of Integer objects, it just will take 4 times as much memory as if you could store everything using int primitives. This is the cost you pay for the clean OOP principles realized in Java.
The best way is to run your program with profiler (for example, JProfile) and see what parts are slow. Also debug output can slow your program, for example.
Hash Map is not slow, but in reality its the fastest among the maps. HashTable is the only thread safe among maps, and can be slow sometimes.
Important note: Close the BufferedReader and File after u read the data... this might help.
eg: br.close()
file.close()
Please check you system processes from task manager, there may be too may processes running in the background.
Sometimes eclipse is real resource heavy, so try to run it from console to check it.
Below is a small test I've coded to educate myself on references API. I thought this would never throw OOME but it is throwing it. I am unable to figure out why. appreciate any help on this.
public static void main(String[] args)
{
Map<WeakReference<Long>, WeakReference<Double>> weak = new HashMap<WeakReference<Long>, WeakReference<Double>>(500000, 1.0f);
ReferenceQueue<Long> keyRefQ = new ReferenceQueue<Long>();
ReferenceQueue<Double> valueRefQ = new ReferenceQueue<Double>();
int totalClearedKeys = 0;
int totalClearedValues = 0;
for (long putCount = 0; putCount <= Long.MAX_VALUE; putCount += 100000)
{
weak(weak, keyRefQ, valueRefQ, 100000);
totalClearedKeys += poll(keyRefQ);
totalClearedValues += poll(valueRefQ);
System.out.println("Total PUTs so far = " + putCount);
System.out.println("Total KEYs CLEARED so far = " + totalClearedKeys);
System.out.println("Total VALUESs CLEARED so far = " + totalClearedValues);
}
}
public static void weak(Map<WeakReference<Long>, WeakReference<Double>> m, ReferenceQueue<Long> keyRefQ,
ReferenceQueue<Double> valueRefQ, long limit)
{
for (long i = 1; i <= limit; i++)
{
m.put(new WeakReference<Long>(new Long(i), keyRefQ), new WeakReference<Double>(new Double(i), valueRefQ));
long heapFreeSize = Runtime.getRuntime().freeMemory();
if (i % 100000 == 0)
{
System.out.println(i);
System.out.println(heapFreeSize / 131072 + "MB");
System.out.println();
}
}
}
private static int poll(ReferenceQueue<?> keyRefQ)
{
Reference<?> poll = keyRefQ.poll();
int i = 0;
while (poll != null)
{
//
poll.clear();
poll = keyRefQ.poll();
i++;
}
return i;
}
}
And below is the log when ran with 64MB of heap
Total PUTs so far = 0
Total KEYs CLEARED so far = 77982
Total VALUESs CLEARED so far = 77980
100000
24MB
Total PUTs so far = 100000
Total KEYs CLEARED so far = 134616
Total VALUESs CLEARED so far = 134614
100000
53MB
Total PUTs so far = 200000
Total KEYs CLEARED so far = 221489
Total VALUESs CLEARED so far = 221488
100000
157MB
Total PUTs so far = 300000
Total KEYs CLEARED so far = 366966
Total VALUESs CLEARED so far = 366966
100000
77MB
Total PUTs so far = 400000
Total KEYs CLEARED so far = 366968
Total VALUESs CLEARED so far = 366967
100000
129MB
Total PUTs so far = 500000
Total KEYs CLEARED so far = 533883
Total VALUESs CLEARED so far = 533881
100000
50MB
Total PUTs so far = 600000
Total KEYs CLEARED so far = 533886
Total VALUESs CLEARED so far = 533883
100000
6MB
Total PUTs so far = 700000
Total KEYs CLEARED so far = 775763
Total VALUESs CLEARED so far = 775762
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at Referencestest.weak(Referencestest.java:38)
at Referencestest.main(Referencestest.java:21)
from http://weblogs.java.net/blog/2006/05/04/understanding-weak-references
I think your use of HashMap is likely to be the issue. You might want to use WeakHashMap
To solve the "widget serial number"
problem above, the easiest thing to do
is use the built-in WeakHashMap class.
WeakHashMap works exactly like
HashMap, except that the keys (not the
values!) are referred to using weak
references. If a WeakHashMap key
becomes garbage, its entry is removed
automatically. This avoids the
pitfalls I described and requires no
changes other than the switch from
HashMap to a WeakHashMap. If you're
following the standard convention of
referring to your maps via the Map
interface, no other code needs to even
be aware of the change.
The heart of the problem is probably that you're filling your heap with WeakReference-objects, the weak references are cleared when you're getting low on memory, but the reference objects themselves are not, so your hashmap is filling up with boat-load if WeakReference objects (not to mention the object array the hashmap uses, which will grow indefinitely), all pointing to null.
The solution, as already pointed out, is a weak hashmap, which will clear out those objects if they're no longer in use (this is done during put).
EDIT:
As Kevin pointed out, you already have your reference-queue logic worked out (I didn't pay close enough attention), a solution using your code is to just clear it out of the map at the point where the key has been collected. This is exactly how weak hash map works (where the poll is simply triggered on insert).
Even when your weak references let go of the things they are referencing, they still do not get recycled themselves.
So eventually your hash will fill up with references to nothing and crash.
What you would need (if you wanted to do it this way) would be to have an event triggered by object deletion that went in and removed the reference from the hash. (which would cause threading issues you need to be aware of as well)
I'm not a java expert at all, but I know in .NET when doing a lot of large object memory allocation you can get heap fragmentation to the point where only small pieces of contiguous memory are available for allocation even though much more memory appears as "free".
A quick google search on "java heap fragmentation" brings up some seemingly relevant result although I haven't taken a good look at them.
Other's have correctly pointed out what the problem is; e.g. #roe, #Bill K.
But another way to solve this kind problem (apart from scratching your head, asking on SO, etc), is to look and see how the Sun recommended approach works. In this case, you can find it in the source code for the WeakHashMap class.
There are a few ways to find Java source code:
If you have a decent Java IDE to running, it should be able to show you the source code of any class in the class library.
Most J2SE JDK downloads include source JAR files for (at least) the public API classes.
You can specifically download full source distributions for the OpenJDK-based releases of Java.
But the ZERO EFFORT approach is to do a Google search, using the fully qualified name of the class with ".java.html" tacked on the end. For example, searching for "java.util.WeakHashMap.java.html" gives this link in the first page of search results.
And the source will tell you that the standard WeakHashMap implementation explicitly polls its reference queue to expunge stale (i.e. broken) weak references from the map's key set. In fact, it does this every time you access or update the map, or even just ask for its size.
An other problem might be that Java for some reason don't always activate its garbadge collecter when running out of memmory, so you might need to insert explicit calls to activate the collector. Try something like
if( (putCount%1000)===0)
Runtime.getRuntime().gc();
in your loop.
Edit: It seems that the new java implementations from sun now does call the garbadge collector before throwing OutOfMemmoryException, but I am pretty sure that the following program would throw OutOfMemmoryException with jre1.3 or 1.4
public class Test {
public static void main(String args[]) {
while(true) {
byte []data=new byte[1000000];
}
}
}