I have code which jsonparse the whole file:
ConcurrentHashMap<String, ValueClass<V>> space = new ConcurrentHashMap<String, ValueClass<V>> ();
Map<String, ValueClass<V>> map = mapper.readValue(new FileReader(fileName)...);
for each entry of map:
space.put(entry.getKey(), entry.value());
the memory used to parse the file is large and is much more than the size of the file itself. To save memory, I decided to replace the code with jsonparse streaming API as follows:
ConcurrentHashMap<String, ValueClass<V>> space = new ConcurrentHashMap<String, ValueClass<V>> ();
JsonFactory f = new MappingJsonFactory();
JsonParser jp = f.createJsonParser(new File(fileName);
JsonToken current;
current = jp.nextToken();
if (current != JsonToken.START_OBJECT) {
show error and return;
}
ObjectMapper mapper = new ObjectMapper();
while (jp.nextToken() != JsonToken.END_OBJECT) {
key = jp.getCurrentName();
current = jp.nextToken();
if (key != null) {
if (current == JsonToken.START_OBJECT) {
mem_before=Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory();
value = mapper.readValue(jp, ValueClass.class); (1)
mem_after=Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory();
mem_diff = mem_after - mem_before; (2)
space.put(key, value); (3)
}
else jp.skipChildren();
else jp.skipChildren();
}
however, it spends even much more memory than parsing the whole file once. And it shows the memory increase is due to (1) (I detect the allocatedMemory using Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory before (1) and after (1) to obtain difference)
shouldn't using streaming API save memory, according to http://www.ngdata.com/parsing-a-large-json-file-efficiently-and-easily and other article said?
EDIT
#StaxMan: Thanks for answering me. Looks like you are very familiar with memory consumption we should expect.
My purpose is to load content from file to memory, so that I do not need to search content from file. so I do need all the content in map all the time.
The reason I think streaming may help is: if loading entire file as one object, I guess besides consuming memory for the map variable: space, we need a large additional memory to parse the whole file, which is a big burden; but by using streaming, although we still need the same memory for the map variable:space, we only need a small amount of additional memory to parse each entry of a file, this additional memory is small because we reuse it to parse each entry instead of parsing the whole file. Doesn’t it save memory therefore? Please correct me if I am wrong.
I understand memory size may be more than file size. But first I do not understand why it need so much more than file (2.7 times of file size in my case); second, I do not understand why with streaming I spend even more memory (even double of without using streaming). Memory leak? But I do not see any problem of my code. should have been at least no more than 2.7 times.
Besides, do you know how to estimate memory consumption of an object such as each entry of my map variable space?
It would help if you explained what you are trying to achieve. For example, do you absolutely need all the values in memory at all times? Or would it be possible to process values one at a time, and avoid building that Map with all keys, values.
Your second attempt does not make much sense without explanation what you are trying to do there; but essentially amount of memory used should be space for value POJOs, keys, and Map that contains them. This may be more than input JSON or less, but it really depends on kind of content you are dealing with. Strings, for example, will consume memory memory in JVM than JSON file: most characters in UTF-8 are single byte, whereas each char in JVM is 16-bit value (UCS-2 encoded). Further, whereas JSON Strings have 3 or 4 byte overhead (quotes, separators), memory overhead for java.lang.String is more like 16 or 24 bytes per String.
Similarly, Maps consume quite a bit of memory for structure that JSON file does not need. JSON has couple of separators per entry, maybe 6 - 8 bytes (depending on indentation is used). Java Maps, on the other hand, have bigger per-entry overhead (maybe 16 bytes?) in addition to String keys (with above-mentioned overhead) and POJO values.
So: all in all, it is not uncommon that objects in memory may consume bit more memory than what JSON file itself contains.
Related
I'm trying to load in a csv file with a huge amount of lines (>5 million) but it slows down massively when trying to process them all into an arraylist of each value
I've tried a few different variations of reading and removing from the input list i loaded from the file, but it still ends up running out of heapspace, even when i allocate 14gb to the process, while the file is only 2gb
I know i need to be removing values so that i dont end up with duplicate references in memory, so that I dont end up with an arraylist of lines and also an arraylist of the individual comma seperated values, but i have no idea how to do something like that
Edit : For reference, in this particular situation, data should end up containing 16 * 5 million values.
If there's a more elegant solution, i'm all for it
The intention when loading this file is to process it as a database, with the appropriate methods like select and select where, all handled by a sheet class. It worked just fine with my smaller sample file of 36k lines, but i guess it doesnt scale very well
Current code :
//Load method to load it from file
private static CSV loadCSV(String filename, boolean absolute)
{
String fullname = "";
if (!absolute)
{
fullname = baseDirectory + filename;
if (!Load.exists(fullname,false))
return null;
}
else if (absolute)
{
fullname = filename;
if (!Load.exists(fullname,false))
return null;
}
ArrayList<String> output = new ArrayList<String>();
AtomicInteger atomicInteger = new AtomicInteger(0);
try (Stream<String> stream = Files.lines(Paths.get(fullname)))
{
stream.forEach(t -> {
output.add(t);
atomicInteger.getAndIncrement();
if (atomicInteger.get() % 10000 == 0)
{
Log.log("Lines done " + output.size());
}
});
CSV c = new CSV(output);
return c;
}
catch (IOException e)
{
Log.log("Error reading file " + fullname,3,"FileIO");
e.printStackTrace();
}
return null;
}
//Process method inside CSV class
public CSV(List<String> output)
{
Log.log("Inside csv " + output.size());
ListIterator<String> iterator = output.listIterator();
while (iterator.hasNext())
{
ArrayList<String> d = new ArrayList<String>(Arrays.asList(iterator.next().split(splitter,-1)));
data.add(d);
iterator.remove();
}
}
You need to use any database, which provide required functionality for your task (select, group).
Any database can effective read and aggregate 5 million rows.
Don't try to use "operations on ArrayList", it's works good only on small dataset.
I think some key concepts are missing here:
You said the file size is 2GB. That does not mean that when you load that file data in an ArrayList, the size in memory would also be 2GB. Why? Usually files store data using UTF-8 character encoding, whereas JVM internally stores String values using UTF-16. So, assuming your file contains only ASCII characters, each character occupies 1 byte in the filesystem whereas 2 bytes in memory. Assuming (for the sake of simplicity) all String values are unique, there will be space required to store the String references which are 32 bits each (assuming a 64 bits system with compressed oop). How much is your heap (excluding other memory areas)? How much is your eden space and old space? I'll come back to this again shortly.
In your code, you don't specify ArrayList size. This is a blunder in this case. Why? JVM creates a small ArrayList. After sometime JVM sees that this guy keeps pumping in data. Let's create a bigger ArrayList and copy the data of the old ArrayList into the new list. This event has some deeper implications when you are dealing with such huge volume of data: firstly, note that both the old and new arrays (with millions of entries) are in memory simultaneously occupying space, secondly unnecessarily data copy happens from one array to another - not once or twice but repeatedly, everytime the array run out of space. What happens to the old array? Well it's discarded and needs to be garbage collected. So, these repeated array copy and garbage collections slow down the process. CPU is really working hard here. What happens when your data no longer fits into the young generation (which is smaller than heap)? Maybe you need to see the behaviour using something like JVisualVM.
All in all, what I mean to say is there are good number of reasons why a 2GB file fills up your much larger heap and why your process performance is poor.
I would have a method that took a line read from the file as parameter and split it into a list of strings and then returned that list. I would then add that list to the CSV object in the file reading loop. That would mean only one large collection instead of two and the read lines could be freed from memory quicker.
Something like this
CSV csv = new CSV();
try (Stream<String> stream = Files.lines(Paths.get(fullname))) {
stream.forEach(t -> {
List<String> splittedString = splitFileRow(t);
csv.add(splittedString);
});
Trying to solve this problem using pure Java it is overwhelming. I suggest using a processing engine like Apache Spark that can process the file in a distributed way, by increasing the level of parallelism.
Apache Spark has specific APIs to load CSV file:
spark.read.format("csv").option("header", "true").load("../Downloads/*.csv")
You can transform it into an RDD, or Dataframe and perform operations on it.
You can find more online, or here
I am wondering what to do about large files of JSON.
I don't want to store them in memory so I don't want to do JsonNodes because I think that stores the entire tree in memory.
My other idea was using TokenBuffer. However, I'm wondering how this works. Does the TokenBuffer store the entire document as well ? Is there a max limit. I know that it's a performance best practice but if I do:
TokenBuffer buff = jParser.readValueAs(TokenBuffer.class);
It seems like it reads the whole document at once (which I don't want).
The purpose of TokenBuffer is to store an extensible array of JSON tokens in memory. It does that by creating at first 1 Segment object with 16 JsonToken objects and then adding new Segment objects as needed.
You are correct to guess that the entire document will be loaded in memory. The only difference is that instead of storing an array of chars it stores the tokens. The performance advantages according to the docs:
You can reprocess JSON tokens without re-parsing JSON content from textual representation.
It's faster if you want to iterate over all tokens in the order they were appended in the buffer.
TokenBuffer is not a low level buffer of a disk file in memory.
I you only want to parse a file once without loading all of it at once in memory skip the TokenBuffer. Just createJsonParser from a JsonFactory or MappingJsonFactory and get tokens with nextToken. Example.
I want to create JSON string from object.
ObjectMapper om = new ObjectMapper();
String str = om.writeValueAsString(obj);
Some objects are large, and it takes long time to create JSON string.
To create 8MB JSON string, it needs about 15secs.
How can I improve this?
Make sure you have enough memory: Java String for storing 8 MB of serialized JSON needs about 16 megabytes of contiguous memory in heap.
But more importantly: why are you creating a java.lang.String in memory?
What possible use is there for such a huge String?
If you need to write JSON content to a file, there are different methods for that; similarly for writing to a network socket. At very least you could write output as a byte[] (takes 50% less memory), but in most cases incremental writing to an external stream requires very little memory.
15 seconds is definitely very slow. Without GC problems, after initial warmup, Jackson should write 8 megs in fraction of a second, something like 10-20 milliseconds for simple object consisting of standard Java types.
EDIT:
Just realized that during construction of the result String, temporary memory usage will be doubled as well, since buffered content is not yet cleared when String is constructed. So 8 MB would need at least 32 MB to construct String. With default heap of 64 MB this would not work well.
I have two large CSV files which contain data that is required for users of a web application to validate some info. I defined an ArrayList< String[] > and intended to keep the contents of both files in memory so I wouldn't have to read them each time a user logged in and used the application.
I'm getting a java.lang.OutOfMemoryError: Java heap space, though, when initializing the application and trying to read the second file. (It finishes reading the first file just fine but hangs when reading the second file and after a while I get that exception)
The code for reading the files is pretty straight forward:
ArrayList<String[]> tokenizedLines = new ArrayList<String[]>();
public void parseTokensFile() throws Exception {
BufferedReader bRead = null;
FileReader fRead = null;
try {
fRead = new FileReader(this.tokensFile);
bRead = new BufferedReader(fRead);
String line;
while ((line = bRead.readLine()) != null) {
tokenizedLines.add(StringUtils.split(line, fieldSeparator));
}
} catch (Exception e) {
throw new Exception("Error parsing file.");
} finally {
bRead.close();
fRead.close();
}
}
I read Java's split function could use up a lot of memory when reading large amounts of data since the substring function makes a reference to the original string, so a substring of some String will use up the same amount of memory as the original, even though we only want a few chars, so I made a simple split function to try avoiding this:
public String[] split(String inputString, String separator) {
ArrayList<String> storage = new ArrayList<String>();
String remainder = new String(inputString);
int separatorLength = separator.length();
while (remainder.length() > 0) {
int nextOccurance = remainder.indexOf(separator);
if (nextOccurance != -1) {
storage.add(new String(remainder.substring(0, nextOccurance)));
remainder = new String(remainder.substring(nextOccurance + separatorLength));
} else {
break;
}
}
storage.add(remainder);
String[] tokenizedFields = storage.toArray(new String[storage.size()]);
storage = null;
return tokenizedFields;
}
This gives me the same error though, so I'm wondering if it's not a memory leak but simply that I can't have structures with so many objects in memory. One file is about 600'000 lines long, with 5 fields per line, and the other is around 900'000 lines long with about the same amount of fields per line.
The full stacktrace is:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at xxx.xxx.xxx.StringUtils.split(StringUtils.java:16)
at xxx.xxx.xxx.GFTokensFile.parseTokensFile(GFTokensFile.java:36)
So, after the long post (sorry :P), is this a restriction of the amount of memory assigned to my JVM or am I missing something obvious and wasting resources somewhere?
Your JVM won't get more than 2GB on a 32-bit operating system with 4GB of RAM. That's one upper limit.
The second is the max heap size you specify when you start the JVM. Look at that -Xmx parameter.
The third is the fact of life that you cannot fit X units of anything into a Y sized container where X > Y. You know the size of your files. Try parsing each one individually and seeing what kind of heap they're consuming.
I'd recommend that you download Visual VM, install all the available plugins, and have it monitor your application while it's running. You'll be able to see the entire heap, perm gen space, GC collection, what objects are taking up the most memory, etc.
Getting data is invaluable for all problems, but especially ones like this. Without it, you're just guessing.
I cannot see a storage leak in the original version of the program.
The scenarios where split and similar methods can leak significant storage are rather limitted:
You have to NOT be retaining a reference to the original string that you split.
You need to be retaining references to a subset of the strings produced by the string splitting.
What happens when String.substring() is called is that it creates a new String object that shares the original String's backing array. If the original String reference is then garbage collected, then the substring String is now holding onto an array of characters that includes characters that are not "in" the substring. This can be a storage leak, depending on how long the substring is kept.
In your example, you are keeping strings that contain all characters apart for the field separator character. There is a good chance that this is actually saving space ... compared to the space used if each substring was an independent String. Certainly, it is no surprise that your version of split doesn't solve the problem.
I think you need to either increase the heap size, or change your application so that it doesn't need to keep all of the data in memory at the same time.
Try improving your code or leave data processing to a database.
The memory usage is larger as your file sizes, since the code makes redundant copies of the processed data. There is a to be processed one processed and some partial data.
String is immutable, see here, no need to use new String(...) to store the result, split does that copy already.
If you can, delegate the whole data storage and searching to a database. CSV files are easily imported/exported to databases and they do all the hard work.
While I wouldn't recommend actual string interning for what you are doing, how about using the idea behind that technique? You could use a HashSet or HashMap to make sure you only use a single String instance whenever your data contains the same sequence of characters. I mean, there must be some kind of overlap in the data, right?
On the other hand, what you might be seeing here could be a bad case of heap fragmentation. I'm not sure how the JVM handles these cases, but in the Microsoft CLR larger objects (especially arrays) will be allocated on a separate heap. Growth strategies, such as those of the ArrayList will create a larger array, then copy over the content of the previous array before releasing the reference to it. The Large Object Heap (LOH) isn't compacted in the CLR, so this growth strategy will leave huge areas of free memory that the ArrayList can no longer use.
I don't know how much of that applies to the Lava VM, but you could try building the list using LinkedList first, then dump the list content into an ArrayList or directly into an array. That way the large array of lines would be created only once, without causing any fragmentation.
Be sure that the total length of both files is lower than your heap size. You can set the max heap size using the JVM option -Xmx.
Then if you have so much content maybe you shouldn't load it entirely in memory. One time I had a similar problem and I fixed it using an index file that store index of informations in the large file. then I just had to read one line at the good offset.
Also in your split method there is some strange things.
String remainder = new String(inputString);
You don't have to take care of preserve inputString using a copy, String are immutable so changes only apply to the scope of the split method.
I try to build a map with the content of a file and my code is as below:
System.out.println("begin to build the sns map....");
String basePath = PropertyReader.getProp("oldbasepath");
String pathname = basePath + "\\user_sns.txt";
FileReader fr;
Map<Integer, List<Integer>> snsMap =
new HashMap<Integer, List<Integer>>(2000000);
try {
fr = new FileReader(pathname);
BufferedReader br = new BufferedReader(fr);
String line;
int i = 1;
while ((line = br.readLine()) != null) {
System.out.println("line number: " + i);
i++;
String[] strs = line.split("\t");
int key = Integer.parseInt(strs[0]);
int value = Integer.parseInt(strs[1]);
List<Integer> list = snsMap.get(key);
//if the follower is not in the map
if(snsMap.get(key) == null)
list = new LinkedList<Integer>();
list.add(value);
snsMap.put(key, list);
System.out.println("map size: " + snsMap.size());
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("finish building the sns map....");
return snsMap;
The program is very fast at first but gets much slowly when the information printed is :
map size: 1138338
line number: 30923602
map size: 1138338
line number: 30923603
....
I try to find to reason with two System.out.println() clauses to judge the preformance of BufferedReader and HashMap instead of a Java profiler.
Sometimes it takes a while to get the information of the map size after getting the line number information, and sometimes, it takes a while to get the information of the line number information after get the map size. My question is: which makes my program slow? the BufferedReader for a big file or HashMap for a big map?
If you are testing this from inside Eclipse, you should be aware of the huge performance penalty of writing to stdout/stderr, due to Eclipse's capturing that ouptut in the Console view. Printing inside a tight loop is always a performance issue, even outside of Eclipse.
But, if what you are complaining about is the slowdown experienced after processing 30 million lines, then I bet it's a memory issue. First it slows down due to intense GC'ing and then it breaks with OutOfMemoryError.
You will have to check you program with some profiling tools to understand why it is slow.
In general file access is much more slower than in memory operations (unless you are constrained in memory and doing excess GC) so the guess would be that reading file could be the slower here.
Before you profiled, you will not know what is slow and what isn't.
Most likely, the System.out will show up as being the bottleneck, and you'll then have to profile without them again. System.out is the worst thing you can do for finding performance bottlenecks, because in doing so you usually add an even worse bottleneck.
An obivous optimization for your code is to move the line
snsMap.put(key, list);
into the if statement. You only need to put this when you created a new list. Otherwise, the put will just replace the current value with itself.
Java cost associated with Integer objects (and in particular the use of Integers in the Java Collections API) is largely a memory (and thus Garbage Collection!) issue. You can sometimes get significant gains by using primitive collections such as GNU trove, depending how well you can adjust your code to use them efficiently. Most of the gains of Trove are in memory usage. Definitely try rewriting your code to use TIntArrayList and TIntObjectMap from GNU trove. I'd avoid linked lists, too, in particular for primitive types.
Roughly estimated, a HashMap<Integer, List<Integer>> needs at least 3*16 bytes per entry. The doubly linked list again needs at least 2*16 bytes per entry stored. 1m keys + 30m values ~ 1 GB. No overhead included yet. With GNU trove TIntObjectHash<TIntArrayList> that should be 4+4+16 bytes per key and 4 bytes per value, so 144 MB. The overhead is probably similar for both.
The reason that Trove uses less memory is because the types are specialized for primitive values such as int. They will store the int values directly, thus using 4 bytes to store each.
A Java collections HashMap consists of many objects. It roughly looks like this: there are Entry objects that point to a key and a value object each. These must be objects, because of the way generics are handled in Java. In your case, the key will be an Integer object, which uses 16 bytes (4 bytes mark, 4 bytes type, 4 bytes actual int value, 4 bytes padding) AFAIK. These are all 32 bit system estimates. So a single entry in the HashMap will probably need some 16 (entry) + 16 (Integer key) + 32 (yet empty LinkedList) bytes of memory that all need to be considered for garbage collection.
If you have lots of Integer objects, it just will take 4 times as much memory as if you could store everything using int primitives. This is the cost you pay for the clean OOP principles realized in Java.
The best way is to run your program with profiler (for example, JProfile) and see what parts are slow. Also debug output can slow your program, for example.
Hash Map is not slow, but in reality its the fastest among the maps. HashTable is the only thread safe among maps, and can be slow sometimes.
Important note: Close the BufferedReader and File after u read the data... this might help.
eg: br.close()
file.close()
Please check you system processes from task manager, there may be too may processes running in the background.
Sometimes eclipse is real resource heavy, so try to run it from console to check it.