This question already has answers here:
JAVA - Best approach to parse huge (extra large) JSON file
(3 answers)
OutOfMemory exception in a lot of memory
Closed 5 years ago.
This is to read a file faster not write it.
I have a 150MB file which has a JSON object inside it. I currently use the following code to read it:
String filename ="/tmp/fileToRead";
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(filename), Charset.forName("UTF-8")));
decompressedString = reader.readLine();
reader.close();
JSONObject obj = new JSONObject(decompressedString);
JSONArray profileData = obj.getJSONObject("profileData").getJSONArray("children");
....
It is a single line file and since it is JSON I can't split it ( or atleast I think so). Reading the file gives me a OutOfMemory Error or a TLE. The file takes more than 7 secs to be read and that results in the TLE since the execution of the whole code cannot go beyond 7 seconds. I get the OOM on decompressedString = reader.readLine();.
Is there a way I can reduce the memory used or the time it takes to be read completely?
You have several problems at hand:
You're preemptively parsing too much.
The error you get happens already when you read the line since you said "I get the OOM on decompressedString = reader.readLine();".
You should never try to read data line by line. BufferedReader.readLine() will block until you've read the character \r or \n or the sequence \r\n. When processing data of any length, you're never sure you'll get one of those characters. Also, you're never sure you'll get of those characters outside of the data itself. So your string may be too long or malformed. So don't ever pretend to know the format. BufferedReader.readLine() must be used when parsing, not when acquiring data.
You're not using an appropriate library for your use-case
Reading your JSON is important, yes, but you're reading too much at once. When creating your JSON, you might want to build it from a stream (one of InputStream, Reader or any nio's Channel/Buffer).
Currently you're making your JSON from a String. A huge one. So I can safely assume you're going to require at one point twice the memory you need. One time in the String and one time in the finalized object.
To reduce that, use an appropriate library to which you can pass one of the stream mentioned above. I mentioned in my comments the following: Gson, JSON.simple and Jackson.
Your file may be too big anyways.
If you get your data and you want to acquire only subset of it (here, you want everything under {"profileData":{"children": <DATA>}}). But you probably have way too much. How many elements exist at the same level as profileData? How many elements exist at the same level as children? Do you know? Probably way too much. All that is not under profileData.children is useless. What percentage of your total data is that? 50%? 90%? 99%?
To solve this, you probably want one of two things: you want less data or you want to be able to focus your request.
If you want less data, ask your data provider to give you less: only what you need. Why get more than that? It makes no sense. Tell him so and say "I want less".
If you want focused data, use a library that allows you to both parse and reduce the amount of data. You might want to have a library that lets you say this: "parse this JSON and return only the processingData.children element". Unfortunately I know no library that does it. If others do, please add a comment or answer. Apparently, Gson is able to do so if you use the JsonReader yourself and selectively use skipValue().
Related
I have a text file with entries like below.
{"id":"event1","state":"start","timestamp":"11025373"}
{"id":"event1","state":"end","timestamp":"11025373"}
{"id":"event2","state":"start","timestamp":"11025387"}
{"id":"event3","state":"start","timestamp":"11025388"}
{"id":"event3","state":"end","timestamp":"11025391"}
{"id":"event2","state":"end","timestamp":"11025397"}
I want to read the file as input and compare the time consumed by each event using Java. Like
event1 has taken (11025373 - 11025373) = 4ms time. (start - end)
event2 has taken (11025397 - 11025387) = 10ms time.
I initially thought to read line by line.
File file = new File("C:\\Users\\xyz\\inputfile.txt");
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null)
LOGGER.info(line);
Considering the input file size can be very Large is this the right approach?.
Any suggestion for best apporach will be helpful. And also how to compare each object in the file, i.e. compare "start" of event1 to "end" of event1 if I go line by line.
Considering the input file size can be very Large this is not not suitable I feel.
This is bizarre. It is, in fact, precisely the right approach. The wrong approach would be to read the entire thing in.
The only exception is if a single line can itself be truly humongous (let's say 128MB or up - that's.. a heck of a long line).
That is JSON format, you need a JSON reader. I suggest Jackson.
Make a class with the structure of that line, presumably something like:
enum State {
start, end;
}
class Event {
String id;
State state;
long timestamp;
}
Then, read a single line, ask Jackson to turn that line into an instance of Event, process it, and repeat until you're done with the file. This will let you process a file that is many GBs in size if you want, as long as any given line is not ridiculously long.
If a single line is ridiculously long: Well, JSON is not really designed for 'streaming', and most JSON libraries therefore don't do it, or at least don't make it easy. I therefore strongly suggest you don't attempt to write something that can 'stream' a single line unless you're sure you really need to do this.
The only slightly complicated thing here is that you need to remember the last read entry, so that you can update its 'time taken' property at that point, as you can only know that once you read the line after the right entry. This is basic programming though.
I need to build an application which scans through a large amount of files. These files contain blocks with some data about a sessions, in which each line has a different value. E.g.: "=ID: 39487".
At that point I have that line, but the problem I now face is that I need the value n lines above that ID. I was thinking about an Iterator but it only has forward methods. I also thought about saving the results in a List but that defies the reason to use Stream and some files are huge so that would cause memory problems.
I was wondering if something like this is possible using the Stream API (Files)? Or perhaps a better question, is there a better way to approach this?
Stream<String> lines = Files.lines(Paths.get(file.getName()));
Iterator<String> search = lines.iterator();
You can't arbitrarily read backwards and forwards through the file with the same reader (no matter if you're using streams, iterators, or a plain BufferedReader.)
If you need:
m lines before a given line
n lines after the given line
You don't know the value of m and n in advance, until you reach that line
...then you essentially have three options:
Read the whole file once, keep it in memory, and then your task is trivial (but this uses the most memory.)
Read the whole file once, mark the line numbers that you need, then do a second pass where you extract the lines you require.
Read the whole file once, storing some form of metadata about line lengths as you go, then use a RandomAccessFile to extract the specific bits you need without having to read the whole file again.
I'd suggest given the files are huge, the second option here is probably the most realistic. The third will probably give you better performance, but will require much more in the way of development effort.
As an alternative if you can guarantee that both n and m are below a certain value, and that value is a reasonable size - you could also just keep a certain number of lines in a buffer as you're processing the file, and read through that buffer when you need to read lines "backwards".
Try my library. abacus-util
try(Reader reader = new FileReader(yourFile)) {
StreamEx.of(reader)
.sliding(n, n, ArrayList::new)
.filter(l -> l.get(l.size() - 1).contains("=ID: 39487"))
./* then do your work */
}
No matter how big your file is. as long as n is small number, not millions
I need to parse files that may be quite large, possibly 100s of megabytes and millions of lines. I have been trying to do this using FlatPack. I would think the way to do this would be to use the buffered parsers and the new stream methods. But, despite that dataset.next() returns true for the correct number of records, the Optional returned by dataset.getRecord() never contains a value.
I have looked at this example/test but it only counts the number of record and does not actually do anything with the content.
example/test
You can use the class BuffReaderParseFactory instead of DefaultParserFactory.
It will read one record from the input file only when you call "next()".
The explanations for both DefaultParserFactory and BuffReaderParseFactory are not exactly helpful. Both libraries said to return PZParser (from newDelimitedParser) but only one of them returns an actual value from a record. Based on the examples I've seen, I think BuffReaderParseFactory is just for checking performance (hence should be faster) and DefaultParserFactory on the other hand contains all the records.
I'm running a small program that processes around 215K of records in the database. These records contain xml that is used by JaxB to marshal and unmarshal to objects.
The program I was running was trying to find xml's that due to legacy couldn't be unmarshalled anymore. Each time I had the unmarshal exception I save this exception message containing the xml in an arraylist. All in the end I wanted to send out a mail with all failed records with the cause exception message. So I used the messages in the arraylist together with a StringBuilder to compose the email body.
However there where around 75K failures and when I was building the body the StringBuilder just stopped appending at a certain point in the for loop and the thread was blocked. I since changed my approach not to append the xml from the exception message anymore, but I'm still not clear why it didn't work.
Could it be that the VM went out of memory, or can Strings only be of a certain size (doubtful I believe certainly in the 64 bit era). Is there a better way I could have solved this ? I contemplated sending the StringBuilder to my service instead of saving the strings in an arraylist first, but that would be such a dirty interface then :(
Any architectural insights would be appreciated.
EDIT
As requested here the code, it's no rocket science. Take that the failures list contains around 75K entries, each entry contains an xml of on avg 500 to 1000 lines
private String createBodyMessage(List<String> failures) {
StringBuilder builder = new StringBuilder();
builder.append("Failed operations\n");
builder.append("=================\n\n");
for (String failure : failures) {
builder.append(failure);
builder.append("\n");
}
return builder.toString();
}
You might be just successful with
int sizeEstimate = failures.size() * 20;
StringBuilder builder = new StringBuilder(sizeEstimate);
builder.append("Failed operations\n");
builder.append("=================\n\n");
while (!failures.isEmpty()) {
builder.append(failures.remove(0));
builder.append('\n');
}
This does less resizing the internal buffer of StringBuilder and consumes failures to reduce that memory.
It might not solve the problem if the text is too huge.
Compressed attachment however is standard procedure.
StringBuffer is based on Array structure, and the maximum number of cells in array is 2^31-1
Reaching this size will normally throws an error on Java 7, but i'm not very sure
The solution is to swap your data to a file, before reaching a fixed size of your StringBuffer
Could it be that the VM went out of memory,
If you filled up the heap, you would get an OutOfMemoryError exception.
or can Strings only be of a certain size (doubtful I believe certainly in the 64 bit era).
Actually, yes. A Java String or StringBuilder can contain at most 2^32-1 characters1.
Is there a better way I could have solved this ? I contemplated sending the StringBuilder to my service instead of saving the strings in an arraylist first ...
That won't help if the real problem is that the concatenation of the strings is too large to hold in a StringBuilder.
Actually, a better approach would be to stream the strings into a PipedOutputStream, and use the corresponding PipedInputStream to construct a MimeBodyPart that you then attach to the email. You could include a compressor in the stream stack too.
But an even better approach would be not to attempt to send gigabytes of erroneous data as email attachments. Save them as files that can be be fetched (or whatever) if the email recipient wants them.
1 - Surprisingly, the javadocs don't seem to state this explicitly. However, String.length() returns an int, and various string manipulation methods take int arguments to specify offsets and lengths. And certainly, the standard implementations of String and StringBuilder use a single char[] as backing store, and arrays are limited to 2^31-1 elements by the JLS and the JVM spec.
I am writing a utility in Java that reads a stream which may contain both text and binary data. I want to avoid having I/O wait. To do that I create a thread to keep reading the data (and wait for it) putting it into a buffer, so the clients can check avialability and terminate the waiting whenever they want (by closing the input stream which will generate IOException and stop waiting). This works every well as far as reading bytes out of it; as binary is concerned.
Now, I also want to make it easy for the client to read line out of it like '.hasNextLine()' and '.readLine()'. Without using an I/O-wait stream like buffered stream, (Q1) How can I check if a binary (byte[]) contain a valid unicode line (in the form of the length of the first line)? I look around the String/CharSet API but could not find it (or I miss it?). (NOTE: If possible I don't want to use non-build-in library).
Since I could not find one, I try to create one. Without being so complicated, here is my algorithm.
1). I look from the start of the byte array until I find '\n' or '\r' without '\n'.
2). Then, I cut the byte array from the start to that point and using it to create a string (with CharSet if specified) using 'new String(byte[])' or 'new String(byte[], CharSet)'.
3). If that success without exception, we found the first valid line and return it.
4). Otherwise, these bytes may not be a string, so I look further to another '\n' or '\r' w/o '\n'. and this process repeat.
5. If the search ends at the end of available bytes I stop and return null (no valid line found).
My question is (Q2)Is the following algorithm adequate?
Just when I was about to implement it, I searched on Google and found that there are many other codes for new line, for example U+2424, U+0085, U+000C, U+2028 and U+2029.
So my last question is (Q3), Do I really need to detect these code? If I do, Will it increase the chance of false alarm?
I am well aware that recognize something from binary is not absolute. I am just trying to find the best balance.
To sum up, I have an array of byte and I want to extract a first valid string line from it with/without specific CharSet. This must be done in Java and avoid using any non-build-in library.
Thanks you all in advance.
I am afraid your problem is not well-defined. You write that you want to extract the "first valid string line" from your data. But whether somet byte sequence is a "valid string" depends on the encoding. So you must decide which encoding(s) you want to use in testing.
Sensible choices would be:
the platform default encoding (Java property "file.encoding")
UTF-8 (as it is most common)
a list of encodings you know your clients will use (such as several Russian or Chinese encodings)
What makes sense will depend on the data, there's no general answer.
Once you have your encodings, the problem of line termination should follow, as most encodings have rules on what terminates a line. In ASCII or Latin-1, LF,CR-LF and LF-CR would suffice. On Unicode, you need all the ones you listed above.
But again, there's no general answer, as new line codes are not strictly regulated. Again, it would depend on your data.
First of all let me ask you a question, is the data you are trying to process a legacy data? In other words, are you responsible for the input stream format that you are trying to consume here?
If you are indeed controlling the input format, then you probably want to take a decision Binary vs. Text out of the Q1 algorithm. For me this algorithm has one troubling part.
`4). Otherwise, these bytes may not be a string, so I look further to
another '\n' or '\r' w/o '\n'. and this process repeat.`
Are you dismissing input prior to line terminator and take the bytes that start immediately after, or try to reevaluate the string with now 2 line terminators? If former, you may have broken binary data interface, if latter you may still not parse the text correctly.
I think having well defined markers for binary data and text data in your stream will simplify your algorithm a lot.
Couple of words on String constructor. new String(byte[], CharSet) will not generate any exception if the byte array is not in particular CharSet, instead it will create a string full of question marks ( probably not what you want ). If you want to generate an exception you should use CharsetDecoder.
Also note that in Java 6 there are 2 constructors that take charset
String(byte[] bytes, String charsetName) and String(byte[] bytes, Charset charset). I did some simple performance test a while ago, and constructor with String charsetName is magnitudes faster than the one that takes Charset object ( Question to Sun: bug, feature? ).
I would try this:
make the IO reader put strings/lines into a thread safe collection (for example some implementation of BlockingQueue)
the main code has only reference to the synced collection and checks for new data when needed, like queue.peek(). It doesn't need to know about the io thread nor the stream.
Some pseudo java code (missing exception & io handling, generics, imports++) :
class IORunner extends Thread {
IORunner(InputStream in, BlockingQueue outputQueue) {
this.reader = new BufferedReader(new InputStreamReader(in, "utf-8"));
this.outputQueue = outputQueue;
}
public void run() {
String line;
while((line=reader.readLine())!=null)
this.outputQueue.put(line);
}
}
class Main {
public static void main(String args[]) {
...
BlockingQueue dataQueue = new LinkedBlockingQueue();
new IORunner(myStreamFromSomewhere, dataQueue).start();
while(true) {
if(!dataQueue.isEmpty()) { // can also use .peek() != null
System.out.println(dataQueue.take());
}
Thread.sleep(1000);
}
}
}
The collection decouples the input(stream) more from the main code. You can also limit the number of lines stored/mem used by creating the queue with a limited capacity (see blockingqueue doc).
The BufferedReader handles the checking of new lines for you :) The InputStreamReader handles the charset (recommend setting one yourself since the default one changes depending on OS etc.).
The java.text namespace is designed for this sort of natural language operation. The BreakIterator.getLineInstance() static method returns an iterator that detects line breaks. You do need to know the locale and encoding for best results, though.
Q2: The method you use seems reasonable enough to work.
Q1: Can't think of something better than the algorithm that you are using
Q3: I believe it will be enough to test for \r and \n. The others are too exotic for usual text files.
I just solved this to get test stubb working for Datagram - I did byte[] varName= String.getBytes(); then final int len = varName.length; then send the int as DataOutputStream and then the byte array and just do readInt() on the rcv then read bytes(count) using the readInt.
Not a lib, not hard to do either. Just read up on readUTF and do what they did for the bytes.
The string should construct from the byte array recovered that way, if not you have other problems. If the string can be reconstructed, it can be buffered ... no?
May be able to just use read / write UTF() in DataStream - why not?
{ edit: per OP's request }
//Sending end
String data = new String("fdsfjal;sajssaafe8e88e88aa");// fingers pounding keyboard
DataOutputStream dataOutputStream = new DataOutputStream();//
final Integer length = new Integer(data.length());
dataOutputStream.writeInt(length.intValue());//
dataOutputStream.write(data.getBytes());//
dataOutputStream.flush();//
dataOutputStream.close();//
// rcv end
DataInputStream dataInputStream = new DataInputStream(source);
final int sizeToRead = dataInputStream.readInt();
byte[] datasink = new byte[sizeToRead.intValue()];
dataInputStream.read(datasink,sizeToRead);
dataInputStream.close;
try
{
// constructor
// String(byte[] bytes, int offset, int length)
final String result = new String(datasink,0x00000000,sizeToRead);//
// continue coding here
Do me a favor, keep the heat off of me. This is very fast right in the posting tool - code probably contains substantial errors - it's faster for me just to explain it writing Java ~ there will be others who can translate it to other code language ( s ) which you can too if you wish it in another codebase. You will need exception trapping an so on, just do a compile and start fixing errors. When you get a clean compile, start over from the beginnning and look for blunders. ( that's what a blunder is called in engineering - a blunder )