I'm running a small program that processes around 215K of records in the database. These records contain xml that is used by JaxB to marshal and unmarshal to objects.
The program I was running was trying to find xml's that due to legacy couldn't be unmarshalled anymore. Each time I had the unmarshal exception I save this exception message containing the xml in an arraylist. All in the end I wanted to send out a mail with all failed records with the cause exception message. So I used the messages in the arraylist together with a StringBuilder to compose the email body.
However there where around 75K failures and when I was building the body the StringBuilder just stopped appending at a certain point in the for loop and the thread was blocked. I since changed my approach not to append the xml from the exception message anymore, but I'm still not clear why it didn't work.
Could it be that the VM went out of memory, or can Strings only be of a certain size (doubtful I believe certainly in the 64 bit era). Is there a better way I could have solved this ? I contemplated sending the StringBuilder to my service instead of saving the strings in an arraylist first, but that would be such a dirty interface then :(
Any architectural insights would be appreciated.
EDIT
As requested here the code, it's no rocket science. Take that the failures list contains around 75K entries, each entry contains an xml of on avg 500 to 1000 lines
private String createBodyMessage(List<String> failures) {
StringBuilder builder = new StringBuilder();
builder.append("Failed operations\n");
builder.append("=================\n\n");
for (String failure : failures) {
builder.append(failure);
builder.append("\n");
}
return builder.toString();
}
You might be just successful with
int sizeEstimate = failures.size() * 20;
StringBuilder builder = new StringBuilder(sizeEstimate);
builder.append("Failed operations\n");
builder.append("=================\n\n");
while (!failures.isEmpty()) {
builder.append(failures.remove(0));
builder.append('\n');
}
This does less resizing the internal buffer of StringBuilder and consumes failures to reduce that memory.
It might not solve the problem if the text is too huge.
Compressed attachment however is standard procedure.
StringBuffer is based on Array structure, and the maximum number of cells in array is 2^31-1
Reaching this size will normally throws an error on Java 7, but i'm not very sure
The solution is to swap your data to a file, before reaching a fixed size of your StringBuffer
Could it be that the VM went out of memory,
If you filled up the heap, you would get an OutOfMemoryError exception.
or can Strings only be of a certain size (doubtful I believe certainly in the 64 bit era).
Actually, yes. A Java String or StringBuilder can contain at most 2^32-1 characters1.
Is there a better way I could have solved this ? I contemplated sending the StringBuilder to my service instead of saving the strings in an arraylist first ...
That won't help if the real problem is that the concatenation of the strings is too large to hold in a StringBuilder.
Actually, a better approach would be to stream the strings into a PipedOutputStream, and use the corresponding PipedInputStream to construct a MimeBodyPart that you then attach to the email. You could include a compressor in the stream stack too.
But an even better approach would be not to attempt to send gigabytes of erroneous data as email attachments. Save them as files that can be be fetched (or whatever) if the email recipient wants them.
1 - Surprisingly, the javadocs don't seem to state this explicitly. However, String.length() returns an int, and various string manipulation methods take int arguments to specify offsets and lengths. And certainly, the standard implementations of String and StringBuilder use a single char[] as backing store, and arrays are limited to 2^31-1 elements by the JLS and the JVM spec.
Related
How to join list of millions of values into a single String by appending '\n' at end of each line -
Input data is in a List:
list[0] = And the good south wind still blew behind,
list[1] = But no sweet bird did follow,
list[2] = Nor any day for food or play
list[3] = Came to the mariners' hollo!
Below code joins the list into a string by appending new line character at the end -
String joinedStr = list.collect(Collectors.joining("\n", "{", "}"));
But, the problem is if the list has millions of data the joining fails. My guess is String object couldn't handle millions line due to large size.
Please give suggestion.
The problem with trying to compose a gigantic string is that you have to keep the entire thing in memory before you do anything further with it.
If the string is too big to fit in memory, you have only two options:
increase the available memory, or
avoid keeping a huge string in memory in the first place
This string is presumably destined for some further processing - maybe it's being written to a blob in a database, or maybe it is the body of an HTTP response. It's not being constructed just for fun.
It is probably much more preferable to write to some kind of stream (maybe an implementation of OutputStream) that can be read one character at a time. The consumer can optionally buffer based on the delimiter if they are aware of the context of what you're sending, or they can wait until they have the entire thing.
Preferably you would use something which supports back pressure so that you can pause writing if the consumer is too slow.
Exactly how this looks will depend on what you're trying to accomplish.
Maybe you can do it with a StringBuilder, which is designed specifically for handling large Strings. Here's how I'd do it:
StringBuilder sb = new StringBuilder();
for (String s : list) sb.append(s).append("\n");
return s.toString();
Haven't tested this code though, but it should work
I am using klocwok to review my code.
For the given line of code:
byte sigToVerify = new byte[sigFileInputStream.available()];
I am getting the following error report:
SV.DOS.ARRSIZE: Unvalidated user input
sigFileInputStream.available() used for array size - attacker can
specify a large number leading to high resource usage on the server
and a DOS attack
Please help me resolve this issue.
Without more of your code snippet to go on, I would think that Klocwork is reporting a valid issue here. You should review the documentation provided for the SV.DOS.ARRSIZE checker, which explains why this is reported. On the Vulnerability and risk:
The use of data from outside the application must be validated before
use by the application. If this data is used to allocate arrays of
objects in the application, the content of the data must be closely
checked. Attackers can exploit this vulnerability to force the
application to allocate very large numbers of objects, leading to high
resource usage on the application server and the potential for a
denial-of-service (DoS) condition.
On the Mitigation and prevention:
The prevention of DoS attacks from user input can be achieved by
validating any and all input from outside the application (user input,
file input, system parameters, etc.). Validation should include length
and content. ... Data used for allocation should also be checked for
reasonable values, assuming that user input could contain very small
or very large values.
Even the Java InputStream API docs (of which FileInputStream is a subclass) warn that using the return value of the available() method is a bad idea:
Note that while some implementations of InputStream will return the
total number of bytes in the stream, many will not. It is never
correct to use the return value of this method to allocate a buffer
intended to hold all data in this stream.
An example of how to fix your code to avoid this would be to, as suggested above, validate the value returned by available() before using it to allocate the array:
int buffSize = sigFileInputStream.available();
if (buffSize > 0 && buffSize < 100000000) { // 100MB
byte sigToVerify = new byte[buffSize];
// do something with sigToVerify ...
} else {
// error
}
Note that 100000000 or 100MB for sigToVerify may still be way too large for your purposes, or it could be too small. You should determine the most sane value to use here based on what your code is trying to accomplish.
I am developing a Java-based downloader for binary data. This data is transferred via a text-based protocol (UU-encoded). For the networking task the netty library is used. The binary data is split by the server into many thousands of small packets and sent to the client (i.e. the Java application).
From netty I receive a ChannelBuffer object every time a new message (data) is received. Now I need to process that data, beside other tasks I need to check the header of the package coming from the server (like the HTTP status line). To do so I call ChannelBuffer.array() to receive a byte[] array. This array I can then convert into a string via new String(byte[]) and easily check (e.g. compare) its content (again, like comparison to the "200" status message in HTTP).
The software I am writing is using multiple threads/connections, so that I receive multiple packets from netty in parallel.
This usually works fine, however, while profiling the application I noticed that when the connection to the server is good and data comes in very fast, then this conversion to the String object seems to be a bottleneck. The CPU usage is close to 100% in such cases, and according to the profiler very much time is spent in calling this String(byte[]) constructor.
I searched for a better way to get from the ChannelBuffer to a String, and noticed the former also has a toString() method. However, that method is even slower than the String(byte[]) constructor.
So my question is: Does anyone of you know a better alternative to achieve what I am doing?
Perhaps you could skip the String conversion entirely? You could have constants holding byte arrays for your comparison values and check array-to-array instead of String-to-String.
Here's some quick code to illustrate. Currently you're doing something like this:
String http200 = "200";
// byte[] -> String conversion happens every time
String input = new String(ChannelBuffer.array());
return input.equals(http200);
Maybe this is faster:
// Ideally only convert String->byte[] once. Store these
// arrays somewhere and look them up instead of recalculating.
final byte[] http200 = "200".getBytes("UTF-8"); // Select the correct charset!
// Input doesn't have to be converted!
byte[] input = ChannelBuffer.array();
return Arrays.equals(input, http200);
Some of the checking you are doing might just look at part of the buffer. If you could use the alternate form of the String constructor:
new String(byteArray, startCol, length)
That might mean a lot less bytes get converted to a string.
Your example of looking for "200" within the message would be an example.
2
You might find that you can use the length of the byte array as a clue. If some messages are long and you are looking for a short one, ignore the long ones and don't convert to characters. Or something like that.
3
Along with what #EricGrunzke said, partially looking in the byte buffer to filter out some messages and find that you don't need to convert them from bytes to characters.
4
If your bytes are ASCII characters, the conversion to characters might be quicker if you use charset "ASCII" instead of whatever the default is for your server:
new String(bytes, "ASCII")
might be faster in that case.
In fact, you might be able to pick and choose the charset for conversion byte-character in some organized fashion that speeds up things.
Depending on what you are trying to do there are a few options:
If you are just trying to get the response status to then can't you just call getStatus()? This would probably be faster than getting the string out.
If you are trying to convert the buffer, then, assuming you know it will be ASCII, which it sounds like you do, then just leave the data as byte[] and convert your UUDecode method to work on a byte[] instead of a String.
The biggest cost of the string conversion is most likely the copying of the data from the byte array to the internal char array of the String, this combined with the conversion is most likely just a bunch of work that you don't need to do.
I have two large CSV files which contain data that is required for users of a web application to validate some info. I defined an ArrayList< String[] > and intended to keep the contents of both files in memory so I wouldn't have to read them each time a user logged in and used the application.
I'm getting a java.lang.OutOfMemoryError: Java heap space, though, when initializing the application and trying to read the second file. (It finishes reading the first file just fine but hangs when reading the second file and after a while I get that exception)
The code for reading the files is pretty straight forward:
ArrayList<String[]> tokenizedLines = new ArrayList<String[]>();
public void parseTokensFile() throws Exception {
BufferedReader bRead = null;
FileReader fRead = null;
try {
fRead = new FileReader(this.tokensFile);
bRead = new BufferedReader(fRead);
String line;
while ((line = bRead.readLine()) != null) {
tokenizedLines.add(StringUtils.split(line, fieldSeparator));
}
} catch (Exception e) {
throw new Exception("Error parsing file.");
} finally {
bRead.close();
fRead.close();
}
}
I read Java's split function could use up a lot of memory when reading large amounts of data since the substring function makes a reference to the original string, so a substring of some String will use up the same amount of memory as the original, even though we only want a few chars, so I made a simple split function to try avoiding this:
public String[] split(String inputString, String separator) {
ArrayList<String> storage = new ArrayList<String>();
String remainder = new String(inputString);
int separatorLength = separator.length();
while (remainder.length() > 0) {
int nextOccurance = remainder.indexOf(separator);
if (nextOccurance != -1) {
storage.add(new String(remainder.substring(0, nextOccurance)));
remainder = new String(remainder.substring(nextOccurance + separatorLength));
} else {
break;
}
}
storage.add(remainder);
String[] tokenizedFields = storage.toArray(new String[storage.size()]);
storage = null;
return tokenizedFields;
}
This gives me the same error though, so I'm wondering if it's not a memory leak but simply that I can't have structures with so many objects in memory. One file is about 600'000 lines long, with 5 fields per line, and the other is around 900'000 lines long with about the same amount of fields per line.
The full stacktrace is:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at xxx.xxx.xxx.StringUtils.split(StringUtils.java:16)
at xxx.xxx.xxx.GFTokensFile.parseTokensFile(GFTokensFile.java:36)
So, after the long post (sorry :P), is this a restriction of the amount of memory assigned to my JVM or am I missing something obvious and wasting resources somewhere?
Your JVM won't get more than 2GB on a 32-bit operating system with 4GB of RAM. That's one upper limit.
The second is the max heap size you specify when you start the JVM. Look at that -Xmx parameter.
The third is the fact of life that you cannot fit X units of anything into a Y sized container where X > Y. You know the size of your files. Try parsing each one individually and seeing what kind of heap they're consuming.
I'd recommend that you download Visual VM, install all the available plugins, and have it monitor your application while it's running. You'll be able to see the entire heap, perm gen space, GC collection, what objects are taking up the most memory, etc.
Getting data is invaluable for all problems, but especially ones like this. Without it, you're just guessing.
I cannot see a storage leak in the original version of the program.
The scenarios where split and similar methods can leak significant storage are rather limitted:
You have to NOT be retaining a reference to the original string that you split.
You need to be retaining references to a subset of the strings produced by the string splitting.
What happens when String.substring() is called is that it creates a new String object that shares the original String's backing array. If the original String reference is then garbage collected, then the substring String is now holding onto an array of characters that includes characters that are not "in" the substring. This can be a storage leak, depending on how long the substring is kept.
In your example, you are keeping strings that contain all characters apart for the field separator character. There is a good chance that this is actually saving space ... compared to the space used if each substring was an independent String. Certainly, it is no surprise that your version of split doesn't solve the problem.
I think you need to either increase the heap size, or change your application so that it doesn't need to keep all of the data in memory at the same time.
Try improving your code or leave data processing to a database.
The memory usage is larger as your file sizes, since the code makes redundant copies of the processed data. There is a to be processed one processed and some partial data.
String is immutable, see here, no need to use new String(...) to store the result, split does that copy already.
If you can, delegate the whole data storage and searching to a database. CSV files are easily imported/exported to databases and they do all the hard work.
While I wouldn't recommend actual string interning for what you are doing, how about using the idea behind that technique? You could use a HashSet or HashMap to make sure you only use a single String instance whenever your data contains the same sequence of characters. I mean, there must be some kind of overlap in the data, right?
On the other hand, what you might be seeing here could be a bad case of heap fragmentation. I'm not sure how the JVM handles these cases, but in the Microsoft CLR larger objects (especially arrays) will be allocated on a separate heap. Growth strategies, such as those of the ArrayList will create a larger array, then copy over the content of the previous array before releasing the reference to it. The Large Object Heap (LOH) isn't compacted in the CLR, so this growth strategy will leave huge areas of free memory that the ArrayList can no longer use.
I don't know how much of that applies to the Lava VM, but you could try building the list using LinkedList first, then dump the list content into an ArrayList or directly into an array. That way the large array of lines would be created only once, without causing any fragmentation.
Be sure that the total length of both files is lower than your heap size. You can set the max heap size using the JVM option -Xmx.
Then if you have so much content maybe you shouldn't load it entirely in memory. One time I had a similar problem and I fixed it using an index file that store index of informations in the large file. then I just had to read one line at the good offset.
Also in your split method there is some strange things.
String remainder = new String(inputString);
You don't have to take care of preserve inputString using a copy, String are immutable so changes only apply to the scope of the split method.
I have a Java socket connection that is receiving data intermittently. The number of bytes of data received with each burst varies. The data may or may not be terminated by a well-known character (such as CR or LF). The length of each burst of data is variable.
I'm attempting to build a string out of each burst of data. What is the fastest way (speed, not memory), to build a string that would later need to be parsed?
I began by using a byte array to store the incoming bytes, then converting them to a String with each burst, like so:
byte[] message = new byte[1024];
...
message[i] = //byte from socket
i++;
...
String messageStr = new String(message);
...
//parse the string here
The obvious disadvantage of this is that some bursts may be longer than 1024. I don't want to arbitrarily create a larger byte array (what if my burst is larger?).
What is the best way of doing this? Should I create a StringBuilder object and append() to it? That way I don't have to convert from StringBuilder to String (since the former has most of the methods I need).
Again, speed of execution is my biggest concern.
TIA.
I would probably use an InputStreamReader wrapped around a BufferedInputStream, which in turn wraps the socket. And write code that processes a message at a time, potentially blocking for input. If the input is bursty, I might run on a background thread and use a concurrent queue to hold the messages.
Reading a buffer at a time and trying to convert it to characters is exactly what BufferedInputStream/InputStreamReader does. And it does so while paying attention to encoding, something that (as other people have noted) your solution does not.
I don't know why you're focused on speed, but you'll find that the time to process data coming off a socket is far less than the time it takes to transmit over that socket.
Note that as you're transmitting across network layers, your speed of conversion may not be the bottleneck. It would be worth measuring, if you believe this to be important.
Note (also) that you're not specifying a character encoding in your conversion from bytes to String (via characters). I would enforce that somehow, otherwise your client/server communication can become corrupted if/when your client/server run in different environments. You can enforce that via JVM runtime args, but it's not a particularly safe option.
Given the above, you may want to consider StringBuilder(int capacity) to configure it in advance with an appropriate size, such that it doesn't have to reallocate on the fly.
First of all, you are making a lot of assumptions about charachter encoding that you receive from your client. Is it US-ASCII, ISO-8859-1, UTF-8?
Because in Java string is not a sequence of bytes, when it comes to building portable String serialization code you should make explicit decisions about character encoding. For this reason you should NEVER use StringBuilder to convert bytes to String. If you look at StringBuilder interface you will notice that it does not even have an append( byte ) method, and that's not because designers just overlooked it.
In your case you should definetly use a ByteArrayOutputStream. The only drawback of using straight implementation of ByteArrayOutputStream is that its toByteArray() method returns a copy of the array held by the object internaly. For this reason you may create your own subclass of ByteArrayOutputStream and provide direct access to the protected buf member.
Note that if you don't use default implementation, remember to specify byte array bounds in your String constructor. Your code should look something like this:
MyByteArrayOutputStream message = new MyByteArrayOutputStream( 1024 );
...
message.write( //byte from socket );
...
String messageStr = new String(message.buf, 0, message.size(), "ISO-8859-1");
Substitute ISO-8859-1 for the character set that's suitable for your needs.
StringBuilder is your friend. Add as many characters as needed, then call toString() to obtain the String.
I would create a "small" array of characters and append characters to it.
When the array is full (or transmission ends), use the StringBuilder.append(char[] str) method to append the content of the array to your string.
Now for the "small" size of the array - you will need to try various sizes and see which one is fastest for your production environment (performance "may" depend on the JVM, OS, processor type and speed and so on)
EDIT: Other people mentioned ByteArrayOutputStream, I agree it is another option as well.
You may wish to look at ByteArrayOutputStream depending if you are dealing with Bytes instead of Characters.
I generally will use a ByteArrayOutputStream to assemble a message then use toString/toByteArray to retrive it when the message is finished.
Edit: ByteArrayOutputStream can handle various Character set encoding through the toString call.
Personally, independent of language, I would send all characters to an in-memory data stream and once I need the string, I would read all characters from this stream into a string.
As an alternative, you could use a dynamic array, making it bigger whenever you need to add more characters. Even better, keep track of the actual length and increase the array with additional blocks instead of single characters. Thus, you would start with 1 character in an array of 1000 chars. Once you get at 1001, the array needs to be resized to 2000, then 3000, 4000, etc...
Fortunately, several languages including Java have a special build-in class that specializes in this. These are the stringbuilder classes. Whatever technique they use isn't that important but they have been created to boost performance so they should be your fastest option.
Have a look at the Text class. It's faster (for the operations you perform) and more deterministic than StringBuilder.
Note: the project containing the class is aimed at RTSJ VMs. It is perfectly usable in standard SE/EE environments though.