How to read large files (a single continuous string) in Java?

How to read large files (a single continuous string) in Java? - java

I am trying to read a very large file (~2GB). Content is a continuous string with sentences (I would like to split them based on a '.'). No matter how I try, I end up with an Outofmemoryerror.
BufferedReader in = new BufferedReader(new FileReader("a.txt"));
String read = null;
int i = 0;
while((read = in.readLine())!=null) {
String[] splitted = read.split("\\.");
for (String part: splitted) {
i+=1;
users.add(new User(i,part));
repository.saveAll(users);
}
}
also,
inputStream = new FileInputStream(path);
    sc = new Scanner(inputStream, "UTF-8");
    while (sc.hasNextLine()) {
        String line = sc.nextLine();
        // System.out.println(line);
    }
    // note that Scanner suppresses exceptions
    if (sc.ioException() != null) {
        throw sc.ioException();
    }
Content of the file (composed of random words with a full stop after 10 words):
fmfbqi .xcdqnjqln kvjhw pexrbunnr cgvrqlr fpaczdegnb puqzjdbp gcfxne jawml aaiwwmo ugzoxn .opjc fmfbqi .xcdqnjqln kvjhw pexrbunnr cgvrqlr fpaczdegnb puqzjdbp gcfxne jawml aaiwwmo ugzoxn .opjc (so on)
Please help!

So first and foremost, based on comments on your question, as Joachim Sauer stated:
If there are no newlines, then there is only a single line and thus only one line number.
So your usecase is faulty, at best.
Let's move past that, and assume maybe there are new line characters - or better yet, assume that the . character you're splitting on is intended to be a newline psudeo-replacement.
Scanner is not a bad approach here, though there are others. Since you provided a Scanner, lets continue with that, but you want to make sure you're wrapping it around a BufferedReader. You clearly don't have a lot of memory, and a BufferedReader allows your to read 'chunks' of a file, as buffered by the BufferedReader, while utilizing the functionality of the Scanner completely obscure to you as a caller that the buffering is happening:
Scanner sc = new Scanner(new BufferedReader(new FileReader(new File("a.txt")), 10*1024));
What this is basically doing, is letting the Scanner function as you expect, but allowing you to buffer 10MB at a time, minimizing your memory footprint. Now, you just keep calling
sc.useDelimiter("\\.");
for(int i = 0; sc.hasNext(); i++) {
String psudeoLine = sc.next();
//store line 'i' in your database for this psudeo-line
//DO NOT store psudeoLine anywhere else - you don't have memory for it
}
Since you don't have enough memory, the clear thing to iterate (and re-iterate) is don't store any part of the file within your JVM's heapspace after reading it. Read it, use it how you need it, and allow it to be marked for JVM garbage collection. In your case, you mention you want to store the psudeo lines in a database, so you want to read the psudeo-line, store it in the database, and just discard it.
There are other things to point out here, such as configuring your JVM arguments, but I hesitate to even mention it because just setting your JVM memory high is a bad idea too - another brute force approach. There's nothing wrong with setting your JVM memory max heap size higher, but learning memory management is better if you're still learning how to write software. You'll get in less trouble later when you get into professional development.
Also, I mentioned Scanner and BufferedReader because you mentioned that in your question, but I think checking out java.nio.file.Path.lines() as pointed out by deHaar is also a good idea. This basically does the same thing as the code I've explicitly laid out, with the caveat that it still only does 1 line at a time without the ability to change what you're 'splitting' on. So if your text file has 1 single line in it, this will still cause you a problem and you will still need something like a scanner to fragment the line out.

Related

Using buffredReader read big files in java

I understand there are two ways read big text files in java. One is using scanner and one is using bufferedreader.
Scanner reader = new Scanner(new FileInputStream(path));
while (reader.hasNextLine()){
String tempString = reader.nextLine();
System.out.println(java.lang.Runtime.getRuntime().totalMemory()/(1024*1024.0));
}
And the number to be printed is always stable around some value.
However when I use bufferedReader as per edit below the number is not stable, it may increase in a sudden (about 20mb) in one line and then remain the same for many lines(like 8000 lines). And the process repeats.
Anyone knows why?
UPDATE
I typed the second method using BufferedReader wrong here is what it should be
BufferedReader reader = new BufferedReader
(new InputStreamReader(new FileInputStream(path)),5*1024*1024);
for(String s = null;(s=reader.readLine())!=null; ){
System.out.println(java.lang.Runtime.getRuntime().totalMemory()/(1024*1024.0));
}
or using while loop
String s;
while ((s=reader.readLine())!=null ){
System.out.println(java.lang.Runtime.getRuntime().totalMemory()/(1024*1024.0));
}
To be more specific, here is a result of test case reading 250M file
Scanner case:
linenumber---totolmemory
5000---117.0
10000---112.5
15000---109.5
20000---109.5
25000---109.5
30000---109.5
35000---109.5
40000---109.5
45000---109.5
50000---109.5
BufferedReader case:
linenumber---totolmemory
5000---123.0
10000---155.5
15000---155.5
20000---220.5
25000---220.5
30000---220.5
35000---220.5
40000---220.5
45000---220.5
50000---211.0
However the scanner is slow and that's why I try to avoid it.
And I check the bufferedReader case the total memory increases suddenly in a single random line.

Just by itself, a Scanner is not particularly good for big text files.
Scanner and BufferedReader are not comparable. You can use a BufferedInputStream in a Scanner - then you'll have the same thing, with the Scanner adding a lot more of "stream" reading functionality than just lines.
Looking at totalMemory isn't particularly useful. To cite Javadoc: Returns the total amount of memory in the Java virtual machine. The value returned by this method may vary over time, depending on the host environment.
Try freeMemory, which is a little more interesting, reflecting the phases of GC that occur every now and then.
Later
Comment on Scanner being slow: Reading a line merely requires scanning bytes for the line separator, and that's how the BufferedReader does it. The Scanner, however, cranks up java.util.regex.Matcher for this task (as it fits better into its overall design). Using the Scanner just for reading lines is breaking butterflies on the wheel.

Buffer reader using input stream gives Out of Memory Error Android Java

i am getting error while reading stream. Stream also contain images along with string data.My code is,
static String convertStreamToString(java.io.InputStream is) {
Reader reader = new InputStreamReader(is);
BufferedReader r = new BufferedReader(reader);
StringBuilder total = new StringBuilder();
String line = null;
try {
while ((line = r.readLine()) != null) {
total.append(line);
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println(total);
return line;
}
In while loop it results in consol as its heap memory is increasing and at last give error "out of memory"
Hopes for your suggestion

You are simply reading your whole file and storing it into the StringBuilder. Your file just too big to fit in RAM, and just results in an OutOfMemory error when the JVM is unable to allocate any more space to your StringBuilder.
Either find a way not to store the whole file in memory or use a smaller file (the former solution clearly being the better).
Also, note that you are not closing your buffered reader after use. The correct way to do this is to declare your resources using the try-with resources, for example :
try(Reader reader = new InputStreamReader(is);
BufferedReader r = new BufferedReader(reader)){
//your code here
}

You do need to close your stream, as others have noted. But you have a couple of other things to think about:
What do you mean when you say that Stream also contain images along with string data? You're not doing anything to deal with this. You're reading it all as Strings. If the file really does contain images and strings, then it's quite likely that the images are much bigger than the strings; and if you just want the strings then you need to find a way to filter out the images. That will probably solve your out-of-memory problem, and also prevent you from ending up with nonsense in your output. Unfortunately, you haven't given us enough information to help with the detail of this: it will depend on the format of the file/stream.
If the file is huge, and contains more string data than you can store in memory, then no amount of careful closing of streams and the like will help. You'll need to process the file as you go, rather than storing it in RAM. But how you do this will rather depend on what you're trying to do with the data.
I'd start by attacking the first problem if I were you. At the moment, it sounds as though even if you solve the memory problem, you'll still end up with nonsensical output because the images will get decoded as Strings.

How to deal with reading and processing huge text files without getting OutofMemoryError

I wrote some straightforward code to read text files (>1g) and do some processing on Strings.
However, I have to deal with Java heap space problems since I try to append Strings (using StringBuilder) that are getting to big on memory usage at some point. I know that I can increase my heap space with, e. g. '-Xmx1024', but I would like to work with only little memory usage here.How could I change my code below to manage my operations?
I am still a Java novice and maybe I made some mistakes in my code which may seem obvious to you.
Here's the code snippet:
private void setInputData() {
Pattern pat = Pattern.compile("regex");
BufferedReader br = null;
Matcher mat = null;
try {
File myFile = new File("myFile");
FileReader fr = new FileReader(myFile);
br = new BufferedReader(fr);
String line = null;
String appendThisString = null;
String processThisString = null;
StringBuilder stringBuilder = new StringBuilder();
while ((line = br.readLine()) != null) {
mat = pat.matcher(line);
if (mat.find()) {
appendThisString = mat.group(1);
}
if (line.contains("|")) {
processThisString = line.replace(" ", "").replace("|", "\t");
stringBuilder.append(processThisString).append("\t").append(appendThisString);
stringBuilder.append("\n");
}
}
// doSomethingWithTheString(stringBuilder.toString());
} catch (Exception ex) {
ex.printStackTrace();
} finally {
try {
if (br != null)br.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
Here's the error message:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
at java.lang.StringBuilder.append(StringBuilder.java:132)
at Test.setInputData(Test.java:47)
at Test.go(Test.java:18)
at Test.main(Test.java:13)

You could do a dry run, without appending, but counting the total string length.
If doSomethingWithTheString is sequential there would be other solutions.
You could tokenize the string, reducing the size. For instance Huffman compression looks for already present sequences reading a char, possible extends the table and then yields a table index. (The open source OmegaT translation tool uses such a strategy at one spot for tokens.) So it depends on the processing you want to do. Seeing the reading of a kind of CSV a dictionary seems feasible.
In general I would use a database.
P.S. you can save half the memory, writing all to a file, and then rereading the file in one string. Or use a java.nio ByteBuffer on the file, a memory mapped file.

You can't use StringBuilder in this case. It holds data in memory.
I think you should consider saving the result into file in every line.
i.e. Use FileWriter instead of StringBuilder.

The method doSomethingWithTheString() should probably need to change so that it accepts an InputStream as well. While reading the original file content and transforming it line by line you should write the transformed content to a temporary file line by line. Then an input stream to that temporary file could be send to the doSomethingWithTheString() method. Probably the method needs to be renamed as doSomethingWithInputStream().

From your example it is not clear what you are going to do with your enormous string once you have modified it. However since your modifications do not appear to span multiple lines I'd just write the modified data to a new file.
In order to do that create and open a new FileWriter object before your while cycle, move your stringBuffer declaration to the beginning of the cycle and write stringBuffer to your new file at the end of the cycle.
If, on the other hand, you do need to combine data coming from different lines consider using a database. Which kind depends on the nature of your data. If it has a record-like organization you might adopt a relational database, such as Apache Derby or MySQL, otherwise you might check out so called No SQL databases, such as Cassandra or MongoDB.

The general strategy is to design your application so that it doesn't need to hold the entire file (or too large a proportion of it) in memory.
Depending on what your application does:
You could write the intermediate data to a file and read it back again a line at a time to process it.
You could pass each line read to the processing algorithm; e.g. by calling doSomethingWithTheString(...) on each line individually rather than all of them.
But if you need to have the entire file in memory, you are between a rock and a hard place.
The other thing to note is that using a StringBuilder like that may require up to 6 times as much memory as the file size. It goes like this.
When the StringBuilder needs to expand its internal buffer it does this by making a char array twice the size of the current buffer, and copying from the old to the new. At that point you have 3 times as much buffer space allocated as you have before the buffer expansion started. Now suppose that there was just one more character to append to the buffer.
If the file is in ASCII (or another 8 bit charset), the StringBuilder's buffer needs twice that amount of memory ... because it consists of char not byte values.
If you have a good estimate of the number of characters that will be in the final string (e.g. from the file size), you can avoid the x3 multiplier by giving a capacity hint when you create the StringBuilder. However, you mustn't underestimate, 'cos if you underestimate just slightly ...
You could also use a byte-oriented buffer (e.g. a ByteArrayOutputStream) instead of a StringBuilder ... and then read it with a ByteArrayInputStream / StreamReader / BufferedReader pipeline.
But ultimately, holding a large file in memory doesn't scale as the file size increases.

Are you sure there is a line terminator in the file? If not, your while loop will just keeps looping and leads to your error. If so, it might worth trying reading a fixed number of bytes at a time so that the reader won't grow infinitely.

I suggest the use of Guavas FileBackedOutputStream. You gain the advantage of having an OutputStream that will eat up disk io instead of main memory. Of course access will be slower due to the disk io, but, if you are dealing with such a large stream, and you are unable to chunk it into a more managable size, it is a good option.

Faster implementation of more than one input in a single line (Java)

Well, this might be a silly problem.
I just want a faster implementation of following problem
I want to take three integer input in a single line eg:
10 34 54
One way is to make a BufferedReader and then use readLine()
which will read the whole line as a string
then we can use StringTokenizer to separate three integer. (Slow implemetation)
Another way is to use 'Scanner' and take input by nextInt() method. (Slower than previous method)
I want a fast implementation to take such type of inputs since I have to read more than 2,000,000 lines and these implementations are very slow.
My implementation:
BufferedReader br=new BufferedReader(new InputStreamReader(System.in));
for(i=0;i<n;i++) {
str=br.readLine();
st = new StringTokenizer(str);
t1=Integer.parseInt(st.nextElement().toString());
t2=Integer.parseInt(st.nextElement().toString());
z=Long.parseLong(st.nextElement().toString());
}
This one is looped for n times. ( n is number of entries)
Since I know each line will contain only three integer there is no need to check for hasMoreElements()

I just want a faster implementation of following problem.
The chances are that you DON'T NEED a faster implementation. Seriously. Not even with a 2 million line input file.
The chances are that:
more time is spent processing the file than reading it, and
most of the "read time" is spent doing things at the operating system level, or simply waiting for the next disk block to be read.
My advice is to not bother optimizing this unless the application as a whole takes too long to run. And when you find that this is the case, profile the application, and use the profile stats to tell you where it could be worthwhile spending effort on optimization.
(My gut feeling is that there is not much to be gained by optimizing this part of your application. But don't rely on that. Profile it!)

Here's a basic example that will be pretty fast:
public static void main(String[] args) throws IOException {
BufferedReader reader = new BufferedReader(new FileReader("myfile.txt"));
String line;
while ((line = reader.readLine()) != null) {
for (String s : line.split(" ")) {
final int i = Integer.parseInt(s);
// do something with i...
}
}
reader.close();
}
However your task is fundamentally going to take time.
If you are doing this on a website and reaching a timeout, you should consider doing it in a background thread, and send a response to the user saying that the data is being processed. You'll probably need to add a way for the user to check on the progress.

Here is what I mean when I say "specialized scanner". Depending upon parser's (or split's) efficiency, this might be a bit faster (it probably is not):
BufferedReader br=new BufferedReader(...);
for(i=0;i<n;i++)
{
String str=br.readLine();
long[] resultLongs = {-1,-1,-1};
int startPos=0;
int nextLongIndex=0;
for (int p=0;p<str.length();p++)
{
if (str.charAt(p)== ' ')
{
String intAsStr=str.substring(startPos, p-1);
resultLongs[nextLongIndex++]=Integer.parseInt(intAsStr);
startpos=p+1;
}
}
// t1, t2 and z are in resultLongs[0] through resultLongs[2]
}
Hths.
And of course this fails miserably if the input file contains garbage, i.e. anything else but longs separated by blanks.
And in addition, to minimize the "roundtrips" to the OS, it is a good idea to supply the buffered reader with a nonstandard (bigger-than-standard) buffer.
The other hint I gave in the comment refined: If you have to read such a huge text file more than once, i.e. more than once after it has been updated, you could read all longs into a data structure (maybe a List of elements that hold three longs), and stream that into a "cache" file. Next time, compare the text file's timestamp to the "cache" file's. If it is older, read the cache file. Since stream I/O does not serialize longs into its string representation, you will see much, much better reading times.
EDIT: Missed the startPos reassignment.
EDIT2: Added the cache idea explanation.

Search a string in a file and write the matched lines to another file in Java

For searching a string in a file and writing the lines with matched string to another
file it takes 15 - 20 mins for a single zip file of 70MB(compressed state).
Is there any ways to minimise it.
my source code:
getting Zip file entries
zipFile = new ZipFile(source_file_name);
entries = zipFile.entries();
while (entries.hasMoreElements())
{ ZipEntry entry = (ZipEntry)entries.nextElement();
if (entry.isDirectory())
{
continue;
}
searchString(Thread.currentThread(),entry.getName(), new BufferedInputStream (zipFile.getInputStream(entry)), Out_File, search_string, stats); }
zipFile.close();
Searching String
public void searchString(Thread CThread, String Source_File, BufferedInputStream in, File outfile, String search, String stats) throws IOException
{
int count = 0;
int countw = 0;
int countl = 0;
String s;
String[] str;
BufferedReader br2 = new BufferedReader(new InputStreamReader(in));
System.out.println(CThread.currentThread());
while ((s = br2.readLine()) != null)
{
str = s.split(search);
count = str.length - 1;
countw += count; //word count
if (s.contains(search))
{
countl++; //line count
WriteFile(CThread,s, outfile.toString(), search);
}
}
br2.close();
in.close();
}
--------------------------------------------------------------------------------
public void WriteFile(Thread CThread,String line, String out, String search) throws IOException
{
BufferedWriter bufferedWriter = null;
System.out.println("writre thread"+CThread.currentThread());
bufferedWriter = new BufferedWriter(new FileWriter(out, true));
bufferedWriter.write(line);
bufferedWriter.newLine();
bufferedWriter.flush();
}
Please help me. Its really taking 40 mins for 10 files using threads and 15 - 20 mins for a single file of 70MB after being compressed. Any ways to minimise the time.

You are reopening the file output handle for every single line you write.
This is likely to have a massive performance impact, far outweighing other performance issues. Instead I would recommend creating the BufferedWriter once (e.g. upon the first match) and then keeping it open, writing each matching line and then closing the Writer upon completion.
Also, remove the call to flush(); there is no need to flush each line as the call to Writer.close() will automatically flush any unwritten data to disk.
Finally, as a side note your variable and method naming style does not follow the Java camel case convention; you might want to consider changing it.

I'm not sure if the cost you are seeing is from disk operations or from string manipulations. I'll assume for now that the problem is the strings, you can check that by writing a test driver that runs your code with the same line over and over.
I can tell you that split() is going to be very expensive in your case because you are producing strings you don't need and then recycling them, creating much overhead. You may want to increase the amount of space available to your JVM with -Xmx.
If you merely separate words by the presence of whitespace, then you would do much better by using a regular expression matcher that you create before the loop and apply it to the string The number of matches when applied to a given string will be your word count, and that should not create an array of strings (which is very wasteful and which you don't use). You will see in the JavaDocs that split does work via regular expressions; that is true, but split does the extra step of creating separate strings and that's where your waste might be.
You can also use a regular expression to search for the match instead of contains though that may not be significantly faster.
You could make things parallel by using multiple threads. However, if split() is the cause of your grief, your problem is the overhead and running out of heap space, so you won't necessarily benefit from it.
More generally, if you need to do this a lot, you may want to write a script in a language more "friendly" to string manipulation. A 10-line script in Python can do this much faster.

wow, what are you doing in this method
WriteFile(CThread,s, outfile.toString(), search);
every time you got the line containing your text, you are creating BufferedWriter(new FileWriter(out, true));
Just create a bufferedWriter in your searchString method and use that to insert lines. No need to open that again and again. It will drastically improve the performance.

One problem here might be that you stop reading when you write. I would probably use one thread for reading and another thread for writing the file. As an extra optimization the thread writing the results could buffer them into memory and write them to the file as a batch, say every ten entries or something.
In the writing thread you should queue the incoming entries before handling them.
Of course, you should maybe first debug where that time is spent, is it the IO or something else.

There are too many potential bottlenecks in this code for anyone to be sure what the critical ones are. Therefore you should profile the application to determine what it causing it to be slow.
Armed with that information, decide whether the problem is in reading the ZIP file, soing the searching or writing the matches to the output file.
(Repeatedly opening and closing the output file is a bad idea, but if you only get a tiny number of search hits it won't make much difference to the overall performance.)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.