Fastest/Cleanest way to load a text file in memory

Fastest/Cleanest way to load a text file in memory - java

I know similar questions have been asked before, but I couldn't find one that answers my exact question.
I need a way to read a file as a String with the least code and as simple and as optimal as possible.
I'm not looking for:
final BufferedReader br = new BufferedReader(new FileReader(file));
String line = null;
while ((line = br.readLine()) != null) {
// logic
}
And I know I can write my own helper class that does this.
I'm looking for something more along the lines of:
final String wholeFileAsStr = Something.load(file);
Where Something.load() is super optimized and buffers the file properly while reading it, taking file size into account for instance.
Can anyone recommend something from Guava or Apache maybe that I'm not aware of ?
Thanks in advance.

Perhaps IOUtils.toString , from Commons IOUtils

For a they detailed look at all various methods of reading a single file in a JVM try the following article:
Java tip: How to read files quickly

Related

Reading a large compressed file using Apache Commons Compress

I'm trying to read a bz2 file using Apache Commons Compress.
The following code works for a small file.
However for a large file (over 500MB), it ends after reading a few thousands lines without any error.
try {
InputStream fin = new FileInputStream("/data/file.bz2");
BufferedInputStream bis = new BufferedInputStream(fin);
CompressorInputStream input = new CompressorStreamFactory()
.createCompressorInputStream(bis);
BufferedReader br = new BufferedReader(new InputStreamReader(input,
"UTF-8"));
String line = "";
while ((line = br.readLine()) != null) {
System.out.println(line);
}
} catch (Exception e) {
e.printStackTrace();
}
Is there another good way to read a large compressed file?

I was having the same problem with a large file, until I noticed that CompressorStreamFactory has a couple of overloaded constructors that take a boolean decompressUntilEOF parameter.
Simply changing to the following may be all that's missing...
CompressorInputStream input = new CompressorStreamFactory(true)
.createCompressorInputStream(bis);
Clearly, whoever wrote this factory seems to think it's better to create new compressor input streams at certain points, with the same underlying buffered input stream so that the new one picks up where the last one left off. They seem to think that's a better default, or preferred way of doing it over allowing one stream to decompress data all the way to the end of the file. I've no doubt they are cleverer than me, and I haven't worked out what trap I'm setting for future me by setting this parameter to true. Maybe someone will tell me in the comments! :-)

In the following Java code, what is the advantage of using the Apache FileUtils class?

I came across the following Java code snippet ( source ), that is used to read the last line in a file. It uses two Apache libraries - ReversedLinesFileReader and FileUtils. Here it is :
ReversedLinesFileReader reader = new ReversedLinesFileReader(FileUtils.getFile(file));
String s = "";
while ((s = reader.readLine()) != null) {
System.out.println(s);
}
What I don't understand however, is why do we not simply put in a file path for the argument to ReversedLinesFileReader , and instead we use the FileUtils class.
Why Can't I simply say instead:
File someFile = new File("someFile.txt");
ReversedLinesFileReader reader = new ReversedLinesFileReader( someFile ) ;
thanks

... what is the advantage of using the Apache FileUtils class?
If file is just one string, there is no advantage. The call to getFile is going to call new File(file) ... after messing around to cope with the "var-args-ness" of the file parameter. So in fact, there is a performance penalty for using getFile in this case, for zero concrete1 benefit.
If file is a String[], then FileUtils.getFile(file) turns the strings into a file path.
Why Can't I simply say instead ...
Assuming that file is a String, you can, and (IMO) you should do what you proposed.
1 - I have seen people argue in cases like this that it is an advantage to avoid using the java.io classes directly, but frankly I don't buy it.

Using File is fine. ReversedLinesFileReader has a constructor that accepts a File as an input.
The reason you would use FileUtils is if you could benefit from what the FileUtils.getFile() overloaded methods offer, like passing a String[] of path elements, etc.
Other than that, no difference.

Java: easiest way to read from an Excel-style document?

I'm trying to find the best way to read in data from a file similar to an Excel document. It doesn't necessarily need to be an actual excel document, just any file that allows you to enter data in a grid format.
Something where I would be able to do manipulation similar to this:
String val = file.readString(column,row);
float val2 = file.readFloat(column,row);
I'm sorry, I usually try to do more research before I post a question here but I was having a hard time finding much info. A lot of what I saw was 3rd party libraries that read excel files. I'm really hoping if possible I can avoid downloading libraries and hopefully use built in ones.
So I guess my questions in short are:
What's the most appropriate file format for this?
What's the best way to read data from that file?

The first thing that comes to my mind is CSV. CSV files are just regular text files with the .csv filename extension. Data is stored in this format:
cell,anothercell,athirdcell
anotherrow,anothercellonthenewrow,thirdcellofsecondrow
For more specifics, read the CSV specs here.

Option 1
Store your data in a CSV and read with any kind of reader (e.g. BufferedReader). This might be the easiest and fastest solution, if you want to use Excel/LibreOffice for entering data.
Please check out the answers in these threads for various solutions.
String csvfile = path;
BufferedReader br = null;
String line = "";
String cvsSplitby = ";";
try {
br = new BufferedReader(new FileReader(csvfile));
while ((line = br.readLine()) != null) {
String[] i = line.split(cvsSplitby);
// do stuff
}
} catch (all kind of exceptions e) {
e.printStackTrace();
} finally {
if (br != null) {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Hope I didn't miss anything important.
Option 2
Use POI Apache.
Option 3
I've made some decent experience with JXL, but I understand that you don'T want to include too many external libs. (I just saw that it hasn't been updated in while. Consider the other options!)

Printing multiple files to a csv [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I'm currently writing an assignment that takes multiple text files(File objects) with lines, and then combines the lines together and separates them by commas, like:
File1Line1, File2Line1
File1Line2, File2Line2
I guess I'm just confused with how to use the files. How would I get the first(second, third, etc.) line from each file, while also dealing with files having different numbers of lines? Any help just on the concept of this is appreciated.

As far as reading a file line by line it's easy to do in most languages. Here's an example in java: How to read a large text file line by line using Java?.
Conceptually, you should start with thinking of an algorithm and then write some pseudocode to further explore and understand it.
For this assignment, an option would be to alternate reading each file one line at a time, and immediately write them to the csv. A second option would be to store each line in a data structure, such as an array, and write at the end, but that could be expensive for large files. You can handle different file lengths in many ways, for instance just writing the lines without corresponding lines alone. Here's some pseudocode, based on java:
FileReader reader1 = FileReader("file1.text")
FileReader reader2 = FileReader("file2.text")
while(reader1.hasNextLine() || reader2.hasNextLine())
{
if(reader1.hasNextLine()) {
writeToCSV(reader1.nextLine());
}
if(reader2.hasNextLine() {
writeToCSV(reader2.nextLine());
}
writeToCSV("\r\n");
}
You can find plenty of examples on the actual method calls, but it's important to understand the algorithm first.

If you are sure the lines of the two files are One-to-One mapping, then it is easy.
You can use two BuffererReader to read these two files, and you just need to iterate one of them
some codes like this:
BufferedReader reader1 = new BufferedReader(new FileReader(new File(pathOfFile1)));
BufferedReader reader2 = new BufferedReader(new FileReader(new File(pathOfFile2)));
BufferedWriter writer = new BufferedWriter(new FileWriter(new File(pathOfOutputCsvFile)));
String lineOfFile1 = null;
while((lineOfFile1 = reader1.readLine()) != null){
String lineOfFile2 = reader2.readLine();
//here, lineOfFile1 and lineOfFile2 are the same line number
//then some codes for combination
//...
}
//finally don't forget to close the readers and writer.
If you can't be sure the lines in these two files are One-to-One mapping, then you should read them all into the memory and mapping them in memory then output them as a CSV file.

This code only directly references 1 line from each file in RAM at a time, meaning it should work with huge files without memory exceptions. Behind the scenes more memory may be occupied than what you see, but it will still not crash with huge files.
Code works by reading one line at a time from each of the files till all files are empty. As files run out of lines output an empty string instead.
void assignment(String outputFile, String... filenames){
PrintWriter writer = new PrintWriter(outputFile, "UTF-8");
Scanner scanners = new Scanner[filenames.length];
for(int i=0;i<filenames.length;i++){
Scanner scanner = new Scanner(new File(filenames[i]));
scanners[i] = scanner;
}
boolean running = true;
while(running){
boolean allEmpty = true;
StringBuilder csvLine = new StringBuilder();
for(int i=0;i<scanners.lengh;i++){
if(scanner.hasNextLine()){
String line = scanner.nextLine();
csvLine.append(line);
allEmpty=false;
}
if(i!=scanners.length-1) csvLine.append(",");
}
if(allEmpty)
running=false;
else
writer.println(csvLine.toString());
}
writer.close();
for(Scanner s : scanners) s.close();
}
Usage:
assignment("output.txt","file1.txt","file2.txt","file3.txt","file4.txt");
Or:
String[] args = new String[]{"helloWorld.txt","fun.bin"};
assignment("output2.txt",args);
This code is untested and doesn't handle exceptions. This code will let you read in lines from files who's lines don't match, and combine them into a single CSV file. As files run out of lines, only empty strings will be shown.
This should give you an idea of how to do precisely what you've asked.

What is more performatic way to extract patterns from large file (over 700MB)

I've a problem which requires me to parse a text file from local machine. There are a few complications:
The files can be quite large (700mb+)
The pattern occurs in multiple lines
I need store line information after the pattern
I've created a simple code using BufferReader, String.indexOf and String.substring (to get item 3).
Inside the file it has a key (pattern) named code= that occurs many times in different blocks. The program read each line from this file using BufferReader.readLine. It uses indexOf to check if the pattern appears and then it extract text after pattern and store in a common string.
When I ran my program with 600mb file, I noticed that performance was worst while it process file. I read an article in CodeRanch that Scanner class isn't performatic for large files.
Are there some techniques or a library that could improve my performance ?
Thanks in advance.
Here's my source code:
String codeC = "code=[";
String source = "";
try {
FileInputStream f1 = new FileInputStream("c:\\Temp\\fo1.txt");
DataInputStream in = new DataInputStream(f1);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
boolean bPrnt = false;
int ln = 0;
// Read File Line By Line
while ((strLine = br.readLine()) != null) {
// Print the content on the console
if (strLine.indexOf(codeC) != -1) {
ln++;
System.out.println(strLine + " ---- register : " + ln);
strLine = strLine.substring(codeC.length(), strLine.length());
source = source + "\n" + strLine;
}
}
System.out.println("");
System.out.println("Lines :" + ln);
f1.close();
} catch ( ... ) {
...
}

This code of yours is highly suspicious and may well account for at least a part of your performance issues:
FileInputStream f1 = new FileInputStream("c:\\Temp\\fo1.txt");
DataInputStream in = new DataInputStream(f1);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
You are involving DataInputStream for no good reason, and in fact using it as an input to a Reader can be considered a case of broken code. Write this instead:
InputStream f1 = new FileInputStream("c:\\Temp\\fo1.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fr));
A huge detriment to performance is the System.out you are using, especially if you measure the performance when running in Eclipse, but even if running from the command line. My guess is, this is the major cause of your bottleneck. By all means ensure you don't print anything in the main loop when you aim for top performance.

In addition to what Marko answered, I suggest to close the br, not the f1:
br.close()
This will not affect the performance, but is cleaner. (closing the outermost stream)

Have a look at java.util.regex
An excellent tutorial from oracle.
A copy paste from the JAVADoc:
Classes for matching character sequences against patterns specified by regular expressions.
An instance of the Pattern class represents a regular expression that is specified in string form in a syntax similar to that used by Perl.
Instances of the Matcher class are used to match character sequences against a given pattern. Input is provided to matchers via the CharSequence interface in order to support matching against characters from a wide variety of input sources.
Unless otherwise noted, passing a null argument to a method in any class or interface in this package will cause a NullPointerException to be thrown.

It works perfectly !!
I followed OldCurmudgeon, Marko Topolnik and AlexWien advices and my performance improved 1000%. Before the program spent 2 hours to complete described operation and write a response in file.
Now it spends 5 minutes !! And SYSO remains in source code !!
I think that reason of great improvement is change String "source" for HashSet "source" like OldCurmudgeon suggests. Bur I removed DataInputStream and used "br.close" too.
Thanks guys !!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Fastest/Cleanest way to load a text file in memory - java

Perhaps IOUtils.toString , from Commons IOUtils

For a they detailed look at all various methods of reading a single file in a JVM try the following article: Java tip: How to read files quickly

Related

Reading a large compressed file using Apache Commons Compress

In the following Java code, what is the advantage of using the Apache FileUtils class?

Java: easiest way to read from an Excel-style document?

Printing multiple files to a csv [closed]

What is more performatic way to extract patterns from large file (over 700MB)

Categories

Resources