Take a look at the following link:
http://snippetsofjosh.wordpress.com/tag/advantages-and-disadvantages-of-arraylist/
This one of the reasons why I always prefer to use Arrays instead of (Array)Lists.
Still, this got me thinking about memory management and speed.
Hence I arrived at the following question:
What is the best way to store data from a file when you don't know the size of the file (/number of entries) (where best is defined as 'the least amount of computation time')
Below, I will present 3 different methods and I would like to know which one of them is best and why. For the clarity of the question, let's assume I must end up with an Array. Also, let's assume every line from our .txt file only has one entry (/one string). Also, for limiting the scope of the questions, I will limit this question to Java only.
Let's say we want to retrieve the following info from a file called words.txt:
Hello
I
am
a
test
file
Method 1 - Double and dangerous
File read = new File("words.txt");
Scanner in = new Scanner(read);
int counter = 0;
while (in.hasNextLine())
{
in.nextLine();
counter++;
}
String[] data = new String[counter];
in = new Scanner(read);
int i = 0;
while (in.hasNextLine())
{
data[i] = in.nextLine();
i++;
}
Method 2 - Clear but redundant
File read = new File("words.txt");
Scanner in = new Scanner(read);
ArrayList<String> temporary = new ArrayList<String>();
while (in.hasNextLine())
{
temporary.add(in.nextLine());
}
String[] data = new String[temporary.size()];
for (int i = 0; i < temporary.size(); i++)
{
data[i] = temporary.get(i);
}
Method 3 - Short but rigid
File read = new File("words.txt");
FileReader reader = new FileReader(read);
String content = null;
char[] chars = new char[(int) read.length()];
reader.read(chars);
content = new String(chars);
String[] data = content.split(System.getProperty("line.separator"));
reader.close();
If you have an alternative way (which is even better) then please supply it below.
Also, feel free to adjust my code where necessary.
Answer:
The fastest method for storing data in an array is the following method:
File read = new File("words.txt");
Scanner in = new Scanner(read);
ArrayList<String> temporary = new ArrayList<String>();
while (in.hasNextLine()) {
temporary.add(in.nextLine());
}
String[] data = temporary.toArray(new String[temporary.size()]);
And for Java 7+:
Path loc = Paths.get(URI.create("file:///Users/joe/FileTest.txt"));
List<String> lines = Files.readAllLines(loc, Charset.defaultCharset());
String[] array = lines.toArray(new String[lines.size()]);
I assume that best means faster here.
I would use method 2, but create the array with the methods provided by the Collection interface:
String[] array = temporary.toArray(new String[temporary.size()]);
Or even simpler (Java 7+):
List<String> lines = Files.readAllLines(file, charset);
String[] array = lines.toArray(new String[lines.size()]);
Other methods:
method 1 does two passes and it is very unlikely that reading a file is more efficient than resizing an arraylist
I am not sure if method 3 is faster or not
Update:
for the sake of completeness, I have run a microbenchmark with the modified method2 as above and including an additional method (method4) that reads all bytes at once, creates a string and split on new lines. The results (in mn microseconds):
Benchmark Mean
method1 126.178
method2 59.679
method3 76.622
method4 75.293
Edit:
with a larger 3MB file: LesMiserables.txt, the results are consistent:
Benchmark Mean
method1 608649.322
method2 34167.101
method3 63410.496
method4 65552.79
A very good comparison with all the source code is given here java_tip_how_read_files_quickly
Summary:
For the best Java read performance, there are four things to remember:
Minimize I/O operations by reading an array at a time, not a byte at a time. An 8Kbyte array is a good size.
Minimize method calls by
getting data an array at a time, not a byte at a time. Use array
indexing to get at bytes in the array.
Minimize thread synchronization locks if you don't need thread safety. Either make
fewer method calls to a thread-safe class, or use a non-thread-safe
class like FileChannel and MappedByteBuffer.
Minimize data copying
between the JVM/OS, internal buffers, and application arrays. Use
FileChannel with memory mapping, or a direct or wrapped array
ByteBuffer.
Hope that helps.
EDIT
I would do sth like that:
File read = new File("words.txt");
Scanner in = new Scanner(read);
List<String> temporary = new LinkedList<String>();
while (in.hasNextLine()) {
temporary.add(in.nextLine());
}
String[] data = temporary.toArray(new String[temporary.size()]);
The main difference is reading data only once (as opposed to other 2 methods) and addition in linkedlist is very cheap + no extra operation on lines needed (like splitting) - don't use arraylist here
If you are reading data from a file, the bottleneck will be the file reading (IO) stage. The time spent processing it will be insignificant in almost all cases. So do what is correct and safe. First you make it right; then you make it fast.
If you don't know the size of the file you must have some kind of dynamically expanding data structure. Which is what ArrayList is. Code you write yourself is unlikely to be more eficient or correct than such an important part of the Java API. So just use ArrayList: option 2.
I would use guava
File file = new File("words.txt");
List<String> lines = Files.readLines(file, Charset.defaultCharset());
// If it really has to be an array:
String[] array = lines.toArray(new String[0]);
List<String> lines = Files.readAllLines(yourFile, charset);
String[] arr = lines.toArray(new String[lines.size()]);
Related
Lets consider this scenario: I am reading a file, and then tweaking each line a bit and then storing the data in a new file. Now, I tried two ways to do it:
storing the data in a String and then writing it to the target file at the end like this:
InputStream ips = new FileInputStream(file);
InputStreamReader ipsr = new InputStreamReader(ips);
BufferedReader br = new BufferedReader(ipsr);
PrintWriter desFile = new PrintWriter(targetFilePath);
String data = "";
while ((line = br.readLine()) != null) {
if (line.contains("_Stop_"))
continue;
String[] s = line.split(";");
String newLine = s[2];
for (int i = 3; i < s.length; i++) {
newLine += "," + s[i];
}
data+=newLine+"\n";
}
desFile.write(data);
desFile.close();
br.close();
directly using println() method for PrintWriter as below in the while loop:
while ((line = br.readLine()) != null) {
if (line.contains("_Stop_"))
continue;
String[] s = line.split(";");
String newLine = s[2];
for (int i = 3; i < s.length; i++) {
newLine += "," + s[i];
}
desFile.println(newLine);
}
desFile.close();
br.close();
The 2nd process is way faster than the 1st one. Now, my question is what is happening so different in these two process that it is differing so much by execution time?
Appending to your string will:
Allocate memory for a new string
Copy all data previously copied.
Copy the data from your new string.
You repeat this process for every single line, meaning that for N lines of output, you copy O(N^2) bytes around.
Meanwhile, writing to your PrintWriter will:
Copy data to the buffer.
Occasionally flush the buffer.
Meaning that for N lines of output, you copy only O(N) bytes around.
For one, you're creating an awful lot of new String objects by appending using +=. I think that'll definitely slow things down.
Try appending using a StringBuilder sb declared outside of the loop and then calling desFile.write(sb.toString()); and see how that performs.
First of all, the two processes aren't producing the same data, since the one that calls println will have line separator characters between the lines whereas the one that builds all the data up in a buffer and writes it all at once will not.
But the reason for the performance difference is probably the enormous number of String and StringBuilder objects you are generating and throwing away, the memory that needs to be allocated to hold the complete file contents in memory, and the time taken by the garbage collector.
If you're going to be doing a significant amount of string concatenation, especially in a loop, it is better to create a StringBuilder before the loop and use it to accumulate the results in the loop.
However, if you're going to be processing large files, it is probably better to write the output as you go. The memory requirements of your application will be lower, whereas if you build up the entire result in memory, the memory required will be equal to the size of the output file.
It's fixed! Thanks to Edgar Boda.
I created a class that should read a text file and put that into an array:
private static String[] parts;
public static void Start() throws IOException{
InputStream instream = new FileInputStream("Storyline.txt");
InputStreamReader inputreader = new InputStreamReader(instream);
BufferedReader buffreader = new BufferedReader(inputreader);
int numberOfLines=0, numberOfActions;
String line = null, input="";
while((line=buffreader.readLine())!=null){
line=buffreader.readLine();
input+=line;
}
parts=input.split(";");
}
But, when I try and output the array, it only contains one string. The last from the file, that I put in.
Here's the file I read from:
0;0;
Hello!;
Welcome!To this.;
56;56;
So;
I think it's something in the loop; but trying to put parts[number] in there doesn't work... Any suggestions?
You want to read the whole file into an String first maybe:
String line = null;
String input = "";
while((line=buffreader.readLine())!=null){
input += line;
}
parts = input.split(";");
You are overwriting the string array parts in every iteration of your while loop, so that's why it only contains the last line.
To store the entire file contents, with fields split, you'll need a 2-dimensional array, not a 1-dimensional array. Assuming there are 5 lines in the file:
private static String[][] parts = new String[5][];
Then assign each split array to an element of parts each loop:
parts[i++]=line.split(";"); // Assuming you define "i" for the line number
Also, split by default discards trailing empty tokens. To retain them, use the two-arg overload of split that takes a limit parameter. Pass a negative number to retain all tokens.
parts[i++] = line.split(";", -1);
It will only contain the last line; you are reassigning parts every time:
parts = line.split(";");
This trashes the previous reference and reassigns a reference to a new array to it. A better way might be to use a StringBuilder and append the lines and then split later:
StringBuilder stringBuilder = new StringBuilder();
while((line=buffreader.readLine())!=null){
stringBuilder.append(line);
}
parts = stringBuilder.toString().split(";");
This way you will get everything you want in one array. If you want to split everything such that you have one array per line, you will need parts to be a two-dimensional array. But the drawback is that you will need to know how many lines will be there in the file. Instead, you can use List<String[]> to keep track of your arrays:
List<String[]> lineParts = new ArrayList<String[]>();
while((line=buffreader.readLine())!=null){
lineParts.add(line.split(";"));
}
So I have a collection of phrases that are separated by newlines and I would like to populate an array with these phrases. This is what I have so far:
Scanner s;
s = new Scanner(new BufferedReader(new FileReader("Phrases.txt")));
for(i = 0; i < array.length;i++)
{
s.nextLine() = array[i];
}
Is there a fast and simple way to just populate an array with phrases separated by newlines?
The assignment should be reverse: -
array[i] = s.nextLine();
And, I think you should fill your array based on the input received from the file. Here you are receving input based on the length of your pre-declared array. I mean, since you are using an array, your size is fixed. So you can only populate it with a fixed number of phrases.
A better way would be to use an ArrayList.
List<String> phrases = new ArrayList<String>();
Now, you can populate your arraylist, based on the phrases you get from your file. You don't need to pre-define the size. It increases in size dynamically.
And to add phrases, you would do: -
phrases.add(s.nextLine());
With a while loop to read till EOF.
while (s.hasNextLine()) {
phrases.add(s.nextLine());
}
Since you don't know how many phrases you're likely to have (I suspect), I would populate an ArrayList<String> and convert it to an array using ArrayList.toArray() once you're done. I'd perhaps keep it as a Java collection, however, for greater flexibility.
You have the assignment operation inverted (array[i] should be set to s.nextLine(), not the other way around. Also, it would be best to modify the for loop to terminate when no more lines exist:
for(i = 0; i < array.length && s.hasNextLine() ;i++) {
array[i] = s.nextLine()
}
It can be done with a 1 liner with apache commons and specifically FileUtils.readLines()
FileUtils.readLines(myFile).toArray(new String[0]);
Don't waste your time with Scanner. BufferedReader is just fine. Try this:
BufferedReader br = new BufferedReader(new FileReader("Phrases.txt")));
LinkedList<String> phrases = new LinkedList<String>();
while(br.ready()) {
phrases.add(br.readLine());
}
String[] phraseArray = phrases.toArray(new String[0]);
By the way it's important to use LinkedList not ArrayList if the file is large. That way you only create one array at the end. Otherwise you will have a lot of large array creation and wasted memory.
you are doing it wrong. it has to be
for(i = 0; i < array.length;i++)
{
array[i]=s.nextLine();
}
array[i] = value; // the value would be assigned into the array at index i.
However, a better option would be to use a List implementing classes such as ArrayList which gives you an advantage of dynamic size.
List<String> list = new ArrayList<>();
list.add(s.nextLine(());
Im doing a frequency dictionary, in which i read 1000 files, each one with about 1000 lines. The approach i'm following is:
BufferedReader to read fileByFile
read the first file, get the first sentence, split the sentence to an array string, then fill in an hashmap with the values from the string array.
do this for all the senteces in that file
do this for all 1000 files
My problem is, this is not a very efficient way to do it, i'm taking about 4 minutes to do all this. I'v increased heap size, refactored the code to make sure i'm not doind something wrong. For this approach, i'm completly sure there's nothing i can improve in the code.
My bet is, each time a sentece is read, a split is applied, which, multiplied by 1000 sentences in a file and by 1000 files is a huge ammount of splits to process.
My idea is, instead of read and process file-by-file, i could read each file to a char array, and then make the split only once per file. That would ease the ammount of processing times consuming with the split. Any suggestions of implementation would be appreciated.
OK, I have just implemented the POC of your dictionary. Fast and dirty. My files contained 868 lines each one but I created 1024 copies of the same file. (This is table of contents of Spring Framework documentation.)
I ran my test and it took 14020 ms (14 seconds!). BTW I ran it from eclipse that could decrease the speed a little bit.
So, I do not know where your problem is. Please try my code on your machine and if it runs faster try to compare it with your code and understand where the root problem.
Anyway my code is not the fastest I can write.
I can create Pattern before loop and the use it instead of String.split(). String.split() calls Pattern.compile() every time. Creating pattern is very expensive.
Here is the code:
public static void main(String[] args) throws IOException {
Map<String, Integer> words = new HashMap<String, Integer>();
long before = System.currentTimeMillis();
File dir = new File("c:/temp/files");
for (File file : dir.listFiles()) {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
String[] lineWords = line.split("\\s+");
for (String word : lineWords) {
int count = 1;
Integer currentCount = words.get(word);
if (currentCount != null) {
count = currentCount + 1;
}
words.put(word, count);
}
}
}
long after = System.currentTimeMillis();
System.out.println("run took " + (after - before) + " ms");
System.out.println(words);
}
If you dont care about the the contents are in different files I would do the approach your are recommending. Read all files and all lines into memory (string, or char array, whatever) and then do the 1 split and hash populate based on the one string/dataset.
If I understand what you're doing, I don't think you want to use strings except when you access your map.
You want to:
loop through files
read each file into a buffer of something like 1024
process the buffer looking for word end characters
create a String from the character array
check your map
if found, update your count, if not, create a new entry
when you reach end of buffer, get the next buffer from the file
at end, loop to next file
Split is probably pretty expensive since it has to interpret the expression each time.
Reading the file as one big string and and then splitting that sounds like a good idea. String splitting/modifying can be surprisingly 'heavy' when it comes to garbage collection. Multiple lines/sentences means multiple Strings and with all the splits it means a huge amount of Strings (Strings are immutable, so any change to them will actually create a new String or multiple Strings)... this produces a lot of garbage to be collected, and the garbage collection could become a bottleneck (with a smaller heap, the maximum amount of memory is reached all the time, kicking off a garbage collection, which potentially needs to clean up hundreds of thousands or millions of separate String-objects).
Of course, without knowing your code this is just a wild guess, but back in the day, I got an old command line Java-programs' (it was a graph-algorithm producing a huge SVG-file) running time to drop from about 18 seconds to less than 0.5 seconds just by modifying the string-handling to use StringBuffers/Builders.
Another thing that springs to mind is using multiple threads (or a threadpool) to handle different files concurrently, and then combine the results at the end. Once you get the program to run "as fast as possible", the remaining bottleneck will be the disk access, and the only way (afaik) to get past that is faster disks (SSDs etc.).
Since you're using a bufferedReader, why do you need to read in a whole file explicitly? I definitely wouldn't use split if you're after speed, remember, it has to evaluate a regular expression each time you run it.
Try something like this for your inner loop (note, I have not compiled this or tried to run it):
StringBuilder sb = null;
String delimiters = " .,\t"; //Build out all your word delimiters in a string here
for(int nextChar = br.read(); nextChar >= 0; nextChar = br.read()) {
if(delimiters.indexOf(nextChar) < 0) {
if(sb == null) sb = new StringBuilder();
sb.append((char)(nextChar));
} else {
if(sb != null) {
//Add sb.toString() to your map or increment it
sb = null;
}
}
}
You could try using different sized buffers explicitly, but you probably won't get a performance improvement over this.
One very simple approach which uses minimum heap space and should be (almost) as fast as anything else would be like
int c;
final String SEPARATORS = " \t,.\n"; // extend as needed
final StringBuilder word = new StringBuilder();
while( ( c = fileInputStream.read() ) >= 0 ) {
final char letter = (char) c;
if ( SEPARATORS.indexOf(letter) < 0 ) {
word.append(letter);
} else {
processWord( word.toString() );
word.setLength( 0 );
}
}
extend for more separator characters as needed, possibly use multi-threading to process multiple files concurrently until disc IO becomes the bottle neck...
I'm making a dictionary app on android. During its startup, the app will load content of .index file (~2MB, 100.000+ lines)
However, when i use BufferedReader.readLine() and do something with the returned string, the app will cause OutOfMemory.
// Read file snippet
Set<String> indexes = new HashSet<String)();
FileInputStream is = new FileInputStream(indexPath);
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
String readLine;
while ( (readLine = reader.readLine()) != null) {
indexes.add(extractHeadWord(readLine));
}
// And the extractHeadWord method
private String extractHeadWord(String string) {
String[] splitted = string.split("\\t");
return splitted[0];
}
When reading log, I found that while executing, it causes the GC explicitly clean objects many times (GC_EXPLICIT freed xxx objects, in which xxx is a big number such as 15000, 20000).
And I tried another way:
final int BUFFER = 50;
char[] readChar = new char[BUFFER];
//.. construct BufferedReader
while (reader.read(readChar) != -1) {
indexes.add(new String(readChar));
readChar = new char[BUFFER];
}
..and it run very fast. But it was not exactly what I wanted.
Is there any solution that run fast as the second snippet and easy to use as the first?
Regard.
The extractHeadWord uses String.split method. This method does not create new strings but relies on the underlying string (in your case the line object) and uses indexes to point out the "new" string.
Since you are not interessed in the rest of the string you need to discard the it so it gets garbage collected otherwise the whole string will be in memory (but you are only using a part of it).
Calling the constructor String(String) ("copy constructor") discards the rest of string:
private String extractHeadWord(String string) {
String[] splitted = string.split("\\t");
return new String(splitted[0]);
}
What happens if your extractHeadWord does this return new String(splitted[0]);.
It will not reduce temporary objects, but it might reduce the footprint of the application. I don't know if split does about the same as substring, but I guess that it does. substring creates a new view over the original data, which means that the full character array will be kept in memory. Explicitly invoking new String(string) will truncate the data.