How to speed up reading in from a massive file (Java)

How to speed up reading in from a massive file (Java) - java

So basically, for this assignment I'm working on, we have to read in from a huge file of about a million lines, store the keys and values in a data structure of our choice (I'm using hash tables), offer functionality to change values for keys, and then save the key value stores back into a file. I'm using the cuckoo hashing method along with a method I found from a Harvard paper called "stashing" to accomplish this, and I'm fine with all of it. My only concern is the amount of time it is taking the program just to read in the data from the file.
The file is formatted so that each line has a key (integer) and a value (String) written like this:
12345 'abcdef'
23456 'bcdefg'
and so on. The method I have come up with to read this in is this:
private static void readData() throws IOException {
try {
BufferedReader inStream = new BufferedReader(new FileReader("input/data.db"));
StreamTokenizer st = new StreamTokenizer(inStream);
String line = inStream.readLine();
do{
String[] arr = line.split(" ");
line = inStream.readLine();
Long n = Long.parseLong(arr[0]);
String s = arr[1];
//HashNode<Long, String> node = HashNode.create(n, s);
//table = HashTable.empty();
//table.add(n, s);
}while(line != null);
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
The method works fine for actually getting the data, however I tested it with our test file of a million lines and it took about 20 minutes for it to get all the way through reading this all in. Surely, this isn't a fast time for reading in data from a file, and I am positive there must be a better way of doing it.
I have tried several different methods for input (BufferedInputStream with FileInputStream, using Scanner however the file extension is .db so Scanner didn't work, I initially didn't have the tokenizer but added it in hopes it would help). I don't know if the computer I'm running it on makes much of a difference. I have a MacBook Air that I am currently doing the run on; however, I am having a mate run it on his laptop in a bit to see if that might help it along. Any input on how to help this or what I might be doing to slow things SO much would be sincerely and greatly appreciated.
P.S. please don't hate me for programming on a Mac :-)

You can use "java.nio.file.*", the following code is written in Java 8 style but can be easily modified to earlier versions on Java if needed:
Map<Long, String> map = new HashMap<>();
Files.lines(Paths.get("full-path-to-your-file")).forEach(line -> {
String[] arr = line.split(" ");
Long number = Long.parseLong(arr[0]);
String string = arr[1];
map.put(number, string);
});
There is an additional performance gain since Files.lines(..).forEach(...) is executed in parallel. Which means that the lines will not be in-order (and in our case - you don't need it to), in case you needed it to be in order you could call: forEachOrdered().
On my MacBook it took less than 5 seconds to write 2 million such records to a file and then read it and populate the map.

Get rid of the StreamTokenizer. You can read millions of lines per second with BufferedReader.readLine(), and that's all you're really doing: no tokenization.
But I strongly suspect the time isn't being spent in I/O but in processing each line.
NB Your do/while loop is normally written as a while loop:
while ((line = in.readLine()) != null)
Much clearer that way, and no risk of NPEs.

Related

How to read large files (a single continuous string) in Java?

I am trying to read a very large file (~2GB). Content is a continuous string with sentences (I would like to split them based on a '.'). No matter how I try, I end up with an Outofmemoryerror.
BufferedReader in = new BufferedReader(new FileReader("a.txt"));
String read = null;
int i = 0;
while((read = in.readLine())!=null) {
String[] splitted = read.split("\\.");
for (String part: splitted) {
i+=1;
users.add(new User(i,part));
repository.saveAll(users);
}
}
also,
inputStream = new FileInputStream(path);
    sc = new Scanner(inputStream, "UTF-8");
    while (sc.hasNextLine()) {
        String line = sc.nextLine();
        // System.out.println(line);
    }
    // note that Scanner suppresses exceptions
    if (sc.ioException() != null) {
        throw sc.ioException();
    }
Content of the file (composed of random words with a full stop after 10 words):
fmfbqi .xcdqnjqln kvjhw pexrbunnr cgvrqlr fpaczdegnb puqzjdbp gcfxne jawml aaiwwmo ugzoxn .opjc fmfbqi .xcdqnjqln kvjhw pexrbunnr cgvrqlr fpaczdegnb puqzjdbp gcfxne jawml aaiwwmo ugzoxn .opjc (so on)
Please help!

So first and foremost, based on comments on your question, as Joachim Sauer stated:
If there are no newlines, then there is only a single line and thus only one line number.
So your usecase is faulty, at best.
Let's move past that, and assume maybe there are new line characters - or better yet, assume that the . character you're splitting on is intended to be a newline psudeo-replacement.
Scanner is not a bad approach here, though there are others. Since you provided a Scanner, lets continue with that, but you want to make sure you're wrapping it around a BufferedReader. You clearly don't have a lot of memory, and a BufferedReader allows your to read 'chunks' of a file, as buffered by the BufferedReader, while utilizing the functionality of the Scanner completely obscure to you as a caller that the buffering is happening:
Scanner sc = new Scanner(new BufferedReader(new FileReader(new File("a.txt")), 10*1024));
What this is basically doing, is letting the Scanner function as you expect, but allowing you to buffer 10MB at a time, minimizing your memory footprint. Now, you just keep calling
sc.useDelimiter("\\.");
for(int i = 0; sc.hasNext(); i++) {
String psudeoLine = sc.next();
//store line 'i' in your database for this psudeo-line
//DO NOT store psudeoLine anywhere else - you don't have memory for it
}
Since you don't have enough memory, the clear thing to iterate (and re-iterate) is don't store any part of the file within your JVM's heapspace after reading it. Read it, use it how you need it, and allow it to be marked for JVM garbage collection. In your case, you mention you want to store the psudeo lines in a database, so you want to read the psudeo-line, store it in the database, and just discard it.
There are other things to point out here, such as configuring your JVM arguments, but I hesitate to even mention it because just setting your JVM memory high is a bad idea too - another brute force approach. There's nothing wrong with setting your JVM memory max heap size higher, but learning memory management is better if you're still learning how to write software. You'll get in less trouble later when you get into professional development.
Also, I mentioned Scanner and BufferedReader because you mentioned that in your question, but I think checking out java.nio.file.Path.lines() as pointed out by deHaar is also a good idea. This basically does the same thing as the code I've explicitly laid out, with the caveat that it still only does 1 line at a time without the ability to change what you're 'splitting' on. So if your text file has 1 single line in it, this will still cause you a problem and you will still need something like a scanner to fragment the line out.

Too much memory while reading a dictionary file in Java

I read a dictionary that might be 100MB or so in size (sometimes gets bigger up to max 500MB). It is a simple dictionary of two columns, the first column words the second column a float value. I read the dictionary file it in this way:
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while((line = br.readLine()) != null) {
String[] cols = line.split("\t");
setIt(cols[0], cols[1]);
and for the setIt function:
public void setIt(String term, String value) {
all.put(term, new Double(value));
}
When I have a big file, it takes a long time to load it and it often goes out of memory. Even with a reasonable size file (100MB) it does need a 4GB memory in Java to be run.
Any clue how to improve it while not changing the structure of the whole package?
EDIT: I'm using a 50MB file with -Xmx1g and I still get the error.
UPDATE: There were some iterations over the file that I fixed them and now the memory problem was partially solved. Yet to try the properties and other solutions and report on that.

You are allocating a new String for every line. There is some overhead associated with a String. See Here for a calculation. This article also addresses the subject of object memory use in java.
There is a stack overflow question on the subject of more memory efficient replacements for strings here.
Is there something you can do to avoid all those allocations? For example, are there a limited number of strings that you could represent as an integer in your data structure, and then use a smaller lookup table to translate?

You can do a lot of things to reduce memory usage. for example :
1- replace String[] cols = line.split("\t"); with :
static final Pattern PATTERN = Pattern.compile("\t");
//...
String[] cols = PATTERN.split(line);
2- use .properties file to store your dictionary and simply load it this way :
Properties properties = new Properties();
//...
try (FileInputStream fileInputStream = new FileInputStream("D:/dictionary.properties")) {
properties.load(fileInputStream);
}
Map<String, Double> map = new HashMap<>();
Enumeration<?> enumeration = properties.propertyNames();
while (enumeration.hasMoreElements()){
String key = (String) enumeration.nextElement();
map.put(key, new Double(properties.getProperty(key)));
}
//...
dictionary.properties :
A = 1
B = 2
C = 3
//...
3- use StringTokenizer :
StringTokenizer tokenizer = new StringTokenizer(line, "\t");
setIt(tokenizer.nextToken(), tokenizer.nextToken());

Well my solution will deviate little bit from your code ...
Use Lucene or more specifically Lucene Dictionary or even more specifically Lucene Spell Checker depends upon what you want.
Lucene handle any amount of data with efficient memory usage ..
Your problem is that you are storing whole Dictionary in memory ... Lucene store it in file with hashing and then it take search result from file at runtime but efficiently. This save lot of memory. You can customize search depends upon your needs
Small Demo of Lucene

A few causes for this problem would be.
1). The String array cols is using up too much memory.
2). The String line might also be using too much memory, unlikely though.
3). While java is opening and reading the file its also using memory so that's also a probability.
4). Your map put will also be taking up a small amount of memory.
It might also be all these things combined, so maybe try and comment some lines out and see if works then.
The most likely cause is all these things added up is eating your memory. So a 10 megabyte file could end up being 50 megabytes. Also make sure to .close() all input steams and try to reallocate ram by splitting up your methods so variables get garbage collected.
As for doing this without changing package structure or java heap size arguments i'm not sure it will be very easy, if possible at all.
Hope this helps.

Need a better method of reading and storing values from text file

Goal: to get values from a text file and store it into values to load into my sqlite database.
Problem: My method is not efficient, and I need help comming up with an easier way.
As of right now I am parsing my textfile that looks like this.
agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone
1,"NJ TRANSIT BUS","http://www.njtransit.com/",America/New_York,en,""
2,"NJ TRANSIT RAIL","http://www.njtransit.com/",America/New_York,en,""
I am parsing everytime i read a comma, then storing that value into a variable, then I will use that variable as my database value.
This method works and is time consuming, The next text file I have to read in has over 200 lines of code, and i need to find an easier way.
AgencyString = readText();
tv = (TextView) findViewById(R.id.letter);
tv.setText(readText());
StringTokenizer st = new StringTokenizer(AgencyString, ",");
for (int i = 0; i < AgencyArray.length; i++) {
size = i; // which value i am targeting in the textfile
//ex. 1 would be agency_id, 2 would be agency_name
AgencyArray[i] = st.nextToken();
}
tv.setText(AgencyArray[size]); //the value im going to store into database value
}
private String readText() {
InputStream inputStream = getResources().openRawResource(R.raw.agency);
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
int i;
try {
i = inputStream.read();
while (i != -1) {
byteArrayOutputStream.write(i);
i = inputStream.read();
}
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
return byteArrayOutputStream.toString();
}

First, why is this a problem? I don't mean to answer your question with a question so to speak, but more context is required to understand in what way you need to improve the efficiency of what you're doing. Is there a perceived delay in the application due to the parsing of the file, or do you have a more serious ANR problem due to you running on the UI thread? Unless there is some bottleneck in other code not shown, I honestly doubt you'd read and tokenise it faster that you're presently doing. Well, actually, no doubt you probably could; however, I believe it's more a case of designing your application so that delays involved in fetching and parsing of large data aren't perceived by or cause irritation to the user. My own application parses massive files like this and it does take a fraction of a second, but it doesn't present a problem due to the design of the overall application and UI. Also, have you used the profiler to see what's taking time? And also, have you run this on a real device, without debugger attached? Having the debugger attached to the real device, or using the simulator greatly increases execution time by several orders.
I am making the assumption that you need to parse this file type after receiving it over a network, as opposed to being something that is bundled with the application and only needs parsing once.

You could just bundle the SQLite database with your application instead of representing it in a text file. Look at the answer to this question

Faster implementation of more than one input in a single line (Java)

Well, this might be a silly problem.
I just want a faster implementation of following problem
I want to take three integer input in a single line eg:
10 34 54
One way is to make a BufferedReader and then use readLine()
which will read the whole line as a string
then we can use StringTokenizer to separate three integer. (Slow implemetation)
Another way is to use 'Scanner' and take input by nextInt() method. (Slower than previous method)
I want a fast implementation to take such type of inputs since I have to read more than 2,000,000 lines and these implementations are very slow.
My implementation:
BufferedReader br=new BufferedReader(new InputStreamReader(System.in));
for(i=0;i<n;i++) {
str=br.readLine();
st = new StringTokenizer(str);
t1=Integer.parseInt(st.nextElement().toString());
t2=Integer.parseInt(st.nextElement().toString());
z=Long.parseLong(st.nextElement().toString());
}
This one is looped for n times. ( n is number of entries)
Since I know each line will contain only three integer there is no need to check for hasMoreElements()

I just want a faster implementation of following problem.
The chances are that you DON'T NEED a faster implementation. Seriously. Not even with a 2 million line input file.
The chances are that:
more time is spent processing the file than reading it, and
most of the "read time" is spent doing things at the operating system level, or simply waiting for the next disk block to be read.
My advice is to not bother optimizing this unless the application as a whole takes too long to run. And when you find that this is the case, profile the application, and use the profile stats to tell you where it could be worthwhile spending effort on optimization.
(My gut feeling is that there is not much to be gained by optimizing this part of your application. But don't rely on that. Profile it!)

Here's a basic example that will be pretty fast:
public static void main(String[] args) throws IOException {
BufferedReader reader = new BufferedReader(new FileReader("myfile.txt"));
String line;
while ((line = reader.readLine()) != null) {
for (String s : line.split(" ")) {
final int i = Integer.parseInt(s);
// do something with i...
}
}
reader.close();
}
However your task is fundamentally going to take time.
If you are doing this on a website and reaching a timeout, you should consider doing it in a background thread, and send a response to the user saying that the data is being processed. You'll probably need to add a way for the user to check on the progress.

Here is what I mean when I say "specialized scanner". Depending upon parser's (or split's) efficiency, this might be a bit faster (it probably is not):
BufferedReader br=new BufferedReader(...);
for(i=0;i<n;i++)
{
String str=br.readLine();
long[] resultLongs = {-1,-1,-1};
int startPos=0;
int nextLongIndex=0;
for (int p=0;p<str.length();p++)
{
if (str.charAt(p)== ' ')
{
String intAsStr=str.substring(startPos, p-1);
resultLongs[nextLongIndex++]=Integer.parseInt(intAsStr);
startpos=p+1;
}
}
// t1, t2 and z are in resultLongs[0] through resultLongs[2]
}
Hths.
And of course this fails miserably if the input file contains garbage, i.e. anything else but longs separated by blanks.
And in addition, to minimize the "roundtrips" to the OS, it is a good idea to supply the buffered reader with a nonstandard (bigger-than-standard) buffer.
The other hint I gave in the comment refined: If you have to read such a huge text file more than once, i.e. more than once after it has been updated, you could read all longs into a data structure (maybe a List of elements that hold three longs), and stream that into a "cache" file. Next time, compare the text file's timestamp to the "cache" file's. If it is older, read the cache file. Since stream I/O does not serialize longs into its string representation, you will see much, much better reading times.
EDIT: Missed the startPos reassignment.
EDIT2: Added the cache idea explanation.

Search a string in a file and write the matched lines to another file in Java

For searching a string in a file and writing the lines with matched string to another
file it takes 15 - 20 mins for a single zip file of 70MB(compressed state).
Is there any ways to minimise it.
my source code:
getting Zip file entries
zipFile = new ZipFile(source_file_name);
entries = zipFile.entries();
while (entries.hasMoreElements())
{ ZipEntry entry = (ZipEntry)entries.nextElement();
if (entry.isDirectory())
{
continue;
}
searchString(Thread.currentThread(),entry.getName(), new BufferedInputStream (zipFile.getInputStream(entry)), Out_File, search_string, stats); }
zipFile.close();
Searching String
public void searchString(Thread CThread, String Source_File, BufferedInputStream in, File outfile, String search, String stats) throws IOException
{
int count = 0;
int countw = 0;
int countl = 0;
String s;
String[] str;
BufferedReader br2 = new BufferedReader(new InputStreamReader(in));
System.out.println(CThread.currentThread());
while ((s = br2.readLine()) != null)
{
str = s.split(search);
count = str.length - 1;
countw += count; //word count
if (s.contains(search))
{
countl++; //line count
WriteFile(CThread,s, outfile.toString(), search);
}
}
br2.close();
in.close();
}
--------------------------------------------------------------------------------
public void WriteFile(Thread CThread,String line, String out, String search) throws IOException
{
BufferedWriter bufferedWriter = null;
System.out.println("writre thread"+CThread.currentThread());
bufferedWriter = new BufferedWriter(new FileWriter(out, true));
bufferedWriter.write(line);
bufferedWriter.newLine();
bufferedWriter.flush();
}
Please help me. Its really taking 40 mins for 10 files using threads and 15 - 20 mins for a single file of 70MB after being compressed. Any ways to minimise the time.

You are reopening the file output handle for every single line you write.
This is likely to have a massive performance impact, far outweighing other performance issues. Instead I would recommend creating the BufferedWriter once (e.g. upon the first match) and then keeping it open, writing each matching line and then closing the Writer upon completion.
Also, remove the call to flush(); there is no need to flush each line as the call to Writer.close() will automatically flush any unwritten data to disk.
Finally, as a side note your variable and method naming style does not follow the Java camel case convention; you might want to consider changing it.

I'm not sure if the cost you are seeing is from disk operations or from string manipulations. I'll assume for now that the problem is the strings, you can check that by writing a test driver that runs your code with the same line over and over.
I can tell you that split() is going to be very expensive in your case because you are producing strings you don't need and then recycling them, creating much overhead. You may want to increase the amount of space available to your JVM with -Xmx.
If you merely separate words by the presence of whitespace, then you would do much better by using a regular expression matcher that you create before the loop and apply it to the string The number of matches when applied to a given string will be your word count, and that should not create an array of strings (which is very wasteful and which you don't use). You will see in the JavaDocs that split does work via regular expressions; that is true, but split does the extra step of creating separate strings and that's where your waste might be.
You can also use a regular expression to search for the match instead of contains though that may not be significantly faster.
You could make things parallel by using multiple threads. However, if split() is the cause of your grief, your problem is the overhead and running out of heap space, so you won't necessarily benefit from it.
More generally, if you need to do this a lot, you may want to write a script in a language more "friendly" to string manipulation. A 10-line script in Python can do this much faster.

wow, what are you doing in this method
WriteFile(CThread,s, outfile.toString(), search);
every time you got the line containing your text, you are creating BufferedWriter(new FileWriter(out, true));
Just create a bufferedWriter in your searchString method and use that to insert lines. No need to open that again and again. It will drastically improve the performance.

One problem here might be that you stop reading when you write. I would probably use one thread for reading and another thread for writing the file. As an extra optimization the thread writing the results could buffer them into memory and write them to the file as a batch, say every ten entries or something.
In the writing thread you should queue the incoming entries before handling them.
Of course, you should maybe first debug where that time is spent, is it the IO or something else.

There are too many potential bottlenecks in this code for anyone to be sure what the critical ones are. Therefore you should profile the application to determine what it causing it to be slow.
Armed with that information, decide whether the problem is in reading the ZIP file, soing the searching or writing the matches to the output file.
(Repeatedly opening and closing the output file is a bad idea, but if you only get a tiny number of search hits it won't make much difference to the overall performance.)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.