Multithreaded string processing blows up with #threads - java

I'm working on a multithreaded project where we have to parse some text from a file into a magic object, do some processing on the object, and aggregate the output. The old version of the code parsed the text in one thread and did the object processing in a thread pool using Java's ExecutorService. We weren't getting the performance boost that we wanted, and it turned out that parsing takes longer than we thought relative to the processing time for each object, so I tried moving the parsing into the worker threads.
This should have worked, but what actually happens is that the time-per-object blows up as a function of the number of threads in the pool. It's worse than linear, but not quite as bad as exponential.
I've whittled it down to a small example that (on my machine anyhow) shows the behavior. The example doesn't even create the magic object; it's just doing string manipulation. There's no inter-thread dependencies that I can see; I know split() isn't terribly efficient but I can't imagine why it would sh*t the bed in a multithreaded context. Have I missed something?
I'm running in Java 7 on a 24-core machine. Lines are long, ~1MB each. There can be dozens of items in features, and 100k+ items in edges.
Sample input:
1 1 156 24 230 1350 id(foo):id(bar):w(house,pos):w(house,neg) 1->2:1#1.0 16->121:2#1.0,3#0.5
Sample command line for running with 16 worker threads:
$ java -Xmx10G Foo 16 myfile.txt
Example code:
public class Foo implements Runnable {
String line;
int id;
public Foo(String line, int id) {
this.line = line;
this.id = id;
}
public void run() {
System.out.println(System.currentTimeMillis()+" Job start "+this.id);
// line format: tab delimited
// x[4]
// graph[2]
// features[m] <-- ':' delimited
// edges[n]
String[] x = this.line.split("\t",5);
String[] graph = x[4].split("\t",4);
String[] features = graph[2].split(":");
String[] edges = graph[3].split("\t");
for (String e : edges) {
String[] ee = e.split(":",2);
ee[0].split("->",2);
for (String f : ee[1].split(",")) {
f.split("#",2);
}
}
System.out.println(System.currentTimeMillis()+" Job done "+this.id);
}
public static void main(String[] args) throws IOException,InterruptedException {
System.err.println("Reading from "+args[1]+" in "+args[0]+" threads...");
LineNumberReader reader = new LineNumberReader(new FileReader(args[1]));
ExecutorService pool = Executors.newFixedThreadPool(Integer.parseInt(args[0]));
for(String line; (line=reader.readLine()) != null;) {
pool.submit(new Foo(line, reader.getLineNumber()));
}
pool.shutdown();
pool.awaitTermination(7,TimeUnit.DAYS);
}
}
Updates:
Reading the whole file into memory first has no effect. To be more specific, I read the whole file, adding each line to an ArrayList<String>. Then I iterated over the list to create the jobs for the pool. This makes the substrings-eating-the-heap hypothesis unlikely, no?
Compiling one copy of the delimiter pattern to be used by all worker threads has no effect. :(
Resolution:
I've converted the parsing code to use a custom splitting routine based on indexOf(), like so:
private String[] split(String string, char delim) {
if (string.length() == 0) return new String[0];
int nitems=1;
for (int i=0; i<string.length(); i++) {
if (string.charAt(i) == delim) nitems++;
}
String[] items = new String[nitems];
int last=0;
for (int next=last,i=0; i<items.length && next!=-1; last=next+1,i++) {
next=string.indexOf(delim,last);
items[i]=next<0?string.substring(last):string.substring(last,next);
}
return items;
}
Oddly enough this does not blow up as the number of threads increases, and I have no idea why. It's a functional workaround though, so I'll live with it...

In Java 7, String.split() uses String.subString() internally, which for "optimization" reasons does not create real new Strings, but empty String shells that point to sub-sections of the original one.
So when you split() a String into small pieces, the original one (maybe huge) is still in memory and may end up eating all your heap. I see you parse big files, this might be a risk (this has been changed in Java 8).
Given that your format is well-known, I would recommend parsing each line "by hand" rather that using String.split() (regex are really bad for performance anyway), and creating real new ones for sub-parts.

String.split is actually using regular expressions to split, see:http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/String.java#String.split%28java.lang.String%2Cint%29
This means that you're compiling a whole bunch of regexes every iteration.
It'd probably be best to compile a pattern for the whole line once and then apply it to each line as it's read to parse them. Either that, write your own parser that can look for character breaks instead of regex'ing it.

Related

Method count words in a file

Hi guys I'm writing a method which counts words in a file, but apparently there is a mistake somewhere in the code and the method does not work. Here's my code:
public class Main2 {
public static void main(String[] args) {
count("/home/bruno/Desktop/WAR_JEE_S_09_Podstawy/MojPlik");
}
static int count(String fileName){
Path path = Paths.get(fileName);
int ilosc = 0;
String wyjscie = "";
try {
for (String charakter : Files.readAllLines(path)){
wyjscie += charakter;
}
StringTokenizer token = new StringTokenizer(wyjscie," \n");
} catch (IOException e) {
e.printStackTrace();
}
return ilosc;
}
}
The file path is correct, here is the file content
test test
test
test
after i call the method in main it displays nothing. Where is the mistake ?
Your code would count lines in a file ... well, if you followed up on that thought.
Right ow your code is simply reading lines, putting them into one large string, to then do nothing about the result of that operation. You have a single int counter ... who is init'ed to 0, and then just returned without ever being used/increased! And unless I am mistaken, readAllLines() will automatically remove the newline char in the end, so overall, your code is nothing but useless.
To count words you have to take each line and (for example) split that one-line-string for spaces. That gives you a number. Then add up these numbers.
Long story short: the real answer here is that you should step back. Don't just write code, assuming that this will magically solve the problem. Instead: first think up a strategy (algorithm) that solves the problem. Write down the algorithm ideas using a pen and paper. Then "manually" run the algorithm on some sample data. Then, in the end, turn the algorithm into code.
Also, beside that you does not output anything, there is a slight error behind you logic. I have made a few changes here and there to get your code working.
s.trim() removes any leading and trainling whitespace, and trimmed.split("\\s+") splits the string at any whitespace character, including spaces.
static int count(String fileName) throws IOException {
Path path = Paths.get(fileName);
int count = 0;
List<String> lines = Files.readAllLines(path);
for (String s : lines) {
String trimmed = s.trim();
count += trimmed.isEmpty() ? 0 : trimmed.split("\\s+").length;
}
return count;
}
Here is the code using functional-style programming in Java 8. This is also a common example of using Stream's flatMap - may be used for counting or printing words from a file.
long n = Files.lines(Paths.get("test.txt"))
.flatMap(s -> Stream.of(s.split("\\s+")))
.count();
System.out.println("No. of words: " + n);
Note the Files.lines(Path) returns a Stream<String> which has the lines from the input file. This method is similar to readAllLines, but returns a stream instead of a List.

What is the quickest / most efficient way to append a char to a file loaded into memory?

read_data = new BufferedReader( new FileReader(args[0]) );
data_buffer = new StringBuffer();
int i;
while(read_data.ready())
{
while((i = read_data.read()) != -1)
{
data_buffer.append((char)i);
}
}
data_buffer.append(System.getProperty("line.separator"));
What I'm trying to do is, read an entire .txt file into a string and append a newline to the string. And then be able to process this string later on by creating a new Scanner by passing data_buffer.toString(). Obviously on really large files this process takes up a lot of time, and all I want to do is just append a newline to the .txt file I've read into memory.
I'm aware the whole idea seems a bit hacky or weird, but are there any quicker methods?
Cheers :)
The fastest way to do something is often to not do it at all.
Why don't you modify the parsing code in such way that the newline at the end is not required? If you are appending it each time, you could as well change the code to behave as if it were there while it really isn't.
The next thing I would try would be to avoid creating a huge String char by char, as this is indeed rather costly. You can create a Scanner based on an InputStream and it will probably be much faster than reading data into a String and parsing that. You can override your FileInputStream to return a virtual newline character at the end of the file, thus avoiding the instatiation of the pasted string.
And if you absolutely positively did have to read the data into a buffer, you would probably be better off by reading into a byte array using the array-based read() methods of the stream - much faster than byte by byte. Since you can know the size of the file in advance, you could allocate your buffer with space for the extra end-of-line marker and insert it into the array. In contrast to creating a StringBuffer and making a String out of it, this does not require a full copy of the buffer.
From what I can tell, what you are actually trying to do is to read a file in such a way that it always appears to have a line separator at the end of the last line.
If that is the case, then you could do this by implementing a subtype of FilterReader, and have it "insert" an extra character or two if required when it reaches the end of the character stream.
The code to do this won't be trivial, but it will avoid the time and space overhead of buffering the entire file in memory.
If all you're doing is passing the resulting file in to a Scanner, you should create a Readable for the file and send that to Scanner.
Here's an example (untested):
public class NLReader implements Readable {
Reader r;
boolean atEndOfReader = false;
boolean atEnd = false;
public NLReader(Reader r) {
this.r = r;
}
public int read(CharBuffer cb) throws IOException {
if (!atEndOfReader) {
int result = r.read(cb);
if (result == -1) {
atEndOfReader = true;
} else {
return result;
}
}
if (!atEnd) {
String nl = System.getProperty("line.separator");
cb.append(nl);
atEnd = true;
return nl.length();
}
return -1;
}
}
This only reads the file once, and never copies it (unlike your StringBuffer -- and you should be using StringBuilder instead unless you really need the synchronization of StringBuffer).
This also doesn't load the actual file in to memory, so that can save memory pressure as well.

java efficient way to process big text files

Im doing a frequency dictionary, in which i read 1000 files, each one with about 1000 lines. The approach i'm following is:
BufferedReader to read fileByFile
read the first file, get the first sentence, split the sentence to an array string, then fill in an hashmap with the values from the string array.
do this for all the senteces in that file
do this for all 1000 files
My problem is, this is not a very efficient way to do it, i'm taking about 4 minutes to do all this. I'v increased heap size, refactored the code to make sure i'm not doind something wrong. For this approach, i'm completly sure there's nothing i can improve in the code.
My bet is, each time a sentece is read, a split is applied, which, multiplied by 1000 sentences in a file and by 1000 files is a huge ammount of splits to process.
My idea is, instead of read and process file-by-file, i could read each file to a char array, and then make the split only once per file. That would ease the ammount of processing times consuming with the split. Any suggestions of implementation would be appreciated.
OK, I have just implemented the POC of your dictionary. Fast and dirty. My files contained 868 lines each one but I created 1024 copies of the same file. (This is table of contents of Spring Framework documentation.)
I ran my test and it took 14020 ms (14 seconds!). BTW I ran it from eclipse that could decrease the speed a little bit.
So, I do not know where your problem is. Please try my code on your machine and if it runs faster try to compare it with your code and understand where the root problem.
Anyway my code is not the fastest I can write.
I can create Pattern before loop and the use it instead of String.split(). String.split() calls Pattern.compile() every time. Creating pattern is very expensive.
Here is the code:
public static void main(String[] args) throws IOException {
Map<String, Integer> words = new HashMap<String, Integer>();
long before = System.currentTimeMillis();
File dir = new File("c:/temp/files");
for (File file : dir.listFiles()) {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
String[] lineWords = line.split("\\s+");
for (String word : lineWords) {
int count = 1;
Integer currentCount = words.get(word);
if (currentCount != null) {
count = currentCount + 1;
}
words.put(word, count);
}
}
}
long after = System.currentTimeMillis();
System.out.println("run took " + (after - before) + " ms");
System.out.println(words);
}
If you dont care about the the contents are in different files I would do the approach your are recommending. Read all files and all lines into memory (string, or char array, whatever) and then do the 1 split and hash populate based on the one string/dataset.
If I understand what you're doing, I don't think you want to use strings except when you access your map.
You want to:
loop through files
read each file into a buffer of something like 1024
process the buffer looking for word end characters
create a String from the character array
check your map
if found, update your count, if not, create a new entry
when you reach end of buffer, get the next buffer from the file
at end, loop to next file
Split is probably pretty expensive since it has to interpret the expression each time.
Reading the file as one big string and and then splitting that sounds like a good idea. String splitting/modifying can be surprisingly 'heavy' when it comes to garbage collection. Multiple lines/sentences means multiple Strings and with all the splits it means a huge amount of Strings (Strings are immutable, so any change to them will actually create a new String or multiple Strings)... this produces a lot of garbage to be collected, and the garbage collection could become a bottleneck (with a smaller heap, the maximum amount of memory is reached all the time, kicking off a garbage collection, which potentially needs to clean up hundreds of thousands or millions of separate String-objects).
Of course, without knowing your code this is just a wild guess, but back in the day, I got an old command line Java-programs' (it was a graph-algorithm producing a huge SVG-file) running time to drop from about 18 seconds to less than 0.5 seconds just by modifying the string-handling to use StringBuffers/Builders.
Another thing that springs to mind is using multiple threads (or a threadpool) to handle different files concurrently, and then combine the results at the end. Once you get the program to run "as fast as possible", the remaining bottleneck will be the disk access, and the only way (afaik) to get past that is faster disks (SSDs etc.).
Since you're using a bufferedReader, why do you need to read in a whole file explicitly? I definitely wouldn't use split if you're after speed, remember, it has to evaluate a regular expression each time you run it.
Try something like this for your inner loop (note, I have not compiled this or tried to run it):
StringBuilder sb = null;
String delimiters = " .,\t"; //Build out all your word delimiters in a string here
for(int nextChar = br.read(); nextChar >= 0; nextChar = br.read()) {
if(delimiters.indexOf(nextChar) < 0) {
if(sb == null) sb = new StringBuilder();
sb.append((char)(nextChar));
} else {
if(sb != null) {
//Add sb.toString() to your map or increment it
sb = null;
}
}
}
You could try using different sized buffers explicitly, but you probably won't get a performance improvement over this.
One very simple approach which uses minimum heap space and should be (almost) as fast as anything else would be like
int c;
final String SEPARATORS = " \t,.\n"; // extend as needed
final StringBuilder word = new StringBuilder();
while( ( c = fileInputStream.read() ) >= 0 ) {
final char letter = (char) c;
if ( SEPARATORS.indexOf(letter) < 0 ) {
word.append(letter);
} else {
processWord( word.toString() );
word.setLength( 0 );
}
}
extend for more separator characters as needed, possibly use multi-threading to process multiple files concurrently until disc IO becomes the bottle neck...

Android - OutOfMemory when reading text file

I'm making a dictionary app on android. During its startup, the app will load content of .index file (~2MB, 100.000+ lines)
However, when i use BufferedReader.readLine() and do something with the returned string, the app will cause OutOfMemory.
// Read file snippet
Set<String> indexes = new HashSet<String)();
FileInputStream is = new FileInputStream(indexPath);
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
String readLine;
while ( (readLine = reader.readLine()) != null) {
indexes.add(extractHeadWord(readLine));
}
// And the extractHeadWord method
private String extractHeadWord(String string) {
String[] splitted = string.split("\\t");
return splitted[0];
}
When reading log, I found that while executing, it causes the GC explicitly clean objects many times (GC_EXPLICIT freed xxx objects, in which xxx is a big number such as 15000, 20000).
And I tried another way:
final int BUFFER = 50;
char[] readChar = new char[BUFFER];
//.. construct BufferedReader
while (reader.read(readChar) != -1) {
indexes.add(new String(readChar));
readChar = new char[BUFFER];
}
..and it run very fast. But it was not exactly what I wanted.
Is there any solution that run fast as the second snippet and easy to use as the first?
Regard.
The extractHeadWord uses String.split method. This method does not create new strings but relies on the underlying string (in your case the line object) and uses indexes to point out the "new" string.
Since you are not interessed in the rest of the string you need to discard the it so it gets garbage collected otherwise the whole string will be in memory (but you are only using a part of it).
Calling the constructor String(String) ("copy constructor") discards the rest of string:
private String extractHeadWord(String string) {
String[] splitted = string.split("\\t");
return new String(splitted[0]);
}
What happens if your extractHeadWord does this return new String(splitted[0]);.
It will not reduce temporary objects, but it might reduce the footprint of the application. I don't know if split does about the same as substring, but I guess that it does. substring creates a new view over the original data, which means that the full character array will be kept in memory. Explicitly invoking new String(string) will truncate the data.

Android Garbage Collector Slow Down

I'm a semi experienced programmer, just not so much within java. To help learn Java/Android I started working on a world builder application, something that takes 2-7 characters and finds all common words out of that. Currently I have about 10,000 words split between 26 .txt files that are loaded based on what characters are inputted from the user. Together it's ~10kb of data.
The logic was the easy part but now, the GC seems to be slowing everything down and I'm struggling to find ways to optimize due to my lack of Java experience. Here is the code below that I'm almost postitive the GC is constantly running on. I'd like to point out with 2-4 characters the code below runs pretty quickly. Anything larger than that gets really slow.
public void readFile() throws IOException, NotFoundException
{
String dictionaryLine = new String(); //Current string from the .txt file
String currentString = new String(); //Current scrambled string
String comboStr = new String(); //Current combo string
int inputLength = myText.getText().length(); // lenth of the user input
//Loop through every "letter" dictionary
for(int z = 0; z < neededFiles.length - 1; z++)
{
if(neededFiles[z] == null)
break;
InputStream input = neededFiles[z];
InputStreamReader inputReader = new InputStreamReader(input);
BufferedReader reader = new BufferedReader(inputReader, inputLength);
//Loop through every line in the dictionary
while((dictionaryLine = reader.readLine()) != null)
{
Log.i(TAG, "dictionary: " + dictionaryLine);
//For every scrambled string...
for(int i = 0; i < scrambled.size(); i++)
{
currentString = scrambled.get(i).toString();
//Generate all possible combos from the scrambled string and populate 'combos'
generate(currentString);
//...lets find every possible combo from that current scramble
for(int j = 0; j < combos.size(); j++)
{
try
{
comboStr = combos.get(j).toString();
//If the input length is less than the current line, don't even compare
if(comboStr.length() < dictionaryLine.length() || comboStr.length() > dictionaryLine.length())
break;
//Add our match
if(dictionaryLine.equalsIgnoreCase(comboStr))
{
output.add(comboStr);
break;
}
}
catch(Exception error)
{
Log.d(TAG, error.getMessage());
}
}
combos.clear();
}
}
}
}
To help clarify this code generates many, many lines of the following:
GC_FOR_MALLOC freed 14000 objects / 510000 byes in 100ms
I appreciate any help you can give, even if it's just Java best practices.
In general, you reduce garbage collection activity by creating and losing less objects. There are a lot of places where objects can be generated:
Each line you are reading produces a String.
Strings are immutable, so likely more objects are being spawned in your generate() function.
If you are dealing with a lot of strings, consider a StringBuilder, which is a mutable string builder which reduces garbage.
However, 100ms for garbage collection is not bad, especially on a phone device.
Basically, you're in a bad way because that for each dictionary word you're generating all possible combinations for all scrambled strings, yikes! If you have enough memory, just generate all the combos for all words once and compare each one to every dictionary value.
However, it must be assumed that there isn't enough memory for this, in which case, this is going to get more complicated. What you can do is use a char[] to produce one scramble possibility, test it, rearrange the characters in the buffer, test, repeat, etc until all possibilities are exhausted.

Categories

Resources