I'm a semi experienced programmer, just not so much within java. To help learn Java/Android I started working on a world builder application, something that takes 2-7 characters and finds all common words out of that. Currently I have about 10,000 words split between 26 .txt files that are loaded based on what characters are inputted from the user. Together it's ~10kb of data.
The logic was the easy part but now, the GC seems to be slowing everything down and I'm struggling to find ways to optimize due to my lack of Java experience. Here is the code below that I'm almost postitive the GC is constantly running on. I'd like to point out with 2-4 characters the code below runs pretty quickly. Anything larger than that gets really slow.
public void readFile() throws IOException, NotFoundException
{
String dictionaryLine = new String(); //Current string from the .txt file
String currentString = new String(); //Current scrambled string
String comboStr = new String(); //Current combo string
int inputLength = myText.getText().length(); // lenth of the user input
//Loop through every "letter" dictionary
for(int z = 0; z < neededFiles.length - 1; z++)
{
if(neededFiles[z] == null)
break;
InputStream input = neededFiles[z];
InputStreamReader inputReader = new InputStreamReader(input);
BufferedReader reader = new BufferedReader(inputReader, inputLength);
//Loop through every line in the dictionary
while((dictionaryLine = reader.readLine()) != null)
{
Log.i(TAG, "dictionary: " + dictionaryLine);
//For every scrambled string...
for(int i = 0; i < scrambled.size(); i++)
{
currentString = scrambled.get(i).toString();
//Generate all possible combos from the scrambled string and populate 'combos'
generate(currentString);
//...lets find every possible combo from that current scramble
for(int j = 0; j < combos.size(); j++)
{
try
{
comboStr = combos.get(j).toString();
//If the input length is less than the current line, don't even compare
if(comboStr.length() < dictionaryLine.length() || comboStr.length() > dictionaryLine.length())
break;
//Add our match
if(dictionaryLine.equalsIgnoreCase(comboStr))
{
output.add(comboStr);
break;
}
}
catch(Exception error)
{
Log.d(TAG, error.getMessage());
}
}
combos.clear();
}
}
}
}
To help clarify this code generates many, many lines of the following:
GC_FOR_MALLOC freed 14000 objects / 510000 byes in 100ms
I appreciate any help you can give, even if it's just Java best practices.
In general, you reduce garbage collection activity by creating and losing less objects. There are a lot of places where objects can be generated:
Each line you are reading produces a String.
Strings are immutable, so likely more objects are being spawned in your generate() function.
If you are dealing with a lot of strings, consider a StringBuilder, which is a mutable string builder which reduces garbage.
However, 100ms for garbage collection is not bad, especially on a phone device.
Basically, you're in a bad way because that for each dictionary word you're generating all possible combinations for all scrambled strings, yikes! If you have enough memory, just generate all the combos for all words once and compare each one to every dictionary value.
However, it must be assumed that there isn't enough memory for this, in which case, this is going to get more complicated. What you can do is use a char[] to produce one scramble possibility, test it, rearrange the characters in the buffer, test, repeat, etc until all possibilities are exhausted.
Related
I'm working on a multithreaded project where we have to parse some text from a file into a magic object, do some processing on the object, and aggregate the output. The old version of the code parsed the text in one thread and did the object processing in a thread pool using Java's ExecutorService. We weren't getting the performance boost that we wanted, and it turned out that parsing takes longer than we thought relative to the processing time for each object, so I tried moving the parsing into the worker threads.
This should have worked, but what actually happens is that the time-per-object blows up as a function of the number of threads in the pool. It's worse than linear, but not quite as bad as exponential.
I've whittled it down to a small example that (on my machine anyhow) shows the behavior. The example doesn't even create the magic object; it's just doing string manipulation. There's no inter-thread dependencies that I can see; I know split() isn't terribly efficient but I can't imagine why it would sh*t the bed in a multithreaded context. Have I missed something?
I'm running in Java 7 on a 24-core machine. Lines are long, ~1MB each. There can be dozens of items in features, and 100k+ items in edges.
Sample input:
1 1 156 24 230 1350 id(foo):id(bar):w(house,pos):w(house,neg) 1->2:1#1.0 16->121:2#1.0,3#0.5
Sample command line for running with 16 worker threads:
$ java -Xmx10G Foo 16 myfile.txt
Example code:
public class Foo implements Runnable {
String line;
int id;
public Foo(String line, int id) {
this.line = line;
this.id = id;
}
public void run() {
System.out.println(System.currentTimeMillis()+" Job start "+this.id);
// line format: tab delimited
// x[4]
// graph[2]
// features[m] <-- ':' delimited
// edges[n]
String[] x = this.line.split("\t",5);
String[] graph = x[4].split("\t",4);
String[] features = graph[2].split(":");
String[] edges = graph[3].split("\t");
for (String e : edges) {
String[] ee = e.split(":",2);
ee[0].split("->",2);
for (String f : ee[1].split(",")) {
f.split("#",2);
}
}
System.out.println(System.currentTimeMillis()+" Job done "+this.id);
}
public static void main(String[] args) throws IOException,InterruptedException {
System.err.println("Reading from "+args[1]+" in "+args[0]+" threads...");
LineNumberReader reader = new LineNumberReader(new FileReader(args[1]));
ExecutorService pool = Executors.newFixedThreadPool(Integer.parseInt(args[0]));
for(String line; (line=reader.readLine()) != null;) {
pool.submit(new Foo(line, reader.getLineNumber()));
}
pool.shutdown();
pool.awaitTermination(7,TimeUnit.DAYS);
}
}
Updates:
Reading the whole file into memory first has no effect. To be more specific, I read the whole file, adding each line to an ArrayList<String>. Then I iterated over the list to create the jobs for the pool. This makes the substrings-eating-the-heap hypothesis unlikely, no?
Compiling one copy of the delimiter pattern to be used by all worker threads has no effect. :(
Resolution:
I've converted the parsing code to use a custom splitting routine based on indexOf(), like so:
private String[] split(String string, char delim) {
if (string.length() == 0) return new String[0];
int nitems=1;
for (int i=0; i<string.length(); i++) {
if (string.charAt(i) == delim) nitems++;
}
String[] items = new String[nitems];
int last=0;
for (int next=last,i=0; i<items.length && next!=-1; last=next+1,i++) {
next=string.indexOf(delim,last);
items[i]=next<0?string.substring(last):string.substring(last,next);
}
return items;
}
Oddly enough this does not blow up as the number of threads increases, and I have no idea why. It's a functional workaround though, so I'll live with it...
In Java 7, String.split() uses String.subString() internally, which for "optimization" reasons does not create real new Strings, but empty String shells that point to sub-sections of the original one.
So when you split() a String into small pieces, the original one (maybe huge) is still in memory and may end up eating all your heap. I see you parse big files, this might be a risk (this has been changed in Java 8).
Given that your format is well-known, I would recommend parsing each line "by hand" rather that using String.split() (regex are really bad for performance anyway), and creating real new ones for sub-parts.
String.split is actually using regular expressions to split, see:http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/String.java#String.split%28java.lang.String%2Cint%29
This means that you're compiling a whole bunch of regexes every iteration.
It'd probably be best to compile a pattern for the whole line once and then apply it to each line as it's read to parse them. Either that, write your own parser that can look for character breaks instead of regex'ing it.
Lets consider this scenario: I am reading a file, and then tweaking each line a bit and then storing the data in a new file. Now, I tried two ways to do it:
storing the data in a String and then writing it to the target file at the end like this:
InputStream ips = new FileInputStream(file);
InputStreamReader ipsr = new InputStreamReader(ips);
BufferedReader br = new BufferedReader(ipsr);
PrintWriter desFile = new PrintWriter(targetFilePath);
String data = "";
while ((line = br.readLine()) != null) {
if (line.contains("_Stop_"))
continue;
String[] s = line.split(";");
String newLine = s[2];
for (int i = 3; i < s.length; i++) {
newLine += "," + s[i];
}
data+=newLine+"\n";
}
desFile.write(data);
desFile.close();
br.close();
directly using println() method for PrintWriter as below in the while loop:
while ((line = br.readLine()) != null) {
if (line.contains("_Stop_"))
continue;
String[] s = line.split(";");
String newLine = s[2];
for (int i = 3; i < s.length; i++) {
newLine += "," + s[i];
}
desFile.println(newLine);
}
desFile.close();
br.close();
The 2nd process is way faster than the 1st one. Now, my question is what is happening so different in these two process that it is differing so much by execution time?
Appending to your string will:
Allocate memory for a new string
Copy all data previously copied.
Copy the data from your new string.
You repeat this process for every single line, meaning that for N lines of output, you copy O(N^2) bytes around.
Meanwhile, writing to your PrintWriter will:
Copy data to the buffer.
Occasionally flush the buffer.
Meaning that for N lines of output, you copy only O(N) bytes around.
For one, you're creating an awful lot of new String objects by appending using +=. I think that'll definitely slow things down.
Try appending using a StringBuilder sb declared outside of the loop and then calling desFile.write(sb.toString()); and see how that performs.
First of all, the two processes aren't producing the same data, since the one that calls println will have line separator characters between the lines whereas the one that builds all the data up in a buffer and writes it all at once will not.
But the reason for the performance difference is probably the enormous number of String and StringBuilder objects you are generating and throwing away, the memory that needs to be allocated to hold the complete file contents in memory, and the time taken by the garbage collector.
If you're going to be doing a significant amount of string concatenation, especially in a loop, it is better to create a StringBuilder before the loop and use it to accumulate the results in the loop.
However, if you're going to be processing large files, it is probably better to write the output as you go. The memory requirements of your application will be lower, whereas if you build up the entire result in memory, the memory required will be equal to the size of the output file.
I am trying to take a file full of strings, read it, then print out a few things:
The string
The string backwards AND uppercase
The string length
There are a few more things, however I haven't even gotten to that point and do not want to ask anyone to write the code entirely for me. After messing around with it for a while, I have it almost completed (I believe, save for a few areas).
The piece that is tripping me up is the backwards word. We are required to put our output neatly into columns using prinf, but I cannot do this if I read each char at a time. So I tried setting a String backwardsWord = ""; and adding each character.
This is the piece that is tripping me up:
for(int i = upperCaseWord.length() - 1; i >= 0; i--)
{
backwardsWord += (upperCaseWord.charAt(i) + "");
}
My issue is that when I print it, the first word works properly. However, each word after that is added to the previous word.
For example: if I am printing cat, dog, and rat backwards, it shows
TAC
TACGOD
TACGODTAR
I obviously want it to read
TAC
GOD
TAR
Any help would be appreciated.
It looks like your variable backwardsWord is always appending a character without being reset between words. The simplest fix is to clear the backwardsWord just before your loop by setting it to empty string.
backwardsWord = ""; //Clear any existing characters from backwardsWord
for(int i = upperCaseWord.length() - 1; i >= 0; i--)
{
backwardsWord += (upperCaseWord.charAt(i) + "");
}
If you are building up a String one character at a time you will be using a lot of memory because Java Strings are immutable.
To do this more efficiently use a StringBuilder instead. This is made for building up characters like what you are doing. Once you have finished you can use the toString method to get the String out.
StringBuilder builder = new StringBuilder(); //Creates the String builder for storing the characters
for(int i = upperCaseWord.length() - 1; i >= 0; i--)
{
builder.append(upperCaseWord.charAt(i)); //Append the characters one at a time
}
backwardsWord = builder.toString(); //Store the finished string in your existing variable
This has the added benefit of resetting the backwardsWord each time.
Finally, since your goal is to get the String in reverse we can actually do it without a loop at all as shown in this answer
backwardsWord = new StringBuilder(upperCaseWord).reverse().toString()
This creates a new StringBuilder with the characters from upperCaseWord, reverses the characters then stores the final string in backwardsWord
Where are you declaring the String backwardsWord?
If you don't clear it between words then the memory space allocated to that string will still contain the previously added characters.
Make sure you are tossing in a backwardsWord = ""; in between words to reset it's value and that should fix your problem.
Without seeing more of your code I can't tell you exactly where to put it.
This should do the job ->
class ReverseWordsInString{
public static String reverse(String s1){
int l = s1.length();
if (l>1)
return(s1.substring(l-1) + reverse(s1.substring(0,l-1)));
else
return(s1.substring(0));
}
public static void main(String[] args){
String st = "Cat Dog Rat";
String r = "";
for (String word : st.split(" "))
r += " "+ reverse(word.toUpperCase());
System.out.println("Reversed words in the given string: "+r.trim());
}
}
I am reading a file to parse later on. The file is not likely to exceed an MB in size, so this is perhaps not a crucial question for me at this stage. But for best practise reasons, I'd like to know when is the optimum time to perform an operation.
Example:
Using a method I've pasted from http://www.dzone.com/snippets/java-read-file-string, I am reading a buffer into a string. I would now like to remove all whitespace. My method is currently this:
private String listRaw;
public boolean readList(String filePath) throws java.io.IOException {
StringBuffer fileData = new StringBuffer(1024);
BufferedReader reader = new BufferedReader(
new FileReader(filePath));
char[] buf = new char[1024];
int numRead=0;
while((numRead=reader.read(buf)) != -1){
String readData = String.valueOf(buf, 0, numRead);
fileData.append(readData);
buf = new char[1024];
}
reader.close();
listRaw = fileData.toString().replaceAll("\\s","");
return true;
}
So, I remove all whitespace from the string at the time I store it - in it's entirety - to a class variable.
To me, this means less processing but more memory usage. Would I be better off applying the replaceAll() operation on the readData variable as I append it to fileData for best practise reasons? Using more processing but avoiding passing superfluous whitespace around.
I imagine this has little impact for a small file like the one I am working on, but what if it's a 200MB log file?
Is it entirely case-dependant, or is there a consensus I'd do better to follow?
Thanks for the input everybody. I'm sure you've helped to aim my mindset in the right direction for writing Java.
I've updated my code to take into consideration the points raised. Including the suggestion by Don Roby that at some point, I may want to keep spaces. Hopefully things read better now!
private String listRaw;
public boolean readList(String filePath) throws java.io.IOException {
StringBuilder fileData = new StringBuilder(51200);
BufferedReader reader = new BufferedReader(new FileReader(filePath));
char[] buf = new char[51200];
boolean spaced = false;
while(reader.read(buf) != -1){
for(int i=0;i<buf.length;i++) {
char c = buf[i];
if (c != '\t' && c != '\r' && c != '\n') {
if (c == ' ') {
if (spaced) {
continue;
}
spaced = true;
} else {
spaced = false;
}
fileData.append(c);
}
}
}
reader.close();
listRaw = fileData.toString().trim();
return true;
}
You'd better create and apply the regexp replacement only once, at the end. But you would gain much more by
initializing the StringBuilder with a reasonable size
avoiding the creation of a String inside the loop, and append the read characters directly to the StringBuilder
avoiding the instantiation of a new char buffer, for nothing, at each iteration.
To avoid an unnecessary long temporary String creation, you could read char by char, and only append the char to the StringBuilder if it's not a whitespace. In the end, the StringBuilder would contain only the good characters, and you wouldn't need any replaceAll() call.
THere are actually several very significant inefficiencies in this code, and you'd have to fix them before worrying about the relatively less important issue you've raised.
First, don't create a new buf object on each iteration of the loop -- use the same one! There's no problem with doing so -- the new data overwrites the old, and you save on object allocation (which is one of the more expensive operations you can do.)
Second, similarly, don't create a String to call append() -- use the form of append that takes a char array and an offset (0, in this case) and length (numRead, in this case.) Again, you create one less object per loop iteration.
Finally, to come to the question you actually asked: doing it in the loop would create a String object per iteration, but with the tuning we've just done, you're creating zero objects per iterataion -- so removing the whitespace at the end of the loop is the clear winner!
Depending somewhat on the parse you're going to do, you may well be better off not removing the spaces in a separate step at all, and just ignore them during the parse.
It's also reasonably rare to want to remove all whitespace. Are you sure you don't want to just replace multiple spaces with single spaces?
Im doing a frequency dictionary, in which i read 1000 files, each one with about 1000 lines. The approach i'm following is:
BufferedReader to read fileByFile
read the first file, get the first sentence, split the sentence to an array string, then fill in an hashmap with the values from the string array.
do this for all the senteces in that file
do this for all 1000 files
My problem is, this is not a very efficient way to do it, i'm taking about 4 minutes to do all this. I'v increased heap size, refactored the code to make sure i'm not doind something wrong. For this approach, i'm completly sure there's nothing i can improve in the code.
My bet is, each time a sentece is read, a split is applied, which, multiplied by 1000 sentences in a file and by 1000 files is a huge ammount of splits to process.
My idea is, instead of read and process file-by-file, i could read each file to a char array, and then make the split only once per file. That would ease the ammount of processing times consuming with the split. Any suggestions of implementation would be appreciated.
OK, I have just implemented the POC of your dictionary. Fast and dirty. My files contained 868 lines each one but I created 1024 copies of the same file. (This is table of contents of Spring Framework documentation.)
I ran my test and it took 14020 ms (14 seconds!). BTW I ran it from eclipse that could decrease the speed a little bit.
So, I do not know where your problem is. Please try my code on your machine and if it runs faster try to compare it with your code and understand where the root problem.
Anyway my code is not the fastest I can write.
I can create Pattern before loop and the use it instead of String.split(). String.split() calls Pattern.compile() every time. Creating pattern is very expensive.
Here is the code:
public static void main(String[] args) throws IOException {
Map<String, Integer> words = new HashMap<String, Integer>();
long before = System.currentTimeMillis();
File dir = new File("c:/temp/files");
for (File file : dir.listFiles()) {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
String[] lineWords = line.split("\\s+");
for (String word : lineWords) {
int count = 1;
Integer currentCount = words.get(word);
if (currentCount != null) {
count = currentCount + 1;
}
words.put(word, count);
}
}
}
long after = System.currentTimeMillis();
System.out.println("run took " + (after - before) + " ms");
System.out.println(words);
}
If you dont care about the the contents are in different files I would do the approach your are recommending. Read all files and all lines into memory (string, or char array, whatever) and then do the 1 split and hash populate based on the one string/dataset.
If I understand what you're doing, I don't think you want to use strings except when you access your map.
You want to:
loop through files
read each file into a buffer of something like 1024
process the buffer looking for word end characters
create a String from the character array
check your map
if found, update your count, if not, create a new entry
when you reach end of buffer, get the next buffer from the file
at end, loop to next file
Split is probably pretty expensive since it has to interpret the expression each time.
Reading the file as one big string and and then splitting that sounds like a good idea. String splitting/modifying can be surprisingly 'heavy' when it comes to garbage collection. Multiple lines/sentences means multiple Strings and with all the splits it means a huge amount of Strings (Strings are immutable, so any change to them will actually create a new String or multiple Strings)... this produces a lot of garbage to be collected, and the garbage collection could become a bottleneck (with a smaller heap, the maximum amount of memory is reached all the time, kicking off a garbage collection, which potentially needs to clean up hundreds of thousands or millions of separate String-objects).
Of course, without knowing your code this is just a wild guess, but back in the day, I got an old command line Java-programs' (it was a graph-algorithm producing a huge SVG-file) running time to drop from about 18 seconds to less than 0.5 seconds just by modifying the string-handling to use StringBuffers/Builders.
Another thing that springs to mind is using multiple threads (or a threadpool) to handle different files concurrently, and then combine the results at the end. Once you get the program to run "as fast as possible", the remaining bottleneck will be the disk access, and the only way (afaik) to get past that is faster disks (SSDs etc.).
Since you're using a bufferedReader, why do you need to read in a whole file explicitly? I definitely wouldn't use split if you're after speed, remember, it has to evaluate a regular expression each time you run it.
Try something like this for your inner loop (note, I have not compiled this or tried to run it):
StringBuilder sb = null;
String delimiters = " .,\t"; //Build out all your word delimiters in a string here
for(int nextChar = br.read(); nextChar >= 0; nextChar = br.read()) {
if(delimiters.indexOf(nextChar) < 0) {
if(sb == null) sb = new StringBuilder();
sb.append((char)(nextChar));
} else {
if(sb != null) {
//Add sb.toString() to your map or increment it
sb = null;
}
}
}
You could try using different sized buffers explicitly, but you probably won't get a performance improvement over this.
One very simple approach which uses minimum heap space and should be (almost) as fast as anything else would be like
int c;
final String SEPARATORS = " \t,.\n"; // extend as needed
final StringBuilder word = new StringBuilder();
while( ( c = fileInputStream.read() ) >= 0 ) {
final char letter = (char) c;
if ( SEPARATORS.indexOf(letter) < 0 ) {
word.append(letter);
} else {
processWord( word.toString() );
word.setLength( 0 );
}
}
extend for more separator characters as needed, possibly use multi-threading to process multiple files concurrently until disc IO becomes the bottle neck...