Why does my code become slow after processing a large dataset? - java

I have a Java program, which basically reads from a file line by line and stores the lines into a set. The file contains more than 30000000 lines. My program runs fast at the beginning but slow down after processing 20000000 lines and even too slow to wait. Can somebody explains why this would happen and how can I speed up the program again?
Thanks.
public void returnTop100Phases() {
Set<Phase> phaseTreeSet = new TreeSet<>(new Comparator<Phase>() {
#Override
public int compare(Phase o1, Phase o2) {
int diff = o2.count - o1.count;
if (diff == 0) {
return o1.phase.compareTo(o2.phase);
} else {
return diff > 0 ? 1 : -1;
}
}
});
try {
int lineCount = 0;
BufferedReader br = new BufferedReader(
new InputStreamReader(new FileInputStream(new File("output")), StandardCharsets.UTF_8));
String line = null;
while ((line = br.readLine()) != null) {
lineCount++;
if (lineCount % 10000 == 0) {
System.out.println(lineCount);
}
String[] tokens = line.split("\\t");
phaseTreeSet.add(new Phase(tokens[0], Integer.parseInt(tokens[1])));
}
br.close();
PrintStream out = new PrintStream(System.out, true, "UTF-8");
Iterator<Phase> iterator = phaseTreeSet.iterator();
int n = 100;
while (n > 0 && iterator.hasNext()) {
Phase phase = iterator.next();
out.print(phase.phase + "\t" + phase.count + "\n");
n--;
}
out.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}

Looking at the runtime behaviour this is clearly a memory issue. Actually my tests even broke after around 5M with 'GC overhaed limit exeeded' on Java8. If I limit the size of the phaseTreeSet by adding
if (phaseTreeSet.size() > 100) { phaseTreeSet.pollLast(); }
it runs through quickly. The point why it gets that slow is, it uses more memory, and thus the garbage collection takes longer. But every time before it takes more memory it has to do a big garbage collection again. Obviously there's quite some memory to take, and every time it gets a bit slower...
To get faster you need to get the stuff out of memory. Maybe by keeping only top Phases like I did, or by using kind of a database.

Related

Is there a faster way to use csv reader in Java?

I need to open a csv file in more parts, each one by 5,000 samples and then plot them. To go back and forward on the signal each time I click a button I have to instantiate a new reader and than I skip to the point I need. My signal is big, is about 135,000 samples so csvReader.skip() method is very slow when I work with last samples. But to go back I can't delete lines, so each time my iterator needs to be re-instantiated. I noticed that skip uses a for loop? Is there a better way to overtake this problem? Here is my code:
public void updateSign(int segmento) {
Log.d("segmento", Integer.toString(segmento));
//check if I am in the signal length
if (segmento>0 && (float)(segmento-1)<=(float)TOTAL/normaLen)
{
try {
reader = new CSVReader(new FileReader(new File(patty)));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
List<Integer> sign = new ArrayList<>();
//this is the point of the signal where i finish
int len = segmento * normaLen;
//check if i am at the end of the signal
if (len >= TOTAL) {
len = TOTAL;
segmento=0;
avanti.setValue(false);
System.out.println(avanti.getValue());
} else {
lines = TOTAL - len;
avanti.setValue(true);
System.out.println(avanti.getValue());
}
//the int to i need to skip
int skipper = (segmento-1)*normaLen;
try {
System.out.println("pre skip");
reader.skip(skipper);
System.out.println("post skip");
} catch (IOException e) {
e.printStackTrace();
}
//my iterator
it = reader.iterator();
System.out.println("iteratore fatto");
//loop to build my mini-signal to plot
//having only 5,000 sample it is fast enaugh
for (int i = skipper; i < len-1; i++) {
if (i>=(segmento-1)*normaLen) {
sign.add(Integer.parseInt(it.next()[0]));
}
else
{
it.next();
System.out.println("non ha funzionato lo skip");
}
}
System.out.println("ciclo for: too much fatica?");
//set sign to be plotted by my fragment
liveSign.setValue(sign);
}
}
Thanks in advance!

How to get HashSet limit Size?

i wanna get HashSet Limit byte size in my develope system.
so i made just adding dump data source code
loot at my source code
String DUMP = "llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll";
void testSetLimitByte(){
File f = new File("d:/test.txt");
BufferedWriter bw = null;
HashSet<String> set = new HashSet<String>();
int cnt = 0;
try {
bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("d:/test.txt", false) , "UTF-8"));
for (int i = 0; i<100000000; i++) {
String dumpData = DUMP + i;
bw.write(dumpData);
bw.newLine();
if(i == 0)
continue;
set.add(dumpData);
if(i%10000 == 0)
System.out.print(".");
if(i%100000 == 0)
System.out.print(" ");
if(i%1000000 == 0){
cnt++;
System.out.println(cnt +" (size 1billion)");
}
}
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if(bw != null)
bw.close();
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("HashSet Limit Memory : " + f.length() +"bytes");
}
}
is this code get similar HashSet limit byte size..?
The HashSet is not limited by bytes, only the total heap size is limited.
Note: to reach the Integer.MAX_VALUE size of a HashSet you need heap of ~64 GB (and trivial keys/values)
How many HashSet adding byte in some system It does not occurring OutOfMemoryError
In this case, you can find that the JVM is just running slower and slower trying to use the last portions of memory. In newer JVMs, it detect that you are approaching this condition and dies a little earlier but it can take a long time to reach that point.

randomAccessFile.readLine() returns null after many uses even though not reaching EOF?

I have a file with 10K lines.
I read it in chunks of 200 lines.
I have a problem that after 5600 lines (chunk 28), randomAccessFile.readLine() returns null.
however, if i start reading from chunk 29 it reads another chunk and stops ( return null).
I force reading from chunk 30, and again - it reads one chunk and stops.
this is my code:
private void addRequestsToBuffer(int fromChunkId, List<String> requests) {
String line;
while (requests.size() < chunkSizeInLines) {
if ((line = readNextLine()) != null) {
return;
}
int httpPosition = line.indexOf("http");
int index = fromChunkId * chunkSizeInLines + requests.size();
requests.add(index + ") " + line.substring(httpPosition));
}
}
private String readNextLine() {
String line;
try {
line = randomAccessFile.readLine();
if (line == null) {
System.out.println("randomAccessFile.readLine() returned null");
}
} catch (IOException ex) {
ex.printStackTrace();
throw new RuntimeException(ex);
}
return line;
}
#Override
public List<String> getNextRequestsChunkStartingChunkId(int fromChunkId) {
List<String> requests = new ArrayList<>();
int linesNum = 0;
try {
for (int i = 0; i < fromChunkId; i++) {
while ((linesNum < chunkSizeInLines) && (randomAccessFile.readLine()) != null) {
linesNum++;
}
linesNum = 0;
}
addRequestsToBuffer(fromChunkId, requests);
} catch (IOException ex) {
ex.printStackTrace();
throw new RuntimeException(ex);
}
return requests;
}
what can cause this? randomAccessFile time out?
Each time you call getNextRequestsChunkStartingChunkId you're skipping the specified number of chunks, without "rewinding" the RandomAccessFile to the start. So for example, if you call:
getNextRequestsChunkStartingChunkId(0);
getNextRequestsChunkStartingChunkId(1);
getNextRequestsChunkStartingChunkId(2);
you'll actually read:
Chunk 0 (leaving the stream at the start of chunk 1)
Chunk 2 (leaving the stream at the start of chunk 3)
Chunk 5 (leaving the stream at the start of chunk 6)
Options:
Read the chunks sequentially, without skipping anything
Rewind at the start of the method
Unfortunately you can't use seek for this, because your chunks aren't equally sized, in terms of bytes.

Java- Basic IO trouble

I'm trying to code a sieve of eratosthenes which I intend to use to find the largest prime factor of 13195. If this works, I intend to use it on the number: 600851475143.
Since creating a list of numbers ranging from 2-600851475143 would be nearly impossible due to memory issues, I have decided to store the numbers in a text file instead.
The problem I'm running into though is that instead of getting a text file filled with numbers, the code only produces a file with one number (this is my first time work with IO related stuff in Java):
long number = 13195;
long limit = (long) Math.sqrt(number);
for (long i = 2; i < limit + 1; i++)
{
try
{
Writer output = null;
File file = new File("Primes.txt");
output = new BufferedWriter(new FileWriter(file));
output.write(Long.toString(i) + "\n");
output.close();
}
catch (IOException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Here's the output contained the text file:
114
What am I doing wrong?
Don't use Erathostenes - it's too slow unless you need all the primes in the range.
Here is a better way to factorize a given number. The function returns a map, where the keys are the prime factors of n and the values are their powers. E.g. for 13195 it will be {5:1, 7:1, 13:1, 29:1}
It's complexity is O(sqrt(n)):
public static Map<Integer, Integer> Factorize(int n){
HashMap<Integer, Integer> ret = new HashMap<Integer, Integer>();
int origN = n;
for(int p = 2; p*p <= origN && n > 1; p += (p == 2 ? 1: 2)){
int power = 0;
while (n % p == 0){
++power;
n /= p;
}
if(power > 0)
ret.put(p, power);
}
return ret;
}
Of course if you need just the largest prime factor you can return the last p only not the whole map - the complexity is the same.
Your code keep re-opening, writing, and closing the same file. You should do something like this:
long number = 13195;
long limit = (long) Math.sqrt(number);
try
{
File file = new File("Primes.txt");
Writer output = new BufferedWriter(new FileWriter(file));
for (long i = 2; i < limit + 1; i++)
{
output.write(Long.toString(i) + "\n");
}
output.close();
}
catch (IOException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
You need to take the file instantiation out of the loop.
You are overwriting your file on every pass through the loop.
You need to open your file outside the main loop.
long number = 13195;
long limit = (long) Math.sqrt(number);
try
{
Writer output = null;
File file = new File("Primes.txt");
output = new BufferedWriter(new FileWriter(file));
catch (IOException e)
{
// Cannot open file
e.printStackTrace();
}
for (long i = 2; i < limit + 1; i++)
{
try
{
output.write(Long.toString(i) + "\n");
}
catch (IOException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
output.close();
You are recreating your filewriter in every iteration of your for-loop and not specifying it to append so you are overwriting your file in every iteration.
Try changing it to create your filewriter before your for-loop and close it after the loop. Something like this:
long number = 13195;
long limit = (long) Math.sqrt(number);
Writer output = null;
try
{
File file = new File("/var/tmp/Primes.txt");
output = new BufferedWriter(new FileWriter(file));
for (long i = 2; i < limit + 1; i++) {
output.write(Long.toString(i) + "\n");
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
output.close();
}
Saving directly to disk is slower, have you considered doing this in pieces, then saving to disk? It could also have the benefit of smaller file size, since you can write only the primes you have found, instead of every composite number.

java: need to increase performance of checksum calculation

I'm using the following function to calculate checksums on files:
public static void generateChecksums(String strInputFile, String strCSVFile) {
ArrayList<String[]> outputList = new ArrayList<String[]>();
try {
MessageDigest m = MessageDigest.getInstance("MD5");
File aFile = new File(strInputFile);
InputStream is = new FileInputStream(aFile);
System.out.println(Calendar.getInstance().getTime().toString() +
" Processing Checksum: " + strInputFile);
double dLength = aFile.length();
try {
is = new DigestInputStream(is, m);
// read stream to EOF as normal...
int nTmp;
double dCount = 0;
String returned_content="";
while ((nTmp = is.read()) != -1) {
dCount++;
if (dCount % 600000000 == 0) {
System.out.println(". ");
} else if (dCount % 20000000 == 0) {
System.out.print(". ");
}
}
System.out.println();
} finally {
is.close();
}
byte[] digest = m.digest();
m.reset();
BigInteger bigInt = new BigInteger(1,digest);
String hashtext = bigInt.toString(16);
// Now we need to zero pad it if you actually / want the full 32 chars.
while(hashtext.length() < 32 ){
hashtext = "0" + hashtext;
}
String[] arrayTmp = new String[2];
arrayTmp[0] = aFile.getName();
arrayTmp[1] = hashtext;
outputList.add(arrayTmp);
System.out.println("Hash Code: " + hashtext);
UtilityFunctions.createCSV(outputList, strCSVFile, true);
} catch (NoSuchAlgorithmException nsae) {
System.out.println(nsae.getMessage());
} catch (FileNotFoundException fnfe) {
System.out.println(fnfe.getMessage());
} catch (IOException ioe) {
System.out.println(ioe.getMessage());
}
}
The problem is that the loop to read in the file is really slow:
while ((nTmp = is.read()) != -1) {
dCount++;
if (dCount % 600000000 == 0) {
System.out.println(". ");
} else if (dCount % 20000000 == 0) {
System.out.print(". ");
}
}
A 3 GB file that takes less than a minute to copy from one location to another, takes over an hour to calculate. Is there something I can do to speed this up or should I try to go in a different direction like using a shell command?
Update: Thanks to ratchet freak's suggestion I changed the code to this which is ridiculously faster (I would guess 2048X faster...):
byte[] buff = new byte[2048];
while ((nTmp = is.read(buff)) != -1) {
dCount += 2048;
if (dCount % 614400000 == 0) {
System.out.println(". ");
} else if (dCount % 20480000 == 0) {
System.out.print(". ");
}
}
use a buffer
byte[] buff = new byte[2048];
while ((nTmp = is.read(buff)) != -1)
{
dCount+=ntmp;
//this logic won't work anymore though
/*
if (dCount % 600000000 == 0)
{
System.out.println(". ");
}
else if (dCount % 20000000 == 0)
{
System.out.print(". ");
}
*/
}
edit: or if you don't need the values do
while(is.read(buff)!=-1)is.skip(600000000);
nvm apparently the implementers of DigestInputStream were stupid and didn't test everything properly before release
Have you tried removing the println's? I imagine all that string manipulation could be consuming most of the processing!
Edit: I didn't read it clearly, I now realise how infrequently they'd be output, I'd retract my answer but I guess it wasn't totally invaluable :-p (Sorry!)
The problem is that System.out.print is used too often. Every time it is called new String objects have to be created and it is expensive.
Use StringBuilder class instead or its thread safe analog StringBuffer.
StringBuilder sb = new StringBuilder();
And every time you need to add something call this:
sb.append("text to be added");
Later, when you are ready to print it:
system.out.println(sb.toString());
Frankly there are several problems with your code that makes it slow:
Like ratchet freak said, disk reads must be buffered because Java read()'s are probably translated to operating system IOs calls without automatically buffering, so one read() is 1 system call!!!
The operating system will normally perform much better if you use an array as buffer or the BufferedInputStream. Better yet, you can use nio to map the file into memory and read it as fast as the OS can handle it.
You may not believe it, but the dCount++; counter may have used a lot of cycles. I believe even for the latest Intel Core processor, it takes several clock cycles to complete a 64-bit floating point add. You will be much better of to use a long for this counter.
If the sole purpose of this counter is to display progress, you can make use of the fact that Java integers overflow without causing an Error and just advance your progress display when a char type wraps to 0 (that's per 65536 reads).
The following string padding is also inefficient. You should use a StringBuilder or a Formatter.
while(hashtext.length() < 32 ){
hashtext = "0"+hashtext;
}
Try using a profiler to find further efficiency problems in your code

Categories

Resources