java: need to increase performance of checksum calculation

java: need to increase performance of checksum calculation - java

I'm using the following function to calculate checksums on files:
public static void generateChecksums(String strInputFile, String strCSVFile) {
ArrayList<String[]> outputList = new ArrayList<String[]>();
try {
MessageDigest m = MessageDigest.getInstance("MD5");
File aFile = new File(strInputFile);
InputStream is = new FileInputStream(aFile);
System.out.println(Calendar.getInstance().getTime().toString() +
" Processing Checksum: " + strInputFile);
double dLength = aFile.length();
try {
is = new DigestInputStream(is, m);
// read stream to EOF as normal...
int nTmp;
double dCount = 0;
String returned_content="";
while ((nTmp = is.read()) != -1) {
dCount++;
if (dCount % 600000000 == 0) {
System.out.println(". ");
} else if (dCount % 20000000 == 0) {
System.out.print(". ");
}
}
System.out.println();
} finally {
is.close();
}
byte[] digest = m.digest();
m.reset();
BigInteger bigInt = new BigInteger(1,digest);
String hashtext = bigInt.toString(16);
// Now we need to zero pad it if you actually / want the full 32 chars.
while(hashtext.length() < 32 ){
hashtext = "0" + hashtext;
}
String[] arrayTmp = new String[2];
arrayTmp[0] = aFile.getName();
arrayTmp[1] = hashtext;
outputList.add(arrayTmp);
System.out.println("Hash Code: " + hashtext);
UtilityFunctions.createCSV(outputList, strCSVFile, true);
} catch (NoSuchAlgorithmException nsae) {
System.out.println(nsae.getMessage());
} catch (FileNotFoundException fnfe) {
System.out.println(fnfe.getMessage());
} catch (IOException ioe) {
System.out.println(ioe.getMessage());
}
}
The problem is that the loop to read in the file is really slow:
while ((nTmp = is.read()) != -1) {
dCount++;
if (dCount % 600000000 == 0) {
System.out.println(". ");
} else if (dCount % 20000000 == 0) {
System.out.print(". ");
}
}
A 3 GB file that takes less than a minute to copy from one location to another, takes over an hour to calculate. Is there something I can do to speed this up or should I try to go in a different direction like using a shell command?
Update: Thanks to ratchet freak's suggestion I changed the code to this which is ridiculously faster (I would guess 2048X faster...):
byte[] buff = new byte[2048];
while ((nTmp = is.read(buff)) != -1) {
dCount += 2048;
if (dCount % 614400000 == 0) {
System.out.println(". ");
} else if (dCount % 20480000 == 0) {
System.out.print(". ");
}
}

use a buffer
byte[] buff = new byte[2048];
while ((nTmp = is.read(buff)) != -1)
{
dCount+=ntmp;
//this logic won't work anymore though
/*
if (dCount % 600000000 == 0)
{
System.out.println(". ");
}
else if (dCount % 20000000 == 0)
{
System.out.print(". ");
}
*/
}
edit: or if you don't need the values do
while(is.read(buff)!=-1)is.skip(600000000);
nvm apparently the implementers of DigestInputStream were stupid and didn't test everything properly before release

Have you tried removing the println's? I imagine all that string manipulation could be consuming most of the processing!
Edit: I didn't read it clearly, I now realise how infrequently they'd be output, I'd retract my answer but I guess it wasn't totally invaluable :-p (Sorry!)

The problem is that System.out.print is used too often. Every time it is called new String objects have to be created and it is expensive.
Use StringBuilder class instead or its thread safe analog StringBuffer.
StringBuilder sb = new StringBuilder();
And every time you need to add something call this:
sb.append("text to be added");
Later, when you are ready to print it:
system.out.println(sb.toString());

Frankly there are several problems with your code that makes it slow:
Like ratchet freak said, disk reads must be buffered because Java read()'s are probably translated to operating system IOs calls without automatically buffering, so one read() is 1 system call!!!
The operating system will normally perform much better if you use an array as buffer or the BufferedInputStream. Better yet, you can use nio to map the file into memory and read it as fast as the OS can handle it.
You may not believe it, but the dCount++; counter may have used a lot of cycles. I believe even for the latest Intel Core processor, it takes several clock cycles to complete a 64-bit floating point add. You will be much better of to use a long for this counter.
If the sole purpose of this counter is to display progress, you can make use of the fact that Java integers overflow without causing an Error and just advance your progress display when a char type wraps to 0 (that's per 65536 reads).
The following string padding is also inefficient. You should use a StringBuilder or a Formatter.
while(hashtext.length() < 32 ){
hashtext = "0"+hashtext;
}
Try using a profiler to find further efficiency problems in your code

Related

Why does my code become slow after processing a large dataset?

I have a Java program, which basically reads from a file line by line and stores the lines into a set. The file contains more than 30000000 lines. My program runs fast at the beginning but slow down after processing 20000000 lines and even too slow to wait. Can somebody explains why this would happen and how can I speed up the program again?
Thanks.
public void returnTop100Phases() {
Set<Phase> phaseTreeSet = new TreeSet<>(new Comparator<Phase>() {
#Override
public int compare(Phase o1, Phase o2) {
int diff = o2.count - o1.count;
if (diff == 0) {
return o1.phase.compareTo(o2.phase);
} else {
return diff > 0 ? 1 : -1;
}
}
});
try {
int lineCount = 0;
BufferedReader br = new BufferedReader(
new InputStreamReader(new FileInputStream(new File("output")), StandardCharsets.UTF_8));
String line = null;
while ((line = br.readLine()) != null) {
lineCount++;
if (lineCount % 10000 == 0) {
System.out.println(lineCount);
}
String[] tokens = line.split("\\t");
phaseTreeSet.add(new Phase(tokens[0], Integer.parseInt(tokens[1])));
}
br.close();
PrintStream out = new PrintStream(System.out, true, "UTF-8");
Iterator<Phase> iterator = phaseTreeSet.iterator();
int n = 100;
while (n > 0 && iterator.hasNext()) {
Phase phase = iterator.next();
out.print(phase.phase + "\t" + phase.count + "\n");
n--;
}
out.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}

Looking at the runtime behaviour this is clearly a memory issue. Actually my tests even broke after around 5M with 'GC overhaed limit exeeded' on Java8. If I limit the size of the phaseTreeSet by adding
if (phaseTreeSet.size() > 100) { phaseTreeSet.pollLast(); }
it runs through quickly. The point why it gets that slow is, it uses more memory, and thus the garbage collection takes longer. But every time before it takes more memory it has to do a big garbage collection again. Obviously there's quite some memory to take, and every time it gets a bit slower...
To get faster you need to get the stuff out of memory. Maybe by keeping only top Phases like I did, or by using kind of a database.

Java Reading big file java heap space

I have written this code:
try(BufferedReader file = new BufferedReader(new FileReader("C:\\Users\\User\\Desktop\\big50m.txt"));){
String line;
StringTokenizer st;
while ((line = file.readLine()) != null){
st = new StringTokenizer(line); // Separation of integers of the file line
while(st.hasMoreTokens())
numbers.add(Integer.parseInt(st.nextToken())); //Converting and adding to the list of numbers
}
}
catch(Exception e){
System.out.println("Can't read the file...");
}
the big50m file has 50.000.000 integers and i get this runtime error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
at java.lang.StringBuffer.append(StringBuffer.java:367)
at java.io.BufferedReader.readLine(BufferedReader.java:370)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at unsortedfilesapp.UnsortedFilesApp.main(UnsortedFilesApp.java:37)
C:\Users\User\AppData\Local\NetBeans\Cache\8.2\executor-snippets\run.xml:53: Java returned: 1
BUILD FAILED (total time: 5 seconds)
I think the problem is the string variable named line. Can you tell me how
to fix it ? Because i want fast reading i use StringTokenizer.

Create a BufferedReader from the file and read() char by char. Put digit char into a String, then Integer.parseInt(), skip any non-digit char and continue parsing on the the next digit, etc, etc.

The readLine() method reads the whole line at once thus eating up a lot of memory. This is highly inefficient and does not scale to an arbitrary big file.
You can use a StreamTokenizer
like this:
StreamTokenizer tokenizer = new StreamTokenizer(new FileReader("bigfile.txt"));
tokenizer.parseNumbers(); // default behaviour
while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) {
if (tokenizer.ttype == StreamTokenizer.TT_NUMBER) {
numbers.add((int)Math.round(tokenizer.nval));
}
}
I have not tested this code but it gives you the general idea.

here is an version that minimize the memory usage. No byte to char conversion. No String operations. But in this version it does not handle negative numbers.
public static void main(final String[]a) {
final Set<Integer> number = new HashSet<>();
int v = 0;
boolean use = false;
int c;
// Input stream avoid char conversion
try(InputStream s = new FileInputStream("C:\\Users\\User\\Desktop\\big50m.txt")) {
// No allocation in the loop
do {
if((c = s.read()) == -1) break;
if(c>='0' && c<='9') { v = v * 10 + c-'0'; use = true; continue; }
if(use) number.add(v);
use = false;
v = 0;
} while(true);
if(use) number.add(v);
} catch(final Exception e){ System.out.println("Can't read the file..."); }
}

On Running the program with -Xmx2048m, the provided snippet worked (with some adjustments: declared numbers as List numbers = new ArrayList<>(50000000); )

Since all numbers are within one line, the BufferedReader approach does not work or scale well. The complete file will be read into memory. Therefore the streaming approach (e.g. from #whbogado) is indeed the way to go.
StreamTokenizer tokenizer = new StreamTokenizer(new FileReader("bigfile.txt"));
tokenizer.parseNumbers(); // default behaviour
while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) {
if (tokenizer.ttype == StreamTokenizer.TT_NUMBER) {
numbers.add((int)Math.round(tokenizer.nval));
}
}
As you are writing, that you are getting a heap space error as well, I assume, that it is not a problem with the streaming anymore. Unfortunately you are storing all values within a List. I think that is the problem now. You say in a comment, that you do not know the actual count of numbers. Hence you should avoid to store those in a list and do here as well some kind of streaming.
For all who are interested, here is my little testcode (java 8) that does produce a testfile of the needed size USED_INT_VALUES. I limited it for now to 5 000 000 integers. As you can see running it, the memory increases steadily while reading through the file. The only place that holds that much memory is the numbers List.
Be aware that initializing an ArrayList with an initial capacity does not allocate the memory the stored objects need, in your case your Integers.
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.StreamTokenizer;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.logging.Level;
import java.util.logging.Logger;
public class TestBigFiles {
public static void main(String args[]) throws IOException {
heapStatistics("program start");
final int USED_INT_VALUES = 5000000;
File tempFile = File.createTempFile("testdata_big_50m", ".txt");
System.out.println("using file " + tempFile.getAbsolutePath());
tempFile.deleteOnExit();
Random rand = new Random();
FileWriter writer = new FileWriter(tempFile);
rand.ints(USED_INT_VALUES).forEach(i -> {
try {
writer.write(i + " ");
} catch (IOException ex) {
Logger.getLogger(TestBigFiles.class.getName()).log(Level.SEVERE, null, ex);
}
});
writer.close();
heapStatistics("large file generated - size=" + tempFile.length() + "Bytes");
List<Integer> numbers = new ArrayList<>(USED_INT_VALUES);
heapStatistics("large array allocated (to avoid array copy)");
int c = 0;
try (FileReader fileReader = new FileReader(tempFile);) {
StreamTokenizer tokenizer = new StreamTokenizer(fileReader);
while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) {
if (tokenizer.ttype == StreamTokenizer.TT_NUMBER) {
numbers.add((int) tokenizer.nval);
c++;
}
if (c % 100000 == 0) {
heapStatistics("within loop count " + c);
}
}
}
heapStatistics("large file parsed nummer list size is " + numbers.size());
}
private static void heapStatistics(String message) {
int MEGABYTE = 1024 * 1024;
//clean up unused stuff
System.gc();
Runtime runtime = Runtime.getRuntime();
System.out.println("##### " + message + " #####");
System.out.println("Used Memory:" + (runtime.totalMemory() - runtime.freeMemory()) / MEGABYTE + "MB"
+ " Free Memory:" + runtime.freeMemory() / MEGABYTE + "MB"
+ " Total Memory:" + runtime.totalMemory() / MEGABYTE + "MB"
+ " Max Memory:" + runtime.maxMemory() / MEGABYTE + "MB");
}
}

Stream of short[]

Hi I need to calculate the entropy of order m of a file where m is the number of bit (m <= 16).
So:
H_m(X)=-sum_i=0 to i=2^m-1{(p_i,m)(log_2 (p_i,m))}
So, I thought to create an input stream to read the file and then calculate the probability of each sequence composed by m bit.
For m = 8 it's easy because I consider a byte.
Since that m<=16 I tought to consider as primitive type short, save each short of the file in an array short[] and then manipulate bits using bitwise operators to obtain all the sequences of m bit in the file.
Is this a good idea?
Anyway, I'm not able to create a stream of short. This is what I've done:
public static void main(String[] args) {
readFile(FILE_NAME_INPUT);
}
public static void readFile(String filename) {
short[] buffer = null;
File a_file = new File(filename);
try {
File file = new File(filename);
FileInputStream fis = new FileInputStream(filename);
DataInputStream dis = new DataInputStream(fis);
int length = (int)file.length() / 2;
buffer = new short[length];
int count = 0;
while(dis.available() > 0 && count < length) {
buffer[count] = dis.readShort();
count++;
}
System.out.println("length=" + length);
System.out.println("count=" + count);
for(int i = 0; i < buffer.length; i++) {
System.out.println("buffer[" + i + "]: " + buffer[i]);
}
fis.close();
}
catch(EOFException eof) {
System.out.println("EOFException: " + eof);
}
catch(FileNotFoundException fe) {
System.out.println("FileNotFoundException: " + fe);
}
catch(IOException ioe) {
System.out.println("IOException: " + ioe);
}
}
But I lose a byte and I don't think this is the best way to proced.
This is what I think to do using bitwise operator:
int[] list = new int[l];
foreach n in buffer {
for(int i = 16 - m; i > 0; i-m) {
list.add( (n >> i) & 2^m-1 );
}
}
I'm assuming in this case to use shorts.
If I use bytes, how can I do a cycle like that for m > 8?
That cycle doesn't work because I have to concatenate multiple bytes and each time varying the number of bits to be joined..
Any ideas?
Thanks

I think you just need to have a byte array:
public static void readFile(String filename) {
ByteArrayOutputStream outputStream=new ByteArrayOutputStream();
try {
FileInputStream fis = new FileInputStream(filename);
byte b=0;
while((b=fis.read())!=-1) {
outputStream.write(b);
}
byte[] byteData=outputStream.toByteArray();
fis.close();
}
catch(IOException ioe) {
System.out.println("IOException: " + ioe);
}
Then you can manipulate byteData as per your bitwise operations.
--
If you want to work with shorts you can combine bytes read this way
short[] buffer=new short[(int)(byteData.length/2.)+1];
j=0;
for(i=0; i<byteData.length-1; i+=2) {
buffer[j]=(short)((byteData[i]<<8)|byteData[i+1]);
j++;
}
To check for odd bytes do this
if((byteData.length%2)==1) last=(short)((0x00<<8)|byteData[byteData.length-1]]);
last is a short so it could be placed in buffer[buffer.length-1]; I'm not sure if that last position in buffer is available or occupied; I think it is but you need to check j after exiting the loop; if j's value is buffer.length-1 then it is available; otherwise might be some problem.
Then manipulate buffer.
The second approach with working with bytes is more involved. It's a question of its own. So try this above.

How would I make text dynamic from events?

Hello well I made a code that will download an app, and that's all working fine.
My question is I want to display on the frame "Please wait until the jar is done downloading...50% done" I want the 50% to change with the % left of the file being downloaded.
I got that all set but the part where it changes is what is not working.
Heres my code:
while(( length = inputstream.read(buffer)) > -1)
{
down += length;
bufferedoutputstream.write(buffer, 0 , length);
String text = clientDL.label1.getText();
text += getPerc() + "%";
}
And here is my getPerc() method:
private int getPerc() {
return (down / (sizeOfClient + 1)) * 100;
}
Thanks.

Answer for
"Please wait until the jar is done downloading...50% done" i want
the 50% to change with the % left of the file being downloaded.
while(( length = inputstream.read(buffer)) > -1)
{
down += length;
bufferedoutputstream.write(buffer, 0 , length);
String text = clientDL.label1.getText();
int perc = getPerc();
if(perc <= 50)
{
text += getPerc() + "% done";
}else
{
text ="Please wait until the jar is downloading..."
text = text + (100 - perc) + " % remaining"
}
}

Your main problem here is that Java doesn't work on that pointer-like logic, so using a getter to get a variable and then assigning a new value to that variable will not do anything.
What you are looking for is
clientDL.label1.setText("Whatever text you want to put in the label");
(for the record, you probably want to define a getter for that label, rather than accessing directly to label1, which is bad practice)

You want to use the carriage return symbol
System.out.print("10%\r")
EDIT: You're not using stdout, my bad. See silverlord's answer

I think it should be more or less something like this:
String text = clientDL.label1.getText();
while(( length = inputstream.read(buffer)) > -1){
down += length;
bufferedoutputstream.write(buffer, 0 , length);
SwingUtilities.invokeLater(new Runnable(){
public void run(){
clientDL.label1.setText(text + getPerc() + "%");
}
});
}
(I assume you are downloading the file in a thread that is not AWT-Event-Queue)

I dont have enough reputation to comment i am unable to provide you solution adding comment to your question. Anyway i tried similar scenario
I think there is a problem with your clientDL.label1.getText(); just like what solutions suggested by Silverlord. Please refer Silverlord answer
public class Checker {
int length = 0;
InputStreamReader inputstream;
FileInputStream fis = null;
public void display() throws IOException
{
fis = new FileInputStream("C:\\Users\\298610\\Documents\\test.txt");
inputstream = new InputStreamReader(fis);
int i=0;
while(( length = inputstream.read()) > -1)
{
if(i<getPerc().length)
{
String text = null;
text = getPerc()[i] + "%";
System.out.println("Hi man "+text);
}
i++;
}
}
private int[] getPerc() {
return new int[]{10,20,30,40,50,60,70,80,90,100};
}
public static void main(String a[]) throws IOException
{
Checker w =new Checker();
w.display();
}
}
I am getting output like :
Hi man 10%
Hi man 20%
Hi man 30%
Hi man 40%
Hi man 50%
Hi man 60%
Hi man 70%
Hi man 80%
Hi man 90%
Hi man 100%

Getting MD5 Hash of File from URL

The result I'm getting is that files of the same type are returning the same md5 hash value. For example two different jpgs are giving me the same result. However, a jpg vs a apk are giving different results.
Here is my code...
public static String checkHashURL(String input) {
try {
MessageDigest md = MessageDigest.getInstance("MD5");
InputStream is = new URL(input).openStream();
try {
is = new DigestInputStream(is, md);
int b;
while ((b = is.read()) > 0) {
;
}
} finally {
is.close();
}
byte[] digest = md.digest();
StringBuffer sb = new StringBuffer();
for (int i = 0; i < digest.length; i++) {
sb.append(
Integer.toString((digest[i] & 0xff) + 0x100, 16).substring(
1));
}
return sb.toString();
} catch (Exception ex) {
throw new RuntimeException(ex);
}
}

This is broken:
while ((b = is.read()) > 0)
Your code will stop at the first byte of the stream which is 0. If the two files have the same values before the first 0 byte, you'll fail. If you really want to call the byte-at-a-time version of read, you want:
while (is.read() != -1) {}
The parameterless InputStream.read() method returns -1 when it reaches the end of the stream.
(There's no need to assign a value to b, as you're not using it.)
Better would be to read a buffer at a time:
byte[] ignoredBuffer = new byte[8 * 1024]; // Up to 8K per read
while (is.read(ignoredBuffer) > 0) {}
This time the condition is valid, because InputStream.read(byte[]) would only ever return 0 if you pass in an empty buffer. Otherwise, it will try to read at least one byte, returning the length of data read or -1 if the end of the stream has been reached.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java: need to increase performance of checksum calculation - java

Have you tried removing the println's? I imagine all that string manipulation could be consuming most of the processing! Edit: I didn't read it clearly, I now realise how infrequently they'd be output, I'd retract my answer but I guess it wasn't totally invaluable :-p (Sorry!)

Related

Why does my code become slow after processing a large dataset?

Java Reading big file java heap space

Stream of short[]

How would I make text dynamic from events?

Getting MD5 Hash of File from URL

Categories

Resources