I am coding a little java based tool to process mysqldump files, which can become quite large (up to a gigabyte for now). I am using this code to read and process the file:
BufferedReader reader = getReader();
BufferedWriter writer = getWriter();
char[] charBuffer = new char[CHAR_BUFFER_SIZE];
int readCharCout;
StringBuffer buffer = new StringBuffer();
while( ( readCharCout = reader.read( charBuffer ) ) > 0 )
{
buffer.append( charBuffer, 0, readCharCout );
//processing goes here
}
What is a good size for the charBuffer? At the moment it is set to 1000, but my code will run with an arbitrary size, so what is best practice or can this size be calculated depending on the file size?
Thanks in ahead,
greetings philipp
It should always be a power of 2. The optimal value is based on the OS and disk format. In code I've seen 4096 is often used, but the bigger the better.
Also, there are better ways to load a file into memory.
Related
I have an app that needs to access a large number of images very quickly, so I need to load those images into memory in some way. Doing so as bitmaps used over 100MB of RAM, which was completely out of the question, so I opted to read jpg files into memory, storing them inside a byteArray. Then I decode them and write them to the canvas as each is needed. This works pretty well, cutting out the slow disk access, while also respecting memory limits.
However, memory usage seems 'off' to me. I'm storing 450 jpgs with a file size of approximately 33kb each. This totals around 15MB of data. However, the app continually runs at between 35MB and 40MB of RAM as reported by both Eclipse DDMS and Android (on a physical device). I've tried modifying how many jpgs are loaded and the RAM used by the app tends to decrease by around 60-70kb per jpg, indicating that each image is stored twice in RAM. Memory usage does not fluctuate which implies that there is not an actual 'leak' involved.
Here is the relevant loading code:
private byte[][] bitmapArray = new byte[totalFrames][];
for (int x=0; x<totalFrames; x++) {
File file = null;
if (cWidth <= cHeight){
file = new File(directory + "/f"+x+".jpg");
} else {
file = new File(directory + "/f"+x+"-land.jpg");
}
bitmapArray[x] = getBytesFromFile(file);
imagesLoaded = x + 1;
}
public byte[] getBytesFromFile(File file) {
byte[] bytes = null;
try {
InputStream is = new FileInputStream(file);
long length = file.length();
bytes = new byte[(int) length];
int offset = 0;
int numRead = 0;
while (offset < bytes.length && (numRead = is.read(bytes, offset, bytes.length - offset)) >= 0) {
offset += numRead;
}
if (offset < bytes.length) {
throw new IOException("Could not completely read file " + file.getName());
}
is.close();
} catch (IOException e) {
//TODO Write your catch method here
}
return bytes;
}
Eventually, they get written to screen like so:
SurfaceHolder holder = getSurfaceHolder();
Canvas c = null;
try {
c = holder.lockCanvas();
if (c != null) {
int canvasWidth = c.getWidth();
int canvasHeight = c.getHeight();
Rect destinationRect = new Rect();
destinationRect.set(0, 0, canvasWidth, canvasHeight);
c.drawBitmap(BitmapFactory.decodeByteArray(bitmapArray[bgcycle], 0, bitmapArray[bgcycle].length), null, destinationRect, null);
}
} finally {
if (c != null)
holder.unlockCanvasAndPost(c);
}
Am I correct that there is some sort of duplication going on here? Or is there just that much overhead involved in storing jpgs in a byteArray like this?
Storing bytes in RAM is very different to storing data on hard drives... There is alot more overhead to it. The references to the objects as well the byte array structures all take up additional memory. There isn't really a single source to all the additional memory but just remember than loading a file into RAM normally takes up 2 ~ 3x more space (from experience, I'm afraid I can't quote any documentation here).
Consider this:
File F = //Some file here (Less than 2 GB please)
FileInputStream fIn = new FileInputStream(F);
ByteArrayOutputStream bOut = new ByteArrayOutputStream(((int)F.length()) + 1);
int r;
byte[] buf = new byte[32 * 1000];
while((r = fIn.read(buf) != -1){
bOut.write(buf, 0, r);
}
//Do a memory measurement at this point. You'll see your using nearly 3x the memory in RAM compared to the file.
//If your actually gonna try this, remember to surround with try-catch and close the streams as appropriate.
Also remember that unused memory is not instantly cleared up. The method getBytesFromFile() may be returning a copy of a byte array which causes memory duplication which may not immediately be garbage collected. If you want to be safe, check the method getBytesFromFile(file) is not leaking any references that should be cleaned up. It won't appear as a memory leak as you only call it a finite number of times.
It might be because your byte array is 2 dimensional, you only need one dimension for loading an image using a byte array, and the second dimension could potentially double the Ram needed as for each byte you would have an empty but still existing byte that you don't use
Using the following code as a benchmark, the system can write 10,000 rows to disk in a fraction of a second:
void withSync() {
int f = open( "/tmp/t8" , O_RDWR | O_CREAT );
lseek (f, 0, SEEK_SET );
int records = 10*1000;
clock_t ustart = clock();
for(int i = 0; i < records; i++) {
write(f, "012345678901234567890123456789" , 30);
fsync(f);
}
clock_t uend = clock();
close (f);
printf(" sync() seconds:%lf writes per second:%lf\n", ((double)(uend-ustart))/(CLOCKS_PER_SEC), ((double)records)/((double)(uend-ustart))/(CLOCKS_PER_SEC));
}
In the above code, 10,000 records can be written and flushed out to disk in a fraction of a second, output below:
sync() seconds:0.006268 writes per second:0.000002
In the Java version, it takes over 4 seconds to write 10,000 records. Is this just a limitation of Java, or am I missing something?
public void testFileChannel() throws IOException {
RandomAccessFile raf = new RandomAccessFile(new File("/tmp/t5"),"rw");
FileChannel c = raf.getChannel();
c.force(true);
ByteBuffer b = ByteBuffer.allocateDirect(64*1024);
long s = System.currentTimeMillis();
for(int i=0;i<10000;i++){
b.clear();
b.put("012345678901234567890123456789".getBytes());
b.flip();
c.write(b);
c.force(false);
}
long e=System.currentTimeMillis();
raf.close();
System.out.println("With flush "+(e-s));
}
Returns this:
With flush 4263
Please help me understand what is the correct/fastest way to write records to disk in Java.
Note: I am using the RandomAccessFile class in combination with a ByteBuffer as ultimately we need random read/write access on this file.
Actually, I am surprised that test is not slower. The behavior of force is OS dependent but broadly it forces the data to disk. If you have an SSD you might achieve 40K writes per second, but with an HDD you won't. In the C example its clearly isn't committing the data to disk as even the fastest SSD cannot perform more than 235K IOPS (That the manufacturers guarantee it won't go faster than that :D )
If you need the data committed to disk every time, you can expect it to be slow and entirely dependent on the speed of your hardware. If you just need the data flushed to the OS and if the program crashes but the OS does not, you will not loose any data, you can write data without force. A faster option is to use memory mapped files. This will give you random access without a system call for each record.
I have a library Java Chronicle which can read/write 5-20 millions records per second with a latency of 80 ns in text or binary formats with random access and can be shared between processes. This only works this fast because it is not committing the data to disk on every record, but you can test that if the JVM crashes at any point, no data written to the chronicle is lost.
This code is more similar to what you wrote in C. Takes only 5 msec on my machine. If you really need to flush after every write, it takes about 60 msec. Your original code took about 11 seconds on this machine. BTW, closing the output stream also flushes.
public static void testFileOutputStream() throws IOException {
OutputStream os = new BufferedOutputStream( new FileOutputStream( "/tmp/fos" ) );
byte[] bytes = "012345678901234567890123456789".getBytes();
long s = System.nanoTime();
for ( int i = 0; i < 10000; i++ ) {
os.write( bytes );
}
long e = System.nanoTime();
os.close();
System.out.println( "outputstream " + ( e - s ) / 1e6 );
}
Java equivalent of fputs is file.write("012345678901234567890123456789"); , you are calling 4 functions and just 1 in C, delay seems obvious
i think this is most similar to your C version. i think the direct buffers in your java example are causing many more buffer copies than the C version. this takes about 2.2s on my (old) box.
public static void testFileChannelSimple() throws IOException {
RandomAccessFile raf = new RandomAccessFile(new File("/tmp/t5"),"rw");
FileChannel c = raf.getChannel();
c.force(true);
byte[] bytes = "012345678901234567890123456789".getBytes();
long s = System.currentTimeMillis();
for(int i=0;i<10000;i++){
raf.write(bytes);
c.force(true);
}
long e=System.currentTimeMillis();
raf.close();
System.out.println("With flush "+(e-s));
}
My String containing a text file of 50 MB.
I got my String like this:
RandomAccessFile file = new RandomAccessFile("wiki.txt", "r");
FileChannel channel = file.getChannel();
MappedByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY, 0, 1024*50);
byte[] b = new byte[1024*50];
buffer.get(b);
String wiki = new String(b);
I get a String expression that can contain multiple words, and I need to return an answer if this expression is in my wiki String (the big String) or not.
The action works good for about 1% of the String(from the beginning of the String), and when the phrase I'm looking for is in the middle or end of the String, the answer I get for the following code is a false:
System.out.println(wiki.contains(strToCheck));
System.out.println(wiki.indexOf(strToCheck, 0));
System.out.println(wiki.matches("(?i).*"+strToCheck+".*"));
Does anyone know why this happens?
Or what am I doing wrong?
Thank you.
I am sorry to say it but 1024*50 in not 50M. It is 50K.
It seems that you are reading 0.1% of your file and then searching in it.
you should try
MappedByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY, 0, 1024*1024*50);
because 50 MB = 1024*1024*50, 50kb = 1024 * 50, 1MB = 1024 kb`
Occam's razor: strToCheck is NOT in wiki.
If you are going to be performing searches in the String, you can consider implementing the Knuth–Morris–Pratt algorithm and buffering your reads of the original String so that the entire string is not loaded into memory.
So I have large (around 4 gigs each) txt files in pairs and I need to create a 3rd file which would consist of the 2 files in shuffle mode. The following equation presents it best:
3rdfile = (4 lines from file 1) + (4 lines from file 2) and this is repeated until I hit the end of file 1 (both input files will have the same length - this is by definition). Here is the code I'm using now but this doesn't scale very good on large files. I was wondering if there is a more efficient way to do this - would working with memory mapped file help ? All ideas are welcome.
public static void mergeFastq(String forwardFile, String reverseFile, String outputFile) {
try {
BufferedReader inputReaderForward = new BufferedReader(new FileReader(forwardFile));
BufferedReader inputReaderReverse = new BufferedReader(new FileReader(reverseFile));
PrintWriter outputWriter = new PrintWriter(new FileWriter(outputFile, true));
String forwardLine = null;
System.out.println("Begin merging Fastq files");
int readsMerge = 0;
while ((forwardLine = inputReaderForward.readLine()) != null) {
//append the forward file
outputWriter.println(forwardLine);
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
//append the reverse file
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
readsMerge++;
if(readsMerge % 10000 == 0) {
System.out.println("[" + now() + "] Merged 10000");
readsMerge = 0;
}
}
inputReaderForward.close();
inputReaderReverse.close();
outputWriter.close();
} catch (IOException ex) {
Logger.getLogger(Utilities.class.getName()).log(Level.SEVERE, "Error while merging FastQ files", ex);
}
}
Maybe you also want to try to use a BufferedWriter to cut down your file IO operations.
http://download.oracle.com/javase/6/docs/api/java/io/BufferedWriter.html
A simple answer is to use a bigger buffer, which help to reduce to total number of I/O call being made.
Usually, memory mapped IO with FileChannel (see Java NIO) will be used for handling large data file IO. In this case, however, it is not the case, as you need to inspect the file content in order to determine the boundary for every 4 lines.
If performance was the main requirement, then I would code this function in C or C++ instead of Java.
But regardless of language used, what I would do is try to manage memory myself. I would create two large buffers, say 128MB or more each and fill them with data from the two text files. Then you need a 3rd buffer that is twice as big as the previous two. The algorithm will start moving characters one by one from input buffer #1 to destination buffer, and at the same time count EOLs. Once you reach the 4th line you store the current position on that buffer away and repeat the same process with the 2nd input buffer. You continue alternating between the two input buffers, replenishing the buffers when you consume all the data in them. Each time you have to refill the input buffers you can also write the destination buffer and empty it.
Buffer your read and write operations. Buffer needs to be large enough to minimize the read/write operations and still be memory efficient. This is really simple and it works.
void write(InputStream is, OutputStream os) throws IOException {
byte[] buf = new byte[102400]; //optimize the size of buffer to your needs
int num;
while((n = is.read(buf)) != -1){
os.write(buffer, 0, num);
}
}
EDIT:
I just realized that you need to shuffle the lines, so this code will not work for you as is but, the concept still remains the same.
I was originally using RIM's native xml parser methods to parse a 150k text file, approximately 5000 lines of xml, however it was taking about 2 minutes to complete, so I tried a line based format:
Title: Book Title Line 1 Line
2 Line 3
I should be able to read the file in less time than it takes to blink, but it is still slow.
Identifier books is a Vector of Book objects and lines are stored in a vector of strings in the Book object.
class classs = Class.forName("com.Gui.FileLoader");
InputStream is = classs.getResourceAsStream( fileName );
int totalFileSize = IOUtilities.streamToBytes( is ).length;
int totalRead = 0;
//Thought that maybe a shared input stream would be faster, in this case it't not.
SharedInputStream sis = SharedInputStream.getSharedInputStream( classs.getResourceAsStream( fileName ) );
LineReader lr = new LineReader( sis );
String strLine = new String( lr.readLine() );
totalRead += strLine.length();
Book book = null;
//Loop over the file until EOF is reached, catch EOF error move on with life after that.
while(1 == 1){
//If Line = Title: then we've got a new book add the old book to our books vector.
if (strLine.startsWith("Title:")){
if (book != null){
books.addElement( book );
}
book = new Book();
book.setTitle( strLine.substring( strLine.indexOf(':') + 1).trim() );
strLine = new String( lr.readLine() );
totalRead += strLine.length();
continue;
}
int totalComplete = (int) ( ( (double) totalRead / (double) totalFileSize ) * 100.00);
_observer.processStatusUpdate( totalComplete , book.getTitle() );
book.addLine( strLine );
strLine = new String( lr.readLine(), "ascii" );
totalRead += strLine.length();
}
For one thing, you're reading in the file twice - once for determining the size and then again for parsing it. Since you're already reading it into a byte array for determining the size, why not pass that byte array into a ByteArrayInputStream constructor? For example:
//Used to determine file size and then show in progress bar, app is threaded.
byte[] fileBytes = IOUtilities.streamToBytes( is );
int totalFileSize = fileBytes.length;
int totalRead = 0;
ByteArrayInputStream bais = new ByteArrayInputStream( fileBytes );
LineReader lr = new LineReader( bais);
This way it won't matter if the rest of the classes reading from the stream are reading a byte at a time - it's all in-memory.
It is easy to assume that all the operations you've elided from the code sample finish in constant time. I am guessing that one of them is doing something inefficiently, such as book.addLine( strLine ); or perhaps _observer.processStatusUpdate( totalComplete , book.getTitle() ); If those operations are not able to complete in constant time, then you could easily have a quadratic parsing algorithm.
Just thinking about the operations is the best way to figure it out, but if you're stumped, try using the BlackBerry profiler. Run your program in the Eclipse debugger and get it to stop at a breakpoint just before parsing. Then, in Eclipse, select 'window .. show view .. other .. BlackBerry .. BlackBerry Profiler View'
Select the 'setup options' button from the profiler view toolbar. It has a blue triangle in the icon. Set 'method attribution' to cumulative, and 'what to profile' to 'time including native methods'
then continue your program. once parsing is finished, you'll need to pause program execution, then click on the 'method' tab of the profiler view. You should be able to determine your pain point from there.
Where does the profiler say you spend your time?
If you do not have a preferred profiler there is jvisualvm in the Java 6 JDK.
(My guess is that you will find all the time being spent on the way down to "read a character from the file". If so, you need to buffer)
Try using new BufferedInputStream(classs.getResourceAsStream(fileName));
EDIT:
Apparently the documentation that says they have BufferedInputStream is wrong.
I am going to leave this wrong answer here just so people have that info (doc being wrong).