Parsing a text file on BlackBerry takes forever - java

I was originally using RIM's native xml parser methods to parse a 150k text file, approximately 5000 lines of xml, however it was taking about 2 minutes to complete, so I tried a line based format:
Title: Book Title Line 1 Line
2 Line 3
I should be able to read the file in less time than it takes to blink, but it is still slow.
Identifier books is a Vector of Book objects and lines are stored in a vector of strings in the Book object.
class classs = Class.forName("com.Gui.FileLoader");
InputStream is = classs.getResourceAsStream( fileName );
int totalFileSize = IOUtilities.streamToBytes( is ).length;
int totalRead = 0;
//Thought that maybe a shared input stream would be faster, in this case it't not.
SharedInputStream sis = SharedInputStream.getSharedInputStream( classs.getResourceAsStream( fileName ) );
LineReader lr = new LineReader( sis );
String strLine = new String( lr.readLine() );
totalRead += strLine.length();
Book book = null;
//Loop over the file until EOF is reached, catch EOF error move on with life after that.
while(1 == 1){
//If Line = Title: then we've got a new book add the old book to our books vector.
if (strLine.startsWith("Title:")){
if (book != null){
books.addElement( book );
}
book = new Book();
book.setTitle( strLine.substring( strLine.indexOf(':') + 1).trim() );
strLine = new String( lr.readLine() );
totalRead += strLine.length();
continue;
}
int totalComplete = (int) ( ( (double) totalRead / (double) totalFileSize ) * 100.00);
_observer.processStatusUpdate( totalComplete , book.getTitle() );
book.addLine( strLine );
strLine = new String( lr.readLine(), "ascii" );
totalRead += strLine.length();
}

For one thing, you're reading in the file twice - once for determining the size and then again for parsing it. Since you're already reading it into a byte array for determining the size, why not pass that byte array into a ByteArrayInputStream constructor? For example:
//Used to determine file size and then show in progress bar, app is threaded.
byte[] fileBytes = IOUtilities.streamToBytes( is );
int totalFileSize = fileBytes.length;
int totalRead = 0;
ByteArrayInputStream bais = new ByteArrayInputStream( fileBytes );
LineReader lr = new LineReader( bais);
This way it won't matter if the rest of the classes reading from the stream are reading a byte at a time - it's all in-memory.

It is easy to assume that all the operations you've elided from the code sample finish in constant time. I am guessing that one of them is doing something inefficiently, such as book.addLine( strLine ); or perhaps _observer.processStatusUpdate( totalComplete , book.getTitle() ); If those operations are not able to complete in constant time, then you could easily have a quadratic parsing algorithm.
Just thinking about the operations is the best way to figure it out, but if you're stumped, try using the BlackBerry profiler. Run your program in the Eclipse debugger and get it to stop at a breakpoint just before parsing. Then, in Eclipse, select 'window .. show view .. other .. BlackBerry .. BlackBerry Profiler View'
Select the 'setup options' button from the profiler view toolbar. It has a blue triangle in the icon. Set 'method attribution' to cumulative, and 'what to profile' to 'time including native methods'
then continue your program. once parsing is finished, you'll need to pause program execution, then click on the 'method' tab of the profiler view. You should be able to determine your pain point from there.

Where does the profiler say you spend your time?
If you do not have a preferred profiler there is jvisualvm in the Java 6 JDK.
(My guess is that you will find all the time being spent on the way down to "read a character from the file". If so, you need to buffer)

Try using new BufferedInputStream(classs.getResourceAsStream(fileName));
EDIT:
Apparently the documentation that says they have BufferedInputStream is wrong.
I am going to leave this wrong answer here just so people have that info (doc being wrong).

Related

Export multiple images in one byte array (BLOB IBM DB2) to disk

I have a column "Content" (BLOB data) in database (IBM DB2) and the data of an record same that (https://drive.google.com/file/d/12d1g5jtomJS-ingCn_n0GKMsM4RkdYzB/view?usp=sharing)
I have opened it by editor and I think that it has more than one image in this (https://i.stack.imgur.com/2biLN.png, https://i.stack.imgur.com/ZwBOs.png).
I can export an image from byte array (using C#) to my disk, but with multiple images, I don't know how to do it.
Please help me! Thanks!
Edit 1:
I have tried export it as only one image by this code:
private void readBLOB(DB2Connection conn, DB2Transaction trans)
{
try
{
string SavePath = #"D:\\MyBLOB";
long CurrentIndex = 0;
//the number of bytes to store in the array
int BufferSize = 413454;
//The Number of bytes returned from GetBytes() method
long BytesReturned;
//A byte array to hold the buffer
byte[] Blob = new byte[BufferSize];
DB2Command cmd = conn.CreateCommand();
cmd.CommandText = "SELECT ATTR0102500126 " +
" FROM JCR.ICMUT01278001 " +
" WHERE COMPKEY = 'N21E26B04900FC6B1F00000'";
cmd.Transaction = trans;
DB2DataReader reader;
reader = cmd.ExecuteReader(CommandBehavior.SequentialAccess);
if (reader.Read())
{
FileStream fs = new FileStream(SavePath + "\\" + "quang canh.jpg", FileMode.OpenOrCreate, FileAccess.Write);
BinaryWriter writer = new BinaryWriter(fs);
//reset the index to the beginning of the file
CurrentIndex = 0;
BytesReturned = reader.GetBytes(
0, //the BlobsTable column index
CurrentIndex, // the current index of the field from which to begin the read operation
Blob, // Array name to write the buffer to
0, // the start index of the array
BufferSize // the maximum length to copy into the buffer
);
while (BytesReturned == BufferSize)
{
writer.Write(Blob);
writer.Flush();
CurrentIndex += BufferSize;
BytesReturned = reader.GetBytes(0, CurrentIndex, Blob, 0, BufferSize);
}
writer.Write(Blob, 0, (int)BytesReturned);
writer.Flush(); writer.Close();
fs.Close();
}
reader.Close();
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
}
But can not view the image, it show format error => https://i.stack.imgur.com/PNS9Q.png
Your are currently asuming all BLOBS in that DB are JPEG Images. But that is clearly not the case.
Option 1: This is a faulty data
Programms that save to databases can fail.
Databases themself might fail, especially if transactions are turned off. Transactions are most likely turned off for BLOB's.
The physical disk the data was stored on might have degraded. And again, you will not get a lot of redundancy and error correction with BLOBS (plus getting use of the Error correction requires going through the proper DBMS in the first place).
Option 2: This is not a jpg
I know article about Unicode that says "[...]problem comes down to one naive programmer who didn’t understand the simple fact that if you don’t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends."
This applies doubly, triply and quadruply to images:
this could be any number of formats that uses Interlacing.
this could could be a professional graphics programms image/project file like TIFF. Which can totally contain multiple images - up to one per layer you are working with.
this could even be a .SVG file (XML text that contains drawing orders) that was run through a .ZIP compression and a word document
this could even be a PDF, where the images are usually appended at the back (allowing you to read the text with a partial file, similar to interleaving)

Java―good size for a char buffer

I am coding a little java based tool to process mysqldump files, which can become quite large (up to a gigabyte for now). I am using this code to read and process the file:
BufferedReader reader = getReader();
BufferedWriter writer = getWriter();
char[] charBuffer = new char[CHAR_BUFFER_SIZE];
int readCharCout;
StringBuffer buffer = new StringBuffer();
while( ( readCharCout = reader.read( charBuffer ) ) > 0 )
{
buffer.append( charBuffer, 0, readCharCout );
//processing goes here
}
What is a good size for the charBuffer? At the moment it is set to 1000, but my code will run with an arbitrary size, so what is best practice or can this size be calculated depending on the file size?
Thanks in ahead,
greetings philipp
It should always be a power of 2. The optimal value is based on the OS and disk format. In code I've seen 4096 is often used, but the bigger the better.
Also, there are better ways to load a file into memory.

Splitting a .gz file into specified file sizes in Java using byte[] array

I have written a code to split a .gz file into user specified parts using byte[] array. But the for loop is not reading/writing the last part of the parent file which is less than the array size. Can you please help me in fixing this?
package com.bitsighttech.collection.packaging;
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.log4j.Logger;
public class FileSplitterBytewise
{
private static Logger logger = Logger.getLogger(FileSplitterBytewise.class);
private static final long KB = 1024;
private static final long MB = KB * KB;
private FileInputStream fis;
private FileOutputStream fos;
private DataInputStream dis;
private DataOutputStream dos;
public boolean split(File inputFile, String splitSize)
{
int expectedNoOfFiles =0;
try
{
double parentFileSizeInB = inputFile.length();
Pattern p = Pattern.compile("(\\d+)\\s([MmGgKk][Bb])");
Matcher m = p.matcher(splitSize);
m.matches();
String FileSizeString = m.group(1);
String unit = m.group(2);
double FileSizeInMB = 0;
try {
if (unit.toLowerCase().equals("kb"))
FileSizeInMB = Double.parseDouble(FileSizeString) / KB;
else if (unit.toLowerCase().equals("mb"))
FileSizeInMB = Double.parseDouble(FileSizeString);
else if (unit.toLowerCase().equals("gb"))
FileSizeInMB = Double.parseDouble(FileSizeString) * KB;
} catch (NumberFormatException e) {
logger.error("invalid number [" + FileSizeInMB + "] for expected file size");
}
double fileSize = FileSizeInMB * MB;
int fileSizeInByte = (int) Math.ceil(fileSize);
double noOFFiles = parentFileSizeInB/fileSizeInByte;
expectedNoOfFiles = (int) Math.ceil(noOFFiles);
int splinterCount = 1;
fis = new FileInputStream(inputFile);
dis = new DataInputStream(new BufferedInputStream(fis));
fos = new FileOutputStream("F:\\ff\\" + "_part_" + splinterCount + "_of_" + expectedNoOfFiles);
dos = new DataOutputStream(new BufferedOutputStream(fos));
byte[] data = new byte[(int) fileSizeInByte];
while ( splinterCount <= expectedNoOfFiles ) {
int i;
for(i = 0; i<data.length-1; i++)
{
data[i] = s.readByte();
}
dos.write(data);
splinterCount ++;
}
}
catch(Exception e)
{
logger.error("Unable to split the file " + inputFile.getName() + " in to " + expectedNoOfFiles);
return false;
}
logger.debug("Successfully split the file [" + inputFile.getName() + "] in to " + expectedNoOfFiles + " files");
return true;
}
public static void main(String args[])
{
String FilePath1 = "F:\\az.gz";
File file= new File(FilePath1);
FileSplitterBytewise fileSplitter = new FileSplitterBytewise();
String splitlen = "1 MB";
fileSplitter.split(file, splitlen);
}
}
I'd suggest to make more methods. You've got a complicated string-handling section of code in split(); it would be best to make a method that takes the human-friendly string as input and returns the number you're looking for. (It would also make it far easier for you to test this section of the routine; there's no way you can test it now.)
Once it is split off and you're writing test cases, you'll probably find that the error message you generate if the string doesn't contain kb, mb, or gb is extremely confusing -- it blames the number 0 for the mistake rather than pointing out the string does not have the expected units.
Using an int to store the file size means your program will never handle files larger than two gigabytes. You should stick with long or double. (double feels wrong for something that is actually confined to integer values but I can't quickly think why it would fail.)
byte[] data = new byte[(int) fileSizeInByte];
Allocating several gigabytes like this is going to destroy your performance -- that's a potentially huge memory allocation (and one that might be considered under control of an adversary; depending upon your security model, this might or might not be a big deal). Don't try to work with the entire file in one piece.
You appear to be reading and writing the files one byte at a time. That's a guarantee to very slow performance. Doing some performance testing for another question earlier today, I found that my machine could read (from a hot cache) 2000 times faster using 131kb blocks than two-byte blocks. One-byte blocks would be even worse. A cold cache would be significantly worse for such small sizes.
fos = new FileOutputStream("F:\\ff\\" + "_part_" + splinterCount + "_of_" + expectedNoOfFiles);
You only appear to ever open one file output stream. Your post probably should have said "only the first works", because it looks like you've not yet tried it on a file that creates three or more pieces.
catch(Exception e)
At this point, you've got the ability to discover errors in your program; you choose to ignore them completely. Sure, you log an error message, but you cannot actually debug your program with the data you log. You should log at a minimum the exception type, message, and maybe even full stack-trace. This combination of data is immensely useful when trying to solve problems, especially in a few months when you've forgotten the details of how it works.
Can you please help me in fixing this?
I would use;
drop the DataInput/OutputStreams, you don't need them.
use in.read(data) to read the whole block instead on one byte at a time. Reading one byte at a time is so much slower!
or read the whole of the data array, you are reading one less.
stop when you reach the end of the file, it might not be a whole multiple of the size.
only write as much as you have read, if your blocks at 1 MB byte there is 100 KB left you should only read/write 100 KB at the end.
close your files when have finished, esp as you have a buffered stream.
you "split" writes everything to the same file (so its not actually splitting) You need to create, write to and close output files in a loop.
don't use fields when you could be/should be using local variables.
would use the length as a long in bytes.
the pattern ignores incorrect input and your pattern doesn't match the test you check for. e.g. your patten allows 1 G or 1 k but these will be treated as 1 MB.

Searching for an expression in a very long String in Java

My String containing a text file of 50 MB.
I got my String like this:
RandomAccessFile file = new RandomAccessFile("wiki.txt", "r");
FileChannel channel = file.getChannel();
MappedByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY, 0, 1024*50);
byte[] b = new byte[1024*50];
buffer.get(b);
String wiki = new String(b);
I get a String expression that can contain multiple words, and I need to return an answer if this expression is in my wiki String (the big String) or not.
The action works good for about 1% of the String(from the beginning of the String), and when the phrase I'm looking for is in the middle or end of the String, the answer I get for the following code is a false:
System.out.println(wiki.contains(strToCheck));
System.out.println(wiki.indexOf(strToCheck, 0));
System.out.println(wiki.matches("(?i).*"+strToCheck+".*"));
Does anyone know why this happens?
Or what am I doing wrong?
Thank you.
I am sorry to say it but 1024*50 in not 50M. It is 50K.
It seems that you are reading 0.1% of your file and then searching in it.
you should try
MappedByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY, 0, 1024*1024*50);
because 50 MB = 1024*1024*50, 50kb = 1024 * 50, 1MB = 1024 kb`
Occam's razor: strToCheck is NOT in wiki.
If you are going to be performing searches in the String, you can consider implementing the Knuth–Morris–Pratt algorithm and buffering your reads of the original String so that the entire string is not loaded into memory.

Most efficient merging of 2 text files.

So I have large (around 4 gigs each) txt files in pairs and I need to create a 3rd file which would consist of the 2 files in shuffle mode. The following equation presents it best:
3rdfile = (4 lines from file 1) + (4 lines from file 2) and this is repeated until I hit the end of file 1 (both input files will have the same length - this is by definition). Here is the code I'm using now but this doesn't scale very good on large files. I was wondering if there is a more efficient way to do this - would working with memory mapped file help ? All ideas are welcome.
public static void mergeFastq(String forwardFile, String reverseFile, String outputFile) {
try {
BufferedReader inputReaderForward = new BufferedReader(new FileReader(forwardFile));
BufferedReader inputReaderReverse = new BufferedReader(new FileReader(reverseFile));
PrintWriter outputWriter = new PrintWriter(new FileWriter(outputFile, true));
String forwardLine = null;
System.out.println("Begin merging Fastq files");
int readsMerge = 0;
while ((forwardLine = inputReaderForward.readLine()) != null) {
//append the forward file
outputWriter.println(forwardLine);
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
//append the reverse file
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
readsMerge++;
if(readsMerge % 10000 == 0) {
System.out.println("[" + now() + "] Merged 10000");
readsMerge = 0;
}
}
inputReaderForward.close();
inputReaderReverse.close();
outputWriter.close();
} catch (IOException ex) {
Logger.getLogger(Utilities.class.getName()).log(Level.SEVERE, "Error while merging FastQ files", ex);
}
}
Maybe you also want to try to use a BufferedWriter to cut down your file IO operations.
http://download.oracle.com/javase/6/docs/api/java/io/BufferedWriter.html
A simple answer is to use a bigger buffer, which help to reduce to total number of I/O call being made.
Usually, memory mapped IO with FileChannel (see Java NIO) will be used for handling large data file IO. In this case, however, it is not the case, as you need to inspect the file content in order to determine the boundary for every 4 lines.
If performance was the main requirement, then I would code this function in C or C++ instead of Java.
But regardless of language used, what I would do is try to manage memory myself. I would create two large buffers, say 128MB or more each and fill them with data from the two text files. Then you need a 3rd buffer that is twice as big as the previous two. The algorithm will start moving characters one by one from input buffer #1 to destination buffer, and at the same time count EOLs. Once you reach the 4th line you store the current position on that buffer away and repeat the same process with the 2nd input buffer. You continue alternating between the two input buffers, replenishing the buffers when you consume all the data in them. Each time you have to refill the input buffers you can also write the destination buffer and empty it.
Buffer your read and write operations. Buffer needs to be large enough to minimize the read/write operations and still be memory efficient. This is really simple and it works.
void write(InputStream is, OutputStream os) throws IOException {
byte[] buf = new byte[102400]; //optimize the size of buffer to your needs
int num;
while((n = is.read(buf)) != -1){
os.write(buffer, 0, num);
}
}
EDIT:
I just realized that you need to shuffle the lines, so this code will not work for you as is but, the concept still remains the same.

Categories

Resources