Searching for an expression in a very long String in Java

Searching for an expression in a very long String in Java - java

My String containing a text file of 50 MB.
I got my String like this:
RandomAccessFile file = new RandomAccessFile("wiki.txt", "r");
FileChannel channel = file.getChannel();
MappedByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY, 0, 1024*50);
byte[] b = new byte[1024*50];
buffer.get(b);
String wiki = new String(b);
I get a String expression that can contain multiple words, and I need to return an answer if this expression is in my wiki String (the big String) or not.
The action works good for about 1% of the String(from the beginning of the String), and when the phrase I'm looking for is in the middle or end of the String, the answer I get for the following code is a false:
System.out.println(wiki.contains(strToCheck));
System.out.println(wiki.indexOf(strToCheck, 0));
System.out.println(wiki.matches("(?i).*"+strToCheck+".*"));
Does anyone know why this happens?
Or what am I doing wrong?
Thank you.

I am sorry to say it but 1024*50 in not 50M. It is 50K.
It seems that you are reading 0.1% of your file and then searching in it.

you should try
MappedByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY, 0, 1024*1024*50);
because 50 MB = 1024*1024*50, 50kb = 1024 * 50, 1MB = 1024 kb`

Occam's razor: strToCheck is NOT in wiki.

If you are going to be performing searches in the String, you can consider implementing the Knuth–Morris–Pratt algorithm and buffering your reads of the original String so that the entire string is not loaded into memory.

Related

Deserializing avro is slow

I try to do a performance test with Java between several serialization formats including avro/protobuf/thrift and etc.
Test bases on deserializing a byte array message having 30 long type fields for 1,000,000 times.
The result for avro is not good.
protobuf/thrift uses around 2000 milliseconds in average, but it takes 9000 milliseconds for avro.
In the document it advice to reuse decoder, so I do the code as follow.
byte[] bytes = readFromFile("market.avro");
long begin = System.nanoTime();
DatumReader<Market> userDatumReader = new ReflectDatumReader<>(Market.class);
InputStream inputStream = new SeekableByteArrayInput(bytes);
BinaryDecoder reuse = DecoderFactory.get().binaryDecoder(inputStream, null);
Market marketReuse = new Market();
for (int i = 0; i < loopCount; i++) {
inputStream = new SeekableByteArrayInput(bytes);
BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(inputStream, reuse);
userDatumReader.read(marketReuse, decoder);
}
long end = System.nanoTime() - begin;
System.out.println("avro loop " + loopCount + " times: " + (end * 1d / 1000 / 1000));
I think avro should not be that slow, so I believe I do something wrong, but I am not sure what's the point. Do I make the 'reuse' in a wrong way?
Is there any advice for avro performance testing? Thanks in advance.

Took me a while to figure this one out. But apparently
DecoderFactory.get().binaryDecoder is the culprit - it creates a buffer of 8KB every time it is invoked. And this buffer is not re-used, but reallocated on every invocation. I don't see any reason why there is a buffer involved in the first place.
The saner alternative is to use DecoderFactory.get().directBinaryDecoder

Java―good size for a char buffer

I am coding a little java based tool to process mysqldump files, which can become quite large (up to a gigabyte for now). I am using this code to read and process the file:
BufferedReader reader = getReader();
BufferedWriter writer = getWriter();
char[] charBuffer = new char[CHAR_BUFFER_SIZE];
int readCharCout;
StringBuffer buffer = new StringBuffer();
while( ( readCharCout = reader.read( charBuffer ) ) > 0 )
{
buffer.append( charBuffer, 0, readCharCout );
//processing goes here
}
What is a good size for the charBuffer? At the moment it is set to 1000, but my code will run with an arbitrary size, so what is best practice or can this size be calculated depending on the file size?
Thanks in ahead,
greetings philipp

It should always be a power of 2. The optimal value is based on the OS and disk format. In code I've seen 4096 is often used, but the bigger the better.
Also, there are better ways to load a file into memory.

Getting the last token String

I have a application that needs to read a String buffer that is semi-colon ';' delimited.
String buff = foo.getBuff(); // returns the buffer
However, the buffer can get pretty large and I just need to get the last String token and then flush the temporary variable String buff so my app won't accumulate much memory.
Update:
I tried this code:
String lastToken = buff.substring(buff.lastIndexOf(";") + 1);
However, I am not getting result with this code above, compared to this:
List<String> slist = Arrays.asList(buff.split(";"));
String lastToken = slist.get(slist.size() - 1);
However using List is very slow. My web app is almost not responding when processing this.

This will help you:
String lastToken = buff.substring(buff.lastIndexOf(";") + 1);

I don't know Java, but by googling 2 functions, I think something like this should work:
String buff = foo.getBuff(); // returns the buffer
String lastToken = buff.substring(buff.lastIndexOf(";")+1);

Check this:
String lastToken = foo.getBuff().substring(foo.getBuff().lastIndexOf(';')+1);
Edit: Dan gave the answer better and faster than me.

Most efficient merging of 2 text files.

So I have large (around 4 gigs each) txt files in pairs and I need to create a 3rd file which would consist of the 2 files in shuffle mode. The following equation presents it best:
3rdfile = (4 lines from file 1) + (4 lines from file 2) and this is repeated until I hit the end of file 1 (both input files will have the same length - this is by definition). Here is the code I'm using now but this doesn't scale very good on large files. I was wondering if there is a more efficient way to do this - would working with memory mapped file help ? All ideas are welcome.
public static void mergeFastq(String forwardFile, String reverseFile, String outputFile) {
try {
BufferedReader inputReaderForward = new BufferedReader(new FileReader(forwardFile));
BufferedReader inputReaderReverse = new BufferedReader(new FileReader(reverseFile));
PrintWriter outputWriter = new PrintWriter(new FileWriter(outputFile, true));
String forwardLine = null;
System.out.println("Begin merging Fastq files");
int readsMerge = 0;
while ((forwardLine = inputReaderForward.readLine()) != null) {
//append the forward file
outputWriter.println(forwardLine);
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
//append the reverse file
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
readsMerge++;
if(readsMerge % 10000 == 0) {
System.out.println("[" + now() + "] Merged 10000");
readsMerge = 0;
}
}
inputReaderForward.close();
inputReaderReverse.close();
outputWriter.close();
} catch (IOException ex) {
Logger.getLogger(Utilities.class.getName()).log(Level.SEVERE, "Error while merging FastQ files", ex);
}
}

Maybe you also want to try to use a BufferedWriter to cut down your file IO operations.
http://download.oracle.com/javase/6/docs/api/java/io/BufferedWriter.html

A simple answer is to use a bigger buffer, which help to reduce to total number of I/O call being made.
Usually, memory mapped IO with FileChannel (see Java NIO) will be used for handling large data file IO. In this case, however, it is not the case, as you need to inspect the file content in order to determine the boundary for every 4 lines.

If performance was the main requirement, then I would code this function in C or C++ instead of Java.
But regardless of language used, what I would do is try to manage memory myself. I would create two large buffers, say 128MB or more each and fill them with data from the two text files. Then you need a 3rd buffer that is twice as big as the previous two. The algorithm will start moving characters one by one from input buffer #1 to destination buffer, and at the same time count EOLs. Once you reach the 4th line you store the current position on that buffer away and repeat the same process with the 2nd input buffer. You continue alternating between the two input buffers, replenishing the buffers when you consume all the data in them. Each time you have to refill the input buffers you can also write the destination buffer and empty it.

Buffer your read and write operations. Buffer needs to be large enough to minimize the read/write operations and still be memory efficient. This is really simple and it works.
void write(InputStream is, OutputStream os) throws IOException {
byte[] buf = new byte[102400]; //optimize the size of buffer to your needs
int num;
while((n = is.read(buf)) != -1){
os.write(buffer, 0, num);
}
}
EDIT:
I just realized that you need to shuffle the lines, so this code will not work for you as is but, the concept still remains the same.

Parsing a text file on BlackBerry takes forever

I was originally using RIM's native xml parser methods to parse a 150k text file, approximately 5000 lines of xml, however it was taking about 2 minutes to complete, so I tried a line based format:
Title: Book Title Line 1 Line
2 Line 3
I should be able to read the file in less time than it takes to blink, but it is still slow.
Identifier books is a Vector of Book objects and lines are stored in a vector of strings in the Book object.
class classs = Class.forName("com.Gui.FileLoader");
InputStream is = classs.getResourceAsStream( fileName );
int totalFileSize = IOUtilities.streamToBytes( is ).length;
int totalRead = 0;
//Thought that maybe a shared input stream would be faster, in this case it't not.
SharedInputStream sis = SharedInputStream.getSharedInputStream( classs.getResourceAsStream( fileName ) );
LineReader lr = new LineReader( sis );
String strLine = new String( lr.readLine() );
totalRead += strLine.length();
Book book = null;
//Loop over the file until EOF is reached, catch EOF error move on with life after that.
while(1 == 1){
//If Line = Title: then we've got a new book add the old book to our books vector.
if (strLine.startsWith("Title:")){
if (book != null){
books.addElement( book );
}
book = new Book();
book.setTitle( strLine.substring( strLine.indexOf(':') + 1).trim() );
strLine = new String( lr.readLine() );
totalRead += strLine.length();
continue;
}
int totalComplete = (int) ( ( (double) totalRead / (double) totalFileSize ) * 100.00);
_observer.processStatusUpdate( totalComplete , book.getTitle() );
book.addLine( strLine );
strLine = new String( lr.readLine(), "ascii" );
totalRead += strLine.length();
}

For one thing, you're reading in the file twice - once for determining the size and then again for parsing it. Since you're already reading it into a byte array for determining the size, why not pass that byte array into a ByteArrayInputStream constructor? For example:
//Used to determine file size and then show in progress bar, app is threaded.
byte[] fileBytes = IOUtilities.streamToBytes( is );
int totalFileSize = fileBytes.length;
int totalRead = 0;
ByteArrayInputStream bais = new ByteArrayInputStream( fileBytes );
LineReader lr = new LineReader( bais);
This way it won't matter if the rest of the classes reading from the stream are reading a byte at a time - it's all in-memory.

It is easy to assume that all the operations you've elided from the code sample finish in constant time. I am guessing that one of them is doing something inefficiently, such as book.addLine( strLine ); or perhaps _observer.processStatusUpdate( totalComplete , book.getTitle() ); If those operations are not able to complete in constant time, then you could easily have a quadratic parsing algorithm.
Just thinking about the operations is the best way to figure it out, but if you're stumped, try using the BlackBerry profiler. Run your program in the Eclipse debugger and get it to stop at a breakpoint just before parsing. Then, in Eclipse, select 'window .. show view .. other .. BlackBerry .. BlackBerry Profiler View'
Select the 'setup options' button from the profiler view toolbar. It has a blue triangle in the icon. Set 'method attribution' to cumulative, and 'what to profile' to 'time including native methods'
then continue your program. once parsing is finished, you'll need to pause program execution, then click on the 'method' tab of the profiler view. You should be able to determine your pain point from there.

Where does the profiler say you spend your time?
If you do not have a preferred profiler there is jvisualvm in the Java 6 JDK.
(My guess is that you will find all the time being spent on the way down to "read a character from the file". If so, you need to buffer)

Try using new BufferedInputStream(classs.getResourceAsStream(fileName));
EDIT:
Apparently the documentation that says they have BufferedInputStream is wrong.
I am going to leave this wrong answer here just so people have that info (doc being wrong).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Searching for an expression in a very long String in Java - java

I am sorry to say it but 1024*50 in not 50M. It is 50K. It seems that you are reading 0.1% of your file and then searching in it.

you should try MappedByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY, 0, 1024102450); because 50 MB = 1024102450, 50kb = 1024 * 50, 1MB = 1024 kb`

Occam's razor: strToCheck is NOT in wiki.

If you are going to be performing searches in the String, you can consider implementing the Knuth–Morris–Pratt algorithm and buffering your reads of the original String so that the entire string is not loaded into memory.

Related

Deserializing avro is slow

Java―good size for a char buffer

Getting the last token String

Most efficient merging of 2 text files.

Parsing a text file on BlackBerry takes forever

Categories

Resources

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Searching for an expression in a very long String in Java - java

I am sorry to say it but 1024*50 in not 50M. It is 50K. It seems that you are reading 0.1% of your file and then searching in it.

you should try MappedByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY, 0, 1024*1024*50); because 50 MB = 1024*1024*50, 50kb = 1024 * 50, 1MB = 1024 kb`

Occam's razor: strToCheck is NOT in wiki.

If you are going to be performing searches in the String, you can consider implementing the Knuth–Morris–Pratt algorithm and buffering your reads of the original String so that the entire string is not loaded into memory.

Related

Deserializing avro is slow

Java―good size for a char buffer

Getting the last token String

Most efficient merging of 2 text files.

Parsing a text file on BlackBerry takes forever

Categories

Resources

you should try MappedByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY, 0, 1024102450); because 50 MB = 1024102450, 50kb = 1024 * 50, 1MB = 1024 kb`