Unable to create a torrent's info hash - java

I'm having trouble finding the issue with how I'm generating the corresponding info hash for a torrent file. This is the code I have so far:
InputStream input = null;
try {
MessageDigest sha1 = MessageDigest.getInstance("SHA-1");
input = new FileInputStream(file);
StringBuilder builder = new StringBuilder();
while (!builder.toString().endsWith("4:info")) {
builder.append((char) input.read()); // It's ASCII anyway.
}
ByteArrayOutputStream output = new ByteArrayOutputStream();
for (int data; (data = input.read()) > -1; output.write(data));
sha1.update(output.toByteArray(), 0, output.size() - 1);
this.infoHash = sha1.digest();
System.out.println(new String(Hex.encodeHex(infoHash)));
} catch (NoSuchAlgorithmException | IOException e) {
e.printStackTrace();
} finally {
if (input != null) try { input.close(); } catch (IOException ignore) {}
}
Below is my expected and actual hash:
Expected: d4d44272ee5f5bf887a9c85ad09ae957bc55f89d
Actual: 4d753474429d817b80ff9e0c441ca660ec5d2450
The torrent I'm trying to generate an info hash for can be found here (Ubuntu 14.04 Desktop amd64).
Let me know if I can provide any more info, thanks!

Exceptions contain 4 useful bits of info: Type, Message, Trace, and Cause. You've tossing away 3 out of the 4 relevant bits of info. Also, code is part of a process, and when an error occurs, generally that process cannot be finished at all. And yet on exceptions your process continues. Stop doing this; you've written code that only hurts you. Remove the try, and the catch. Add a throws clause on your method signature. If you can't, the go-to default (and update your IDE if that generated this code to do this) is throw new RuntimeException("Unhandled", e);. This is shorter, does not destroy any of the 4 interesting bits of info, and ends a process.
Separately, the notion that the right way to handle an inputstream close method's IOException being: Just ignore it, is also false. It is highly unlikely to throw, but if it does, you should assume you didn't read every byte. As that would be one explanation for a mismatched hash, it's misguided.
Finally, use the proper language constructs: There is a try-with-resources statement that would work far better here.
You're calling update with output.size() - 1; unless you want to intentionally ignore the last byte, this is a mistake; you're lopping off the last byte read.
Reading bytes into a builder, and then per byte converting the builder to a string and then checking the last character is incredibly inefficient; for a file as small as 1MB that'll cause quite a grind.
Reading a single byte at a time from a raw FileInputStream is also that level of inefficient, because every read will cause file access (reading 1 byte is as expensive as reading a whole buffer full, so, it's about 50000 times slower than it needs to be).
Here's how to do this with somewhat newer API, and look how much nicer this code reads. It also acts better under erroneous conditions:
byte[] data = Files.readAllBytes(Paths.get(fileName));
var search = "4:info".getBytes(StandardCharsets.US_ASCII);
int searchIdx = -1;
for (int i = 0; searchIdx == -1 && i < data.length - search.length; i++) {
for (int j = 0; j < search.length; j++) {
if (data[i + j] != search[j]) break;
if (j == search.length - 1) searchIdx = i + j;
}
}
if (searchIdx == -1) throw new IOException("Input torrent file does not contain marker");
var sha1 = MessageDigest.getInstance("SHA-1");
sha1.update(data, searchIdx, data.length - searchIdx);
byte[] hash = sha1.digest();
StringBuilder hex = new StringBuilder();
for (byte h : hash) hex.append(String.format("%02x", h));
System.out.println(hex);

While rzwitserloot's answer covers some general java coding practices there also are correctness issues on the bittorrent level.
You are using string processing for a structured data format, this is pretty much the same mistake as attempting to parse html with regex. In this case you're assuming that the only place that the data can contain the string 4:info is the top-level dictionary key for the info dict and that the info dictionary is the last entry of the top level dictionary.
Instead you should use a proper bencoding decoder-encoder to extract the info dict and then re-encode it for hashing or a tokenizer to find the exact byte-range covering the info value. Note that you need a validating parser for the former while the latter can also handle some out-of-spec edge cases. Unless you want to implement them yourself you may want to find a library that handles this for you.
Additionally you're assuming that the data is ASCII. bencoding is in fact a binary format that just tends to use ascii by convention in some places. You should operate on byte arrays directly. Your input is already binary, the hasher expects binary so it is quite circuitous to go through strings.

Related

Is reading/writing in an array more efficient than reading/writing a char/byte one by one?

try(FileReader reader = new FileReader("input.txt")) {
int c;
while ((c = reader.read()) != -1)
System.out.print((char)c);
} catch (Exception ignored) { }
In this code, I read a char by char. Is it more efficient in someway to read a into an array of chars at once? In other words, is there any kind of optimization that happens when reading in arrays?
For example in this code, I have an array of char called arr and I read into it until there is noting left to read. Is it more efficient?
try(FileReader reader = new FileReader("input.txt")) {
int size;
char[] arr = new char[100];
while ((size = reader.read(arr)) != -1)
for (int i = 0; i < size; i++)
System.out.print(arr[i]);
} catch (Exception ignored) { }
The question applies for both reading/writing both chars/bytes.
Depends on the reader. The answer can be yes, though. Whatever Reader or InputStream is the actual 'raw' driver (the one that isn't just wrapping another reader or inputstream, but the one that is actually talking to the OS to get the data) - it may well implement the single-character read() method by asking the OS to read a single character.
In the end, you have a disk, and disks return data in blocks. So if you ask for 1 byte, you have 2 options as a computer:
Ask the disk for the block that contains the byte that is to be read. Store the block in memory someplace for a while. Return one byte; for the next few moments, if more requests for bytes come in from the same block, return from the stored data in memory and don't bother asking the disk at all. NOTE: This requires memory! Who allocates it? How much memory is okay? Tricky questions. OSes tend to give low level tools and don't like just picking values for any of these questions.
Ask the disk for the block that contains the byte that is to be read. Find the 1 byte needed from within this block. Ignore the rest of the data, return just that one byte. If in a few moments another byte from that block is asked for... ask the disk, again, for the whole block, and repeat this routine.
Which of the two models you get depends on many factors: For example: What kind of disk is it, what OS do you have, what underlying java reader are you using. But it is plausible you end up in this second mode and that is, as you can probably tell, usually incredibly slow, because you end up reading the same block 4000+ times instead of only once.
So, how to fix this? Well, java doesn't really know what the OS is doing either, so the safest bet is to let java do the caching. Then you have no dependencies on whatever the OS is doing.
You could write it yourself, so instead of:
for (int i = in.read(); i != -1; i = in.read()) {
processOneChar((char) i);
}
you could do:
char[] buffer = new char[4096];
while (true) {
int r = in.read(buffer);
if (r == -1) break;
for (int i = 0; i < r; i++) processOneChar(buffer[i]);
}
more code, but now the second scenario (the same block is read off the disk a ton of times) can no longer occur; you have given the OS the freedom to return to you up to 4096 chars worth of data.
Or, use a java builtin: BufferedX:
BufferedReader br = new BufferedReader(in);
for (int i = br.read(); i != -1; i = br.read()) {
processOneChar((char) i);
}
The implementation of BufferedReader guarantees that java will take care of making some reasonably sized buffer to avoid rereads of the same block off of disk.
NB: Note that the FileReader constructor you are using should not be used. It uses platform default encoding (anytime you convert bytes to characters, encoding is involved), and platform default is a recipe for untestable bugs, which are very bad. Use new FileReader(file, StandardCharsets.UTF_8) instead, or better yet, use the new API:
Path p = Paths.get("C:/file.txt");
try (BufferedReader br = Files.newBufferedReader(p)) {
for (int i = br.read(); i != -1; i = br.read()) {
processOneChar((char) i);
}
}
Note that this:
Defaults to UTF-8, because the Files API defaults to UTF-8 unlike most places in the VM.
Makes a bufferedreader immediately, no need to make it yourself.
Properly manages the resource (ensures it is closed regardless of how this code exits, be it normally or be exception), by using an ARM block.
Because a BufferedX is involved, no risk of the 'read the same block a lot' performance hole.
NB: The same logic applies when writing; disks such as SSDs can only write a whole block at a time. Now it's not just slow as molasses to write, you're also ruining your disk, as they get a limited number of writes.

Can I use java.nio for Console Input?

Consider the scenario of competitive programming, I have to read 2*10^5 (or Even more ) numbers from console . Then I use BufferedReader or for even fast performance I use custom reader class that uses DataInputStream under the hood.
Quick Internet search given me this .
We can use java.io for smaller streaming of data and for large streaming we can use java.nio.
So I want to try java.nio console input and test it against the java.io performance .
Is it possible to read console input using java.nio ?
Can I read data from System.in using java.nio ?
Will it be faster than input methods that I currently have ?
Any relevant information will be appreciated.
Thanks ✌️
You can open a channel to stdin like
FileInputStream stdin = new FileInputStream(FileDescriptor.in);
FileChannel stdinChannel = stdin.getChannel();
When stdin has been redirected to a file, operations like querying the size, performing fast transfers to other channels and even memory mapping may work. But when the input is a real console or a pipe or you are reading character data, the performance is unlikely to differ significantly.
The performance depends on the way you read it, not the class you are using.
An example of code directly operating on a channel, to process white-space separated decimal numbers, is
CharsetDecoder cs = Charset.defaultCharset().newDecoder();
ByteBuffer bb = ByteBuffer.allocate(1024);
CharBuffer cb = CharBuffer.allocate(1024);
while(stdinChannel.read(bb) >= 0) {
bb.flip();
cs.decode(bb, cb, false);
bb.compact();
cb.flip();
extractDoubles(cb);
cb.compact();
}
bb.flip();
cs.decode(bb, cb, true);
if(cb.position() > 0) {
cb.flip();
extractDoubles(cb);
}
private static void extractDoubles(CharBuffer cb) {
doubles: for(int p = cb.position(); p < cb.limit(); ) {
while(p < cb.limit() && Character.isWhitespace(cb.get(p))) p++;
cb.position(p);
if(cb.hasRemaining()) {
for(; p < cb.limit(); p++) {
if(Character.isWhitespace(cb.get(p))) {
int oldLimit = cb.limit();
double d = Double.parseDouble(cb.limit(p).toString());
cb.limit(oldLimit);
processDouble(d);
continue doubles;
}
}
}
}
}
This is more complicated than using java.util.Scanner or a BufferedReader’s readLine() followed by split("\\s"), but has the advantage of avoiding the complexity of the regex engine, as well as not creating String objects for the lines. When there are more than one number per line or empty lines, i.e. the line strings would not not match the number strings, this can save the copying overhead intrinsic to string construction.
This code is still handling arbitrary charsets. When you know the expected charset and it is ASCII based, using a lightweight transformation instead of the CharsetDecoder, like shown in this answer, can gain an additional performance increase.

Wav comparison, same file

I'm currently stumped. I've been looking around and experimenting with audio comparison. I've found quite a bit of material, and a ton of references to different libraries and methods to do it.
As of now I've taken Audacity and exported a 3min wav file called "long.wav" and then split the first 30seconds of that into a file called "short.wav". I figured somewhere along the line I could visually log (log.txt) the data through java for each and should be able to see at least some visual similarities among the values.... here's some code
Main method:
int totalFramesRead = 0;
File fileIn = new File(filePath);
BufferedWriter writer = new BufferedWriter(new FileWriter(outPath));
writer.flush();
writer.write("");
try {
AudioInputStream audioInputStream =
AudioSystem.getAudioInputStream(fileIn);
int bytesPerFrame =
audioInputStream.getFormat().getFrameSize();
if (bytesPerFrame == AudioSystem.NOT_SPECIFIED) {
// some audio formats may have unspecified frame size
// in that case we may read any amount of bytes
bytesPerFrame = 1;
}
// Set an arbitrary buffer size of 1024 frames.
int numBytes = 1024 * bytesPerFrame;
byte[] audioBytes = new byte[numBytes];
try {
int numBytesRead = 0;
int numFramesRead = 0;
// Try to read numBytes bytes from the file.
while ((numBytesRead =
audioInputStream.read(audioBytes)) != -1) {
// Calculate the number of frames actually read.
numFramesRead = numBytesRead / bytesPerFrame;
totalFramesRead += numFramesRead;
// Here, do something useful with the audio data that's
// now in the audioBytes array...
if(totalFramesRead <= 4096 * 100)
{
Complex[][] results = PerformFFT(audioBytes);
int[][] lines = GetKeyPoints(results);
DumpToFile(lines, writer);
}
}
} catch (Exception ex) {
// Handle the error...
}
audioInputStream.close();
} catch (Exception e) {
// Handle the error...
}
writer.close();
Then PerformFFT:
public static Complex[][] PerformFFT(byte[] data) throws IOException
{
final int totalSize = data.length;
int amountPossible = totalSize/Harvester.CHUNK_SIZE;
//When turning into frequency domain we'll need complex numbers:
Complex[][] results = new Complex[amountPossible][];
//For all the chunks:
for(int times = 0;times < amountPossible; times++) {
Complex[] complex = new Complex[Harvester.CHUNK_SIZE];
for(int i = 0;i < Harvester.CHUNK_SIZE;i++) {
//Put the time domain data into a complex number with imaginary part as 0:
complex[i] = new Complex(data[(times*Harvester.CHUNK_SIZE)+i], 0);
}
//Perform FFT analysis on the chunk:
results[times] = FFT.fft(complex);
}
return results;
}
At this point I've tried logging everywhere: audioBytes before transforms, Complex values, and FFT results.
The problem: No matter what values I log, the log.txt of each wav file is completely different. I'm not understanding it. Given that I took the small.wav from the large.wav (and they have all the same properties) there should be a very heavy similarity among either the raw wav byte[] data... or Complex[][] fft data... or something thus far..
How can I possibly try to compare these files if the data isn't even close to similar at any point of these calculations.
I know I'm missing quite a bit of knowledge with regards to audio analysis, and this is why I come to the board for help! Thanks for any info, help, or fixes you can offer!!
Have you looked at MARF? It is a well-documented Java library used for audio recognition.
It is used to recognize speakers (for transcription or securing software) but the same features should be able to be used to classify audio samples. I'm not familiar with it but it looks like you'd want to use the FeatureExtraction class to extract an array of features from each audio sample and then create a unique id.
For 16-bit audio, 3e-05 isn't really that different from zero. So a file of zeros is pretty much the same as a file of zeros (maybe missing equality by some tiny rounding errors.)
ADDED:
For your comparison, read in and plot, using some Java plotting library, a portion of each of the two waveforms when they get past the portion that's mostly (close to) zero.
I think for debugging you better try use matlab to plot out. Since matlab is much more powerful in dealing with this problem.
You use "wavread" to the file, and "stft" to get the short time Fourier Transformation which is a complex number Matrix. Then simply abs(Matrix) to get the magnitude of each complex number. Show the image with imshow(abs(Matrix),[]).
I don't know how do you compare the whole file and 30s clip (by looking at the stft image?)
I don't know how are you comparing both audio files, but, seeing some service that offer music recognition (like TrackId or MotoID), these services take a small sample of the music you're hearing (10-20 secs), then process them in their server, i theorize that they have samples that long or less and that they have a database of (or calculate it on the fly) patterns of that samples (in your case Fourier Transforms), in your case, you may need to break your long audio file in chunks of or smaller size than your sample data, in the first case you may find a specific chunk that resembles more the pattern in your sample data, in the second case your smaller chunks may resamble a part of your sample data and you can calculate the probability that the sample data belongs to a respective audio file.
I think you are looking at Acoustic Fingerprinting
It's hard, and there are libraries to do it.
If you want to implement it yourself, this is a whitepaper on the shazam algorithm.

java nio socketChannel read always return same data

In client side, read code:
byte[] bytes = new byte[50]; //TODO should reuse buffer, for test only
ByteBuffer dst = ByteBuffer.wrap(bytes);
int ret = 0;
int readBytes = 0;
boolean fail = false;
try {
while ((ret = socketChannel.read(dst)) > 0) {
readBytes += ret;
System.out.println("read " + ret + " bytes from socket " + dst);
if (!dst.hasRemaining()) {
break;
}
}
int pos = dst.position();
byte[] data = new byte[pos];
dst.flip();
dst.get(data);
System.out.println("read data: " + StringUtil.toHexString(data));
} catch (Exception e) {
fail = true;
handler.onException(e);
}
The problem is socketChannel.read() always return positive, I checked the return buffer, the data is duplicate N times, it likes the low level socket buffer's position is not move forward. Any idea?
If the server only returned 48 bytes, your code must have blocked in the read() method trying to get the 49th and 50th bytes. So either your '50' is wrong or you will have to restructure your code to read and process whatever you get as you get it rather than trying to fill buffers first. And this can't possibly be the code where you think you always got the same data. The explanation for that would be failure to compact the buffer after the get, if you reuse the same buffer for the next read, which you should do, but your posted code doesn't do.
1 : This might not be a bug !
[assuming that there is readable data in the buffer]...
You would expect a -1 at the end of the stream... See http://docs.oracle.com/javase/1.4.2/docs/api/java/nio/channels/SocketChannel.html#read%28java.nio.ByteBuffer%29
If you are continually recieving a positive value from the read() call, then you will need to determine why data is being read continually.
Of course, the mystery herein ultimately lies in the source data (i.e. the SocketChannel which you are read data from).
2: Explanation of your possible problems
If your socket channel is coming from a REAL file, which is finite then your file is really big, and eventually, the read() operation will return 0... eventually...
If, on the other hand, your socket channel is listening to a source of data which you EXPECT to be finite (i.e. a serialized object stream, for example), I would double check the source --- maybe your finite stream is simply producing more and more data... and you are correctly consuming it.
3: Finally some advice
A trick for debugging this type of error is playing with the ByteBuffer input to your read method : the nice thing about java.nio's ByteBuffers is that, since they are more object oriented then the older byte[] writers, you can get very fine-grained debugging of their operations.

Efficient way to search a stream for a string

Let's suppose that have a stream of text (or Reader in Java) that I'd like to check for a particular string. The stream of text might be very large so as soon as the search string is found I'd like to return true and also try to avoid storing the entire input in memory.
Naively, I might try to do something like this (in Java):
public boolean streamContainsString(Reader reader, String searchString) throws IOException {
char[] buffer = new char[1024];
int numCharsRead;
while((numCharsRead = reader.read(buffer)) > 0) {
if ((new String(buffer, 0, numCharsRead)).indexOf(searchString) >= 0)
return true;
}
return false;
}
Of course this fails to detect the given search string if it occurs on the boundary of the 1k buffer:
Search text: "stackoverflow"
Stream buffer 1: "abc.........stack"
Stream buffer 2: "overflow.......xyz"
How can I modify this code so that it correctly finds the given search string across the boundary of the buffer but without loading the entire stream into memory?
Edit: Note when searching a stream for a string, we're trying to minimise the number of reads from the stream (to avoid latency in a network/disk) and to keep memory usage constant regardless of the amount of data in the stream. Actual efficiency of the string matching algorithm is secondary but obviously, it would be nice to find a solution that used one of the more efficient of those algorithms.
There are three good solutions here:
If you want something that is easy and reasonably fast, go with no buffer, and instead implement a simple nondeterminstic finite-state machine. Your state will be a list of indices into the string you are searching, and your logic looks something like this (pseudocode):
String needle;
n = needle.length();
for every input character c do
add index 0 to the list
for every index i in the list do
if c == needle[i] then
if i + 1 == n then
return true
else
replace i in the list with i + 1
end
else
remove i from the list
end
end
end
This will find the string if it exists and you will never need a
buffer.
Slightly more work but also faster: do an NFA-to-DFA conversion that figures out in advance what lists of indices are possible, and assign each one to a small integer. (If you read about string search on Wikipedia, this is called the powerset construction.) Then you have a single state and you make a state-to-state transition on each incoming character. The NFA you want is just the DFA for the string preceded with a state that nondeterministically either drops a character or tries to consume the current character. You'll want an explicit error state as well.
If you want something faster, create a buffer whose size is at least twice n, and user Boyer-Moore to compile a state machine from needle. You'll have a lot of extra hassle because Boyer-Moore is not trivial to implement (although you'll find code online) and because you'll have to arrange to slide the string through the buffer. You'll have to build or find a circular buffer that can 'slide' without copying; otherwise you're likely to give back any performance gains you might get from Boyer-Moore.
I did a few changes to the Knuth Morris Pratt algorithm for partial searches. Since the actual comparison position is always less or equal than the next one there is no need for extra memory. The code with a Makefile is also available on github and it is written in Haxe to target multiple programming languages at once, including Java.
I also wrote a related article: searching for substrings in streams: a slight modification of the Knuth-Morris-Pratt algorithm in Haxe. The article mentions the Jakarta RegExp, now retired and resting in the Apache Attic. The Jakarta Regexp library “match” method in the RE class uses a CharacterIterator as a parameter.
class StreamOrientedKnuthMorrisPratt {
var m: Int;
var i: Int;
var ss:
var table: Array<Int>;
public function new(ss: String) {
this.ss = ss;
this.buildTable(this.ss);
}
public function begin() : Void {
this.m = 0;
this.i = 0;
}
public function partialSearch(s: String) : Int {
var offset = this.m + this.i;
while(this.m + this.i - offset < s.length) {
if(this.ss.substr(this.i, 1) == s.substr(this.m + this.i - offset,1)) {
if(this.i == this.ss.length - 1) {
return this.m;
}
this.i += 1;
} else {
this.m += this.i - this.table[this.i];
if(this.table[this.i] > -1)
this.i = this.table[this.i];
else
this.i = 0;
}
}
return -1;
}
private function buildTable(ss: String) : Void {
var pos = 2;
var cnd = 0;
this.table = new Array<Int>();
if(ss.length > 2)
this.table.insert(ss.length, 0);
else
this.table.insert(2, 0);
this.table[0] = -1;
this.table[1] = 0;
while(pos < ss.length) {
if(ss.substr(pos-1,1) == ss.substr(cnd, 1))
{
cnd += 1;
this.table[pos] = cnd;
pos += 1;
} else if(cnd > 0) {
cnd = this.table[cnd];
} else {
this.table[pos] = 0;
pos += 1;
}
}
}
public static function main() {
var KMP = new StreamOrientedKnuthMorrisPratt("aa");
KMP.begin();
trace(KMP.partialSearch("ccaabb"));
KMP.begin();
trace(KMP.partialSearch("ccarbb"));
trace(KMP.partialSearch("fgaabb"));
}
}
The Knuth-Morris-Pratt search algorithm never backs up; this is just the property you want for your stream search. I've used it before for this problem, though there may be easier ways using available Java libraries. (When this came up for me I was working in C in the 90s.)
KMP in essence is a fast way to build a string-matching DFA, like Norman Ramsey's suggestion #2.
This answer applied to the initial version of the question where the key was to read the stream only as far as necessary to match on a String, if that String was present. This solution would not meet the requirement to guarantee fixed memory utilisation, but may be worth considering if you have found this question and are not bound by that constraint.
If you are bound by the constant memory usage constraint, Java stores arrays of any type on the heap, and as such nulling the reference does not deallocate memory in any way; I think any solution involving arrays in a loop will consume memory on the heap and require GC.
For simple implementation, maybe Java 5's Scanner which can accept an InputStream and use a java.util.regex.Pattern to search the input for might save you worrying about the implementation details.
Here's an example of a potential implementation:
public boolean streamContainsString(Reader reader, String searchString)
throws IOException {
Scanner streamScanner = new Scanner(reader);
if (streamScanner.findWithinHorizon(searchString, 0) != null) {
return true;
} else {
return false;
}
}
I'm thinking regex because it sounds like a job for a Finite State Automaton, something that starts in an initial state, changing state character by character until it either rejects the string (no match) or gets to an accept state.
I think this is probably the most efficient matching logic you could use, and how you organize the reading of the information can be divorced from the matching logic for performance tuning.
It's also how regexes work.
Instead of having your buffer be an array, use an abstraction that implements a circular buffer. Your index calculation will be buf[(next+i) % sizeof(buf)], and you'll have to be careful to full the buffer one-half at a time. But as long as the search string fits in half the buffer, you'll find it.
I believe the best solution to this problem is to try to keep it simple. Remember, beacause I'm reading from a stream, I want to keep the number of reads from the stream to a minimum (as network or disk latency may be an issue) while keeping the amount of memory used constant (as the stream may be very large in size). Actual efficiency of the string matching is not the number one goal (as that has been studied to death already).
Based on AlbertoPL's suggestion, here's a simple solution that compares the buffer against the search string character by character. The key is that because the search is only done one character at a time, no back tracking is needed and therefore no circular buffers, or buffers of a particular size are needed.
Now, if someone can come up with a similar implementation based on Knuth-Morris-Pratt search algorithm then we'd have a nice efficient solution ;)
public boolean streamContainsString(Reader reader, String searchString) throws IOException {
char[] buffer = new char[1024];
int numCharsRead;
int count = 0;
while((numCharsRead = reader.read(buffer)) > 0) {
for (int c = 0; c < numCharsRead; c++) {
if (buffer[c] == searchString.charAt(count))
count++;
else
count = 0;
if (count == searchString.length()) return true;
}
}
return false;
}
If you're not tied to using a Reader, then you can use Java's NIO API to efficiently load the file. For example (untested, but should be close to working):
public boolean streamContainsString(File input, String searchString) throws IOException {
Pattern pattern = Pattern.compile(Pattern.quote(searchString));
FileInputStream fis = new FileInputStream(input);
FileChannel fc = fis.getChannel();
int sz = (int) fc.size();
MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, sz);
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
CharBuffer cb = decoder.decode(bb);
Matcher matcher = pattern.matcher(cb);
return matcher.matches();
}
This basically mmap()'s the file to search and relies on the operating system to do the right thing regarding cache and memory usage. Note however that map() is more expensive the just reading the file in to a large buffer for files less than around 10 KiB.
A very fast searching of a stream is implemented in the RingBuffer class from the Ujorm framework. See the sample:
Reader reader = RingBuffer.createReader("xxx ${abc} ${def} zzz");
String word1 = RingBuffer.findWord(reader, "${", "}");
assertEquals("abc", word1);
String word2 = RingBuffer.findWord(reader, "${", "}");
assertEquals("def", word2);
String word3 = RingBuffer.findWord(reader, "${", "}");
assertEquals("", word3);
The single class implementation is available on the SourceForge:
For more information see the link.
Implement a sliding window. Have your buffer around, move all elements in the buffer one forward and enter a single new character in the buffer at the end. If the buffer is equal to your searched word, it is contained.
Of course, if you want to make this more efficient, you can look at a way to prevent moving all elements in the buffer around, for example by having a cyclic buffer and a representation of the strings which 'cycles' the same way the buffer does, so you only need to check for content-equality. This saves moving all elements in the buffer.
I think you need to buffer a small amount at the boundary between buffers.
For example if your buffer size is 1024 and the length of the SearchString is 10, then as well as searching each 1024-byte buffer you also need to search each 18-byte transition between two buffers (9 bytes from the end of the previous buffer concatenated with 9 bytes from the start of the next buffer).
I'd say switch to a character by character solution, in which case you'd scan for the first character in your target text, then when you find that character increment a counter and look for the next character. Every time you don't find the next consecutive character restart the counter. It would work like this:
public boolean streamContainsString(Reader reader, String searchString) throws IOException {
char[] buffer = new char[1024];
int numCharsRead;
int count = 0;
while((numCharsRead = reader.read(buffer)) > 0) {
if (buffer[numCharsRead -1] == searchString.charAt(count))
count++;
else
count = 0;
if (count == searchString.size())
return true;
}
return false;
}
The only problem is when you're in the middle of looking through characters... in which case there needs to be a way of remembering your count variable. I don't see an easy way of doing so except as a private variable for the whole class. In which case you would not instantiate count inside this method.
You might be able to implement a very fast solution using Fast Fourier Transforms, which, if implemented properly, allow you to do string matching in times O(nlog(m)), where n is the length of the longer string to be matched, and m is the length of the shorter string. You could, for example, perform FFT as soon as you receive an stream input of length m, and if it matches, you can return, and if it doesn't match, you can throw away the first character in the stream input, wait for a new character to appear through the stream, and then perform FFT again.
You can increase the speed of search for very large strings by using some string search algorithm
If you're looking for a constant substring rather than a regex, I'd recommend Boyer-Moore. There's plenty of source code on the internet.
Also, use a circular buffer, to avoid think too hard about buffer boundaries.
Mike.
I also had a similar problem: skip bytes from the InputStream until specified string (or byte array). This is the simple code based on circular buffer. It is not very efficient but works for my needs:
private static boolean matches(int[] buffer, int offset, byte[] search) {
final int len = buffer.length;
for (int i = 0; i < len; ++i) {
if (search[i] != buffer[(offset + i) % len]) {
return false;
}
}
return true;
}
public static void skipBytes(InputStream stream, byte[] search) throws IOException {
final int[] buffer = new int[search.length];
for (int i = 0; i < search.length; ++i) {
buffer[i] = stream.read();
}
int offset = 0;
while (true) {
if (matches(buffer, offset, search)) {
break;
}
buffer[offset] = stream.read();
offset = (offset + 1) % buffer.length;
}
}
Here is my implementation:
static boolean containsKeywordInStream( Reader ir, String keyword, int bufferSize ) throws IOException{
SlidingContainsBuffer sb = new SlidingContainsBuffer( keyword );
char[] buffer = new char[ bufferSize ];
int read;
while( ( read = ir.read( buffer ) ) != -1 ){
if( sb.checkIfContains( buffer, read ) ){
return true;
}
}
return false;
}
SlidingContainsBuffer class:
class SlidingContainsBuffer{
private final char[] keyword;
private int keywordIndexToCheck = 0;
private boolean keywordFound = false;
SlidingContainsBuffer( String keyword ){
this.keyword = keyword.toCharArray();
}
boolean checkIfContains( char[] buffer, int read ){
for( int i = 0; i < read; i++ ){
if( keywordFound == false ){
if( keyword[ keywordIndexToCheck ] == buffer[ i ] ){
keywordIndexToCheck++;
if( keywordIndexToCheck == keyword.length ){
keywordFound = true;
}
} else {
keywordIndexToCheck = 0;
}
} else {
break;
}
}
return keywordFound;
}
}
This answer fully qualifies the task:
The implementation is able to find the searched keyword even if it was split between buffers
Minimum memory usage defined by the buffer size
Number of reads will be minimized by using bigger buffer

Categories

Resources