JVM String methods implementation - java

String class has some methods that i cannot understand why they were implemented like this... replace is one of them.
public String replace(CharSequence target, CharSequence replacement) {
return Pattern.compile(target.toString(), Pattern.LITERAL).matcher(
this).replaceAll(Matcher.quoteReplacement(replacement.toString()));
}
Are there some significant advantages over a simpler and more efficient (fast!) method?
public static String replace(String string, String searchFor, String replaceWith) {
StringBuilder result=new StringBuilder();
int index=0;
int beginIndex=0;
while((index=string.indexOf(searchFor, index))!=-1){
result.append(string.substring(beginIndex, index)+replaceWith);
index+=searchFor.length();
beginIndex=index;
}
result.append(string.substring(beginIndex, string.length()));
return result.toString();
}
Stats with Java 7:
1,000,000 iterations
replace "b" with "x" in "a.b.c"
result: "a.x.c"
Times:
string.replace: 485ms
string.replaceAll: 490ms
optimized replace = 180ms
Code like the Java 7 split method is heavily optimized to avoid pattern compile / regex processing when possible:
public String[] split(String regex, int limit) {
/* fastpath if the regex is a
(1)one-char String and this character is not one of the
RegEx's meta characters ".$|()[{^?*+\\", or
(2)two-char String and the first char is the backslash and
the second is not the ascii digit or ascii letter.
*/
char ch = 0;
if (((regex.value.length == 1 &&
".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
(regex.length() == 2 &&
regex.charAt(0) == '\\' &&
(((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
((ch-'a')|('z'-ch)) < 0 &&
((ch-'A')|('Z'-ch)) < 0)) &&
(ch < Character.MIN_HIGH_SURROGATE ||
ch > Character.MAX_LOW_SURROGATE))
{
int off = 0;
int next = 0;
boolean limited = limit > 0;
ArrayList<String> list = new ArrayList<>();
while ((next = indexOf(ch, off)) != -1) {
if (!limited || list.size() < limit - 1) {
list.add(substring(off, next));
off = next + 1;
} else { // last one
//assert (list.size() == limit - 1);
list.add(substring(off, value.length));
off = value.length;
break;
}
}
// If no match was found, return this
if (off == 0)
return new String[]{this};
// Add remaining segment
if (!limited || list.size() < limit)
list.add(substring(off, value.length));
// Construct result
int resultSize = list.size();
if (limit == 0)
while (resultSize > 0 && list.get(resultSize - 1).length() == 0)
resultSize--;
String[] result = new String[resultSize];
return list.subList(0, resultSize).toArray(result);
}
return Pattern.compile(regex).split(this, limit);
}
Following the logic of the replace method:
public String replaceAll(String regex, String replacement) {
return Pattern.compile(regex).matcher(this).replaceAll(replacement);
}
The split implementation should be:
public String[] split(String regex, int limit) {
return Pattern.compile(regex).split(this, limit);
}
The performance losses are not far from the ones found on the replace methods. For some reason Oracle gives a fastpath approach on some methods and not others.

Are you sure your proposed method is indeed faster than the regex-based one used by the String class - not just for your own test input, but for every possible input that a program might throw at it? It relies on String.indexOf to do substring matching, which is itself a naive implementation that is subject to bad worst-case performance. It's entirely possible that Pattern implements a more sophisticated matching algorithm such as KMP to avoid redundant comparisons.
In general, the Java team takes performance of the core libraries very seriously, and maintains lots of internal benchmarks using a wide range of real-world data. I've never encountered a situation where regex processing was a bottleneck. My standing advice is to start by writing the simplest possible code that works correctly, and don't even begin to think about rewriting the Java built-ins until profiling proves that it's a bottleneck, and you've exhausted all other avenues of optimization.
Regarding your latest edit - first, I would not describe the split method as heavily optimized. It handles one special case that happens to be extremely common and is guaranteed not to suffer from the poor worst-case complexity described above for the naive string matching algorithm - that of splitting on a single-character, literal token.
It may very well be that the same special case could be optimized for replace, and would provide some measurable improvement. But look what it took to achieve that simple optimization - about 50 lines of code. Those lines of code come at a cost, especially when they're a part of what's probably the most widely-used class in the Java library. Cost comes in many forms:
Resources - That's 50 lines of code that some developer must spend time writing, testing, documenting, and maintaining for the lifetime of the Java language.
Risk - That's 50 opportunities for subtle bugs that slip past the initial testing.
Complexity - That's 50 extra lines of code that any developer who wants to understand how the method works must now take time to read and understand.
Your question now boils down to "why was this one method optimized to handle a special case, but not the other one?" or even more generally "why was this particular feature not implemented?" Nobody but the original author can answer that definitively, but the answer is almost always that either there is not sufficient demand for that feature, or that the benefit derived from having the feature is deemed not worth the cost of adding it.

Related

How does jvm optimize loop code?

There is a method that can search substring from a text(use brute force algorithm, please ignore null pointer)
public static int forceSearch(String text, String pattern) {
int patternLength = pattern.length();
int textLength = text.length();
for (int i = 0, n = textLength - patternLength; i <= n; i++) {
int j = 0;
for (; j < patternLength && text.charAt(i + j) == pattern.charAt(j); j++) {
;
}
if (j == patternLength) {
return i;
}
}
return -1;
}
Strangely! Use the same algorithm, but the following code is more faster!!!
public static int forceSearch(String text, String pattern) {
int patternLength = pattern.length();
int textLength = text.length();
char first = pattern.charAt(0);
for (int i = 0, n = textLength - patternLength; i <= n; i++) {
if (text.charAt(i) != first) {
while (++i <= n && text.charAt(i) != first)
;
}
int j = 0;
for (; j < patternLength && text.charAt(i + j) == pattern.charAt(j); j++) {
;
}
if (j == patternLength) {
return i;
}
}
return -1;
}
I found the second code is obviously faster than first if I run it by jvm. Howere, when I write it in c and run, the two functions take almost the same time. So I think the reason is that jvm optimize loop code
if (text.charAt(i) != first) {
while (++i <= max && text.charAt(i) != first)
;
}
Am I right? If I'm right, how should we use the jvm optimization strategy to
optimize our code?
Hope somebody help, thankyou:)
If you really want to get to the bottom of this, you'll probably need to instruct the JVM to print the assembly. In my experience, minor tweaks to loops can cause surprising performance differences. But it's not necessarily due to optimizations of the loop itself.
There are plenty of factors that can affect how your code gets JIT compiled.
For example, tweaking the size of a method can affect your inlining tree, which could mean better or worse performance depending on what your call stack looks like. If a method gets inlined further up the call stack, it could prevent nested call sites from being inlined into the same frame. If those nested call sites are especially 'hot', the added call overhead could be substantial. I'm not saying that's the cause here; I'm merely pointing out that there are many thresholds that govern how the JIT arranges your code, and the reasons for performance differences are not always obvious.
One nice thing about using JMH for benchmarks is that you can reduce the influence of such changes by explicitly setting inlining boundaries. But you can use -XX:CompileCommand to achieve the same effects manually.
There are, of course, other factors like cache friendliness that require more intuitive analysis. Given that your benchmark probably doesn't have a particularly deep call tree, I'm inclined to lean towards cache behavior as a more likely explanation. I would guess that your second version performs better because your comparand (the first chunk of pattern) is usually in your L1 cache, while your first version causes more cache churn. If your inputs are long (and it sounds like they are), then this is a likely explanation. If not, the reasons could be more subtle, e.g., your first version could be 'tricking' the CPU into employing more aggressive cache prefetching, but in a way that actually hurts performance (at least for the inputs you are benchmarking). Regardless, if cache behavior is to explain, then I wonder why you do not see a similar difference in the C versions. What optimization flags are you compiling the C version with?
Dead code elimination might also be a factor. I would have to see what your inputs are, but it's possible that your hand-optimized version causes certain instruction blocks to never be hit during the instrumented interpretation phase, leading the JIT to exclude them from the final assembly.
I reiterate: if you want to get to the bottom of this, you'll want to force the JIT to dump the assembly for each version (and compare to the C versions as well).
This if statement simplify a lot of work (especially when the pattern is found at the end of the input string.
if (text.charAt(i) != first) {
while (++i <= n && text.charAt(i) != first)
;
}
In the first version, you have to check j < patternLength for every i before comparing the first character.
In the second version you don't need to.
But strangely I think for small input it does not make much different.
Could you share the length of items you used to benchmark?
If you search for JVM compiler optimization on the internet, the
"loop unwinding" or "loop unrolling"
should jump out. Again benchmarking is tricky. You will find plenty of SO answer for the same.

Common Substring of two strings

This particular interview-question stumped me:
Given two Strings S1 and S2. Find the longest Substring which is a Prefix of S1 and suffix of S2.
Through Google, I came across the following solution, but didnt quite understand what it was doing.
public String findLongestSubstring(String s1, String s2) {
List<Integer> occurs = new ArrayList<>();
for (int i = 0; i < s1.length(); i++) {
if (s1.charAt(i) == s2.charAt(s2.length()-1)) {
occurs.add(i);
}
}
Collections.reverse(occurs);
for(int index : occurs) {
boolean equals = true;
for(int i = index; i >= 0; i--) {
if (s1.charAt(index-i) != s2.charAt(s2.length() - i - 1)) {
equals = false;
break;
}
}
if(equals) {
return s1.substring(0,index+1);
}
}
return null;
}
My questions:
How does this solution work?
And how do you get to discovering this solution?
Is there a more intuitive / easier solution?
Part 2 of your question
Here is a shorter variant:
public String findLongestPrefixSuffix(String s1, String s2) {
for( int i = Math.min(s1.length(), s2.length()); ; i--) {
if(s2.endsWith(s1.substring(0, i))) {
return s1.substring(0, i);
}
}
}
I am using Math.min to find the length of the shortest String, as I don't need to and cannot compare more than that.
someString.substring(x,y) returns you the String you get when reading someString beginning from character x and stopping at character y. I go backwards from the biggest possible substring (s1 or s2) to the smallest possible substring, the empty string. This way the first time my condition is true it will be biggest possible substring the fulfills it.
If you prefer you can go the other way round, but you have to introduce a variable saving the length of the longest found substring fulfilling the condition so far:
public static String findLongestPrefixSuffix(String s1, String s2) {
if (s1.equals(s2)) { // this part is optional and will
return s1; // speed things up if s1 is equal to s2
} //
int max = 0;
for (int i = 0; i < Math.min(s1.length(), s2.length()); i++) {
if (s2.endsWith(s1.substring(0, i))) {
max = i;
}
}
return s1.substring(0, max);
}
For the record: You could start with i = 1 in the latter example for a tiny bit of extra performance. On top of this you can use i to specify how long the suffix has at least to be you want to get. ;) If you writ Math.min(s1.length(), s2.length()) - x you can use x to specify how long the found substring may be at most. Both of these things are possible with the first solution, too, but the min length is a bit more involving. ;)
Part 1 of your question
In the part above the Collections.reverse the author of the code searches for all positions in s1 where the last letter of s2 is and saves this position.
What follows is essentially what my algorithm does, the difference is, that he doesn't check every substring but only those that end with the last letter of s2.
This is some sort of optimization to speed things up. If speed is not that important my naive implementation should suffice. ;)
Where did you find that solution? Was it written by a credible, well-respected coder? If you're not sure of that, then it might not be worth reading it. One could write really complex and inefficient code to accomplish something really simple, and it will not be worth understanding the algorithm.
Rather than trying to understand somebody else's solution, it might be easier to come up with it on your own. I think you understand the problem much better that way, and the logic becomes your own. Over time and practice the thought process will start to come more naturally. Practice makes perfect.
Anyway, I put a more simple implementation in Python here (spoiler alert!). I suggest you first figure out the solution on your own, and compare it to mine later.
Apache commons lang3, StringUtils.getCommonPrefix()
Java is really bad in providing useful stuff via stdlib. On the plus side there's almost always some reasonable tool from Apache.
I converted the #TheMorph's answer to javascript. Hope this helps js developer
if (typeof String.prototype.endsWith !== 'function') {
String.prototype.endsWith = function(suffix) {
return this.indexOf(suffix, this.length - suffix.length) !== -1;
};
}
function findLongestPrefixSuffix(s2, s1) {
for( var i = Math.min(s1.length, s2.length); ; i--) {
if(s2.endsWith(s1.substring(0, i))) {
return s1.substring(0, i);
}
}
}
console.log(findLongestPrefixSuffix('abc', 'bcd')); // result: 'bc'

Recursive method to determine if a string is a hex number - Java

This is a homework question that I am having a bit of trouble with.
Write a recursive method that determines if a String is a hex number.
Write javadocs for your method.
A String is a hex number if each and every character is either
0 or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9
or a or A or b or B or c or C or d or D or e or E or f or f.
At the moment all I can see to test this is if the character at 0 of the string is one of these values he gave me then that part of it is a hex.
Any hints or suggestions to help me out?
This is what I have so far: `
public boolean isHex(String string){
if (string.charAt(0)==//what goes here?){
//call isHex again on s.substring(1)
}else return false;
}
`
If you're looking for a good hex digit method:
boolean isHexDigit(char c) {
return Character.isDigit(c) || (Character.toUpperCase(c) >= 'A' && Character.toUpperCase(c) <= 'F');
}
Hints or suggestions, as requested:
All recursive methods call themselves with a different input (well, hopefully a different input!)
All recursive methods have a stop condition.
Your method signature should look something like this
boolean isHexString(String s) {
// base case here - an if condition
// logic and recursion - a return value
}
Also, don't forget that hex strings can start with "0x". This might be (more) difficult to do, so I would get the regular function working first. If you tackle it later, try to keep in mind that 0xABCD0x123 shouldn't pass. :-)
About substring: Straight from the String class source:
public String substring(int beginIndex, int endIndex) {
if (beginIndex < 0) {
throw new StringIndexOutOfBoundsException(beginIndex);
}
if (endIndex > count) {
throw new StringIndexOutOfBoundsException(endIndex);
}
if (beginIndex > endIndex) {
throw new StringIndexOutOfBoundsException(endIndex - beginIndex);
}
return ((beginIndex == 0) && (endIndex == count)) ? this :
new String(offset + beginIndex, endIndex - beginIndex, value);
}
offset is a member variable of type int
value is a member variable of type char[]
and the constructor it calls is
String(int offset, int count, char value[]) {
this.value = value;
this.offset = offset;
this.count = count;
}
It's clearly an O(1) method, calling an O(1) constructor. It can do this because String is immutable. You can't change the value of a String, only create a new one. (Let's leave out things like reflection and sun.misc.unsafe since they are extraneous solutions!) Since it can't be changed, you also don't have to worry about some other Thread messing with it, so it's perfectly fine to pass around like the village bicycle.
Since this is homework, I only give some hints instead of code:
Write a method that always tests the first character of a String if it fulfills the requirements. If not, return false, if yes, call the method again with the same String, but the first character missing. If it is only 1 character left and it is also a hex character then return true.
Pseudocode:
public boolean isHex(String testString) {
If String has 0 characters -> return true;
Else
If first character is a hex character -> call isHex with the remaining characters
Else if the first character is not a hex character -> return false;
}
When solving problems recursively, you generally want to solve a small part (the 'base case'), and then recurse on the rest.
You've figured out the base case - checking if a single character is hex or not.
Now you need to 'recurse on the rest'.
Here's some pseudocode (Python-ish) for reversing a string - hopefully you will see how similar methods can be applied to your problem (indeed, all recursive problems)
def ReverseString(str):
# base case (simple)
if len(str) <= 1:
return str
# recurse on the rest...
return last_char(str) + ReverseString(all_but_last_char(str))
Sounds like you should recursively iterate the characters in string and return the boolean AND of whether or not the current character is in [0-9A-Fa-f] with the recursive call...
You have already received lots of useful answers. In case you want to train your recursive skills (and Java skills in general) a bit more I can recommend you to visit Coding Bat. You will find a lot of exercises together with automated tests.

Suggestion on equalsIgnoreCase in Java to C implementation

All,
In a bid to improve my C skills, I decided to start implementing various Java libraries/library functions to C code. This would ensure that everyone knows the functionality of my implementation at least. Here is the link to the C source code that simulates the equalsIgnoreCase() of String class in Java : C source code. I have tested the code and it looks fine as per my testing skills are concerned. My aim was to use as much basic operations and datatypes as possible. Though, it would be great if the gurus here can:
1 > Give me any suggestion to improve the code quality
2 > Enlighten me with any missing coding standard/practices
3 > Locate bugs in my logic.
100 lines of code is not too long to post here.
You calculate the string length twice. In C, the procedure to calculate the string length starts at the beginning of the string and runs along all of it (not necessarily in steps of 1 byte) until it finds the terminating null byte. If your strings are 2Mbyte long, you "walk" along 4Mbyte unnecessarily.
in <ctype.h> there are the two functions tolower() and toupper() declared. You can use one of them (tolower) instead of extractFirstCharacterASCIIVal(). The advantage of using the library function is that it is not locked in to ASCII and may even work with foreign characters when you go 'international'.
You use awkward (very long) names for your variables (and functions too). eg: ch1 and ch2 do very well for characters in file 1 and file 2 respectively :-)
return 1; at the end of main usually means something went wrong with the program. return 0; is idiomatic for successful termination.
Edit: for comparison with tcrosley version
#include <ctype.h>
int cmpnocase(const char *s1, const char *s2) {
while (*s1 && *s2) {
if (tolower((unsigned char)*s1) != tolower((unsigned char)*s2)) break;
s1++;
s2++;
}
return (*s1 != *s2);
}
With C++, you could replace performComparison(char* string1, char * string2) with stricmp.
However stricmp is not part of the standard C library. Here is a version adapted to your example. Note you don't need the extractFirstCharacterASCIIVal function, use tolower instead. Also note there is no need to explicitly calculate the string length ahead of time, as strings in C are terminated by the NULL character '\0'.
int performComparison(char* string1, char * string2)
{
char c1, c2;
int v;
do {
c1 = *string1++;
c2 = *string2++;
v = (UINT) tolower(c1) - (UINT) tolower(c2);
} while ((v == 0) && (c1 != '\0') && (c2 != '\0') );
return v != 0;
}
If you do want to use your own extractFirstCharacterASCIIVal function instead of the tolower macro, to make the code more transparent then you should code it like so:
if ((str >= 'a') && (str <= 'z'))
{
returnVal = str - ('a' - 'A');
}
else
{
returnVal = str;
}
to make it more obvious what you are doing. Also you should include a comment that this assumes the characters a..z and A..Z are contiguous. (They are in ASCII, but not always in other encodings.)

Efficient way to search a stream for a string

Let's suppose that have a stream of text (or Reader in Java) that I'd like to check for a particular string. The stream of text might be very large so as soon as the search string is found I'd like to return true and also try to avoid storing the entire input in memory.
Naively, I might try to do something like this (in Java):
public boolean streamContainsString(Reader reader, String searchString) throws IOException {
char[] buffer = new char[1024];
int numCharsRead;
while((numCharsRead = reader.read(buffer)) > 0) {
if ((new String(buffer, 0, numCharsRead)).indexOf(searchString) >= 0)
return true;
}
return false;
}
Of course this fails to detect the given search string if it occurs on the boundary of the 1k buffer:
Search text: "stackoverflow"
Stream buffer 1: "abc.........stack"
Stream buffer 2: "overflow.......xyz"
How can I modify this code so that it correctly finds the given search string across the boundary of the buffer but without loading the entire stream into memory?
Edit: Note when searching a stream for a string, we're trying to minimise the number of reads from the stream (to avoid latency in a network/disk) and to keep memory usage constant regardless of the amount of data in the stream. Actual efficiency of the string matching algorithm is secondary but obviously, it would be nice to find a solution that used one of the more efficient of those algorithms.
There are three good solutions here:
If you want something that is easy and reasonably fast, go with no buffer, and instead implement a simple nondeterminstic finite-state machine. Your state will be a list of indices into the string you are searching, and your logic looks something like this (pseudocode):
String needle;
n = needle.length();
for every input character c do
add index 0 to the list
for every index i in the list do
if c == needle[i] then
if i + 1 == n then
return true
else
replace i in the list with i + 1
end
else
remove i from the list
end
end
end
This will find the string if it exists and you will never need a
buffer.
Slightly more work but also faster: do an NFA-to-DFA conversion that figures out in advance what lists of indices are possible, and assign each one to a small integer. (If you read about string search on Wikipedia, this is called the powerset construction.) Then you have a single state and you make a state-to-state transition on each incoming character. The NFA you want is just the DFA for the string preceded with a state that nondeterministically either drops a character or tries to consume the current character. You'll want an explicit error state as well.
If you want something faster, create a buffer whose size is at least twice n, and user Boyer-Moore to compile a state machine from needle. You'll have a lot of extra hassle because Boyer-Moore is not trivial to implement (although you'll find code online) and because you'll have to arrange to slide the string through the buffer. You'll have to build or find a circular buffer that can 'slide' without copying; otherwise you're likely to give back any performance gains you might get from Boyer-Moore.
I did a few changes to the Knuth Morris Pratt algorithm for partial searches. Since the actual comparison position is always less or equal than the next one there is no need for extra memory. The code with a Makefile is also available on github and it is written in Haxe to target multiple programming languages at once, including Java.
I also wrote a related article: searching for substrings in streams: a slight modification of the Knuth-Morris-Pratt algorithm in Haxe. The article mentions the Jakarta RegExp, now retired and resting in the Apache Attic. The Jakarta Regexp library “match” method in the RE class uses a CharacterIterator as a parameter.
class StreamOrientedKnuthMorrisPratt {
var m: Int;
var i: Int;
var ss:
var table: Array<Int>;
public function new(ss: String) {
this.ss = ss;
this.buildTable(this.ss);
}
public function begin() : Void {
this.m = 0;
this.i = 0;
}
public function partialSearch(s: String) : Int {
var offset = this.m + this.i;
while(this.m + this.i - offset < s.length) {
if(this.ss.substr(this.i, 1) == s.substr(this.m + this.i - offset,1)) {
if(this.i == this.ss.length - 1) {
return this.m;
}
this.i += 1;
} else {
this.m += this.i - this.table[this.i];
if(this.table[this.i] > -1)
this.i = this.table[this.i];
else
this.i = 0;
}
}
return -1;
}
private function buildTable(ss: String) : Void {
var pos = 2;
var cnd = 0;
this.table = new Array<Int>();
if(ss.length > 2)
this.table.insert(ss.length, 0);
else
this.table.insert(2, 0);
this.table[0] = -1;
this.table[1] = 0;
while(pos < ss.length) {
if(ss.substr(pos-1,1) == ss.substr(cnd, 1))
{
cnd += 1;
this.table[pos] = cnd;
pos += 1;
} else if(cnd > 0) {
cnd = this.table[cnd];
} else {
this.table[pos] = 0;
pos += 1;
}
}
}
public static function main() {
var KMP = new StreamOrientedKnuthMorrisPratt("aa");
KMP.begin();
trace(KMP.partialSearch("ccaabb"));
KMP.begin();
trace(KMP.partialSearch("ccarbb"));
trace(KMP.partialSearch("fgaabb"));
}
}
The Knuth-Morris-Pratt search algorithm never backs up; this is just the property you want for your stream search. I've used it before for this problem, though there may be easier ways using available Java libraries. (When this came up for me I was working in C in the 90s.)
KMP in essence is a fast way to build a string-matching DFA, like Norman Ramsey's suggestion #2.
This answer applied to the initial version of the question where the key was to read the stream only as far as necessary to match on a String, if that String was present. This solution would not meet the requirement to guarantee fixed memory utilisation, but may be worth considering if you have found this question and are not bound by that constraint.
If you are bound by the constant memory usage constraint, Java stores arrays of any type on the heap, and as such nulling the reference does not deallocate memory in any way; I think any solution involving arrays in a loop will consume memory on the heap and require GC.
For simple implementation, maybe Java 5's Scanner which can accept an InputStream and use a java.util.regex.Pattern to search the input for might save you worrying about the implementation details.
Here's an example of a potential implementation:
public boolean streamContainsString(Reader reader, String searchString)
throws IOException {
Scanner streamScanner = new Scanner(reader);
if (streamScanner.findWithinHorizon(searchString, 0) != null) {
return true;
} else {
return false;
}
}
I'm thinking regex because it sounds like a job for a Finite State Automaton, something that starts in an initial state, changing state character by character until it either rejects the string (no match) or gets to an accept state.
I think this is probably the most efficient matching logic you could use, and how you organize the reading of the information can be divorced from the matching logic for performance tuning.
It's also how regexes work.
Instead of having your buffer be an array, use an abstraction that implements a circular buffer. Your index calculation will be buf[(next+i) % sizeof(buf)], and you'll have to be careful to full the buffer one-half at a time. But as long as the search string fits in half the buffer, you'll find it.
I believe the best solution to this problem is to try to keep it simple. Remember, beacause I'm reading from a stream, I want to keep the number of reads from the stream to a minimum (as network or disk latency may be an issue) while keeping the amount of memory used constant (as the stream may be very large in size). Actual efficiency of the string matching is not the number one goal (as that has been studied to death already).
Based on AlbertoPL's suggestion, here's a simple solution that compares the buffer against the search string character by character. The key is that because the search is only done one character at a time, no back tracking is needed and therefore no circular buffers, or buffers of a particular size are needed.
Now, if someone can come up with a similar implementation based on Knuth-Morris-Pratt search algorithm then we'd have a nice efficient solution ;)
public boolean streamContainsString(Reader reader, String searchString) throws IOException {
char[] buffer = new char[1024];
int numCharsRead;
int count = 0;
while((numCharsRead = reader.read(buffer)) > 0) {
for (int c = 0; c < numCharsRead; c++) {
if (buffer[c] == searchString.charAt(count))
count++;
else
count = 0;
if (count == searchString.length()) return true;
}
}
return false;
}
If you're not tied to using a Reader, then you can use Java's NIO API to efficiently load the file. For example (untested, but should be close to working):
public boolean streamContainsString(File input, String searchString) throws IOException {
Pattern pattern = Pattern.compile(Pattern.quote(searchString));
FileInputStream fis = new FileInputStream(input);
FileChannel fc = fis.getChannel();
int sz = (int) fc.size();
MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, sz);
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
CharBuffer cb = decoder.decode(bb);
Matcher matcher = pattern.matcher(cb);
return matcher.matches();
}
This basically mmap()'s the file to search and relies on the operating system to do the right thing regarding cache and memory usage. Note however that map() is more expensive the just reading the file in to a large buffer for files less than around 10 KiB.
A very fast searching of a stream is implemented in the RingBuffer class from the Ujorm framework. See the sample:
Reader reader = RingBuffer.createReader("xxx ${abc} ${def} zzz");
String word1 = RingBuffer.findWord(reader, "${", "}");
assertEquals("abc", word1);
String word2 = RingBuffer.findWord(reader, "${", "}");
assertEquals("def", word2);
String word3 = RingBuffer.findWord(reader, "${", "}");
assertEquals("", word3);
The single class implementation is available on the SourceForge:
For more information see the link.
Implement a sliding window. Have your buffer around, move all elements in the buffer one forward and enter a single new character in the buffer at the end. If the buffer is equal to your searched word, it is contained.
Of course, if you want to make this more efficient, you can look at a way to prevent moving all elements in the buffer around, for example by having a cyclic buffer and a representation of the strings which 'cycles' the same way the buffer does, so you only need to check for content-equality. This saves moving all elements in the buffer.
I think you need to buffer a small amount at the boundary between buffers.
For example if your buffer size is 1024 and the length of the SearchString is 10, then as well as searching each 1024-byte buffer you also need to search each 18-byte transition between two buffers (9 bytes from the end of the previous buffer concatenated with 9 bytes from the start of the next buffer).
I'd say switch to a character by character solution, in which case you'd scan for the first character in your target text, then when you find that character increment a counter and look for the next character. Every time you don't find the next consecutive character restart the counter. It would work like this:
public boolean streamContainsString(Reader reader, String searchString) throws IOException {
char[] buffer = new char[1024];
int numCharsRead;
int count = 0;
while((numCharsRead = reader.read(buffer)) > 0) {
if (buffer[numCharsRead -1] == searchString.charAt(count))
count++;
else
count = 0;
if (count == searchString.size())
return true;
}
return false;
}
The only problem is when you're in the middle of looking through characters... in which case there needs to be a way of remembering your count variable. I don't see an easy way of doing so except as a private variable for the whole class. In which case you would not instantiate count inside this method.
You might be able to implement a very fast solution using Fast Fourier Transforms, which, if implemented properly, allow you to do string matching in times O(nlog(m)), where n is the length of the longer string to be matched, and m is the length of the shorter string. You could, for example, perform FFT as soon as you receive an stream input of length m, and if it matches, you can return, and if it doesn't match, you can throw away the first character in the stream input, wait for a new character to appear through the stream, and then perform FFT again.
You can increase the speed of search for very large strings by using some string search algorithm
If you're looking for a constant substring rather than a regex, I'd recommend Boyer-Moore. There's plenty of source code on the internet.
Also, use a circular buffer, to avoid think too hard about buffer boundaries.
Mike.
I also had a similar problem: skip bytes from the InputStream until specified string (or byte array). This is the simple code based on circular buffer. It is not very efficient but works for my needs:
private static boolean matches(int[] buffer, int offset, byte[] search) {
final int len = buffer.length;
for (int i = 0; i < len; ++i) {
if (search[i] != buffer[(offset + i) % len]) {
return false;
}
}
return true;
}
public static void skipBytes(InputStream stream, byte[] search) throws IOException {
final int[] buffer = new int[search.length];
for (int i = 0; i < search.length; ++i) {
buffer[i] = stream.read();
}
int offset = 0;
while (true) {
if (matches(buffer, offset, search)) {
break;
}
buffer[offset] = stream.read();
offset = (offset + 1) % buffer.length;
}
}
Here is my implementation:
static boolean containsKeywordInStream( Reader ir, String keyword, int bufferSize ) throws IOException{
SlidingContainsBuffer sb = new SlidingContainsBuffer( keyword );
char[] buffer = new char[ bufferSize ];
int read;
while( ( read = ir.read( buffer ) ) != -1 ){
if( sb.checkIfContains( buffer, read ) ){
return true;
}
}
return false;
}
SlidingContainsBuffer class:
class SlidingContainsBuffer{
private final char[] keyword;
private int keywordIndexToCheck = 0;
private boolean keywordFound = false;
SlidingContainsBuffer( String keyword ){
this.keyword = keyword.toCharArray();
}
boolean checkIfContains( char[] buffer, int read ){
for( int i = 0; i < read; i++ ){
if( keywordFound == false ){
if( keyword[ keywordIndexToCheck ] == buffer[ i ] ){
keywordIndexToCheck++;
if( keywordIndexToCheck == keyword.length ){
keywordFound = true;
}
} else {
keywordIndexToCheck = 0;
}
} else {
break;
}
}
return keywordFound;
}
}
This answer fully qualifies the task:
The implementation is able to find the searched keyword even if it was split between buffers
Minimum memory usage defined by the buffer size
Number of reads will be minimized by using bigger buffer

Categories

Resources