Highly effect find a byte[] is contained by another byte[] [duplicate]

Highly effect find a byte[] is contained by another byte[] [duplicate] - java

Given a byte array, how can I find within it, the position of a (smaller) byte array?
This documentation looked promising, using ArrayUtils, but if I'm correct it would only let me find an individual byte within the array to be searched.
(I can't see it mattering, but just in case: sometimes the search byte array will be regular ASCII characters, other times it will be control characters or extended ASCII characters. So using String operations would not always be appropriate)
The large array could be between 10 and about 10000 bytes, and the smaller array around 10. In some cases I will have several smaller arrays that I want found within the larger array in a single search. And I will at times want to find the last index of an instance rather than the first.

The simpelst way would be to compare each element:
public int indexOf(byte[] outerArray, byte[] smallerArray) {
for(int i = 0; i < outerArray.length - smallerArray.length+1; ++i) {
boolean found = true;
for(int j = 0; j < smallerArray.length; ++j) {
if (outerArray[i+j] != smallerArray[j]) {
found = false;
break;
}
}
if (found) return i;
}
return -1;
}
Some tests:
#Test
public void testIndexOf() {
byte[] outer = {1, 2, 3, 4};
assertEquals(0, indexOf(outer, new byte[]{1, 2}));
assertEquals(1, indexOf(outer, new byte[]{2, 3}));
assertEquals(2, indexOf(outer, new byte[]{3, 4}));
assertEquals(-1, indexOf(outer, new byte[]{4, 4}));
assertEquals(-1, indexOf(outer, new byte[]{4, 5}));
assertEquals(-1, indexOf(outer, new byte[]{4, 5, 6, 7, 8}));
}
As you updated your question: Java Strings are UTF-16 Strings, they do not care about the extended ASCII set, so you could use string.indexOf()

Google's Guava provides a Bytes.indexOf(byte[] array, byte[] target).

Using the Knuth–Morris–Pratt algorithm is the most efficient way.
StreamSearcher.java is an implementation of it and is part of Twitter's elephant-bird project.
It is not recommended to include this library since it is rather sizable for using just a single class.
import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;
/**
* An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
* For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
public class StreamSearcher
{
private byte[] pattern_;
private int[] borders_;
// An upper bound on pattern length for searching. Results are undefined for longer patterns.
#SuppressWarnings("unused")
public static final int MAX_PATTERN_LENGTH = 1024;
StreamSearcher(byte[] pattern)
{
setPattern(pattern);
}
/**
* Sets a new pattern for this StreamSearcher to use.
*
* #param pattern the pattern the StreamSearcher will look for in future calls to search(...)
*/
public void setPattern(byte[] pattern)
{
pattern_ = Arrays.copyOf(pattern, pattern.length);
borders_ = new int[pattern_.length + 1];
preProcess();
}
/**
* Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
* that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
* byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
* another reasonable default, i.e. leave the stream unchanged.
*
* #return bytes consumed if found, -1 otherwise.
*/
long search(InputStream stream) throws IOException
{
long bytesRead = 0;
int b;
int j = 0;
while ((b = stream.read()) != -1)
{
bytesRead++;
while (j >= 0 && (byte) b != pattern_[j])
{
j = borders_[j];
}
// Move to the next character in the pattern.
++j;
// If we've matched up to the full pattern length, we found it. Return,
// which will automatically save our position in the InputStream at the point immediately
// following the pattern match.
if (j == pattern_.length)
{
return bytesRead;
}
}
// No dice, Note that the stream is now completely consumed.
return -1;
}
/**
* Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
* and aids in implementation of the Knuth-Moore-Pratt string search.
* <p>
* For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
private void preProcess()
{
int i = 0;
int j = -1;
borders_[i] = j;
while (i < pattern_.length)
{
while (j >= 0 && pattern_[i] != pattern_[j])
{
j = borders_[j];
}
borders_[++i] = ++j;
}
}
}

Is this what you are looking for?
public class KPM {
/**
* Search the data byte array for the first occurrence of the byte array pattern within given boundaries.
* #param data
* #param start First index in data
* #param stop Last index in data so that stop-start = length
* #param pattern What is being searched. '*' can be used as wildcard for "ANY character"
* #return
*/
public static int indexOf( byte[] data, int start, int stop, byte[] pattern) {
if( data == null || pattern == null) return -1;
int[] failure = computeFailure(pattern);
int j = 0;
for( int i = start; i < stop; i++) {
while (j > 0 && ( pattern[j] != '*' && pattern[j] != data[i])) {
j = failure[j - 1];
}
if (pattern[j] == '*' || pattern[j] == data[i]) {
j++;
}
if (j == pattern.length) {
return i - pattern.length + 1;
}
}
return -1;
}
/**
* Computes the failure function using a boot-strapping process,
* where the pattern is matched against itself.
*/
private static int[] computeFailure(byte[] pattern) {
int[] failure = new int[pattern.length];
int j = 0;
for (int i = 1; i < pattern.length; i++) {
while (j>0 && pattern[j] != pattern[i]) {
j = failure[j - 1];
}
if (pattern[j] == pattern[i]) {
j++;
}
failure[i] = j;
}
return failure;
}
}

To save your time in testing:
http://helpdesk.objects.com.au/java/search-a-byte-array-for-a-byte-sequence
gives you code that works if you make computeFailure() static:
public class KPM {
/**
* Search the data byte array for the first occurrence
* of the byte array pattern.
*/
public static int indexOf(byte[] data, byte[] pattern) {
int[] failure = computeFailure(pattern);
int j = 0;
for (int i = 0; i < data.length; i++) {
while (j > 0 && pattern[j] != data[i]) {
j = failure[j - 1];
}
if (pattern[j] == data[i]) {
j++;
}
if (j == pattern.length) {
return i - pattern.length + 1;
}
}
return -1;
}
/**
* Computes the failure function using a boot-strapping process,
* where the pattern is matched against itself.
*/
private static int[] computeFailure(byte[] pattern) {
int[] failure = new int[pattern.length];
int j = 0;
for (int i = 1; i < pattern.length; i++) {
while (j>0 && pattern[j] != pattern[i]) {
j = failure[j - 1];
}
if (pattern[j] == pattern[i]) {
j++;
}
failure[i] = j;
}
return failure;
}
}
Since it is always wise to test the code that you borrow, you may start with:
public class Test {
public static void main(String[] args) {
do_test1();
}
static void do_test1() {
String[] ss = { "",
"\r\n\r\n",
"\n\n",
"\r\n\r\nthis is a test",
"this is a test\r\n\r\n",
"this is a test\r\n\r\nthis si a test",
"this is a test\r\n\r\nthis si a test\r\n\r\n",
"this is a test\n\r\nthis si a test",
"this is a test\r\nthis si a test\r\n\r\n",
"this is a test"
};
for (String s: ss) {
System.out.println(""+KPM.indexOf(s.getBytes(), "\r\n\r\n".getBytes())+"in ["+s+"]");
}
}
}

Copied almost identical from java.lang.String.
indexOf(char[],int,int,char[]int,int,int)
static int indexOf(byte[] source, int sourceOffset, int sourceCount, byte[] target, int targetOffset, int targetCount, int fromIndex) {
if (fromIndex >= sourceCount) {
return (targetCount == 0 ? sourceCount : -1);
}
if (fromIndex < 0) {
fromIndex = 0;
}
if (targetCount == 0) {
return fromIndex;
}
byte first = target[targetOffset];
int max = sourceOffset + (sourceCount - targetCount);
for (int i = sourceOffset + fromIndex; i <= max; i++) {
/* Look for first character. */
if (source[i] != first) {
while (++i <= max && source[i] != first)
;
}
/* Found first character, now look at the rest of v2 */
if (i <= max) {
int j = i + 1;
int end = j + targetCount - 1;
for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++)
;
if (j == end) {
/* Found whole string. */
return i - sourceOffset;
}
}
}
return -1;
}

package org.example;
import java.util.List;
import org.riversun.finbin.BinarySearcher;
public class Sample2 {
public static void main(String[] args) throws Exception {
BinarySearcher bs = new BinarySearcher();
// UTF-8 without BOM
byte[] srcBytes = "Hello world.It's a small world.".getBytes("utf-8");
byte[] searchBytes = "world".getBytes("utf-8");
List<Integer> indexList = bs.searchBytes(srcBytes, searchBytes);
System.out.println("indexList=" + indexList);
}
}
so it results in
indexList=[6, 25]
So,u can find the index of byte[] in byte[]
Example here on Github at: https://github.com/riversun/finbin

Several (or all?) of the examples posted here failed some Unit tests so I am posting my version along with the aforementioned tests over here. All of the Unit tests are BASED upon the requirement that Java's String.indexOf() always gives us the right answer!
// The Knuth, Morris, and Pratt string searching algorithm remembers information about
// the past matched characters instead of matching a character with a different pattern
// character over and over again. It can search for a pattern in O(n) time as it never
// re-compares a text symbol that has matched a pattern symbol. But, it does use a partial
// match table to analyze the pattern structure. Construction of a partial match table
// takes O(m) time. Therefore, the overall time complexity of the KMP algorithm is O(m + n).
public class KMPSearch {
public static int indexOf(byte[] haystack, byte[] needle)
{
// needle is null or empty
if (needle == null || needle.length == 0)
return 0;
// haystack is null, or haystack's length is less than that of needle
if (haystack == null || needle.length > haystack.length)
return -1;
// pre construct failure array for needle pattern
int[] failure = new int[needle.length];
int n = needle.length;
failure[0] = -1;
for (int j = 1; j < n; j++)
{
int i = failure[j - 1];
while ((needle[j] != needle[i + 1]) && i >= 0)
i = failure[i];
if (needle[j] == needle[i + 1])
failure[j] = i + 1;
else
failure[j] = -1;
}
// find match
int i = 0, j = 0;
int haystackLen = haystack.length;
int needleLen = needle.length;
while (i < haystackLen && j < needleLen)
{
if (haystack[i] == needle[j])
{
i++;
j++;
}
else if (j == 0)
i++;
else
j = failure[j - 1] + 1;
}
return ((j == needleLen) ? (i - needleLen) : -1);
}
}
import java.util.Random;
class KMPSearchTest {
private static Random random = new Random();
private static String alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
#Test
public void testEmpty() {
test("", "");
test("", "ab");
}
#Test
public void testOneChar() {
test("a", "a");
test("a", "b");
}
#Test
public void testRepeat() {
test("aaa", "aaaaa");
test("aaa", "abaaba");
test("abab", "abacababc");
test("abab", "babacaba");
}
#Test
public void testPartialRepeat() {
test("aaacaaaaac", "aaacacaacaaacaaaacaaaaac");
test("ababcababdabababcababdaba", "ababcababdabababcababdaba");
}
#Test
public void testRandomly() {
for (int i = 0; i < 1000; i++) {
String pattern = randomPattern();
for (int j = 0; j < 100; j++)
test(pattern, randomText(pattern));
}
}
/* Helper functions */
private static String randomPattern() {
StringBuilder sb = new StringBuilder();
int steps = random.nextInt(10) + 1;
for (int i = 0; i < steps; i++) {
if (sb.length() == 0 || random.nextBoolean()) { // Add literal
int len = random.nextInt(5) + 1;
for (int j = 0; j < len; j++)
sb.append(alphabet.charAt(random.nextInt(alphabet.length())));
} else { // Repeat prefix
int len = random.nextInt(sb.length()) + 1;
int reps = random.nextInt(3) + 1;
if (sb.length() + len * reps > 1000)
break;
for (int j = 0; j < reps; j++)
sb.append(sb.substring(0, len));
}
}
return sb.toString();
}
private static String randomText(String pattern) {
StringBuilder sb = new StringBuilder();
int steps = random.nextInt(100);
for (int i = 0; i < steps && sb.length() < 10000; i++) {
if (random.nextDouble() < 0.7) { // Add prefix of pattern
int len = random.nextInt(pattern.length()) + 1;
sb.append(pattern.substring(0, len));
} else { // Add literal
int len = random.nextInt(30) + 1;
for (int j = 0; j < len; j++)
sb.append(alphabet.charAt(random.nextInt(alphabet.length())));
}
}
return sb.toString();
}
private static void test(String pattern, String text) {
try {
assertEquals(text.indexOf(pattern), KMPSearch.indexOf(text.getBytes(), pattern.getBytes()));
} catch (AssertionError e) {
System.out.println("FAILED -> Unable to find '" + pattern + "' in '" + text + "'");
}
}
}

Java strings are composed of 16-bit chars, not of 8-bit bytes. A char can hold a byte, so you can always make your byte arrays into strings, and use indexOf: ASCII characters, control characters, and even zero characters will work fine.
Here is a demo:
byte[] big = new byte[] {1,2,3,0,4,5,6,7,0,8,9,0,0,1,2,3,4};
byte[] small = new byte[] {7,0,8,9,0,0,1};
String bigStr = new String(big, StandardCharsets.UTF_8);
String smallStr = new String(small, StandardCharsets.UTF_8);
System.out.println(bigStr.indexOf(smallStr));
This prints 7.
However, considering that your large array could be up to 10,000 bytes, and the small array is only ten bytes, this solution may not be the most efficient, for two reasons:
It requires copying your big array into an array that is twice as large (same capacity, but with char instead of byte). This triples your memory requirements.
String search algorithm of Java is not the fastest one available. You may get sufficiently faster if you implement one of the advanced algorithms, for example, the Knuth–Morris–Pratt one. This could potentially bring the execution speed down by a factor of up to ten (the length of the small string), and will require additional memory that is proportional to the length of the small string, not the big string.

For a little HTTP server I am currently working on, I came up with the following code to find boundaries in a multipart/form-data request. Hoped to find a better solution here, but i guess I'll stick with it. I think it is as efficent as it can get (quite fast and uses not much ram). It uses the input bytes as ring buffer, reads the next byte as soon as it does not match the boundary and writes the data after the first full cycle into the output stream. Of course can it be changed for byte arrays instead of streams, as asked in the question.
private boolean multipartUploadParseOutput(InputStream is, OutputStream os, String boundary)
{
try
{
String n = "--"+boundary;
byte[] bc = n.getBytes("UTF-8");
int s = bc.length;
byte[] b = new byte[s];
int p = 0;
long l = 0;
int c;
boolean r;
while ((c = is.read()) != -1)
{
b[p] = (byte) c;
l += 1;
p = (int) (l % s);
if (l>p)
{
r = true;
for (int i = 0; i < s; i++)
{
if (b[(p + i) % s] != bc[i])
{
r = false;
break;
}
}
if (r)
break;
os.write(b[p]);
}
}
os.flush();
return true;
} catch(IOException e) {e.printStackTrace();}
return false;
}

Related

How to count all the matching sub-string using regex?

I am trying to count all instances of a substring from .PBAAP.B with P A B in that sequence and can have 1-3 symbols in between them (inclusive).
The output should be 2
.P.A...B
.P..A..B
What I've tried so far is
return (int) Pattern
.compile("P.{0,2}A.{0,2}B")
.matcher(C)
.results()
.count();
But I only get output 1. My guess is that in both cases, the group is PBAAP.B. So instead of 2, I get 1.
I could write an elaborate function to achieve what I am trying to do, but I was wondering if there was a way to do it with regex.
int count = 0;
for (int i = 0; i < C.length(); i++) {
String p = Character.toString(C.charAt(i));
if (p.equalsIgnoreCase("P")) {
for (int j = X; j <= Y; j++) {
if (i + j < C.length()) {
String a = Character.toString(C.charAt(i + j));
if (a.equals("A")) {
for (int k = X; k <= Y; k++) {
if (i + j + k < C.length()) {
String b = Character.toString(C.charAt(i + j + k));
if (b.equalsIgnoreCase("B")) {
count++;
}
}
}
}
}
return count;

To my knowledge, you will only get a boolean response out of a regex match - either it is a match or it isn't. Thus, I can't think of a solution solving your problem using regex.
count() is used, if you want to check if there are multiple matches at different indices - which is not what you want. For instance, the following snippet will return 2 as th is found at 11-12 and at 20-21:
Pattern
.compile("(th)")
.matcher("let's test this together")
.results()
.count();
In order to keep the solution without regex readable and extensible, you may want to use regression.
Class to keep track of the latest match of each letter:
public class Letter {
private char letter;
private int latestMatch;
// constructor
// getters + setters
}
Class to detect different matches:
public class MatchFinder {
final static String C = "......PAA.BBBadsfjksPeAkBB";
final static Letter[] LETTERS = new Letter[3];
final static int SYMBOLS_THRESHOLD = 4;
public static void main(String[] args) {
LETTERS[0] = new Letter('P', -1);
LETTERS[1] = new Letter('A', -1);
LETTERS[2] = new Letter('B', -1);
int count = countMatches(0, C.length(), 0, 0, -1);
System.out.println(count);
}
public static int countMatches(int start, int end, int letterIndex, int currentCount, int latestMatch) {
for (int i = start; i < end; i++) {
if (i < C.length()) {
Character c = Character.toLowerCase(C.charAt(i));
if (c.equals(LETTERS[letterIndex].getLowercaseLetter()) && i > latestMatch) {
LETTERS[letterIndex].setLatestMatch(i);
if (letterIndex + 1 < LETTERS.length) {
int childStart = LETTERS[letterIndex].getLatestMatch() + 1;
return countMatches(childStart, childStart + SYMBOLS_THRESHOLD, letterIndex + 1, currentCount, -1);
}
currentCount++;
}
}
}
if (letterIndex > 0) {
int parentLetterIndex = letterIndex - 1;
int latestParentMatch = LETTERS[parentLetterIndex].getLatestMatch();
if (letterIndex > 1) {
int parentStart = LETTERS[letterIndex - 2].getLatestMatch() + 1;
return countMatches(parentStart, parentStart + SYMBOLS_THRESHOLD, parentLetterIndex, currentCount, latestParentMatch);
} else {
return countMatches(LETTERS[parentLetterIndex].getLatestMatch() + 1, C.length(), 0, currentCount, latestParentMatch);
}
}
return currentCount;
}
}

is there any way to find all the occurrence of a pattern in a text instead of finding the first occurrence in Boyer Moore algorithm

I'm working on Boyer Moore string matching algorithm in java so it could work on Arabic text sufficiently, I have the code that finds the first occurrence of a pattern instead of all occurrence.
I tried to modify to it but it doesn't work, I got arrayIndexoutofbounds exception.
the exception I got is :
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException:
-1 at BM.findPattern(BM.java:27) at BM.main(BM.java:110)
the purpose is to modify the algorithm so it can handle the vocalized Arabic Text.
here is a sample of input and out put:
Enter Text
anananananana
Enter Pattern
ana
found at postion 0
then I got the exception above
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
/** Class BoyerMoore **/
public class BM
{
/** function findPattern **/
public void findPattern(String t, String p)
{
char[] text = t.toCharArray();
char[] pattern = p.toCharArray();
int pos = indexOf(text, pattern);
if (pos == -1)
System.out.println("\nNo Match\n");
else
System.out.println("Pattern found at position : "+ pos);
}
/** Function to calculate index of pattern substring **/
public int indexOf(char[] text, char[] pattern)
{
if (pattern.length == 0)
return 0;
int charTable[] = makeCharTable(pattern);
int offsetTable[] = makeOffsetTable(pattern);
for (int i = pattern.length - 1, j; i < text.length;)
{
for (j = pattern.length - 1; pattern[j] == text[i]; --i, --j)
if (j == 0)
return i;
// i += pattern.length - j; // For naive method
i += Math.max(offsetTable[pattern.length - 1 - j],
charTable[text[i]]);
}
return -1;
}
/** Makes the jump table based on the mismatched character
information **/
private int[] makeCharTable(char[] pattern)
{
final int ALPHABET_SIZE = 256;
int[] table = new int[ALPHABET_SIZE];
for (int i = 0; i < table.length; ++i)
table[i] = pattern.length;
for (int i = 0; i < pattern.length - 1; ++i)
table[pattern[i]] = pattern.length - 1 - i;
return table;
}
/** Makes the jump table based on the scan offset which mismatch
occurs. **/
private static int[] makeOffsetTable(char[] pattern)
{
int[] table = new int[pattern.length];
int lastPrefixPosition = pattern.length;
for (int i = pattern.length - 1; i >= 0; --i)
{
if (isPrefix(pattern, i + 1))
lastPrefixPosition = i + 1;
table[pattern.length - 1 - i] = lastPrefixPosition - i +
pattern.length - 1;
}
for (int i = 0; i < pattern.length - 1; ++i)
{
int slen = suffixLength(pattern, i);
table[slen] = pattern.length - 1 - i + slen;
}
return table;
}
/** function to check if needle[p:end] a prefix of pattern **/
private static boolean isPrefix(char[] pattern, int p)
{
for (int i = p, j = 0; i < pattern.length; ++i, ++j)
if (pattern[i] != pattern[j])
return false;
return true;
}
/** function to returns the maximum length of the substring ends at p and is a suffix **/
private static int suffixLength(char[] pattern, int p)
{
int len = 0;
for (int i = p, j = pattern.length - 1; i >= 0 && pattern[i] == pattern[j]; --i, --j)
len += 1;
return len;
}
/** Main Function **/
public static void main(String[] args) throws IOException
{
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
System.out.println("Boyer Moore Algorithm Test\n");
System.out.println("\nEnter Text\n");
String text = br.readLine();
System.out.println("\nEnter Pattern\n");
String pattern = br.readLine();
BM bm = new BM();
bm.findPattern(text, pattern);
}
}

How could I improve the speed/performance for this problem, Java

I saw this challenge on https://www.topcoder.com/ for Beginners. And I really wanted to complete it. I've got so close after so many failures. But I got stuck and don't know what to do no more. Here is what I mean
Question:
Read the input one line at a time and output the current line if and only if you have already read at least 1000 lines greater than the current line and at least 1000 lines less than the current line. (Again, greater than and less than are with respect to the ordering defined by String.compareTo().)
Link to the Challenge
My Solution:
public static void doIt(BufferedReader r, PrintWriter w) throws IOException {
SortedSet<String> linesThatHaveBeenRead = new TreeSet<>();
int lessThan =0;
int greaterThan =0;
Iterator<String> itr;
for (String currentLine = r.readLine(); currentLine != null; currentLine = r.readLine()){
itr = linesThatHaveBeenRead.iterator();
while(itr.hasNext()){
String theCurrentLineInTheSet = itr.next();
if(theCurrentLineInTheSet.compareTo(currentLine) == -1)++lessThan;
else if(theCurrentLineInTheSet.compareTo(currentLine) == 1)++greaterThan;
}
if(lessThan >= 1000 && greaterThan >= 1000){
w.println(currentLine);
lessThan = 0;
greaterThan =0;
}
linesThatHaveBeenRead.add(currentLine);
}
}
PROBLEM
I think the problem with my solution, is because I'm using nested loops which is making it a lot slower, but I've tried other ways and none worked. At this point I'm stuck. The whole point of this challenge is to make use of the most correct data-structure for this problem.
GOAL:
The goal is to use the most efficient data-structure for this problem.

Let me try to present just an accessible refinement of what to do.
public static void
doIt(java.io.BufferedReader r, java.io.PrintWriter w)
throws java.io.IOException {
feedNonExtremes(r, (line) -> { w.println(line);}, 1000, 1000);
}
/** Read <code>r</code> one line at a time and
* output the current line if and only there already were<br/>
* at least <code>nHigh</code> lines greater than the current line <br/>
* and at least <code>nLow</code> lines less than the current line.<br/>
* #param r to read lines from
* #param sink to feed lines to
* #param nLow number of lines comparing too small to process
* #param nHigh number of lines comparing too great to process
*/
static void feedNonExtremes(java.io.BufferedReader r,
Consumer<String> sink, int nLow, int nHigh) {
// collect nLow+nHigh lines into firstLowHigh; instantiate
// - a PriorityQueue(firstLowHigh) highest
// - a PriorityQueue(nLow, (a, b) -> String.compareTo(b, a)) lowest
// remove() nLow elements from highest and insert each into lowest
// for each remaining line
// if greater than the head of highest
// add to highest and remove head
// else if smaller than the head of lowest
// add to lowest and remove head
// else feed to sink
}

Made you a little example with Binary search, now in Java code. It will only use Binary search when newLine is within limits of the sorting.
public static void main(String[] args) {
// Create random lines
ArrayList<String> lines = new ArrayList<String>();
Random rn = new Random();
for (int i = 0; i < 50000; i++) {
int lenght = rn.nextInt(100);
char[] newString = new char[lenght];
for (int j = 0; j < lenght; j++) {
newString[j] = (char) rn.nextInt(255);
}
lines.add(new String(newString));
}
// Here starts logic
ArrayList<String> lowerCompared = new ArrayList<String>();
ArrayList<String> higherCompared = new ArrayList<String>();
int lowBoundry = 1000, highBoundry = 1000;
int k = 0;
int firstLimit = Math.min(lowBoundry, highBoundry);
// first x lines sorter equal
for (; k < firstLimit; k++) {
int index = Collections.binarySearch(lowerCompared, lines.get(k));
if (index < 0)
index = ~index;
lowerCompared.add(index, lines.get(k));
higherCompared.add(index, lines.get(k));
}
for (; k < lines.size(); k++) {
String newLine = lines.get(k);
boolean lowBS = newLine.compareTo(lowerCompared.get(lowBoundry - 1)) < 0;
boolean highBS = newLine.compareTo(higherCompared.get(0)) > 0;
if (lowerCompared.size() == lowBoundry && higherCompared.size() == highBoundry && !lowBS && !highBS) {
System.out.println("Time to print: " + newLine);
continue;
}
if (lowBS) {
int lowerIndex = Collections.binarySearch(lowerCompared, newLine);
if (lowerIndex < 0)
lowerIndex = ~lowerIndex;
lowerCompared.add(lowerIndex, newLine);
if (lowerCompared.size() > lowBoundry)
lowerCompared.remove(lowBoundry);
}
if (highBS) {
int higherIndex = Collections.binarySearch(higherCompared, newLine);
if (higherIndex < 0)
higherIndex = ~higherIndex;
higherCompared.add(higherIndex, newLine);
if (higherCompared.size() > highBoundry)
higherCompared.remove(0);
}
}
}

You need to implement binary search and also need to handle duplicates.
I've done some code sample here which does what you want ( may contains bugs).
public class CheckRead1000 {
public static void main(String[] args) {
// generate strings in revert order to get the worse case
List<String> aaa = new ArrayList<String>();
for (int i = 50000; i > 0; i--) {
aaa.add("some string 123456789" + i);
}
// fast solution
ArrayList<String> sortedLines = new ArrayList<>();
long st1 = System.currentTimeMillis();
for (String a : aaa) {
checkIfRead1000MoreAndLess(sortedLines, a);
}
System.out.println(System.currentTimeMillis() - st1);
// doIt solution
TreeSet<String> linesThatHaveBeenRead = new TreeSet<>();
long st2 = System.currentTimeMillis();
for (String a : aaa) {
doIt(linesThatHaveBeenRead, a);
}
System.out.println(System.currentTimeMillis() - st2);
}
// solution doIt
public static void doIt(SortedSet<String> linesThatHaveBeenRead, String currentLine) {
int lessThan = 0;
int greaterThan = 0;
Iterator<String> itr = linesThatHaveBeenRead.iterator();
while (itr.hasNext()) {
String theCurrentLineInTheSet = itr.next();
if (theCurrentLineInTheSet.compareTo(currentLine) == -1) ++lessThan;
else if (theCurrentLineInTheSet.compareTo(currentLine) == 1) ++greaterThan;
}
if (lessThan >= 1000 && greaterThan >= 1000) {
// System.out.println(currentLine);
lessThan = 0;
greaterThan = 0;
}
linesThatHaveBeenRead.add(currentLine);
}
// will return if we have read more at least 1000 string more and less then our string
private static boolean checkIfRead1000MoreAndLess(List<String> sortedLines, String newLine) {
//adding string to list and calculating its index and the last search range
int indexes[] = addNewString(sortedLines, newLine);
int index = indexes[0]; // index of element
int low = indexes[1];
int high = indexes[2];
//we need to check if this string already was in list for instance
// 1,2,3,4,5,5,5,5,5,6,7 for 5 we need to count 'less' as 4 and 'more' is 2
int highIndex = index;
for (int i = highIndex + 1; i < high; i++) {
if (sortedLines.get(i).equals(newLine)) {
highIndex++;
} else {
//no more duplicates
break;
}
}
int lowIndex = index;
for (int i = lowIndex - 1; i > low; i--) {
if (sortedLines.get(i).equals(newLine)) {
lowIndex--;
} else {
//no more duplicates
break;
}
}
// just calculating how many we did read more and less
if (sortedLines.size() - highIndex - 1 > 1000 && lowIndex > 1000) {
return true;
}
return false;
}
// simple binary search will insert string and return its index and ranges in sorted list
// first int is index,
// second int is start of range - will be used to find duplicates,
// third int is end of range - will be used to find duplicates,
private static int[] addNewString(List<String> sortedLines, String newLine) {
if (sortedLines.isEmpty()) {
sortedLines.add(newLine);
return new int[]{0, 0, 0};
}
// int index = Integer.MAX_VALUE;
int low = 0;
int high = sortedLines.size() - 1;
int mid = 0;
while (low <= high) {
mid = (low + high) / 2;
if (sortedLines.get(mid).compareTo(newLine) < 0) {
low = mid + 1;
} else if (sortedLines.get(mid).compareTo(newLine) > 0) {
high = mid - 1;
} else if (sortedLines.get(mid).compareTo(newLine) == 0) {
// index = mid;
break;
}
if (low > high) {
mid = low;
}
}
if (mid == sortedLines.size()) {
sortedLines.add(newLine);
} else {
sortedLines.add(mid, newLine);
}
return new int[]{mid, low, high};
}
}

Why Java String.indexOf() outperforms the same logic implemented in user defined class?

I have a question about the performance of Java's String.indexOf(String subString).
I've written a class to compare the performance of calling String.indexOf(String subString) compare to copying the source from inside String's source and calling the internal indexOf() using exactly the same arguments.
The performance appears to be about 4x better when calling String.indexOf() directly, despite the call stack would be 2 frames deeper.
My JVM is JDK1.7.0_40 64bit (windows hotspot).
My machine is running Windows with i7-4600U CPU with 16GB ram.
Here is the code:
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.concurrent.atomic.AtomicLong;
public class TestIndexOf implements Runnable {
final static String s0 = "This is my search string, it is pretty long so can test the speed of the search";
final static String s1 = "speed of the search";
final static char[] c0 = s0.toCharArray();
final static char[] c1 = s1.toCharArray();
final static byte[] b0 = s0.getBytes();
final static byte[] b1 = s1.getBytes();
static AtomicBoolean EXIT = new AtomicBoolean(false);
static AtomicLong TOTAL = new AtomicLong(0);
#Override
public void run() {
long count = 0;
try {
for (;;) {
// Case 1, search as byte[]
int idx = indexOf(b0, 0, b0.length, b1, 0, b1.length, 0);
// Case 2, search as char[]
// int idx = indexOf(c0, 0, c0.length, c1, 0, c1.length, 0);
// Case 3, search as String (using String.indexOf())
// int idx = s0.indexOf(s1);
if (idx >= 0) {
count ++;
}
if (EXIT.get()) {
break;
}
}
TOTAL.addAndGet(count);
} catch(Exception e) {
e.printStackTrace();
}
}
/* byte version of indexOf, modified from Java JDK source */
static int indexOf(byte[] source, int sourceOffset, int sourceCount,
byte[] target, int targetOffset, int targetCount,
int fromIndex) {
if (fromIndex >= sourceCount) {
return (targetCount == 0 ? sourceCount : -1);
}
if (fromIndex < 0) {
fromIndex = 0;
}
if (targetCount == 0) {
return fromIndex;
}
byte first = target[targetOffset];
int max = sourceOffset + (sourceCount - targetCount);
for (int i = sourceOffset + fromIndex; i <= max; i++) {
/* Look for first character. */
if (source[i] != first) {
while (++i <= max && source[i] != first) {
;
}
}
/* Found first character, now look at the rest of v2 */
if (i <= max) {
int j = i + 1;
int end = j + targetCount - 1;
for (int k = targetOffset + 1; j < end && source[j] ==
target[k]; j++, k++) {
;
}
if (j == end) {
/* Found whole string. */
return i - sourceOffset;
}
}
}
return -1;
}
/* char version of indexOf, directly copied from JDK's String class */
static int indexOf(char[] source, int sourceOffset, int sourceCount,
char[] target, int targetOffset, int targetCount,
int fromIndex) {
if (fromIndex >= sourceCount) {
return (targetCount == 0 ? sourceCount : -1);
}
if (fromIndex < 0) {
fromIndex = 0;
}
if (targetCount == 0) {
return fromIndex;
}
char first = target[targetOffset];
int max = sourceOffset + (sourceCount - targetCount);
for (int i = sourceOffset + fromIndex; i <= max; i++) {
/* Look for first character. */
if (source[i] != first) {
while (++i <= max && source[i] != first) {
;
}
}
/* Found first character, now look at the rest of v2 */
if (i <= max) {
int j = i + 1;
int end = j + targetCount - 1;
for (int k = targetOffset + 1; j < end && source[j] ==
target[k]; j++, k++) {
;
}
if (j == end) {
/* Found whole string. */
return i - sourceOffset;
}
}
}
return -1;
}
public static void main(String[] args) throws Exception {
int threads = 4;
ExecutorService executorService = Executors.newFixedThreadPool(threads);
for(int i=0; i<threads; i++) {
executorService.execute(new TestIndexOf());
}
Thread.sleep(10000);
EXIT.set(true);
System.out.println("STOPPED");
Thread.sleep(1000);
System.out.println("Count = " + TOTAL.get());
System.exit(0);
}
}
The results I got was: (2 samples, ran for 10 seconds, with 4 threads)
byte[]
224848726
225011695
char[]
224707442
224707442
String
898161092
897897572
What is the magic with String.indexOf()? Does this get hardware acceleration? :P

JVM has specific optimizations for some methods from standard library.
One of them will replace calls to String.indexOf with efficient inline assembly. It can even take advantage of SSE4.2 instructions. This is, most likely, causing this difference.
For details see: src/share/vm/opto/library_call.cpp

Finding the intersection between two list of string candidates

I wrote the following Java code, to find the intersection between the prefix and the suffix of a String in Java.
// you can also use imports, for example:
// import java.math.*;
import java.util.*;
class Solution {
public int max_prefix_suffix(String S) {
if (S.length() == 0) {
return 1;
}
// prefix candidates
Vector<String> prefix = new Vector<String>();
// suffix candidates
Vector<String> suffix = new Vector<String>();
// will tell me the difference
Set<String> set = new HashSet<String>();
int size = S.length();
for (int i = 0; i < size; i++) {
String candidate = getPrefix(S, i);
// System.out.println( candidate );
prefix.add(candidate);
}
for (int i = size; i >= 0; i--) {
String candidate = getSuffix(S, i);
// System.out.println( candidate );
suffix.add(candidate);
}
int p = prefix.size();
int s = suffix.size();
for (int i = 0; i < p; i++) {
set.add(prefix.get(i));
}
for (int i = 0; i < s; i++) {
set.add(suffix.get(i));
}
System.out.println("set: " + set.size());
System.out.println("P: " + p + " S: " + s);
int max = (p + s) - set.size();
return max;
}
// codility
// y t i l i d o c
public String getSuffix(String S, int index) {
String suffix = "";
int size = S.length();
for (int i = size - 1; i >= index; i--) {
suffix += S.charAt(i);
}
return suffix;
}
public String getPrefix(String S, int index) {
String prefix = "";
for (int i = 0; i <= index; i++) {
prefix += S.charAt(i);
}
return prefix;
}
public static void main(String[] args) {
Solution sol = new Solution();
String t1 = "";
String t2 = "abbabba";
String t3 = "codility";
System.out.println(sol.max_prefix_suffix(t1));
System.out.println(sol.max_prefix_suffix(t2));
System.out.println(sol.max_prefix_suffix(t3));
System.exit(0);
}
}
Some test cases are:
String t1 = "";
String t2 = "abbabba";
String t3 = "codility";
and the expected values are:
1, 4, 0
My idea was to produce the prefix candidates and push them into a vector, then find the suffix candidates and push them into a vector, finally push both vectors into a Set and then calculate the difference. However, I'm getting 1, 7, and 0. Could someone please help me figure it out what I'm doing wrong?

I'd write your method as follows:
public int max_prefix_suffix(String s) {
final int len = s.length();
if (len == 0) {
return 1; // there's some dispute about this in the comments to your post
}
int count = 0;
for (int i = 1; i <= len; ++i) {
final String prefix = s.substring(0, i);
final String suffix = s.substring(len - i, len);
if (prefix.equals(suffix)) {
++count;
}
}
return count;
}
If you need to compare the prefix to the reverse of the suffix, I'd do it like this:
final String suffix = new StringBuilder(s.substring(len - i, len))
.reverse().toString();

I see that the code by #ted Hop is good..
The question specify to return the max number of matching characters in Suffix and Prefix of a given String, which is a proper subset. Hence the entire string is not taken into consideration for this max number.
Ex. "abbabba", prefix and suffix can have abba(first 4 char) - abba (last 4 char),, hence the length 4
codility,, prefix(c, co,cod,codi,co...),, sufix (y, ty, ity, lity....), none of them are same.
hence length here is 0.
By modifying the count here from
if (prefix.equals(suffix)) {
++count;
}
with
if (prefix.equals(suffix)) {
count = prefix.length();// or suffix.length()
}
we get the max length.
But could this be done in O(n).. The inbuilt function of string equals, i believe would take O(n), and hence overall complexity is made O(n2).....

i would use this code.
public static int max_prefix_suffix(String S)
{
if (S == null)
return -1;
Set<String> set = new HashSet<String>();
int len = S.length();
StringBuilder builder = new StringBuilder();
for (int i = 0; i < len - 1; i++)
{
builder.append(S.charAt(i));
set.add(builder.toString());
}
int max = 0;
for (int i = 1; i < len; i++)
{
String suffix = S.substring(i, len);
if (set.contains(suffix))
{
int suffLen = suffix.length();
if (suffLen > max)
max = suffLen;
}
}
return max;
}

#ravi.zombie
If you need the length in O(n) then you just need to change Ted's code as below:
int max =0;
for (int i = 1; i <= len-1; ++i) {
final String prefix = s.substring(0, i);
final String suffix = s.substring(len - i, len);
if (prefix.equals(suffix) && max < i) {
max =i;
}
return max;
}
I also left out the entire string comparison to get proper prefix and suffixes so this should return 4 and not 7 for an input string abbabba.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Highly effect find a byte[] is contained by another byte[] [duplicate] - java

Google's Guava provides a Bytes.indexOf(byte[] array, byte[] target).

Related

How to count all the matching sub-string using regex?

is there any way to find all the occurrence of a pattern in a text instead of finding the first occurrence in Boyer Moore algorithm

How could I improve the speed/performance for this problem, Java

Why Java String.indexOf() outperforms the same logic implemented in user defined class?

Finding the intersection between two list of string candidates

Categories

Resources