What substring search algorithm is used by different JREs?

What substring search algorithm is used by different JREs? - java

java.lang.String JavaDoc says nothing about the default indexOf(String) substring search algorithm. So my question is - which substring algorithms is used by different JREs?

There's src.zip in JDK which shows implementation:
/**
* Code shared by String and StringBuffer to do searches. The
* source is the character array being searched, and the target
* is the string being searched for.
*
* #param source the characters being searched.
* #param sourceOffset offset of the source string.
* #param sourceCount count of the source string.
* #param target the characters being searched for.
* #param targetOffset offset of the target string.
* #param targetCount count of the target string.
* #param fromIndex the index to begin searching from.
*/
static int indexOf(char[] source, int sourceOffset, int sourceCount,
char[] target, int targetOffset, int targetCount,
int fromIndex) {
if (fromIndex >= sourceCount) {
return (targetCount == 0 ? sourceCount : -1);
}
if (fromIndex < 0) {
fromIndex = 0;
}
if (targetCount == 0) {
return fromIndex;
}
char first = target[targetOffset];
int max = sourceOffset + (sourceCount - targetCount);
for (int i = sourceOffset + fromIndex; i <= max; i++) {
/* Look for first character. */
if (source[i] != first) {
while (++i <= max && source[i] != first);
}
/* Found first character, now look at the rest of v2 */
if (i <= max) {
int j = i + 1;
int end = j + targetCount - 1;
for (int k = targetOffset + 1; j < end && source[j] ==
target[k]; j++, k++);
if (j == end) {
/* Found whole string. */
return i - sourceOffset;
}
}
}
return -1;
}

fwiw (in case this Q is about the performance of different algorithms) on appropriate hardware and with a sufficiently recent oracle jvm (6u21 and later as detailed in the bug report), String.indexOf is implemented via the relevant SSE 4.2 intrinsics.. see chapter 2.3 in this intel reference doc

Here is what found for now:
Oracle JDK 1.6/1.7, OpenJDK 6/7
static int indexOf(char[] source, int sourceOffset, int sourceCount,
char[] target, int targetOffset, int targetCount,
int fromIndex) {
if (fromIndex >= sourceCount) {
return (targetCount == 0 ? sourceCount : -1);
}
if (fromIndex < 0) {
fromIndex = 0;
}
if (targetCount == 0) {
return fromIndex;
}
char first = target[targetOffset];
int max = sourceOffset + (sourceCount - targetCount);
for (int i = sourceOffset + fromIndex; i <= max; i++) {
/* Look for first character. */
if (source[i] != first) {
while (++i <= max && source[i] != first);
}
/* Found first character, now look at the rest of v2 */
if (i <= max) {
int j = i + 1;
int end = j + targetCount - 1;
for (int k = targetOffset + 1; j < end && source[j] ==
target[k]; j++, k++);
if (j == end) {
/* Found whole string. */
return i - sourceOffset;
}
}
}
return -1;
}
IBM JDK 5.0
public int indexOf(String subString, int start) {
if (start < 0) start = 0;
int subCount = subString.count;
if (subCount > 0) {
if (subCount + start > count) return -1;
char[] target = subString.value;
int subOffset = subString.offset;
char firstChar = target[subOffset];
int end = subOffset + subCount;
while (true) {
int i = indexOf(firstChar, start);
if (i == -1 || subCount + i > count) return -1; // handles subCount > count || start >= count
int o1 = offset + i, o2 = subOffset;
while (++o2 < end && value[++o1] == target[o2]);
if (o2 == end) return i;
start = i + 1;
}
} else return start < count ? start : count;
}
Sabre SDK
public int indexOf(String str, int fromIndex)
{
if (fromIndex < 0)
fromIndex = 0;
int limit = count - str.count;
for ( ; fromIndex <= limit; fromIndex++)
if (regionMatches(fromIndex, str, 0, str.count))
return fromIndex;
return -1;
}
Feel free to update this post.

As most of the time indexOf is used for small substrings in reasonable small strings it is I believe save to assume that a fairly straight forward algorithm like the one shown by Victor is used. There are more advanced algorithms that work better for large strings but AFAIK these all perform worse for relative short strings.

Related

Why Beautifull arrangment not working in c

void check(int start, int* count, int size, int * set)
{
if(start == size) {
(*count) += 1;
return;
}
for(int i = start; i < size ; i++)
{
if((set[start] == 0) && (((i+1) % (start +1) == 0) || (start + 1) % (i+1) == 0 ))
{
set[start] = 1;
check(start +1, count, size, set);
set[start] = 0;
}
}
}
int countArrangement(int n){
int* set = (int *)malloc(sizeof(int) * n);
memset(set, 0, sizeof(int) * n);
int count = 0;
check(0, &count, n, set);
return count;
}
this is the code translated from java to c but the problem is , when n is greater than 6 then the result is wrong
for 7 it increase by one but after that the result is always smaller than the expected value, I am not able to understand what am i missing at.
Your answer
1
2
3
8
10
36
42
128
216
600
660
3456
3744
9408
18900
Expected answer
1
2
3
8
10
36
41
132
250
700
750
4010
4237
10680
24679
that java code
public class Solution {
int count = 0;
public int countArrangement(int N) {
boolean[] visited = new boolean[N + 1];
calculate(N, 1, visited);
return count;
}
public void calculate(int N, int pos, boolean[] visited) {
if (pos > N)
count++;
for (int i = 1; i <= N; i++) {
if (!visited[i] && (pos % i == 0 || i % pos == 0)) {
visited[i] = true;
calculate(N, pos + 1, visited);
visited[i] = false;
}
}
}
}
Can you just point out the missing part?

this is the code translated from java to c
Well, not really.
You have not made a one-to-one translation. The C code is (perhaps by mistake) using a completely different algorithm.
Start by making a one-to-one translation. Once you have that working, you can start playing with algorithm changes. But don't do both in the same step.
A one-to-one translation would be more like:
void calculate(int N, int pos, int * visited, int* count)
{
if (pos > N)
(*count)++;
for (int i = 1; i <= N; i++) {
if (!visited[i] && (pos % i == 0 || i % pos == 0)) {
visited[i] = 1;
calculate(N, pos + 1, visited, count);
visited[i] = 0;
}
}
}
int countArrangement(int n)
{
int* set = calloc(n+1, sizeof *set);
int count = 0;
calculate(n, 1, set, &count);
free(set);
return count;
}
Notice how the C code for calculate is almost identical to the java version. No change of algorithm - only a few changes required due to language differences.

Java regex not picking up "+"

I will show you my problem. This is using leetcode and I'm trying to create an atoi method.
public int myAtoi(String s) {
System.out.println(s.matches("^[^ -0123456789].*")); //this is the regex I am debugging
if(s.matches("^[^ -0123456789].*")){
return 0;
}
int solution = 0;
s = s.replaceAll("[^-0123456789.]","");
solution = 0;
boolean negative = false;
if(s.charAt(0) == '-'){
s = s.replaceAll("-","");
negative = true;
}
if(s.matches("^[0-9]?[.][0-9]+")){
s = s.substring(0, s.indexOf('.'));
System.out.println(s);
}
for(int i = s.length(); i > 0; i--){
solution = solution + (s.charAt(s.length() - i) - 48) * (int)Math.pow(10,i - 1);
}
if(negative) solution = solution * -1;
if(negative && solution > 0) return (int) Math.pow(-2,31);
if(!negative && solution < 0) return (int) Math.pow(2,31) - 1;
return solution;
}
here is the output section screenshot provided incase I have missed something there but a text description also exists.
enter image description here
When the input is "+-12" the output is supposed to be (int) 0. This is due to the requirement being that "if the string does not start with a number, a space, or a negative sign" we return 0.
The line of code whch is supposed to handle this starts at 4 and looks like
if(s.matches("^[^ -0123456789].*")){
return 0;
}
What is wrong with my regex?

We don't really have to use regular expressions for solving this problem, because of the time complexity.
for instance, if(s.matches("^[0-9]?[.][0-9]+")){ does not run linearly, runs quadratically due to the lazy quantifier (?).
We can just loop through once (order of N) and define some statements:
class Solution {
public static final int myAtoi(
String s
) {
s = s.trim();
char[] characters = s.toCharArray();
int sign = 1;
int index = 0;
if (
index < characters.length &&
(characters[index] == '-' || characters[index] == '+')
) {
if (characters[index] == '-') {
sign = -1;
}
++index;
}
int num = 0;
int bound = Integer.MAX_VALUE / 10;
while (
index < characters.length &&
characters[index] >= '0' &&
characters[index] <= '9'
) {
final int digit = characters[index] - '0';
if (num > bound || (num == bound && digit > 7)) {
return sign == 1 ? Integer.MAX_VALUE : Integer.MIN_VALUE;
}
num *= 10;
num += digit;
++index;
}
return sign * num;
}
}
Here is a C++ version, if you might be interested:
// Most of headers are already included;
// Can be removed;
#include <iostream>
#include <cstdint>
#include <vector>
#include <string>
// The following block might trivially improve the exec time;
// Can be removed;
static const auto imporve_runtime = []() {
std::ios::sync_with_stdio(false);
std::cin.tie(NULL);
std::cout.tie(NULL);
return 0;
}();
#define MAX INT_MAX
#define MIN INT_MIN
using ValueType = std::int_fast32_t;
struct Solution {
static const int myAtoi(
const std::string str
) {
const ValueType len = std::size(str);
ValueType sign = 1;
ValueType index = 0;
while (index < len && str[index] == ' ') {
index++;
}
if (index == len) {
return 0;
}
if (str[index] == '-') {
sign = -1;
++index;
} else if (str[index] == '+') {
++index;
}
std::int_fast64_t num = 0;
while (index < len && num < MAX && std::isdigit(str[index])) {
ValueType digit = str[index] - '0';
num *= 10;
num += digit;
index++;
}
if (num > MAX) {
return sign == 1 ? MAX : MIN;
}
return sign * num;
}
};
// int main() {
// std::cout << Solution().myAtoi("words and 987") << "\n";
// std::cout << Solution().myAtoi("4193 with words") << "\n";
// std::cout << Solution().myAtoi(" -42") << "\n";
// }
Regarding your question
What is wrong with my regex?
If you'd like to see how a regular expression solution works, maybe this concise Python version would help (also runs on O(N ^ 2)):
import re
class Solution:
def myAtoi(self, s: str) -> int:
MAX, MIN = 2147483647, -2147483648
DIGIT_PATTERN = re.compile(r'^\s*[+-]?\d+')
s = re.findall(DIGIT_PATTERN, s)
try:
res = int(''.join(s))
except:
return 0
if res > MAX:
return MAX
if res < MIN:
return MIN
return res
We can workaround the expression of ^\s*[+-]?\d+ by dividing it into two subexpressions so that we would be able to get rid of the lazy quantifier and design an order of N solution, yet that would be unnecessary (and is also against the KISS principle).

minimum operations required to make the longest character interval equal to K

I was asked this question in an contest.
Given a string containing only M and L, we can change any "M" to "L" or any "L" to "M". The objective of this function is to calculate the minimum number of changes we have to make in order to achieve the desired longest M-interval length K.
For example, given S = "MLMMLLM" and K = 3, the function should return 1. We can change the letter at position 4 (counting from 0) to obtain "MLMMMLM", in which the longest interval of letters "M" is exactly three characters long.
For another example, given S = "MLMMMLMMMM" and K = 2, the function should return 2. We can, for example, modify the letters at positions 2 and 7 to get the string "MLLMMLMLMM", which satisfies the desired property.
Here's what I have tried till now, but I am not getting correct output:
I am traversing the string and whenever longest char count exceeds K, I'm replacing M with L that point.
public static int solution(String S, int K) {
StringBuilder Str = new StringBuilder(S);
int longest=0;int minCount=0;
for(int i=0;i<Str.length();i++){
char curr=S.charAt(i);
if(curr=='M'){
longest=longest+1;
if(longest>K){
Str.setCharAt(i, 'L');
minCount=minCount+1;
}
}
if(curr=='L')
longest=0;
}
if(longest < K){
longest=0;int indexoflongest=0;minCount=0;
for(int i=0;i<Str.length();i++){
char curr=S.charAt(i);
if(curr=='M'){
longest=longest+1;
indexoflongest=i;
}
if(curr=='L')
longest=0;
}
Str.setCharAt(indexoflongest, 'M');
minCount=minCount+1;
}
return minCount;
}

There are 2 parts to this algorithm as we want to get the longest character interval equal to K.
We already have a interval >= K so now we need to appropriately change some characters so we greedily change every (k + 1) th character and again start counting from 0.
Now if the interval was less than K I will need to run a sliding window over the array. While running this window I am basically considering converting all L's to M's in this window of length K. But this comes with a side effect of increasing the length of the interval as there could be K's outside so this variable (int nec) keeps track of that. So now I have to also consider converting the 2 possible M's outside the (K length) window to L's.
Here's the complete runnable code in C++. Have a good day.
#include <bits/stdc++.h>
using namespace std;
typedef long long ll;
typedef vector <int> vi;
typedef pair<int, int> ii;
int change(string s, int k) {
// handling interval >= k
bool flag = false;
int ans = 0;
int cnt = 0;
for(int i=0; i<s.size(); i++) {
if(s[i] == 'M') cnt++;
else cnt = 0;
if(cnt == k) flag = true;
if(cnt > k) s[i] = 'L', ans++, cnt = 0;
}
if(flag) return ans;
// handling max interval < k
// If the interval is too big.
if(k > s.size()) {
cerr << "Can't do it.\n"; exit(0);
}
// Sliding window
cnt = 0;
for(int i=0; i<k; i++) {
if(s[i] == 'L') cnt++;
}
ans = cnt + (s[k] == 'M'); // new edit
int nec = 0; // new edit
for(int i=k; i<s.size(); i++) {
if(s[i-k] == 'L') cnt--;
if(s[i] == 'L') cnt++;
nec = 0;
if(i-k != 0 && s[i-k-1] == 'M')
nec++;
if(i < s.size()-1 && s[i+1] == 'M')
nec++;
ans = min(ans, cnt + nec);
}
return ans;
}
int main() {
ios_base::sync_with_stdio(false);
cin.tie(nullptr);
freopen("in.txt", "r", stdin);
freopen("out.txt", "w", stdout);
string s;
int k;
cin >> s >> k;
int ans = change(s, k);
cout << ans << "\n";
return 0;
}

int
process_data(const char *m, int k)
{
int m_cnt = 0, c_cnt = 0;
char ch;
const char *st = m;
int inc_cnt = -1;
int dec_cnt = -1;
while((ch = *m++) != 0) {
if (m_cnt++ < k) {
c_cnt += ch == 'M' ? 0 : 1;
if ((m_cnt == k) && (
(inc_cnt == -1) || (inc_cnt > c_cnt))) {
inc_cnt = c_cnt;
}
}
else if (ch == 'M') {
if (*st++ == 'M') {
/*
* losing & gaining M carries no change provided
* there is atleast one L in the chunk. (c_cnt != 0)
* Else it implies stretch of Ms
*/
if (c_cnt <= 0) {
int t;
c_cnt--;
/*
* compute min inserts needed to brak the
* stretch to meet max of k.
*/
t = (k - c_cnt) / (k+1);
dec_cnt += t;
}
}
else {
ASSERT(c_cnt > 0, "expect c_cnt(%d) > 0", c_cnt);
ASSERT(inc_cnt != -1, "expect inc_cnt(%d) != -1", inc_cnt);
/* Losing L and gaining M */
if (--c_cnt < inc_cnt) {
inc_cnt = c_cnt;
}
}
}
else {
if (c_cnt <= 0) {
/*
* take this as a first break and restart
* as any further addition of M should not
* happen. Ignore this L
*/
st = m;
c_cnt = 0;
m_cnt = 0;
}
else if (*st++ == 'M') {
/* losing m & gaining l */
c_cnt++;
}
else {
// losing & gaining L; no change
}
}
}
return dec_cnt != -1 ? dec_cnt : inc_cnt;
}

Corrected code:
int
process_data(const char *m, int k)
{
int m_cnt = 0, c_cnt = 0;
char ch;
const char *st = m;
int inc_cnt = -1;
int dec_cnt = -1;
while((ch = *m++) != 0) {
if (m_cnt++ < k) {
c_cnt += ch == 'M' ? 0 : 1;
if ((m_cnt == k) && (
(inc_cnt == -1) || (inc_cnt > c_cnt))) {
inc_cnt = c_cnt;
}
}
else if (ch == 'M') {
if (*st++ == 'M') {
/*
* losing & gaining M carries no change provided
* there is atleast one L in the chunk. (c_cnt != 0)
* Else it implies stretch of Ms
*/
if (c_cnt <= 0) {
c_cnt--;
}
}
else {
ASSERT(c_cnt > 0, "expect c_cnt(%d) > 0", c_cnt);
ASSERT(inc_cnt != -1, "expect inc_cnt(%d) != -1", inc_cnt);
/* Losing L and gaining M */
if (--c_cnt < inc_cnt) {
inc_cnt = c_cnt;
}
}
}
else {
if (c_cnt <= 0) {
/*
* compute min inserts needed to brak the
* stretch to meet max of k.
*/
dec_cnt += (dec_cnt == -1 ? 1 : 0) + ((k - c_cnt) / (k+1));
/*
* take this as a first break and restart
* as any further addition of M should not
* happen. Ignore this L
*/
st = m;
c_cnt = 0;
m_cnt = 0;
}
else if (*st++ == 'M') {
/* losing m & gaining l */
c_cnt++;
}
else {
// losing & gaining L; no change
}
}
}
if (c_cnt <= 0) {
/*
* compute min inserts needed to brak the
* stretch to meet max of k.
*/
dec_cnt += (dec_cnt == -1 ? 1 : 0) + ((k - c_cnt) / (k+1));
}
return dec_cnt != -1 ? dec_cnt : inc_cnt;
}

Custom Binary Search Function is not Working Properly

I am creating a custom binary search function and am running into issues. I have looked through the code for a good while now, however, I cannot figure out why nothing is returning. Please let me know what you think. Thank you!
a is the array, b is the result that is returned in the end, and t is the target value. Pos is the current position and min and max are the minimum and maximum positions.
public static int binarySearch(int a[], int t){
int min = 0;
int max = a.length;
if (a[0] == t){
return 0;
}
int b = -1;
for (int pos = min; a[pos] != t;){
pos = (max - min) / 2;
if (a[pos] == t){
b = pos;
} else {
if(t > a[pos]){
min = pos + 1;
} else {
min = pos - 1;
}
}
}
return b;
}

Two small issues
pos = (max - min) / 2;
It looks like you're trying to find the average of the values, but are instead finding half the difference
Instead to find the average, use pos = (max - min) / 2 + min;
Also when moving the max down, you accidentally move the min up instead
min = pos - 1; should instead be max = pos - 1;

I see three issues with your code :
Your loop condition is not enough to determine the termination of the whole process . Your loop condition is a[pos] != t and you only increase or decrease the pos which will cause ArrayIndexOutOfBoundsExceptioneventually if the element we search can't be found inside the array.
Your if-else is not correct , cause you only update the value for the min and not the max too.
Instead of 'moving' the value pos by half each time, you just set it the average of min and max which is not correct.
Combine all the above and you will get this result :
public static int binarySearch(int a[], int t) {
int min = 0;
int max = a.length;
if (a[0] == t) {
return 0;
}
int b = -1;
for (int pos = min; a[pos] != t;) {
pos = min + (max - min) / 2;
if (pos >= a.length || pos <= 0) {
break;
}
if (a[pos] == t) {
b = pos;
} else {
if (t > a[pos]) {
min = pos + 1;
} else {
max = pos - 1;
}
}
}
return b;
}

How to calculate possible missing characters?

I'm trying to get possible missing characters e.g:
input --> aa??bb there should be possible characters aaaabb & aaabbb & aabbbb so the result would be 3, Also ?a? would be 1.
Note:
aababb would be wrong, because it's not a right path for alphabet.
I'v done some code here but i couldn't get the perfect result yet.
may someone help me?
Scanner input = new Scanner(System.in);
String s = input.nextLine();
int possibleAlphabet = 1, oldPossibleAlphabet = 0;
for (int i = 0; i < s.length(); i++) {
oldPossibleAlphabet = 0;
System.out.print(s.charAt(i));
if (s.charAt(i) >= 'a' && s.charAt(i) <= 'z' || s.contains("?")) {
if (s.charAt(i) == '?'){
for (int j = 0; j < i; j++) {
if (s.charAt(i - 1) == '?' && s.charAt(i + 1) == '?')
oldPossibleAlphabet++;
}
}
}else {
System.out.print( " ");
System.exit(0);
}
possibleAlphabet += oldPossibleAlphabet;
}
System.out.println(possibleAlphabet);

Check my code
public class Solution {
public static void main(String[] args) {
String str = "abc??cde???g?"; // Example
char[] arr = str.toCharArray();
int length = arr.length;
// Init value for count, start and end
int count = 0;
char start = 'a';
char end = 'a';
for (int i = 0; i < length; i++) {
if (arr[i] == '?') { // We found a question mark
boolean foundEnd = false;
int total = 1; // Currently the total of question mark is 1
for (int j = i + 1; j < length; j++) { // Count the total question mark for our method and the end character
if (arr[j] != '?') { // Not question mark
end = arr[j]; // Update end;
i = j -1;
foundEnd = true;
break;
} else {
total++;
}
}
if (!foundEnd) { // Change end to start in the case our question mark continue to the end of string
end = start;
}
// Start to counting and reset end to 'z'
int result = countPossibleCharacters(total, start, end);
if (count > 0) {
count *= result;
} else {
count += result;
}
end = 'z';
} else {
start = arr[i];
}
}
System.out.println("The total is : " + count);
}
/**
* Count the possible characters
* #param total the total question mark
* #param start the character in the left side of question mark
* #param end the character in the right side of question mark
* #return
*/
static int countPossibleCharacters(int total, char start, char end) {
if (total == 0) {
return 0;
}
if (total == 1) {
return end - start + 1;
}
if (total >= 2) {
int count = 0;
/**
* We have a range of characters from start to end
* and for each character we have 2 options: use or don't use it
*/
// We use it, so the total of question mark will be decrement by 1
count += countPossibleCharacters(total - 1, start, end);
// We don't use it, so the range of characters will be decrement by 1
if (start < end) {
count += countPossibleCharacters(total, ++start, end);
}
return count;
}
return 0;
}
}
Rules apply in my code
All characters in string are lowercase
Character on the left side must be lower than character on the right side
If we have a question mark in the beginning of string, we'll loop from 'a' to the next non question mark character
If we have question mark in the end of string, we'll replace it with previous non question mark character

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

What substring search algorithm is used by different JREs? - java

java.lang.String JavaDoc says nothing about the default indexOf(String) substring search algorithm. So my question is - which substring algorithms is used by different JREs?

fwiw (in case this Q is about the performance of different algorithms) on appropriate hardware and with a sufficiently recent oracle jvm (6u21 and later as detailed in the bug report), String.indexOf is implemented via the relevant SSE 4.2 intrinsics.. see chapter 2.3 in this intel reference doc

Related

Why Beautifull arrangment not working in c

Java regex not picking up "+"

minimum operations required to make the longest character interval equal to K

Custom Binary Search Function is not Working Properly

How to calculate possible missing characters?

Categories

Resources