Check for multiple occurrence of certain character in string

Check for multiple occurrence of certain character in string - java

Edit: To those who downvote me, this question is difference from the duplicate question which you guy linked. The other question is about returning the indexes. However, for my case, I do not need the index. I just want to check whether there is duplicate.
This is my code:
String word = "ABCDE<br>XYZABC";
String[] keywords = word.split("<br>");
for (int index = 0; index < keywords.length; index++) {
if (keywords[index].toLowerCase().contains(word.toLowerCase())) {
if (index != (keywords.length - 1)) {
endText = keywords[index];
definition.setText(endText);
}
}
My problem is, if the keywords is "ABC", then the string endText will only show "ABCDE". However, "XYZABC" contains "ABC" as well. How to check if the string has multiple occurrence? I would like to make the definition textview become definition.setText(endText + "More"); if there is multiple occurrence.
I tried this. The code is working, but it is making my app very slow. I guess the reason is because I got the String word through textwatcher.
String[] keywords = word.split("<br>");
for (int index = 0; index < keywords.length; index++) {
if (keywords[index].toLowerCase().contains(word.toLowerCase())) {
if (index != (keywords.length - 1)) {
int i = 0;
Pattern p = Pattern.compile(search.toLowerCase());
Matcher m = p.matcher( word.toLowerCase() );
while (m.find()) {
i++;
}
if (i > 1) {
endText = keywords[index];
definition.setText(endText + " More");
} else {
endText = keywords[index];
definition.setText(endText);
}
}
}
}
Is there any faster way?

It's a little hard for me to understand your question, but it sounds like:
You have some string (e.g. "ABCDE<br>XYZABC"). You also have some target text (e.g. "ABC"). You want to split that string on a delimiter (e.g. "<br>", and then:
If exactly one substring contains the target, display that substring.
If more than one substring contains the target, display the last substring that contains it plus the suffix "More"
In your posted code, the performance is really slow because of the Pattern.compile() call. Re-compiling the Pattern on every loop iteration is very costly. Luckily, there's no need for regular expressions here, so you can avoid that problem entirely.
String search = "ABC".toLowerCase();
String word = "ABCDE<br>XYZABC";
String[] keywords = word.split("<br>");
int count = 0;
for (String keyword : keywords) {
if (keyword.toLowerCase().contains(search)) {
++count;
endText = keyword;
}
}
if (count > 1) {
definition.setText(endText + " More");
}
else if (count == 1) {
definition.setText(endText);
}

You are doing it correctly but you are doing unnecessary check which is if (index != (keywords.length - 1)). This will ignore if there is match in the last keywords array element. Not sure is that a part of your requirement.
To enhance performance when you found the match in second place break the loop. You don't need to check anymore.
public static void main(String[] args) {
String word = "ABCDE<br>XYZABC";
String pattern = "ABC";
String[] keywords = word.split("<br>");
String endText = "";
int count = 0;
for (int index = 0; index < keywords.length; index++) {
if (keywords[index].toLowerCase().contains(pattern.toLowerCase())) {
//If you come into this part mean found a match.
if(count == 1) {
// When you found the second match it will break to loop. No need to check anymore
// keep the first found String and append the more part to it
endText += " more";
break;
}
endText = keywords[index];
count++;
}
}
System.out.println(endText);
}
This will print ABCDE more

Hi You have to use your condition statement like this
if (word.toLowerCase().contains(keywords[index].toLowerCase()))

You can use this:
String word = "ABCDE<br>XYZABC";
String[] keywords = word.split("<br>");
for (int i = 0; i < keywords.length - 1; i++) {
int c = 0;
Pattern p = Pattern.compile(keywords[i].toLowerCase());
Matcher m = p.matcher(word.toLowerCase());
while (m.find()) {
c++;
}
if (c > 1) {
definition.setText(keywords[i] + " More");
} else {
definition.setText(keywords[i]);
}
}
But like what I mentioned in comment, there is no double occurrence in word "ABCDE<br>XYZABC" when you want to split it by <br>.
But if you use the word "ABCDE<br>XYZABCDE" there is two occurrence of word "ABCDE"

void test() {
String word = "ABCDE<br>XYZABC";
String sequence = "ABC";
if(word.replaceFirst(sequence,"{---}").contains(sequence)){
int startIndex = word.indexOf(sequence);
int endIndex = word.indexOf("<br>");
Log.v("test",word.substring(startIndex,endIndex)+" More");
}
else{
//your code
}
}
Try this

Related

Tokenize method: Split string into array

I've been really struggling with a programming assignment. Basically, we have to write a program that translates a sentence in English into one in Pig Latin. The first method we need is one to tokenize the string, and we are not allowed to use the Split method usually used in Java. I've been trying to do this for the past 2 days with no luck, here is what I have so far:
public class PigLatin
{
public static void main(String[] args)
{
String s = "Hello there my name is John";
Tokenize(s);
}
public static String[] Tokenize(String english)
{
String[] tokenized = new String[english.length()];
for (int i = 0; i < english.length(); i++)
{
int j= 0;
while (english.charAt(i) != ' ')
{
String m = "";
m = m + english.charAt(i);
if (english.charAt(i) == ' ')
{
j++;
}
else
{
break;
}
}
for (int l = 0; l < tokenized.length; l++) {
System.out.print(tokenized[l] + ", ");
}
}
return tokenized;
}
}
All this does is print an enormously long array of "null"s. If anyone can offer any input at all, I would reallllyyyy appreciate it!
Thank you in advance
Update: We are supposed to assume that there will be no punctuation or extra spaces, so basically whenever there is a space, it's a new word

If I understand your question, and what your Tokenize was intended to do; then I would start by writing a function to split the String
static String[] splitOnWhiteSpace(String str) {
List<String> al = new ArrayList<>();
StringBuilder sb = new StringBuilder();
for (char ch : str.toCharArray()) {
if (Character.isWhitespace(ch)) {
if (sb.length() > 0) {
al.add(sb.toString());
sb.setLength(0);
}
} else {
sb.append(ch);
}
}
if (sb.length() > 0) {
al.add(sb.toString());
}
String[] ret = new String[al.size()];
return al.toArray(ret);
}
and then print using Arrays.toString(Object[]) like
public static void main(String[] args) {
String s = "Hello there my name is John";
String[] words = splitOnWhiteSpace(s);
System.out.println(Arrays.toString(words));
}

If you're allowed to use the StringTokenizer Object (which I think is what the assignment is asking, it would look something like this:
StringTokenizer st = new StringTokenizer("this is a test");
while (st.hasMoreTokens()) {
System.out.println(st.nextToken());
}
which will produce the output:
this
is
a
test
Taken from here.
The string is split into tokens and stored in a stack. The while loop loops through the tokens, which is where you can apply the pig latin logic.

Some hints for you to do the "manual splitting" work.
There is a method String#indexOf(int ch, int fromIndex) to help you to find next occurrence of a character
There is a method String#substring(int beginIndex, int endIndex) to extract certain part of a string.
Here is some pseudo-code that show you how to split it (there are more safety handling that you need, I will leave that to you)
List<String> results = ...;
int startIndex = 0;
int endIndex = 0;
while (startIndex < inputString.length) {
endIndex = get next index of space after startIndex
if no space found {
endIndex = inputString.length
}
String result = get substring of inputString from startIndex to endIndex-1
results.add(result)
startIndex = endIndex + 1 // move startIndex to next position after space
}
// here, results contains all splitted words

String english = "hello my fellow friend"
ArrayList tokenized = new ArrayList<String>();
String m = "";
int j = 0; //index for tokenised array list.
for (int i = 0; i < english.length(); i++)
{
//the condition's position do matter here, if you
//change them, english.charAt(i) will give index
//out of bounds exception
while( i < english.length() && english.charAt(i) != ' ')
{
m = m + english.charAt(i);
i++;
}
//add to array list if there is some string
//if its only ' ', array will be empty so we are OK.
if(m.length() > 0 )
{
tokenized.add(m);
j++;
m = "";
}
}
//print the array list
for (int l = 0; l < tokenized.size(); l++) {
System.out.print(tokenized.get(l) + ", ");
}
This prints, "hello,my,fellow,friend,"
I used an array list since at the first sight the length of the array is not clear.

Substring between two same or different delimiters (when delimiters occur multiple times)

I need to fetch a sub string that lies between two same or different delimiters. The delimiters will be occurring multiple times in the string, so i need to extract the sub-string that lies between mth occurrence of delimiter1 and nth occurrence of delimiter2.
For eg:
myString : Ron_CR7_MU^RM^_SAF_34^
What should i do here if i need to extract the sub-string that lies between 3rd occurrence of '_' and 3rd occurence of '^'?
Substring = SAF_34
Or i could look for a substring that lies between 2nd '^' and 4th '_', i.e :
Substring = _SAF
An SQL equivalent would be :
substr(myString, instr(myString, '',1,3)+1,instr(myString, '^',1,3)-1-instr(myString, '',1,3))

I would use,
public static int findNth(String text, String toFind, int count) {
int pos = -1;
do {
pos = text.indexOf(toFind, pos+1);
} while(--count > 0 && pos >= 0);
return pos;
}
int from = findNth(text, "_", 3);
int to = findNth(text, "^", 3);
String found = text.substring(from+1, to);

If you can use a solution without regex you can find indexes in your string where your resulting string needs to start and where it needs to end. Then just simply perform: myString.substring(start,end) to get your result.
Biggest problem is to find start and end. To do it you can repeat this N (M) times:
int pos = indexOf(delimiterX)
myString = myString.substring(pos) //you may want to work on copy of myString
Hope you get an idea.

You could create a little method that simply hunts for such substrings between delimiters sequentially, using (as noted) String.indexOf(string); You do need to decide whether you want all substrings (whether they overlap or not .. which your question indicates), or if you don't want to see overlapping strings. Here is a trial for such code
import java.util.Vector;
public class FindDelimitedStrings {
public static void main(String[] args) {
String[] test = getDelimitedStrings("Ron_CR7_MU'RM'_SAF_34'", "_", "'");
if (test != null) {
for (int i = 0; i < test.length; i++) {
System.out.println(" " + (i + 1) + ". |" + test[i] + "|");
}
}
}
public static String[] getDelimitedStrings(String source,
String leftDelimiter, String rightDelimiter) {
String[] answer = null;
;
Vector<String> results = new Vector<String>();
if (source == null || leftDelimiter == null || rightDelimiter == null) {
return null;
}
int loc = 0;
int begin = source.indexOf(leftDelimiter, loc);
int end;
while (begin > -1) {
end = source
.indexOf(rightDelimiter, begin + leftDelimiter.length());
if (end > -1) {
results.add(source.substring(begin, end));
// loc = end + rightDelimiter.length(); if strings must be
// returned as pairs
loc = begin + 1;
if (loc < source.length()) {
begin = source.indexOf(leftDelimiter, loc);
} else {
begin = -1;
}
} else {
begin = -1;
}
}
if (results.size() > 0) {
answer = new String[results.size()];
results.toArray(answer);
}
return answer;
}
}

Algorithm for duplicated but overlapping strings

I need to write a method where I'm given a string s and I need to return the shortest string which contains s as a contiguous substring twice.
However two occurrences of s may overlap. For example,
aba returns ababa
xxxxx returns xxxxxx
abracadabra returns abracadabracadabra
My code so far is this:
import java.util.Scanner;
public class TwiceString {
public static String getShortest(String s) {
int index = -1, i, j = s.length() - 1;
char[] arr = s.toCharArray();
String res = s;
for (i = 0; i < j; i++, j--) {
if (arr[i] == arr[j]) {
index = i;
} else {
break;
}
}
if (index != -1) {
for (i = index + 1; i <= j; i++) {
String tmp = new String(arr, i, i);
res = res + tmp;
}
} else {
res = res + res;
}
return res;
}
public static void main(String args[]) {
Scanner inp = new Scanner(System.in);
System.out.println("Enter the string: ");
String word = inp.next();
System.out.println("The requires shortest string is " + getShortest(word));
}
}
I know I'm probably wrong at the algorithmic level rather than at the coding level. What should be my algorithm?

Use a suffix tree. In particular, after you've constructed the tree for s, go to the leaf representing the whole string and walk up until you see another end-of-string marker. This will be the leaf of the longest suffix that is also a prefix of s.

As #phs already said, part of the problem can be translated to "find the longest prefix of s that is also a suffix of s" and a solution without a tree may be this:
public static String getShortest(String s) {
int i = s.length();
while(i > 0 && !s.endsWith(s.substring(0, --i)))
;
return s + s.substring(i);
}

Once you've found your index, and even if it's -1, you just need to append to the original string the substring going from index + 1 (since index is the last matching character index) to the end of the string. There's a method in String to get this substring.

i think you should have a look at the Knuth-Morris-Pratt algorithm, the partial match table it uses is pretty much what you need (and by the way it's a very nice algorithm ;)

If your input string s is, say, "abcde" you can easily build a regex like the following (notice that the last character "e" is missing!):
a(b(c(d)?)?)?$
and run it on the string s. This will return the starting position of the trailing repeated substring. You would then just append the missing part (i.e. the last N-M characters of s, where N is the length of s and M is the length of the match), e.g.
aba
^ match "a"; append the missing "ba"
xxxxxx
^ match "xxxxx"; append the missing "x"
abracadabra
^ match "abra"; append the missing "cadabra"
nooverlap
--> no match; append "nooverlap"

From my understanding you want to do this:
input: dog
output: dogdog
--------------
input: racecar
output: racecaracecar
So this is how i would do that:
public String change(String input)
{
StringBuilder outputBuilder = new StringBuilder(input);
int patternLocation = input.length();
for(int x = 1;x < input.length();x++)
{
StringBuilder check = new StringBuilder(input);
for(int y = 0; y < x;y++)
check.deleteCharAt(check.length() - 1);
if(input.endsWith(check.toString()))
{
patternLocation = x;
break;
}
}
outputBuilder.delete(0, input.length() - patternLocation);
return outputBuilder.toString();
}
Hope this helped!

Extract sub-string between two certain words using regex in java

I would like to extract sub-string between certain two words using java.
For example:
This is an important example about regex for my work.
I would like to extract everything between "an" and "for".
What I did so far is:
String sentence = "This is an important example about regex for my work and for me";
Pattern pattern = Pattern.compile("(?<=an).*.(?=for)");
Matcher matcher = pattern.matcher(sentence);
boolean found = false;
while (matcher.find()) {
System.out.println("I found the text: " + matcher.group().toString());
found = true;
}
if (!found) {
System.out.println("I didn't found the text");
}
It works well.
But I want to do two additional things
If the sentence is: This is an important example about regex for my work and for me.
I want to extract till the first "for" i.e. important example about regex
Some times I want to limit the number of words between the pattern to 3 words i.e. important example about
Any ideas please?

For your first question, make it lazy. You can put a question mark after the quantifier and then the quantifier will match as less as possible.
(?<=an).*?(?=for)
I have no idea what the additional . at the end is good for in .*. its unnecessary.
For your second question you have to define what a "word" is. I would say here probably just a sequence of non whitespace followed by a whitespace. Something like this
\S+\s
and repeat this 3 times like this
(?<=an)\s(\S+\s){3}(?=for)
To ensure that the pattern mathces on whole words use word boundaries
(?<=\ban\b)\s(\S+\s){1,5}(?=\bfor\b)
See it online here on Regexr
{3} will match exactly 3 for a minimum of 1 and a max of 3 do this {1,3}
Alternative:
As dma_k correctly stated in your case here its not necessary to use look behind and look ahead. See here the Matcher documentation about groups
You can use capturing groups instead. Just put the part you want to extract in brackets and it will be put into a capturing group.
\ban\b(.*?)\bfor\b
See it online here on Regexr
You can than access this group like this
System.out.println("I found the text: " + matcher.group(1).toString());
^
You have only one pair of brackets, so its simple, just put a 1 into matcher.group(1) to access the first capturing group.

Your regex is "an\\s+(.*?)\\s+for". It extracts all characters between an and for ignoring white spaces (\s+). The question mark means "greedy". It is needed to prevent pattern .* to eat everything including word "for".

public class SubStringBetween {
public static String subStringBetween(String sentence, String before, String after) {
int startSub = SubStringBetween.subStringStartIndex(sentence, before);
int stopSub = SubStringBetween.subStringEndIndex(sentence, after);
String newWord = sentence.substring(startSub, stopSub);
return newWord;
}
public static int subStringStartIndex(String sentence, String delimiterBeforeWord) {
int startIndex = 0;
String newWord = "";
int x = 0, y = 0;
for (int i = 0; i < sentence.length(); i++) {
newWord = "";
if (sentence.charAt(i) == delimiterBeforeWord.charAt(0)) {
startIndex = i;
for (int j = 0; j < delimiterBeforeWord.length(); j++) {
try {
if (sentence.charAt(startIndex) == delimiterBeforeWord.charAt(j)) {
newWord = newWord + sentence.charAt(startIndex);
}
startIndex++;
} catch (Exception e) {
}
}
if (newWord.equals(delimiterBeforeWord)) {
x = startIndex;
}
}
}
return x;
}
public static int subStringEndIndex(String sentence, String delimiterAfterWord) {
int startIndex = 0;
String newWord = "";
int x = 0;
for (int i = 0; i < sentence.length(); i++) {
newWord = "";
if (sentence.charAt(i) == delimiterAfterWord.charAt(0)) {
startIndex = i;
for (int j = 0; j < delimiterAfterWord.length(); j++) {
try {
if (sentence.charAt(startIndex) == delimiterAfterWord.charAt(j)) {
newWord = newWord + sentence.charAt(startIndex);
}
startIndex++;
} catch (Exception e) {
}
}
if (newWord.equals(delimiterAfterWord)) {
x = startIndex;
x = x - delimiterAfterWord.length();
}
}
}
return x;
}
}

Count Occurence of Needle String in Haystack String, most optimally?

The Problem is simple Find "ABC" in "ABCDSGDABCSAGAABCCCCAAABAABC" without using String.split("ABC")
Here is the solution I propose, I'm looking for any solutions that might be better than this one.
public static void main(String[] args) {
String haystack = "ABCDSGDABCSAGAABCCCCAAABAABC";
String needle = "ABC";
char [] needl = needle.toCharArray();
int needleLen = needle.length();
int found=0;
char hay[] = haystack.toCharArray();
int index =0;
int chMatched =0;
for (int i=0; i<hay.length; i++){
if (index >= needleLen || chMatched==0)
index=0;
System.out.print("\nchar-->"+hay[i] + ", with->"+needl[index]);
if(hay[i] == needl[index]){
chMatched++;
System.out.println(", matched");
}else {
chMatched=0;
index=0;
if(hay[i] == needl[index]){
chMatched++;
System.out.print("\nchar->"+hay[i] + ", with->"+needl[index]);
System.out.print(", matched");
}else
continue;
}
if(chMatched == needleLen){
found++;
System.out.println("found. Total ->"+found);
}
index++;
}
System.out.println("Result Found-->"+found);
}
It took me a while creating this one. Can someone suggest a better solution (if any)
P.S. Drop the sysouts if they look messy to you.

How about:
boolean found = haystack.indexOf("ABC") >= 0;
**Edit - The question asks for number of occurences, so here's a modified version of the above:
public static void main(String[] args)
{
String needle = "ABC";
String haystack = "ABCDSGDABCSAGAABCCCCAAABAABC";
int numberOfOccurences = 0;
int index = haystack.indexOf(needle);
while (index != -1)
{
numberOfOccurences++;
haystack = haystack.substring(index+needle.length());
index = haystack.indexOf(needle);
}
System.out.println("" + numberOfOccurences);
}

If you're looking for an algorithm, google for "Boyer-Moore". You can do this in sub-linear time.
edit to clarify and hopefully make all the purists happy: the time bound on Boyer-Moore is, formally speaking, linear. However the effective performance is often such that you do many fewer comparisons than you would with a simpler approach, and in particular you can often skip through the "haystack" string without having to check each character.

You say your challenge is to find ABC within a string. If all you need is to know if ABC exists within the string, a simple indexOf() test will suffice.
If you need to know the number of occurrences, as your posted code tries to find, a simple approach would be to use a regex:
public static int countOccurrences(string haystack, string regexToFind) {
Pattern p = Pattern.compile(regexToFind);
Matcher m = p.matcher(haystack); // get a matcher object
int count = 0;
while(m.find()) {
count++;
}
return count;
}

Have a look at http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

public class NeedleCount
{
public static void main(String[] args)
{
String s="AVBVDABCHJHDFABCJKHKHF",ned="ABC";
int nedIndex=-1,count=0,totalNed=0;
for(int i=0;i<s.length();i++)
{
if(i>ned.length()-1)
nedIndex++;
else
nedIndex=i;
if(s.charAt(i)==ned.charAt(nedIndex))
count++;
else
{
nedIndex=0;
count=0;
if(s.charAt(i)==ned.charAt(nedIndex))
count++;
else
nedIndex=-1;
}
if(count==ned.length())
{
nedIndex=-1;
count=0;
totalNed++;
System.out.println(totalNed+" needle found at index="+(i-(ned.length()-1)));
}
}
System.out.print("Total Ned="+totalNed);
}
}

Asked by others, better in what sense? A regexp based solution will be the most concise and readable (:-) ). Boyer-Moore (http://en.wikipedia.org/wiki/Boyer–Moore_string_search_algorithm) will be the most efficient in terms of time (O(N)).

If you don't mind implementing a new datastructure as replacement for strings, have a look at Tries: http://c2.com/cgi/wiki?StringTrie or http://en.wikipedia.org/wiki/Trie
If you don't look for a regular expression but an exact match they should provide the fastest solution (proportional to length of search string).

public class FindNeedleInHaystack {
String hayStack="ASDVKDBGKBCDGFLBJADLBCNFVKVBCDXKBXCVJXBCVKFALDKBJAFFXBCD";
String needle="BCD";
boolean flag=false;
public void findNeedle() {
//Below for loop iterates the string by each character till reaches max length
for(int i=0;i<hayStack.length();i++) {
//When i=n (0,1,2... ) then we are at nth character of hayStack. Let's start comparing nth char of hayStach with first char of needle
if(hayStack.charAt(i)==needle.charAt(0)) {
//if condition return true, we reach forloop which iterates needle by lenghth.
//Now needle(BCD) first char is 'B' and nth char of hayStack is 'B'. Then let's compare remaining characters of needle with haystack using below loop.
for(int j=0;j<needle.length();j++) {
//for example at i=9 is 'B', i+j is i+0,i+1,i+2...
//if condition return true, loop continues or else it will break and goes to i+1
if(hayStack.charAt(i+j)==needle.charAt(j)) {
flag=true;
} else {
flag=false;
break;
}
}
if(flag) {
System.out.print(i+" ");
}
}
}
}
}

Below code will perform exactly O(n) complexity because we are looping n chars of haystack. If you want to capture start and end index's of needle uncomment below commented code. Solution is around playing with characters and no Java String functions (Pattern matching, IndexOf, substring etc.,) are used as they may bring extra space/time complexity
char[] needleArray = needle.toCharArray();
char[] hayStackArray = hayStack.toCharArray();
//java.util.LinkedList<Pair<Integer,Integer>> indexList = new LinkedList<>();
int head;
int tail = 0;
int needleCount = 0;
while(tail<hayStackArray.length){
head = tail;
boolean proceed = false;
for(int j=0;j<needleArray.length;j++){
if(head+j<hayStackArray.length && hayStackArray[head+j]==needleArray[j]){
tail = head+j;
proceed = true;
}else{
proceed = false;
break;
}
}
if(proceed){
// indexList.add(new Pair<>(head,tail));
needleCount++;
}
++tail;
}
System.out.println(needleCount);
//System.out.println(indexList);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Check for multiple occurrence of certain character in string - java

Hi You have to use your condition statement like this if (word.toLowerCase().contains(keywords[index].toLowerCase()))

Related

Tokenize method: Split string into array

Substring between two same or different delimiters (when delimiters occur multiple times)

Algorithm for duplicated but overlapping strings

Extract sub-string between two certain words using regex in java

Count Occurence of Needle String in Haystack String, most optimally?

Categories

Resources