Extract sub-string between two certain words using regex in java

Extract sub-string between two certain words using regex in java - java

I would like to extract sub-string between certain two words using java.
For example:
This is an important example about regex for my work.
I would like to extract everything between "an" and "for".
What I did so far is:
String sentence = "This is an important example about regex for my work and for me";
Pattern pattern = Pattern.compile("(?<=an).*.(?=for)");
Matcher matcher = pattern.matcher(sentence);
boolean found = false;
while (matcher.find()) {
System.out.println("I found the text: " + matcher.group().toString());
found = true;
}
if (!found) {
System.out.println("I didn't found the text");
}
It works well.
But I want to do two additional things
If the sentence is: This is an important example about regex for my work and for me.
I want to extract till the first "for" i.e. important example about regex
Some times I want to limit the number of words between the pattern to 3 words i.e. important example about
Any ideas please?

For your first question, make it lazy. You can put a question mark after the quantifier and then the quantifier will match as less as possible.
(?<=an).*?(?=for)
I have no idea what the additional . at the end is good for in .*. its unnecessary.
For your second question you have to define what a "word" is. I would say here probably just a sequence of non whitespace followed by a whitespace. Something like this
\S+\s
and repeat this 3 times like this
(?<=an)\s(\S+\s){3}(?=for)
To ensure that the pattern mathces on whole words use word boundaries
(?<=\ban\b)\s(\S+\s){1,5}(?=\bfor\b)
See it online here on Regexr
{3} will match exactly 3 for a minimum of 1 and a max of 3 do this {1,3}
Alternative:
As dma_k correctly stated in your case here its not necessary to use look behind and look ahead. See here the Matcher documentation about groups
You can use capturing groups instead. Just put the part you want to extract in brackets and it will be put into a capturing group.
\ban\b(.*?)\bfor\b
See it online here on Regexr
You can than access this group like this
System.out.println("I found the text: " + matcher.group(1).toString());
^
You have only one pair of brackets, so its simple, just put a 1 into matcher.group(1) to access the first capturing group.

Your regex is "an\\s+(.*?)\\s+for". It extracts all characters between an and for ignoring white spaces (\s+). The question mark means "greedy". It is needed to prevent pattern .* to eat everything including word "for".

public class SubStringBetween {
public static String subStringBetween(String sentence, String before, String after) {
int startSub = SubStringBetween.subStringStartIndex(sentence, before);
int stopSub = SubStringBetween.subStringEndIndex(sentence, after);
String newWord = sentence.substring(startSub, stopSub);
return newWord;
}
public static int subStringStartIndex(String sentence, String delimiterBeforeWord) {
int startIndex = 0;
String newWord = "";
int x = 0, y = 0;
for (int i = 0; i < sentence.length(); i++) {
newWord = "";
if (sentence.charAt(i) == delimiterBeforeWord.charAt(0)) {
startIndex = i;
for (int j = 0; j < delimiterBeforeWord.length(); j++) {
try {
if (sentence.charAt(startIndex) == delimiterBeforeWord.charAt(j)) {
newWord = newWord + sentence.charAt(startIndex);
}
startIndex++;
} catch (Exception e) {
}
}
if (newWord.equals(delimiterBeforeWord)) {
x = startIndex;
}
}
}
return x;
}
public static int subStringEndIndex(String sentence, String delimiterAfterWord) {
int startIndex = 0;
String newWord = "";
int x = 0;
for (int i = 0; i < sentence.length(); i++) {
newWord = "";
if (sentence.charAt(i) == delimiterAfterWord.charAt(0)) {
startIndex = i;
for (int j = 0; j < delimiterAfterWord.length(); j++) {
try {
if (sentence.charAt(startIndex) == delimiterAfterWord.charAt(j)) {
newWord = newWord + sentence.charAt(startIndex);
}
startIndex++;
} catch (Exception e) {
}
}
if (newWord.equals(delimiterAfterWord)) {
x = startIndex;
x = x - delimiterAfterWord.length();
}
}
}
return x;
}
}

Related

Pig it method that I am trying to make trouble checking punctuation at the end java

I am trying to answer this question.
Move the first letter of each word to the end of it, then add "ay" to the end of the word. Leave punctuation marks untouched.
This is what I did so far:
public static String pigIt(String str) {
//Populating the String argument into the String Array after splitting them by spaces
String[] strArray = str.split(" ");
System.out.println("\nPrinting strArray: " + Arrays.toString(strArray));
String toReturn = "";
for (int i = 0; i < strArray.length; i++) {
String word = strArray[i];
for (int j = 1; j < word.length(); j++) {
toReturn += Character.toString(word.charAt(j));
}
//Outside of inner for loop
if (!(word.contains("',.!?:;")) && (i != strArray.length - 1)) {
toReturn += Character.toString(word.charAt(0)) + "ay" + " ";
} else if (word.contains("',.!?:;")) {
toReturn += Character.toString(word.charAt(0)) + "ay" + " " + strArray[strArray.length - 1];
}
}
return toReturn;
}
It is supposed to return the punctuation mark without adding "ay" + "". I think I am overthinking but please help. Please see the below debugger.

One of the problems here is that your else if statement is never being invoked. The .contains method will not work with multiple characters like that unless you are trying to match them all. In your conditions you are essentially asking if the word matches that entire string "',.!?:;". If you just keep the exclamation point in there it will work invoke it. I don't know how else you can use contains besides making a condition for each one like word.contains("!")|| word.contains(",")|| word.contains("'"), etc.. You can also use regex for this problem.
Alternatively, you can use something like,
Character ch = new Character(yourString.charAt(i));
if(!Character.isAlphabetic(yourString.charAt(i))) {
to determine if a character is not an alphabetical one, and is a symbol or punctuation.

I think the best way is not relay on str.split("\\s++"), because you could have punctuation in any plase. The best one is to look through the string and find all not letter or digit symbols. After that you can define a word borders and translate it.
public static String pigIt(String str) {
StringBuilder buf = new StringBuilder();
for (int i = 0, j = 0; j <= str.length(); j++) {
char ch = j < str.length() ? str.charAt(j) : '\0';
if (Character.isLetterOrDigit(ch))
continue;
if (i < j) {
buf.append(str.substring(i + 1, j));
buf.append(str.charAt(i));
buf.append("ay");
}
if (ch != '\0')
buf.append(ch);
i = j + 1;
}
return buf.toString();
}
Output:
System.out.println(pigIt(",Hello, !World")); // ,elloHay, !orldWay

Regex may be difficult to start with but is very powerful:
public static String pigIt(String str) {
return str.replaceAll("([a-zA-Z])([a-zA-Z]*)", "$2$1ay");
}
The () specify groups. So I have one group with the first alphabet character and a second group with the remaining alphabet characters.
In the replace parameter you can refer to these groups ($1, $2).
String.replaceAll will search all matching string parts and apply the replacement. Non matching characters like the punctuations are left untouched.
public static void main(String[] args) {
System.out.println("Hello, World, ! -->"+ pigIt("Hello, World, !"));
System.out.println("Hello?, Wo$, F, ! -->"+ pigIt("Hello?, Wo$, F, !"));
}
The output of this method is:
Hello, World, ! -->elloHay, orldWay, !
Hello?, Wo$, F, ! -->elloHay?, oWay$, Fay, !

How to return only the strings ending in a punctuation mark?

I want to build a method, which returns only the strings that end with a punctuation mark. The problem is that when I compile it says that it couldn't find the 'i', so what should I do?
public static String ktheFjalite(String[] s){
int nrFjaleve = 0;
int nrZanoreve = 0;
for (int j = 0; j <s.length; j++){
if (s[j].charAt(j) == ' '){
nrFjaleve++;
}
}
for (int k = 0; k < s.length; k++){
if (s[k].contains("a")||s[k].contains("e")||s[k].contains("i")||s[k].contains("o")||s[k].contains("u")||s[k].contains("y")){
nrZanoreve++;
}
}
if(s[i].endsWith(".")||s[i].endsWith("!")||s[i].endsWith("?")||s[i].endsWith("...")){
if(nrFjaleve<=6){
if(nrZanoreve<=8) {
return (Arrays.toString(s));
}
}
}
}

Another way not using regex or patterns:
Consider String marks = "..."; where the ellipsis represents all of the characters that you consider to be punctuation marks.
Then note that the final character of a String s is c = s.charAt(s.length() - 1); //Minus one or OutOfBoundsException
Then marks.contains(c) will be true if the last character of the String is a punctuation mark.

I would use regex to find all words that end with punctuation and return it in a list. For more information on regex click here.
private static List<String> findPunctuation(String sentence) {
List<String> neededWords = new ArrayList<String>();
String[] words = sentence.split(" ");
for (String word : words)
if (word.matches(".*[\\?!\\.]"))
neededWords.add(word);
return neededWords;
}
This may not be a perfect solution for you but it should give you some direction.
EDIT
If you only want to return the word if there are x amount of words you can achieve that with this:
private static String findPunctuation(String sentence) {
Pattern p = Pattern.compile("(?:.* ){6,8}(.*[\\?!\\.])$");
Matcher m = p.matcher(sentence);
while(m.find())
return m.group(1);
return null;
}
Hope this helps.

How do I reverse words in string that has line feed (\n or \r)?

I have a string as follows:
String sentence = "I have bananas\r" +
"He has apples\r" +
"I own 3 cars\n" +
"*!"
I'd like to reverse this string so as to have an output like this:
"*!" +
"\ncars 3 own I" +
"\rapples has He" +
"\rbananas have I"
Here is a program I wrote.
public static String reverseWords(String sentence) {
StringBuilder str = new StringBuilder();
String[] arr = sentence.split(" ");
for (int i = arr.length -1; i>=0; i--){
str.append(arr[i]).append(" ");
}
return str.toString();
}
But I don't get the output as expected. What is wrong?

The problem is you are only splitting on spaces, but that is not the only type of whitespace in your sentence. You can use the pattern \s to match all whitespace. However, then you don't know what to put back in that position after the split. So instead we will split on the zero-width position in front of or behind a whitespace character.
Change your split to this:
String[] arr = sentence.split("(?<=\\s)|(?=\\s)");
Also, now that you are preserving the whitespace characters, you no longer need to append them. So change your append to this:
str.append(arr[i]);
The final problem is that your output will be garbled due to the presence of \r. So, if you want to see the result clearly, you should replace those characters. For example:
System.out.println(reverseWords(sentence).replaceAll("\\r","\\\\r").replaceAll("\\n","\\\\n"));
This modified code now give the desired output.
Output:
*!\ncars 3 own I\rapples has He\rbananas have I
Note:
Since you are freely mixing \r and \n, I did not add any code to treat \r\n as a special case, which means that it will be reversed to become \n\r. If that is a problem, then you will need to prevent or undo that reversal.
For example, this slightly more complex regex will prevent us from reversing any consecutive whitespace characters:
String[] arr = sentence.split("(?<=\\s)(?!\\s)|(?<!\\s)(?=\\s)");
The above regex will match the zero-width position where there is whitespace behind but not ahead OR where there is whitespace ahead but not behind. So it won't split in the middle of consecutive whitespaces, and the order of sequences such as \r\n will be preserved.

The logic behind this question is simple, there are two steps to achieve the OP's target:
reverse the whole string;
reverse the words between (words splitted by spaces);
Instead of using StringBuilder, I'd prefer char[] to finish this, which is easy to understand.
The local test code is:
public class WordReverse {
public static void main(String... args) {
String s = " We have bananas\r" +
"He has apples\r" +
"I own 3 cars\n" +
"*!";
System.out.println(reverseSentenceThenWord(s));
}
/**
* return itself if the #param s is null or empty;
* #param s
* #return the words (non-whitespace character compound) reversed string;
*/
private static String reverseSentenceThenWord(String s) {
if (s == null || s.length() == 0) return s;
char[] arr = s.toCharArray();
int len = arr.length;
reverse(arr, 0, len - 1);
boolean inWord = !isSpace(arr[0]); // used to track the start and end of a word;
int start = inWord ? 0 : -1; // is the start valid?
for (int i = 0; i < len; ++i) {
if (!isSpace(arr[i])) {
if (!inWord) {
inWord = true;
start = i; // just set the start index of the new word;
}
} else {
if (inWord) { // from word to space, we do the reverse for the traversed word;
reverse(arr, start, i - 1);
}
inWord = false;
}
}
if (inWord) reverse(arr, start, len - 1); // reverse the last word if it ends the sentence;
String ret = new String(arr);
ret = showWhiteSpaces(ret);
// uncomment the line above to present all whitespace escape characters;
return ret;
}
private static void reverse(char[] arr, int i, int j) {
while (i < j) {
char c = arr[i];
arr[i] = arr[j];
arr[j] = c;
i++;
j--;
}
}
private static boolean isSpace(char c) {
return String.valueOf(c).matches("\\s");
}
private static String showWhiteSpaces(String s) {
String[] hidden = {"\t", "\n", "\f", "\r"};
String[] show = {"\\\\t", "\\\\n", "\\\\f", "\\\\r"};
for (int i = hidden.length - 1; i >= 0; i--) {
s = s.replaceAll(hidden[i], show[i]);
}
return s;
}
}
The output is not in my PC as OP provided but as:
*!
bananas have I
However, if you set a breakpoint and debug it and check the returned string, it will be as:
which is the right answer.
UPDATE
Now, if you would like to show the escaped whitespaces, you can just uncomment this line before returning the result:
// ret = showWhiteSpaces(ret);
And the final output will be exactly the same as expected in the OP's question:
*!\ncars 3 own I\rapples has He\rbananas have I

Take a look at the output you're after carefully. You actually need two iteration steps here - you first need to iterate over all the lines backwards, then all the words in each line backwards. At present you're just splitting once by space (not by new line) and iterating over everything returned in that backwards, which won't do what you want!
Take a look at the example below - I've kept closely to your style and just added a second loop. It first iterates over new lines (either by \n or by \r, since split() takes a regex), then by words in each of those lines.
Note however this comes with a caveat - it won't preserve the \r and the \n. For that you'd need to use lookahead / lookbehind in your split to preserve the delimiters (see here for an example.)
public static String reverseWords(String sentence) {
StringBuilder str = new StringBuilder();
String[] lines = sentence.split("[\n\r]");
for (int i = lines.length - 1; i >= 0; i--) {
String[] words = lines[i].split(" ");
for (int j = words.length - 1; j >= 0; j--) {
str.append(words[j]).append(" ");
}
str.append("\n");
}
return str.toString();
}

Check for multiple occurrence of certain character in string

Edit: To those who downvote me, this question is difference from the duplicate question which you guy linked. The other question is about returning the indexes. However, for my case, I do not need the index. I just want to check whether there is duplicate.
This is my code:
String word = "ABCDE<br>XYZABC";
String[] keywords = word.split("<br>");
for (int index = 0; index < keywords.length; index++) {
if (keywords[index].toLowerCase().contains(word.toLowerCase())) {
if (index != (keywords.length - 1)) {
endText = keywords[index];
definition.setText(endText);
}
}
My problem is, if the keywords is "ABC", then the string endText will only show "ABCDE". However, "XYZABC" contains "ABC" as well. How to check if the string has multiple occurrence? I would like to make the definition textview become definition.setText(endText + "More"); if there is multiple occurrence.
I tried this. The code is working, but it is making my app very slow. I guess the reason is because I got the String word through textwatcher.
String[] keywords = word.split("<br>");
for (int index = 0; index < keywords.length; index++) {
if (keywords[index].toLowerCase().contains(word.toLowerCase())) {
if (index != (keywords.length - 1)) {
int i = 0;
Pattern p = Pattern.compile(search.toLowerCase());
Matcher m = p.matcher( word.toLowerCase() );
while (m.find()) {
i++;
}
if (i > 1) {
endText = keywords[index];
definition.setText(endText + " More");
} else {
endText = keywords[index];
definition.setText(endText);
}
}
}
}
Is there any faster way?

It's a little hard for me to understand your question, but it sounds like:
You have some string (e.g. "ABCDE<br>XYZABC"). You also have some target text (e.g. "ABC"). You want to split that string on a delimiter (e.g. "<br>", and then:
If exactly one substring contains the target, display that substring.
If more than one substring contains the target, display the last substring that contains it plus the suffix "More"
In your posted code, the performance is really slow because of the Pattern.compile() call. Re-compiling the Pattern on every loop iteration is very costly. Luckily, there's no need for regular expressions here, so you can avoid that problem entirely.
String search = "ABC".toLowerCase();
String word = "ABCDE<br>XYZABC";
String[] keywords = word.split("<br>");
int count = 0;
for (String keyword : keywords) {
if (keyword.toLowerCase().contains(search)) {
++count;
endText = keyword;
}
}
if (count > 1) {
definition.setText(endText + " More");
}
else if (count == 1) {
definition.setText(endText);
}

You are doing it correctly but you are doing unnecessary check which is if (index != (keywords.length - 1)). This will ignore if there is match in the last keywords array element. Not sure is that a part of your requirement.
To enhance performance when you found the match in second place break the loop. You don't need to check anymore.
public static void main(String[] args) {
String word = "ABCDE<br>XYZABC";
String pattern = "ABC";
String[] keywords = word.split("<br>");
String endText = "";
int count = 0;
for (int index = 0; index < keywords.length; index++) {
if (keywords[index].toLowerCase().contains(pattern.toLowerCase())) {
//If you come into this part mean found a match.
if(count == 1) {
// When you found the second match it will break to loop. No need to check anymore
// keep the first found String and append the more part to it
endText += " more";
break;
}
endText = keywords[index];
count++;
}
}
System.out.println(endText);
}
This will print ABCDE more

Hi You have to use your condition statement like this
if (word.toLowerCase().contains(keywords[index].toLowerCase()))

You can use this:
String word = "ABCDE<br>XYZABC";
String[] keywords = word.split("<br>");
for (int i = 0; i < keywords.length - 1; i++) {
int c = 0;
Pattern p = Pattern.compile(keywords[i].toLowerCase());
Matcher m = p.matcher(word.toLowerCase());
while (m.find()) {
c++;
}
if (c > 1) {
definition.setText(keywords[i] + " More");
} else {
definition.setText(keywords[i]);
}
}
But like what I mentioned in comment, there is no double occurrence in word "ABCDE<br>XYZABC" when you want to split it by <br>.
But if you use the word "ABCDE<br>XYZABCDE" there is two occurrence of word "ABCDE"

void test() {
String word = "ABCDE<br>XYZABC";
String sequence = "ABC";
if(word.replaceFirst(sequence,"{---}").contains(sequence)){
int startIndex = word.indexOf(sequence);
int endIndex = word.indexOf("<br>");
Log.v("test",word.substring(startIndex,endIndex)+" More");
}
else{
//your code
}
}
Try this

How to find and replace 3 or more chars with 3 or more chars in a String Java?

I need to check if the line contains strings that must be
eliminated and indicate which symbols would be eliminated.
A character sequence is replaced by underscores (""), accordingly with the sequence length, if there are three or more contiguous characters with the same symbol. for example, the line ", _, #, #, #, #, $, $, , #, #,!" would be transformed into ", _, _, _, _, _, _, $, $, _, #, #,!" After the process of elimination.
I need to do this only with String or StringBuilder, Regex, ect... (Only Basic coding of Java).
Can't use arrays also.
Thanks in advance.
This is what i tried:
public static void main(String[] args) {
String linha = "##,$$$$,%%%%,#%###,!!!!", validos = "$#%!#";
for (int i = 0; i < validos.length(); i++) {
linha = linha.replaceAll("\\" + validos.charAt(i) + "{3,}", "_");
}
System.out.println (linha);
}
}
The problem here is that replaces a sequence with just one "_", and i don't know which chars are replaced.

Surely you can do this in many ways, and probably this is a good exercise to do by yourself. Here you have a basic implementation using just basic loop structures and nothing fancy like StringUtils libraries... Note that your previous loop implementation would have missed several occurrences of the same character repeated in different locations of linha.
static int index(String lookInStr, char lookUpChr) {
return lookInStr.indexOf(new String(new char[] { lookUpChr, lookUpChr, lookUpChr }));
}
public static void main(String[] args) {
String linha = "####,########,$$$$,%%%%,#%###,!!!!", validos = "$#%!#";
for (int i = 0; i < validos.length(); i++) {
char currentSearchChar = validos.charAt(i);
do {
int index = index(linha, currentSearchChar);
if (index >= 0) {
int count = -1;
do {
count++;
} while (linha.charAt(count + index) == currentSearchChar && count + index < linha.length() - 1);
String replacementSeq = "";
for (int j = 0; j < count; j++) {
replacementSeq += "-";
}
linha = linha.replaceAll("\\" + validos.charAt(i) + "{" + count + ",}", replacementSeq);
}
} while (index(linha, currentSearchChar) >= 0);
}
System.out.println(linha);
}

If you are trying to replace three characters at once, and you want three underscores instead, you are just missing this:
linha = linha.replaceAll("\\" + validos.charAt(i) + "{3,}", "___");
If you want them separated by commas:
linha = linha.replaceAll("\\" + validos.charAt(i) + "{3,}", "_,_,_");

Basically, this splits the string into separate blocks, then checks the length of the blocks and either returns the original block, or replaces it with underscores.
static String convert(String s) {
StringBuilder sb = new StringBuilder();
for(int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
StringBuilder tempSb = new StringBuilder();
for(; i < s.length(); i++) {
char d = s.charAt(i);
if(d != c) {
i--;
break;
} else {
tempSb.append(d);
}
}
String t = tempSb.toString();
if(t.length() < 3) {
sb.append(t);
} else {
sb.append(repeat("_", t.length()));
}
}
return sb.toString();
}
public static void main(String[] args) {
String x = convert("##,$$$$,%%%%,#%###,!!!!");
System.out.println(x); // ##,____,____,#%___,____
}
And here's the simple repeat method:
static String repeat(String s, int repeatCount) {
StringBuilder sb = new StringBuilder();
for(int i = 0; i < repeatCount; i++) {
sb.append(s);
}
return sb.toString();
}

Haven't really implemented this, but this is something you may look at:
In Matcher, there is find(int start), start() and end()
Have a pattern for the '3-or-more-repetitive char' (you may refer to the comment in your question).
psuedo code is something like this:
int lastEndingPosition = 0;
StringBuilder sb;
while (matcher can find next group) {
// add the unmatched part
sb.append( substring of input string from lastEndingPosition to matcher.start() );
// add the matched part
sb.append( "-" for matcher.end() - matcher.start() times);
lastEndingPosition = matcher.end();
}
sb.append( substring of input string from lastEndingPosition to the end);
Probably there are some more elegant way to do this. This is just one alternative

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract sub-string between two certain words using regex in java - java

Your regex is "an\\s+(.?)\\s+for". It extracts all characters between an and for ignoring white spaces (\s+). The question mark means "greedy". It is needed to prevent pattern . to eat everything including word "for".

Related

Pig it method that I am trying to make trouble checking punctuation at the end java

How to return only the strings ending in a punctuation mark?

How do I reverse words in string that has line feed (\n or \r)?

Check for multiple occurrence of certain character in string

How to find and replace 3 or more chars with 3 or more chars in a String Java?

Categories

Resources