Regular Expression to find words separated with space, backtracking - java

I have to find words separated by space. What best practice to do it with the smallest backtracking?
I found this solution:
Regex: \d+\s([a-zA-Z]+\\s{0,1}){1,} in a sentence
Input: 1234 this is words in a sentence
So, this is words - i have to check using regex ([a-zA-Z]+\\s{0,1}){1,} and words in a sentence i have to check by constant words in regex in a sentences.
But in this case regex101.com gives me debug with 4156 steps and this is Catastrophic Backtracking. Any way to avoid it?
I have other more complicated example, where it takes 86000 steps and it does not validate.
Main problem, that i have to find all words separated by space, but in the same time regex contains words separated by space (constants). This is where i have Catastrophic Backtracking.
I have to do this using Java.

You want to find words separated by space.So you should say at least 1 or more space.You can use this instead which takes just 37 steps.
\d+\s([a-zA-Z]+\s+)+in a sentence
See demo.
https://regex101.com/r/tD0dU9/4
For java double escape all ie \d==\\d

You could try splitting the String into a String array, then find the size of the array after eliminating any members of the array that do not match your definition of a word (ex. a whitespace or puncuation)
String[] mySplitString = myOriginalString.split(" ");
for(int x = 0; x < mySplitString.length; x++){
if(mySplitString[x].matches("\\w.*"/*Your regex for a word here*/)) words++;
}
mySplitString is an array of Strings that have been split from an original string. All whitespace characters are removed and substrings that were before, after, or in-between whitespaces are placed into the new String array. The for-loop runs through the split String array and checks to make sure that each array member contains a word (characters or numbers atleast once) and adds it to a total word count.

If I understood it right, you want to match any word separeted by space plus the sentence "in a sentence".
You can try the following solution:
(in a sentence)|(\S+)
As seen in this example on regex101: Exemple
The regex matchs in 61 steps.
You might have problems with punctuation after the "in a sentence" sentence. Make some tests.
I hope I was helpfull.

Related

RegEx for matching two words with two capital letters

I am creating a business card parser and am running into issues with the regex. I have a line that I am reading from my file - String s.
I need to be able to grab a line that contains two words and only two capital letters along with not containing certain words. Below is the regex I have used in the past that works, but I am wanting to make this if else statement with the .matches and !.matches
else if ((!s.matches(".*\\b(Technologies|Engineer|Systems|Developer|Company|INC|Analyst|Computers|Technology|#)\\b.*") && (s.matches("^(?!(.*[A-Z]){3,})[a-zA-Z]+ [a-zA-Z]+$"))))
{
getName();
}
I'm not sure, if this RegEx might be what you may be looking for.
Input
Technologies Word Word word
Engineer Word Word word
Systems Word word word
Developer Word word word
Company Word word word
INC Word Word Word
Analyst Word word word
Computers Word word word
Technology Word word word
Output
If not, you can use that same tool and design a RegEx, you only need to add {2} at the end for repeating twice.
For removing certain words, you may not need another matching, you might just add, the list you wish in the beginning of the same RegEx:
^(?!Technologies|Engineer|Anything|Else|You|Wish)([A-Z][a-z]+\s){2}
Output

Splitting a sentence

I'm trying to split a string: multiple characters such as !!! , ??, ... denote the end of the sentence so I want anything after this to be on a new line e.g. sentence hey.. hello split !!! example me. should be turned into:
hey..
hello split !!!
example me.
What I tried:
String myStr= "hey.. hello split !!! example me.";
String [] split = myStr.split("(?<=\\.{2,})");
This works fine when I have multiple dots but doesn't work for anything else, I can't add exclamation marks to this expression too "(?<=[\\.{2,}!{2,}]). This splits after each dot and exclamation. Is there any way to combine those ?
Ideally I wanted the app to split after a SINGLE dot too (anything that denotes the end of the sentence) but I don't think this is possible in a single pass...Thanks
Just do like this,
String [] split = myStr.split("(?<=([?!.])\\1+)");
oir
String [] split = myStr.split("(?<=([?!.])\\1{1,99})");
It captures the first character from the list [?.!] and expects the same character to be present one or more times. If yes, then the splitting should occur next to this.
or
String[] split = s.split("(?<=\\.{2,}+)|(?<=\\?{2,}+)|(?<=!{2,}+)");
Ideone
Ideally I wanted the app to split after a SINGLE dot too (anything that denotes the end of the sentence)
To do this first you have to lay down as to what cases are you considering as end of sentence. Multiple special symbols are not standard form of ending a sentence (as per my knowledge).
But if you are keeping in mind the nefarious users or some casual mistakes ending up making special symbols look like end of sentence then at least make a list of such cases and then proceed.
For your situation here where you want to split the string on multiple special symbols. Lookbehind won't be of much help because as Wiktor noted
The problem is in the backreference whose length is not known from the start.
So we need to find that zero-width where splitting needs to be done. And following regex does the same.
Regex:
(?<=[.!?])(?=[^.!?]) Regex101 Demo Ideone Demo
(?<=[.!?]) (?=[^.!?]) Regex101 Demo Ideone Demo
Note the space between two assertions in second regex.If you want to consume the preceding space when start next line.
Explanation:
This will split on the zero-width where it's preceded by special and not succeeded by it.
hey..¦ hello split !!!¦ example me. ( ¦ denotes the zero-width)
A look behind, with a negative look to prevent split within the group:
String[] lines = s.split("(?<=[?!.]{2,3})(?![?!.])");
Some test code:
public static void main (String[] args) {
String s = "hey..hello split !!!example me.";
String[] lines = s.split("(?<=[?!.]{2,3})(?![?!.])");
Arrays.stream(lines).forEach(System.out::println);
}
Output:
hey..
hello split !!!
example me.

Java Regex for counting syllables

I am writing a regex pattern to count all the syllables in a word but I'm having trouble ignoring the case when an "e" is alone at the end of the word.
My pattern right now is:
[aeiouy]+[^$e]
I have given certain rules that are not completely precise but I need to do it that way for the exercise, the rules are the following:
A syllable is a contiguous sequence of vowels, except for a lonely vowel "e" at the end, vowels are "aeiouy", for example the word "sentence" should be only 2 syllables but my pattern is counting 3, the word "there" should be only one syllable by my pattern is counting 2.
Thanks in advance for any help!
Edit: With Yassin example I've noticed that the main issue is that when "e" is following by another character, question marks, comma, etc. The regex is counting another syllable
As you're having problems with "e" ending words and followed by points or commas etc.
Here is a solution using a 12 syllables sentence.
We are excluding the "e" letter followed by any of the characters below.
Solution
Pattern p = Pattern.compile("[aeiouy]+[^$e(,.:;!?)]");
Matcher m = p.matcher("This is a sentence:this is another sentence.");
int syllables = 0;
while (m.find()){
syllables++;
}
System.out.println(syllables);
Output
12

Java String delete tokens contains numbers

I have a string like this and I would like to eliminate all the tokens that contain a number:
String[] s="In the 1980s".split(" ");
Is there a way to remove the tokens that contain numbers - in this case 1980s, but also, for example 784th or s787?
Use a \w*\d\w* regex matcher for that. It will match all words with at least one digit in them. Although I generally despise regexes, they are particularily well suited for your problem.
String[] s = input.replaceAll("\\w*\\d\\w* *", "").split(" +");
See Java lib docs for Pattern/Matcher (RegEx) for more reference how to work with regexes in general.
Test code:
http://ideone.com/LrHDsT
Remove the unwanted words first, then split:
String[] s = str.replaceAll("\\w*\\d\\w*", "").trim().split(" +");
Some test code:
String str = "666 In the 1980s 784th s787 foo BAR";
String[] s = str.replaceAll("\\w*\\d\\w*", "").trim().split(" +");
System.out.println(Arrays.toString(s));
Output:
[In, the, foo, BAR]
You could Regex as suggested by #vaxquis or alternately after splitting the string based on the delimiter
You could Parse the token strings and check if the token has number among them using NumberUtils.isNumber and remove those tokens.
split doesn't seem to be what you are looking for. Even if you remove words which contain digit like in case of
"1foo f2oo bar whatever baz2"
you will end up with
" bar whatever "
and if you split on spaces now you will end up with ["", "bar", "whatever"].
To solve this problem you may want also to remove spaces after word you removed so now
"1foo f2oo bar whatever baz2"
would become
"bar whatever "
so it can be split correctly (space at the end is not the problem since split by default removes trailing empty strings in result array).
But instead of doing two iterations (removing words and splitting on string) you can achieve same thing with only one iteration. All you need to do is use opposite approach:instead of focusing on removing wrong elements, lets try to find correct ones.
Correct tokens seem to be words which contains any non-space characters but not digits. You can regex representing such words with this regex \b[\S&&\D]\b where:
\b represents word boundaries,
\S any non whitespace character
\D any non digit character
[\S&&\D] intersection of non-whitespaces and non-digits, in other words non whitespaces which are also non-ditigts
Demo:
String input = "1foo f2oo bar whatever baz2";
Pattern p = Pattern.compile("\\b[\\S&&\\D]+\\b");
Matcher m = p.matcher(input);
while(m.find())
System.out.println(m.group());
Output:
bar
whatever
BTW to avoid potential problems with potential empty element at start of results you can use Scanner which doesn't return empty element if delimiter is found at start of string. So we can simply set delimiter as series of spaces or words which contains digit. So your code can also look like
Scanner sc = new Scanner(input);
sc.useDelimiter("(\\s|\\w*\\d\\w*)+");
while (sc.hasNext())
System.out.println(sc.next());
sc.close();

Java regex split with any number of asterisks

I am learning regex (with this site) and trying to figure out how to parse the following string 1***2 to give me [1,2] (without using a specific case for 3 asterisk). There can be any number of asterisks that I need to split as one delimiter, so I am looking for the * char followed by the * wildcard. The delimiters could be letters as well.
The output should only only be numbers so I use ^-^0-9 to split by everything else.
So far I have tried:
input.split("[^-^0-9]"); // Gives me [1, , ,2]
input.split("[^-^0-9\\**]"); // Gives me [1***2]
input.split("[^-^0-9+\\**]"); // Gives me [1***2]
\* does not work as it is not recognized as a valid escape character.
Thanks!
You are looking for
input.split("[*]+");
This splits the string on one or more consecutive asterisks.
To allow other characters (e.g. letters) within delimiters, add them to the [*] character class.
If the delimiters could be letter..
you can use
\D+
OR
[^\d]+

Categories

Resources