I have a problem in Regular expression. I need to print and count the words starts with word ‘ge’ and end with either word ‘ne’ or ‘me’. When I running the code only words start with "ge" appear. Can anyone help me to improve my source code?
Pattern p = Pattern.compile("ge\\s*(\\w+)");
Matcher m = p.matcher(input);
int count=0;
List<String> outputs = new ArrayList<String>();
while (m.find()) {
count++;
outputs.add(m.group());
System.out.println(m.group());
}
System.out.println("The count is " + count);
\bge\w*[nm]e\b
This should do it for you.
In java this would be
\\bge\\w*[nm]e\\b
use \b to denote word boundary
Well you're kinda missing the second half of your regex
ge\s*(\w+)[nm]e
This will work.
Regex101
Related
I'm searching for words in a paragraph but it takes ages with long paragraphs. Hence, I want to remove the words after I find it in the paragraph to shorten the number of words I have to go through. Or if there's a better way to make this efficient do tell!
List<String> list = new ArrayList<>();
for (String word : wordList) {
String regex = ".*\\b" + Pattern.quote(word) + "\\b.*";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(paragraph);
if (m.find()) {
System.out.println("Found: " + word);
list.add(word);
}
}
For example, lets say my wordList has the following values "apple","hungry","pie"
And my paragraph is "I ate an apple, but I am still hungry, so I will eat pie"
I want to find the words in wordList in the paragraph and eliminate them in the hopes of making the above code faster
You may use
String paragraph = "I ate an apple, but I am still hungry, so I will eat pie";
List<String> wordList = Arrays.asList("apple","hungry","pie");
Pattern p = Pattern.compile("\\b(?:" + String.join("|", wordList) + ")\\b");
Matcher m = p.matcher(paragraph);
if (m.find()) { // To find all matches, replace "if" with "while"
System.out.println("Found " + m.group()); // => Found apple
}
See the Java demo.
The regex will look like \b(?:word1|word2|wordN)\b and will match:
\b - a word boundary
(?:word1|word2|wordN) - any of the alternatives inside the non-capturing group
\b - a word boundary
Since you say the characters in the words can only be uppercase letters, digits and hyphens with slashes, none of them need escaping, so Pattern.quote is not important here. Also, since the slashes and hyphens will never appear at the start/end of the string, you won't have issues usually caused by \b word boundary. Otherwise, replace the first "\\b" with "(?<!\\w)" and the last one with "(?!\\w)".
In the following code:
public static void main(String[] args) {
List<String> allMatches = new ArrayList<String>();
Matcher m = Pattern.compile("\\d+\\D+\\d+").matcher("2abc3abc4abc5");
while (m.find()) {
allMatches.add(m.group());
}
String[] res = allMatches.toArray(new String[0]);
System.out.println(Arrays.toString(res));
}
The result is:
[2abc3, 4abc5]
I'd like it to be
[2abc3, 3abc4, 4abc5]
How can it be achieved?
Make the matcher attempt to start its next scan from the latter \d+.
Matcher m = Pattern.compile("\\d+\\D+(\\d+)").matcher("2abc3abc4abc5");
if (m.find()) {
do {
allMatches.add(m.group());
} while (m.find(m.start(1)));
}
Not sure if this is possible in Java, but in PCRE you could do the following:
(?=(\d+\D+\d+)).
Explanation
The technique is to use a matching group in a lookahead, and then "eat" one character to move forward.
(?= : start of positive lookahead
( : start matching group 1
\d+ : match a digit one or more times
\D+ : match a non-digit character one or more times
\d+ : match a digit one or more times
) : end of group 1
) : end of lookahead
. : match anything, this is to "move forward".
Online demo
Thanks to Casimir et Hippolyte it really seems to work in Java. You just need to add backslashes and display the first capturing group: (?=(\\d+\\D+\\d+))..
Tested on www.regexplanet.com:
The above solution of HamZa works perfectly in Java. If you want to find a specific pattern in a text all you have to do is:
String regex = "\\d+\\D+\\d+";
String updatedRegex = "(?=(" + regex + ")).";
Where the regex is the pattern you are looking for and to be overlapping you need to surround it with (?=(" at the start and ")). at the end.
I need to find out the number of words in a string. However, this string is not the normal type of string. It has a lot of special character like < , /em, /p and many more. So most of the method used in StackOverflow does not work. As a result, I need to define a regular expression by myself.
What I intend to do is to define what is a word using a regular expression and count the number of time a word appears.
This is how I define a word.
It must start with a letter and end with one of this : or , or ! or ? or ' or - or ) or . or "
This is how I define my regular expression
pattern = Pattern.compile("^[a-zA-Z](:|,|!|?|'|-|)|.|")$");
matcher = pattern.matcher(line);
while (matcher.find())
wordCount++;
However, there is an error with the first line
pattern = Pattern.compile("^[a-zA-Z](:|,|!|?|'|-|)|.|")$");
How can I fix this problem?
In fact you also want to remove tags, like <em> (HTML emphasized), which otherwise would count as words. If you then consider full tags with attributes:
<span font="Consolas"> then it is easier to remove tags:
public int static wordCount(String s) {
s.replaceAll("<[A-Za-z/][^>]*>", " ") // Tags as space
.replaceAll("[^\\p{L}\\p{M}\\d]+", " ") // Non-letters, -accents, -digits as blank
.trim() // Not before or after (empty words)
.split(" ").length;
}
It is quite inefficient, replaceAll and trim. At least precompiling and using Pattern would be nicer. But probably not worth it.
Does this help?
String line = "so.this:is,what)you!wanted?";
int wordCount = 0;
Pattern pattern = Pattern.compile("([a-zA-Z]++[:'-,\\.!\\?\")]{1})");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
wordCount++;
}
System.out.println(wordCount); // Prints 6
I have a string phahahahoto and I need to find how many times the String haha appear in the above string. If you look closely it appears 2 times.
My code is below and I get the output 1 instead of 2.
Code is written in java.
Pattern pattern = Pattern.compile("haha");
Matcher matcher = pattern.matcher("phahahahoto");
int count = 0;
while (matcher.find()) {
count++;
}
System.out.println(count);
Use lookaheads in-order to do overlapping matches. If you clearly noticed that the string haha was overlapped. If you pass haha as regex, it won't do an overlapping match, since the pattern haha matches the first haha substring which leaves you only the last ha part. Lookarounds won't consume any single character. So it would be able to match only the boundaries.
Pattern pattern = Pattern.compile("(?=haha)");
Matcher matcher = pattern.matcher("phahahahoto");
int count = 0;
while (matcher.find()) {
count++;
}
System.out.println(count);
DEMO
Here it matches the boundary which exists before each haha . See the above demo link.
You can get the count in one line like this also:
int count = "phahahahoto".split("(?=haha)").length - 1;
//=> 2
I want to find every instance of a number, followed by a comma (no space), followed by any number of characters in a string. I was able to get a regex to find all the instances of what I was looking for, but I want to print them individually rather than all together. I'm new to regex in general, so maybe my pattern is wrong?
This is my code:
String test = "1 2,A 3,B 4,23";
Pattern p = Pattern.compile("\\d+,.+");
Matcher m = p.matcher(test);
while(m.find()) {
System.out.println("found: " + m.group());
}
This is what it prints:
found: 2,A 3,B 4,23
This is what I want it to print:
found: 2,A
found: 3,B
found: 4,23
Thanks in advance!
try this regex
Pattern p = Pattern.compile("\\d+,.+?(?= |$)");
You could take an easier route and split by space, then ignore anything without a comma:
String values = test.split(' ');
for (String value : values) {
if (value.contains(",") {
System.out.println("found: " + value);
}
}
What you apparently left out of your requirements statement is where "any number of characters" is supposed to end. As it stands, it ends at the end of the string; from your sample output, it seems you want it to end at the first space.
Try this pattern: "\\d+,[^\\s]*"