count substring occurrence in every combination - java

I have a string phahahahoto and I need to find how many times the String haha appear in the above string. If you look closely it appears 2 times.
My code is below and I get the output 1 instead of 2.
Code is written in java.
Pattern pattern = Pattern.compile("haha");
Matcher matcher = pattern.matcher("phahahahoto");
int count = 0;
while (matcher.find()) {
count++;
}
System.out.println(count);

Use lookaheads in-order to do overlapping matches. If you clearly noticed that the string haha was overlapped. If you pass haha as regex, it won't do an overlapping match, since the pattern haha matches the first haha substring which leaves you only the last ha part. Lookarounds won't consume any single character. So it would be able to match only the boundaries.
Pattern pattern = Pattern.compile("(?=haha)");
Matcher matcher = pattern.matcher("phahahahoto");
int count = 0;
while (matcher.find()) {
count++;
}
System.out.println(count);
DEMO
Here it matches the boundary which exists before each haha . See the above demo link.

You can get the count in one line like this also:
int count = "phahahahoto".split("(?=haha)").length - 1;
//=> 2

Related

Java string split with regular experssions

I am far from mastering regular expressions but I would like to split a string on first and last underscore e.g.
split the string on first and last underscore with regular expression
"hello_5_9_2018_world"
to
"hello"
"5_9_2018"
"world"
I can split it on the last underscore with
String[] splitArray = subjectString.split("_(?=[^_]*$)");
but I am not able to figure out how to split on first underscore.
Could anyone show me how I can do this?
Thanks
David
You can achieve this without regex. You can achieve this by finding the first and last index of _ and getting substrings based on them.
String s = "hello_5_9_2018_world";
int firstIndex = s.indexOf("_");
int lastIndex = s.lastIndexOf("_");
System.out.println(s.substring(0, firstIndex));
System.out.println(s.substring(firstIndex + 1, lastIndex));
System.out.println(s.substring(lastIndex + 1));
The above prints
hello
5_9_2018
world
Note:
If the string does not have two _ you will get a StringIndexOutOfBoundsException.
To safeguard against it, you can check if the extracted indices are valid.
If firstIndex == lastIndex == -1 then it means the string does
not have any underscores.
If firstIndex == lastIndex then the string has just one underscore.
If you have always three parts as above, you can use
([^_]*)_(.*)_(^_)*
and get the single elements as groups.
Regular Expression
(?<first>[^_]+)_(?<middle>.+)+_(?<last>[^_]+)
Demo
Java Code
final String str = "hello_5_9_2018_world";
Pattern pattern = Pattern.compile("(?<first>[^_]+)_(?<middle>.+)+_(?<last>[^_]+)");
Matcher matcher = pattern.matcher(str);
if(matcher.matches()) {
String first = matcher.group("first");
String middle = matcher.group("middle");
String last = matcher.group("last");
}
I see that a lot of guys provided their solution, but I have another regex pattern for your question
You can achieve your goal with this pattern:
"([a-zA-Z]+)_(.*)_([a-zA-Z]+)"
The whole code looks like this:
String subjectString= "hello_5_9_2018_world";
Pattern pattern = Pattern.compile("([a-zA-Z]+)_(.*)_([a-zA-Z]+)");
Matcher matcher = pattern.matcher(subjectString);
if(matcher.matches()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));
}
It outputs:
hello
5_9_2018
world
While the other answers are actually nicer and better, if you really want to use split, this is the way to go:
"hello_5_9_2018_world".split("((?<=^[^_]*)_)|(_(?=[^_]*$))")
==> String[3] { "hello", "5_9_2018", "world" }
This is a combination of your lookahead pattern (_(?=[^_]*$))
and the symmetrical look-behind pattern: ((?<=^[^_]*)_)
(match the _ preceeded by ^ (start of the string) and [^_]* (0..n non-underscore chars).

Find out number of words in a string with a lot of special character

I need to find out the number of words in a string. However, this string is not the normal type of string. It has a lot of special character like < , /em, /p and many more. So most of the method used in StackOverflow does not work. As a result, I need to define a regular expression by myself.
What I intend to do is to define what is a word using a regular expression and count the number of time a word appears.
This is how I define a word.
It must start with a letter and end with one of this : or , or ! or ? or ' or - or ) or . or "
This is how I define my regular expression
pattern = Pattern.compile("^[a-zA-Z](:|,|!|?|'|-|)|.|")$");
matcher = pattern.matcher(line);
while (matcher.find())
wordCount++;
However, there is an error with the first line
pattern = Pattern.compile("^[a-zA-Z](:|,|!|?|'|-|)|.|")$");
How can I fix this problem?
In fact you also want to remove tags, like <em> (HTML emphasized), which otherwise would count as words. If you then consider full tags with attributes:
<span font="Consolas"> then it is easier to remove tags:
public int static wordCount(String s) {
s.replaceAll("<[A-Za-z/][^>]*>", " ") // Tags as space
.replaceAll("[^\\p{L}\\p{M}\\d]+", " ") // Non-letters, -accents, -digits as blank
.trim() // Not before or after (empty words)
.split(" ").length;
}
It is quite inefficient, replaceAll and trim. At least precompiling and using Pattern would be nicer. But probably not worth it.
Does this help?
String line = "so.this:is,what)you!wanted?";
int wordCount = 0;
Pattern pattern = Pattern.compile("([a-zA-Z]++[:'-,\\.!\\?\")]{1})");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
wordCount++;
}
System.out.println(wordCount); // Prints 6

Regex in Java - Regular Expression

I have a problem in Regular expression. I need to print and count the words starts with word ‘ge’ and end with either word ‘ne’ or ‘me’. When I running the code only words start with "ge" appear. Can anyone help me to improve my source code?
Pattern p = Pattern.compile("ge\\s*(\\w+)");
Matcher m = p.matcher(input);
int count=0;
List<String> outputs = new ArrayList<String>();
while (m.find()) {
count++;
outputs.add(m.group());
System.out.println(m.group());
}
System.out.println("The count is " + count);
\bge\w*[nm]e\b
This should do it for you.
In java this would be
\\bge\\w*[nm]e\\b
use \b to denote word boundary
Well you're kinda missing the second half of your regex
ge\s*(\w+)[nm]e
This will work.
Regex101

Pattern Matcher Vs String Split, which should I use?

First time posting.
Firstly I know how to use both Pattern Matcher & String Split.
My questions is which is best for me to use in my example and why?
Or suggestions for better alternatives.
Task:
I need to extract an unknown NOUN between two known regexp in an unknown string.
My Solution:
get the Start and End of the noun (from Regexp 1&2) and substring to extract the noun.
String line = "unknownXoooXNOUNXccccccXunknown";
int goal = 12 ;
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";
I need to locate the index position AFTER the first regex.
I need to locate the index position BEFORE the second regex.
A) I can use pattern matcher
Pattern p = Pattern.compile(regexp1);
Matcher m = p.matcher(line);
if (m.find()) {
int afterRegex1 = m.end();
} else {
throw new IllegalArgumentException();
//TODO Exception Management;
}
B) I can use String Split
String[] split = line.split(regex1,2);
if (split.length != 2) {
throw new UnsupportedOperationException();
//TODO Exception Management;
}
int afterRegex1 = line.indexOf(split[1]);
Which Approach should I use and why?
I don't know which is more efficient on time and memory.
Both are near enough as readable to myself.
I'd do it like this:
String line = "unknownXoooXNOUNXccccccXunknown";
String regex = "Xo+X(.*?)Xc+X";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(line);
if (m.find()) {
String noun = m.group(1);
}
The (.*?) is used to make the inner match on the NOUN reluctant. This protects us from a case where our ending pattern appears again in the unknown portion of the string.
EDIT
This works because the (.*?) defines a capture group. There's only one such group defined in the pattern, so it gets index 1 (the parameter to m.group(1)). These groups are indexed from left to right starting at 1. If the pattern were defined like this
String regex = "(Xo+X)(.*?)(Xc+X)";
Then there would be three capture groups, such that
m.group(1); // yields "XoooX"
m.group(2); // yields "NOUN"
m.group(3); // yields "XccccccX"
There is a group 0, but that matches the whole pattern, and it's equivalent to this
m.group(); // yields "XoooXNOUNXccccccX"
For more information about what you can do with the Matcher, including ways to get the start and end positions of your pattern within the source string, see the Matcher JavaDocs
You should use String.split() for readability unless you're in a tight loop.
Per split()'s javadoc, split() does the equivalent of Pattern.compile(), which you can optimize away if you're in a tight loop.
It looks like you want to get a unique occurrence. For this do simply
input.replaceAll(".*Xo+X(.*)Xc+X.*", "$1")
For efficiency, use Pattern.matcher(input).replaceAll instead.
In case you input contains line breaks, use Pattern.DOTALL or the s modifier.
In case you want to use split, consider using Guava's Splitter. It behaves more sane and also accepts a Pattern which is good for speed.
If you really need the locations you can do it like this:
String line = "unknownXoooXNOUNXccccccXunknown";
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";
Matcher m=Pattern.compile(regexp1).matcher(line);
if(m.find())
{
int start=m.end();
if(m.usePattern(Pattern.compile(regexp2)).find())
{
final int end = m.start();
System.out.println("from "+start+" to "+end+" is "+line.substring(start, end));
}
}
But if you just need the word in between, I recommend the way Ian McLaird has shown.

Extracting a word containing a symbol from a string in Java

The basic idea is that I want to pull out any part of the string with the form "text1.text2". Some examples of the input and output of what I'd like to do would be:
"employee.first_name" ==> "employee.first_name"
"2 * employee.salary AS double_salary" ==> "employee.salary"
Thus far I have just .split(" ") and then found what I needed and .split("."). Is there any cleaner way?
I would go with an actual Pattern and an iterative find, instead of splitting the String.
For instance:
String test = "employee.first_name 2 * ... employee.salary AS double_salary blabla e.s blablabla";
// searching for a number of word characters or puctuation, followed by dot,
// followed by a number of word characters or punctuation
// note also we're avoiding the "..." pitfall
Pattern p = Pattern.compile("[\\w\\p{Punct}&&[^\\.]]+\\.[\\w\\p{Punct}&&[^\\.]]+");
Matcher m = p.matcher(test);
while (m.find()) {
System.out.println(m.group());
}
Output:
employee.first_name
employee.salary
e.s
Note: to simplify the Pattern you could only list the allowed punctuation forming your "."-separated words in the categories
For instance:
Pattern p = Pattern.compile("[\\w_]+\\.[\\w_]+");
This way, foo.bar*2 would be matched as foo.bar
You need to make use of split to break the string into fragments.Then search for . in each of those fragments using contains method, to get the desired fragments:
Here you go:
public static void main(String args[]) {
String str = "2 * employee.salary AS double_salary";
String arr[] = str.split("\\s");
for (int i = 0; i < arr.length; i++) {
if (arr[i].contains(".")) {
System.out.println(arr[i]);
}
}
}
String mydata = "2 * employee.salary AS double_salary";
pattern = Pattern.compile("(\\w+\\.\\w+)");
Matcher matcher = pattern.matcher(mydata);
if (matcher.find())
{
System.out.println(matcher.group(1));
}
I'm not an expert in JAVA, but as I used regex in python and based on internet tutorials, I offer you to use r'(\S*)\.(\S*)' as the pattern. I tried it in python and it worked well in your example.
But if you want to use multiple dots continuously, it has a bug. I mean if you are trying to match something like first.second.third, this pattern identifies ('first.second', 'third') as the matched group and I think it relates to the best match strategy.

Categories

Resources