Pattern/Matcher in Java? - java

I have a certain text in Java, and I want to use pattern and matcher to extract something from it. This is my program:
public String getItemsByType(String text, String start, String end) {
String patternHolder;
StringBuffer itemLines = new StringBuffer();
patternHolder = start + ".*" + end;
Pattern pattern = Pattern.compile(patternHolder);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
itemLines.append(text.substring(matcher.start(), matcher.end())
+ "\n");
}
return itemLines.toString();
}
This code works fully WHEN the searched text is on the same line, for instance:
String text = "My name is John and I am 18 years Old";
getItemsByType(text, "My", "John");
immediately grabs the text "My name is John" out of the text. However, when my text looks like this:
String text = "My name\nis John\nand I'm\n18 years\nold";
getItemsByType(text, "My", "John");
It doesn't grab anything, since "My" and "John" are on different lines. How do I solve this?

Use this instead:
Pattern.compile(patternHolder, Pattern.DOTALL);
From the javadoc, the DOTALL flag means:
Enables dotall mode.
In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.

Use Pattern.compile(patternHolder, Pattern.DOTALL) to compile the pattern. This way the dot will match the newline. By default, newline is treated in a special way and not matched by the dot.

Related

How to parse a string to get array of #tags out of the string?

so I have this string like
"#tag1 #tag2 #tag3 not_tag1 not_tag2 #tag4" (the space between tag2 and tag4 is to indicate there can be many spaces). From this string I want to parse just a tag1, tag2 and so on. They are similar to #tags we see on LinkedIn or any other social media. Is there any easy way to do this using regex or any other function in Java. Or should I do it hard way(i.e. using loops and conditions).
Tag format should be "#" (to indicate tag is starting) and space " "(to indicate end of tag). In between there can be character or numbers but start should be a character only.
example,
input : "#tag1 #tag2 #tag3 not_tag1 not_tag2 #12tag #tag4"
output : ["tag1", "tag2", "tag3", "tag4"]
split by regex: "#\w+"
EDIT: this is the correct regex, but split is not the right method.
same solution as javadev suggested, but use instead:
String input = "#tag1 #tag2 #tag3 not_tag1 not_tag2 #12tag #tag4";
Matcher matcher = Pattern.compile("#\\w+").matcher(input);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
output with # as expected.
Maybe something like:
public static void main(String[] args ) {
String input = "#tag1 #tag2 #tag3 not_tag1 not_tag2 #12tag #tag4";
Pattern pattern = Pattern.compile("#([A-z][A-z0-9]*) *");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
}
worked for me :)
Output:
tag1
tag2
tag3
tag4

extract a set of a characters between some characters

I have a string email = John.Mcgee.r2d2#hitachi.com
How can I write a java code using regex to bring just the r2d2?
I used this but got an error on eclipse
String email = John.Mcgee.r2d2#hitachi.com
Pattern pattern = Pattern.compile(".(.*)\#");
Matcher matcher = patter.matcher
for (Strimatcher.find()){
System.out.println(matcher.group(1));
}
To match after the last dot in a potential sequence of multiple dots request that the sequence that you capture does not contain a dot:
(?<=[.])([^.]*)(?=#)
(?<=[.]) means "preceded by a single dot"
(?=#) means "followed by # sign"
Note that since dot . is a metacharacter, it needs to be escaped either with \ (doubled for Java string literal) or with square brackets around it.
Demo.
Not sure if your posting the right code. I'll rewrite it based on what it should look like though:
String email = John.Mcgee.r2d2#hitachi.com
Pattern pattern = Pattern.compile(".(.*)\#");
Matcher matcher = pattern.matcher(email);
int count = 0;
while(matcher.find()) {
count++;
System.out.println(matcher.group(count));
}
but I think you just want something like this:
String email = John.Mcgee.r2d2#hitachi.com
Pattern pattern = Pattern.compile(".(.*)\#");
Matcher matcher = pattern.matcher(email);
if(matcher.find()){
System.out.println(matcher.group(1));
}
No need to Pattern you just need replaceAll with this regex .*\.([^\.]+)#.* which mean get the group ([^\.]+) (match one or more character except a dot) which is between dot \. and #
email = email.replaceAll(".*\\.([^\\.]+)#.*", "$1");
Output
r2d2
regex demo
If you want to go with Pattern then you have to use this regex \\.([^\\.]+)# :
String email = "John.Mcgee.r2d2#hitachi.com";
Pattern pattern = Pattern.compile("\\.([^\\.]+)#");
Matcher matcher = pattern.matcher(email);
if (matcher.find()) {
System.out.println(matcher.group(1));// Output : r2d2
}
Another solution you can use split :
String[] split = email.replaceAll("#.*", "").split("\\.");
email = split[split.length - 1];// Output : r2d2
Note :
Strings in java should be between double quotes "John.Mcgee.r2d2#hitachi.com"
You don't need to escape # in Java, but you have to escape the dot with double slash \\.
There are no syntax for a for loop like you do for (Strimatcher.find()){, maybe you mean while

find patern text in Java and replace to another pattern

I have a paragraph of text numbers with specific format
e.g "123-21-1234 this is another text - some text 222-34-2244 another text"
I need to select the specific numbers ( 123-21-1234 and 222-34-2244) and convert them to "123/21/1234 this is another text - some text 222/34/2244 another text"
You can try something like below using Matcher.appendReplacement
public static void main(String[] args) {
String str = "123-21-1234 this is another text - some text 222-34-2244 another text";
Pattern p = Pattern.compile("(\\d{3})-(\\d{2})-(\\d{4})");
Matcher m = p.matcher(str);
StringBuffer sb = new StringBuffer();
while (m.find()) {
String num = m.group();
m.appendReplacement(sb, num.replace('-', '/'));
}
m.appendTail(sb);
System.out.println(sb.toString());
}
Using .replaceAll("-", "/") has some annoying side effects
Instead you can look for the String literal to replace, or craft your own regex
string.replaceAll("123-21-1234", "123/21/1234").replaceAll("222-34-2244", "222/34/2244");
If you wish to match any XXX-XX-XXXX patterns
string.replaceAll("(\\d{3})-(\\d{2})-(\\d{4})", "$1/$2/$3");
This works by looking for the digit sequence, putting the digits into groups ($0 is the whole match, $1 is the first ()s, $2 is second ()s...)

Regular expression to match any characters including line breaks in a string using java

I have a string which includes line breaks in java .I need a regular expression to match any characters including line breaks.
Here is the string:
String s= "Hello World".(line break)
Note= amount"
I am using this (.*?) but it won't match line breaks.
You can try with this pattern:
([a-zA-Z\n])+
Tested on regexr. This is an example of how to implement it in Java
String pattern = "([a-zA-Z\n])+";
String input = "String with \n line break.";
Pattern p = Pattern.compile(pattern);
java.util.regex.Matcher m = p.matcher(input);

Unescaped java not matching in regex matcher.find()

I have the following code that basically matches "Match this:" and keeps the first sentence. However, there are sometimes unicode characters that get passed into the text that are causing backtracking on other more complicated regex's. Escaping seem to alleviate the backtracking index out of range exceptions. However, now the regex isn't matching.
What i would like to know is why this regex isn't matching when escaped? If you comment out the escape/unescape java lines everything.
String text = "Keep this\n\n"
+ "Match this:\n\nDelete 📱 this";
text = org.apache.commons.lang.StringEscapeUtils.escapeJava(text);
Pattern PATTERN = Pattern.compile("^Match this:$",
Pattern.MULTILINE);
Matcher m = PATTERN.matcher(text);
if (m.find()) {
text = text.substring(0, m.start()).replaceAll("[\\n]+$", "");
}
text = org.apache.commons.lang.StringEscapeUtils.unescapeJava(text);
System.out.println(text);
What i would like to know is why this regex isn't matching when escaped?
When you escape string like "foo\nbar" which printed is similar to
foo
bar
you are getting "foo\\nbar" which printed looks like
foo\nbar
It happens because StringEscapeUtils.escapeJava escapes also \n and is replacing it with \\n, so it is no longer line separator but simple literal, so it can't be matched with ^ or $.
Possible solution could be replacing back "\\n" with "\n" after StringEscapeUtils.escapeJava. You will need to be careful here, not to "unescapee" real "\\n" which after replacing would give you "\\\\n" which printed would look like \\n. So maybe use
text = org.apache.commons.lang3.StringEscapeUtils.escapeJava(text);
text = text.replaceAll("(?<!\\\\)\\\\n", "\n");// escape `\n`
// if it is not preceded with `\`
//do your job
//and now you can unescape your text (\n will stay \n)
text = org.apache.commons.lang3.StringEscapeUtils.unescapeJava(text);
Another option could be creating your own implementation similar to StringEscapeUtils.escapeJava. If you take a look at this method body you will see
return ESCAPE_JAVA.translate(input);
Where ESCAPE_JAVA is
CharSequenceTranslator ESCAPE_JAVA =
new LookupTranslator(
new String[][] {
{"\"", "\\\""},
{"\\", "\\\\"},
}).with(
new LookupTranslator(EntityArrays.JAVA_CTRL_CHARS_ESCAPE())
).with(
UnicodeEscaper.outsideOf(32, 0x7f)
);
and EntityArrays.JAVA_CTRL_CHARS_ESCAPE() returns clone of
String[][] JAVA_CTRL_CHARS_ESCAPE = {
{"\b", "\\b"},
{"\n", "\\n"},
{"\t", "\\t"},
{"\f", "\\f"},
{"\r", "\\r"}
};
array. So if you provide here your own table which will tell explicitly that \n should be left as it is (so it should be replaced with itself \n) your code will ignore it.
So this is how your own implementation can look like
private static CharSequenceTranslator translatorIgnoringLineSeparators =
new LookupTranslator(
new String[][] {
{ "\"", "\\\"" },
{ "\\", "\\\\" },
}).with(
new LookupTranslator(new String[][] {
{ "\b", "\\b" },
{ "\n", "\n" },//this will handle `\n` and will not change it
{ "\r", "\r" },//this will handle `\r` and will not change it
{ "\t", "\\t" },
{ "\f", "\\f" },
})).with(UnicodeEscaper.outsideOf(32, 0x7f));
public static String myJavaEscaper(CharSequence input) {
return translatorIgnoringLineSeparators.translate(input);
}
This method will prevent escaping \r and \n.

Categories

Resources