Regex based string split in Java

Regex based string split in Java - java

String delimiterRegexp = "(;|:|[^<]/)";
String value = "get/time/pick me <i>Jack</i>";
String[] splitedTexts = value.split(delimiterRegexp);
for (String text : splitedTexts) {
System.out.println(text);
}
Output:
ge
tim
pick me <i>Jack</i>
Expected Result:
get
time
pick me <i>Jack</i>
A character is getting added as delimeter along with /. Could anyone help me out to write regex to split text based on delimeter"/" and it should ignore xml end tag"

Your regex should be like this:
(;|:|(?<!<)/)
with a negative lookbehind, demo: https://regex101.com/r/2k1WI5/1/
Your current regex [^<]/ will match basically any character that is not < followed by / even \n, space, and Japanese characters.
That's why you are losing some letters as they are considered as part of the separator.
Following The fourth bird recommendation, you can even simplify the regex into: ([;:]|(?<!<)/)

[^<]/ will match e/ and t/
use a lookbehind instead, it will have the wanted behaviour to only consider / as separator if it's not a closing tag
On regex101.com
(?<!<)/
The whole regex
(;|:|(?<!<)/)

Related

Regex to check if String is one word in Java

I need regex to check if String has only one word (e.g. "This", "Country", "Boston ", " Programming ").
So far I used an alternative way of doing it which is to check if String contains spaces. However, I am sure that this can be done using regex.
One possible way in my opinion is "^\w{2,}\s". Does this work properly? Are there any other possible answers?

The pattern ^\w{2,}\s matches 2 or more word characters from the start of the string, followed by a mandatory whitespace char (that can also match a newline)
As the pattern is also unanchored, it can also match Boston in Boston test
If you want to match a single word with as least 2 characters surrounded by optional horizontal whitespace characters using \h* and add an anchor $ to assert the end of the string.
^\h*\w{2,}\h*$
Regex demo
In Java
String regex = "^\\h*\\w{2,}\\h*$";

How to split a string by triple dots or horizontal ellipsis

I am facing problems when trying to split a String by "..."
String text ="Here…It is safer.";
I tried:
String [] output = text.split("[\\...]");
String [] output = text.split("\\.");
and many others, but I haven't found the solution yet.
I know that the question is very simple, but I will be happy If somebody can explain how should I make it work.

Regex for matching three dots is \\.{3} or \\.\\.\\. or [.][.][.] or \\Q...\\E.
Both [\\...] and \\. match a single dot, because repeated characters inside a character class are treated as a single character.
Horizontal ellipsis is a different character. It is not a metacharacter in regex language, so it can be matched directly with no escaping:
String [] output = text.split("…");

In general, you can use
String[] chunks = text.split("…|\\.{3}");
To also remove the enclosing whitespace:
String[] chunks = text.split("\\s*(?:…|\\.{3})\\s*");
See this regex demo.
If you need to make sure the triple dots are NOT enclosed with other dot chars, you can add lookarounds:
String[] chunks = text.split("\\s*(?:…|(?<!\\.)\\.{3}(?!\\.))\\s*");
Details:
\s* - zero or more whitespaces
(?:...) - a non-capturing group
… - an ellipsis
| - or
(?<!\.) - a negative lookbehind that fails the match if there is a dot char immediately to the left of the current location
\.{3} - triple dots
(?!\.) - a negative lookahead that fails the match if there is a dot char immediately to the right of the current location.
See a Java demo:
String text = "Here…It is safer... The end.";
String[] chunks = text.split("\\s*(?:…|\\.{3})\\s*");
System.out.println(Arrays.toString(chunks));
// => [Here, It is safer, The end.]

Regex for multiple dots would be:
(\.)*
Java would require something like this if I remember correct:
(\\.)*
Edit: Just noticed you asked for triple dot only. Since there is a correct answer already I'm going to leave this here just in case.

Regular expression to remove unwanted characters from the String

I have a requirement where I need to remove unwanted characters for String in java.
For example,
Input String is
Income ......................4,456
liability........................56,445.99
I want the output as
Income 4,456
liability 56,445.99
What is the best approach to write this in java. I am parsing large documents
for this hence it should be performance optimized.

You can do this replace with this line of code:
System.out.println("asdfadf ..........34,4234.34".replaceAll("[ ]*\\.{2,}"," "));

For this particular example, I might use the following replacement:
String input = "Income ......................4,456";
input = input.replaceAll("(\\w+)\\s*\\.+(.*)", "$1 $2");
System.out.println(input);
Here is an explanation of the pattern being used:
(\\w+) match AND capture one or more word characters
\\s* match zero or more whitespace characters
\\.+ match one or more literal dots
(.*) match AND capture the rest of the line
The two quantities in parentheses are known as capture groups. The regex engine remembers what these were while matching, and makes them available, in order, as $1 and $2 to use in the replacement string.
Output:
Income 4,456
Demo

Best way to do that is like:
String result = yourString.replaceAll("[-+.^:,]","");
That will replace this special character with nothing.

Correct existing regular expression / create a new one

I am trying to learn Regular expressions and am trying to replace values in a string with white-spaces using regular expressions to feed it into a tokenizer. The string might contain many punctuations. However, I do not want to replace whitespaces in string which contain an apostrophe/ hyphen within them.
For example,
six-pack => six-pack
He's => He's
This,that => This That
I tried to replace all the punctuations with whitespace initially but that would not work.
I tried to replace only those punctuations by specifying the wordboundaries as in
\B[^\p{L}\p{N}\s]+\B|\b[^\p{L}\p{N}\s]+\B|\B[^\p{L}\p{N}\s]+\b
But, I am not able to exclude the hyphen and apostrophe from them.
My guess is that the above regex is also very cumbersome and there should be a better way. Is there any?
So, all I am trying to do is:
Replace all punctuations with whitespace
Do not do the above if they are hyphen/apostrophe
Do replace if the hyphen/apostrophe does occur at start/end of a word.
Any help is appreciated.

You can probably work out a set of punctuation characters that are ok between words, and another set that isn't, then define your regular expression based on that.
For instance:
String[] input = {
"six-pack",// => six-pack
"He's",// => He's
"This,that"// => This That"
};
for (String s: input) {
System.out.println(s.replaceAll("(?<=\\w)[\\p{Punct}&&[^'-]](?=\\w)", " "));
}
Output
six-pack
He's
This that
Note
Here I'm defining the Pattern by using a character class including all posix for punctuation, preceded and followed by a word character, but negating a character class containing either ' or -.

You can use this lookahead based regex:
(?!((?!^)['-].))\\p{Punct}
RegEx Demo

You could use negative lookahead assertion like below,
String s = "six-pack\n"
+ "He's\n"
+ "This,that";
System.out.println(s.replaceAll("(?m)^['-]|['-]$|(?!['-])\\p{Punct}", " "));
Output:
six-pack
He's
This that
Explanation:
(?m) Multiline Mode
^['-] Matches ' or - which are at the start.
| OR
['-]$ Matches ' or - which are at the end of the line.
| OR
(?!['-])\\p{Punct} Matches all the punctuations except these two ' or - . It won't touch the matched [-'] symbols (ie, at the start and end).
RegEx Demo

Removing all standalone occurences of a word from a string with regular expressions in Java

Need advice on how to replace a sub-string like: #sometext, but not replace "#someothertext#somemail.com" sub-string.
For example, when I've got a string something like:
An example with #sometext and also with "#someothertext#somemail.com" sometextafter
And the result, after replacing sub-strings in string above should look like:
An example with and also with "#someothertext#somemail.com" sometextafter
After getting string from a field, I'm using:
String textMod = someText.replaceAll("( |^)[^\"]#[^#]+?( |$)","");
someText = textMod + "#\"" + someone.getEmail() + "\" ";
And then I'm setting this string into field.

You can do a regex on a standalone occurence this way
\b#sometext\b
Putting the \b in front and in the back of the #sometext will make sure that it's a standalone word, not part of another word like #someothertext#sometext.com. Then if it's found the result will be put inside $match, now you can do whatever you want with $match
Hope this helps
From https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html
The \b in the pattern indicates a word boundary, so only the distinct
* word "web" is matched, and not a word partial like "webbing" or "cobweb"
if (preg_match("/\bweb\b/i", "PHP is the web scripting language of choice.")) {
echo "A match was found.";
}
^ PHP example but you get the point

If there is always a space before and behind the tags to replace, this might suffice.
/\s(#\w+)\s/g

Try this
(?<!\w)#[^#\s]+(?!\S)
See it here on Regexr
Match on a # but only if there is no word character \w before (?<!\w). Then match a sequence of characters that are not # and not whitespace \s but only if its not followed by a non whitespace \S
(?<!\w) is called a negative lookbehind assertion
[^#\s] is called a negated character class, means match anything that is not part of the class
(?!\S) is a negative lookahead assertion

This should correspond to your needs:
str = str.replaceAll("#\w+[^#]", "");

(c#, regex based)
//match #xxx sequences, but only if i can look back and NOT see a #xxx immediately preceding me, and if I don't end with a #
string input = #"[An example with #hello and also with ""##hello#somemail.com"" sometext #lastone";
var pattern = #"(?<!#\w+)(?>#\w+)(?!#)";
var matches = Regex.Matches(input, pattern);

Simply adding spaces before and after "#sometext" would not work if "#sometext" is at the start or end of a sentence. However, just adding a pattern checking for start or end of sentence would not work either, as when you match "#sometext " at the start of a sentence and leave a space " ", this will make the resulting string look strange. Same goes for the end of a sentence.
We need to split the regex replace in to two actions, and perform two seperate regex replaces:
str = str.replaceAll(" #sometext ", " ");
str = str.replaceAll("^#sometext | #sometext$|(?:#sometext ){2,}", "");
^ means start of line, $ means end of line.
EDIT: Added corner case handling of when several #sometext's are after each other.

myString = myString.replaceAll(" #hello ", " ");
If #hello is a single word, then it has spaces before and after, right? So you should find all #hellos with space before and after and replace it with a space.
If you need to remove not only #hellos and all words which are starting with # and not containing other #, use this:
myString = myString.replaceAll(" #[^#]+? ", " ");
[^#] is any symbol except #. +? means match at least one character until reaching the first space.
If you want to remove words with only alphanumeric characters, use \\w instead of [^#]
EDIT:
Yeah, ohaal's right. To make it match at the start and the end of string use this pattern:
( |^)#[^#]+?( |$)
myString = myString.replaceAll("( |^)#hello( |$)", " ");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex based string split in Java - java

[^<]/ will match e/ and t/ use a lookbehind instead, it will have the wanted behaviour to only consider / as separator if it's not a closing tag On regex101.com (?<!<)/ The whole regex (;|:|(?<!<)/)

Related

Regex to check if String is one word in Java

How to split a string by triple dots or horizontal ellipsis

Regular expression to remove unwanted characters from the String

Correct existing regular expression / create a new one

Removing all standalone occurences of a word from a string with regular expressions in Java

Categories

Resources