Regular expression to remove unwanted characters from the String - java

I have a requirement where I need to remove unwanted characters for String in java.
For example,
Input String is
Income ......................4,456
liability........................56,445.99
I want the output as
Income 4,456
liability 56,445.99
What is the best approach to write this in java. I am parsing large documents
for this hence it should be performance optimized.

You can do this replace with this line of code:
System.out.println("asdfadf ..........34,4234.34".replaceAll("[ ]*\\.{2,}"," "));

For this particular example, I might use the following replacement:
String input = "Income ......................4,456";
input = input.replaceAll("(\\w+)\\s*\\.+(.*)", "$1 $2");
System.out.println(input);
Here is an explanation of the pattern being used:
(\\w+) match AND capture one or more word characters
\\s* match zero or more whitespace characters
\\.+ match one or more literal dots
(.*) match AND capture the rest of the line
The two quantities in parentheses are known as capture groups. The regex engine remembers what these were while matching, and makes them available, in order, as $1 and $2 to use in the replacement string.
Output:
Income 4,456
Demo

Best way to do that is like:
String result = yourString.replaceAll("[-+.^:,]","");
That will replace this special character with nothing.

Related

Regex to check if String is one word in Java

I need regex to check if String has only one word (e.g. "This", "Country", "Boston ", " Programming ").
So far I used an alternative way of doing it which is to check if String contains spaces. However, I am sure that this can be done using regex.
One possible way in my opinion is "^\w{2,}\s". Does this work properly? Are there any other possible answers?
The pattern ^\w{2,}\s matches 2 or more word characters from the start of the string, followed by a mandatory whitespace char (that can also match a newline)
As the pattern is also unanchored, it can also match Boston in Boston test
If you want to match a single word with as least 2 characters surrounded by optional horizontal whitespace characters using \h* and add an anchor $ to assert the end of the string.
^\h*\w{2,}\h*$
Regex demo
In Java
String regex = "^\\h*\\w{2,}\\h*$";

Regex based string split in Java

String delimiterRegexp = "(;|:|[^<]/)";
String value = "get/time/pick me <i>Jack</i>";
String[] splitedTexts = value.split(delimiterRegexp);
for (String text : splitedTexts) {
System.out.println(text);
}
Output:
ge
tim
pick me <i>Jack</i>
Expected Result:
get
time
pick me <i>Jack</i>
A character is getting added as delimeter along with /. Could anyone help me out to write regex to split text based on delimeter"/" and it should ignore xml end tag"
Your regex should be like this:
(;|:|(?<!<)/)
with a negative lookbehind, demo: https://regex101.com/r/2k1WI5/1/
Your current regex [^<]/ will match basically any character that is not < followed by / even \n, space, and Japanese characters.
That's why you are losing some letters as they are considered as part of the separator.
Following The fourth bird recommendation, you can even simplify the regex into: ([;:]|(?<!<)/)
[^<]/ will match e/ and t/
use a lookbehind instead, it will have the wanted behaviour to only consider / as separator if it's not a closing tag
On regex101.com
(?<!<)/
The whole regex
(;|:|(?<!<)/)

Regex: Match group if present otherwise ignore and proceed with other matches

I have been trying to match a regex pattern within the following data:
String:
TestData to 1colon delimiter list has 1 rows.Set...value is: 1 Save Error: 267467374736437-TestInfo send Error
Words to match:
TestData
267467374736437-TestInfo
Regex pattern i m using:
(.+?\s)?.*(\s\d+-.*?\s)?
Scenario here is that 2nd match (267467374736437-TestInfo) can be absent in the string to be matched. So, i want it to be a match if it exists otherwise proceed with other matches. Due to this i added zero or one match quantifier ? to the group pattern above. But then it ignores the 2nd group all together.
If i use the below pattern:
`(.+?\s)?.*(\s\d+-.*?\s)`
It matches just fine but fails if string "267467374736437-TestInfo" from the matching string as it's not having the "?" quantifier.
Please help me understand where is it going wrong.
I would rather not use a complex regex, which will be ugly and a maintenance nightmare. Instead, one simple way would be to just split the string and grab the first term, and then use a smart regex to pinpoint the second term.
String input = "TestData to 1colon delimiter list has 1 rows.Set...value is: 1 Save Error: 267467374736437-TestInfo send Error";
String first = input.split(" ")[0];
String second = input.replaceAll(".*Save Error:\\s(.*)?\\s", "$1");
Explore the regex:
Regex101
The optional pattern at the end will almost never not be matched if a more generic pattern occurs. In your case, the greedy dot .* grabs the whole rest of the line up to the end, and since the last pattern is optional, the regex engine calls it a day and does not try to accommodate any text for it.
If you had a lazy dot .*?, the only position where it would work is right after the preceding subpattern, which is rarely the case.
Thus, you can only rely on a tempered greedy token:
^(\S+)(?:(?!\d+-\S).)*(\d+-\S+)?
See the regex demo.
Or an unrolled version:
^(\S+)\D*(?:\d(?!\d*-\S)\D*)*(\d+-\S+)?

Merge multiple regex in Java

I have written a regex to omit the characters after the first occurrence of some characters (, and #)
String number = "(123) (456) (7890)#123";
number = number.replaceAll("[,#](.*)", ""); //This is the 1st regex
Then a second regex to get only numbers (remove spaces and other non numeric characters)
number = number.replaceAll("[^0-9]+", ""); //This is the 2nd regex
Output: 1234567890
How can I merge the two regex into one like piping the O/p from first regex to the second.
You can combine both regex in the following way.
String number = "(123) (456) (7890)#123";
number = number.replaceAll("[,#](.*)", "").replaceAll("[^0-9]+", "");
So you need to remove all symbols other than digits and the whole rest of the string after the first hash symbol or a comma.
You cannot just concatenate the patterns with |operator because one of the patterns is anchored implicitly at the end of the string.
You need to remove any symbols but digits AND hashes with commas first since the tegex engine processes the string from left to right and then you can add the alternative to match a comma or hash with any text after them. Use DOTALL modifier in case you have newline symbols in your input.
Use
 (?s)[,#].*$|[^#,0-9]+

Replacing first occurence of two asterisks in a String in java

I java, I need to replace a double asterisk, only the first occurence. How?
I want that:
the first "**" --> "<u>"
and the second "**" --> "<\u>"
Example:
String a = "John **Doe** is a bad boy"
should become:
String a = "John <u>Doe<\u> is a bad boy"
using somethig as:
a = a.replaceFirst("**","<u>").replaceFirst("**","<\u>")
How?
You need to escape the asterisks to avoid them being interpreted as part of a regular expression:
a = a.replaceFirst(Pattern.escape("**"), "<u>");
Or:
a = a.replaceFirst("\\Q**\\E", "<u>")
Or:
a = a.replaceFirst("\\*\\*"), "<u>");
To perform your translation you could do this:
a = a.replaceAll("\\*\\*(.*?)\\*\\*", "<u>$1</u>");
The advantage of a single replaceAll over a pair of replaceFirst calls is that replaceAll would work for strings containing multiple asterisked words, e.g. "John **Doe** is a **bad** boy".
Essentially the matching expression means:
\\*\\* -- literal "**"
( -- start a capturing group
. -- match any character (except LF, CR)
* -- zero or more of them
? -- not greedily (i.e. find the shortest match possible)
) -- end the group
\\*\\* -- literal "**"
The replacement:
<u> -- literal <u>
$1 -- the contents of the captured group (i.e. text inside the asterisks)
</u> -- literal </u>
By the way, I've changed your end tag to </u> instead of <\u> :-)
Depending on your requirements, you might be able to use a Markdown parser, e.g. Txtmark and save yourself reinventing the wheel.
You can use:
String a = "John **Doe** is a bad boy"
a = a.replaceFirst("\\Q**\\E", "<u>").replaceFirst("\\Q**\\E", "</u>");
//=> John <u>Doe</u> is a bad boy
As mentioned above by aetheria and going with what you already are trying:
a = a.replaceFirst("\\*\\*", "<u>").replaceFirst("\\*\\*", "<\u>");
When you want to try something else, I recommend using the online regex tester below which will show the results of different patterns using replaceFirst, replaceAll, etc on different input strings. It will also provide in the top left the correctly escaped string that should be used in your Java code.
http://www.regexplanet.com/advanced/java/index.html
I would do this:
String a = "John **Doe** is a bad boy";
String b = a.replaceAll("\\*\\*(.*?)\\*\\*", "<u>$1</u>");
//John <u>Doe</u> is a bad boy
LIVE DEMO
REGEX EXPLANATION
\*\*(.*?)\*\*
Match the character “*” literally «\*»
Match the character “*” literally «\*»
Match the regex below and capture its match into backreference number 1 «(.*?)»
Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “*” literally «\*»
Match the character “*” literally «\*»
<u>$1</u>
Insert the character string “<u>” literally «<u>»
Insert the text that was last matched by capturing group number 1 «$1»
Insert the character string “</u>” literally «</u>»

Categories

Resources