Parsing special character using Java Regex - java

I have a requirement where i need to remove those special characters from a string which are not in the array list. The current code removes all special character when found ,
String Modified_remark = final_remark.replaceAll("[^\\x00-\\x7F]", "");
This code removes all special character from the string , But i want to retain certain items like Angstrom Symbol (Å) & Micron Symbol (μ)
For example if i place the allowed special character in Array , i want the code to skip the replacement and if not matching then replace with "" (Empty quotes).
String[] allowedChar = {Å, μ};
To be added more when requested by User's. Can anyone help with this logic.

Just add all the allowedChars to the exception list in your regex:
final_remark.replaceAll("[^\\x00-\\x7F" + String.join("", allowedChar) + "]", "");
Demo: https://ideone.com/iQWvHI
Update
As Wiktor Stribiżew rightly pointed out, the this simple code breaks if allowedChar contains some regex special characters. Since the requirements imply allowedChar to contain only non-ACSII characters, we may add a condition on allowedChar as follows:
String[] allowedChar = {"Å", "μ", "]"};
String allowedChars = "";
for (String ch : allowedChar)
if (ch.matches("^[^\\x00-\\x7F]$"))
allowedChars += ch;
String Modified_remark = final_remark.replaceAll("[^\\x00-\\x7F" + allowedChars + "]", "");
Demo: https://ideone.com/94513e

Related

String split regex [duplicate]

I'm new to regular expressions and would appreciate your help. I'm trying to put together an expression that will split the example string using all spaces that are not surrounded by single or double quotes. My last attempt looks like this: (?!") and isn't quite working. It's splitting on the space before the quote.
Example input:
This is a string that "will be" highlighted when your 'regular expression' matches something.
Desired output:
This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something.
Note that "will be" and 'regular expression' retain the space between the words.
I don't understand why all the others are proposing such complex regular expressions or such long code. Essentially, you want to grab two kinds of things from your string: sequences of characters that aren't spaces or quotes, and sequences of characters that begin and end with a quote, with no quotes in between, for two kinds of quotes. You can easily match those things with this regular expression:
[^\s"']+|"([^"]*)"|'([^']*)'
I added the capturing groups because you don't want the quotes in the list.
This Java code builds the list, adding the capturing group if it matched to exclude the quotes, and adding the overall regex match if the capturing group didn't match (an unquoted word was matched).
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"([^\"]*)\"|'([^']*)'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
if (regexMatcher.group(1) != null) {
// Add double-quoted string without the quotes
matchList.add(regexMatcher.group(1));
} else if (regexMatcher.group(2) != null) {
// Add single-quoted string without the quotes
matchList.add(regexMatcher.group(2));
} else {
// Add unquoted word
matchList.add(regexMatcher.group());
}
}
If you don't mind having the quotes in the returned list, you can use much simpler code:
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"[^\"]*\"|'[^']*'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}
There are several questions on StackOverflow that cover this same question in various contexts using regular expressions. For instance:
parsings strings: extracting words and phrases
Best way to parse Space Separated Text
UPDATE: Sample regex to handle single and double quoted strings. Ref: How can I split on a string except when inside quotes?
m/('.*?'|".*?"|\S+)/g
Tested this with a quick Perl snippet and the output was as reproduced below. Also works for empty strings or whitespace-only strings if they are between quotes (not sure if that's desired or not).
This
is
a
string
that
"will be"
highlighted
when
your
'regular expression'
matches
something.
Note that this does include the quote characters themselves in the matched values, though you can remove that with a string replace, or modify the regex to not include them. I'll leave that as an exercise for the reader or another poster for now, as 2am is way too late to be messing with regular expressions anymore ;)
If you want to allow escaped quotes inside the string, you can use something like this:
(?:(['"])(.*?)(?<!\\)(?>\\\\)*\1|([^\s]+))
Quoted strings will be group 2, single unquoted words will be group 3.
You can try it on various strings here: http://www.fileformat.info/tool/regex.htm or http://gskinner.com/RegExr/
The regex from Jan Goyvaerts is the best solution I found so far, but creates also empty (null) matches, which he excludes in his program. These empty matches also appear from regex testers (e.g. rubular.com).
If you turn the searches arround (first look for the quoted parts and than the space separed words) then you might do it in once with:
("[^"]*"|'[^']*'|[\S]+)+
(?<!\G".{0,99999})\s|(?<=\G".{0,99999}")\s
This will match the spaces not surrounded by double quotes.
I have to use min,max {0,99999} because Java doesn't support * and + in lookbehind.
It'll probably be easier to search the string, grabbing each part, vs. split it.
Reason being, you can have it split at the spaces before and after "will be". But, I can't think of any way to specify ignoring the space between inside a split.
(not actual Java)
string = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
regex = "\"(\\\"|(?!\\\").)+\"|[^ ]+"; // search for a quoted or non-spaced group
final = new Array();
while (string.length > 0) {
string = string.trim();
if (Regex(regex).test(string)) {
final.push(Regex(regex).match(string)[0]);
string = string.replace(regex, ""); // progress to next "word"
}
}
Also, capturing single quotes could lead to issues:
"Foo's Bar 'n Grill"
//=>
"Foo"
"s Bar "
"n"
"Grill"
String.split() is not helpful here because there is no way to distinguish between spaces within quotes (don't split) and those outside (split). Matcher.lookingAt() is probably what you need:
String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
str = str + " "; // add trailing space
int len = str.length();
Matcher m = Pattern.compile("((\"[^\"]+?\")|('[^']+?')|([^\\s]+?))\\s++").matcher(str);
for (int i = 0; i < len; i++)
{
m.region(i, len);
if (m.lookingAt())
{
String s = m.group(1);
if ((s.startsWith("\"") && s.endsWith("\"")) ||
(s.startsWith("'") && s.endsWith("'")))
{
s = s.substring(1, s.length() - 1);
}
System.out.println(i + ": \"" + s + "\"");
i += (m.group(0).length() - 1);
}
}
which produces the following output:
0: "This"
5: "is"
8: "a"
10: "string"
17: "that"
22: "will be"
32: "highlighted"
44: "when"
49: "your"
54: "regular expression"
75: "matches"
83: "something."
I liked Marcus's approach, however, I modified it so that I could allow text near the quotes, and support both " and ' quote characters. For example, I needed a="some value" to not split it into [a=, "some value"].
(?<!\\G\\S{0,99999}[\"'].{0,99999})\\s|(?<=\\G\\S{0,99999}\".{0,99999}\"\\S{0,99999})\\s|(?<=\\G\\S{0,99999}'.{0,99999}'\\S{0,99999})\\s"
Jan's approach is great but here's another one for the record.
If you actually wanted to split as mentioned in the title, keeping the quotes in "will be" and 'regular expression', then you could use this method which is straight out of Match (or replace) a pattern except in situations s1, s2, s3 etc
The regex:
'[^']*'|\"[^\"]*\"|( )
The two left alternations match complete 'quoted strings' and "double-quoted strings". We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expressions on the left. We replace those with SplitHere then split on SplitHere. Again, this is for a true split case where you want "will be", not will be.
Here is a full working implementation (see the results on the online demo).
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
Pattern regex = Pattern.compile("\'[^']*'|\"[^\"]*\"|( )");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "SplitHere");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("SplitHere");
for (String split : splits) System.out.println(split);
} // end main
} // end Program
If you are using c#, you can use
string input= "This is a string that \"will be\" highlighted when your 'regular expression' matches <something random>";
List<string> list1 =
Regex.Matches(input, #"(?<match>\w+)|\""(?<match>[\w\s]*)""|'(?<match>[\w\s]*)'|<(?<match>[\w\s]*)>").Cast<Match>().Select(m => m.Groups["match"].Value).ToList();
foreach(var v in list1)
Console.WriteLine(v);
I have specifically added "|<(?[\w\s]*)>" to highlight that you can specify any char to group phrases. (In this case I am using < > to group.
Output is :
This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something random
1st one-liner using String.split()
String s = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
String[] split = s.split( "(?<!(\"|').{0,255}) | (?!.*\\1.*)" );
[This, is, a, string, that, "will be", highlighted, when, your, 'regular expression', matches, something.]
don't split at the blank, if the blank is surrounded by single or double quotes
split at the blank when the 255 characters to the left and all characters to the right of the blank are neither single nor double quotes
adapted from original post (handles only double quotes)
I'm reasonably certain this is not possible using regular expressions alone. Checking whether something is contained inside some other tag is a parsing operation. This seems like the same problem as trying to parse XML with a regex -- it can't be done correctly. You may be able to get your desired outcome by repeatedly applying a non-greedy, non-global regex that matches the quoted strings, then once you can't find anything else, split it at the spaces... that has a number of problems, including keeping track of the original order of all the substrings. Your best bet is to just write a really simple function that iterates over the string and pulls out the tokens you want.
A couple hopefully helpful tweaks on Jan's accepted answer:
(['"])((?:\\\1|.)+?)\1|([^\s"']+)
Allows escaped quotes within quoted strings
Avoids repeating the pattern for the single and double quote; this also simplifies adding more quoting symbols if needed (at the expense of one more capturing group)
You can also try this:
String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something";
String ss[] = str.split("\"|\'");
for (int i = 0; i < ss.length; i++) {
if ((i % 2) == 0) {//even
String[] part1 = ss[i].split(" ");
for (String pp1 : part1) {
System.out.println("" + pp1);
}
} else {//odd
System.out.println("" + ss[i]);
}
}
The following returns an array of arguments. Arguments are the variable 'command' split on spaces, unless included in single or double quotes. The matches are then modified to remove the single and double quotes.
using System.Text.RegularExpressions;
var args = Regex.Matches(command, "[^\\s\"']+|\"([^\"]*)\"|'([^']*)'").Cast<Match>
().Select(iMatch => iMatch.Value.Replace("\"", "").Replace("'", "")).ToArray();
When you come across this pattern like this :
String str = "2022-11-10 08:35:00,470 RAV=REQ YIP=02.8.5.1 CMID=caonaustr CMN=\"Some Value Pyt Ltd\"";
//this helped
String[] str1= str.split("\\s(?=(([^\"]*\"){2})*[^\"]*$)\\s*");
System.out.println("Value of split string is "+ Arrays.toString(str1));
This results in :[2022-11-10, 08:35:00,470, PLV=REQ, YIP=02.8.5.1, CMID=caonaustr, CMN="Some Value Pyt Ltd"]
This regex matches spaces ONLY if it is followed by even number of double quotes.

Splitting a string with a certain pattern in Java

I am writing a parser for a file containing the following string pattern:
Key : value
Key : value
Key : value
etc...
I am able to retrieve those lines one by one into a list. What I would like to do is to separate the key from the value for each one of those strings. I know there is the split() method that can take a Regex and do this for me, but I am very unfamiliar with them so I don't know what Regex to give as a parameter to the split() function.
Also, while not in the specifications of the file I am parsing, I would like for that Regex to be able to recognize the following patterns as well (if possible):
Key: value
Key :value
Key:value
etc...
So basically, whether there's a space or not after/before/after AND before the : character, I would like for that Regex to be able to detect it. What is the Regex that can achieve this?
In other words split method should look for : and zero or more whitespaces before or after it.
Key: value
^^
Key :value
^^
Key:value
^
Key : value
^^^
In that case split("\\s*:\\s*") should do the trick.
Explanation:
\\s represents any whitespace
* means one or more occurrences of element described before it
\\s* means zero or more whitespaces.
On the other hand you may want also to find entire key:value pair and place parts matching key and value in separate groups (you can even name groups as you like using (?<groupName>regex)). In that case you may use
Pattern p = Pattern.compile("(?<key>\\w+)\\s*:\\s*(?<value>\\w+)");
Matcher m = p.matcher(yourData);
while(m.find()){
System.out.println("key = " + m.group("key"));
System.out.println("value = " + m.group("value"));
System.out.println("--------");
}
If you want to use String.split(), you could use this:
String input = "key : value";
String[] s = input.split("\\s*:\\s*");
String key = s[0];
String value = s[1];
This will split the String at the ":", but add all whitespaces in front of the ":" to it, so that you will receive a trimmed string.
Explanation:
\\s* will match any whitespace, by default this is equal to [ \\n\\r\\t]*
The : in between the two \\s* means that your : need to be there
Note that this solution will cause an ArrayIndexOutOfBoundsException if your input line does not contain the key-value-format as you defined it.
If you are not sure if the line really contain the key-value-String, maybe because you want to have an empty line at the end of your file like there normally is, you could do it like that:
String input = "key : value";
Matcher m = Pattern.compile("(\\S+)\\s*:\\s*(.+)").matcher(input);
if (m.matches())
{
String key = m.group(1); // note that the count starts by 1 here
String value = m.group(2);
}
Explanation:
\\S+ matches any non-whitespace String - if it contains whitespaces, the next part of the regex will be matches with this expression already. Note that the () around it mark so that you can get it's value by m.group().
\\s* will match any whitespace, by default this is equal to [ \\n\\r\\t]*
The : in between the two \\s* means that your : need to be there
The last group, .+, will match any string, containing whitespaces and so on.
you can use the split method but can pass delimiter as ":"
This splits the string when it sees ':', then you can trim the values to get the key and value.
String s = " keys : value ";
String keyValuePairs[] = s.split(":");
String key = keyValuePairs[0].trim();
String value = keyValuePairs[1].trim();
You can also make use of regex to simplify it.
String keyValuePairs[] = s.trim().split("[ ]*:[ ]*");
s.trim() will remove the spaces before and after the string (if you have it in your case), So sting will become "keys : value" and
[ ]*:[ ]*
to split the string with regular expression saying spaces (one or more) : spaces (one or more) as delimiter.
For a pure regex solution, you can use the following pattern (note the space at the beginning):
?: ?
See http://regexr.com/39evh
String[] tokensVal = str.split(":");
String key = tokensVal[0].trim();
String value = tokensVal[1].trim();

Error when splitting a string in java

I am trying to split a string according to a certain set of delimiters.
My delimiters are: ,"():;.!? single spaces or multiple spaces.
This is the code i'm currently using,
String[] arrayOfWords= inputString.split("[\\s{2,}\\,\"\\(\\)\\:\\;\\.\\!\\?-]+");
which works fine for most cases but i'm have a problem when the the first word is surrounded by quotation marks. For example
String inputString = "\"Word\" some more text.";
Is giving me this output
arrayOfWords[0] = ""
arrayOfWords[0] = "Word"
arrayOfWords[1] = "some"
arrayOfWords[2] = "more"
arrayOfWords[3] = "text"
I want the output to give me an array with
arrayOfWords[0] = "Word"
arrayOfWords[1] = "some"
arrayOfWords[2] = "more"
arrayOfWords[3] = "text"
This code has been working fine when quotation marks are used in the middle of the sentence, I'm not sure what the trouble is when it's at the beginning.
EDIT: I just realized I have same problem when any of the delimiters are used as the first character of the string
Unfortunately you wont be able to remove this empty first element using only split. You should probably remove first elements from your string that match your delimiters and split after it. Also your regex seems to be incorrect because
by adding {2,} inside [...] you are in making { 2 , and } characters delimiters,
you don't need to escape rest of your delimiters (note that you don't have to escape - only because it is at end of character class [] so he cant be used as range operator).
Try maybe this way
String regexDelimiters = "[\\s,\"():;.!?\\-]+";
String inputString = "\"Word\" some more text.";
String[] arrayOfWords = inputString.replaceAll(
"^" + regexDelimiters,"").split(regexDelimiters);
for (String s : arrayOfWords)
System.out.println("'" + s + "'");
output:
'Word'
'some'
'more'
'text'
A delimiter is interpreted as separating the strings on either side of it, thus the empty string on its left is added to the result as well as the string to its right ("Word"). To prevent this, you should first strip any leading delimiters, as described here:
How to prevent java.lang.String.split() from creating a leading empty string?
So in short form you would have:
String delim = "[\\s,\"():;.!?\\-]+";
String[] arrayOfWords = inputString.replaceFirst("^" + delim, "").split(delim);
Edit: Looking at Pshemo's answer, I realize he is correct regarding your regex. Inside the brackets it's unnecessary to specify the number of space characters, as they will be caught be the + operator.

Split string by array of characters

i want to split a string by array of characters,
so i have this code:
String target = "hello,any|body here?";
char[] delim = {'|',',',' '};
String regex = "(" + new String(delim).replaceAll("(.)", "\\\\$1|").replaceAll("\\|$", ")");
String[] result = target.split(regex);
everything works fine except when i want to add a character like 'Q' to delim[] array,
it throws exception :
java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence near index 11
(\ |\,|\||\Q)
so how can i fix that to work with non-special characters as well?
thanks in advance
how can i fix that to work with non-special characters as well
Put square brackets around your characters, instead of escaping them. Make sure that if ^ is included in your list of characters, you need to make sure it's not the first character, or escape it separately if it's the only character on the list.
Dashes also need special treatment - they need to go at the beginning or at the end of the regex.
String delimStr = String(delim);
String regex;
if (delimStr.equals("^") {
regex = "\\^"
} else if (delimStr.charAt(0) == '^') {
// This assumes that all characters are distinct.
// You may need a stricter check to make this work in general case.
regex = "[" + delimStr.charAt(1) + delimStr + "]";
} else {
regex = "[" + delimStr + "]";
}
Using Pattern.quote and putting it in square brackets seems to work:
String regex = "[" + Pattern.quote(new String(delim)) + "]";
Tested with possible problem characters.
Q is not a control character in a regex, so you do not have to put the \\ before it (it only serves to mark that you must interpret the following character as a literal, and not as a control character).
Example
`\\.` in a regex means "a dot"
`.` in a regex means "any character"
\\Q fails because Q is not special character in a regex, so it does not need to be quoted.
I would make delim a String array and add the quotes to these values that need it.
delim = {"\\|", ..... "Q"};

Removing a substring between two characters (java)

I have a java string such as this:
String string = "I <strong>really</strong> want to get rid of the strong-tags!";
And I want to remove the tags. I have some other strings where the tags are way longer, so I'd like to find a way to remove everything between "<>" characters, including those characters.
One way would be to use the built-in string method that compares the string to a regEx, but I have no idea how to write those.
Caution is advised when using regex to parse HTML (due its allowable complexity), however for "simple" HTML, and simple text (text without literal < or > in it) this will work:
String stripped = html.replaceAll("<.*?>", "");
To avoid Regex:
String toRemove = StringUtils.substringBetween(string, "<", ">");
String result = StringUtils.remove(string, "<" + toRemove + ">");
For multiple instances:
String[] allToRemove = StringUtils.substringsBetween(string, "<", ">");
String result = string;
for (String toRemove : allToRemove) {
result = StringUtils.remove(result, "<" + toRemove + ">");
}
Apache StringUtils functions are null-, empty-, and no match- safe
You should use
String stripped = html.replaceAll("<[^>]*>", "");
String stripped = html.replaceAll("<[^<>]*>", "");
where <[^>]*> matches substrings starting with <, then zero or more chars other than > (or the chars other than < and > if you choose the second version) and then a > char.
Note that <.*?>
is less efficient than a negated character class (see Which would be better non-greedy regex or negated character class?)
does not find substrings spanning across multiple lines (see How do I match any character across multiple lines in a regular expression?), but it can be solved with (?s)<.*?>, <(?s:.)*?>, <[\w\W]*?>, and many other not-so-efficient variations.
See the regex demo.

Categories

Resources