I have a text file having | (pipe) as the separator. If I am reading a column and the column itself also contains | then it while separating another column is created.
Example :
name|date|age
zzz|20-03-22|23
"xx|zz"|23-23-33|32
How can I escape the character within the double quotes ""
how to escape the regular expression used in the split, so that it works for user-specified delimiters
i have tried
String[] cols = line.split("\|");
System.out.println("lets see column only=="+cols[1]);
How can I escape the character within the double quotes ""
Here's one approach:
String str = "\"xx|zz\"|23-23-33|32";
Matcher m = Pattern.compile("\"[^\"]*\"").matcher(str);
StringBuffer sb = new StringBuffer();
while (m.find())
m.appendReplacement(sb, m.group().replace("|", "\\\\|"));
m.appendTail(sb);
System.out.println(sb); // prints "xx\|zz"|23-23-33|32
In order to get the columns back you'd do something like this:
String str = "\"xx\\|zz\"|23-23-33|32";
String[] cols = str.split("(?<!\\\\)\\|");
for (String col : cols)
System.out.println(col.replace("\\|", "|"));
Regarding your edit:
how to escape the regular expression used in the split, so that it works for user-specified delimiters
You should use Pattern.quote on the string you want to split on:
String[] cols = line.split(Pattern.quote(delimiter));
This will ensure that the split works as intended even if delimiter contains special regex-symbols such as . or |.
You can use a CSV parser like OpenCSV ou Commons CSV
http://opencsv.sourceforge.net
http://commons.apache.org/sandbox/csv
You can replace it with its unicode sequence (prior to delimiting with pipe)
But what you should do is adjust your parser to take that into account, rather than changing the files.
Here is one way to parse it
String str = "zzz|20-03-22|23 \"xx|zz\"|23-23-33|32";
String regex = "(?<=^|\\|)(([^\"]*?)|([^\"]+\"[^\"]+\".*?))(?=\\||$)";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(str);
while(m.find()) {
System.out.println(m.group());
}
Output:
zzz
20-03-22
23 "xx|zz"
23-23-33
32
Related
Let's say I have the string:
String toTokenize = "prop1=value1;prop2=String test='1234';int i=4;;prop3=value3";
I want the tokens:
prop1=value1
prop2=String test='1234';int i=4;
prop3=value3
For backwards compatibility, I have to use the semicolon as a delimiter. I have tried wrapping code in something like CDATA:
String toTokenize = "prop1=value1;prop2=<![CDATA[String test='1234';int i=4;]]>;prop3=value3";
But I can't figure out a regular expression to ignore the semicolons that are within the cdata tags.
I've tried escaping the non-delimiter:
String toTokenize = "prop1=value1;prop2=String test='1234'\\;int i=4\\;;prop3=value3";
But then there is an ugly mess of removing the escape characters.
Do you have any suggestions?
You may match either <![CDATA...]]> or any char other than ;, 1 or more times, to match the values. To match the keys, you may use a regular \w+ pattern:
(\w+)=((?:<!\[CDATA\[.*?]]>|[^;])+)
See the regex demo.
Details
(\w+) - Group 1: one or more word chars
= - a = sign
((?:<!\[CDATA\[.*?]]>|[^;])+) - Group 1: one or more sequences of
<!\[CDATA\[.*?]]> - a <![CDATA[...]]> substring
| - or
[^;] - any char but ;
See a Java demo:
String rx = "(\\w+)=((?:<!\\[CDATA\\[.*?]]>|[^;])+)";
String s = "prop1=value1;prop2=<![CDATA[String test='1234';int i=4;]]>;prop3=value3";
Pattern pattern = Pattern.compile(rx);
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.println(matcher.group(1) + " => " + matcher.group(2));
}
Results:
prop1 => value1
prop2 => <![CDATA[String test='1234';int i=4;]]>
prop3 => value3
Prerequisite:
All your tokens start with prop
There is no prop in the file other than the beginning of a token
I'd just do a replace of all ;prop by ~prop
Then your string becomes:
"prop1=value1~prop2=String test='1234';int i=4~prop3=value3";
You can then tokenize using the ~ delimiter
I have a string email = John.Mcgee.r2d2#hitachi.com
How can I write a java code using regex to bring just the r2d2?
I used this but got an error on eclipse
String email = John.Mcgee.r2d2#hitachi.com
Pattern pattern = Pattern.compile(".(.*)\#");
Matcher matcher = patter.matcher
for (Strimatcher.find()){
System.out.println(matcher.group(1));
}
To match after the last dot in a potential sequence of multiple dots request that the sequence that you capture does not contain a dot:
(?<=[.])([^.]*)(?=#)
(?<=[.]) means "preceded by a single dot"
(?=#) means "followed by # sign"
Note that since dot . is a metacharacter, it needs to be escaped either with \ (doubled for Java string literal) or with square brackets around it.
Demo.
Not sure if your posting the right code. I'll rewrite it based on what it should look like though:
String email = John.Mcgee.r2d2#hitachi.com
Pattern pattern = Pattern.compile(".(.*)\#");
Matcher matcher = pattern.matcher(email);
int count = 0;
while(matcher.find()) {
count++;
System.out.println(matcher.group(count));
}
but I think you just want something like this:
String email = John.Mcgee.r2d2#hitachi.com
Pattern pattern = Pattern.compile(".(.*)\#");
Matcher matcher = pattern.matcher(email);
if(matcher.find()){
System.out.println(matcher.group(1));
}
No need to Pattern you just need replaceAll with this regex .*\.([^\.]+)#.* which mean get the group ([^\.]+) (match one or more character except a dot) which is between dot \. and #
email = email.replaceAll(".*\\.([^\\.]+)#.*", "$1");
Output
r2d2
regex demo
If you want to go with Pattern then you have to use this regex \\.([^\\.]+)# :
String email = "John.Mcgee.r2d2#hitachi.com";
Pattern pattern = Pattern.compile("\\.([^\\.]+)#");
Matcher matcher = pattern.matcher(email);
if (matcher.find()) {
System.out.println(matcher.group(1));// Output : r2d2
}
Another solution you can use split :
String[] split = email.replaceAll("#.*", "").split("\\.");
email = split[split.length - 1];// Output : r2d2
Note :
Strings in java should be between double quotes "John.Mcgee.r2d2#hitachi.com"
You don't need to escape # in Java, but you have to escape the dot with double slash \\.
There are no syntax for a for loop like you do for (Strimatcher.find()){, maybe you mean while
I have a paragraph of text numbers with specific format
e.g "123-21-1234 this is another text - some text 222-34-2244 another text"
I need to select the specific numbers ( 123-21-1234 and 222-34-2244) and convert them to "123/21/1234 this is another text - some text 222/34/2244 another text"
You can try something like below using Matcher.appendReplacement
public static void main(String[] args) {
String str = "123-21-1234 this is another text - some text 222-34-2244 another text";
Pattern p = Pattern.compile("(\\d{3})-(\\d{2})-(\\d{4})");
Matcher m = p.matcher(str);
StringBuffer sb = new StringBuffer();
while (m.find()) {
String num = m.group();
m.appendReplacement(sb, num.replace('-', '/'));
}
m.appendTail(sb);
System.out.println(sb.toString());
}
Using .replaceAll("-", "/") has some annoying side effects
Instead you can look for the String literal to replace, or craft your own regex
string.replaceAll("123-21-1234", "123/21/1234").replaceAll("222-34-2244", "222/34/2244");
If you wish to match any XXX-XX-XXXX patterns
string.replaceAll("(\\d{3})-(\\d{2})-(\\d{4})", "$1/$2/$3");
This works by looking for the digit sequence, putting the digits into groups ($0 is the whole match, $1 is the first ()s, $2 is second ()s...)
iam new in regural expressions. I have a String
String span = "some text, param1:'1123',some text, param2:'3444';"
Now i want to use split, and get values of param1 and param2.
So i think if i use split by single quotes, i will get array elements with length == 2.
My problem is my split by single quotes doesn't work.
I think i need to put inside single quotes some regex
String[] elements = span.split("''");
param1 = elements[elements.length-2];
param2 = elements[elements.length-1];
So my output will be:
1123
3444
You can do this using regular expression:
'(.*?)'
' is a special character in Java so you need to use Pattern.quote() to treat it as a literal.
Try:
String span = "some text, param1:'1123',some text, param2:'3444';";
Pattern p = Pattern.compile(Pattern.quote("'") + "(.*?)" + Pattern.quote("'"));
Matcher m = p.matcher(span);
while (m.find()) {
System.out.println(m.group(1));
}
This outputs:
1123
3444
My solution is :
Pattern pattern = Pattern.compile("'(?:[^']|'')+'");
ArrayList<String> values = new ArrayList<>();
Matcher matcher = pattern.matcher(span);
while (matcher.find()) {
values.add(matcher.group());
}
nom = values.get(1).replace("'","");
I'm looking for a built-in Java functions which for example can convert "\\n" into "\n".
Something like this:
assert parseFunc("\\n") = "\n"
Or do I have to manually search-and-replace all the escaped characters?
You can use StringEscapeUtils.unescapeJava(s) from Apache Commons Lang. It works for all escape sequences, including Unicode characters (i.e. \u1234).
https://commons.apache.org/lang/apidocs/org/apache/commons/lang3/StringEscapeUtils.html#unescapeJava-java.lang.String-
Anthony is 99% right -- since backslash is also a reserved character in regular expressions, it needs to be escaped a second time:
result = myString.replaceAll("\\\\n", "\n");
Just use the strings own replaceAll method.
result = myString.replaceAll("\\n", "\n");
However if you want match all escape sequences then you could use a Matcher. See http://www.regular-expressions.info/java.html for a very basic example of using Matcher.
Pattern p = Pattern.compile("\\(.)");
Matcher m = p.matcher("This is tab \\t and \\n this is on a new line");
StringBuffer sb = new StringBuffer();
while (m.find()) {
String s = m.group(1);
if (s == "n") {s = "\n"; }
else if (s == "t") {s = "\t"; }
m.appendReplacement(sb, s);
}
m.appendTail(sb);
System.out.println(sb.toString());
You just need to make the assignment to s more sophisticated depending on the number and type of escapes you want to handle. (Warning this is air code, I'm not Java developer)
If you don't want to list all possible escaped characters you can delegate this to Properties behaviour
String escapedText="This is tab \\t and \\rthis is on a new line";
Properties prop = new Properties();
prop.load(new StringReader("x=" + escapedText + "\n"));
String decoded = prop.getProperty("x");
System.out.println(decoded);
This handle all possible characters