Hi I have a csv file with an error in it.so i want it to correct with regular expression, some of the fields contain line break, Example as below
"AHLR150","CDS","-1","MDCPBusinessRelationshipID",,,"Investigating","1600 Amphitheatre Pkwy
California",,"Mountain View",,"United States",,"California",,,"94043-1351","9958"
the above two lines should be in one line
"AHLR150","CDS","-1","MDCPBusinessRelationshipID",,,"Investigating","1600 Amphitheatre PkwyCalifornia",,"Mountain View",,"United States",,"California",,,"94043-1351","9958"
I tried to use the below regex but it didnt help me
%s/\\([^\"]\\)\\n/\\1/
Try this:
public static void main(String[] args) {
String input = "\"AHLR150\",\"CDS\",\"-1\",\"MDCPBusinessRelationshipID\","
+ ",,\"Investigating\",\"1600 Amphitheatre Pkwy\n"
+ "California\",,\"Mountain View\",,\"United\n"
+ "States\",,\"California\",,,\"94043-1351\",\"9958\"\n";
Matcher matcher = Pattern.compile("\"([^\"]*[\n\r].*?)\"").matcher(input);
Pattern patternRemoveLineBreak = Pattern.compile("[\n\r]");
String result = input;
while(matcher.find()) {
String quoteWithLineBreak = matcher.group(1);
String quoteNoLineBreaks = patternRemoveLineBreak.matcher(quoteWithLineBreak).replaceAll(" ");
result = result.replaceFirst(quoteWithLineBreak, quoteNoLineBreaks);
}
//Output
System.out.println(result);
}
Output:
"AHLR150","CDS","-1","MDCPBusinessRelationshipID",,,"Investigating","1600 Amphitheatre Pkwy California",,"Mountain View",,"United States",,"California",,,"94043-1351","9958"
Create a RegEx surrounding the text you want to keep by parentheses and that will create a group of matched characters. Then replace the string using the group index to compose as you wish.
String test = "\"AHLR150\",\"CDS\",\"-1\",\"MDCPBusinessRelationshipID\","
+ ",,\"Investigating\",\"1600 Amphitheatre Pkwy\n"
+ "California\",,\"Mountain View\",,\"United\n"
+ "States\",,\"California\",,,\"94043-1351\",\"9958\"\n";
System.out.println(test.replaceAll("(\"[^\"]*)\n([^\"]*\")", "$1$2"));
So when we replace the matching string ("United\nStates") by $1$2 we are removing the line break because it not belongs to any group:
$1 => the first group (\"[^\"]*) that will match "United
$2 => the second group ([^\"]*\")" that will match States"
Based on this you can try with:
/\r?\n|\r/
I checked it here and seems to be fine
Let's say I have the string:
String toTokenize = "prop1=value1;prop2=String test='1234';int i=4;;prop3=value3";
I want the tokens:
prop1=value1
prop2=String test='1234';int i=4;
prop3=value3
For backwards compatibility, I have to use the semicolon as a delimiter. I have tried wrapping code in something like CDATA:
String toTokenize = "prop1=value1;prop2=<![CDATA[String test='1234';int i=4;]]>;prop3=value3";
But I can't figure out a regular expression to ignore the semicolons that are within the cdata tags.
I've tried escaping the non-delimiter:
String toTokenize = "prop1=value1;prop2=String test='1234'\\;int i=4\\;;prop3=value3";
But then there is an ugly mess of removing the escape characters.
Do you have any suggestions?
You may match either <![CDATA...]]> or any char other than ;, 1 or more times, to match the values. To match the keys, you may use a regular \w+ pattern:
(\w+)=((?:<!\[CDATA\[.*?]]>|[^;])+)
See the regex demo.
Details
(\w+) - Group 1: one or more word chars
= - a = sign
((?:<!\[CDATA\[.*?]]>|[^;])+) - Group 1: one or more sequences of
<!\[CDATA\[.*?]]> - a <![CDATA[...]]> substring
| - or
[^;] - any char but ;
See a Java demo:
String rx = "(\\w+)=((?:<!\\[CDATA\\[.*?]]>|[^;])+)";
String s = "prop1=value1;prop2=<![CDATA[String test='1234';int i=4;]]>;prop3=value3";
Pattern pattern = Pattern.compile(rx);
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.println(matcher.group(1) + " => " + matcher.group(2));
}
Results:
prop1 => value1
prop2 => <![CDATA[String test='1234';int i=4;]]>
prop3 => value3
Prerequisite:
All your tokens start with prop
There is no prop in the file other than the beginning of a token
I'd just do a replace of all ;prop by ~prop
Then your string becomes:
"prop1=value1~prop2=String test='1234';int i=4~prop3=value3";
You can then tokenize using the ~ delimiter
Can you suggest me an approach by which I can split a String which is like:
:31C:150318
:31D:150425 IN BANGLADESH
:20:314015040086
So I tried to parse that string with
:[A-za-z]|\\d:
This kind of regular expression, but it is not working . Please suggest me a regular expression by which I can split that string with 20 , 31C , 31D etc as Keys and 150318 , 150425 IN BANGLADESH etc as Values .
If I use string.split(":") then it would not serve my purpose.
If a string is like:
:20: MY VALUES : ARE HERE
then It will split up into 3 string , and key 20 will be associated with "MY VALUES" , and "ARE HERE" will not associated with key 20 .
You may use matching mechanism instead of splitting since you need to match a specific colon in the string.
The regex to get 2 groups between the first and second colon and also capture everything after the second colon will look like
^:([^:]*):(.*)$
See demo. The ^ will assert the beginning of the string, ([^:]*) will match and capture into Group 1 zero or more characters other than :, and (.*) will match and capture into Group 2 the rest of the string. $ will assert the position at the end of a single line string (as . matches any symbol but a newline without Pattern.DOTALL modifier).
String s = ":20:AND:HERE";
Pattern pattern = Pattern.compile("^:([^:]*):(.*)$");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println("Key: " + matcher.group(1) + ", Value: " + matcher.group(2) + "\n");
}
Result for this demo: Key: 20, Value: AND:HERE
You can use the following to split:
^[:]+([^:]+):
Try with split function of String class
String[] splited = string.split(":");
For your requirements:
String c = ":31D:150425 IN BANGLADESH:todasdsa";
c=c.substring(1);
System.out.println("C="+c);
String key= c.substring(0,c.indexOf(":"));
String value = c.substring(c.indexOf(":")+1);
System.out.println("key="+key+" value="+value);
Result:
C=31D:150425 IN BANGLADESH:todasdsa
key=31D value=150425 IN BANGLADESH:todasdsa
I want to surround all tokens in a text with tags in the following manner:
Input: " abc fg asd "
Output:" <token>abc</token> <token>fg</token> <token>asd</token> "
This is the code I tried so far:
String regex = "(\\s)([a-zA-Z]+)(\\s)";
String text = " abc fg asd ";
text = text.replaceAll(regex, "$1<token>$2</token>$3");
System.out.println(text);
Output:" <token>abc</token> fg <token>asd</token> "
Note: for simplicity we can assume that the input starts and ends with whitespaces
Use lookaround:
String regex = "(?<=\\s)([a-zA-Z]+)(?=\\s)";
...
text = text.replaceAll(regex, "<token>$1</token>");
If your tokens are only defined with a character class you don't need to describe what characters are around. So this should suffice since the regex engine walks from left to right and since the quantifier is greedy:
String regex = "[a-zA-Z]+";
text = text.replaceAll(regex, "<token>$0</token>");
// meaning not a space, 1+ times
String result = input.replaceAll("([^\\s]+)", "<token>$1</token>");
this matches everything that isn't a space. Prolly the best fit for what you need. Also it's greedy meaning it will never leave out a character that it shouldn't ( it will never find the string "as" in the string "asd" when there is another character with which it matches)
I have a regex like below one :
"\\t'AUR +(username) .*? /ROLE=\"(my_role)\".*$"
username and my_role parts will be given from args. So they always change when the script is starting. So how can i give parameters to that part of regex ?
Thanks for your helps.
Define regex like this:
String fmt = "\\t'AUR +(%s) .*? /ROLE=\"(%s)\".*$";
// assuming userName and myRole are your arguments
String regex = String.format(fmt, userName, myRole);
You should escape special characters in dynamic strings using Pattern.quote. To put the regex parts together you can simply use string concatenation like this:
String quotedUsername = Pattern.quote(username);
String quotedRole = Pattern.quote(my_role);
String regexString = "\\t'AUR +(" + quotedUsername +
") .*? /ROLE=\"(" + quotedRole + ")\".*$";
I think mixing regular expressions with format strings when using String.format can make the regex harder to understand.
Use string format or straight string concat to construct the regex before passing it to compile ...
Try this for an example:
String patternString = "\\t'AUR +(%s) .*? /ROLE=\"(%s)\".*$";
String formatted = String.format(patternString, username,my_role);
System.out.println(formatted);
Pattern pattern = Pattern.compile(patternString);
You can run a working example here: http://ideone.com/93YeNg