Java Split String by colon on both side - java

Can you suggest me an approach by which I can split a String which is like:
:31C:150318
:31D:150425 IN BANGLADESH
:20:314015040086
So I tried to parse that string with
:[A-za-z]|\\d:
This kind of regular expression, but it is not working . Please suggest me a regular expression by which I can split that string with 20 , 31C , 31D etc as Keys and 150318 , 150425 IN BANGLADESH etc as Values .
If I use string.split(":") then it would not serve my purpose.
If a string is like:
:20: MY VALUES : ARE HERE
then It will split up into 3 string , and key 20 will be associated with "MY VALUES" , and "ARE HERE" will not associated with key 20 .

You may use matching mechanism instead of splitting since you need to match a specific colon in the string.
The regex to get 2 groups between the first and second colon and also capture everything after the second colon will look like
^:([^:]*):(.*)$
See demo. The ^ will assert the beginning of the string, ([^:]*) will match and capture into Group 1 zero or more characters other than :, and (.*) will match and capture into Group 2 the rest of the string. $ will assert the position at the end of a single line string (as . matches any symbol but a newline without Pattern.DOTALL modifier).
String s = ":20:AND:HERE";
Pattern pattern = Pattern.compile("^:([^:]*):(.*)$");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println("Key: " + matcher.group(1) + ", Value: " + matcher.group(2) + "\n");
}
Result for this demo: Key: 20, Value: AND:HERE

You can use the following to split:
^[:]+([^:]+):

Try with split function of String class
String[] splited = string.split(":");
For your requirements:
String c = ":31D:150425 IN BANGLADESH:todasdsa";
c=c.substring(1);
System.out.println("C="+c);
String key= c.substring(0,c.indexOf(":"));
String value = c.substring(c.indexOf(":")+1);
System.out.println("key="+key+" value="+value);
Result:
C=31D:150425 IN BANGLADESH:todasdsa
key=31D value=150425 IN BANGLADESH:todasdsa

Related

Need help in regex matching

It may be very simple, but I am extremely new to regex and have a requirement where I need to do some regex matches in a string and extract the number in it. Below is my code with sample i/p and required o/p. I tried to construct the Pattern by referring to https://www.freeformatter.com/java-regex-tester.html, but my regex match itself is returning false.
Pattern pattern = Pattern.compile(".*/(a-b|c-d|e-f)/([0-9])+(#[0-9]?)");
String str = "foo/bar/Samsung-Galaxy/a-b/1"; // need to extract 1.
String str1 = "foo/bar/Samsung-Galaxy/c-d/1#P2";// need to extract 2.
String str2 = "foo.com/Samsung-Galaxy/9090/c-d/69"; // need to extract 69
System.out.println("result " + pattern.matcher(str).matches());
System.out.println("result " + pattern.matcher(str1).matches());
System.out.println("result " + pattern.matcher(str1).matches());
All of above SOPs are returning false. I am using java 8, is there is any way by which in a single statement I can match the pattern and then extract the digit from the string.
I would be great if somebody can point me on how to debug/develop the regex.Please feel free to let me know if something is not clear in my question.
You may use
Pattern pattern = Pattern.compile(".*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)");
See the regex demo
When used with matches(), the pattern above does not require explicit anchors, ^ and $.
Details
.* - any 0+ chars other than line break chars, as many as possible
/ - the rightmost / that is followed with the subsequent subpatterns
(?:a-b|c-d|e-f) - a non-capturing group matching any of the alternatives inside: a-b, c-d or e-f
/ - a / char
[^/]*? - any chars other than /, as few as possible
([0-9]+) - Group 1: one or more digits.
Java demo:
List<String> strs = Arrays.asList("foo/bar/Samsung-Galaxy/a-b/1","foo/bar/Samsung-Galaxy/c-d/1#P2","foo.com/Samsung-Galaxy/9090/c-d/69");
Pattern pattern = Pattern.compile(".*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)");
for (String s : strs) {
Matcher m = pattern.matcher(s);
if (m.matches()) {
System.out.println(s + ": \"" + m.group(1) + "\"");
}
}
A replacing approach using the same regex with anchors added:
List<String> strs = Arrays.asList("foo/bar/Samsung-Galaxy/a-b/1","foo/bar/Samsung-Galaxy/c-d/1#P2","foo.com/Samsung-Galaxy/9090/c-d/69");
String pattern = "^.*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)$";
for (String s : strs) {
System.out.println(s + ": \"" + s.replaceFirst(pattern, "$1") + "\"");
}
See another Java demo.
Output:
foo/bar/Samsung-Galaxy/a-b/1: "1"
foo/bar/Samsung-Galaxy/c-d/1#P2: "2"
foo.com/Samsung-Galaxy/9090/c-d/69: "69"
Because you match always the last number in your regex, I would Like to just use replaceAll with this regex .*?(\d+)$ :
String regex = ".*?(\\d+)$";
String strResult1 = str.replaceAll(regex, "$1");
System.out.println(!strResult1.isEmpty() ? "result " + strResult1 : "no result");
String strResult2 = str1.replaceAll(regex, "$1");
System.out.println(!strResult2.isEmpty() ? "result " + strResult2 : "no result");
String strResult3 = str2.replaceAll(regex, "$1");
System.out.println(!strResult3.isEmpty() ? "result " + strResult3 : "no result");
If the result is empty then you don't have any number.
Outputs
result 1
result 2
result 69
Here is a one-liner using String#replaceAll:
public String getDigits(String input) {
String number = input.replaceAll(".*/(?:a-b|c-d|e-f)/[^/]*?(\\d+)$", "$1");
return number.matches("\\d+") ? number : "no match";
}
System.out.println(getDigits("foo.com/Samsung-Galaxy/9090/c-d/69"));
System.out.println(getDigits("foo/bar/Samsung-Galaxy/a-b/some other text/1"));
System.out.println(getDigits("foo/bar/Samsung-Galaxy/9090/a-b/69ace"));
69
no match
no match
This works on the sample inputs you provided. Note that I added logic which will display no match for the case where ending digits could not be matched fitting your pattern. In the case of a non-match, we would typically be left with the original input string, which would not be all digits.

In Java, how do you tokenize a string that contains the delimiter in the tokens?

Let's say I have the string:
String toTokenize = "prop1=value1;prop2=String test='1234';int i=4;;prop3=value3";
I want the tokens:
prop1=value1
prop2=String test='1234';int i=4;
prop3=value3
For backwards compatibility, I have to use the semicolon as a delimiter. I have tried wrapping code in something like CDATA:
String toTokenize = "prop1=value1;prop2=<![CDATA[String test='1234';int i=4;]]>;prop3=value3";
But I can't figure out a regular expression to ignore the semicolons that are within the cdata tags.
I've tried escaping the non-delimiter:
String toTokenize = "prop1=value1;prop2=String test='1234'\\;int i=4\\;;prop3=value3";
But then there is an ugly mess of removing the escape characters.
Do you have any suggestions?
You may match either <![CDATA...]]> or any char other than ;, 1 or more times, to match the values. To match the keys, you may use a regular \w+ pattern:
(\w+)=((?:<!\[CDATA\[.*?]]>|[^;])+)
See the regex demo.
Details
(\w+) - Group 1: one or more word chars
= - a = sign
((?:<!\[CDATA\[.*?]]>|[^;])+) - Group 1: one or more sequences of
<!\[CDATA\[.*?]]> - a <![CDATA[...]]> substring
| - or
[^;] - any char but ;
See a Java demo:
String rx = "(\\w+)=((?:<!\\[CDATA\\[.*?]]>|[^;])+)";
String s = "prop1=value1;prop2=<![CDATA[String test='1234';int i=4;]]>;prop3=value3";
Pattern pattern = Pattern.compile(rx);
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.println(matcher.group(1) + " => " + matcher.group(2));
}
Results:
prop1 => value1
prop2 => <![CDATA[String test='1234';int i=4;]]>
prop3 => value3
Prerequisite:
All your tokens start with prop
There is no prop in the file other than the beginning of a token
I'd just do a replace of all ;prop by ~prop
Then your string becomes:
"prop1=value1~prop2=String test='1234';int i=4~prop3=value3";
You can then tokenize using the ~ delimiter

Extract a particular number from a string using regex in java

Here is my string
INPUT:
22 TIRES (2 defs)
1 AP(PEAR + ANC)E (CAN anag)
6 CHIC ("SHEIK" hom)
EXPECTED OUTPUT:
22 TIRES
1 APPEARANCE
6 CHIC
ACTUAL OUTPUT :
TIRES
APPEARANCE
CHIC
I tried using below code and got the above output.
String firstnames =a.split(" \\(.*")[0].replace("(", "").replace(")", "").replace(" + ",
"");
Any idea of how to extract along with the numbers ? I don't want the numbers which are after the parentheses like in the input " 22 TIRES (2 defs)". I need the output as "22 TIRES" Any help would be great !!
I am doing it bit differently
String line = "22 TIRES (2 defs)\n\n1 AP(PEAR + ANC)E (CAN anag)\n\n6 CHIC (\"SHEIK\" hom)";
String pattern = "(\\d+\\s+)(.*)\\(";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(line);
while (m.find()) {
String tmp = m.group(1) + m.group(2).replaceAll("[^\\w]", "");
System.out.println(tmp);
}
Ideone Demo
I would use a single replaceAll function.
str.replaceAll("\\s+\\(.*|\\s*\\+\\s*|[()]", "");
DEMO
\\s+\\(.*, this matches a space and and the following ( characters plus all the remaining characters which follows this pattern. So (CAN anag) part in your example got matched.
\\s*\\+\\s* matches + along with the preceding and following spaces.
[()] matches opening or closing brackets.
Atlast all the matched chars are replaced by empty string.

Regex expression to get the file name

I want to extract only filename from the complete file name + time stamp . below is the input.
String filePath = "fileName1_20150108.csv";
expected output should be: "fileName1"
String filePath2 = "fileName1_filedesc1_20150108_002_20150109013841.csv"
And expected output should be: "fileName1_filedesc1"
I wrote a below code in java to get the file name but it is working for first part (filePath) but not for filepath2.
Pattern pattern = Pattern.compile(".*.(?=_)");
String filePath = "fileName1_20150108.csv";
String filePath2 = "fileName1_filedesc1_20150108_002_20150109013841.csv";
Matcher matcher = pattern.matcher(filePath);
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end() + " ");
System.out.println(matcher.group());
}
Can somebody please help me to correct the regex so i can parse both filepath using same regex?
Thanks
Anchor the start, and make the .* non-greedy:
^.*?(_\D.*?)?(?=[_.])
Update: change the second group (for fileDesc) to optional, and enforce that it starts with a non-digit character. This will work as long as your fileDesc strings never start with numbers.
You can get the characters before the first underscode, the first underscore, and then the characters until the next underscore:
^[^_]*_[^_]*
This should work: "^(.*?)_([0-9_]*)\\.([^.]*)$"
It will return you 3 groups:
the base name (assuming not a single part will be all numbers)
the timestamp info
the extension.
You can test here: http://fiddle.re/v0hne6 (RegexPlanet)

Java Regex ReplaceAll with grouping

I want to surround all tokens in a text with tags in the following manner:
Input: " abc fg asd "
Output:" <token>abc</token> <token>fg</token> <token>asd</token> "
This is the code I tried so far:
String regex = "(\\s)([a-zA-Z]+)(\\s)";
String text = " abc fg asd ";
text = text.replaceAll(regex, "$1<token>$2</token>$3");
System.out.println(text);
Output:" <token>abc</token> fg <token>asd</token> "
Note: for simplicity we can assume that the input starts and ends with whitespaces
Use lookaround:
String regex = "(?<=\\s)([a-zA-Z]+)(?=\\s)";
...
text = text.replaceAll(regex, "<token>$1</token>");
If your tokens are only defined with a character class you don't need to describe what characters are around. So this should suffice since the regex engine walks from left to right and since the quantifier is greedy:
String regex = "[a-zA-Z]+";
text = text.replaceAll(regex, "<token>$0</token>");
// meaning not a space, 1+ times
String result = input.replaceAll("([^\\s]+)", "<token>$1</token>");
this matches everything that isn't a space. Prolly the best fit for what you need. Also it's greedy meaning it will never leave out a character that it shouldn't ( it will never find the string "as" in the string "asd" when there is another character with which it matches)

Categories

Resources