Splitting strings & Pattern matching in Java

Splitting strings & Pattern matching in Java - java

I have a following String:
MYLMFILLAAGCSKMYLLFINNAARPFASSTKAASTVVTPHHSYTSKPHHSTTSHCKSSD
I want to split such a string every time a K or R is encountered, except when followed by a P.
Therefore, I want the following output:
MYLMFILLAAGCSK
MYLLFINNAARPFASSTK
AASTVVTPHHSYTSKPHHSTTSHCK
SSD
At first, I tried using simple .split() function in java but I couldn't get the desired result. Because I really don't know how to mention it in the .split() function not to split if there is a P right after K or R.
I've looked at other similar questions and they suggest to use Pattern matching but I don't know how to use it in this context.

You can use split:
String[] parts = str.split("(?<=[KR])(?!P)");
Because you want to keep the input you're splitting on, you must use a look behind, which asserts without consuming. There are two look arounds:
(?<=[KR]) means "the previous char is either K or R"
(?!P) means "the next char is not a P"
This regex matches between characters where you want to split.
Some test code:
String str = "MYLMFILLAAGCSKMYLLFINNAARPFASSTKAASTVVTPHHSYTSKPHHSTTSHCKSSD";
Arrays.stream(str.split("(?<=[KR])(?!P)")).forEach(System.out::println);
Output:
MYLMFILLAAGCSK
MYLLFINNAARPFASSTK
AASTVVTPHHSYTSKPHHSTTSHCK
SSD

Just try this regexp:
(K)([^P]|$)
and substitute each matching by
\1\n\2
as ilustrated in the following demo. No negative lookahead needed. But you cannot use it with split, as it should eliminate the not P character after the K also.
You can do a first transform like the one above, and then .split("\n");
so it should be:
"MYLMFILLAAGCSKMYLLFINNAARPFASSTKAASTVVTPHHSYTSKPHHSTTSHCKSSDK"
.subst("(K)([^P]|$)", "\1\n\2").split("\n");

Related

Get all possible matches in a regex match (In Java)?

I am using a regex to match few possible values in a string that coming with my objects, there I need to get all possible values that are matching from my string as below,
If my string value is "This is the code ABC : xyz use for something".
Here is my code that I am using to extract matchers,
String my_regex = "(ABC|ABC :).*";
List <String> matchers = Pattern.compile(my_regex, Pattern.CASE_INSENSITIVE)
.matcher(my_string)
.results()
.map(MatchResult::group)
.collect(Collection.toList());
I am expecting the 2 list items as the output > {"ABC", "ABC :"}, But I am only getting one. Help would be highly appreciated.

What you describe just isn't how regex engines work. They do not find all possible variant search results; they simply consume and give you all results, moving forward. In other words, had you written:
String my_regex = "(ABC|ABC :)"; // note, get rid of the .*
String myString = "This is the code ABC : xyz use for something ABC again";
Then you'd get 2 results back - ABC : and ABC.
Yes, the regex could just as easily match just the ABC part instead of the ABC : part and it would still be valid. However, regexp matching is by default 'greedy' - it will match as much as it can. For some operators (specifically, * and +) you can use the non-greedy variants: *? and +? which will match as little as possible.
In other words, given:
String regex = "(a*?)(a+)";
String myString = "aaaaa";
Then group 1 would match 0 a (that's the shortest string that can match (a*?) whilst still being able to match the entire regex to the input), and group 2 would be aaaaa.
If, on the other hand, you wrote (a*)(a+), then group 1 would be aaaa and group 2 would be a. It is not possible to ask the regexp engine to provide for you the combinatory explosion, with every possible length of 'a' - which appears to be what you want. The regexp API that ships with java does not have any option to do this, nor does any other regexp API I know of, so you'd have to write that yourself, perhaps. I admit I haven't scoured the web for every possible alternate regex engine impl for java, there are a bunch of third party libraries, perhaps one of them can do it.
NB: I said at the start: Get rid of the .*. That's because otherwise it's still just the one match: ABC : xyz use for something ABC again is the longest possible match and given that regex engines are greedy, that's what you will get: It is a valid 'interpretation' of your string (1 match), consuming the most - that's how it works.
NB2: Greediness can never change whether a regex even matches or not. It just changes which of the input is assigned to which group, and when find()ing more than once (which .results() does - it find()s until no more matches are found - which matches you get.

Java String Regex replacement

Sample Input:
a:b
a.in:b
asds.sdsd:b
a:b___a.sds:bc___ab:bd
Sample Output:
a:replaced
a.in:replaced
asds.sdsd:replaced
a:replaced___a.sds:replaced___ab:replaced
String which comes after : should be replaced with custom function.
I have done the same without Regex. I feel it can be replaced with regex as we are trying to extract string out of specific pattern.
For first three cases, it's simple enough to extract String after :, but I couldn't find a way to deal with third case, unless I split the string ___ and apply the approach for first type of pattern and again concatenate them.

Just replace only the letters with exists next to : with the string replaced.
string.replaceAll("(?<=:)[A-Za-z]+", "replaced");
DEMO
or
If you also want to deal with digits, then add \d inside the char class.
string.replaceAll("(?<=:)[A-Za-z\\d]+", "replaced");

(:)[a-zA-Z]+
You can simply do this with string.replaceAll.Replace by $1replaced.See demo.
https://regex101.com/r/fX3oF6/18

Tokenizing a string using negations

So i have the following problem:
I have to tokenize a string using String.split() and the tokens must be in the form 07dd ddd ddd, where d is a digit. I thought of using the following regex : ^(07\\d{2}\\s\\d{3}\\d{3}) and pass it as an argument to String.split(). But for some reason, although i do have substrings under that form, it outputs the whole initial string and doesn't tokenize it.
I initially thought that it was using an empty string as a splitter, as an empty string indeed matches that regex, but even after I added & (.)+ to the regex in order to assure that the splitter hasn't got length 0, it still outputs the whole initial string.
I know that i could have used Pattern's and Matchers to solve it much faster, but i have to use String.split(). Any ideas why this happens?

A Few Pointers
Your pattern ^(07\d{2}\s\d{3}\d{3}) is missing a space between the two last groups of digits
The reason you get the whole string back is that this pattern was never found in the first place: there is no split
If you split on this pattern (once fixed), the resulting array will be strings that are in-between this pattern (these tokens are actually removed)
If you want to use this pattern (once fixed), you need a Match All not a Split. This will look like arrayOfMatches = yourString.match(/pattern/g);
If you want to split, you need to use a delimiter that is present between the token (this delimiter could in fact just be a zero-width position asserted by the 07 about to follow)
Further Reading
Match All and Split are Two Sides of the Same Coin

Need regex to separate comma separated values (Interface list from a router query output)

I have an input like this
RX Only : Gi1/0/15,Gi1/0/20,Gi1/0/17
I want to capture 1/0/15, 1/0/20, 1/0/17 from this. But this input changes. Sometimes there are only 2 comma separated values, sometimes 1 sometimes more than 3.
The regex I came up with only captures the first group. If I use the non-greedy operator, then it captures last. What regex should I use to capture all these groups separately.
The language used would be Java.

it's often easier to just write the regex for the substrings you are interested in, then repeatedly use Matcher.find(), as opposed to trying to write a regex that matches the entire string and pulling what you want from a complex arrangement of groups.
assuming what you are looking for are triples of three numbers separated by "/", then,
Pattern p = Pattern.compile("\\d+/\\d+/\\d+");
Matcher m = p.matcher(inputString);
while (m.find()) {
// your triple is in group 0
System.out.println(m.group(0));
}

Give a man a fish ... or
http://gskinner.com/RegExr/

Do you really have to use regex here? If data formats are quite similar you can just use indexOf function combined with substring. You will have to find the : character and start finding comas starting from the next character. Then you check the position of \n and use the smaller index in order to retrieve a substring.

java regex tricky pattern

I'm stucked for a while with a regex that does me the following:
split my sentences with this: "[\W+]"
but if it finds a word like this: "aaa-aa" (not "aaa - aa" or "aaa--aaa-aa"), the word isnt splitted, but the whole word.
Basically, i want to split a sentece per words, but also considering "aaa-aa" is a word. I'have sucessfully done that by creating two separate functions, one for spliting with \w, and other to find words like "aaa-aa". Finally, i then add both, and subctract each compound word.
For example, the sentence:
"Hello my-name is Richard"
First i collect {Hello, my, name, is, Richard}
then i collect {my-name}
then i add {my-name} to {Hello, my, name, is, Richard}
then i take out {my} and {name} in here {Hello, my, name, is, Richard}.
result: {Hello, my-name, is, Richard}
this approach does what i need, but for parsing large files, this becomes too heavy, because for each sentence there's too many copies needed. So my question is, there is anything i can do to include everything in one pattern? Like:
"split me the text using this pattern "[\W+], but if you find a word like this "aaa-aa", consider it a word and not two words.

If you want to use a split() rather than explicitly matching the words you are interested in, the following should do what you want: [\s-]{2,}|\s To break that down, you first split on two or more whitespaces and/or hyphens - so a single '-' won't match so 'one-two' will be left alone but something like 'one--two', 'one - two' or even 'one - --- - two' will be split into 'one' and 'two'. That still leaves the 'normal' case of a single whitespace - 'one two' - unmatched, so we add an or ('|') followed by a single whitespace (\s). Note that the order of the alternatives is important - RE subexpressions separated by '|' are evaluated left-to-right so we need to put the spaces-and-hyphens alternative first. If we did it the other way around, when presented with something like 'one -two' we'd match on the first whitespace and return 'one', '-two'.
If you want to interactively play around with Java REs I can thoroughly recommend http://myregexp.com/signedJar.html which allows you to edit the RE and see it matching against a sample string as you edit the RE.

Why not to use pattern \\s+? This does exactly what you want without any tricks: splits text by words separated by whitespace.

Your description isn't clear enough, but why not just split it up by spaces?

I am not sure whether this pattern would work, because I don't have developer tools for Java, you might try it though, it uses character class substraction, which is supported only in Java regex as far as I know:
[\W&&[^-]]+
it means match characters if they are [\W] and [^-], that is characters are [\W] and not [-].

Almost the same regular expression as in your previous question:
String sentence = "Hello my-name is Richard";
Pattern pattern = Pattern.compile("(?<!\\w)\\w+(-\\w+)?(?!\\w)");
Matcher matcher = pattern.matcher(sentence);
while (matcher.find()) {
System.out.println(matcher.group());
}
Just added the option (...)? to also match non-hypened words.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Splitting strings & Pattern matching in Java - java

Related

Get all possible matches in a regex match (In Java)?

Java String Regex replacement

Tokenizing a string using negations

Need regex to separate comma separated values (Interface list from a router query output)

java regex tricky pattern

Categories

Resources