Java, getting portion of pattern partially matched by input

Java, getting portion of pattern partially matched by input - java

As title says, i'd like to get the portion of the pattern that is being matched partially by the input; example:
Pattern: aabb
Input string: "aa"
At this point, i'll use hitEnd() method of Matcher class to find out if the pattern is being matched partially, like shown in this answer, but i'd also like to find out that specifically "aa" of "aabb" is matched.
Is there any way to do this in java?

This may be dirty, but here We go...
Once you know that some string hitEnd, do a second processing:
Remove the last character from the string
Search with the original regex
If It matches, then you are over and you have the part of the string
If not, go to 1 and repeat the whole process until you match
If test strings can be long, performance may be a problem. So instead of positions from last to first, try searching for blocks.
For example, considering a string of 1,000 chars:
Test 1000/2 characters: 1-500. For this example, we consider it matches
Test for first 500 chars + 500/2 (1-750 positions). For this example, We consider It does not match. So we know that the position must be placed from 500 to 750
Now test 1-625 ((750+500)/2)... If it matches, the positions must exist between 625-750. If it does not match, It must be from 500 to 625
...

There is no such function in Matcher class. However you could achieve it for example in this way:
public String getPartialMatching(String pattern, String input) {
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(input);
int end = 0;
while(m.find()){
end = m.end();
}
if (m.hitEnd()) {
return input.substring(end);
} else {
return null;
}
}
First, iterate over all matched parts of string and skip them. For example: input = "aabbaa" m.hitEnd() will return false without skipping aabb.
Second, validate if the left part of the string partially matches.

Related

Delete some part of the string in beginning and some at last in java

I want a dynamic code which will trim of some part of the String at the beginning and some part at last. I am able to trim the last part but not able to trim the initial part of the String to a specific point completely. Only the first character is deleted in the output.
public static String removeTextAndLastBracketFromString(String string) {
StringBuilder str = new StringBuilder(string);
int i=0;
do {
str.deleteCharAt(i);
i++;
} while(string.equals("("));
str.deleteCharAt(string.length() - 2);
return str.toString();
}
This is my code. When I pass Awaiting Research(5056) as an argument, the output given is waiting Research(5056. I want to trim the initial part of such string till ( and I want only the digits as my output. My expected output here is - 5056. Please help.

You don't need loops (in your code), you can use String.substring(int, int) in combination with String.indexOf(char):
public static void main(String[] args) {
// example input
String input = "Awaiting Research(5056)";
// find the braces and use their indexes to get the content
String output = input.substring(
input.indexOf('(') + 1, // index is exclusive, so add 1
input.indexOf(')')
);
// print the result
System.out.println(output);
}
Output:
5056
Hint:
Only use this if you are sure the input will always contain a ( and a ) with indexOf('(') < indexOf(')') or handle IndexOutOfBoundsExceptions, which will occur on most Strings not matching the braces constraint.

If your goal is just to look one numeric value of the string, try split the string with regex for the respective numeric value and then you'll have the number separated from the string
e.g:
Pattern pattern = Pattern.compile("\\d+");
Matcher matcher = pattern.matcher("somestringwithnumberlike123");
if(matcher.find()) {
System.out.println(matcher.group());
}

Using a regexp to extract what you need is a better option :
String test = "Awaiting Research(5056)";
Pattern p = Pattern.compile("([0-9]+)");
Matcher m = p.matcher(test);
if (m.find()) {
System.out.println(m.group());
}

For your case, battery use regular expression to extract your interested part.
Pattern pattern = Pattern.compile("(?<=\\().*(?=\\))");
Matcher matcher = pattern.matcher("Awaiting Research(5056)");
if(matcher.find())
{
return matcher.group();
}

It is much easier to solve the problem e.g. using the String.indexOf(..) and String.substring(from,to). But if, for some reason you want to stick to your approach, here are some hints:
Your code does what is does because:
string.equals("(") is only true if the given string is exacly "("
the do {code} while (condition)-loop executes code once if condition is not true -> think about using the while (condition) {code} loop instead
if you change the condition to check for the character at i, your code would remove the first, third, fifth and so on: After first execution i is 1 and char at i is now the third char of the original string (because the first has been removed already) -> think about always checking and removing charAt 0.

Finding and retrieving consecutive matches

Say I want to match a string that should solely consist of parts adhering to a specific (regex) pattern and retrieve the elements in a loop. For this it seems that Matcher.find() was invented. However, find will match any string, not just one that is directly after the pattern, so intermediate characters are skipped.
So - for instance - I want to match \\p{Xdigit}{2} (two hexadecimal digits) in such a way that:
aabb matches;
_aabb doesn't match;
aa_bb doesn't match;
aabb_ doesn't match.
by using find (or any other iterated call to the regex) so I can directly process each byte in the array. So I want to process aa and bb separately, after matching.
OK, that's it, the most elegant way of doing this wins the accept.
Notes:
the hexadecimal parsing is just an example of a simple repeating pattern;
preferably I would like to keep the regex to the minimal required to match the element;
yes, I know about using (\\p{XDigit}{2})*, but I don't want to scan string twice (as it should be usable on huge input strings).

It appears you want to get all (multiple) matches that appear at the start of the string or right after a successful match. You may combine \G operator with a lookahead that will assure the string only matches some repeated pattern.
Use
(?:\G(?!^)|^(?=(?:\p{XDigit}{2})*$))\p{XDigit}{2}
See the regex demo
Details
(?: - start of a non-capturing group with 2 alternatives:
\G(?!^) - the end of the previous successful match
| - or
^(?=(?:\p{XDigit}{2})*$) - start of a string (^) that is followed with 0+ occurrences of \p{XDigit}{2} pattern up to the end of the string ($)
) - end of the non-capturing group
\p{XDigit}{2} - 2 hex chars.
Java demo:
String regex = "(?:\\G(?!^)|^(?=(?:[0-9a-fA-F]{2})*$))[0-9a-fA-F]{2}";
String[] strings = {"aabb","_aabb","aa_bb", "aabb_"};
Pattern pattern = Pattern.compile(regex);
for (String s : strings) {
System.out.println("Checking " + s);
Matcher matcher = pattern.matcher(s);
List<String> res = new ArrayList<>();
while (matcher.find()) {
res.add(matcher.group(0));
}
if (res.size() > 0) {
System.out.println(res);
} else {
System.out.println("No match!");
}
}
Output:
Checking aabb
[aa, bb]
Checking _aabb
No match!
Checking aa_bb
No match!
Checking aabb_
No match!

OK, I may finally have had a brainstorm: the idea is to remove the find() method out of the condition of the while loop. Instead I should simply keep a variable holding the location and only stop parsing when the whole string has been processed. The location can also be used to produce a more informative error message.
The location starts at zero and is updated to the end of the match. Each time a new match is found the start of the match is compared with the location, i.e. end of the last match. An error occurs if:
the pattern is not found;
the pattern is found, but not at the end of the last match.
Code:
private static byte[] parseHex(String hex){
byte[] bytes = new byte[hex.length() / 2];
int off = 0;
// the pattern is normally a constant
Pattern hexByte = Pattern.compile("\\p{XDigit}{2}");
Matcher hexByteMatcher = hexByte.matcher(hex);
int loc = 0;
// so here we would normally do the while (hexByteMatcher.find()) ...
while (loc < hex.length()) {
// optimization in case we have a maximum size of the pattern
hexByteMatcher.region(loc, loc + 2);
// instead we try and find the pattern, and produce an error if not found at the right location
if (!hexByteMatcher.find() || hexByteMatcher.start() != loc) {
// only a single throw, message includes location
throw new IllegalArgumentException("Hex string invalid at offset " + loc);
}
// the processing of the pattern, in this case a double hex digit representing a byte value
bytes[off++] = (byte) Integer.parseInt(hexByteMatcher.group(), 16);
// set the next location to the end of the match
loc = hexByteMatcher.end();
}
return bytes;
}
The method can be improved by adding \\G (end of last match) to the regex: \\G\\p{XDigit}{2}: this way the regular expression will fail immediately if the pattern cannot be found starting at the end of the last match or the start of the string).
For regular expressions with an expected maximum size (2 in this case) it is of course also possible to adjust the end of the region that needs to be matched.

Java Matcher Pattern issue

I am trying to extract everything that is after this string path /share/attachments/docs/. All my strings are starting with /share/attachments/docs/
For example: /share/attachments/docs/image2.png
Number of characters after ../docs/ is not static!
I tried with
Pattern p = Pattern.compile("^(.*)/share/attachments/docs/(\\d+)$");
Matcher m = p.matcher("/share/attachments/docs/image2.png");
m.find();
String link = m.group(2);
System.out.println("Link #: "+link);
But I am getting Exception that: No match found.
Strange because if I use this:
Pattern p = Pattern.compile("^(.*)ABC Results for draw no (\\d+)$");
Matcher m = p.matcher("ABC Results for draw no 2888");
then it works!!!
Also one thing is that in some very rare cases my string does not start with /share/attachments/docs/ and then I should not parse anything but that is not related directly to the issue, but it will be good to handle.

I am getting Exception that: No match found.
This is because image2.png doesn't match with \d+ use a more appropriate pattern like .+ assuming that you want to extract image2.png.
Your regular expression will then be ^(.*)/share/attachments/docs/(.+)$
In case of ABC Results for draw no 2888, the regexp ^(.*)ABC Results for draw no (\\d+)$ works because you have several successive digits at the end of your String while in the first case you had image2.png that is a mix of letters and digits which is the reason why there were no match found.
Generally speaking to avoid getting an IllegalStateException: No match found, you need first to check the result of find(), if it returns true the input String matches:
if (m.find()) {
// The String matches with the pattern
String link = m.group(2);
System.out.println("Draw #: "+link);
} else {
System.out.println("Input value doesn't match with the pattern");
}

The regular expression \d+ (expressed as \\d+ inside a string literal) matches a run of one or more digits. Your example input does not have a corresponding digit run, so it is not matched. The regex metacharacter . matches any character (+/- newline, depending on regex options); it seems like that may be what you're really after.
Additionally, when you use Matcher.find() it is unnecessary for the pattern to match the whole string, so it is needless to include .* to match leading context. Furthermore, find() returns a value that tells you whether a match to the pattern was found. You generally want to use this return value, and in your particular case you can use it to reject those rare non-matching strings.
Maybe this is more what you want:
Pattern p = Pattern.compile("/share/attachments/docs/(.+)$");
Matcher m = p.matcher("/share/attachments/docs/image2.png");
String link;
if (m.find()) {
link = m.group(1);
System.out.println("Draw #: " + link);
} else {
link = null;
System.out.println("Draw #: (not found)");
}

Regex matching up to a character if it occurs

I need to match string as below:
match everything upto ;
If - occurs, match only upto - excluding -
For e.g. :
abc; should return abc
abc-xyz; should return abc
Pattern.compile("^(?<string>.*?);$");
Using above i can achieve half. but dont know how to change this pattern to achieve the second requirement. How do i change .*? so that it stops at forst occurance of -
I am not good with regex. Any help would be great.
EDIT
I need to capture it as group. i cant change it since there many other patterns to match and capture. Its only part of it that i have posted.
Code looks something like below.
public static final Pattern findString = Pattern.compile("^(?<string>.*?);$");
if(findString.find())
{
return findString.group("string"); //cant change anything here.
}

Just use a negated char class.
^[^-;]*
ie.
Pattern p = Pattern.compile("^[^-;]*");
Matcher m = p.matcher(str);
while(m.find()) {
System.out.println(m.group());
}
This would match any character at the start but not of - or ;, zero or more times.

This should do what you are looking for:
[^-;]*
It matches characters that are not - or ;.
Tipp: If you don't feel sure with regular expressions there are great online solutions to test your input, e.g. https://regex101.com/

UPDATE
I see you have an issue in the code since you try to access .group in the Pattern object, while you need to use the .group method of the Matcher object:
public static String GetTheGroup(String str) {
Pattern findString = Pattern.compile("(?s)^(?<string>.*?)[;-]");
Matcher matcher = findString.matcher(str);
if (matcher.find())
{
return matcher.group("string"); //you have to change something here.
}
else
return "";
}
And call it as
System.out.println(GetTheGroup("abc-xyz;"));
See IDEONE demo
OLD ANSWER
Your ^(?<string>.*?);$ regex only matches 0 or more characters other than a newline from the beginning up to the first ; that is the last character in the string. I guess it is not what you expect.
You should learn more about using character classes in regex, as you can match 1 symbol from a specified character set that is defined with [...].
You can achieve this with a String.split taking the first element only and a [;-] regex that matches a ; or - literally:
String res = "abc-xyz;".split("[;-]")[0];
System.out.println(res);
Or with replaceAll with (?s)[;-].*$ regex (that matches the first ; or - and then anything up to the end of string:
res = "abc-xyz;".replaceAll("(?s)[;-].*$", "");
System.out.println(res);
See IDEONE demo

I have found the solution without removing groupings.
(?<string>.*?) matches everything upto next grouping pattern
(?:-.*?)? followed by a non grouping pattern starts with - and comes zero or once.
; end character.
So putting all together:
public static final Pattern findString = Pattern.compile("^(?<string>.*?)(?:-.*?)?;$");
if(findString.find())
{
return findString.group("string"); //cant change anything here.
}

Java Regex is including new line in match

I'm trying to match a regular expression to textbook definitions that I get from a website.
The definition always has the word with a new line followed by the definition. For example:
Zither
Definition: An instrument of music used in Austria and Germany It has from thirty to forty wires strung across a shallow sounding board which lies horizontally on a table before the performer who uses both hands in playing on it Not to be confounded with the old lute shaped cittern or cithern
In my attempts to get just the word (in this case "Zither") I keep getting the newline character.
I tried both ^(\w+)\s and ^(\S+)\s without much luck. I thought that maybe ^(\S+)$ would work, but that doesn't seem to successfully match the word at all. I've been testing with rubular, http://rubular.com/r/LPEHCnS0ri; which seems to successfully match all my attempts the way I want, despite the fact that Java doesn't.
Here's my snippet
String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above.
Pattern rgx = Pattern.compile("^(\\S+)$");
Matcher mtch = rgx.matcher(str);
if (mtch.find()) {
String result = mtch.group();
terms.add(new SearchTerm(result, System.nanoTime()));
}
This is easily solved by triming the resulting string, but that seems like it should be unnecessary if I'm already using a regular expression.
All help is greatly appreciated. Thanks in advance!

Try using the Pattern.MULTILINE option
Pattern rgx = Pattern.compile("^(\\S+)$", Pattern.MULTILINE);
This causes the regex to recognise line delimiters in your string, otherwise ^ and $ just match the start and end of the string.
Although it makes no difference for this pattern, the Matcher.group() method returns the entire match, whereas the Matcher.group(int) method returns the match of the particular capture group (...) based on the number you specify. Your pattern specifies one capture group which is what you want captured. If you'd included \s in your Pattern as you wrote you tried, then Matcher.group() would have included that whitespace in its return value.

With regular expressions the first group is always the complete matching string. In your case you want group 1, not group 0.
So changing mtch.group() to mtch.group(1) should do the trick:
String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above.
Pattern rgx = Pattern.compile("^(\\w+)\s");
Matcher mtch = rgx.matcher(str);
if (mtch.find()) {
String result = mtch.group(1);
terms.add(new SearchTerm(result, System.nanoTime()));
}

A late response, but if you are not using Pattern and Matcher, you can use this alternative of DOTALL in your regex string
(?s)[Your Expression]
Basically (?s) also tells dot to match all characters, including line breaks
Detailed information: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html

Just replace:
String result = mtch.group();
By:
String result = mtch.group(1);
This will limit your output to the contents of the capturing group (e.g. (\\w+)) .

Try the next:
/* The regex pattern: ^(\w+)\r?\n(.*)$ */
private static final REGEX_PATTERN =
Pattern.compile("^(\\w+)\\r?\\n(.*)$");
public static void main(String[] args) {
String input = "Zither\n Definition: An instrument of music";
System.out.println(
REGEX_PATTERN.matcher(input).matches()
); // prints "true"
System.out.println(
REGEX_PATTERN.matcher(input).replaceFirst("$1 = $2")
); // prints "Zither = Definition: An instrument of music"
System.out.println(
REGEX_PATTERN.matcher(input).replaceFirst("$1")
); // prints "Zither"
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java, getting portion of pattern partially matched by input - java

Related

Delete some part of the string in beginning and some at last in java

Finding and retrieving consecutive matches

Java Matcher Pattern issue

Regex matching up to a character if it occurs

Java Regex is including new line in match

Categories

Resources