How to use Java Regular Expressions to extract the following data? - java

How to obtain the first long number from the whole sentence given below using regular exression:
396124450036269056,"#Anyi1987 asi fue,bano total para mi.,:D",MiriamBustam
I want the result as: 396124450036269056.
So how do I represent the number in this whole sentence using regular expressions?
I am using Apache Pig scripting language which makes use of Java regular expressions.
So in Apace Pig:
REGEX_EXTRACT_ALL:
Syntax:
REGEX_EXTRACT_ALL (string, regex)
. Use the REGEX_EXTRACT_ALL function to perform regular expression matching and to extract all matched groups.
This example will return the tuple (192.168.1.5,8020).
REGEX_EXTRACT_ALL('192.168.1.5:8020', '(.*)\:(.*)');
REGEX_EXTRACT:
Syntax:
REGEX_EXTRACT (string, regex, index).
Use the REGEX_EXTRACT function to perform regular expression matching and to extract the matched group defined by the index parameter (where the index is a 1-based parameter.)
This example will return the string '192.168.1.5'.
REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);

\d+
Matches all digit characters.
So it matches 396124450036269056 in this case.
You don't need a regex here. You could use a substring().
s.substring(0, s.indexOf(","))

I think is not exist a regular expresion to match the longest number from a text.
The expressions like \d+ or \d* will match only the first number no matter how many digits will have. So if you will have "55 msadmmsada 8882138213821321382183" those expressions will match 55 only.

If your string always starts with a number, simply use (\d+) (see this at regex101).
This will extract all digits at the start of something into a matching group. So, if I understand your examples right,
REGEX_EXTRACT(you, '(\d+).*', 1);
Would do the trick. You would only have to append the .* if this function has to match the whole text to extract something, otherwise you can omit it.

You could use:
\d*
and it will match 396124450036269056
Explanation:
\d* match a digit [0-9]
Quantifier: * Between zero and unlimited times

Related

Regex Match Reset \K Equalent In Java

I have come up with a regex pattern to match a part of a Json value. But only PRCE engine is supporting this. I want to know the Java equalent of this regex.
Simplified version
cif:\K.*(?=(.+?){4})
Matches part of the value, leaving the last 4 characters.
cif:test1234
Matched value will be test
https://regex101.com/r/xV4ZNa/1
Note: I can only define the regex and the replace text. I don't have access to the Java code since it's handle by a propriotery log masking framework.
You can write simplify the pattern to:
(?<=cif:).*(?=....)
Explanation
(?<=cif:) Positive lookbehind, assert cif: to the left
.* Match 0+ times any character without newlines
(?=....) Positive lookahead, assert 4 characters (which can include spaces)
See a regex demo.
If you don't want to match empty strings, then you can use .+ instead
(?<=cif:).+(?=....)
You can use a lookbehind assertion instead:
(?<=cif:).*(?=(.+?){4})
Demo: https://regex101.com/r/xV4ZNa/3

Expression to capture only 1 occurrence for a single character but multiple for others

I am trying to use the following regex to capture following values. This is for use in Java.
(\$|£|$|£)([ 0-9.]+)
Example values which I do want to be captured via above regex which works.
$100
$100.5
$100
$100.6
£200
£200.6
But the following as gets captured which is wrong. I only want to capture values when thereis only 1 dot in the text. Not multiples.
£200.15.
£200.6.6.6.6
Is there a way to select such that multiple periods doesn't count?
I can't do something like following cos that would affect the numbers too. Please advice.
(\$|£|$|£)([ 0-9.]{1})
You can use
(\$|£|$|£)(\d+(?:\.\d+)?)\b(?!\.)
See the regex demo.
In this regex, (\d+(?:\.\d+)?)\b(?!\.) matches
(\d+(?:\.\d+)?) - Group 1: one or more digits, then an optional occurrence of . and one or more digits
\b - a word boundary
(?!\.) - not immediately followed with a . char.
Another solution for Java (where the regex engine supports possessive quantifiers) will be
(\$|£|$|£)(\d++(?:\.\d+)?+)(?!\.)
See this regex demo. \d++ and (?:\.\d+)?+ contain ++ and ?+ possessive quantifiers that prevent backtracking into the quantified subpatterns.
In Java, do not forget to double the backslashes in the string literals:
String regex = "(\\$|£|$|£)(\\d++(?:\\.\\d+)?+)(?!\\.)";
You could try this
(\$|£|$|£)([0-9]+(?:\.[0-9]+)?)$
one or more digits followed by an optional dot and some digits and then the end of the string.
EDIT: some typos fixed
And it's not ok to delete the whole sentence obove, due to one word against my self. :(

Regex with Whitespace

I am try to write a regex to match the following:
act=MATCHME
act=Match me too
I have the following regex to match either one but not both. Here is my effort:
matches MATCHME: act=(\w+)
matches Match me too: (\w+\s\w+\s\w+)
Is there anyway to can combine the two with OR, or may I be looking at this wrong?
I am using the JAVA regex engine.
You may use an optional non-capturing group:
act=(\w+(?:\s+\w+\s+\w+)?)
^^^^^^^^^^^^^^^^^
See the regex demo
The ? matches 1 or 0 occurrences of the quantified subpattern. When it is applied to a grouping construct, the quantification is applied to the whole pattern sequence, so (?:\s+\w+\s+\w+)? matches 1 or 0 sequences of 1+ whitespaces, 1+ word chars, 1+ whitespaces and again 1+ word chars.
You may further subsegment the pattern if you need to capture 2-word substrings after act=.
Surely you know how to compose regular expressions by alternation.
This regular expression may help you
^[a-zA-Z ]*$

Java regular expressions for specific name\value format

I'm not familiar yet with java regular expressions. I want to validate a string that has the following format:
String INPUT = "[name1 value1];[name2 value2];[name3 value3];";
namei and valuei are Strings should contain any characters expect white-space.
I tried with this expression:
String REGEX = "([\\S*\\s\\S*];)*";
But if I call matches() I get always false even for a good String.
what's the best regular expression for it?
This does the trick:
(?:\[\w.*?\s\w.*?\];)*
If you want to only match three of these, replace the * at the end with {3}.
Explanation:
(?:: Start of non-capturing group
\[: Escapes the [ sign which is a meta-character in regex. This
allows it to be used for matching.
\w.*?: Lazily matches any word character [a-z][A-Z][0-9]_. Lazy matching means it attempts to match the character as few times possible, in this case meaning that when will stop matching once it finds the following \s.
\s: Matches one whitespace
\]: See \[
;: Matches one semicolon
): End of non-capturing group
*: Matches any number of what is contained in the preceding non-capturing group.
See this link for demonstration
You should escape square brackets. Also, if your aim is to match only three, replace * with {3}
(\[\\S*\\s\\S*\];){3}

What is the responsibility of (.*) in the Java String?

What is the responsibility of (.*) in the third line and how it works?
String Str = new String("Welcome to Tutorialspoint.com");
System.out.print("Return Value :" );
System.out.println(Str.matches("(.*)Tutorials(.*)"));
.matches() is a call to parse Str using the regex provided.
Regex, or Regular Expressions, are a way of parsing strings into groups. In the example provided, this matches any string which contains the word "Tutorials". (.*) simply means "a group of zero or more of any character".
This page is a good regex reference (for very basic syntax and examples).
Your expression matches any word prefixed and suffixed by any character of word Tutorial. .* means occurrence of any character any number of times including zero times.
The . represents regular expression meta-character which means any character.
The * is a regular expression quantifier, which means 0 or more occurrences of the expression character it was associated with.
matches takes regular expression string as parameter and (.*) means capture any character zero or more times greedily
.* means a group of zero or more of any character
In Regex:
.
Wildcard: Matches any single character except \n
for example pattern a.e matches ave in nave and ate in water
*
Matches the previous element zero or more times
for example pattern \d*\.\d matches .0, 19.9, 219.9
There is no reason to put parentheses around the .*, nor is there a reason to instantiate a String if you've already got a literal String. But worse is the fact that the matches() method is out of place here.
What it does is greedily matching any character from the start to the end of a String. Then it backtracks until it finds "Tutorials", after which it will again match any characters (except newlines).
It's better and more clear to use the find method. The find method simply finds the first "Tutorials" within the String, and you can remove the "(.*)" parts from the pattern.
As a one liner for convenience:
System.out.printf("Return value : %b%n", Pattern.compile("Tutorials").matcher("Welcome to Tutorialspoint.com").find());

Categories

Resources