Regex to match continuous pattern of integer then space - java

I'm asking the user for input through the Scanner in Java, and now I want to parse out their selections using a regular expression. In essence, I show them an enumerated list of items, and they type in the numbers for the items they want to select, separated by a space. Here is an example:
1 yorkshire terrier
2 staffordshire terrier
3 goldfish
4 basset hound
5 hippopotamus
Type the numbers that correspond to the words you wish to exclude: 3 5
The enumerated list of items can be a just a few elements or several hundred. The current regex I'm using looks like this ^|\\.\\s+)\\d+\\s+, but I know it's wrong. I don't fully understand regular expressions yet, so if you can explain what it is doing that would be helpful too!

Pattern pattern = new Pattern(^([0-9]*\s+)*[0-9]*$)
Explanation of the RegEx:
^ : beginning of input
[0-9] : only digits
'*' : any number of digits
\s : a space
'+' : at least one space
'()*' : any number of this digit space combination
$: end of input
This treats all of the following inputs as valid:
"1"
"123 22"
"123 23"
"123456 33 333 3333 "
"12321 44 452 23 "
etc.

You want integers:
\d+
followed by any number of space, then another integer:
\d+( \d+)*
Note that if you want to use a regex in a Java string you need to escape every \ as \\.

To "parse out" the integers, you don't necessarily want to match the input, but rather you want to split it on spaces (which uses regex):
String[] nums = input.trim().split("\\s+");
If you actually want int values:
List<Integer> selections = new ArrayList<>();
for (String num : input.trim().split("\\s+"))
selections.add(Integer.parseInt(num));

If you want to ensure that your string contains only numbers and spaces (with a variable number of spaces and trailing/leading spaces allowed) and extract number at the same time, you can use the \G anchor to find consecutive matches.
String source = "1 3 5 8";
List<String> result = new ArrayList<String>();
Pattern p = Pattern.compile("\\G *(\\d++) *(?=[\\d ]*$)");
Matcher m = p.matcher(source);
while (m.find()) {
result.add(m.group(1));
}
for (int i=0;i<result.size();i++) {
System.out.println(result.get(i));
}
Note: at the begining of a global search, \G matches the start of the string.

Related

Split String at different lengths in Java

I want to split a string after a certain length.
Let's say we have a string of "message"
123456789
Split like this :
"12" "34" "567" "89"
I thought of splitting them into 2 first using
"(?<=\\G.{2})"
Regexp and then join the last two and again split into 3 but is there any way to do it on a single go using RegExp. Please help me out
Use ^(.{2})(.{2})(.{3})(.{2}).* (See it in action in regex101) to group the String to the specified length and grab the groups as separate Strings
String input = "123456789";
List<String> output = new ArrayList<>();
Pattern pattern = Pattern.compile("^(.{2})(.{2})(.{3})(.{2}).*");
Matcher matcher = pattern.matcher(input);
if (matcher.matches()) {
for (int i = 1; i <= matcher.groupCount(); i++) {
output.add(matcher.group(i));
}
}
System.out.println(output);
NOTE: Group capturing starts from 1 as the group 0 matches the whole String
And a Magnificent Sorcery from #YCF_L from comment
String pattern = "^(.{2})(.{2})(.{3})(.{2}).*";
String[] vals = "123456789".replaceAll(pattern, "$1-$2-$3-$4").split("-");
Whats the magic here is you can replace the captured group by replaceAll() method. Use $n (where n is a digit) to refer to captured subsequences. See this stackoverflow question for better explanation.
NOTE: here its assumed that no input string contains - in it.
if so, then find any other character that will not be in any of
your input strings so that it can be used as a delimiter.
test this regex in regex101 with 123456789 test string.
^(\d{2})(\d{2})(\d{3})(\d{2})$
output :
Match 1
Full match 0-9 `123456789`
Group 1. 0-2 `12`
Group 2. 2-4 `34`
Group 3. 4-7 `567`
Group 4. 7-9 `89`

Parse numbers and parentheses from a String?

Given a String containing numbers (possibly with decimals), parentheses and any amount of whitespace, I need to iterate through the String and handle each number and parenthesis.
The below works for the String "1 ( 2 3 ) 4", but does not work if I remove whitespaces between the parentheses and the numbers "1 (2 3) 4)".
Scanner scanner = new Scanner(expression);
while (scanner.hasNext()) {
String token = scanner.next();
// handle token ...
System.out.println(token);
}
Scanner uses whitespace as it's default delimiter. You can change this to use a different Regex pattern, for example:
(?:\\s+)|(?<=[()])|(?=[()])
This pattern will set the delimiter to the left bracket or right bracket or one or more whitespace characters. However, it will also keep the left and right brackets (as I think you want to include those in your parsing?) but not the whitespace.
Here is an example of using this:
String test = "123(3 4)56(7)";
Scanner scanner = new Scanner(test);
scanner.useDelimiter("(?:\\s+)|(?<=[()])|(?=[()])");
while(scanner.hasNext()) {
System.out.println(scanner.next());
}
Output:
123
(
3
4
)
56
(
7
)
Detailed Regex Explanation:
(?:\\s+)|(?<=[()])|(?=[()])
1st Alternative: (?:\\s+)
(?:\\s+) Non-capturing group
\\s+ match any white space character [\r\n\t\f ]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
2nd Alternative: (?<=[()])
(?<=[()]) Positive Lookbehind - Assert that the regex below can be matched
[()] match a single character present in the list below
() a single character in the list () literally
3rd Alternative: (?=[()])
(?=[()]) Positive Lookahead - Assert that the regex below can be matched
[()] match a single character present in the list below
() a single character in the list () literally
Scanner's .next() method uses whitespace as its delimiter. Luckily, we can change the delimiter!
For example, if you need the scanner to process to handle whitespace and parentheses, you could run this code immediately after constructing your Scanner:
scanner.useDelimiter(" ()");

split on integer values but not floating point values

I have a java program where I need to split on integer values but not floating point values
ie. "1/\\2" should produce: [1,/\\,2]
but "1.0/\\2.0" should produce: [1.0,/\\,2.0]
does anybody have any ideas?
or could anybody point me in the direction of how to split on the specific strings "\\/" and "/\\" ?
UPDATE: sorry! one more case! for the string "100 /\ 3.4e+45" I need to split it into:
[100,/\,3.4,e,+,45]
my current regex is (kind of really ugly):
line.split("\\s+|(?<=[-+])|(?=[-+])|(?:(?<=[0-9])(?![0-9.]|$))|(?:(?<![0-9.]|^)(?=[0-9]))|(?<=[-+()])|(?=[-+()])|(?<=e)|(?=e)");
and for the string: "100 /\ 3.4e+45" is giving me:
[100,/\,3.4,+,45]
This regex should do it:
(?:(?<=[0-9])(?![0-9.]|$))|(?:(?<![0-9.]|^)(?=[0-9]))
It's two checks, basically matching:
A digit not followed by a digit, a decimal point, or the end of text.
A digit not preceded by a digit, a decimal point, or the start of text.
It will match the empty space after/before the digit, so you can use this regex in split().
See regex101 for demo.
Follow-up
could anybody point me in the direction of how to split on the specific strings "\/" and "/\""
If you want to split before a specific pattern, use a positive lookahead: (?=xxx). If you want to split after a specific pattern, use a positive lookbehind: (?<=xxx). To do either, separate by |:
(?<=xxx)|(?=xxx)
where xxx is the text \/ or /\, i.e. the regex \\/|/\\, and doubling for Java string literal:
"(?<=\\\\/|/\\\\)|(?=\\\\/|/\\\\)"
See regex101 for demo.
You could try something like this:
String regex = "\\d+(.\\d+)?", str = "1//2";
Matcher m = Pattern.compile(regex).matcher(str);
ArrayList<String> list = new ArrayList<String>();
int index = 0;
for(index = 0 ; m.find() ; index = m.end()) {
if(index != m.start()) list.add(str.substring(index, m.start()));
list.add(str.substring(m.start(), m.end()));
}
list.add(str.substring(index));
The idea is to find number using regex and Matcher, and also add the strings in between.

How do I count repetitive/continuous appearance of a character in String(When I don't know index of start/end)?

So if I have 22332, I want to replace that for BEA, as in mobile keypad.I want to see how many times a digit appear so that I can count A--2,B--22,C--222,D--3,E--33,F--333, etc(and a 0 is pause).I want to write a decoder that takes in digit string and replaces digit occurrences with letters.example : 44335557075557777 will be decoded as HELP PLS.
This is the key portion of the code:
public void printMessages() throws Exception {
File msgFile = new File("messages.txt");
Scanner input = new Scanner(msgFile);
while(input.hasNext()) {
String x = input.next();
String y = input.nextLine();
System.out.println(x+":"+y);
}
It takes the input from a file as digit String.Then Scanner prints the digit.I tried to split the string digits and then I don't know how to evaluate the appearance of the mentioned kind in the question.
for(String x : b.split(""))
System.out.print(x);
gives: 44335557075557777(input from the file).
I don't know how can I call each repetitive index and see how they formulate such pattern as in mobile keypad.If I use for loop then I have to cycle through whole string and use lots of if statements. There must be some other way.
Another suggestion of making use of regex in breaking the encoded string.
By making use of look-around + back-reference makes it easy to split the string at positions that preceding and following characters are different.
e.g.
String line = "44335557075557777";
String[] tokens = line.split("(?<=(.))(?!\\1)");
// tokens will contain ["44", "33", "555", "7", "0", "7", "555", "7777"]
Then it should be trivial for you to map each string to its corresponding character, either by a Map or even naively by bunch of if-elses
Edit: Some background on the regex
(?<=(.))(?!\1)
(?<= ) : Look behind group, which means finding
something (a zero-length patternin this example)
preceded by this group of pattern
( ) : capture group #1
. : any char
: zero-length pattern between look behind and look
ahead group
(?! ) : Negative look ahead group, which means finding
a pattern (zero-length in this example) NOT followed
by this group of pattern
\1 : back-reference, whatever matched by
capture group #1
So it means, find any zero-length positions, for which the character before and after such position is different, and use such positions to do splitting.

Regex to split a string into different parts (using Java)

I'm looking for a regex to split the following strings
red 12478
blue 25 12375
blue 25, 12364
This should give
Keywords red, ID 12478
Keywords blue 25, ID 12475
Keywords blue IDs 25, 12364
Each line has 2 parts, a set of keywords and a set of IDs. Keywords are separated by spaces and IDs are separated by commas.
I came up with the following regex: \s*((\S+\s+)+?)([\d\s,]+)
However, it fails for the second one. I've been trying to work with lookahead, but can't quite work it out
I am trying to split the string into its component parts (keywords and IDs)
The format of each line is one or more space separated keywords followed by one or more comma separated IDs. IDs are numeric only and keywords do not contain commas.
I'm using Java to do this.
I found a two-line solution using replaceAll and split:
pattern = "(\\S+(?<!,)\\s+(\\d+\\s+)*)";
String[] keywords = theString.replaceAll(pattern+".*","$1").split(" ");
String[] ids = theString.split(pattern)[1].split(",\\s?");
I assumed that the comma will always be immediately after the ID for each ID (this can be enforced by removing spaces adjacent to a comma), and that there is no trailing space.
I also assumed that the first keyword is a sequence of non-whitespace chars (without trailing comma) \\S+(?<!,)\\s+, and the rest of the keywords (if any) are digits (\\d+\\s+)*. I made this assumption based on your regex attempt.
The regex here is very simple, just take (greedily) any sequence of valid keywords that is followed by a space (or whitespaces). The longest will be the list of keywords, the rest will be the IDs.
Full Code:
public static void main(String[] args){
String pattern = "(\\S+(?<!,)\\s+(\\d+\\s+)*)";
Scanner sc = new Scanner(System.in);
while(true){
String theString = sc.nextLine();
String[] keywords = theString.replaceAll(pattern+".*","$1").split(" ");
String[] ids = theString.split(pattern)[1].split(",\\s?");
System.out.println("Keywords:");
for(String keyword: keywords){
System.out.println("\t"+keyword);
}
System.out.println("IDs:");
for(String id: ids){
System.out.println("\t"+id);
}
System.out.println();
}
}
Sample run:
red 124
Keywords:
red
IDs:
124
red 25 124
Keywords:
red
25
IDs:
124
red 25, 124
Keywords:
red
IDs:
25
124
I came up with:
(red|blue)( \d+(?!$)(?:, \d+)*)?( \d+)?$
as illustrated in http://rubular.com/r/y52XVeHcxY which seems to pass your tests. It's a straightforward matter to insert your keywords between the match substrings.
Ok since the OP didn't specify a target language, I am willing to tilt at this windmill over lunch as a brain teaser and provide a C#/.Net Regex replace with match evaluator which gives the required output:
Keywords red, ID 12478
Keywords blue 25 ID 12375
Keywords blue IDs 25, 12364
Note there is no error checking and this is fine example of using a lamda expression for the match evaluator and returning a replace per rules does the job. Also of note due to the small sampling size of data it doesn't handle multiple Ids/keywords as the case may actually be.
string data = #"red 12478
blue 25 12375
blue 25, 12364";
var pattern = #"(?xmn) # x=IgnorePatternWhiteSpace m=multiline n=explicit capture
^
(?<Keyword>[^\s]+) # Match Keyword Color
[\s,]+
(
(?<Numbers>\d+)
(?<HasComma>,)? # If there is a comma that signifies IDs
[,\s]*
)+ # 1 or more values
$";
Console.WriteLine (Regex.Replace(data, pattern, (mtch) =>
{
StringBuilder sb = new StringBuilder();
sb.AppendFormat("Keywords {0}", mtch.Groups["Keyword"].Value);
var values = mtch.Groups["Numbers"]
.Captures
.OfType<Capture>()
.Select (cp => cp.Value)
.ToList();
if (mtch.Groups["HasComma"].Success)
{
sb.AppendFormat(" IDs {0}", string.Join(", ", values));
}
else
{
if (values.Count() > 1)
sb.AppendFormat(" {0} ID {1}", values[0], values[1] );
else
sb.AppendFormat(", ID {0}", values[0]);
}
return sb.ToString();
}));

Categories

Resources