Fix Regular Expression to allow optional fields - java

A data-line looks like this:
$POSL,VEL,SPL,,,4.1,0.0,4.0*12
The 7th field (4.1) is extracted to the named field SPEED using this Java Regexp.
\\$POSL,VEL,SPL,,,(?<SPEED>\\d+.\\d+),.*
New data has slightly changed. The fields in 4,5,6 may now contain data:
$POSL,VEL,SPL,a,b,c,4.0,a,b,c,d
But, the Regexp is now returning zero. Note: fields 4, 5, 6 may contain letters or numbers. But, they will not contain quoted Strings (so we don't need to worry about quoted commas).
Can someone offer a fix please?

You could optionally repeat chars a-zA-Z and digits using ,[A-Za-z0-9]*
As there is 1 comma more in the second string, you can make that part optional.
If you are not interested in the last part, but only in the capturing group, you can omit .* at the end. If the value can also occur at the end of the string, you can end the pattern with an alternation (?:,|$)
Note to escape the dot in this part \\d+\\.\\d+
\$POSL,VEL,SPL,[A-Za-z0-9]*,[A-Za-z0-9]*,(?:[A-Za-z0-9]*,)?(?<SPEED>\d+\.\d+)(?:,|$)
In Java with double escaped backslashes
String regex = "\\$POSL,VEL,SPL,[A-Za-z0-9]*,[A-Za-z0-9]*,(?:[A-Za-z0-9]*,)?(?<SPEED>\\d+\\.\\d+)(?:,|$)";
Regex demo

You may use \w+ for any digit/letter, for the fields 4, 5, 6
\\$POSL,VEL,SPL,\\w*,\\w*,\\w*,(?<SPEED>\\d+.\\d+),.*
REGEX DEMO
Note that in your post, the example and the regex may miss a comma to get the numbre as seventh field

Assuming in first input one , was missing.
package arraysAndStrings;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexGroupCapture {
public static void main(String[] args) {
String inputArr[] = { "$POSL,VEL,SPL,,,,4.1,0.0,4.0*12",
"$POSL,VEL,SPL,a,b,c,4.0,a,b,c,d" };
for (String input : inputArr) {
System.out.println(extractSpeed(input));
}
}
private static float extractSpeed(String input) {
float speed = 0;
try {
String regex = "\\$POSL,VEL,SPL,.*?,.*?,.*?,(?<SPEED>\\d+.\\d+),.*";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
if (matcher.find()) {
speed = Float.parseFloat(matcher.group(1));
}
} catch (Exception e) {
e.printStackTrace();
}
return speed;
}
}
Output
=====
4.1
4.0

Related

Regex for extracting all heading digits from a string

I am trying to extract all heading digits from a string using Java regex without writing additional code and I could not find something to work:
"12345XYZ6789ABC" should give me "12345".
"X12345XYZ6789ABC" should give me nothing
public final class NumberExtractor {
private static final Pattern DIGITS = Pattern.compile("what should be my regex here?");
public static Optional<Long> headNumber(String token) {
var matcher = DIGITS.matcher(token);
return matcher.find() ? Optional.of(Long.valueOf(matcher.group())) : Optional.empty();
}
}
Use a word boundary \b:
\b\d+
See live demo.
If you strictly want to match only digits at the start of the input, and not from each word (same thing when the input contains only one word), use ^:
^\d+
Pattern DIGITS = Pattern.compile("\\b\\d+"); // leading digits of all words
Pattern DIGITS = Pattern.compile("^\\d+"); // leading digits of input
I'd think something like "^[0-9]*" would work. There's a \d that matches other Unicode digits if you want to include them as well.
Edit: removed errant . from the string.

How to use Pattern, Matcher in Java regex API to remove a specific line

I have a complicate string split, I need to remove the comments, spaces, and keep all the numbers but change all string into character. If the - sign is at the start and followed by a number, treat it as a negative number rather than a operator
the comment has the style of ?<space>comments<space>? (the comments is a place holder)
Input :
-122+2 ? comments ?sa b
-122+2 ? blabla ?sa b
output :
["-122","+","2","?","s","a","b"]
(all string into character and no space, no comments)
Replace the unwanted string \s*\?\s*\w+\s*(?=\?) with "". You can chain String#replaceAll to remove any remaining whitespace. Note that ?= means positive lookahead and here it means \s*\?\s*\w+\s* followed by a ?. I hope you already know that \s specifies whitespace and \w specifies a word character.
Then you can use the regex, ^-\d+|\d+|\D which means either negative integer in the beginning (i.e. ^-\d+) or digits (i.e. \d+) or a non-digit (\D).
Demo:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String str = "-122+2 ? comments ?sa b";
str = str.replaceAll("\\s*\\?\\s*\\w+\\s*(?=\\?)", "").replaceAll("\\s+", "");
Pattern pattern = Pattern.compile("^-\\d+|\\d+|\\D");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
Output:
-122
+
2
?
s
a
b

regex for letters or numbers in brackets

I am using Java to process text using regular expressions. I am using the following regular expression
^[\([0-9a-zA-Z]+\)\s]+
to match one or more letters or numbers in parentheses one or more times. For instance, I like to match
(aaa) (bb) (11) (AA) (iv)
or
(111) (aaaa) (i) (V)
I tested this regular expression on http://java-regex-tester.appspot.com/ and it is working. But when I use it in my code, the code does not compile. Here is my code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Tester {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("^[\([0-9a-zA-Z]+\)\s]+");
String[] words = pattern.split("(a) (1) (c) (xii) (A) (12) (ii)");
String w = pattern.
for(String s:words){
System.out.println(s);
}
}
}
I tried to use \ instead of \ but the regex gave different results than what I expected (it matches only one group like (aaa) not multiple groups like (aaa) (111) (ii).
Two questions:
How can I fix this regex and be able to match multiple groups?
How can I get the individual matches separately (like (aaa) alone and then (111) and so on). I tried pattern.split but did not work for me.
Firstly, you want to escape any backslashes in the quotation marks with another backslash. The Regex will treat it as a single backslash. (E.g. call a word character \w in quotation marks, etc.)
Secondly, you got to finish the line that reads:
String w = pattern.
That line explains why it doesn't compile.
Here is my final solution to match the individual groups of letters/numbers in brackets that appear at the beginning of a line and ignore the rest
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Tester {
static ArrayList<String> listOfEnums;
public static void main(String[] args) {
listOfEnums = new ArrayList<String>();
Pattern pattern = Pattern.compile("^\\([0-9a-zA-Z^]+\\)");
String p = "(a) (1) (c) (xii) (A) (12) (ii) and the good news (1)";
Matcher matcher = pattern.matcher(p);
boolean isMatch = matcher.find();
int index = 0;
//once you find a match, remove it and store it in the arrayList.
while (isMatch) {
String s = matcher.group();
System.out.println(s);
//Store it in an array
listOfEnums.add(s);
//Remove it from the beginning of the string.
p = p.substring(listOfEnums.get(index).length(), p.length()).trim();
matcher = pattern.matcher(p);
isMatch = matcher.find();
index++;
}
}
}
1) Your regex is incorrect. You want to match individual groups of letters / numbers in brackets, and the current regex will match only a single string of one or more such groups. I.e. it will match
(abc) (def) (123)
as a single group rather than three separate groups.
A better regex that would match only up to the closing bracket would be
\([0-9a-zA-Z^\)]+\)
2) Java requires you to escape all backslashes with another backslash
3) The split() method will not do what you want. It will find all matches in your string then throw them away and return an array of what is left over. You want to use matcher() instead
Pattern pattern = Pattern.compile("\\([0-9a-zA-Z^\\)]+\\)");
Matcher matcher = pattern.matcher("(a) (1) (c) (xii) (A) (12) (ii)");
while (matcher.find()) {
System.out.println(matcher.group());
}

Java: Whitespace in HTML not recognized with Regex pattern

Code:
static short state = 0;
static int td_number = 0;
public static void main(String[] args) {
final Pattern p = Pattern.compile("^[\\s]*?\\d+\\.\\d+[\\s]*?");
final short TD_ENTRY = 0;
final short NO_ENTRY = 1;
HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback() {
public void handleText(char[] data, int pos) {
switch (state) {
case NO_ENTRY:
break;
case TD_ENTRY: {
// We are in the right table column
// Create string from char array
String s = new String(data);
Matcher m = p.matcher(s);
boolean b = m.matches();
// Check if data information has correct format (0.0)
if (b) {
}
}
break;
default:
break;
}
state = NO_ENTRY;
}
public void handleStartTag(HTML.Tag tag, MutableAttributeSet set, int pos) {
if (tag == HTML.Tag.TD) {
//[...]
}
}
};
Reader reader = new StringReader(html);
try {
new ParserDelegator().parse(reader, callback, false);
} catch (IOException e) {
}
}
I am trying to parse HTML with Regular Expressions. The program reads the content of td tags within an html table. The content in the table cell should fit a special pattern defined in Pattern p.
The main problem is now that the regex pattern does not match for cell content like this " 0.1".
But if I define the String s manually with the value (" 0.1") in the code the pattern matches.
Furthermore if I copy the content of char[] data in debug mode and define s with this copied content the pattern does also not fit although it looks the same like the manually defined value from above.
Is it possible to find out which whitespace characters are really read?
It seems that the whitespace is not always a whitespace and therefore does not match with regex class [\s]. Is this possible?
EDIT:
Thanks for answers. It was really a whitespace character (\xA0) which was not recognized by \s regex class.
For all of you which downvote (really frustrating) my question simply missunderstood me. Maybe the problem was really the sentence "I want to parse HTML with regex" but in fact I simply have content from a HTML table cell with unknown whitespace characters ;-).
I think I had got the same problems with a library like jsoup.
In Java regexes, the non-breaking space character (NBSP, U+00A0) is traditionally not treated as whitespace for the purpose of matching \s. If that's what's causing your problem, you just need to add it to your existing whitespace class:
"^[\\s\\xA0]*\\d+\\.\\d+[\\s\\xA0]*$"
There are other Unicode whitespace characters that aren't matched by \s, but none of them are anywhere as common as the NBSP.
Alternatively, if you're running Java 7+ you can specify UNICODE_CHARACTER_CLASS mode and go on using \s.
Your code snippet is too long, but as far as I understand you just need pattern to match something like 0.0, 10.52 etc, i.e. floating point numbers? Use pattern \\d+\\.\\d+.
\d+ means 1..n digits
\. means dot. A single dot . in regex means "any character"
Here is the usage example:
String str = "123.456";
Pattern p = Pattern.compile("\\d+\\.\\d+");
Matcher m = p.matcher(str);
if (m.matches()) {
// do something.
}
BTW, pay attention that matches() matches full line. If you want to match part of line use find() instead. I personally always use find() and use start and end line markers ^ and $ into regex itself when needed.

Parsing CSV input with a RegEx in java

I know, now I have two problems. But I'm having fun!
I started with this advice not to try and split, but instead to match on what is an acceptable field, and expanded from there to this expression.
final Pattern pattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?=,|$)");
The expression looks like this without the annoying escaped quotes:
"([^"]*)"|(?<=,|^)([^,]*)(?=,|$)
This is working well for me - either it matches on "two quotes and whatever is between them", or "something between the start of the line or a comma and the end of the line or a comma". Iterating through the matches gets me all the fields, even if they are empty. For instance,
the quick, "brown, fox jumps", over, "the",,"lazy dog"
breaks down into
the quick
"brown, fox jumps"
over
"the"
"lazy dog"
Great! Now I want to drop the quotes, so I added the lookahead and lookbehind non-capturing groups like I was doing for the commas.
final Pattern pattern = Pattern.compile("(?<=\")([^\"]*)(?=\")|(?<=,|^)([^,]*)(?=,|$)");
again the expression is:
(?<=")([^"]*)(?=")|(?<=,|^)([^,]*)(?=,|$)
Instead of the desired result
the quick
brown, fox jumps
over
the
lazy dog
now I get this breakdown:
the quick
"brown
fox jumps"
,over,
"the"
,,
"lazy dog"
What am I missing?
Operator precedence. Basically there is none. It's all left to right. So the or (|) is applying to the closing quote lookahead and the comma lookahead
Try:
(?:(?<=")([^"]*)(?="))|(?<=,|^)([^,]*)(?=,|$)
(?:^|,)\s*(?:(?:(?=")"([^"].*?)")|(?:(?!")(.*?)))(?=,|$)
This should do what you want.
Explanation:
(?:^|,)\s*
The pattern should start with a , or beginning of string. Also, ignore all whitespace at the beginning.
Lookahead and see if the rest starts with a quote
(?:(?=")"([^"].*?)")
If it does, then match non-greedily till next quote.
(?:(?!")(.*?))
If it does not begin with a quote, then match non-greedily till next comma or end of string.
(?=,|$)
The pattern should end with a comma or end of string.
When I started to understand what I had done wrong, I also started to understand how convoluted the lookarounds were making this. I finally realized that I didn't want all the matched text, I wanted specific groups inside of it. I ended up using something very similar to my original RegEx except that I didn't do a lookahead on the closing comma, which I think should be a little more efficient. Here is my final code.
package regex.parser;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class CSVParser {
/*
* This Pattern will match on either quoted text or text between commas, including
* whitespace, and accounting for beginning and end of line.
*/
private final Pattern csvPattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?:,|$)");
private ArrayList<String> allMatches = null;
private Matcher matcher = null;
private String match = null;
private int size;
public CSVParser() {
allMatches = new ArrayList<String>();
matcher = null;
match = null;
}
public String[] parse(String csvLine) {
matcher = csvPattern.matcher(csvLine);
allMatches.clear();
String match;
while (matcher.find()) {
match = matcher.group(1);
if (match!=null) {
allMatches.add(match);
}
else {
allMatches.add(matcher.group(2));
}
}
size = allMatches.size();
if (size > 0) {
return allMatches.toArray(new String[size]);
}
else {
return new String[0];
}
}
public static void main(String[] args) {
String lineinput = "the quick,\"brown, fox jumps\",over,\"the\",,\"lazy dog\"";
CSVParser myCSV = new CSVParser();
System.out.println("Testing CSVParser with: \n " + lineinput);
for (String s : myCSV.parse(lineinput)) {
System.out.println(s);
}
}
}
I know this isn't what the OP wants, but for other readers, one of the String.replace methods could be used to strip the quotes from each element in the result array of the OPs current regex.

Categories

Resources