Java: Whitespace in HTML not recognized with Regex pattern

Java: Whitespace in HTML not recognized with Regex pattern - java

Code:
static short state = 0;
static int td_number = 0;
public static void main(String[] args) {
final Pattern p = Pattern.compile("^[\\s]*?\\d+\\.\\d+[\\s]*?");
final short TD_ENTRY = 0;
final short NO_ENTRY = 1;
HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback() {
public void handleText(char[] data, int pos) {
switch (state) {
case NO_ENTRY:
break;
case TD_ENTRY: {
// We are in the right table column
// Create string from char array
String s = new String(data);
Matcher m = p.matcher(s);
boolean b = m.matches();
// Check if data information has correct format (0.0)
if (b) {
}
}
break;
default:
break;
}
state = NO_ENTRY;
}
public void handleStartTag(HTML.Tag tag, MutableAttributeSet set, int pos) {
if (tag == HTML.Tag.TD) {
//[...]
}
}
};
Reader reader = new StringReader(html);
try {
new ParserDelegator().parse(reader, callback, false);
} catch (IOException e) {
}
}
I am trying to parse HTML with Regular Expressions. The program reads the content of td tags within an html table. The content in the table cell should fit a special pattern defined in Pattern p.
The main problem is now that the regex pattern does not match for cell content like this " 0.1".
But if I define the String s manually with the value (" 0.1") in the code the pattern matches.
Furthermore if I copy the content of char[] data in debug mode and define s with this copied content the pattern does also not fit although it looks the same like the manually defined value from above.
Is it possible to find out which whitespace characters are really read?
It seems that the whitespace is not always a whitespace and therefore does not match with regex class [\s]. Is this possible?
EDIT:
Thanks for answers. It was really a whitespace character (\xA0) which was not recognized by \s regex class.
For all of you which downvote (really frustrating) my question simply missunderstood me. Maybe the problem was really the sentence "I want to parse HTML with regex" but in fact I simply have content from a HTML table cell with unknown whitespace characters ;-).
I think I had got the same problems with a library like jsoup.

In Java regexes, the non-breaking space character (NBSP, U+00A0) is traditionally not treated as whitespace for the purpose of matching \s. If that's what's causing your problem, you just need to add it to your existing whitespace class:
"^[\\s\\xA0]*\\d+\\.\\d+[\\s\\xA0]*$"
There are other Unicode whitespace characters that aren't matched by \s, but none of them are anywhere as common as the NBSP.
Alternatively, if you're running Java 7+ you can specify UNICODE_CHARACTER_CLASS mode and go on using \s.

Your code snippet is too long, but as far as I understand you just need pattern to match something like 0.0, 10.52 etc, i.e. floating point numbers? Use pattern \\d+\\.\\d+.
\d+ means 1..n digits
\. means dot. A single dot . in regex means "any character"
Here is the usage example:
String str = "123.456";
Pattern p = Pattern.compile("\\d+\\.\\d+");
Matcher m = p.matcher(str);
if (m.matches()) {
// do something.
}
BTW, pay attention that matches() matches full line. If you want to match part of line use find() instead. I personally always use find() and use start and end line markers ^ and $ into regex itself when needed.

Related

Fix Regular Expression to allow optional fields

A data-line looks like this:
$POSL,VEL,SPL,,,4.1,0.0,4.0*12
The 7th field (4.1) is extracted to the named field SPEED using this Java Regexp.
\\$POSL,VEL,SPL,,,(?<SPEED>\\d+.\\d+),.*
New data has slightly changed. The fields in 4,5,6 may now contain data:
$POSL,VEL,SPL,a,b,c,4.0,a,b,c,d
But, the Regexp is now returning zero. Note: fields 4, 5, 6 may contain letters or numbers. But, they will not contain quoted Strings (so we don't need to worry about quoted commas).
Can someone offer a fix please?

You could optionally repeat chars a-zA-Z and digits using ,[A-Za-z0-9]*
As there is 1 comma more in the second string, you can make that part optional.
If you are not interested in the last part, but only in the capturing group, you can omit .* at the end. If the value can also occur at the end of the string, you can end the pattern with an alternation (?:,|$)
Note to escape the dot in this part \\d+\\.\\d+
\$POSL,VEL,SPL,[A-Za-z0-9]*,[A-Za-z0-9]*,(?:[A-Za-z0-9]*,)?(?<SPEED>\d+\.\d+)(?:,|$)
In Java with double escaped backslashes
String regex = "\\$POSL,VEL,SPL,[A-Za-z0-9]*,[A-Za-z0-9]*,(?:[A-Za-z0-9]*,)?(?<SPEED>\\d+\\.\\d+)(?:,|$)";
Regex demo

You may use \w+ for any digit/letter, for the fields 4, 5, 6
\\$POSL,VEL,SPL,\\w*,\\w*,\\w*,(?<SPEED>\\d+.\\d+),.*
REGEX DEMO
Note that in your post, the example and the regex may miss a comma to get the numbre as seventh field

Assuming in first input one , was missing.
package arraysAndStrings;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexGroupCapture {
public static void main(String[] args) {
String inputArr[] = { "$POSL,VEL,SPL,,,,4.1,0.0,4.0*12",
"$POSL,VEL,SPL,a,b,c,4.0,a,b,c,d" };
for (String input : inputArr) {
System.out.println(extractSpeed(input));
}
}
private static float extractSpeed(String input) {
float speed = 0;
try {
String regex = "\\$POSL,VEL,SPL,.*?,.*?,.*?,(?<SPEED>\\d+.\\d+),.*";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
if (matcher.find()) {
speed = Float.parseFloat(matcher.group(1));
}
} catch (Exception e) {
e.printStackTrace();
}
return speed;
}
}
Output
=====
4.1
4.0

Hamcrest isEqualIgnoringWhitespace does not Ignore whitespace

So as we need to parse some HTML to XML and validate that everything from the HTML is in the XML file we use Hamcrest in Unittests for validation. As we can not have more or less information in the XML files it is important that we have a matcher that does not use a contains but an equalTo. The problem is that we parse but have to extract certain elements as they are not allowed in the externally managed datamodel. We found out that doing so migth add extra whitespaces in some cases (has something to do with Jsoup).
So as the spaces are not relevant in the actual content we decided to ignore those for now (as this is purely PoC), but we do want validation of our concept. To do this I came up with a solution that strips every whitespace (String.replaceAll("\\s","")) which also strips newlines and tabs. All the text is then concatenated into one String object which makes for terrible reading and also not a very good practice when debugging at all. So instead I opted to use Hamcrests IsEqualIgnoringWhitespace. When testing I found out that it does not do anything like the name suggests at all. In the code there is no deletion of spaces, tabs or newlines but instead it checks if the current character is a whitespace and if so if the character before that also contained a whitespace. If that is the case it will remove one whitespace. So bassically it only normalises the whitespaces to contain only one of them in between two words.
Here is the code of the used stripSpace method in the class:
public String stripSpace(String toBeStripped) {
final StringBuilder result = new StringBuilder();
boolean lastWasSpace = true;
for (int i = 0; i < toBeStripped.length(); i++) {
char c = toBeStripped.charAt(i);
if (isWhitespace(c)) {
if (!lastWasSpace) {
result.append(' ');
}
lastWasSpace = true;
} else {
result.append(c);
lastWasSpace = false;
}
}
return result.toString().trim();
}
So in essence it does not ignore whitespaces at all. Why is it named like this then?
to give some examples of inputs we want to match with one another here is some of the text that has whitespaces but shouldn't (text is in dutch but this doesn't matter):
m2 vs. m 2 (HTML original: m<sup>2</sup>)
Tabel 3.1 vs. Tabel 3 .1 (HTML original: Tabel 3.1)
So as these texts will never be matched by a normal equalTo matcher, the equalToIgnoringWhitespaces should actually match this based on the name but it doesn't.
Does anyone of you know if there actually is a matcher that actually ignores whitespaces?

According to the Javadocs IsEqualIgnoringWhitespace:
Creates a matcher of String that matches when the examined string is equal to the specified expectedString, when whitespace differences are (mostly) ignored.
This is explained in more detail in the Matchers Javadocs:
Creates a matcher of String that matches when the examined string is equal to the specified expectedString, when whitespace differences are (mostly) ignored. To be exact, the following whitespace rules are applied:
all leading and trailing whitespace of both the expectedString and the examined string are ignored
any remaining whitespace, appearing within either string, is collapsed to a single space before comparison
The following test verifies this behaviour:
#Test
public void testIsEqualIgnoringWhitespace() {
// leading and trailing spaces are ignored
assertThat("m 2", equalToIgnoringWhiteSpace(" m 2 "));
// all other spaces are collapsed to a single space
assertThat("m 2", equalToIgnoringWhiteSpace("m 2"));
// does not match because the single space in the expected string is not collapsed any further
assertThat("m2", not(equalToIgnoringWhiteSpace("m 2")));
}
So, that explains why you are seeing the behaviour your described in your question.
Re this:
Does anyone of you know if there actually is a matcher that actually ignores whitespaces?
You could write your own matcher. Here's an example:
public class IgnoresAllWhitespacesMatcher extends BaseMatcher<String> {
public String expected;
public static IgnoresAllWhitespacesMatcher ignoresAllWhitespaces(String expected) {
return new IgnoresAllWhitespacesMatcher(expected);
}
private IgnoresAllWhitespacesMatcher(String expected) {
this.expected = expected.replaceAll("\\s+", "");
}
#Override
public boolean matches(Object actual) {
return expected.equals(actual);
}
#Override
public void describeTo(Description description) {
description.appendText(String.format("the given String should match '%s' without whitespaces", expected));
}
}
Using this matcher the following test passes:
#Test
public void testUsingCustomIgnoringAllWhitespaceMatcher() {
// leading and trailing spaces are ignored
assertThat("m2", ignoresAllWhitespaces(" m 2 "));
// intermediate spaces are ignored
assertThat("m2", ignoresAllWhitespaces("m 2"));
}

Java Regex is including new line in match

I'm trying to match a regular expression to textbook definitions that I get from a website.
The definition always has the word with a new line followed by the definition. For example:
Zither
Definition: An instrument of music used in Austria and Germany It has from thirty to forty wires strung across a shallow sounding board which lies horizontally on a table before the performer who uses both hands in playing on it Not to be confounded with the old lute shaped cittern or cithern
In my attempts to get just the word (in this case "Zither") I keep getting the newline character.
I tried both ^(\w+)\s and ^(\S+)\s without much luck. I thought that maybe ^(\S+)$ would work, but that doesn't seem to successfully match the word at all. I've been testing with rubular, http://rubular.com/r/LPEHCnS0ri; which seems to successfully match all my attempts the way I want, despite the fact that Java doesn't.
Here's my snippet
String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above.
Pattern rgx = Pattern.compile("^(\\S+)$");
Matcher mtch = rgx.matcher(str);
if (mtch.find()) {
String result = mtch.group();
terms.add(new SearchTerm(result, System.nanoTime()));
}
This is easily solved by triming the resulting string, but that seems like it should be unnecessary if I'm already using a regular expression.
All help is greatly appreciated. Thanks in advance!

Try using the Pattern.MULTILINE option
Pattern rgx = Pattern.compile("^(\\S+)$", Pattern.MULTILINE);
This causes the regex to recognise line delimiters in your string, otherwise ^ and $ just match the start and end of the string.
Although it makes no difference for this pattern, the Matcher.group() method returns the entire match, whereas the Matcher.group(int) method returns the match of the particular capture group (...) based on the number you specify. Your pattern specifies one capture group which is what you want captured. If you'd included \s in your Pattern as you wrote you tried, then Matcher.group() would have included that whitespace in its return value.

With regular expressions the first group is always the complete matching string. In your case you want group 1, not group 0.
So changing mtch.group() to mtch.group(1) should do the trick:
String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above.
Pattern rgx = Pattern.compile("^(\\w+)\s");
Matcher mtch = rgx.matcher(str);
if (mtch.find()) {
String result = mtch.group(1);
terms.add(new SearchTerm(result, System.nanoTime()));
}

A late response, but if you are not using Pattern and Matcher, you can use this alternative of DOTALL in your regex string
(?s)[Your Expression]
Basically (?s) also tells dot to match all characters, including line breaks
Detailed information: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html

Just replace:
String result = mtch.group();
By:
String result = mtch.group(1);
This will limit your output to the contents of the capturing group (e.g. (\\w+)) .

Try the next:
/* The regex pattern: ^(\w+)\r?\n(.*)$ */
private static final REGEX_PATTERN =
Pattern.compile("^(\\w+)\\r?\\n(.*)$");
public static void main(String[] args) {
String input = "Zither\n Definition: An instrument of music";
System.out.println(
REGEX_PATTERN.matcher(input).matches()
); // prints "true"
System.out.println(
REGEX_PATTERN.matcher(input).replaceFirst("$1 = $2")
); // prints "Zither = Definition: An instrument of music"
System.out.println(
REGEX_PATTERN.matcher(input).replaceFirst("$1")
); // prints "Zither"
}

Excluding markup on lowercased parentheses letters

A string can contain one to many parentheses in lower case letters like String content = "This is (a) nightmare"; I want to transform the string to "<centamp>This is </centamp>(a) <centamp>nightmare</centamp>"; So basically add centamp markup around this string but if it has a lowercase letter in parentheses that should be excluded from the markup.
This is what I have tried so far, but it doesn't achieve the desired result. There could be none to many parentheses in a string and excluding it from the markup should happen for every parentheses.
Pattern pattern = Pattern.compile("^(.*)?(\\([a-z]*\\))?(.*)?$", Pattern.MULTILINE);
String content = "This is (a) nightmare";
System.out.println(content.matches("^(.*)?(\\([a-z]*\\))?(.*)?$"));
System.out.println(pattern.matcher(content).replaceAll("<centamp>$1$3</centamp>$2"));

This can be done in one replaceAll:
String outputString =
inputString.replaceAll("(?s)\\G((?:\\([a-z]+\\))*+)((?:(?!\\([a-z]+\\)).)+)",
"$1<centamp>$2</centamp>");
It allows a non-empty sequence of lower case English alphabet character inside bracket \\([a-z]+\\).
Features:
Whitespace only sequences are tagged.
There will be no tag surrounding empty string.
Explanation:
\G asserts the match boundary, i.e. the next match can only start from the end of last match. It can also match the beginning of the string (when we have yet to find any match).
Each match of the regex will contain a sequence of: 0 or more consecutive \\([a-z]+\\) (no space between allowed), and followed by at least 1 character that does not form \\([a-z]+\\) sequence.
0 or more consecutive \\([a-z]+\\) to cover the case where the string does not start with \\([a-z]+\\), and the case where the string does not contain \\([a-z]+\\).
In the pattern for this portion (?:\\([a-z]+\\))*+ - note that the + after * makes the quantifier possessive, in other words, it disallows backtracking. Simply put, an optimization.
One character restriction is necessary to prevent adding tag that encloses empty string.
In the pattern for this portion (?:(?!\\([a-z]+\\)).)+ - note that for every character, I check whether it is part of the pattern \\([a-z]+\\) before matching it (?!\\([a-z]+\\))..
(?s) flag will cause . to match any character including new line. This will allow a tag to enclose text that spans multiple lines.

You just replace all of the occurence of "([a-z])" with </centamp>$1<centamp> and then prepend <centamp> and append </centamp>
String content = "Test (a) test (b) (c)";
Pattern pattern = Pattern.compile("(\\([a-z]\\))");
Matcher matcher = pattern.matcher(content);
String result = "<centamp>" + matcher.replaceAll("</centamp>$1<centamp>") + "</centamp>";
note I wrote the above in the browser so there may be syntax errors.
EDIT Here's a full example with the simplest RegEx possible.
import java.util.*;
import java.lang.*;
import java.util.regex.*;
class Main
{
public static void main (String[] args) throws java.lang.Exception
{
String content = "test (a) (b) and (c)";
String result = "<centamp>" +
content.replaceAll("(\\([a-z]\\))", "</centamp>$1<centamp>") +
"</centamp>";
result = result.replaceAll("<centamp></centamp>", "");
System.out.print(result);
}
}

This is another solution which uses cleaner regex. The solution is longer, but it allows more flexibility in adjusting the condition to add tag.
The idea here is to match the parenthesis containing lower case characters (the part we don't want to tag), then use the indices from the matches to identify the portion we want to enclose in tag.
// Regex for the parenthesis containing only lowercase English
// alphabet characters
static Pattern REGEX_IN_PARENTHESIS = Pattern.compile("\\([a-z]+\\)");
private static String addTag(String str) {
Matcher matcher = REGEX_IN_PARENTHESIS.matcher(str);
StringBuilder sb = new StringBuilder();
// Index that we have processed up to last append into StringBuilder
int lastAppend = 0;
while (matcher.find()) {
String bracket = matcher.group();
// The string from lastAppend to start of a match is the part
// we want to tag
// If you want to, you can easily add extra logic to process
// the string
if (lastAppend < matcher.start()) { // will not tag if empty string
sb.append("<centamp>")
.append(str, lastAppend, matcher.start())
.append("</centamp>");
}
// Append the parenthesis with lowercase English alphabet as it is
sb.append(bracket);
lastAppend = matcher.end();
}
// The string from lastAppend to end of string (no more match)
// is the part we want to tag
if (lastAppend < str.length()) {
sb.append("<centamp>")
.append(str, lastAppend, str.length())
.append("</centamp>");
}
return sb.toString();
}

Java regex and pattern matching: finding "blanks" in pattern which do not include them?

So, I need to write a compiler scanner for a homework, and thought it'd be "elegant" to use regex. Fact is, I seldomly used them before, and it was a long time ago. So I forgot most of the stuff about them and needed to have a look around. I used them successfully for the identifiers (or at least I think so, I still need to do some further tests but for now they all look ok), but I have a problem with the numbers-recognition.
The function nextCh() reads the next character on the input (lookahead char). What I'd like to do here is to check if this char matches the regex [0-9]*. I append every matching char in the str field of my current token, then I read the int value of this field. It recognizes a single number input such as "123", but the problem I have is that for the input "123 456", the final str will be "123 456" while I should get 2 separate tokens with fields "123" and "456". Why is the " " being matched?
private void readNumber(Token t) {
t.str = "" + ch; // force conversion char --> String
final Pattern pattern = Pattern.compile("[0-9]*");
nextCh(); // get next char and check if it is a digit
Matcher match = pattern.matcher("" + ch);
while (match.find() && ch != EOF) {
t.str += ch;
nextCh();
match = pattern.matcher("" + ch);
}
t.kind = Kind.number;
try {
int value = Integer.parseInt(t.str);
t.val = value;
} catch(NumberFormatException e) {
error(t, Message.BIG_NUM, t.str);
}
Thank you!
PS: I did solve my problem using the code below. Nevertheless, I'd like to understand where the flaw is in my regex expression.
t.str = "" + ch;
nextCh(); // get next char and check if it is a number
while (ch>='0' && ch<='9') {
t.str += ch;
nextCh();
}
t.kind = Kind.number;
try {
int value = Integer.parseInt(t.str);
t.val = value;
} catch(NumberFormatException e) {
error(t, Message.BIG_NUM, t.str);
}
EDIT: turns out my regex also doesn't work for the identifiers recognition (again, includes blanks), so I had to switch to a system similar to my "solution" (while with a lot of conditions). Guess I'll need to study the regex again :O

I'm not 100% sure whether this is relevant in your case, but this:
Pattern.compile("[0-9]*");
matches zero or more numbers anywhere in the string, because of the asterisk. I think the space gets matched because it is a match for 'zero numbers'. If you wanted to make sure the char was a number, you would have to match one or more, using the plus sign:
Pattern.compile("[0-9]+");
or, since you are only comparing a single char at a time, just match one number:
Pattern.compile("^[0-9]$");

You should be using the matches method rather than the find method. From the documentation:
The matches method attempts to match the entire input sequence against the pattern
The find method scans the input sequence looking for the next subsequence that matches the pattern.
So in other words, by using find, if the string contains a digit anywhere at all, you'll get a match, but if you use matches the entire string must match the pattern.
For example, try this:
Pattern p = Pattern.compile("[0-9]*");
Matcher m123abc = p.matcher("123 abc");
System.out.println(m123abc.matches()); // prints false
System.out.println(m123abc.find()); // prints true

Use a simpler regex like
/\d+/
Where
\d means a digit
+ means one or more
In code:
final Pattern pattern = Pattern.compile("\\d+");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java: Whitespace in HTML not recognized with Regex pattern - java

Related

Fix Regular Expression to allow optional fields

Hamcrest isEqualIgnoringWhitespace does not Ignore whitespace

Java Regex is including new line in match

Excluding markup on lowercased parentheses letters

Java regex and pattern matching: finding "blanks" in pattern which do not include them?

Categories

Resources