So as we need to parse some HTML to XML and validate that everything from the HTML is in the XML file we use Hamcrest in Unittests for validation. As we can not have more or less information in the XML files it is important that we have a matcher that does not use a contains but an equalTo. The problem is that we parse but have to extract certain elements as they are not allowed in the externally managed datamodel. We found out that doing so migth add extra whitespaces in some cases (has something to do with Jsoup).
So as the spaces are not relevant in the actual content we decided to ignore those for now (as this is purely PoC), but we do want validation of our concept. To do this I came up with a solution that strips every whitespace (String.replaceAll("\\s","")) which also strips newlines and tabs. All the text is then concatenated into one String object which makes for terrible reading and also not a very good practice when debugging at all. So instead I opted to use Hamcrests IsEqualIgnoringWhitespace. When testing I found out that it does not do anything like the name suggests at all. In the code there is no deletion of spaces, tabs or newlines but instead it checks if the current character is a whitespace and if so if the character before that also contained a whitespace. If that is the case it will remove one whitespace. So bassically it only normalises the whitespaces to contain only one of them in between two words.
Here is the code of the used stripSpace method in the class:
public String stripSpace(String toBeStripped) {
final StringBuilder result = new StringBuilder();
boolean lastWasSpace = true;
for (int i = 0; i < toBeStripped.length(); i++) {
char c = toBeStripped.charAt(i);
if (isWhitespace(c)) {
if (!lastWasSpace) {
result.append(' ');
}
lastWasSpace = true;
} else {
result.append(c);
lastWasSpace = false;
}
}
return result.toString().trim();
}
So in essence it does not ignore whitespaces at all. Why is it named like this then?
to give some examples of inputs we want to match with one another here is some of the text that has whitespaces but shouldn't (text is in dutch but this doesn't matter):
m2 vs. m 2 (HTML original: m<sup>2</sup>)
Tabel 3.1 vs. Tabel 3 .1 (HTML original: Tabel 3.1)
So as these texts will never be matched by a normal equalTo matcher, the equalToIgnoringWhitespaces should actually match this based on the name but it doesn't.
Does anyone of you know if there actually is a matcher that actually ignores whitespaces?
According to the Javadocs IsEqualIgnoringWhitespace:
Creates a matcher of String that matches when the examined string is equal to the specified expectedString, when whitespace differences are (mostly) ignored.
This is explained in more detail in the Matchers Javadocs:
Creates a matcher of String that matches when the examined string is equal to the specified expectedString, when whitespace differences are (mostly) ignored. To be exact, the following whitespace rules are applied:
all leading and trailing whitespace of both the expectedString and the examined string are ignored
any remaining whitespace, appearing within either string, is collapsed to a single space before comparison
The following test verifies this behaviour:
#Test
public void testIsEqualIgnoringWhitespace() {
// leading and trailing spaces are ignored
assertThat("m 2", equalToIgnoringWhiteSpace(" m 2 "));
// all other spaces are collapsed to a single space
assertThat("m 2", equalToIgnoringWhiteSpace("m 2"));
// does not match because the single space in the expected string is not collapsed any further
assertThat("m2", not(equalToIgnoringWhiteSpace("m 2")));
}
So, that explains why you are seeing the behaviour your described in your question.
Re this:
Does anyone of you know if there actually is a matcher that actually ignores whitespaces?
You could write your own matcher. Here's an example:
public class IgnoresAllWhitespacesMatcher extends BaseMatcher<String> {
public String expected;
public static IgnoresAllWhitespacesMatcher ignoresAllWhitespaces(String expected) {
return new IgnoresAllWhitespacesMatcher(expected);
}
private IgnoresAllWhitespacesMatcher(String expected) {
this.expected = expected.replaceAll("\\s+", "");
}
#Override
public boolean matches(Object actual) {
return expected.equals(actual);
}
#Override
public void describeTo(Description description) {
description.appendText(String.format("the given String should match '%s' without whitespaces", expected));
}
}
Using this matcher the following test passes:
#Test
public void testUsingCustomIgnoringAllWhitespaceMatcher() {
// leading and trailing spaces are ignored
assertThat("m2", ignoresAllWhitespaces(" m 2 "));
// intermediate spaces are ignored
assertThat("m2", ignoresAllWhitespaces("m 2"));
}
Related
Write a procedure loadDocument(String name) which will load and analyze lines after lines searching for link in every line. The link format is as follows: 5 characters link= (it can be mixed capital and small letters) after which there is a correct identifier. The correct identifier starts from letter (small or capital) follows by zero or more occurrences of letters or digits or underline _. The procedure has to print subsequent identifiers, each one in a separated line. Before printing, the identifiers have to be changed to small letters. The document ends with line with the text eod, which means end of document.
My code:
public static void loadDocument(String name, Scanner scan) {
while(scan.hasNext()) {
String line = scan.nextLine();
if(line.equals("eod")) {
return;
}
else if(line.matches("link="+name) && correctLink(name)) {
String identifier = name.toLowerCase();
System.out.println(identifier);
}
else
continue;
}
}
// accepted only small letters, capital letter, digits and '_' (but not on the begin)
public static boolean correctLink(String link) {
if(link.matches("^[a-zA-Z]+[0]+||[0-9]+||_"))
return true;
else
return false;
}
How to write if line equal to link=, return whatever's after link=?
My problem is in this code:
else if(line.matches("link="+name) && correctLink(name)) {
String identifier = name.toLowerCase();
System.out.println(identifier);
}
For example, if the input is link=abc, I want it to print abc.
First I would suggest that you get used to compare to literal strings "the other way round" - this will save you from a lot NullPointerExceptions (but this is just a side comment):
if ("eod".equals(line))
You can use #Ofer s example (it is generated from https://regex101.com, a nice page to play around with regular expressions and get them explained btw.) but you should use a different regex:
final String regex = "link=([a-z][a-z0-9_]*)";
and a different option for the pattern:
final Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
I used the CASE_INSENSITIVE option to make the "link" also trigger for mixed case writings (like "Link", "liNk" and so on). Therefore I also could skip the uppercase letters from the actual regex, as [a-z] will also match uppercase letters with that option, and that's what was requested.
The "magic" here is that you put the expression that you later want to "read" from the pattern matcher into parenthesis "()" - this marks a "group". Group 0 always gives back the full match (including the "link=" here).
You can play around with that expression at https://regex101.com/r/id2CP2/1
Please don't forget to convert the identifiers (you get them from matcher.group(i)) to lowercase before you output them.
I don't believe I saw this when searching (believe me, I spent a good amount of time searching for this) for a solution to this so here goes.
Goal:
Match regex in a string and replace it with something that contains the matched value.
Regex used currently:
\b(Connor|charries96|Foo|Bar)\b
For the record I suck at regex incase this isn't the best way to do it.
My current code (and several other methods I tried) can only replace the text with the first match it encounters if there are multiple matches.
private Pattern regexFromList(List<String> input) {
if(input.size() < 1) {
return "";
}
StringBuilder builder = new StringBuilder();
builder.append("\\b");
builder.append("(");
for(String s : input) {
builder.append(s);
if(!s.equals(input.get(input.size() - 1)))
{
builder.append("|");
}
}
builder.append(")");
builder.append("\\b");
return Pattern.compile(builder.toString(), Pattern.CASE_INSENSITIVE);
}
Example input:
charries96's name is Connor.
Example result using TEST as the data to prepend the match with
TESTcharries96's name is TESTcharries96.
Desired result using example input:
TESTcharries96's name is TESTConnor.
Here is my current code for replacing the text:
if(highlight) {
StringBuilder builder = new StringBuilder();
Matcher match = pattern.matcher(event.getMessage());
String string = event.getMessage();
if (match.find()) {
string = match.replaceAll("TEST" + match.group());
// I do realise I'm using #replaceAll but that's mainly given it gives me the same result as other methods so why not just cut to the chase.
}
builder.append(string);
return builder.toString();
}
EDIT:
Working example of desired result on RegExr
There are a few problems here:
You are taking the user input as is and build the regex:
builder.append(s);
If there are special character in the user input, it might be recognized as meta character and cause unexpected behavior.
Always use Pattern.quote if you want to match a string as it is passed in.
builder.append(Pattern.quote(s));
Matcher.replaceAll is a high level function which resets the Matcher (start the match all over again), and search for all the matches and perform the replacement. In your case, it can be as simple as:
String result = match.replaceAll("TEST$1");
The StringBuilder should be thrown away along with the if statement.
Matcher.find, Matcher.group are lower level functions for fine grain control on what you want to do with a match.
When you perform replacement, you need to build the result with Matcher.appendReplacement and Matcher.appendTail.
A while loop (instead of if statement) should be used with Matcher.find to search for and perform replacement for all matched.
I need to check a string whether it includes a specific arrangements of letters and numbers.
Valid arrangements are for example:
X
X-Y
A-H-K-L-J-Y
A-H-J-Y
123
12?
12*
12-17
Invalid are for example:
-X-Y
-XY
*12
?12
I have written this method in java to solve this problem (but i donĀ“t have some experiences with regular expressions):
public boolean checkPatternMatching(String sourceToScan, String searchPattern) {
boolean patternFounded;
if (sourceToScan == null) {
patternFounded = false;
} else {
Pattern pattern = Pattern.compile(Pattern.quote(searchPattern),
Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(sourceToScan);
patternFounded = matcher.find();
}
return patternFounded;
}
How can i implemented this requirement with regular expressions?
By the way: It is a good solution to check a string, whether it includes numeric content by using the method isNumeric from the java class StringUtils?
//EDIT
The link, which was edited by the admins includes not specific arrangements of characters but only an appearance of characters with regular expressions in general !
After a good while trying to help, answering to constantly changing questions, just found out that the same was asked yesterday, and that the OP doesn't accept answers to his questions...all I have left to say is good night sir, good luck
n-th answer follows:
First pattern: [a-z](-[a-z])* : a letter, possibly followed by more letters, separated by -.
Second pattern: \d+(-\d+)*[?*]* : a number, possibly followed by more numbers, separated by -, and possibly ending with ? or *.
So join them together: ^([a-z](-[a-z])*)|(\d+(-\d+)*[?*]*)$. ^ and $ mark the beginning and the end of the string.
Few more comments on the code: you don't need to use Pattern.quote, and you should use matches() instead of find(), because find() returns true if any part of the string matches the pattern, and you want the whole string:
public static boolean checkPatternMatching(String sourceToScan, String searchPattern) {
boolean patternFounded;
if (sourceToScan == null) {
patternFounded = false;
} else {
Pattern pattern = Pattern.compile(searchPattern, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(sourceToScan);
patternFounded = matcher.matches();
}
return patternFounded;
}
Called like this: checkPatternMatching(s, "^([a-z](-[a-z])*)|(\\d+(-\\d+)*[?*]*)$")
About the second question, this is the current implementation of StringUtils.isNumeric:
public static boolean isNumeric(final CharSequence cs) {
if (isEmpty(cs)) {
return false;
}
final int sz = cs.length();
for (int i = 0; i < sz; i++) {
if (Character.isDigit(cs.charAt(i)) == false) {
return false;
}
}
return true;
}
So no, there is nothing wrong about it, that is as simple as it gets. But you need to include an external JAR in your program, which I find unnecessary if you just want to use such a simple method.
I believe that you should first remove the Pattern.quote() method because that would turn the inputting patterns into string literals; and those are not really useful in your context.
To match the valid arrangements with letters, something like this should work:
^[a-z](?:-[a-z])*$
For the numbers (if I understood the rules correctly):
^\\d+(?:[?*]|-\\d+)*$
And if you want to combine them:
^(?:[a-z](?:-[a-z])*|\\d+(?:[?*]|-\\d+)*)$
I'm not familiar with Java itself, nor the isNumeric method, sorry.
As per your comment, if you want to accept *12 or 1?2 or 12*456, you can use:
^\\*?\\d+(?:[?*]\\d*|-\\d+)*$
Then add it to the previous regex like so:
^(?:[a-z](?:-[a-z])*|\\*?\\d+(?:[?*]\\d*|-\\d+)*)$
A string can contain one to many parentheses in lower case letters like String content = "This is (a) nightmare"; I want to transform the string to "<centamp>This is </centamp>(a) <centamp>nightmare</centamp>"; So basically add centamp markup around this string but if it has a lowercase letter in parentheses that should be excluded from the markup.
This is what I have tried so far, but it doesn't achieve the desired result. There could be none to many parentheses in a string and excluding it from the markup should happen for every parentheses.
Pattern pattern = Pattern.compile("^(.*)?(\\([a-z]*\\))?(.*)?$", Pattern.MULTILINE);
String content = "This is (a) nightmare";
System.out.println(content.matches("^(.*)?(\\([a-z]*\\))?(.*)?$"));
System.out.println(pattern.matcher(content).replaceAll("<centamp>$1$3</centamp>$2"));
This can be done in one replaceAll:
String outputString =
inputString.replaceAll("(?s)\\G((?:\\([a-z]+\\))*+)((?:(?!\\([a-z]+\\)).)+)",
"$1<centamp>$2</centamp>");
It allows a non-empty sequence of lower case English alphabet character inside bracket \\([a-z]+\\).
Features:
Whitespace only sequences are tagged.
There will be no tag surrounding empty string.
Explanation:
\G asserts the match boundary, i.e. the next match can only start from the end of last match. It can also match the beginning of the string (when we have yet to find any match).
Each match of the regex will contain a sequence of: 0 or more consecutive \\([a-z]+\\) (no space between allowed), and followed by at least 1 character that does not form \\([a-z]+\\) sequence.
0 or more consecutive \\([a-z]+\\) to cover the case where the string does not start with \\([a-z]+\\), and the case where the string does not contain \\([a-z]+\\).
In the pattern for this portion (?:\\([a-z]+\\))*+ - note that the + after * makes the quantifier possessive, in other words, it disallows backtracking. Simply put, an optimization.
One character restriction is necessary to prevent adding tag that encloses empty string.
In the pattern for this portion (?:(?!\\([a-z]+\\)).)+ - note that for every character, I check whether it is part of the pattern \\([a-z]+\\) before matching it (?!\\([a-z]+\\))..
(?s) flag will cause . to match any character including new line. This will allow a tag to enclose text that spans multiple lines.
You just replace all of the occurence of "([a-z])" with </centamp>$1<centamp> and then prepend <centamp> and append </centamp>
String content = "Test (a) test (b) (c)";
Pattern pattern = Pattern.compile("(\\([a-z]\\))");
Matcher matcher = pattern.matcher(content);
String result = "<centamp>" + matcher.replaceAll("</centamp>$1<centamp>") + "</centamp>";
note I wrote the above in the browser so there may be syntax errors.
EDIT Here's a full example with the simplest RegEx possible.
import java.util.*;
import java.lang.*;
import java.util.regex.*;
class Main
{
public static void main (String[] args) throws java.lang.Exception
{
String content = "test (a) (b) and (c)";
String result = "<centamp>" +
content.replaceAll("(\\([a-z]\\))", "</centamp>$1<centamp>") +
"</centamp>";
result = result.replaceAll("<centamp></centamp>", "");
System.out.print(result);
}
}
This is another solution which uses cleaner regex. The solution is longer, but it allows more flexibility in adjusting the condition to add tag.
The idea here is to match the parenthesis containing lower case characters (the part we don't want to tag), then use the indices from the matches to identify the portion we want to enclose in tag.
// Regex for the parenthesis containing only lowercase English
// alphabet characters
static Pattern REGEX_IN_PARENTHESIS = Pattern.compile("\\([a-z]+\\)");
private static String addTag(String str) {
Matcher matcher = REGEX_IN_PARENTHESIS.matcher(str);
StringBuilder sb = new StringBuilder();
// Index that we have processed up to last append into StringBuilder
int lastAppend = 0;
while (matcher.find()) {
String bracket = matcher.group();
// The string from lastAppend to start of a match is the part
// we want to tag
// If you want to, you can easily add extra logic to process
// the string
if (lastAppend < matcher.start()) { // will not tag if empty string
sb.append("<centamp>")
.append(str, lastAppend, matcher.start())
.append("</centamp>");
}
// Append the parenthesis with lowercase English alphabet as it is
sb.append(bracket);
lastAppend = matcher.end();
}
// The string from lastAppend to end of string (no more match)
// is the part we want to tag
if (lastAppend < str.length()) {
sb.append("<centamp>")
.append(str, lastAppend, str.length())
.append("</centamp>");
}
return sb.toString();
}
Code:
static short state = 0;
static int td_number = 0;
public static void main(String[] args) {
final Pattern p = Pattern.compile("^[\\s]*?\\d+\\.\\d+[\\s]*?");
final short TD_ENTRY = 0;
final short NO_ENTRY = 1;
HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback() {
public void handleText(char[] data, int pos) {
switch (state) {
case NO_ENTRY:
break;
case TD_ENTRY: {
// We are in the right table column
// Create string from char array
String s = new String(data);
Matcher m = p.matcher(s);
boolean b = m.matches();
// Check if data information has correct format (0.0)
if (b) {
}
}
break;
default:
break;
}
state = NO_ENTRY;
}
public void handleStartTag(HTML.Tag tag, MutableAttributeSet set, int pos) {
if (tag == HTML.Tag.TD) {
//[...]
}
}
};
Reader reader = new StringReader(html);
try {
new ParserDelegator().parse(reader, callback, false);
} catch (IOException e) {
}
}
I am trying to parse HTML with Regular Expressions. The program reads the content of td tags within an html table. The content in the table cell should fit a special pattern defined in Pattern p.
The main problem is now that the regex pattern does not match for cell content like this " 0.1".
But if I define the String s manually with the value (" 0.1") in the code the pattern matches.
Furthermore if I copy the content of char[] data in debug mode and define s with this copied content the pattern does also not fit although it looks the same like the manually defined value from above.
Is it possible to find out which whitespace characters are really read?
It seems that the whitespace is not always a whitespace and therefore does not match with regex class [\s]. Is this possible?
EDIT:
Thanks for answers. It was really a whitespace character (\xA0) which was not recognized by \s regex class.
For all of you which downvote (really frustrating) my question simply missunderstood me. Maybe the problem was really the sentence "I want to parse HTML with regex" but in fact I simply have content from a HTML table cell with unknown whitespace characters ;-).
I think I had got the same problems with a library like jsoup.
In Java regexes, the non-breaking space character (NBSP, U+00A0) is traditionally not treated as whitespace for the purpose of matching \s. If that's what's causing your problem, you just need to add it to your existing whitespace class:
"^[\\s\\xA0]*\\d+\\.\\d+[\\s\\xA0]*$"
There are other Unicode whitespace characters that aren't matched by \s, but none of them are anywhere as common as the NBSP.
Alternatively, if you're running Java 7+ you can specify UNICODE_CHARACTER_CLASS mode and go on using \s.
Your code snippet is too long, but as far as I understand you just need pattern to match something like 0.0, 10.52 etc, i.e. floating point numbers? Use pattern \\d+\\.\\d+.
\d+ means 1..n digits
\. means dot. A single dot . in regex means "any character"
Here is the usage example:
String str = "123.456";
Pattern p = Pattern.compile("\\d+\\.\\d+");
Matcher m = p.matcher(str);
if (m.matches()) {
// do something.
}
BTW, pay attention that matches() matches full line. If you want to match part of line use find() instead. I personally always use find() and use start and end line markers ^ and $ into regex itself when needed.