Regex expression in java htmlunit - java

I am trying to advance my knowledge of java, by trying to automate webpage scraping and form input. I have experimented with jsoup and now htmlunit. I found a htmlunit example that I am trying to run.
public class GoogleHtmlUnitTest {
static final WebClient browser;
static {
browser = new WebClient();
browser.getOptions().setJavaScriptEnabled(false);
// browser.setJavaScriptEnabled(false);
}
public static void main(String[] arguments) {
boolean result;
try {
result = searchTest();
} catch (Exception e) {
e.printStackTrace();
result = false;
}
System.out.println("Test " + (result? "passed." : "failed."));
if (!result) {
System.exit(1);
}
}
private static boolean searchTest() {
HtmlPage currentPage;
try {
currentPage = (HtmlPage) browser.getPage("http://www.google.com");
} catch (Exception e) {
System.out.println("Could not open browser window");
e.printStackTrace();
return false;
}
System.out.println("Simulated browser opened.");
try {
((HtmlTextInput) currentPage.getElementByName("q")).setValueAttribute("qa automation");
currentPage = currentPage.getElementByName("btnG").click();
System.out.println("contents: " + currentPage.asText());
return containsPattern(currentPage.asText(), "About .* results");
} catch (Exception e) {
System.out.println("Could not search");
e.printStackTrace();
return false;
}
}
public static boolean containsPattern(String string, String regex) {
Pattern pattern = Pattern.compile(regex);
// Check for the existence of the pattern
Matcher matcher = pattern.matcher(string);
return matcher.find();
}
}
It works with some htmlunit errors, that I have found on stackoverflow to ignore. The program runs correctly, so I am taking the advice and ignoring the errors.
Jul 31, 2016 7:29:03 AM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
WARNING: CSS error: 'https://www.google.com/search?q=qa+automation&sa=G&gbv=1&sei=_eCdV63VGMjSmwHa85kg' [1:1467] Error in declaration. '*' is not allowed as first char of a property.
My problem at the moment is the regex expression being used for the search. If I am understanding this correctly, “qa automation” is being googled and the retrieved page is being searched by:
return containsPattern(currentPage.asText(), "About .* results");
What is throwing me is “About .* results”. This is the regex, but I don't get how it is being interpreted. What is being searched for on the retrieved page?

.* means "zero or more of any character," in another words, a complete wildcard. It can be
About 28 results
About 2864 results
About 2,864 results
About ERROR results
About results
(Response to comments.)
To be honest, you should find a quick regular expressions tutorial. You're missing some very basic things and instead relying on your own intuitive sense of how "searching" should work, which is leading to confusion.
I like teaching though, so here's a little more :-)
Go to this RegExr link. I already set it up with this expression:
/^About .* results$/gm
Ignore the /^ and the $/gm. (If you really want to know, the two slashes is just the conventional notation for regular expressions. The ^ and $ are "anchors" that force a "full match"—that's why it seemed like "About" had to be in position 0. Whatever regex engine you're using, it seems to force anchors. The g is a flag that just means "Highlight every match," and the m is a flag that means, "Treat every line as a separate entry.") Anyway, back to the main expression:
About .* results
And its matches:
See how if you put a character on either side, it's no longer a match? Again, that's because of anchoring. The expression expects "A" as the first character, so "x" fails. The expression also expects the last character to be "s", so "x" would fail there too. But why did About results fail? It's because there's a space around each side of the .*. The .* wildcard is allowed to match nothing, but the spaces have to match just like letters and numbers. So a single space won't cut it; you need at least two.
You wrote that you tried 230 .* results. See, you're not understanding that regex works character by character, with certain "special" characters you can use. Your expression means, "A string that begins with 230, a space, then anything, a space, "results", and nothing after."
[...] how would I code regex to find the "230" in any position followed by "results", ie "foobar 230 foobar2 results"?
In other words, you want to find a string that starts with anything, has 230 somewhere, has more of anything, a space, "results", and nothing more:
.*230.* results
Do you want the exact number, 230?
.* 230 results

Related

Using replace function in Java

I am trying to use replace function to replace "('", "'" and "')" from the below string -
String foo = "('UK', 'IT', 'DE')";
I am trying to use the below code in order to do this operation -
(foo.contains("('")?foo.replaceAll("('", ""):foo.replace("'",""))?foo.replaceAll("')",""):foo
but its failing as -
java.util.regex.PatternSyntaxException: Unclosed group near index 2
Am I missing anything here?
replaceAll takes a regular expression as its search pattern. Since ( is a special character in regular expressions, it need to be escaped: '\\('. Furthermore, there’s no need for the contains test:
final String bar = foo.replaceAll("\\('", "") …
Lastly, you can combine all your replacements into one regular expression:
final String bar = foo.replaceAll("\\(?'([^']*)'\\)?", "$1");
// Output: UK, IT, DE
This will replace each occurrence of a single-quoted part inside your string with its content without the quotes, and it will allow (and discard) surrounding opening and closing parentheses.
foo.replaceAll("[(')]", "") will make the work )
As other answer(s) and the error message point out, replaceAll() deals with regular expressions, where the opening parenthesis ( has special meaning. Some characters have special meaning even in the replacement argument for the same reason.
If you want to be absolutely sure that your strings are going to behave as strings, there are two built-in "quote" methods (both are static) for "neutralizing" patterns:
Pattern.quote() for wrapping the replacee pattern
Matcher.quoteReplacement() for wrapping the replacement
Example code attempting to replaceAll() two ( to $ symbols:
System.out.println("Naive:");
try {
System.out.println("(("
.replaceAll("(", "$"));
} catch (Exception ex) {
System.out.println(ex);
}
System.out.println("\nPattern.quote:");
try {
System.out.println("q: "+Pattern.quote("("));
System.out.println("(("
.replaceAll(Pattern.quote("("), "$"));
} catch (Exception ex) {
System.out.println(ex);
}
System.out.println("\nPattern.quote+Matcher.quoteReplacement:");
try {
System.out.println("q: "+Pattern.quote("("));
System.out.println("qR: "+Matcher.quoteReplacement("$"));
System.out.println("(("
.replaceAll(Pattern.quote("("), Matcher.quoteReplacement("$")));
} catch (Exception ex) {
System.out.println(ex);
}
Output:
Naive:
java.util.regex.PatternSyntaxException: Unclosed group near index 1
(
Pattern.quote:
q: \Q(\E
java.lang.IllegalArgumentException: Illegal group reference: group index is missing
Pattern.quote+Matcher.quoteReplacement:
q: \Q(\E
qR: \$
$$
Of course by the time one knows about these methods, they have long got accustomed to escape the special characters manually.

Pattern for last occurring element

so I recently wrote a piece of regex like this:
replaceInputWithLatex("sinus(?<formula>.*?)moveout",
"\\\\sin(${formula})");
replaceInputWithLatex("sinus(?<formula>.+)",
"\\\\sin" + changeColorOfString("Red", "(") + "${formula}" + changeColorOfString("Red", ")"));
replaceInputWithLatex("sinus",
"\\\\sin" + noInputBox);
Here's replaceInputWithLatex function:
private static void replaceInputWithLatex(String pattern, String latexOutput{
regexPattern = Pattern.compile(pattern);
regexMatcher = regexPattern.matcher(mathFormula);
while(regexMatcher.find()){
Log.d("FOUND", regexMatcher.group());
mathFormula = regexMatcher.replaceAll(latexOutput);
replaceInputWithLatex(pattern, latexOutput);
}
}
Let's say I input a string: "sinus1+sinus2x+3moveout".
I would like the 1st match to take this string: "sinus2x+3moveout". And replace it. And in the next iteration match "sinus1+(already_converted)".
However, so far it takes an entire string first. Here are "FOUND" logs:
11-12 19:26:40.750 30244-30244/com.example.user.signum D/FOUND: sinus1+sinus2x+3moveout
11-12 19:26:40.750 30244-30244/com.example.user.signum D/FOUND: sinus2x+3
Latex output look like this(I want both outside parentheses to be red - in reverse order as it is now):
What pattern shall I use? (I've been trying recursive approach, but I haven't come up with a solution yet)
I've mocked up a working example in JavaScript.
Basically, we keep hitting the input with the same regex. We mark the tail end we've already handled with some special characters, then handle the next one to the left, and so on. When the markers cover the whole string, we're done.
console.clear();
var input = "sinus1+sinus2x+3moveout";
do {
console.log(input);
input = input.replace(/(sinus\dx?\+)?(?:\d?moveout|(<<handled>>))$/, function(m,lead){
return lead ? "<<handled>>" : "{{all done}}"
});
} while (input.indexOf("<<") !== -1);
console.log(input);
It's a bit pseudo-codey, but I hope this gives you some useful ideas.

I can't get the url with Pattern.compile

What I really want is to return the URLs that are in the txt variable. the url comes from randomly then not are regular expreccion to use or not is part of my code this poorly written... use google translator only sorry I speak Spanish; ol
//I can't get the url with Pattern.compile
//My code example::::: in the works :(
String txt="sources: [{file:\"http://pla.cdn19.fx.rrrrrr.com/luq5t4nidtixexzw6wblbiexs7hg2hdu4coqdlltx6t3hu3knqhbfoxp7jna/normal.mp4\",label:\"360p\"}],sources: [{file:\"http://pla.cdn19.fx.rrrrrr.com/luq5t4nidtixexzw6wblbiexs7hg2hdu4coqdlltx6t3hu3knqhbfoxp7jna/normal.mp4\",label:\"360p\"}]";
ArrayList<String> getfi = new ArrayList<String>();
Matcher matcher = Pattern.compile("sources: [{file:\"(.*)\"").matcher(txt);
if (matcher.find()) {
while(matcher.find()) {
getfi.add(matcher.group(1));
}
System.out.println(getfi);
} else {
System.exit(1);
}
Pattern.compile("sources: [{file:\"(.*)\"")
You regex is wrong, since both [ and { are special characters, so they must be escaped. Which is why you get PatternSyntaxException: Unclosed character class near index 21, which you didn't mention in your question.
Also the pattern will match the entire string, except for the last two characters.
if (matcher.find()) {
while(matcher.find()) {
The find() call in the if statement consumes the first find. Since the first find is the entire text except last two characters, there is no second find for the find() call in the while loop, so loop is never entered.
To make it work, escape the special characters, change .* to not be greedy, and fix the loop:
String txt="sources: [{file:\"http://pla.cdn19.fx.rrrrrr.com/luq5t4nidtixexzw6wblbiexs7hg2hdu4coqdlltx6t3hu3knqhbfoxp7jna/normal.mp4\",label:\"360p\"}],sources: [{file:\"http://pla.cdn19.fx.rrrrrr.com/luq5t4nidtixexzw6wblbiexs7hg2hdu4coqdlltx6t3hu3knqhbfoxp7jna/normal.mp4\",label:\"360p\"}]";
Matcher matcher = Pattern.compile("sources: \\[\\{file:\"(.*?)\"").matcher(txt);
ArrayList<String> getfi = new ArrayList<String>();
while (matcher.find()) {
getfi.add(matcher.group(1));
}
if (getfi.isEmpty()) {
System.exit(1);
}
System.out.println(getfi);
WARNING:
Notice that sometimes there is a space after :, and sometimes not. That is perfectly valid for JSON. JSON text may contain whitespace, including newlines, so using a simple regex is not a good idea.
Use a JSON parser instead.

Check string using regex

I have the following string:
String s = "http://www.[VP_ANY].com:8080/servlet/[VP_ALL]";
I need to check if this string has the words [VP_ANY] o [VP_ALL]. I tried something like this (and many combinations), but it doesn't work:
Pattern.compile("\b(\\\\[VP_ANY\\\\]|\\\\[VP_ALL\\\\])\b").matcher(s).matches()
What am I doing wrong?
I tried the following:
s = "www.[VP_ANY].com:8080/servlet/[VP_ALL]";
System.out.println(Pattern.compile("\[VP_ANY\]").matcher(s).matches());
System.out.println(s.replaceAll("\[VP_ANY\]", "A"));
The first 'System.out' returns false, and the second one returns the replacement correctly.
I'm escaping the "[" and "]" characters with 2 backslashes, but when I save the post just one is showed. But I'm using 2 ...
Pattern.compile("\b(\\\\[VP_ANY\\\\]|\\\\[VP_ALL\\\\])\b").matcher(s).matches()
String s = "http://www.[VP_ANY].com:8080/servlet/[VP_ALL]";
^^ ^^ ^^ ^
NoWB NoWB NoWB WB
Your regex will not work because there is no word boundaray between . and [, between ] and . and between / and [
Additionally I think you are wrong with the escaping, your word boundaries would need a backslash more and the others two less.
So, since the word boundaries are not working, you should be fine with
Pattern.compile("\\[VP_(?:ANY|ALL)\\])")
Try this one
try {
boolean foundMatch = subjectString.matches("(?i)\\bVP_(?:ANY|ALL)\\b");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
or this
try {
boolean foundMatch = subjectString.matches("(?i)\\[VP_(?:ANY|ALL)\\]");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
Try This
\[VP_ANY\]|\[VP_ALL\]
My go at Java
try {
boolean foundMatch = "www.[VP_ANY].com:8080/servlet/[VP_ALL]".matches("\\[VP_ANY\\]|\\[VP_ALL\\]");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
"http://www.[VP_ANY].com:8080/servlet/[VP_ALL]".replaceAll ("http://www.(\\[VP_ANY\\]).com:8080/servlet/(\\[VP_ALL\\])", "$1:$2")
res117: java.lang.String = [VP_ANY]:[VP_ALL]
If you're looking for a literal [, you have to mask it - else it will mean a group like [A-Z].
Now if you read the regex from a file or a JTextField at runtime, that's all. But if you write it to your source code, the compiler will see the \ and treat it as a general masking, which might be needed to mask quotes like in
char apo = '\'';
String quote = "He said: \"Whut?\"";
So you have to mask it again, because only "\\" means "\".
So, for development, to not get too much confused, it is a fine idea to have a simple GUI-App with 2 or 3 textfields for testing regexps. If you succeed, you only have to add another level of masking, but to develop them, you can keep this second level away.
Divide et impera, like the ancient roman programmers told us.

Validating an infix notation possibly using regex

I am thinking of validating an infix notation which consists of alphabets as operands and +-*/$ as operators [eg: A+B-(C/D)$(E+F)] using regex in Java. Is there any better way? Is there any regex pattern which I can use?
I am not familiar with the language syntax of infix, but you can certainly do a first pass validation check which simply verifies that all of the characters in the string are valid (i.e. acceptable characters = A-Z, +, -, *, /, $, ( and )). Here is a Java program which checks for valid characters and also includes a function which checks for unbalanced (possibly nested) parentheses:
import java.util.regex.*;
public class TEST {
public static void main(String[] args) {
String s = "A+B-(C/D)$(E+F)";
Pattern regex = Pattern.compile(
"# Verify that a string contains only specified characters.\n" +
"^ # Anchor to start of string\n" +
"[A-Z+\\-*/$()]+ # Match one or more valid characters\n" +
"$ # Anchor to end of string\n",
Pattern.COMMENTS);
Matcher m = regex.matcher(s);
if (m.find()) {
System.out.print("OK: String has only valid characters.\n");
} else {
System.out.print("ERROR: String has invalid characters.\n");
}
// Verify the string contains only balanced parentheses.
if (checkParens(s)) {
System.out.print("OK: String has no unbalanced parentheses.\n");
} else {
System.out.print("ERROR: String has unbalanced parentheses.\n");
}
}
// Function checks is string contains any unbalanced parentheses.
public static Boolean checkParens(String s) {
Pattern regex = Pattern.compile("\\(([^()]*)\\)");
Matcher m = regex.matcher(s);
// Loop removes matching nested parentheses from inside out.
while (m.find()) {
s = m.replaceFirst(m.group(1));
m.reset(s);
}
regex = Pattern.compile("[()]");
m = regex.matcher(s);
// Check if there are any erroneous parentheses left over.
if (m.find()) {
return false; // String has unbalanced parens.
}
return true; // String has balanced parens.
}
}
This does not validate the grammar, but may be useful as a first test to filter out obviously bad strings.
Possibly overkill, but you might consider using a fully fledged parser generator such as ANTLR (http://www.antlr.org/). With ANTLR you can create rules that will generate the java code for you automatically. Assuming you have only got valid characters in the input this is a syntax analysis problem, otherwise you would want to validate the character stream with lexical analysis first.
For syntax analysis you might have rules like:
PLUS : '+' ;
etc...
expression:
term ( ( PLUS | MINUS | MULTIPLY | DIVIDE )^ term )*
;
term:
constant
| OPENPAREN! expression CLOSEPAREN!
;
With constant being integers/reals whatever. If the ANTLR generated parser code can't match the input with your parser rules it will throw an exception so you can determine whether code is valid.
You probably could do it with recursive PCRE..but this may be a PITA.
since you only want to validate it, you can do it very simple. just use a stack, push all the elements one by one and remove valid expressions.
define some rules, for example:
an operator is only allowed if there is an alphabet on top of the stack
an alphabet or parentheses are only allowed if there is an operator on top of the stack
everything is allowed if the stack is empty
then:
if you encounter a closing parenthes remove everything up to the opening parenthes.
if you encounter an alphabet, remove the expression
after every removal of an expression, add an dummy alphabet. repeat the previous steps.
if the result is an alphabet, the expression is valid.
or something like that..

Categories

Resources