Why doesn't this Java regex compile? - java

I am trying to extract the pass number from strings of any of the following formats:
PassID_132
PassID_64
Pass_298
Pass_16
For this, I constructed the following regex:
Pass[I]?[D]?_([\d]{2,3})
-and tested it in Eclipse's search dialog. It worked fine.
However, when I use it in code, it doesn't match anything. Here's my code snippet:
String idString = filename.replaceAll("Pass[I]?[D]?_([\\d]{2,3})", "$1");
int result = Integer.parseInt(idString);
I also tried
java.util.regex.Pattern.compile("Pass[I]?[D]?_([\\d]{2,3})")
in the Expressions window while debugging, but that says "", whereas
java.util.regex.Pattern.compile("Pass[I]?[D]?_([0-9]{2,3})")
compiled, but didn't match anything. What could be the problem?

Instead of Pass[I]?[D]?_([\d]{2,3}) try this:
Pass(?:I)?(?:D)?_([\d]{2,3})

There's nothing invalid with your tegex, but it sucks. You don't need character classes around single character terms. Try this:
"Pass(?:ID)?_(\\d{2,3})"

Related

How can you get the regular (!) source code of a page using selenium in java?

Okay, so here's the thing: All of you are probably thinking the same thing: you can use
driver.getPageSource();
And this is partially true. The only issue is that the source code gets compiled in a rather strange way where all through the code
\"
starts showing up. I tried removing this manually but that still doesnt fix the problem completely.
One example of what I mean:
normal source code:
\"query_title\":null}",encoded_title:"WyJoZW5rIl0",ref:"unknown",logger_source:"www_main",typeahead_sid:"",tl_log:false,impression_id:"bbdb1882",filter_ids:
Selenium output:
\\\"query_title\\\":null}\",\"encoded_title\":\"WyJoZW5rIl0\",\"ref\":\"br_tf\",\"logger_source\":\"www_main\",\"typeahead_sid\":\"0.6583900225217523\",\"tl_log\":false,\"impression_id\":\"e00060b4\",\"filter_ids\"
It seems to be the same type of thing as where you have to put something in front of certain symbols in quotes, to stop java from seeing it as one of those symbols, but I don't fully understand this behaviour, and have no idea how to fix it... hope you can help :)
edit:
replacing doesn't work because of the way this got compiled. An example of why it won't work is actually in the example I included earlier:
original:
}",encoded_title:
compiled version:
}\",\"encoded_title\":
Replacing \" with " would change it in to:
}","encoded_title":
which differs from the original...
And if I were to replace \" with nothing, I would get:
},encoded_title:
which, sadly, still differs from the original. The way this is compiled I just don't think replacing is a viable option...
You can use javascript to get html using outerHTML or innerHTML (How do I get the HTML source from the page?):
((JavascriptExecutor) driver).executeScript("return document.documentElement.outerHTML;")
((JavascriptExecutor) driver).executeScript("return document.documentElement.outerHTML;")
((JavascriptExecutor) driver).executeScript("return document.all[0].outerHTML")
((JavascriptExecutor) driver).executeScript("return new XMLSerializer().serializeToString(document);")
You can use Java String Class replaceAll method to replace unwanted characters with the character you want.
OLD solution -
driver.getPageSource().replaceAll("\\"", "\"").replaceAll("\\\\", ""));
New approx solution - As page source can contain anything in HTML
public class CheckString {
static String str = "\\\\\\"query_title\\\\\\":null}\\",\\"encoded_title\\":\\"WyJoZW5rIl0\\",\\"ref\\":\\"br_tf\\",\\"logger_source\\":\\"www_main\\",\\"typeahead_sid\\":\\"0.6583900225217523\\",\\"tl_log\\":false,\\"impression_id\\":\\"e00060b4\\",\\"filter_ids\\"";
public static void main(String[] args) {
System.out.println(str.replaceAll("\\\\",","\",")
.replaceAll(":\\\\"", ":\"")
.replaceAll("\\\\"","")
.replaceAll("\\\\\\\\", "\\\\\""));
}
}
OutPut -
\"query_title\":null}",encoded_title:"WyJoZW5rIl0",ref:"br_tf",logger_source:"www_main",typeahead_sid:"0.6583900225217523",tl_log:false,impression_id:"e00060b4",filter_ids
Note - In earlier approach I forgot to escape & character which is used by replaceAll function to separate multiple condition in regex

Regex: Read value between multiple brackets

I currently working on translating a website (Smarty) with Poedit. To get all the text from the .tpl files i'm using regex to get the data between the {t} and {/t}. so an example:
{t}Password incorrect, please try again{/t}
The regex will read Password incorrect, please try again and place it in a .po file. This is all working fine. It goes wrong when it gets a little more advanced.
Sometimes the text between the {t} tags uses a parameter. this looks like this:
{t 1=$email|escape 2=$mailbox}No $1 given, please check your $2{/t}
This is also working great.
The real problem start when i use brackets inside the parameter like this:
{t 1={site info='name'} 2=$mailbox}visit %1 or go to your %2{/t}
My regex will close when it sees the first closing brackets so the result will be 2=$mailbox}visit %1 or go to your %2.
My regex looks like this:
\{t.*?\}?[}]([^\{]+)\{\/t\}|\{t\}([^\{]+)\{\/t\}
The regex is used inside a java program.
Does anybody has a way to fix this problem?
The easiest solution I see on this is to normalize the .tpl files. Just use a regex which matches all tags something like this one:
{[^}]*[^{]*}
I had the same issue to solve and it worked pretty good with the normalizing.
The normalizing-method would look like this:
final String regex = "\\{[^\\}]*[^\\{]*\\}";
private String normalizeContent(String content) {
return content.replaceAll(regex, "");
}

Selenium via java - sendKeys doesn't send specific chars to input

I'm having a strange condition where i'm trying to type into input by using sendKeys , the reuslt is that specific chars doesn't seem to be implemented in the input at all.
What i'm trying to do:
webDriver.findElement(By.id("additionalInfo(token_autocompleteSelectInputId)")).sendKeys("(test)");
the result is that input field is now : test) and the missing char is '(' .
If i will try
webDriver.findElement(By.id("additionalInfo(token_autocompleteSelectInputId)")).sendKeys("((((((((((")
the result is that the input is empty.
Anyone ever faced this issue before? it is happening on a very specific input in the app, couldn't find anything related to it in the html code.
Thanks in advance.
Edit: I can manually type ( in the input field.
Maybe it's a special character for selenium, have you tried using escape characters? Something like backslash before it if it allows it.
Edit: I found some issue report on github from last year, not sure if they agreed to not fix it. Executing a script to type "(" seems to be an alternative.
Source: https://github.com/seleniumhq/selenium/issues/674
try declaring the key as a string first
String keyToSend = "(test)";
webDriver.findElement(By.id("additionalInfo(token_autocompleteSelectInputId)")).sendKeys(keyToSend);
In this case you should try using JavascriptExecutor as below :-
WebElement el = webDriver.findElement(By.id("additionalInfo(token_autocompleteSelectInputId)"));
((JavascriptExecutor)webDriver).executeScript("arguments[0].value = arguments[1]", el, "(test)");
Hope it helps..:)

Matcher.find() only find the last match in JUnit Test

i have this weird problem. I have this Java method that works fine in my program:
/*
* Extract all image urls from the html source code
*/
public void extractImageUrlFromSource(ArrayList<String> imgUrls, String html) {
Pattern pattern = Pattern.compile("\\<[ ]*[iI][mM][gG][\t\n\r\f ]+.*[sS][rR][cC][ ]*=[ ]*\".*\".*>");
Matcher matcher = pattern.matcher(html);
while (matcher.find()) {
imgUrls.add(extractImgUrlFromTag(matcher.group()));
}
}
This method works fine in my java application. But whenever I test it in JUnit test, it only adds the last url to the ArrayList
/**
* Test of extractImageUrlFromSource method, of class ImageDownloaderProc.
*/
#Test
public void testExtractImageUrlFromSource() {
System.out.println("extractImageUrlFromSource");
String html = "<html><title>fdjfakdsd</title><body><img kfjd src=\"http://image1.png\">df<img dsd src=\"http://image2.jpg\"></body><img dsd src=\"http://image3.jpg\"></html>";
ArrayList<String> imgUrls = new ArrayList<String>();
ArrayList<String> expimgUrls = new ArrayList<String>();
expimgUrls.add("http://image1.png");
expimgUrls.add("http://image2.jpg");
expimgUrls.add("http://image3.jpg");
ImageDownloaderProc instance = new ImageDownloaderProc();
instance.extractImageUrlFromSource(imgUrls, html);
imgUrls.stream().forEach((x) -> {
System.out.println(x);
});
assertArrayEquals(expimgUrls.toArray(), imgUrls.toArray());
}
Is it the JUnit that has the fault. Remember, it works fine in my application.
I think there is a problem in the regex:
"\\<[ ]*[iI][mM][gG][\t\n\r\f ]+.*[sS][rR][cC][ ]*=[ ]*\".*\".*>"
The problem (or at least one problem) us the first .*. The + and * metacharacters are greedy, which means that they will attempt to match as many characters as possible. In your unit test, I think that what is happening is that the .* is matching everything up to the last 'src' in the input string.
I suspect that the reason that this "works" in your application is that the input data is different. Specifically, I suspect that you are running your application on input files where each img element is on a different line. Why does this make a difference? Well, it turns out that by default, the . metacharacter does not match line breaks.
For what it is worth, using regexes to "parse" HTML is generally thought to be a bad idea. For a start, it is horribly fragile. People who do a lot of this kind of stuff tend to use proper HTML parsers ... like "jsoup".
Reference: RegEx match open tags except XHTML self-contained tags
I wish I could comment as I'm not sure about this, but it might be worth mentioning...
This line looks like it's extracting the URLs from the wrong array...did you mean to extract from expimgUrls instead of imgUrls?
instance.extractImageUrlFromSource(imgUrls, html);
I haven't gotten this far in my Java education so I may be incorrect...I just looked over the code and noticed it. I hope someone else who knows more can actually give you a solid answer!

Regex to Extract First Part of URL

I need a java regex to extract parts of a URL.
For example, take the following URLs:
http://localhost:81/example
https://test.com/test
http://test.com/
I would want my regex expression to return:
http://localhost:81
https://test.com
http://test.com
I will be using this in a Java patcher.
This is what I have so far, problem is it takes the whole URLs:
^https?:\/\/(?!.*:\/\/)\S+
import Java.net.URL
//snip
URL url = new URL(urlString);
return url.getProtocol() + "://" + url.getAuthority();
The right tool for the right job.
Building off your attempt, try this:
^https?://[^/]+
I'm assuming that you want to capture everything until the first / after http://? (That's what I was getting from your examples - if not, please post some more).
Are these URLs given as one input, or are each a different string?
Edit: It was pointed out that there were unnecessary escapes, so fixed to a more condensed version
Language independent answer:
For the whitespace: replace /^\s+/ with the empty string.
For removing the path information from the URL, if you can assume there aren't any slashes in the path (i.e. you're not dealing with http://localhost:81/foo/bar/baz), replace /\/[^\/]+$/ with the empty string. If there might be more slashes, you might try something like replacing /(^\s*.*:\/\/[^\/]+)\/.*/ with $1.
A simple one: ^(https?://[^/]+)

Categories

Resources