Matcher.find() only find the last match in JUnit Test - java

i have this weird problem. I have this Java method that works fine in my program:
/*
* Extract all image urls from the html source code
*/
public void extractImageUrlFromSource(ArrayList<String> imgUrls, String html) {
Pattern pattern = Pattern.compile("\\<[ ]*[iI][mM][gG][\t\n\r\f ]+.*[sS][rR][cC][ ]*=[ ]*\".*\".*>");
Matcher matcher = pattern.matcher(html);
while (matcher.find()) {
imgUrls.add(extractImgUrlFromTag(matcher.group()));
}
}
This method works fine in my java application. But whenever I test it in JUnit test, it only adds the last url to the ArrayList
/**
* Test of extractImageUrlFromSource method, of class ImageDownloaderProc.
*/
#Test
public void testExtractImageUrlFromSource() {
System.out.println("extractImageUrlFromSource");
String html = "<html><title>fdjfakdsd</title><body><img kfjd src=\"http://image1.png\">df<img dsd src=\"http://image2.jpg\"></body><img dsd src=\"http://image3.jpg\"></html>";
ArrayList<String> imgUrls = new ArrayList<String>();
ArrayList<String> expimgUrls = new ArrayList<String>();
expimgUrls.add("http://image1.png");
expimgUrls.add("http://image2.jpg");
expimgUrls.add("http://image3.jpg");
ImageDownloaderProc instance = new ImageDownloaderProc();
instance.extractImageUrlFromSource(imgUrls, html);
imgUrls.stream().forEach((x) -> {
System.out.println(x);
});
assertArrayEquals(expimgUrls.toArray(), imgUrls.toArray());
}
Is it the JUnit that has the fault. Remember, it works fine in my application.

I think there is a problem in the regex:
"\\<[ ]*[iI][mM][gG][\t\n\r\f ]+.*[sS][rR][cC][ ]*=[ ]*\".*\".*>"
The problem (or at least one problem) us the first .*. The + and * metacharacters are greedy, which means that they will attempt to match as many characters as possible. In your unit test, I think that what is happening is that the .* is matching everything up to the last 'src' in the input string.
I suspect that the reason that this "works" in your application is that the input data is different. Specifically, I suspect that you are running your application on input files where each img element is on a different line. Why does this make a difference? Well, it turns out that by default, the . metacharacter does not match line breaks.
For what it is worth, using regexes to "parse" HTML is generally thought to be a bad idea. For a start, it is horribly fragile. People who do a lot of this kind of stuff tend to use proper HTML parsers ... like "jsoup".
Reference: RegEx match open tags except XHTML self-contained tags

I wish I could comment as I'm not sure about this, but it might be worth mentioning...
This line looks like it's extracting the URLs from the wrong array...did you mean to extract from expimgUrls instead of imgUrls?
instance.extractImageUrlFromSource(imgUrls, html);
I haven't gotten this far in my Java education so I may be incorrect...I just looked over the code and noticed it. I hope someone else who knows more can actually give you a solid answer!

Related

Regexp for URL cache in Java

I need to match a certain Regexp pattern in Java and I think I'm very close, could anyone with more experience help? I have been testing it for at least a few hours and couldn't come to a solution yet.
This Regexp is mounted based on a URL, and must represent a "key" to this URL, since depending on the source it may change a lot, but a few stuff is always there... Already mapped Strings to match:
http://fictionalURL:8080/servlet/TPCW_new_products_servlet;jsessionid=865266C8B1231C35FEDEAA9D66400074?subject=POLITICS
http://fictionalURL.:8080/servlet/TPCW_buy_request_servlet;jsessionid=6FA80FDC52BB22518DB7D587E0876D63?RETURNING_FLAG=Y&UNAME=OGREREBABAREAT&PASSWD=ogrerebabareat&C_ID=1440046&SHOPPING_ID=171
http://localhost:8080/servlet/;jsessionid=865266C8B1231C35FEDEAA9D66400074?subject=POLITICS
my code is built so that the part that represents the URL pattern is built on runtime:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test_regexp {
public static void main (String[] args){
String testString = "http://ec2-54-158-62-71.compute-1.amazonaws.com:8080/servlet/TPCW_buy_request_servlet;jsessionid=6FA80FDC52BB22518DB7D587E0876D63?RETURNING_FLAG=Y&UNAME=OGREREBABAREAT&PASSWD=ogrerebabareat&C_ID=1440046&SHOPPING_ID=171";
int beginIndex = testString.indexOf("servlet");
int endIndex = testString.indexOf("jsessionid");
CharSequence cs = new String(testString);
String patt = "\\(?=.*:8080/.*)(?=.*jsessionid=).*";
System.out.println("Pattern: "+patt);
Pattern teste = Pattern.compile(patt);
System.out.println(teste.matcher(cs).matches());
}
}
but at the end the pattern should look something like this:
Pattern: ((?=.:8080/.)(?=.jsessionid=).)
PS: The pattern must include the URL full endpoint (with parameters), but not the sessionId and other stuff
EDIT: I forgot to mention, the regexp must also have the subject parameter, which is after the session ID, I have only realized it while writing this...
For those who want to know what's my purpose on all that, I'm making a LRU cache based on Regexp Patterns stored in a HashSet.
I would apprecite the help very much! This is the last task to finish the project!
Thanks in advance.
Double check your pattern. I'm not sure what the backslash out front is for.
"\\(?=.*:8080/.*)(?=.*jsessionid=).*"
Your pattern would also happily match jsessionid=10:8080/

replaceAll replacing the full words before and after slash as well

I have a need wherein I need to replace some specific words.
For example, if my text has
He needs to have java skills
I need to replace it as
He/She needs to have java skills
I kind of achieved this with below code
String replacedText = originalText.replaceAll("\\bHe\\b|\\bShe\\b","He/She");
But the problem is when I execute the code again, the output is
He/She/He/She needs to have java skills
The problem is '\\b' is considering the words full even when they are before or after slash.
Update: I am getting the source from a word/excel/html file. So for the first time it works fine. My intention is if I run the code again on the modified files, it should not change anything.
How to fix this?
Few hints at start:
he she can re represented with s?he (where s is optional) so you don't need he|she (it will keep things shorter and equally simple).
Also you can use (?i) flag which will make your regex case-insensitive.
Now consider replacing either
he
she
but also
he/she
she/he
with he/she. Regex representing this cases can look like s?he(/s?he)?
So try with
replaceAll("(?i)\\bs?he(/s?he)?\\b","He/She");
I achieved it with the help of negative lookahead and negative lookbehind. With this logic I can run the code any no. of times on already modified files as well.
private String replace(String originalText) {
String replacedText = originalText.replaceAll(
"\\b(he(?!/)|(?<!/)she)\\b", "he/she");
replacedText = replacedText.replaceAll("\\b(He(?!/)|(?<!/)She)\\b",
"He/She");
replacedText = replacedText.replaceAll("\\b(his(?!/)|(?<!/)her)\\b",
"his/her");
replacedText = replacedText.replaceAll("\\b(His(?!/)|(?<!/)Her)\\b",
"His/Her");
replacedText = replacedText.replaceAll("\\bhim(?!/)\\b", "him/her");
replacedText = replacedText.replaceAll("\\bHim(?!/)\\b", "Him/Her");
return replacedText;
}
Thank you Biffen for the idea.
A simple approach could be
String[] originalTexts = {"He needs to have java skills",
"She needs to have java skills",
"He/She needs to have java skills"
};
for (String original : originalTexts) {
String replacedText = original.replaceAll("\\b(She/He|He/She|He|She)\\b","He/She");
System.out.printf("original: %-32s replacedText: %20s%n", original, replacedText);
}

Website/URL Validation Regex in JAVA

I need a regex string to match URL starting with "http://", "https://", "www.", "google.com"
the code i tried using is:
//Pattern to check if this is a valid URL address
Pattern p = Pattern.compile("(http://|https://)(www.)?([a-zA-Z0-9]+).[a-zA-Z0-9]*.[a-z]{3}.?([a-z]+)?");
Matcher m;
m=p.matcher(urlAddress);
but this code only can match url such as "http://www.google.com"
I know this ma be a dupicate question but i have tried all of the regex provided and it does not suit my requirement. Willl someone please help me? Thank you.
You need to make (http://|https://) part in your regex as optional one.
^(http:\/\/|https:\/\/)?(www.)?([a-zA-Z0-9]+).[a-zA-Z0-9]*.[a-z]{3}.?([a-z]+)?$
DEMO
You can use the Apache commons library(org.apache.commons.validator.UrlValidator) for validating a url:
String[] schemes = {"http","https"}.
UrlValidator urlValidator = new UrlValidator(schemes);
And use :-
urlValidator.isValid(your url)
Then there is no need of regex.
Link:-
https://commons.apache.org/proper/commons-validator/apidocs/org/apache/commons/validator/routines/UrlValidator.html
If you use Java, I recommend use this RegEx (I wrote it by myself):
^(https?:\/\/)?(www\.)?([\w]+\.)+[‌​\w]{2,63}\/?$
"^(https?:\\/\\/)?(www\.)?([\\w]+\\.)+[‌​\\w]{2,63}\\/?$" // as Java-String
to explain:
^ = line start
(https?://)? = "http://" or "https://" may occur.
(www.)? = "www." may orrur.
([\w]+.)+ = a word ([a-zA-Z0-9]) has to occur one or more times. (extend here if you need special characters like ü, ä, ö or others in your URL - remember to use IDN.toASCII(url) if you use special characters. If you need to know which characters are legal in general: https://kb.ucla.edu/articles/what-characters-can-go-into-a-valid-http-url
[‌​\w]{2,63} = a word ([a-zA-Z0-9]) with 2 to 63 characters has to occur exactly one time. (a TLD (top level domain (for example .com) can not be shorter than 2 or longer than 63 characters)
/? = a "/"-character may occur. (some people or servers put a / at the end... whatever)
$ = line end
-
If you extend it by special characters it could look like this:
^(https?:\/\/)?(www\.)?([\w\Q$-_+!*'(),%\E]+\.)+[‌​\w]{2,63}\/?$
"^(https?:\\/\\/)?(www\.)?([\\w\\Q$-_+!*'(),%\\E]+\\.)+[‌​\\w]{2,63}\\/?$" // as Java-String
The answer of Avinash Raj is not fully correct.
^(http:\/\/|https:\/\/)?(www.)?([a-zA-Z0-9]+).[a-zA-Z0-9]*.[a-z]{3}.?([a-z]+)?$
The dots are not escaped what means it matches with any character. Also my version is simpler and I never heard of a domain like "test..com" (which actually matches...)
Demo: https://regex101.com/r/vM7wT6/279
Edit:
As I saw some people needing a regex which also matches servers directories I wrote this:
^(https?:\/\/)?([\w\Q$-_+!*'(),%\E]+\.)+(\w{2,63})(:\d{1,4})?([\w\Q/$-_+!*'(),%\E]+\.?[\w])*\/?$
while this may not be the best one, since I didn't spend too much time with it, maybe it helps someone. You can see how it works here: https://regex101.com/r/vM7wT6/700
It also matches urls like "hello.to/test/whatever.cgi"
Java compatible version of #Avinash's answer would be
//Pattern to check if this is a valid URL address
Pattern p = Pattern.compile("^(http://|https://)?(www.)?([a-zA-Z0-9]+).[a-zA-Z0-9]*.[a-z]{3}.?([a-z]+)?$");
Matcher m;
m=p.matcher(urlAddress);
boolean matches = m.matches();
pattern="w{3}\.[a-z]+\.?[a-z]{2,3}(|\.[a-z]{2,3})"
this will only accept addresses like e.g www.google.com & www.google.co.in
//I use that
static boolean esURL(String cadena){
boolean bandera = false;
bandera = cadena.matches("\\b(https://?|ftp://|file://|www.)[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]");
return bandera;
}

Why doesn't this Java regex compile?

I am trying to extract the pass number from strings of any of the following formats:
PassID_132
PassID_64
Pass_298
Pass_16
For this, I constructed the following regex:
Pass[I]?[D]?_([\d]{2,3})
-and tested it in Eclipse's search dialog. It worked fine.
However, when I use it in code, it doesn't match anything. Here's my code snippet:
String idString = filename.replaceAll("Pass[I]?[D]?_([\\d]{2,3})", "$1");
int result = Integer.parseInt(idString);
I also tried
java.util.regex.Pattern.compile("Pass[I]?[D]?_([\\d]{2,3})")
in the Expressions window while debugging, but that says "", whereas
java.util.regex.Pattern.compile("Pass[I]?[D]?_([0-9]{2,3})")
compiled, but didn't match anything. What could be the problem?
Instead of Pass[I]?[D]?_([\d]{2,3}) try this:
Pass(?:I)?(?:D)?_([\d]{2,3})
There's nothing invalid with your tegex, but it sucks. You don't need character classes around single character terms. Try this:
"Pass(?:ID)?_(\\d{2,3})"

Java regular expression for extracting the data between tags

I am trying to a regular expression which extracs the data from a string like
<B Att="text">Test</B><C>Test1</C>
The extracted output needs to be Test and Test1. This is what I have done till now:
public class HelloWorld {
public static void main(String[] args)
{
String s = "<B>Test</B>";
String reg = "<.*?>(.*)<\\/.*?>";
Pattern p = Pattern.compile(reg);
Matcher m = p.matcher(s);
while(m.find())
{
String s1 = m.group();
System.out.println(s1);
}
}
}
But this is producing the result <B>Test</B>. Can anybody point out what I am doing wrong?
Three problems:
Your test string is incorrect.
You need a non-greedy modifier in the group.
You need to specify which group you want (group 1).
Try this:
String s = "<B Att=\"text\">Test</B><C>Test1</C>"; // <-- Fix 1
String reg = "<.*?>(.*?)</.*?>"; // <-- Fix 2
// ...
String s1 = m.group(1); // <-- Fix 3
You also don't need to escape a forward slash, so I removed that.
See it running on ideone.
(Also, don't use regular expressions to parse HTML - use an HTML parser.)
If u are using eclipse there is nice plugin that will help you check your regular expression without writing any class to check it.
Here is link:
http://regex-util.sourceforge.net/update/
You will need to show view by choosing Window -> Show View -> Other, and than Regex Util
I hope it will help you fighting with regular expressions
It almost looks like you're trying to use regex on XML and/or HTML. I'd suggest not using regex and instead creating a parser or lexer to handle this type of arrangement.
I think the bestway to handle and get value of XML nodes is just treating it as an XML.
If you really want to stick to regex try:
<B[^>]*>(.+?)</B\s*>
understanding that you will get always the value of B tag.
Or if you want the value of any tag you will be using something like:
<.*?>(.*?)</.*?>

Categories

Resources