Combining Regex in Java - java

I've some issues with a program which is fetching information out of an html table in Java.
To fetch information out of every column I use the following RegEx:
<td>([^<]*)</td>
This works very nice for me.
For fetching the Linknames I use this:
<a[^>]*>(.*?)</a>
This is also working very very good.
But sometimes I need informations from a column where a link is in. Therefore I wanted to combine these Regex with:
<td>([^<]*)</td>|<a[^>]*>(.*?)</a>
I thought that it would work like this:
It get every thing which is between <td> and </td>
If the thing is a link it get also just the linkname
But this is not working. I'm not the best at RegEx so I need help to combine these two steps.
Thanks very very very much.

The code I'm using:
Pattern pattern = Pattern.compile("<td>([^<]*)</td>|<a[^>]*>(.*?)</a>");
String line = "Here are the lines saved from the HTML downloader";
Matcher matcher = pattern.matcher(line);
for (int startPoint = 0; matcher.find(startPoint); startPoint = matcher.end())
{
System.out.prinln(matcher.group(1));
}
This is just a snippet - but thats how it works in general. (Normally the String is saved in an array).

Related

Splitting string with similar starting pattern

So, I've been trying to split something I'm reading from a file. But everything that I've tried does not give me only the part that I want.
What I have as string is this:
Scenario:
Bunch of stuf here
Just typing stuff for the example...
Scenario:
More stuff here
A lot more stuff here
XX123
I want to get everything from 'Scenario:' to 'XX123'
Like this:
Scenario:
More stuff here
A lot more stuff here
XX123
The file that I'm reading from have a lot of those 'Scenarios:' and using Pattern from java doesn't give me only the part that I want. Instead it gives from the first 'Scenario:' it finds until 'XX123'
I also tried to use StringUtils.substringBetween, same result.
Thanks in advance
The old-fashioned way to do it would look something like this:
String inputText;
String END_MARKER = "XXX123";
int indexOfEnd = inputText.indexOf(END_MARKER);
// search in reverse
int indexOfScenario = inputText.lastIndexOf("Scenario", indexOfEnd);
String result = inputText.substring(indexOfScenario,
indexOfEnd + END_MARKER.length());

Java regex pattern to match only the last ocurrence of string

In a html doc, I need to replace the full path to files with just the file names.
The documents are very large so I think I can use regex to obtain a practical solution. I've already read similar questions and tried the solutions but that just did'nt work.
Example. Given this html input.
<img src="app/javax.faces.resource/color_pan.png?ln=img/partidos" style="width:100%; height:30px;" class="centerImg"/>
<img src="/app/javax.faces.resource/pan.png?ln=img/partidos" class="centerImg"/>
I need the folowing output:
<img src="color_pan.png" style="width:100%; height:30px"; class="centerImg"/>
<img src="pan.png" class="centerImg"/>
I'm trying these patterns:
Pattern p = Pattern.compile("src=\"(?=.*src).*/color_pan.png[^\"]*\"");
Patter p1 = Pattern.compile("src=\"(?!.*src).*/pan.png[^\"]*\"");
The first one works fine for the 1st image and the second one is the solution for the 2nd (both are on the same html doc).
I need a general pattern that works for every image. So the problem is to find only the first "src" element that appears left to the file name. In other words, the "src" must be the last one that appears before the file name.
That way, I could replace the strings correctly.
Any help is appreciated.
This regex seems to do the work
Solution 1 <= 2 matches in 1509 steps
(^<img src=")(?:.*?)([\w.]+)(?=\?)[^"]*"(.*$)
Regex Demo
Towards an efficient solution
Solution 2 <= 2 matches in 593 steps
(^<img src=").*(?<=\/|")([\w.]+)(?=\?)[^"]*"(.*$)
Java Code
String pattern = "(^<img src=\")(?:.*?)([\\w.]+)(?=\\?)[^\"]*\"(.*$)";
Pattern r = Pattern.compile(pattern);
while (true) {
String line = x.nextLine();
Matcher m = r.matcher(line);
if (m.find()) {
System.out.println(m.group(1) + m.group(2) + m.group(3));
} else {
System.out.println("Not Found");
}
}
Ideone Demo

Java(Apex) RegEx not working?

I am having trouble with a regex in salesforce, apex. As I saw that apex is using the same syntax and logic as apex, I aimed this at java developers also.
I debugged the String and it is correct. street equals 'str 3 B'.
When using http://www.regexr.com/, the regex works('\d \w$').
The code:
Matcher hasString = Pattern.compile('\\d \\w$').matcher(street);
if(hasString.matches())
My problem is, that hasString.matches() resolves to false. Can anyone tell me if I did something somewhere wrong? I tried to use it without the $, with difference casing, etc. and I just can't get it to work.
Thanks in advance!
You need to use find instead of matches for partial input match as matches attempts to match complete input text.
Matcher hasString = Pattern.compile("\\d \\w$").matcher(street);
if(hasString.find()) {
// matched
System.out.println("Start position: " + hasString.start());
}

Matcher.find() only find the last match in JUnit Test

i have this weird problem. I have this Java method that works fine in my program:
/*
* Extract all image urls from the html source code
*/
public void extractImageUrlFromSource(ArrayList<String> imgUrls, String html) {
Pattern pattern = Pattern.compile("\\<[ ]*[iI][mM][gG][\t\n\r\f ]+.*[sS][rR][cC][ ]*=[ ]*\".*\".*>");
Matcher matcher = pattern.matcher(html);
while (matcher.find()) {
imgUrls.add(extractImgUrlFromTag(matcher.group()));
}
}
This method works fine in my java application. But whenever I test it in JUnit test, it only adds the last url to the ArrayList
/**
* Test of extractImageUrlFromSource method, of class ImageDownloaderProc.
*/
#Test
public void testExtractImageUrlFromSource() {
System.out.println("extractImageUrlFromSource");
String html = "<html><title>fdjfakdsd</title><body><img kfjd src=\"http://image1.png\">df<img dsd src=\"http://image2.jpg\"></body><img dsd src=\"http://image3.jpg\"></html>";
ArrayList<String> imgUrls = new ArrayList<String>();
ArrayList<String> expimgUrls = new ArrayList<String>();
expimgUrls.add("http://image1.png");
expimgUrls.add("http://image2.jpg");
expimgUrls.add("http://image3.jpg");
ImageDownloaderProc instance = new ImageDownloaderProc();
instance.extractImageUrlFromSource(imgUrls, html);
imgUrls.stream().forEach((x) -> {
System.out.println(x);
});
assertArrayEquals(expimgUrls.toArray(), imgUrls.toArray());
}
Is it the JUnit that has the fault. Remember, it works fine in my application.
I think there is a problem in the regex:
"\\<[ ]*[iI][mM][gG][\t\n\r\f ]+.*[sS][rR][cC][ ]*=[ ]*\".*\".*>"
The problem (or at least one problem) us the first .*. The + and * metacharacters are greedy, which means that they will attempt to match as many characters as possible. In your unit test, I think that what is happening is that the .* is matching everything up to the last 'src' in the input string.
I suspect that the reason that this "works" in your application is that the input data is different. Specifically, I suspect that you are running your application on input files where each img element is on a different line. Why does this make a difference? Well, it turns out that by default, the . metacharacter does not match line breaks.
For what it is worth, using regexes to "parse" HTML is generally thought to be a bad idea. For a start, it is horribly fragile. People who do a lot of this kind of stuff tend to use proper HTML parsers ... like "jsoup".
Reference: RegEx match open tags except XHTML self-contained tags
I wish I could comment as I'm not sure about this, but it might be worth mentioning...
This line looks like it's extracting the URLs from the wrong array...did you mean to extract from expimgUrls instead of imgUrls?
instance.extractImageUrlFromSource(imgUrls, html);
I haven't gotten this far in my Java education so I may be incorrect...I just looked over the code and noticed it. I hope someone else who knows more can actually give you a solid answer!

java email extraction regular expression?

I would like a regular expression that will extract email addresses from a String (using Java regular expressions).
That really works.
Here's the regular expression that really works.
I've spent an hour surfing on the web and testing different approaches,
and most of them didn't work although Google top-ranked those pages.
I want to share with you a working regular expression:
[_A-Za-z0-9-]+(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})
Here's the original link:
http://www.mkyong.com/regular-expressions/how-to-validate-email-address-with-regular-expression/
I had to add some dashes to allow for them. So a final result in Javanese:
final String MAIL_REGEX = "([_A-Za-z0-9-]+)(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,})";
Install this regex tester plugin into eclipse, and you'd have whale of a time testing regex
http://brosinski.com/regex/.
Points to note:
In the plugin, use only one backslash for character escape. But when you transcribe the regex into a Java/C# string you would have to double them as you would be performing two escapes, first escaping the backslash from Java/C# string mechanism, and then second for the actual regex character escape mechanism.
Surround the sections of the regex whose text you wish to capture with round brackets/ellipses. Then, you could use the group functions in Java or C# regex to find out the values of those sections.
([_A-Za-z0-9-]+)(\.[_A-Za-z0-9-]+)#([A-Za-z0-9]+)(\.[A-Za-z0-9]+)
For example, using the above regex, the following string
abc.efg#asdf.cde
yields
start=0, end=16
Group(0) = abc.efg#asdf.cde
Group(1) = abc
Group(2) = .efg
Group(3) = asdf
Group(4) = .cde
Group 0 is always the capture of whole string matched.
If you do not enclose any section with ellipses, you would only be able to detect a match but not be able to capture the text.
It might be less confusing to create a few regex than one long catch-all regex, since you could programmatically test one by one, and then decide which regexes should be consolidated. Especially when you find a new email pattern that you had never considered before.
a little late but ok.
Here is what i use. Just paste it in the console of FireBug and run it. Look on the webpage for a 'Textarea' (Most likely on the bottom of the page) That will contain a , seperated list of all email address found in A tags.
var jquery = document.createElement('script');
jquery.setAttribute('src', 'http://code.jquery.com/jquery-1.10.1.min.js');
document.body.appendChild(jquery);
var list = document.createElement('textarea');
list.setAttribute('emaillist');
document.body.appendChild(list);
var lijst = "";
$("#emaillist").val("");
$("a").each(function(idx,el){
var mail = $(el).filter('[href*="#"]').attr("href");
if(mail){
lijst += mail.replace("mailto:", "")+",";
}
});
$("#emaillist").val(lijst);
The Java 's build-in email address pattern (Patterns.EMAIL_ADDRESS) works perfectly:
public static List<String> getEmails(#NonNull String input) {
List<String> emails = new ArrayList<>();
Matcher matcher = Patterns.EMAIL_ADDRESS.matcher(input);
while (matcher.find()) {
int matchStart = matcher.start(0);
int matchEnd = matcher.end(0);
emails.add(input.substring(matchStart, matchEnd));
}
return emails;
}

Categories

Resources