I have problem with extracting data from website.
Im trying to get name of company and price its: SYGNITY and 8,40
<a class="link" href="http://www.money.pl/gielda/spolki-gpw/PLCMPLD00016.html">SYGNITY</a>
</td>
<td class="ac"><img width="12" height="11" src="http://static1.money.pl/i/gielda/chart.gif" title="Pokaż wykres" alt="Pokaż wykres" /></td>
<td class="al">SGN</td>
<td class="ar">8,40</td>
I tried to use this pattern but it doesnt work:
String expr = "<a class=\"link\" href=\"(.+?)\">(.+?)</a>(.+?)<td class=\"ar\">(.+?)</td> ";
any advices?
Using JSoup parser
You should use a html parser like JSoup since regex is not a good idea to parse html.
You can do something like this:
String htmlString = "YOUR HTML HERE";
Document document=Jsoup.parse(htmlString);
Element element=document.select("a[href=http://www.money.pl/gielda/spolki-gpw/PLCMPLD00016.html]").first();
System.out.println(element.text()); //SYGNITY
element=document.select("td[class=ar]").first();
System.out.println(element.text()); //8,40
Using regex
If you still want to use a regex, then you could use a regex like below and grab the content from capturing groups:
PLCMPLD00016.html">(.*?)<\/a>|"ar">(.*?)<\/td>
Working demo
String htmlString = "YOUR HTML HERE"
Pattern pattern = Pattern.compile("PLCMPLD00016.html">(.*?)<\\/a>|"ar">(.*?)<\\/td>");
Matcher matcher = pattern.matcher(htmlString );
while (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}
Related
I am a newbie to Java regex. I have a long string which contains text like this(Below is only the part of my string which I would like to replace):
href="javascript:openWin('Images/DCRMBex_01B_ex01.jpg',480,640)"
href="javascript:openWin('Images/DCRMBex_01A_ex01.jpg',480,640)"
href="javascript:openWin('Images/DCRMBex_06A_ex06.jpg',480,640)"
I would like to replace
Images
with
http://google.com/Images
For eg. my output should look like this:
href="javascript:openWin('http://google.com/Images/DCRMBex_01B_ex01.jpg',480,640)"
Below is my Java program:
import java.io.FileReader;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main2 {
public static void main(String[] args) throws FileNotFoundException {
Scanner in = new Scanner(new FileReader("C:\\Projects\\input.txt"));
StringBuilder sb = new StringBuilder();
while (in.hasNext()) {
sb.append(in.next());
}
String patternString = "href=\"javascript:openWin(.+?)\"";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(sb);
while (matcher.find()) {
//System.out.println(matcher.group(1));
//System.out.println(matcher.group(1).replaceAll("Images", "http://google.com/Images"));
matcher.group(1).replaceAll("Images", "http://google.com/Images");
}
System.out.println(sb);
}
}
Below is my input file(input.txt). This is only a part of my file. The file is too long to paste here:
<td valign="top"><a href="http://www.google.com/cds/desktop/documents/DCRMBex/DCRMBex_01_ex01.pdf"><b>Example 1: Bible (Rusch)</b></a> � <a href="javascript:openWin('Images/DCRMBex_01A_ex01.jpg',480,640)">Figure 1A. First page of text</a> � <a href="javascript:openWin('Images/DCRMBex_01B_ex01.jpg',480,640)">Figure 1B. Source of supplied title</a></td>
<td valign="top"> </td>
<td valign="top"><a href="http://www.google.com/cds/desktop/documents/DCRMBex/DCRMBex_06_ex06.pdf"><b>Example 6: Angelo Carletti</b></a> � <a href="javascript:openWin('Images/DCRMBex_06A_ex06.jpg',480,640)">Figure 6A. Title page</a> � <a href="javascript:openWin('Images/DCRMBex_06B_ex06.jpg',480,640)">Figure 6B. Colophon showing use of i/j and u/v</a></td>
</tr>
<tr>
<td valign="top"><a href="http://www.google.com/cds/desktop/documents/DCRMBex/DCRMBex_02_ex02.pdf"><b>Example 2: Greek anthology</b></a> � <a href="javascript:openWin('Images/DCRMBex_02A_ex02.jpg',480,640)">Figure 2A. First page of text</a> � <a href="javascript:openWin('Images/DCRMBex_02B_ex02.jpg',480,640)">Figure 2B. Colophon</a></td>
<td valign="top"> </td>
<td valign="top"><a href="http://www.google.com/cds/desktop/documents/DCRMBex/DCRMBex_07_ex07.pdf"><b>Example 7: Erasmus</b></a> � <a href="javascript:openWin('Images/DCRMBex_07A_ex07.jpg',480,640)">Figure 7A. Title page</a> � <a href="javascript:openWin('Images/DCRMBex_07B_ex07.jpg',480,640)">Figure 7B. Colophon</a> � <a href="javascript:openWin('Images/DCRMBex_07C_ex07.jpg',640,480)">Figure 7C. Running title</a></td>
</tr>
<tr>
<td valign="top"><a href="http://www.google.com/cds/desktop/documents/DCRMBex/DCRMBex_03_ex03.pdf"><b>Example 3: Heytesbury</b></a> � <a href="javascript:openWin('Images/DCRMBex_03A_ex03.jpg',480,640)">Figure 3A. Title page</a> � <a href="javascript:openWin('Images/DCRMBex_03B_ex03.jpg',480,640)">Figure 3B. Colophon showing use of i/j and u/v</a></td>
<td valign="top"> </td>
<td valign="top"><a href="http://www.google.com/cds/desktop/documents/DCRMBex/DCRMBex_08_ex08.pdf"><b>Example 8: Pliny</b></a> � <a href="javascript:openWin('Images/DCRMBex_08A_ex08.jpg',480,640)">Figure 8A. Title page</a> � <a href="javascript:openWin('Images/DCRMBex_08B_ex08.jpg',480,640)">Figure 8B. Colophon</a></td>
Output:
1) System.out.println(matcher.group(1))
('Images/DCRMBex_05_ex05.jpg',480,640)
2)System.out.println(matcher.group(1).replaceAll("Images","http://google.com/Images"));
('http://google.com/Images/DCRMBex_05_ex05.jpg',480,640)
But when I print my struingbuilder, it doesn't show any replacement. What I am doing wrong here? Any help is appreciated. Thanks
I would recommend using Files.lines() and Java Steam to modify the input. With your actual input you also don't need a regex:
try (Stream<String> lines = Files.lines(Paths.get("input.txt"))) {
String result = lines
.map(line -> line.replace("Images", "http://google.com/Images"))
.collect(Collectors.joining("\n"));
System.out.println(result);
}
If you really want to use a regex I would recommend to use a pattern outside the loop, because String.replaceAll() internally compiles the pattern every time you call it. So the performance is much better if you do not do Pattern.compile() for each line:
Pattern pattern = Pattern.compile("(href=\"javascript:openWin.*)(Images.*\")");
try (Stream<String> lines = Files.lines(Paths.get("input.txt"))) {
String result = lines
.map(pattern::matcher)
.map(matcher -> matcher.replaceAll("$1http://google.com/$2"))
.collect(Collectors.joining("\n"));
System.out.println(result);
}
Using this regex for replacement it will create two groups (between ()). You can use this groups in your replacement string by using $index. So $1 will insert the first group.
The result in both cases will be:
href="javascript:openWin('http://google.com/Images/DCRMBex_01B_ex01.jpg',480,640)"
href="javascript:openWin('http://google.com/Images/DCRMBex_01A_ex01.jpg',480,640)"
href="javascript:openWin('http://google.com/Images/DCRMBex_06A_ex06.jpg',480,640)"
replaceAll returns the modified string; it does not modify in place. In this case, I would not use java.util.regex and instead use replaceAll's support for capture groups:
Scanner in = new Scanner(new FileReader("C:\\Projects\\input.txt"));
StringBuilder sb = new StringBuilder();
while (in.hasNext()) {
sb.append(in.next());
}
// Modified regex
String patternString = "(href=\"javascript:openWin\\(')(.+?)(')";
String result = sb.toString().replaceAll(patternString, "$1http://google.com/$2$3");
Try it online
Hope this helps!
I want to replace only exactly matching link given in String.
My code is as follows:
String originalString = "<a target=\"_blank\" href=\"http://example.com/\"><span style=\"font-size: 12px;\">ABC</span></a>"
+ "<a target=\"_blank\" href=\"http://example.com/contact/\"><span style=\"font-size: 12px;\">Contact</span></a>";
String replacedString = originalString.replace("http://example.com/", "link1");
System.out.println("Replaced String:" + replacedString);
replacedString = "<a target="_blank" href="link1"><span style="font-size: 12px;">ABC</span></a><a target="_blank" href="link1contact/"><span style="font-size: 12px;">Contact</span></a>"
requiredString = "<a target="_blank" href="link1"><span style="font-size: 12px;">ABC</span></a><a target="_blank" href="link2"><span style="font-size: 12px;">Contact</span></a>"
I get Output as replacedString but required Output should be as requiredString.
Thanks in advance.
Replace the URL with the quotes:
String replacedString = originalString.replace("\"http://example.com/\"", "\"link1\"");
replacedString = replacedString.replace("\"http://example.com/contact/\"", "\"link2\"");
The problem is that http://example.com/contact/ contains http://example.com/.
Use this instead:
String replacedString = originalString.replace("http://example.com/contact/", "link2");
String replacedString2 = replacedString.replace("http://example.com/", "link1");
replacedString2 is the required output
working regex ishttp:\\/\\/example.com.*?(?=\\\\) in java. it matches all occurences of http://example.com and until the next backslash
I have a string obtained from an EditText. The string contains html tags.
Spannable s = mainEditText.getText();
String webText = Html.toHtml(s);
The contents of the string is :
<p dir="ltr">test</p>
<p dir="ltr"><img src="http://files.parsetfss.com/bcff7108-cbce-4ab8-b5d1-1f82827e6519/tfss-0de7a730-3fa9-4a1e-9f82-d34e4f6e2d31-file" /><br /></p>
<p dir="ltr"><img src="http://maps.google.com/maps/api/staticmap?center=22.572646,88.363895&zoom=15&size=960x540&sensor=false&markers=color:blue%7Clabel:!%7C22.572646,88.363895" /><br /> </p>
Now, what I want to do is, wherever there is an img src tag, I want to precede it with a center tag.
What should I do to get the following output?
<p dir="ltr">test</p>
<p dir="ltr"><center><img src="http://files.parsetfss.com/bcff7108-cbce-4ab8-b5d1-1f82827e6519/tfss-0de7a730-3fa9-4a1e-9f82-d34e4f6e2d31-file" /></center><br /></p>
<p dir="ltr"><center><img src="http://maps.google.com/maps/api/staticmap?center=22.572646,88.363895&zoom=15&size=960x540&sensor=false&markers=color:blue%7Clabel:!%7C22.572646,88.363895" /></center><br /> </p>
Can a regex solve the issue or should it be done in a different way?
Can JSOUP help in any way? Is there any other type of HTML parser which can do the job?
(<img\s+[^>]*>)
You can try this.Replace with <center>$1</centre>.See demo.
http://regex101.com/r/sU3fA2/38
Something like
var re = /(<img\s+[^>]*>)/g;
var str = '<p dir="ltr">test</p> \n<p dir="ltr"><img src="http://files.parsetfss.com/bcff7108-cbce-4ab8-b5d1-1f82827e6519/tfss-0de7a730-3fa9-4a1e-9f82-d34e4f6e2d31-file" /><br /></p> \n<p dir="ltr"><img src="http://maps.google.com/maps/api/staticmap?center=22.572646,88.363895&zoom=15&size=960x540&sensor=false&markers=color:blue%7Clabel:!%7C22.572646,88.363895" /><br /> </p>';
var subst = '<center>$1</centre>';
var result = str.replace(re, subst);
By using Jsoup, you can use the wrap() method of the Element class of Jsoup.
It would look like this :
public String wrapImgWithCenter(String html) {
Document doc = Jsoup.parse(html);
doc.getElementsByTag("img").wrap("<center></center>");
return doc.html();
}
I implemented the JSOUP solution suggested by mourphy. But, I had edited the method a little and it did the miracle for me. The new method is:
public String wrapImgWithCenter(String html){
Document doc = Jsoup.parse(html);
doc.select("img").wrap("<center></center>");
return doc.html();
}
Thanks mourphy and vks for your help!
Using Regex, you could also do this in java:
String formatted = str.replaceAll("(<img\\s+[^>]*>)", "<center>$1</center>");
Let's say i have a html fragment like this:
<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>
What i want to extract from that is:
foo bar foobar baz
So my question is: how can i strip all the wrapping tags from a html and get only the text in the same order as it is in the html?
As you can see in the title, i want to use jsoup for the parsing.
Example for accented html (note the 'á' character):
<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>
<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>
What i want:
Tarthatatlan biztonsági viszonyok
Tarthatatlan biztonsági viszonyok
This html is not static, generally i just want every text of a generic html fragment in decoded human readable form, width line breaks.
With Jsoup:
final String html = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
Document doc = Jsoup.parse(html);
System.out.println(doc.text());
Output:
foo bar foobar baz
If you want only the text of p-tag, use this instead of doc.text():
doc.select("p").text();
... or only body:
doc.body().text();
Linebreak:
final String html = "<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>"
+ "<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>";
Document doc = Jsoup.parse(html);
for( Element element : doc.select("p") )
{
System.out.println(element.text());
// eg. you can use a StringBuilder and append lines here ...
}
Output:
Tarthatatlan biztonsági viszonyok
Tarthatatlan biztonsági viszonyok
Using Regex: -
String str = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
str = str.replaceAll("<[^>]*>", "");
System.out.println(str);
OUTPUT: -
foo bar foobar baz
Using Jsoup: -
Document doc = Jsoup.parse(str);
String text = doc.text();
Actually, the correct way to clean with Jsoup is through a Whitelist
...
final String html = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
Document doc = Jsoup.parse(html);
Whitelist wl = Whitelist.none()
String cleanText = Jsoup.clean(doc.html(), wl)
If you want to still preserve some tags:
Whitelist wl = new Whitelist().relaxed().removeTags("a")
I have a Html string which include lots of image tag, I need to get the tag and change it. for example:
String imageRegex = "(<img.+(src=\".+\").+/>){1}";
String str = "<img src=\"static/image/smiley/comcom/9.gif\" smilieid=\"296\" border=\"0\" alt=\"\" />hello world<img src=\"static/image/smiley/comcom/7.gif\" smilieid=\"294\" border=\"0\" alt=\"\" />";
Matcher matcher = Pattern.compile(imageRegex, Pattern.CASE_INSENSITIVE).matcher(msg);
int i = 0;
while (matcher.find()) {
i++;
Log.i("TAG", matcher.group());
}
the result is :
<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" />hello world<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" />
but it's not I want, I want the result is
<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" />
<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" />
what's wrong with my regular expression?
Try (<img)(.*?)(/>), this should do the trick, although yes, you shouldn't use Regex for parsing HTML, as people will tell you over and over.
I don't have eclipse installed, but I have VS2010, and this works for me.
String imageRegex = "(<img)(.*?)(/>)";
String str = "<img src=\"static/image/smiley/comcom/9.gif\" smilieid=\"296\" border=\"0\" alt=\"\" />hello world<img src=\"static/image/smiley/comcom/7.gif\" smilieid=\"294\" border=\"0\" alt=\"\" />";
System.Text.RegularExpressions.MatchCollection match = System.Text.RegularExpressions.Regex.Matches(str, imageRegex, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
StringBuilder sb = new StringBuilder();
foreach (System.Text.RegularExpressions.Match m in match)
{
sb.AppendLine(m.Value);
}
System.Windows.MessageBox.Show(sb.ToString());
Result:
<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" />
<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" />
David M is correct, you really shouldn't try to do this, but your specific problem is that the + quantifier in your regex is greedy, so it will match the longest possible substring that could match.
See The regex tutorial for more details on the quantifiers.
I'd NOT recommend to use regex for parsing HTML. Please consider JSoup or similar solutions
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements images = doc.select("img");
Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp.