Java-Jsoup, scrape html - java

I am using Jsoup with Java to Parse an HTML file. My question is how can I just extract the line that says "Hourly Rate: 23,016 orders"
I am parsing a lot of files, so the number next to the Hourly Rate will change.
<html>
<head>
<title>Testing</title>
</head>
<body>
<p class=MsoNormal align=center style='background:#DEDEDF'>
<span style='font-size:18.0pt'><b>Testing</b></span></p>
Hourly Rate: 23,016 orders<br>
<table border=0 cellpadding=0>
<tr valign=top>
<td>
Thanks

I just added this code:
String HourlyRate = doc.body().ownText();
//String text = doc.body().text();
System.out.println(HourlyRate);
This Printed out:
Hourly Rate: 23,016 orders

Grab the MsoNormal class then use a regular expression to look for a number i.e.
Document doc = Jsoup.parse(htmlString);
Element msoNormal = doc.getElementsByClass("MsoNormal").first();
if(msoNormal!=null){
Pattern p = Pattern.compile("[0-9]+,[0-9]+");
Matcher m = pattern.matcher(msoNormal.text());
if(matcher.find())
System.out.println(m.get());
}

Related

Java regex , jsoup

How to extract these messages by regex or jsoup ? 19040172b-1、 SQL Server Develop 、zheng 、3-5,7-14 、D-101 ,
<div id="AE9D7F630640426F8457A661607D2B8E-5-2" style="display: none;" class="kbcontent">
19040172b-1
<br>SQL Server Develop
<br>
<font title="teacher">zheng</font>
<br>
<font title="week">3-5,7-14</font>
<br>
<font title="classroom">D-101</font>
<br>
</div>
I have tried the following ways but failed.
1. Pattern pattern = Pattern.compile(">(.*?)<br>");
2. Elements msg = doc.select(":matchesOwn([>.*?<br>])");
1) First, it's never a good idea to parse HTMl with a regex. You can read more about that here.
2)You can just take all text between tags.
Document doc = Jsoup.parse(file, charsetName);
String text= doc.text();
System.out.println(text);
String html = "<div id=\"AE9D7F630640426F8457A661607D2B8E-5-2\" style=\"display: none;\" class=\"kbcontent\"> 19040172b-1 <br>SQL Server Develop <br> <font title=\"teacher\">zheng</font> <br> <font title=\"week\">3-5,7-14</font> <br> <font title=\"classroom\">D-101</font> <br> </div> ";
html = html.replaceAll("<br>", "#~#");
Document doc = Jsoup.parse(html.toString());
String newHtml = doc.text();
String[] ary = newHtml.split("#~#");
This will do the job, yet there may be other clean ways to replace the br tag.

JSOUP - Help getting <IMG SRC> from <DIV CLASS>

I have the HTML snippet below. There are multiple div classes for "teaser-img" throughout the document. I want to be able to grab all the "img src" from all these "teaser-img" classes.
<div class="teaser-img">
<a href="/julien/blog/failure-consciousness-vs-success-consciousness-shifting-focus-become-badass-or-loser">
<img src="http://www.rsdnation.com/files/imagecache/blog_thumbnail/files/blog_thumbs/rsdnatonaustin.jpg" alt="" title=""/>
</a>
</div>
I have tried many things so I wouldn't know what code to share with you guys. Your help will be much appreciated.
final String html = "<div class=\"teaser-img\">\n"
+ " <a href=\"/julien/blog/failure-consciousness-vs-success-consciousness-shifting-focus-become-badass-or-loser\">\n"
+ " <img src=\"http://www.rsdnation.com/files/imagecache/blog_thumbnail/files/blog_thumbs/rsdnatonaustin.jpg\" alt=\"\" title=\"\"/>\n"
+ " </a>\n"
+ "</div>";
// Parse the html from string or eg. connect to a website using connect()
Document doc = Jsoup.parseBodyFragment(html);
for( Element element : doc.select("div.teaser-img img[src]") )
{
System.out.println(element);
}
Output:
<img src="http://www.rsdnation.com/files/imagecache/blog_thumbnail/files/blog_thumbs/rsdnatonaustin.jpg" alt="" title="">
See here for documentation about the selector syntax.

How to parse 'div' without name?

Using Jsoup:
Element movie_div = doc.select("div.movie").first();
I got a such HTML-code:
<div class="movie">
<div>
<div>
<strong>Year:</strong> 2014
</div>
<div>
<strong>Country:</strong> USA
</div>
</div>
</div>
How can I use jsoup to extract the country and the year?
For the example html I want the extracted values to be "2014" and "USA".
Thanks.
Use
Element e = doc.select("div.movie").first().child(0);
List<TextNode> textNodes = e.child(0).textNodes();
String year = textNodes.get(textNodes.size()-1).text().trim();
textNodes = e.child(1).textNodes();
String country = textNodes.get(textNodes.size()-1).text().trim();
Did you try something like:
Element movie_div = doc.select("div.movie strong").first();
And to get the text value you should try;
movie_div.text();

Parsing table data with jsoup

I am using jsoup in my android app to parse my html code but now I need parse table data and I can not get it to work. I try many ways but not successful so I want try luck here if anyone have experience.
Here is part of my html:
<div id="editacia_jedla">
<h2>My header</h2>
<h3>My sub header</h3>
<table border="0" class="jedalny_listok_tabulka" cellpadding="2" cellspacing="1">
<tr>
<td width="100" class="menu_nazov neparna" align="left">Food Menu 1</td>
<td class="jedlo neparna" align="left">vegetable and beef
<div class="jedlo_box_alergeny">Allergens: 1, 3</div>
</td>
</tr>
<tr>
<td width="100" class="menu_nazov parna" align="left">Food Menu 2</td>
<td class="jedlo parna" align="left">Potato salad and pork
<div class="jedlo_box_alergeny">Allergens: 6</div>
</td>
</tr>
</table>
etc
</div>
My java/android code:
try {
String tableHtmlCode="";
Document fullHtmlDocument = Jsoup.connect(urlOfFoodDay).get();
Element elm1 = fullHtmlDocument.select("#editacia_jedla").first();
for( Element element : elm1.children() )
{
tableHtmlCode+=element.getElementsByIndexEquals(2); //this set table content because 0=h2, 1=h3
}
Document parsedTableDocument = Jsoup.parse(tableHtmlCode);
//Element th = parsedTableDocument.select("td[class=jedlo neparna]").first(); THIS IS BAD
String foodContent="";
String foodAllergens="";
}
So now I want extract text vegetable and beef and save it to string foodContent and numbera 1, 3(together) from div class jedlo_box_alergeny save to string foodAllergens. Someone can help? I will very grateful for any ideas
Iterate over your document's parent tag jedalny_listok_tabulka and loop over td tags.
td tag is the parent to href tags which include the allergy values. Hence, you would loop over the tags a elements to get your numbers, something like:
Elements myElements = doc.getElementsByClass("jedalny_listok_tabulka")
.first().getElementsByTag("td");
for (Element element : myElements) {
if (element.className().contains("jedlo")) {
String foodContent = element.ownText();
String foodAllergen = "";
for (Element href : element.getElementsByTag("a")) {
foodAllergen += " " + href.text();
}
System.out.println(foodContent + " : " + foodAllergen);
}
}
Output:
vegetable and beef : 1 3
Potato salad and pork : 6

java: regular expression

I have a Html string which include lots of image tag, I need to get the tag and change it. for example:
String imageRegex = "(<img.+(src=\".+\").+/>){1}";
String str = "<img src=\"static/image/smiley/comcom/9.gif\" smilieid=\"296\" border=\"0\" alt=\"\" />hello world<img src=\"static/image/smiley/comcom/7.gif\" smilieid=\"294\" border=\"0\" alt=\"\" />";
Matcher matcher = Pattern.compile(imageRegex, Pattern.CASE_INSENSITIVE).matcher(msg);
int i = 0;
while (matcher.find()) {
i++;
Log.i("TAG", matcher.group());
}
the result is :
<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" />hello world<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" />
but it's not I want, I want the result is
<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" />
<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" />
what's wrong with my regular expression?
Try (<img)(.*?)(/>), this should do the trick, although yes, you shouldn't use Regex for parsing HTML, as people will tell you over and over.
I don't have eclipse installed, but I have VS2010, and this works for me.
String imageRegex = "(<img)(.*?)(/>)";
String str = "<img src=\"static/image/smiley/comcom/9.gif\" smilieid=\"296\" border=\"0\" alt=\"\" />hello world<img src=\"static/image/smiley/comcom/7.gif\" smilieid=\"294\" border=\"0\" alt=\"\" />";
System.Text.RegularExpressions.MatchCollection match = System.Text.RegularExpressions.Regex.Matches(str, imageRegex, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
StringBuilder sb = new StringBuilder();
foreach (System.Text.RegularExpressions.Match m in match)
{
sb.AppendLine(m.Value);
}
System.Windows.MessageBox.Show(sb.ToString());
Result:
<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" />
<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" />
David M is correct, you really shouldn't try to do this, but your specific problem is that the + quantifier in your regex is greedy, so it will match the longest possible substring that could match.
See The regex tutorial for more details on the quantifiers.
I'd NOT recommend to use regex for parsing HTML. Please consider JSoup or similar solutions
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements images = doc.select("img");
Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp.

Categories

Resources