I'm reading a html file using jsoup. I want to show the html table,how can I do that?
I'm a beginner with jsoup - and a not that experienced java developer. :)
public class test {
public static void main(String[] args) throws IOException {
// TODO 自動產生的方法 Stub
File input = new File("D://index.html");//從一個html文件讀取
Document doc = Jsoup.parse(input,"UTF-8");
//test
Elements trs = doc.select("table").select("tr");
for(Element e : trs) {
System.out.println("-------------------");
System.out.println(e.text());
}
}
}
Without knowing jsoup, I guess you should descend into the html structure step by step, like this:
...
//test
Elements tables = doc.select("table");
for (Element table : tables) {
for (Element row : table.select("tr")) {
for (Element e : row.select("td")) {
// output your td-contents here
System.out.println("-------------------");
System.out.println(e.text());
}
}
}
...
The advantage of this approach is that you have more control over drawing separators between the HTML Elements.
Related
I'm trying to scrape some data from table on this site:https://www.worldometers.info/coronavirus/
Here is the source code of scraper I've tried
public static void main(String[] args) throws Exception {
String url = "https://www.worldometers.info/coronavirus/";
try{
Document doc = Jsoup.connect(url).get();
Element table = doc.getElementById("main_table_countries_today");
Elements rows = table.getElementsByTag("tr");
for(Element row : rows){
Elements tds = row.getElementsByTag("td");
for(int i = 0;i<tds.size();i++){
System.out.println(tds.get(i).text());
}
}
}catch (IOException e){
e.printStackTrace();
}
}
And here is the output
China
80,928
+34
3,245
+8
70,420
7,263
2,274
56
Italy
35,713 ....
I would like to scrape only data for one specific country,eg. France.
But I don't have any idea how to do it.
You have to ask first every "td" if it's contain "France" then you can print the row.
public static void main(String[] args) throws Exception {
String url = "https://www.worldometers.info/coronavirus/";
try{
Document doc = Jsoup.connect(url).get();
Element table = doc.getElementById("main_table_countries_today");
Elements rows = table.getElementsByTag("tr");
for(Element row : rows){
Elements tds = row.getElementsByTag("td");
for(int i = 0;i<tds.size();i++){
if(tds.get(i).text().equals("France")){
System.out.println(row.text());
}
}
}
}catch (IOException e){
e.printStackTrace();
}
Output:
France 14,459 562 1,587 12,310 1,525 222
I have the HTML string like
<b>test</b><b>er</b>
<span class="ab">continue</span><span> without</span>
I want to collapse the Tags which are similar and belong to each other. In the above sample I want to have
<b>tester</b>
since the tags have the same tag withouth any further attribute or style. But for the span Tag it should remain the same because it has a class attribute. I am aware that I can iterate via Jsoup over the tree.
Document doc = Jsoup.parse(input);
for (Element element : doc.select("b")) {
}
But I'm not clear how look forward (I guess something like nextSibling) but than how to collapse the elements?
Or exists a simple regexp merge?
The attributes I can specify on my own. It's not required to have a one-fits-for-all Tag solution.
My approach would be like this. Comments in the code
public class StackOverflow60704600 {
public static void main(final String[] args) throws IOException {
Document doc = Jsoup.parse("<b>test</b><b>er</b><span class=\"ab\">continue</span><span> without</span>");
mergeSiblings(doc, "b");
System.out.println(doc);
}
private static void mergeSiblings(Document doc, String selector) {
Elements elements = doc.select(selector);
for (Element element : elements) {
// get the next sibling
Element nextSibling = element.nextElementSibling();
// merge only if the next sibling has the same tag name and the same set of attributes
if (nextSibling != null && nextSibling.tagName().equals(element.tagName())
&& nextSibling.attributes().equals(element.attributes())) {
// your element has only one child, but let's rewrite all of them if there's more
while (nextSibling.childNodes().size() > 0) {
Node siblingChildNode = nextSibling.childNodes().get(0);
element.appendChild(siblingChildNode);
}
// remove because now it doesn't have any children
nextSibling.remove();
}
}
}
}
output:
<html>
<head></head>
<body>
<b>tester</b>
<span class="ab">continue</span>
<span> without</span>
</body>
</html>
One more note on why I used loop while (nextSibling.childNodes().size() > 0). It turned out for or iterator couldn't be used here because appendChild adds the child but removes it from the source element and remaining childen are be shifted. It may not be visible here but the problem will appear when you try to merge: <b>test</b><b>er<a>123</a></b>
I tried to update the code from #Krystian G but my edit was rejected :-/ Therefore I post it as an own post. The code is an excellent starting point but it fails if between the tags a TextNode appears, e.g.
<span> no class but further</span> (in)valid <span>spanning</span> would result into a
<span> no class but furtherspanning</span> (in)valid
Therefore the corrected code looks like:
public class StackOverflow60704600 {
public static void main(final String[] args) throws IOException {
String test1="<b>test</b><b>er</b><span class=\"ab\">continue</span><span> without</span>";
String test2="<b>test</b><b>er<a>123</a></b>";
String test3="<span> no class but further</span> <span>spanning</span>";
String test4="<span> no class but further</span> (in)valid <span>spanning</span>";
Document doc = Jsoup.parse(test1);
mergeSiblings(doc, "b");
System.out.println(doc);
}
private static void mergeSiblings(Document doc, String selector) {
Elements elements = doc.select(selector);
for (Element element : elements) {
Node nextElement = element.nextSibling();
// if the next Element is a TextNode but has only space ==> we need to preserve the
// spacing
boolean addSpace = false;
if (nextElement != null && nextElement instanceof TextNode) {
String content = nextElement.toString();
if (!content.isBlank()) {
// the next element has some content
continue;
} else {
addSpace = true;
}
}
// get the next sibling
Element nextSibling = element.nextElementSibling();
// merge only if the next sibling has the same tag name and the same set of
// attributes
if (nextSibling != null && nextSibling.tagName().equals(element.tagName())
&& nextSibling.attributes().equals(element.attributes())) {
// your element has only one child, but let's rewrite all of them if there's more
while (nextSibling.childNodes().size() > 0) {
Node siblingChildNode = nextSibling.childNodes().get(0);
if (addSpace) {
// since we have had some space previously ==> preserve it and add it
if (siblingChildNode instanceof TextNode) {
((TextNode) siblingChildNode).text(" " + siblingChildNode.toString());
} else {
element.appendChild(new TextNode(" "));
}
}
element.appendChild(siblingChildNode);
}
// remove because now it doesn't have any children
nextSibling.remove();
}
}
}
}
So I am trying to get the data from this webpage using Jsoup...
I've tried looking up many different ways of doing it and I've gotten close but I don't know how to find tags for certain stats (Attack, Strength, Defence, etc.)
So let's say for examples sake I wanted to print out
'Attack', '15', '99', '200,000,000'
How should I go about doing this?
You can use CSS selectors in Jsoup to easily extract the column data.
// retrieve page source code
Document doc = Jsoup
.connect("http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=Lynx%A0Titan")
.get();
// find all of the table rows
Elements rows = doc.select("div#contentHiscores table tr");
ListIterator<Element> itr = rows.listIterator();
// loop over each row
while (itr.hasNext()) {
Element row = itr.next();
// does the second col contain the word attack?
if (row.select("td:nth-child(2) a:contains(attack)").first() != null) {
// if so, assign each sibling col to variable
String rank = row.select("td:nth-child(3)").text();
String level = row.select("td:nth-child(4)").text();
String xp = row.select("td:nth-child(5)").text();
System.out.printf("rank=%s level=%s xp=%s", rank, level, xp);
// stop looping rows, found attack
break;
}
}
A very rough implementation would be as below. I have just shown a snippet , optimizations or other conditionals need to be added
public static void main(String[] args) throws Exception {
Document doc = Jsoup
.connect("http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=Lynx%A0Titan")
.get();
Element contentHiscoresDiv = doc.getElementById("contentHiscores");
Element table = contentHiscoresDiv.child(0);
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
for (Element column : tds) {
if (column.children() != null && column.children().size() > 0) {
Element anchorTag = column.getElementsByTag("a").first();
if (anchorTag != null && anchorTag.text().contains("Attack")) {
System.out.println(anchorTag.text());
Elements attributeSiblings = column.siblingElements();
for (Element attributeSibling : attributeSiblings) {
System.out.println(attributeSibling.text());
}
}
}
}
}
}
Attack
15
99
200,000,000
I was trying to scrape the data of a website and to some extents I succeed in my goal. But, there is a problem that the web page I am trying to scrape have got multiple HTML tables in it. Now, when I execute my program it only retrieves the data of the first table in the CSV file and not retrieving the other tables. My java class code is as follows.
public static void parsingHTML() throws Exception {
//tbodyElements = doc.getElementsByTag("tbody");
for (int i = 1; i <= 1; i++) {
Elements table = doc.getElementsByTag("table");
if (table.isEmpty()) {
throw new Exception("Table is not found");
}
elements = table.get(0).getElementsByTag("tr");
for (Element trElement : elements) {
trElement2 = trElement.getElementsByTag("tr");
tdElements = trElement.getElementsByTag("td");
File fold = new File("C:\\convertedCSV9.csv");
fold.delete();
File fnew = new File("C:\\convertedCSV9.csv");
FileWriter sb = new FileWriter(fnew, true);
//StringBuilder sb = new StringBuilder(" ");
//String y = "<tr>";
for (Iterator<Element> it = tdElements.iterator(); it.hasNext();) {
//Element tdElement1 = it.next();
//final String content2 = tdElement1.text();
if (it.hasNext()) {
sb.append("\r\n");
}
for (Iterator<Element> it2 = trElement2.iterator(); it.hasNext();) {
Element tdElement2 = it.next();
final String content = tdElement2.text();
//stringjoiner.add(content);
//sb.append(formatData(content));
if (it2.hasNext()) {
sb.append(formatData(content));
sb.append(" , ");
}
if (!it.hasNext()) {
String content1 = content.replaceAll(",$", " ");
sb.append(formatData(content1));
//it2.next();
}
}
System.out.println(sb.toString());
sb.flush();
sb.close();
}
System.out.println(sampleList.add(tdElements));
}
}
}
What I analyze is that there is a loop which is only checking tr tds. So, after first table there is a style sheet on the HTML page. May be due to style sheet loop is breaking. I think that's the reason it is proceeding to the next table.
P.S: here's the link which I am trying to scrap
http://www.mufap.com.pk/nav_returns_performance.php?tab=01
What you do just at the beginning of your code will not work:
// loop just once, why
for (int i = 1; i <= 1; i++) {
Elements table = doc.getElementsByTag("table");
if (table.isEmpty()) {
throw new Exception("Table is not found");
}
elements = table.get(0).getElementsByTag("tr");
Here you loop just once, read all table elements and then process all tr elements for the first table you find. So even if you would loop more than once, you would always process the first table.
You will have to iterate all table elements, e.g.
for(Element table : doc.getElementsByTag("table")) {
for (Element trElement : table.getElementsByTag("tr")) {
// process "td"s and so on
}
}
Edit Since you're having troubles with the code above, here's a more thorough example. Note that I'm using Jsoup to read and parse the HTML (you didn't specify what you are using)
Document doc = Jsoup
.connect("http://www.mufap.com.pk/nav_returns_performance.php?tab=01")
.get();
for (Element table : doc.getElementsByTag("table")) {
for (Element trElement : table.getElementsByTag("tr")) {
// skip header "tr"s and process only data "tr"s
if (trElement.hasClass("tab-data1")) {
StringJoiner tdj = new StringJoiner(",");
for (Element tdElement : trElement.getElementsByTag("td")) {
tdj.add(tdElement.text());
}
System.out.println(tdj);
}
}
}
This will concat and print all data cells (those having the class tab-data1). You will still have to modify it to write to your CSV file though.
Note: in my tests this processes 21 tables, 243 trs and 2634 tds.
How could I use Jsoup to extract specification data from this website separately for each row e.g. Network->Network Type, Battery etc.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class mobilereviews {
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("http://mobilereviews.net/details-for-Motorola%20L7.htm").get();
for (Element table : doc.select("table")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
System.out.println(tds.get(0).text());
}
}
}
}
Here is an attempt to find the solution to your problem
Document doc = Jsoup.connect("http://mobilereviews.net/details-for-Motorola%20L7.htm").get();
for (Element table : doc.select("table[id=phone_details]")) {
for (Element row : table.select("tr:gt(2)")) {
Elements tds = row.select("td:not([rowspan])");
System.out.println(tds.get(0).text() + "->" + tds.get(1).text());
}
}
Parsing the HTML is tricky and if the HTML changes your code needs to change as well.
You need to study the HTML markup to come up with your parsing rules first.
There are multiple tables in the HTML, so you first filter on the correct one table[id=phone_details]
The first 2 table rows contain only markup for formatting, so skip them tr:gt(2)
Every other row starts with the global description for the content type, filter it out td:not([rowspan])
For more complex options in the selector syntax, look here http://jsoup.org/cookbook/extracting-data/selector-syntax
xpath for the columns - //*[#id="phone_details"]/tbody/tr[3]/td[2]/strong
xpath for the values - //*[#id="phone_details"]/tbody/tr[3]/td[3]
#Joey's code tries to zero in on these. You should be able to write the select() rules based on the Xpath.
Replace the numbers (tr[N] / td[N]) with appropriate values.
Alternatively, you can pipe the HTML thought a text only browser and extract the data from the text. Here is the text version of the page. You can delimit the text or read after N chars to extract the data.
this is how i get the data from a html table.
org.jsoup.nodes.Element tablaRegistros = doc
.getElementById("tableId");
for (org.jsoup.nodes.Element row : tablaRegistros.select("tr")) {
for (org.jsoup.nodes.Element column : row.select("td")) {
// Elements tds = row.select("td");
// cadena += tds.get(0).text() + "->" +
// tds.get(1).text()
// + " \n";
cadena += column.text() + ",";
}
cadena += "\n";
}
Here is a generic solution to extraction of table from HTML page via JSoup.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ExtractTableDataUsingJSoup {
public static void main(String[] args) {
extractTableUsingJsoup("http://mobilereviews.net/details-for-Motorola%20L7.htm","phone_details");
}
public static void extractTableUsingJsoup(String url, String tableId){
Document doc;
try {
// need http protocol
doc = Jsoup.connect(url).get();
//Set id of any table from any website and the below code will print the contents of the table.
//Set the extracted data in appropriate data structures and use them for further processing
Element table = doc.getElementById(tableId);
Elements tds = table.getElementsByTag("td");
//You can check for nesting of tds if such structure exists
for (Element td : tds) {
System.out.println("\n"+td.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}