How to extract data if the div class comes after an id?

How to extract data if the div class comes after an id? - java

I try to get some data from div which is embedded after an ID and type=hidden. I cannot reach the class to get the links listed in that class.
I am using Jsoup with Elements and .select() or .getElementsbyId() and tried to combine them to reach the class. Without success. The site is https://www.ariva.de/aktien/suche. If you hit the search "Suche starten" button the result table pops up. In this table the links are what I want to reach.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class DatenImportUnternehmen {
public static void main (String[] args) {
String url = "https://www.ariva.de/aktien/suche";
try {
Document document = Jsoup.connect(url).get();
for (Element row : document.select("div.aktiensuche_result_table")) {
if(row.select("input[type=hidden]").text().equals("")) {
continue;
}
else {
String raw = row.select("[type=hidden]").text();
System.out.println(raw);
}
}
}
catch (Exception ex) {
ex.printStackTrace();
}
}
I don't get any result. Eclipse just states terminated.

If I understand correctly you want to get to the links in the table generated when you hit the search button on https://www.ariva.de/aktien/suche.
The first problem you are having is that the search results aren't available directly from this URL. Instead when you click the search button a POST request is made to https://www.ariva.de/aktiensuche/_result_table.m
The result of this request actually contains the table with the links that I believe you are interested in. Specifically the response contains HTML which is then dynamically added to the page as the results table.
The second problem looks to be in the jsoup query. I can't see any hidden input fields in the result table, but it is easy enough to grab the links using document.select("a[href]").
So for me this code:
String searchUrl = "https://www.ariva.de/aktiensuche/_result_table.m";
String searchBody = "page=0&page_size=25&sort=ariva_name&sort_d=asc&ariva_performance_1_year=_&ariva_performance_3_years=&ariva_performance_5_years=&index=0&founding_year=&land=0&industrial_sector=0&sector=0&currency=0&type_of_share=0&year=_all_years&sales=_&profit_loss=&sum_assets=&sum_liabilities=&number_of_shares=&earnings_per_share=&dividend_per_share=&turnover_per_share=&book_value_per_share=&cashflow_per_share=&balance_sheet_total_per_share=&number_of_employees=&turnover_per_employee=_&profit_per_employee=&kgv=_&kuv=_&kbv=_&dividend_yield=_&return_on_sales=_";
// post request to search URL
Document document = Jsoup.connect(searchUrl).requestBody(searchBody).post();
// find links in returned HTML
for(Element link:document.select("a[href]")) {
System.out.println(link);
}
produces the output:
1&1 Drillisch
11 88 0 Solutions
1st Red
21ST. CENT. FOX B NEW
21st Century Fox
2G Energy
3I Group
3I INFRASTRUCTURE
3M Company
3U Holding
3W Power
4imprint Group
4 SC
6,625% Statkraft AS 09/19 auf Festzins
7C Solarparken
888 Holdings
A.A.A. aktiengesellschaft allgemeine anlageverwaltung
A.G. BARR LS-,04167
A.H.T. Syngas Technology
A.S. Creation Tapeten
A+J Mucklow Group
A+JMUCKLOW GRP PREF. LS 1
A2A
AAC Technologies Holding
Aalberts
Which I hope is more or less what you are after. To set search parameters you will need to examine the search form and modify the form data in the searchBody string (or use the .data method instead of .requestBody to build the query).

Related

Can't scrape the data that i'm looking for?

I am trying to scrape the prices and the dates in the table in the attached picture from the URL: ****
http://www.airfrance.fr/vols/paris+tunis
I succeeded to scrape informations but not the ways i'm looking for ( date + price). I used these lines of code
import java.io.IOException;
import javax.lang.model.element.Element;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Test {
public static void main(String[] args) {
Document doc;
try {
doc = Jsoup.connect("http://www.airfrance.fr/vols/paris+tunis").get();
Elements links = doc.select("div");
for (org.jsoup.nodes.Element e:links) {
System.out.println(e.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Running this code gives me just some prices and anly a few dates but not all the table as it shown in the picture below.
Can you help me please to resolve this problem for my study project and thanks.

The problem is the calendar you are parsing is not in the original source code (right click > view source) as delivered from the server. That table is generated using JavaScript when the page is rendered by the browser (right click > inspect).
Jsoup can only parse source code. So you need to load the page first with something like HtmlUnit, then pass this rendered paged to Jsoup.
// load page using HTML Unit and fire scripts
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage("http://www.airfrance.fr/vols/paris+tunis");
// convert page to generated HTML and convert to document
Document doc = Jsoup.parse(myPage.asXml());
// find all of the date/price cells
for(Element cell : doc.select("td.available.daySelection")) {
String cellDate = cell.select(".cellDate").text();
String cellPrice = cell.select(".cellPrice > .day_price").text();
System.out.println(
String.format(
"cellDate=%s cellPrice=%s",
cellDate,
cellPrice));
}
// clean up resources
webClient.close();
Console
cellDate=1 septembre cellPrice=302 €
cellDate=2 septembre cellPrice=270 €
cellDate=3 septembre cellPrice=270 €
cellDate=4 septembre cellPrice=270 €
cellDate=5 septembre cellPrice=270 €
....
Source: Parsing JavaScript Generated Pages

Parsing Information from URL Using Jsoup

I need help with my Java project using Jsoup (if you think there is a more efficient way to achieve the purpose, please let me know). The purpose of my program is to parse certain useful information from different URLs and put it in a text file. I am not an expert in HTML or JavaScript, therefore, it has been difficult for me to code in Java exactly what I want to parse.
In the website that you see in the code below as one of the examples, the information that interests me to parse with Jsoup is everything you can see in the table under “Routing”(Route, Location, Vessel/Voyage, Container Arrival Date, Container Departure Date; = Origin, Seattle SSA Terminal T18, 26 Jun 15 A, 26 Jun 15 A… and so on).
So far, with Jsoup we are only able to parse the title of the website, yet we have been unsuccessful in getting any of the body.
Here is the code that I used, which I got from an online source:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Jsouptest71115 {
public static void main(String[] args) throws Exception {
String url = "http://google.com/gentrack/trackingMain.do "
+ "?trackInput01=999061985";
Document document = Jsoup.connect(url).get();
String title = document.title();
System.out.println("title : " + title);
String body = document.select("body").text();
System.out.println("Body: " + body);
}
}

Working code:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.ArrayList;
public class Sample {
public static void main(String[] args) {
String url = "http://homeport8.apl.com/gentrack/blRoutingPopup.do";
try {
Connection.Response response = Jsoup.connect(url)
.data("blNbr", "999061985") // tracking number
.method(Connection.Method.POST)
.execute();
Element tableElement = response.parse().getElementsByTag("table")
.get(2).getElementsByTag("table")
.get(2);
Elements trElements = tableElement.getElementsByTag("tr");
ArrayList<ArrayList<String>> tableArrayList = new ArrayList<>();
for (Element trElement : trElements) {
ArrayList<String> columnList = new ArrayList<>();
for (int i = 0; i < 5; i++) {
columnList.add(i, trElement.children().get(i).text());
}
tableArrayList.add(columnList);
}
System.out.println("Origin/Location: "
+tableArrayList.get(1).get(1));// row and column number
System.out.println("Discharge Port/Container Arrival Date: "
+tableArrayList.get(5).get(3));
} catch (IOException e) {
e.printStackTrace();
}
}
}
Output:
Origin/Location: SEATTLE SSA TERMINAL (T18), WA  
Discharge Port/Container Arrival Date: 23 Jul 15  E

You need to utilize document.select("body") select method input to which is CSS selector. To know more about CSS selectors just google it, or Read this. Using CSS selectors you can identify parts of web page body easily.
In your particular case you will have a different problem though, for instance the table you are after is inside an IFrame and if you look at the html of web page you are visiting its(iframe's) url is "http://homeport8.apl.com/gentrack/blRoutingFrame.do", so if you visit this URL directly so that you can access its content you will get an exception which is perhaps some restriction from Server. To get content properly you need to visit two URLs via JSoup, 1. http://homeport8.apl.com/gentrack/trackingMain.do?trackInput01=999061985 and 2. http://homeport8.apl.com/gentrack/blRoutingFrame.do?trackInput01=999061985
For first URL you'll get nothing useful, but for second URL you'll get tables of your interest. The try using document.select("table") which will give you List of tables iterator over this list and find table of your interest. Once you have the table use Element.select("tr") to get a table row and then for each "tr" use Element.select("td") to get table cell data.
The webpage you are visiting didn't use CSS class and id selectors which would have made reading it with jsoup a lot easier so I am afraid iterating over document.select("table") is your best and easy option.
Good Luck.

Java Jsoup can't select table

I have recently started to work on a mini project so I can learn the basis of Jsoup, however I have some difficulty to select a table on a particular website. I'm trying to fetch the table with Jsoup but with no sucess (see picture) http://imgur.com/RC21UBk
I know that the table that i'm trying to get have the class="meddelande" and is also inside a form element which have the same class="meddelande".
HTML code of the website: http://pastebin.com/ufRDhLSy
I'm trying to fetch the red marked area, any idea on how to do it?
Thanks in advance! :)
My code:
public void startMessage(String cookie1) {
try {
doc1 = Jsoup.connect("https://nya.boplats.se/minsida/meddelande")
.timeout(0).cookie("Boplats-Session", cookie1)
.get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements tables = doc1.select("form.meddelande");
Elements table = tables.select("table.meddelande");
System.out.println(table);
}

In your code
Elements tables = doc1.select("form.meddelande");
Elements table = tables.select("table.meddelande");
you are trying to access form with class attribute meddelande but from your linked HTML source meddelande is id, not class, so instead of
form.meddelande
you should use
form#meddelande
^--# means id, dot represents class
So try with
Elements tables = doc.select("form#meddelande");
Elements table = doc.select("table.meddelande");
or maybe simpler
Elements table = doc.select("form#meddelande table.meddelande");
If this will not work then HTML code responsible for table is probably generated by JavaScript. In that case you will not be able to get it with Jsoup, but you will need something like Selenium web driver, or HtmlUtil

In your situation better select class unread and read.
import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupSO {
public static void main(String args0[]) throws IOException {
Document doc;
Elements elements;
doc = Jsoup.parse(new File("path_to_file or use connect for URL"), "UTF-8");
elements = doc.getElementsByClass("unread");
for (Element element : elements) {
System.out.println(element);
}
elements = doc.getElementsByClass("read");
for (Element element : elements) {
System.out.println(element);
}
}
}
Output: http://pastebin.com/CwG1cL5T
And yes read their cookbook http://jsoup.org/

Here is an attempt
Document doc = Jsoup.connect("http://pastebin.com/raw.php?i=ufRDhLSy").get();
System.out.println(doc.select("table[class=meddelande]"));
or use the shorter syntax when selecting the node with a particular class only
System.out.println(doc.select("table.meddelande"));
JSoup supports the selector syntax. So you could use that to select DOM nodes with particular attributes - in this case the class attribute.
For more complex options in the selector syntax, look here http://jsoup.org/cookbook/extracting-data/selector-syntax

How to parse data in Talend with Java (coming from a previously produced .txt file)?

I have a process in Talend which gets the search result of a page, saves the html and writes it into files, as seen here:
Initially I had a two step process with parsing out the date from the HTML files in Java. Here is the code: It works and writes it to a mysql database. Here is the code which basically does exactly that. (I'm a beginner, sorry for the lack of elegance)
package org.jsoup.examples;
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;
import java.io.IOException;
public class parse2 {
static parse2 parseIt2 = new parse2();
String companyName = "Platzhalter";
String jobTitle = "Platzhalter";
String location = "Platzhalter";
String timeAdded = "Platzhalter";
public static void main(String[] args) throws IOException {
parseIt2.getData();
}
//
public void getData() throws IOException {
Document document = Jsoup.parse(new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"), "utf-8");
Elements elements = document.select(".joblisting");
for (Element element : elements) {
// Parse Data into Elements
Elements jobTitleElement = element.select(".job_title span");
Elements companyNameElement = element.select(".company_name span[itemprop=name]");
Elements locationElement = element.select(".locality span[itemprop=addressLocality]");
Elements dateElement = element.select(".job_date_added [datetime]");
// Strip Data from unnecessary tags
String companyName = companyNameElement.text();
String jobTitle = jobTitleElement.text();
String location = locationElement.text();
String timeAdded = dateElement.attr("datetime");
System.out.println("Firma:\t"+ companyName + "\t" + jobTitle + "\t in:\t" + location + " \t Erstellt am \t" + timeAdded );
}
}
}
Now I want to do the process End-to-End in Talend, and I got assured this works.
I tried this (which looks quite shady to me):
Basically I put all imports in "advanced settings" and the code in the "basic settings" section. This importLibrary is thought to load the jsoup parsing library, as well as the mysql connect (i might to the connect with talend tools though).
Obviously this isn't working. I tried to strip the Base Code from classes and stuff and it was even worse. Can you help me how to get the generated .txt files parsed with Java here?
EDIT: Here is the Link to the talend Job http://www.share-online.biz/dl/8M5MD99NR1
EDIT2: I changed the code to the one I tried in JavaFlex. But it didn't work (the import part in the start part of the code, the rest in "body/main" and nothing in "end".

This is a problem related to Talend, in your code, use the complete method names including their packages. For your document parsing for example, you can use :
Document document = org.jsoup.Jsoup.parse(new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"), "utf-8");

Extracting Values From an XML File Either using XPath, SAX or DOM for this Specific Scenario

I am currently working on an academic project, developing in Java and XML. Actual task is to parse XML, passing required values preferably in HashMap for further processing. Here is the short snippet of actual XML.
<root>
<BugReport ID = "1">
<Title>"(495584) Firefox - search suggestions passes wrong previous result to form history"</Title>
<Turn>
<Date>'2009-06-14 18:55:25'</Date>
<From>'Justin Dolske'</From>
<Text>
<Sentence ID = "3.1"> Created an attachment (id=383211) [details] Patch v.2</Sentence>
<Sentence ID = "3.2"> Ah. So, there's a ._formHistoryResult in the....</Sentence>
<Sentence ID = "3.3"> The simple fix it to just discard the service's form history result.</Sentence>
<Sentence ID = "3.4"> Otherwise it's trying to use a old form history result that no longer applies for the search string.</Sentence>
</Text>
</Turn>
<Turn>
<Date>'2009-06-19 12:07:34'</Date>
<From>'Gavin Sharp'</From>
<Text>
<Sentence ID = "4.1"> (From update of attachment 383211 [details])</Sentence>
<Sentence ID = "4.2"> Perhaps we should rename one of them to _fhResult just to reduce confusion?</Sentence>
</Text>
</Turn>
<Turn>
<Date>'2009-06-19 13:17:56'</Date>
<From>'Justin Dolske'</From>
<Text>
<Sentence ID = "5.1"> (In reply to comment #3)</Sentence>
<Sentence ID = "5.2"> &gt; (From update of attachment 383211 [details] [details])</Sentence>
<Sentence ID = "5.3"> &gt; Perhaps we should rename one of them to _fhResult just to reduce confusion?</Sentence>
<Sentence ID = "5.4"> Good point.</Sentence>
<Sentence ID = "5.5"> I renamed the one in the wrapper to _formHistResult. </Sentence>
<Sentence ID = "5.6"> fhResult seemed maybe a bit too short.</Sentence>
</Text>
</Turn>
.....
and so on
</BugReport>
There are many commenter like 'Justin Dolske' who have commented on this report and what I actually looking for is the list of commenter and all sentences they have written in a whole XML file. Something like if(from == justin dolske) getHisAllSentences(). Similarly for other commenters (for all). I have tried many different ways to get the sentences only for 'Justin dolske' or other commenters, even in a generic form for all using XPath, SAX and DOM but failed. I am quite new to these technologies including JAVA and any don't know how to achieve it.
Can anyone guide me specifically how could I get it with any of above technologies or is there any other better strategy to do it?
(Note: Later I want to put it in a hashmap such as like this HashMap (key, value) where key = name of commenter (justin dolske) and value is (all sentences))
Urgent help will be highly appreciated.

There're several ways using which you can achieve your requirement.
One way would be use JAXB. There're several tutorials available on this on the web, so feel free to refer to them.
You can also think of creating a DOM and then extracting data from it and then put it into your HashMap.
One reference implementation would be something like this:
import java.io.File;
import java.util.ArrayList;
import java.util.HashMap;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
public class XMLReader {
private HashMap<String,ArrayList<String>> namesSentencesMap;
public XMLReader() {
namesSentencesMap = new HashMap<String, ArrayList<String>>();
}
private Document getDocument(String fileName){
Document document = null;
try{
document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File(fileName));
}catch(Exception exe){
//handle exception
}
return document;
}
private void buildNamesSentencesMap(Document document){
if(document == null){
return;
}
//Get each Turn block
NodeList turnList = document.getElementsByTagName("Turn");
String fromName = null;
NodeList sentenceNodeList = null;
for(int turnIndex = 0; turnIndex < turnList.getLength(); turnIndex++){
Element turnElement = (Element)turnList.item(turnIndex);
//Assumption: <From> element
Element fromElement = (Element) turnElement.getElementsByTagName("From").item(0);
fromName = fromElement.getTextContent();
//Extracting sentences - First check whether the map contains
//an ArrayList corresponding to the name. If yes, then use that,
//else create a new one
ArrayList<String> sentenceList = namesSentencesMap.get(fromName);
if(sentenceList == null){
sentenceList = new ArrayList<String>();
}
//Extract sentences from the Turn node
try{
sentenceNodeList = turnElement.getElementsByTagName("Sentence");
for(int sentenceIndex = 0; sentenceIndex < sentenceNodeList.getLength(); sentenceIndex++){
sentenceList.add(((Element)sentenceNodeList.item(sentenceIndex)).getTextContent());
}
}finally{
sentenceNodeList = null;
}
//Put the list back in the map
namesSentencesMap.put(fromName, sentenceList);
}
}
public static void main(String[] args) {
XMLReader reader = new XMLReader();
reader.buildNamesSentencesMap(reader.getDocument("<your_xml_file>"));
for(String names: reader.namesSentencesMap.keySet()){
System.out.println("Name: "+names+"\tTotal Sentences: "+reader.namesSentencesMap.get(names).size());
}
}
}
Note: This is just a demonstration and you would need to modify it to suit your need. I've created it based on your XML to show one way of doing it.

I suggest to use JAXB to creates a Data Model reflecting your XML structure.
One done, you can load the XML into Java instances.
Put each 'Turn' into a Map< String, List< Turn >>, using Turn.From as key.
Once done, you'll can write:
List< Turn > justinsTurn = allTurns.get( "'Justin Dolske'" );

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to extract data if the div class comes after an id? - java

Related

Can't scrape the data that i'm looking for?

Parsing Information from URL Using Jsoup

Java Jsoup can't select table

How to parse data in Talend with Java (coming from a previously produced .txt file)?

Extracting Values From an XML File Either using XPath, SAX or DOM for this Specific Scenario

Categories

Resources