Web scraping with java and jsoup

Web scraping with java and jsoup - java

I am trying to scrape data from the following table.
Yahoo finance CBOE Volatility Index
I am using jsoup for it.
String url = "https://finance.yahoo.com/quote/%5EVIX/history?p=%5EVIX&guccounter=1&guce_referrer=aHR0cHM6Ly9tYWlsLmdvb2dsZS5jb20v&guce_referrer_sig=AQAAAKU5UXnZEhNK_s1k-l6fQ7l-jFaR2xghH5NOhaohsec-HThT1BaEsni-hUlysVCFWpzd4qa2OZ2YZtBDJNQqKw1Uh64_nppDI4RnzPnTgxDGta123-A_SbIBm4SA5B0xopHvDcl5A21esFvWceZnRJPk6ohtud7OGJpWcNLdADYT";
Document doc = Jsoup.connect(url).get();
Element table = doc.getElementById("mrt-node-Col1-1-HistoricalDataTable");
Elements rows=table.select("tr");
Elements first=rows.get(0).select("th,td");
List<String>headers=new ArrayList<>();
for(Element header:first)
headers.add(header.text());
List<Map<String,String>> listMap = new ArrayList<Map<String,String>>();
for(int row=1;row<rows.size()-1;row++) {
Elements colVals = rows.get(row).select("th,td");
int colCount = 0;
Map<String,String> tuple = new LinkedHashMap<String,String>();
for(Element colVal : colVals)
tuple.put(headers.get(colCount++), colVal.text());
listMap.add(tuple);
}
By this approach I am only getting the first 100 or some more rows. This is because it first loads that amount of rows and whenever we scroll to that position of the row, newer rows are loaded. I could not find any pagination and nothing helpful from the network calls.The data seems to be encoded in gif format(whenever there is a mouse event on scroll).
I found a way around to use selenium web driver and fetch all the data. I was wondering is there any way to just use Jsoup to solve the issue.

Related

Get specific information from Wikipedia Information Box

I'm trying to get the details of the latest release in the information box on the right side. I'm trying to retrieve "6.2 (Build 9200) / August 1, 2012; 7 years ago" from the box by scraping this page using jsoup.
I have code that pulls all data from the box but I can't figure out how to pull the specific part of the box.
org.jsoup.Connection.Response res = Jsoup.connect("https://en.wikipedia.org/wiki/Windows_Server_2012").execute();
String html = res.body();
Document doc2 = Jsoup.parseBodyFragment(html);
Element body = doc2.body();
Elements tables = body.getElementsByTag("table");
for (Element table : tables) {
if (table.className().contains("infobox")==true) {
System.out.println(table.outerHtml());
break;
}
}

You can query for the table row that contains a link that ends with Software_release_life_cycle:
String url = "https://en.wikipedia.org/wiki/Windows_Server_2012";
try {
Document document = Jsoup.connect(url).get();
Elements elements = document.select("tr:has([href$=Software_release_life_cycle])");
for (Element element: elements){
System.out.println(element.text());
}
}
catch (IOException e) {
//exception handling
}
This is why, by looking at the full html, I found out that the row you need (and only the row you need -this is a vital detail!-) is formed like this. Infact elements will actually contain only an Element.
Finally you extract only the text. This code will print:
Latest release 6.2 (Build 9200) / August 1, 2012; 7 years ago (2012-08-01)[2]
If you need even more refinement you can always substring it.
Hope I helped!
( selector syntax reference )

JSOUP - Extract Data Web dinamis

I tried to extract the price. Can anyone please help me? There is no output for the price and its weight ,, I've tried several ways but not out the results
Document doc = Jsoup.connect("https://www.jakmall.com/tokocamzone/mi-travel-charger-20a-output-fast-charging#9730928979371").get();
Elements rows = doc.getElementsByAttributeValue("class", "div[dp__price dp__price--2 format__money]");
System.out.println("rows.size() = " + rows.size());
String index = "";
for (Element span : rows) {
index = span.text();
}
System.out.println("index = " + index);
I've tried another way but I did not get the result. I was very curious but did not find it the right way

if you run this line of code above you will discover thtat there is no price ordiv[dp__price dp__price--2 format__money] DOM. There is only Javascript.
String d = doc.getElementsByClass("dp__header__info").outherHtml();
System.out.println(d);
Jsoup is not able to fetch the price because content is loaded dynamically after page loading. Consider using Selenium which more powerfull and supports JavaScript websites,

Reading webpage's inspect element data using java

I have a requirement . I am reading file from a dynamic web page , and the values which i require from the webpage lies within
<td>
, and this is visible when i inspect this element . So my question is , is it somehow possible to print the data contained in the inspect element using java?

Using JSOUP. Here is the cookbook
ArrayList<String> downServers = new ArrayList<>();
Element table = doc.select("table").get(0);
Elements rows = table.select("tr");
for (int i = 1; i < rows.size(); i++) {
Element row = rows.get(i);
Elements cols = row.select("td");
// Use cols.get(index) to get the data from td element
}

I found the solution to this one , leaving this answer in case if anyone stuck into this in future.
To print whatever you see inside inspect element can be tracked down using selenium.
Here's the code which i used `
WebDriver driver= new ChromeDriver();
driver.manage().timeouts().implicitlyWait(15, TimeUnit.SECONDS);
driver.manage().window().maximize();
driver.get("http://www.whatever.com");
Thread.sleep(1000);
List<WebElement> frameList = driver.findElements(By.tagName("frame"));
System.out.println(frameList.size());
driver.switchTo().frame(0);
String temp=driver.findElement(By.xpath("/html/body/table/thead/tr/td/div[2]/table/thead/tr[2]/td[2]")).getText();
read here for more .

Selenium WebDriver Java locating two different table cells

I'm using Selenium Webdriver in Java. I have a table, and I like to get my hands on the last cell on the first row, and the last cell of last row. I manage to get one of them
WebElement table =driver.findElement(By.className("dataTable"));
List <WebElement> rows = table.findElements(By.tagName("tr"));
WebElement firstrow= rows.get(0);
WebElement lastrow= rows.get(rivit.size()-1);
List <WebElement> firstcells = firstrow.findElements(By.tagName("td"));
List <WebElement> lastcells = lastcell.findElements(By.tagName("td"));
firstcell.get(6).getText());
This is because I'm locating td-tags twice. Any hints how to get both cells nicely? I have no identifiers in my rows or cells.

You can use xpath to get the elements:
WebElement lastCellInFirstRow = driver.findElement(By.xpath("table[#class='dataTable']//tr[1]//td[last()]"));
WebElement lastCellInLastRow = driver.findElement(By.xpath("table[#class='dataTable']//tr[last()]//td[last()]"));
Here's the xpath specification. You can play with xpath here.

You can try to make it with cssSelectors:
String cssLast="table[class='dataTable']>tr:first-child>td:last-child"
String cssFirst="table[class='dataTable']>tr:last-child>td:last-child"
it will be smt like that;
driver.findElement(By.cssSelector(cssLast)).getText();
driver.findElement(By.cssSelector(cssFirst)).getText();
another approach is using js:
String getText(cssSel){
JavascriptExecutor js = (JavascriptExecutor) driver;
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.append("var x = $(\""+cssSel+"\");");
stringBuilder.append("return x.text().toString();") ;
String res= (String) js.executeScript(stringBuilder.toString());
}
text1=getText(cssLast);
text2=getText(csscssFirst);
But always make sure that you located elements properly (e.g. using firepath, firebug addon in firefox)

The TableDriver extension (https://github.com/jkindwall/TableDriver.Java) offers a nice clean way to handle things like this. If your table has headers, you can (and should) identify the cell column by its header text, but in case you don't have headers, you can still do something like this.
Table table = Table.createWithNoHeaders(driver.findElement(By.className("dataTable")), 0);
WebElement firstRowLastCell = table.findCell(0, table.getColumnCount() - 1);
WebElement lastRowFirstCell = table.findCell(table.getRowCount() - 1, table.getColumnCount() - 1);

Using Jsoup to extract data

I am using jsoup to extract data from a table in a website.http://www.moneycontrol.com/stocks/marketstats/gainerloser.php?optex=BSE&opttopic=topgainers&index=-1 using Jsoup. I have referred to Using JSoup To Extract HTML Table Contents and other similar questions but it does not print the data. Could someone please provide me with the code required to achieve this?
public class TestClass
{
public static void main(String args[]) throws IOException
{
Document doc = Jsoup.connect("http://www.moneycontrol.com/stocks/marketstats/gainerloser.php?optex=BSE&opttopic=topgainers&index=-1").get();
for (Element table : doc.select("table.tablehead")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
if (tds.size() > 6) {
System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
}
}
}

If you want to get the content of table(not head), you need change the selector of table:
for (Element table : doc.select("table.tbldata14"))
instead of
for (Element table : doc.select("table.tablehead"))

One important thing is to check what are you getting in Doc when you parse the HTML because there might be few problems with it like:
1. The Site might be using iframes to display content
2. Display content via Javascript
3. few sites have scripts which does not allow jsoup parsing, hence the doc element will contain random data

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Web scraping with java and jsoup - java

Related

Get specific information from Wikipedia Information Box

JSOUP - Extract Data Web dinamis

Reading webpage's inspect element data using java

Selenium WebDriver Java locating two different table cells

Using Jsoup to extract data

Categories

Resources