Only scrape specific details from a web page - java

I am using Jsoup to retrieve details from a webpage and write into a text file. Is it possible to for me to retrieve only parts of it? For example in the following link, I want to take only the job description.
http://aldogroup.luceosolutions.com/recruit/stores/advert_details.php?id=3136&_lang=en&partner_id=139
Sometimes the job postings are from different websites and therefore the format of the html tags may vary. I need a way to retrieve just the job description only. The following code retrieves everything on the web page. How can I get only the job description? Please help.
public class MainCollector {
public static void main(String[] args) {
// TODO Auto-generated method stub
Document doc;
try {
doc = Jsoup.connect("http://aldogroup.luceosolutions.com/recruit/stores/advert_details.php?id=3136&_lang=en&partner_id=139").get();
String title = doc.title();
String body = doc.body().toString();
Document convertText = Jsoup.parseBodyFragment(body);
String convertedText = convertText.text();
System.out.println("Title:" + title);
System.out.println("Body:" + convertedText);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

You can use this -
Elements e = doc.select(".annonce > p:nth-child(5)");
System.out.println(e.text());
To get the right CSS selector you can open your browser's developer tools (by pressing F12), and then choosing the inspector tool.
You should also add the user agent string to your request, so you will get the exect same page both from your browser and your program -
doc = Jsoup.connect("http://aldogroup.luceosolutions.com/recruit/stores/advert_details.php?id=3136&_lang=en&partner_id=139")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0")
.get();

Related

Jsoup webscraping to find game odds data

I've been trying to create a program in Java that can catch the odds of a game from a sportsbook like FanDuel but I've been running into a lot of problems. When I print the html for the site I dont get the entire html for the site so Im unable to go into the divs and retrieve the actual data I want.
I used the Url https://sportsbook.fanduel.com/ . If I try and run a method like Elements element = doc.getElementByID("root"); to get the data inside that div the rest of the data in that div will not appear. enter image description here. I specifically would just like to get the moneyline data for each game if anyone can help that would be great
public class ExtractSportsBookData {
public static void extractData(String url){
try{
Document doc = Jsoup.connect(url).get();
String html = doc.html();
System.out.println(html);
} catch (IOException e){
e.printStackTrace();
}
}
}
enter image description here
If you look at the image inside the li tags is where the data is stored for the moneylines for each game but I cannot seem to find a way to extract that data using Jsoup
public class Main {
public static void main(String[] args) {
ExtractSportsBookData.extractData("https://sportsbook.fanduel.com/");
}
}
import java.io.IOException;
import java.io.SyncFailedException;
public class ExtractSportsBookData {
public static void extractData(String url){
try{
Document doc = Jsoup.connect(url).get();
String html = doc.html();
//System.out.println(html);
Elements element = doc.getElementsByClass("jo jp fk fe jy jz bs");
System.out.println(element.isEmpty());
} catch (IOException e){
e.printStackTrace();
}
}
}
enter image description here
The result I receive from this is true meaning that the element is empty which is not what I want. Any help on this would be appreciated

How to Load Entire Contents of HTML - Jsoup

I was trying to download html table rows using jsoup but it parsing only partial html contents. I tried with below code also for loading full html contents but doesn't work. any suggestion would be appreciated.
public class AmfiDaily {
public static void main(String[] args) {
AmfiDaily amfiDaily = new AmfiDaily();
amfiDaily.extractAmfiTable("https://www.amfiindia.com/intermediary/other-data/transaction-in-debt-and-money-market-securities");
}
public void extractAmfiTable(String url){
Document doc;
try {
FileWriter writer = new FileWriter("D:\\FTRACK\\Amfi Report " + java.time.LocalDate.now() + ".csv");
Document document = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
.maxBodySize(0)
.timeout(100000*5)
.get();
Elements rows = document.select("tr");
for (Element row : rows) {
Elements cells1 = row.select("td");
for (Element cell : cells1) {
if (cell.text().contains(",")) {
writer.write(cell.text().concat(","));
}
else
{
writer.write(cell.text().concat(","));
}
}
writer.write("\n");
}
writer.close();
} catch (IOException e) {
e.getStackTrace();
}
}
}
Disable JavaScript to see exactly what Jsoup sees. Part of the page is loaded with AJAX so Jsoup is not able to reach it. But there's an easy way to check where the additional data comes from.
You can use your browsers debugger to check Network tab and take a look at the requests and responses.
You can see that table is downloaded from this URL:
https://www.amfiindia.com/modules/LoadModules/MoneyMarketSecurities
You can use directly this URL to get the data you need.
To overcome Jsoup's limitation and load whole HTML at once you should use Selenium webdriver, example here: https://stackoverflow.com/a/54510107/9889778

Jsoup HTML Parsing work on java but doesn't work on android studio

im working on a html parsing project using Jsoup, im able to parse Title and Image correctly, but when i try to parse a timer (related to this post this link) i fail on android studio but works with the code gave me for java by #Shn_Android_Dev This Code,
and this is my code..
public void EbayTimerTest() {
new Thread(new Runnable() {
#Override
public void run() {
Document doc;
try {
doc = Jsoup.connect(WEBSITE_URL).get();
String remaining = doc.select("#vi-cdown_timeLeft").first().text();
remainingMs = getUnixFromString(remaining);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
runOnUiThread(new Runnable() {
#Override
public void run() {
timer.setText(String.valueOf(remainingMs));
}
});
}
}).start();
}
}
and the error i still get is
java.lang.NullPointerException: Attempt to invoke virtual method 'java.lang.String org.jsoup.nodes.Element.text()' on a null object reference
pretty sure the
String remaining = doc.select("#vi-cdown_timeLeft").first().text();
fail to parse on android studio but works on Java Eclipse..
p.s Jsoup works well if i try to parse others Element such Title and Image.
The main reason for the exception may be that each IDE sends a different userAgent string to the server, so you get two different HTMLs.
You can solve it in one of two ways:
Read the doc you get with AS and see what is the query for the information that you need.
Add the userAgent string to the get request -
doc = Jsoup.connect(URL).userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:59.0) Gecko/20100101")
.get();

Java Jsoup no result from instagram.com

I try to get all divs from the website. If I try it with google.com or another webpage it works fine, just instagram gives an empty result. The metod looks like:
public static List<String> getPhotoPaths(String url) {
List<String> paths = new ArrayList<>();
try {
Document doc = Jsoup.connect("https://www.instagram.com/explore/tags/test/")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2")
.get();
for (Element element : doc.select("div")) {
System.out.println(element);
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return paths;
}
Has someone an idea whats wrong? This is the test website, it uses normaly divs like every other page. Or not?
You don't get any result because Instragram loads those pictures asynchronously thanks to Javascript (if you disable it in your browser you will not be able to see pictures anymore), hence it is not available when the page is loaded. Unfortunately JSoup cannot deal with Javascript, so you should use another library that can handle it or parse by yourself the JSON object assigned to window._sharedData variable, which contains the URLs pointing to the thumbnails and the original pictures

JSoup Data Issue

So i am trying to get some data from a website by using JSoup, and i am not sure how.
This is the code i have been using and it does not work:
public static Document doc;
public static Elements elementPrice;
public void getDocument()
{
try
{
doc = Jsoup.connect("https://steamcommunity.com/market/search?appid=730&q=ak47+jaguar+factory-new").get();
elementPrice = doc.select("market_table_value");
System.out.println(elementPrice);
} catch (IOException e)
{
e.printStackTrace();
}
}
}
I am trying to get data from this site: https://steamcommunity.com/market/search?appid=730&q=ak47+jaguar+factory-new
And the data/attribute i am trying to get is this:
Pris från:
35,36€
Which is the price of a csgo item in steam.
And now i wonder why this doesen't work.
Thanks for any help! :)
select uses CSS selectors syntax so if you want to describe elements by its class use .className (notice dot at start). So try with
elementPrice = doc.select(".market_table_value");
// ^--add this dot
You can also use getElementsByClass method instead of select and pass name of class directly, without any CSS like
elementPrice = doc.getElementsByClass("market_table_value");

Categories

Resources