How to Load Entire Contents of HTML - Jsoup - java

I was trying to download html table rows using jsoup but it parsing only partial html contents. I tried with below code also for loading full html contents but doesn't work. any suggestion would be appreciated.
public class AmfiDaily {
public static void main(String[] args) {
AmfiDaily amfiDaily = new AmfiDaily();
amfiDaily.extractAmfiTable("https://www.amfiindia.com/intermediary/other-data/transaction-in-debt-and-money-market-securities");
}
public void extractAmfiTable(String url){
Document doc;
try {
FileWriter writer = new FileWriter("D:\\FTRACK\\Amfi Report " + java.time.LocalDate.now() + ".csv");
Document document = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
.maxBodySize(0)
.timeout(100000*5)
.get();
Elements rows = document.select("tr");
for (Element row : rows) {
Elements cells1 = row.select("td");
for (Element cell : cells1) {
if (cell.text().contains(",")) {
writer.write(cell.text().concat(","));
}
else
{
writer.write(cell.text().concat(","));
}
}
writer.write("\n");
}
writer.close();
} catch (IOException e) {
e.getStackTrace();
}
}
}

Disable JavaScript to see exactly what Jsoup sees. Part of the page is loaded with AJAX so Jsoup is not able to reach it. But there's an easy way to check where the additional data comes from.
You can use your browsers debugger to check Network tab and take a look at the requests and responses.
You can see that table is downloaded from this URL:
https://www.amfiindia.com/modules/LoadModules/MoneyMarketSecurities
You can use directly this URL to get the data you need.
To overcome Jsoup's limitation and load whole HTML at once you should use Selenium webdriver, example here: https://stackoverflow.com/a/54510107/9889778

Related

Jsoup HTML Parsing work on java but doesn't work on android studio

im working on a html parsing project using Jsoup, im able to parse Title and Image correctly, but when i try to parse a timer (related to this post this link) i fail on android studio but works with the code gave me for java by #Shn_Android_Dev This Code,
and this is my code..
public void EbayTimerTest() {
new Thread(new Runnable() {
#Override
public void run() {
Document doc;
try {
doc = Jsoup.connect(WEBSITE_URL).get();
String remaining = doc.select("#vi-cdown_timeLeft").first().text();
remainingMs = getUnixFromString(remaining);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
runOnUiThread(new Runnable() {
#Override
public void run() {
timer.setText(String.valueOf(remainingMs));
}
});
}
}).start();
}
}
and the error i still get is
java.lang.NullPointerException: Attempt to invoke virtual method 'java.lang.String org.jsoup.nodes.Element.text()' on a null object reference
pretty sure the
String remaining = doc.select("#vi-cdown_timeLeft").first().text();
fail to parse on android studio but works on Java Eclipse..
p.s Jsoup works well if i try to parse others Element such Title and Image.
The main reason for the exception may be that each IDE sends a different userAgent string to the server, so you get two different HTMLs.
You can solve it in one of two ways:
Read the doc you get with AS and see what is the query for the information that you need.
Add the userAgent string to the get request -
doc = Jsoup.connect(URL).userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:59.0) Gecko/20100101")
.get();

Get data from div class by JSOUP

I need to get the value "8.32" from the "rnicper", "36 mg" from "rnstr" and "20/80 PG/VG" from "nirat".
<div class="recline highlight" id="rnic">
<div class="rlab"><span class="nopr indic indic-danger"></span>Nicotine juice <span id="rnstr">36 mg</span> (<span id="nirat">20/80 PG/VG</span>)</div>
<div class="runit" id="rnicml">2.08</div>
<div class="rdrops" id="rnicdr">73</div>
<div class="rgrams" id="rnicg" style="display: none;">2.53</div>
<div class="rpercent" id="rnicper">8.32</div><br>
</div>
I tried various methods, but nothing happens.
doc.getElementById("rnicper").outerHtml();
doc.getElementById("rnicper").text();
doc.select("div#rnicper");
doc.getElementsByAttributeValue("id", "rnicper").text();
Tell me, please, how can I get this information using JSOUP?
Update for Chintak Patel
AsyncTask asyncTask = new AsyncTask() {
#Override
protected Object doInBackground(Object[] objects) {
Document doc = null;
try {
doc = Jsoup.connect("http://e-liquid-recipes.com/recipe/2254223/RY4D%20Vanilla%20Swirl%20DL").get();
} catch (IOException e) {
e.printStackTrace();
}
String content = doc.select("div[id=rnicper]").text();
Log.d("content", content);
return null;
}
};
asyncTask.execute();
The values of parameters you are trying to get are are not part of initial html, but are set by javascript after page is loaded.
Jsoup only gets static html, does not execute javascript code.
To get what you want you can use tool like HtmlUnit or Selenium.
HtmlUnit example:
try (final WebClient webClient = new WebClient()) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
final HtmlPage page = webClient
.getPage("http://e-liquid-recipes.com/recipe/2254223/RY4D%20Vanilla%20Swirl%20DL");
System.out.println(page.getElementById("rnicper").asText());
}
Write the following class in your Activity class and do your execution using JSoup. This code is used to get current version from play store website. you can change the URL and div[id=rnicper] into select() method. and then do your execution in postExecute() method.
private class GetVersionCode extends AsyncTask<Void, String, String> {
#Override
protected String doInBackground(Void... voids) {
String newVersion = null;
try {
newVersion = Jsoup.connect("https://play.google.com/store/apps/details?id=" + MainActivity.this.getPackageName() + "&hl=en")
.timeout(30000)
.userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6")
.referrer("http://www.google.com")
.get()
.select("div[itemprop=softwareVersion]")
.first()
.ownText();
return newVersion;
} catch (Exception e) {
return newVersion;
}
}
#Override
protected void onPostExecute(String onlineVersion) {
super.onPostExecute(onlineVersion);
if (onlineVersion != null && !onlineVersion.isEmpty()) {
if (Float.valueOf(currentVersion) < Float.valueOf(onlineVersion)) {
showAlertDialogForUpdate(currentVersion, onlineVersion);
}
}
Log.e("update", "Current version " + currentVersion + "playstore version " + onlineVersion);
}
}

Java Jsoup no result from instagram.com

I try to get all divs from the website. If I try it with google.com or another webpage it works fine, just instagram gives an empty result. The metod looks like:
public static List<String> getPhotoPaths(String url) {
List<String> paths = new ArrayList<>();
try {
Document doc = Jsoup.connect("https://www.instagram.com/explore/tags/test/")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2")
.get();
for (Element element : doc.select("div")) {
System.out.println(element);
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return paths;
}
Has someone an idea whats wrong? This is the test website, it uses normaly divs like every other page. Or not?
You don't get any result because Instragram loads those pictures asynchronously thanks to Javascript (if you disable it in your browser you will not be able to see pictures anymore), hence it is not available when the page is loaded. Unfortunately JSoup cannot deal with Javascript, so you should use another library that can handle it or parse by yourself the JSON object assigned to window._sharedData variable, which contains the URLs pointing to the thumbnails and the original pictures

Only scrape specific details from a web page

I am using Jsoup to retrieve details from a webpage and write into a text file. Is it possible to for me to retrieve only parts of it? For example in the following link, I want to take only the job description.
http://aldogroup.luceosolutions.com/recruit/stores/advert_details.php?id=3136&_lang=en&partner_id=139
Sometimes the job postings are from different websites and therefore the format of the html tags may vary. I need a way to retrieve just the job description only. The following code retrieves everything on the web page. How can I get only the job description? Please help.
public class MainCollector {
public static void main(String[] args) {
// TODO Auto-generated method stub
Document doc;
try {
doc = Jsoup.connect("http://aldogroup.luceosolutions.com/recruit/stores/advert_details.php?id=3136&_lang=en&partner_id=139").get();
String title = doc.title();
String body = doc.body().toString();
Document convertText = Jsoup.parseBodyFragment(body);
String convertedText = convertText.text();
System.out.println("Title:" + title);
System.out.println("Body:" + convertedText);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
You can use this -
Elements e = doc.select(".annonce > p:nth-child(5)");
System.out.println(e.text());
To get the right CSS selector you can open your browser's developer tools (by pressing F12), and then choosing the inspector tool.
You should also add the user agent string to your request, so you will get the exect same page both from your browser and your program -
doc = Jsoup.connect("http://aldogroup.luceosolutions.com/recruit/stores/advert_details.php?id=3136&_lang=en&partner_id=139")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0")
.get();

Need to parse image src from HTML page then display it

I'm currently trying to develop an app whereby it visits the following site (Http://lulpix.com) and parses the HTML and gets the img src from the following section
<div class="pic rounded-8" style="overflow:hidden;"><div style="margin:0 0 36px 0;overflow:hidden;border:none;height:474px;"><img src="**http://lulpix.com/images/2012/April/13/4f883cdde3591.jpg**" alt="All clogged up" title="All clogged up" width="319"/></div></div>
Its of course different every time the page is loaded so I cannot give a direct URL to an Asynchronous gallery of images which is what i intend to do, for instance
Load Page > Parse img src > download ASync to imageview > Reload lulpix.com > start again
Then place each of these in an image view from which the user can swipe left and right to browse.
So the TL;DR of this is, how can i parse the html to retrieve the URL and has anyone got any experiences with libarys for displaying images.
Thank you v much.
Here's an AsyncTask that connects to lulpix, fakes a referrer & user-agent (lulpix tries to block scraping with some pretty lame checks apparently). Starts like this in your Activity:
new ForTheLulz().execute();
The resulting Bitmap is downloaded in a pretty lame way (no caching or checks if the image is already DL:ed) & error handling is overall pretty non-existent - but the basic concept should be ok.
class ForTheLulz extends AsyncTask<Void, Void, Bitmap> {
#Override
protected Bitmap doInBackground(Void... args) {
Bitmap result = null;
try {
Document doc = Jsoup.connect("http://lulpix.com")
.referrer("http://www.google.com")
.userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6")
.get();
//parse("http://lulpix.com");
if (doc != null) {
Elements elems = doc.getElementsByAttributeValue("class", "pic rounded-8");
if (elems != null && !elems.isEmpty()) {
Element elem = elems.first();
elems = elem.getElementsByTag("img");
if (elems != null && !elems.isEmpty()) {
elem = elems.first();
String src = elem.attr("src");
if (src != null) {
URL url = new URL(src);
// Just assuming that "src" isn't a relative URL is probably stupid.
InputStream is = url.openStream();
try {
result = BitmapFactory.decodeStream(is);
} finally {
is.close();
}
}
}
}
}
} catch (IOException e) {
// Error handling goes here
}
return result;
}
#Override
protected void onPostExecute(Bitmap result) {
ImageView lulz = (ImageView) findViewById(R.id.lulpix);
if (result != null) {
lulz.setImageBitmap(result);
} else {
//Your fallback drawable resource goes here
//lulz.setImageResource(R.drawable.nolulzwherehad);
}
}
}
I recently used JSoup to parse invalid HTML, it works well! Do something like...
Document doc = Jsoup.parse(str);
Element img = doc.body().select("div[class=pic rounded-8] img").first();
String src = img.attr("src");
Play with the "selector string" to get it right, but I think the above will work. It first selects the outer div based on the value of its class attribute, and then any descendent img element.
No need to use webview now check this sample project
https://github.com/meetmehdi/HTMLImageParser.git
In this sample project I am parsing html and image tag, than extracting the image from image URL. Image is downloaded and is displayed.

Categories

Resources