JSoup core web text extraction

JSoup core web text extraction - java

I am new to JSoup, Sorry if my question is too trivial.
I am trying to extract article text from http://www.nytimes.com/ but on printing the parse document
I am not able to see any articles in the parsed output
public class App
{
public static void main( String[] args )
{
String url = "http://www.nytimes.com/";
Document document;
try {
document = Jsoup.connect(url).get();
System.out.println(document.html()); // Articles not getting printed
//System.out.println(document.toString()); // Same here
String title = document.title();
System.out.println("title : " + title); // Title is fine
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
ok I have tried to parse "http://en.wikipedia.org/wiki/Big_data" to retrieve the wiki data, same issue here as well I am not getting the wiki data in the out put.
Any help or hint will be much appreciated.
Thanks.

Here's how to get all <p class="summary> text:
final String url = "http://www.nytimes.com/";
Document doc = Jsoup.connect(url).get();
for( Element element : doc.select("p.summary") )
{
if( element.hasText() ) // Skip those tags without text
{
System.out.println(element.text());
}
}
If you need all <p> tags, without any filtering, you can use doc.select("p") instead. But in most cases it's better to select only those you need (see here for Jsoup Selector documentation).

Related

Get Google Search Result with Java using Jsoup

first of all i search this problem in stackoverflow database and google. Unfortunately i couldn't find a solution.
I am trying to get Google Search Result for a keyword. Heres my code :
public static void main(String[] args) throws Exception {
Document doc;
try{
doc = Jsoup.connect("https://www.google.com/search?as_q=&as_epq=%22Yorkshire+Capital%22+&as_oq=fraud+OR+allegations+OR+scam&as_eq=&as_nlo=&as_nhi=&lr=lang_en&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=&as_filetype=&as_rights=").userAgent("Mozilla").ignoreHttpErrors(true).timeout(0).get();
Elements links = (Elements) doc.select("li[class=g]");
for (Element link : links) {
Elements titles = link.select("h3[class=r]");
String title = titles.text();
Elements bodies = link.select("span[class=st]");
String body = bodies.text();
System.out.println("Title: "+title);
System.out.println("Body: "+body+"\n");
}
}
catch (IOException e) {
e.printStackTrace();
}
}
And heres the errors : https://prnt.sc/ro4ooi
It says : can only iterate over an array or an instance of java.lang.iterable ( at links )..
When i delete the (Elements) : https://prnt.sc/ro4pa9
Thank you.

Java parse data from html table with jsoup

I want to get the data from the table from the link.
link:
https://www.nasdaq.com/symbol/aapl/financials?query=balance-sheet
I´ve tried my code but it doens´t work
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("https://www.nasdaq.com/symbol/aapl/financials?query=balance-sheet").get();
Elements trs = doc.select("td_genTable");
for (Element tr : trs) {
Elements tds = tr.getElementsByTag("td");
Element td = tds.first();
System.out.println(td.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
Can anybody help me? To get it to work
I´m not getting an output of the table. Nothing happens.

After test your code I've got and Read time out problem. Looking on Google I found this post where suggest to add an user agent to fix it and it worked for me. So, you can try this
public static void main(String[] args) {
try {
// add user agent
Document doc = Jsoup.connect("https://www.nasdaq.com/symbol/aapl/financials?query=balance-sheet")
.userAgent("Mozilla/5.0").get();
Elements trs = doc.select("tr");
for (Element tr : trs) {
Elements tds = tr.select(".td_genTable");
// avoid tr headers that produces NullPointerException
if(tds.size() == 0) continue;
// look for siblings (see the html structure of the web)
Element td = tds.first().siblingElements().first();
System.out.println(td.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
I have added User agent option and fix some query errors. This will be useful to start your work ;)

Java XML Read with WSIL file

at the moment I am trying to program a program which is able to render a link of an xml-file. I use Jsoup, my current code is the following
public static String XmlReader() {
InputStream is = RestService.getInstance().getWsilFile();
try {
Document doc = Jsoup.parse(fis, null, "", Parser.xmlParser());
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
}
I would like to read the following part from a XML file:
<wsil:service>
<wsil:abstract>Read the full documentation on: https://host/sap/bc/mdrs/cdo?type=psm_isi_r&objname=II_QUERY_PROJECT_IN&saml2=disabled</wsil:abstract>
<wsil:name>Query Projects</wsil:name>
<wsil:description location="host/sap/bc/srt/wsdl/srvc_00163E5E1FED1EE897C188AB4A5723EF/wsdl11/allinone/ws_policy/document?sap-vhost=host&saml2=disabled" referencedNamespace="http://schemas.xmlsoap.org/wsdl/"/>
</wsil:service>
I want to return the following URL as String
host/sap/bc/srt/wsdl/srvc_00163E5E1FED1EE897C188AB4A5723EF/wsdl11/allinone/ws_policy/document?sap-vhost=host&saml2=disabled
How can I do that ?
Thank you

If there is only one tag wsil:description then you can use this code:
doc.outputSettings().escapeMode(EscapeMode.xhtml);
String val = doc.select("wsil|description").attr("location");
Escape mode should be changed, since you are not working on regular html, but xml.
If you have more than one tag with given name you can search for distinct neighbour element, and find required tag with respect to it:
String val = doc.select("wsil|name:contains(Query Projects)").first().parent().select("wsil|description").attr("location");

Jsoup Google Search Results

I am attempting to parse the HTML of google's search results to grab the title of each result. This is done through android in a private nested class shown below:
private class WebScraper extends AsyncTask<String, Void, String> {
public WebScraper() {}
#Override
protected String doInBackground(String... urls) {
Document doc;
try {
doc = Jsoup.connect(urls[0]).get();
} catch (IOException e) {
System.out.println("Failed to open document");
return "";
}
Elements results = doc.getElementsByClass("rc");
int count = 0;
for (Element lmnt : results) {
System.out.println(count++);
System.out.println(lmnt.text());
}
System.out.println("Count is : " + count);
String key = "test";
//noinspection Since15
SearchActivity.this.songs.put(key, SearchActivity.this.songs.getOrDefault(key, 0) + 1);
// return requested
return "";
}
}
an example url I am trying to parse: http://www.google.com/#q=i+might+site:genius.com
For some reason, when i run the above code, my count is printed as 0, thus no elements are being stored in results. Any help is much appreciated! P.S. docs is definitely initialized and the HTML page is loading properly

This code will search a word like "Apple" in google and fetch all links from results and display their title and url. It can search upto 500 words in a day after that google detect it and stop giving results.
search="Apple"; //your word to be search on google
String userAgent = "ExampleBot 1.0 (+http://example.com/bot)";
Elements links=null;
try {
links = Jsoup.connect(google +
URLEncoder.encode(search,charset)).
userAgent(userAgent).get().select(".g>.r>a");
} catch (UnsupportedEncodingException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in
format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
try {
url = URLDecoder.decode(url.substring(url.indexOf('=') +
1, url.indexOf('&')), "UTF-8");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}

If you check source code of the Google's page, you will notice that it does not contain any text data which is shown normally in the browser - there is only a bunch of javascript code. That means that Google outputs all the search results dynamically.
Jsoup will fetch that javascript code and it will not find any html code with "rc" classes, that's why you get zero count in your code sample.
Consider using Google's public search API instead of direct parsing of its html pages: https://developers.google.com/custom-search/.

I completely agree with Matvey Sidorenko but for using the google public search API, you need to have the Google Api key. But the problem is that google limits 100 searches per api key, exceeding which, it stops working and it gets reset in 24 hours.
Recently i was working on a project where we needed to get the google search result links for different queries provided by the user, so as to overcome this issue of API limit, i made my own API that searches directly on google/ncr and gives you the result link.
Free Google Search API-
http://freegoogleapi.azurewebsites.net/ OR http://google.bittque.com
I used HTML-UNIT library for making this API.
You can use my API or you can use the HTML UNIT Library for achieving what you need.

how to read string which is written outside the html <> tags.?

I am having HTML code of 1000 lines and I wanted to extract the data which is written outside the HTML <> Tags.
for example..
<>Java Programm<>
It should read only "Java Programm" and escape whatever written inside the "<>" tags
I tried following code but it is reading whole data including <> but I do not need "<>" in my output.
public static void main(String[] args) throws Exception {
try {
FileInputStream fin = new FileInputStream("C:\\Users\\File.txt");
int i;
while ((i=fin.read())!=-1) {
System.out.print((char)i);
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

You would need an HTML parser. For JSoup it's
File input = new File("C:\\Users\\File.txt");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element body = doc.body(); //Get the body of the html
System.out.println(body.text()) ; //Get the all the text inside the body tag
This is one way to do it. Simple enough :), there of course are other ways to do it. This ofcourse will leave the text outside of body tag. You can explore JSoup a here and find a solution for it.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JSoup core web text extraction - java

Related

Get Google Search Result with Java using Jsoup

Java parse data from html table with jsoup

Java XML Read with WSIL file

Jsoup Google Search Results

how to read string which is written outside the html <> tags.?

Categories

Resources