extracting data using jsoup in java

extracting data using jsoup in java - java

I am trying to run this code and i am facing the "Null Pointer Exception" in my program.I used try and catch but i donot know how to eliminate the problem.
Here is the code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.net.*;
import java.io.*;
import java.lang.NullPointerException;
public class WikiScraper {
public static void main(String[] args) throws IOException
{
scrapeTopic("/wiki/Python");
}
public static void scrapeTopic(String url){
String html = getUrl("http://www.wikipedia.org/"+url);
Document doc = Jsoup.parse(html);
String contentText = doc.select("#mw-content-text>p").first().text();
System.out.println(contentText);
System.out.println("The url was malformed!");
}
public static String getUrl(String url){
URL urlObj = null;
try{
urlObj = new URL(url);
}
catch(MalformedURLException e){
System.out.println("The url was malformed!");
return "";
}
URLConnection urlCon = null;
BufferedReader in = null;
String outputText = "";
try{
urlCon = urlObj.openConnection();
in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
String line = "";
while((line = in.readLine()) != null){
outputText += line;
}
in.close();
}catch(IOException e){
System.out.println("There was an error connecting to the URL");
return "";
}
return outputText;
}
}
The Error shown is:
There was an error connecting to the URL
Exception in thread "main" java.lang.NullPointerException
at hello.WikiScraper.scrapeTopic(WikiScraper.java:17)
at hello.WikiScraper.main(WikiScraper.java:11)

You have
public static String getUrl(String url){
// ...
return "";
}
What always ends in an empty String.
Try
Document doc = Jsoup.connect("http://example.com/").get();
for example.

Related

URLConnectionReader produces UnknownHostException

Thanks in advance for every input!
I'm getting a little familiar with how to read data from websites with Java and have tried to do this by reading data using a URLConnectionReader.
Unfortunately I get an UnknownHostException when I test the whole thing in a Java online compiler (https://www.jdoodle.com/online-java-compiler/).
Have I forgotten any imports? I proceeded according to a tutorial.
Code: (designed for online-java-compiler jdoodle):
import java.net.*;
import java.io.*;
public class URLConnectionReader {
public static void main(String[] args)
{
String output = getUrlContents("https://www.tradegate.de/orderbuch_umsaetze.php?isin=NO0010892359");
System.out.println(output);
}
private static String getUrlContents(String theUrl)
{
StringBuilder content = new StringBuilder();
try
{
URL url = new URL(theUrl);
URLConnection urlConnection = url.openConnection();
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(urlConnection.getInputStream()));
String line;
while ((line = bufferedReader.readLine()) != null)
{
content.append(line + "\n");
}
bufferedReader.close();
}
catch(Exception e)
{
e.printStackTrace();
}
return content.toString();
}
}
Error message:
java.net.UnknownHostException: www.tradegate.de
at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403)
at java.base/java.net.Socket.connect(Socket.java:591)
at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:285)
at java.base/sun.security.ssl.BaseSSLSocketImpl.connect(BaseSSLSocketImpl.java:173)
at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:182)
at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:474)
at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:569)
at java.base/sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:265)
at java.base/sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:372)
at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)
at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1515)
at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:250)
at URLConnectionReader.getUrlContents(URLConnectionReader.java:21)
at URLConnectionReader.main(URLConnectionReader.java:8)

I separated the classes as follows and your code works without any exceptions=>
class Mian:
public class Mian {
public static void main(String[] args) throws ClassNotFoundException {
URLConnectionReader urlcr = new URLConnectionReader();
String output =
urlcr.getUrlContents("https://www.tradegate.de/orderbuch_umsaetze.php?
isin=NO0010892359");
System.out.println(output);
}
}
and URLConnectionReader class:
import java.net.*;
import java.io.*;
public class URLConnectionReader {
public String getUrlContents(String theUrl)
{
StringBuilder content = new StringBuilder();
try
{
URL url = new URL(theUrl);
URLConnection urlConnection = url.openConnection();
BufferedReader bufferedReader = new BufferedReader(new
InputStreamReader(urlConnection.getInputStream()));
String line;
while ((line = bufferedReader.readLine()) != null)
{
content.append(line + "\n");
}
bufferedReader.close();
}
catch(Exception e)
{
e.printStackTrace();
}
return content.toString();
}
}

Read URL in Java

My goal with this program is to extract a website's content and output it to console. However, an exception gets thrown every time I run this code. I am wondering what I am doing wrong, and if anyone can point me in the right direction. Thank you ahead of time!
public class twikiripper {
public static URL url;
public static void main(String[] args) {
BufferedReader br = null;
try{
URL url = new URL("http://www.google.com");
}catch(MalformedURLException ex){}
try {
url.openConnection();
br = new BufferedReader(new InputStreamReader(url.openStream()));
String line;
StringBuilder sb = new StringBuilder();
while ((line = br.readLine()) != null) {
sb.append(line);
sb.append(System.lineSeparator());
}
System.out.println(sb);
}catch(Exception e){
System.out.println("Exception: "+e.toString());
}
}
My code is above. I was wondering, why am I always outputting Exception: java.lang.NullPointerException ? I seem to always throw this exception. I thought I was doing everything right.
What I am trying to do is display the output code from a website, that is all. Please help !

You have an unnecessary try-catch block in your code. Try this:
public static void main(String[] args) {
BufferedReader br = null;
try {
URL url = new URL("http://www.google.com");
url.openConnection();
br = new BufferedReader(new InputStreamReader(url.openStream()));
String line;
StringBuilder sb = new StringBuilder();
while ((line = br.readLine()) != null) {
sb.append(line);
sb.append(System.lineSeparator());
}
System.out.println(sb);
}catch(Exception e){
System.out.println("Exception: "+e.toString());
}
}
And also make sure that your are importing the correct classes.

The reason for null pointer exception is:
In your code, in url.openConnection(); variable url is local and its scope ends at the end of the its try block.
Try this:
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
public class twikiripper {
public static URL url;
public static void main(String[] args) {
BufferedReader br = null;
try{
url = new URL("http://www.google.com"); // I have changed this line
}catch(MalformedURLException ex){}
try {
url.openConnection();
br = new BufferedReader(new InputStreamReader(url.openStream()));
String line;
StringBuilder sb = new StringBuilder();
while ((line = br.readLine()) != null) {
sb.append(line);
sb.append(System.lineSeparator());
}
System.out.println(sb);
}catch(Exception e){
System.out.println("Exception: "+e.toString());
}
}
}

In your first try block a the local variable hides the field url. You've got two different variables with the same name. Change URL url = new URL("http://www.google.com"); to url = new URL("http://www.google.com"); or follow NiVeRs answer. – Eritrean 8 mins ago
Correct! Thank you.

below code will fullfill your requirement --
package com.subham.testing;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
public class Test13 {
public static URL url;
public static void main(String[] args) {
BufferedReader br = null;
try {
url = new URL("http://www.google.com");
} catch (MalformedURLException ex) {
System.out.println("came exception");
}
try {
url.openConnection();
br = new BufferedReader(new InputStreamReader(url.openStream()));
String line;
StringBuilder sb = new StringBuilder();
while ((line = br.readLine()) != null) {
sb.append(line);
sb.append(System.lineSeparator());
}
System.out.println(sb);
} catch (Exception e) {
System.out.println("Exception: " + e.toString());
}
}
}
You were creating new url object in first try block so in second try block was getting null as it was just decleared but not initialized.

How to read a simple playlist in java

I am trying to create a simple java command line code that will accept the URL from a playlist in command line and it should return the playlist content.
I am getting the following response back Enter playlist url here (0 to quit):
http://gv8748.lu.edu:8084/sweng987/simple-01/playlist.m3u8
java.net.MalformedURLException: no protocol: http://lu8748.lu.edu:8084/sweng987/simple-01/playlist.m3u8
at java.base/java.net.URL.<init>(URL.java:627)
at java.base/java.net.URL.<init>(URL.java:523)
at java.base/java.net.URL.<init>(URL.java:470)
at edu.lu.sweng987.SimplePlaylist.getPlaylistUrl(SimplePlaylist.java:36)
at edu.lu.sweng987.SimplePlaylist.main(SimplePlaylist.java:21)
My code is the following
package edu.psgv.sweng861;
import java.util.Scanner;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.net.*;
public class SimplePlaylist {
private SimplePlaylist() {
//don't allow instances
}
// The main function returns the URL entered
public static void main(String[] args) throws IOException{
String output = getPlaylistUrl("");
System.out.println(output);
}
private static String getPlaylistUrl(String theUrl) {
String content = "";
Scanner scanner = new Scanner(System.in);
boolean validInput = false;
System.out.println("Enter playlist url here (0 to quit):");
content = scanner.nextLine();
try {
URL url = new URL(theUrl);
URLConnection urlConnection = (HttpURLConnection) url.openConnection();
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(urlConnection.getInputStream()));
String line;
while ((line = bufferedReader.readLine()) != null) {
content += line + "\n";
}
bufferedReader.close();
} catch(Exception e) {
e.printStackTrace();
}
return content;
}
}

You have incorrectly used the parameter for the method when creating the URL instance instead of the local variable actually containing the url
Change
content = scanner.nextLine();
try {
URL url = new URL(theUrl);
to
content = scanner.nextLine();
try {
URL url = new URL(content);

InputStreamReader on a URL connection returning null

I am following a tutorial on web scraping from the book "Web Scraping with Java". The following code gives me a nullPointerExcpetion. Part of the problem is that (line = in.readLine()) is always null, so the while loop at line 33 never runs. I do not know why it is always null however. Can anyone offer me insight into this? This code should print the first paragraph of the wikipedia article on CPython.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.net.*;
import java.io.*;
public class WikiScraper {
public static void main(String[] args) {
scrapeTopic("/wiki/CPython");
}
public static void scrapeTopic(String url){
String html = getUrl("http://www.wikipedia.org/"+url);
Document doc = Jsoup.parse(html);
String contentText = doc.select("#mw-content-text > p").first().text();
System.out.println(contentText);
}
public static String getUrl(String url){
URL urlObj = null;
try{
urlObj = new URL(url);
}
catch(MalformedURLException e){
System.out.println("The url was malformed!");
return "";
}
URLConnection urlCon = null;
BufferedReader in = null;
String outputText = "";
try{
urlCon = urlObj.openConnection();
in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
String line = "";
while((line = in.readLine()) != null){
outputText += line;
}
in.close();
}catch(IOException e){
System.out.println("There was an error connecting to the URL");
return "";
}
return outputText;
}
}

If you enter http://www.wikipedia.org//wiki/CPython in web browser, it will be redirected to https://en.wikipedia.org/wiki/CPython, so
use String html = getUrl("https://en.wikipedia.org/"+url);
instead String html = getUrl("http://www.wikipedia.org/"+url);
then line = in.readLine() can really read something.

Simple Java code for Crawling not working

public class Crawler {
public static void main(String[] args) {
List<String> Web = new ArrayList<String>();
Web.add("www.thehindu.com");
Web.add("www.indianexpress.com");
Web.add("www.ndtv.com");
Web.add("www.tehekla.com");
try {
for (int i = 0; i < Web.size(); i ++) {
// URL my_url = new URL("http://www.thehindu.com/");
String a = Web.get(i).toString();
System.out.println(a);
URL my_url = new URL(a);
BufferedReader br = new BufferedReader(new InputStreamReader(my_url.openStream()));
String strTemp = "";
while(null != (strTemp = br.readLine())) {
System.out.println(strTemp);
}
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
When I am trying to run this code then error is showing as:
java.net.MalformedURLException: no protocol: www.thehindu.com

Try adding http:// before each URL.

You need to place http before website address
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
public class Crawler {
public static void main(String[] args) {
List<String> Web = new ArrayList<String>();
Web.add("http://www.thehindu.com");
Web.add("http://www.indianexpress.com");
Web.add("http://www.ndtv.com");
Web.add("http://www.tehekla.com");
try {
for (int i = 0; i < Web.size(); i++) {
// URL my_url = new URL("http://www.thehindu.com/");
String a = Web.get(i).toString();
System.out.println(a);
URL my_url = new URL(a);
BufferedReader br = new BufferedReader(new InputStreamReader(
my_url.openStream()));
String strTemp = "";
while (null != (strTemp = br.readLine())) {
System.out.println(strTemp);
}
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

extracting data using jsoup in java - java

You have public static String getUrl(String url){ // ... return ""; } What always ends in an empty String. Try Document doc = Jsoup.connect("http://example.com/").get(); for example.

Related

URLConnectionReader produces UnknownHostException

Read URL in Java

How to read a simple playlist in java

InputStreamReader on a URL connection returning null

Simple Java code for Crawling not working

Categories

Resources