Using Java to pull data from a webpage? - java

I'm attempting to make my first program in Java. The goal is to write a program that browses to a website and downloads a file for me. However, I don't know how to use Java to interact with the internet. Can anyone tell me what topics to look up/read about or recommend some good resources?

The simplest solution (without depending on any third-party library or platform) is to create a URL instance pointing to the web page / link you want to download, and read the content using streams.
For example:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
public class DownloadPage {
public static void main(String[] args) throws IOException {
// Make a URL to the web page
URL url = new URL("http://stackoverflow.com/questions/6159118/using-java-to-pull-data-from-a-webpage");
// Get the input stream through URL Connection
URLConnection con = url.openConnection();
InputStream is = con.getInputStream();
// Once you have the Input Stream, it's just plain old Java IO stuff.
// For this case, since you are interested in getting plain-text web page
// I'll use a reader and output the text content to System.out.
// For binary content, it's better to directly read the bytes from stream and write
// to the target file.
try(BufferedReader br = new BufferedReader(new InputStreamReader(is))) {
String line = null;
// read each line and write to System.out
while ((line = br.readLine()) != null) {
System.out.println(line);
}
}
}
}
Hope this helps.

The Basics
Look at these to build a solution more or less from scratch:
Start from the basics: The Java Tutorial's chapter on Networking, including Working With URLs
Make things easier for yourself: Apache HttpComponents (including HttpClient)
The Easily Glued-Up and Stitched-Up Stuff
You always have the option of calling external tools from Java using the exec() and similar methods. For instance, you could use wget, or cURL.
The Hardcore Stuff
Then if you want to go into more fully-fledged stuff, thankfully the need for automated web-testing as given us very practical tools for this. Look at:
HtmlUnit (powerful and simple)
Selenium, Selenium-RC
WebDriver/Selenium2 (still in the works)
JBehave with JBehave Web
Some other libs are purposefully written with web-scraping in mind:
JSoup
Jaunt
Some Workarounds
Java is a language, but also a platform, with many other languages running on it. Some of which integrate great syntactic sugar or libraries to easily build scrapers.
Check out:
Groovy (and its XmlSlurper)
or Scala (with great XML support as presented here and here)
If you know of a great library for Ruby (JRuby, with an article on scraping with JRuby and HtmlUnit) or Python (Jython) or you prefer these languages, then give their JVM ports a chance.
Some Supplements
Some other similar questions:
Scrape data from HTML using Java
Options for HTML Scraping

Here's my solution using URL and try with resources phrase to catch the exceptions.
/**
* Created by mona on 5/27/16.
*/
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
public class ReadFromWeb {
public static void readFromWeb(String webURL) throws IOException {
URL url = new URL(webURL);
InputStream is = url.openStream();
try( BufferedReader br = new BufferedReader(new InputStreamReader(is))) {
String line;
while ((line = br.readLine()) != null) {
System.out.println(line);
}
}
catch (MalformedURLException e) {
e.printStackTrace();
throw new MalformedURLException("URL is malformed!!");
}
catch (IOException e) {
e.printStackTrace();
throw new IOException();
}
}
public static void main(String[] args) throws IOException {
String url = "https://madison.craigslist.org/search/sub";
readFromWeb(url);
}
}
You could additionally save it to file based on your needs or parse it using XML or HTML libraries.

Since Java 11 the most convenient way it to use java.net.http.HttpClient from the standard library.
Example:
HttpClient client = HttpClient.newBuilder()
.version(Version.HTTP_1_1)
.followRedirects(Redirect.NORMAL)
.connectTimeout(Duration.ofSeconds(20))
.proxy(ProxySelector.of(new InetSocketAddress("proxy.example.com", 80)))
.authenticator(Authenticator.getDefault())
.build();
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("httpss://foo.com/"))
.timeout(Duration.ofMinutes(2))
.GET()
.build();
HttpResponse<String> response = client.send(request, BodyHandlers.ofString());
System.out.println(response.statusCode());
System.out.println(response.body());

I use the following code for my API:
try {
URL url = new URL("https://stackoverflow.com/questions/6159118/using-java-to-pull-data-from-a-webpage");
InputStream content = url.openStream();
int c;
while ((c = content.read())!=-1) System.out.print((char) c);
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException ie) {
ie.printStackTrace();
}
You can catch the characters and convert them to string.

Related

reproducing the jackson vulnerability CVE-2017-7525 using spring boot and custom java class

I read about the ongoing jackson vulnerability(CVE-2017-7525) which allows for remote code execution, as explainedhere.
I did some modifications to the example class given on that page and wrote something like this:
import java.io.*;
import java.net.*;
public class Exploit extends com.sun.org.apache.xalan.internal.xsltc.runtime.AbstractTranslet {
private static String urlString = "https://sv443.net/jokeapi/category/any?blacklistFlags=nsfwreligiouspolitical";
public Exploit() throws Exception {
StringBuilder result = new StringBuilder();
URL url = new URL(urlString);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line;
while ((line = rd.readLine()) != null) {
result.append(line);
}
rd.close();
//Lets see the joke in the logs
System.out.println(result);
}
#Override
public void transform(com.sun.org.apache.xalan.internal.xsltc.DOM document, com.sun.org.apache.xml.internal.dtm.DTMAxisIterator iterator, com.sun.org.apache.xml.internal.serializer.SerializationHandler handler) {
}
#Override
public void transform(com.sun.org.apache.xalan.internal.xsltc.DOM document, com.sun.org.apache.xml.internal.serializer.SerializationHandler[] handler) {
}
}
Compiled the .java file and opened the generated .class file and passed its contents as part of the sample api request body provided, however it appears the the custom code may not have been executed (or so I think), I am expecting to see something on the application logs, printing the output of the request. However I do not see anything being printed.
Does anyone have a simple example that showcases this vulnerability using spring boot and jackson, through an api call using bogus jackson?
I understand this is an unusual question, but I am looking into this interesting topic hoping there is someone out there who has come across the need to demo this issue.
In short I am looking to demo this java deserialization vulnerability while using spring boot, jackson by making an api call and passing a Json document which contains the compiled java code to be executed.

How to get the content of a Website to a String in Android Studio ?

I want to display the parts of the content of a Website in my app. I've seen some solutions here but they are all very old and do not work with the newer versions of Android Studio. So maybe someone can help out.
https://jsoup.org/ should help for getting full site data, parse it based on class, id and etc. For instance, below code gets and prints site's title:
Document doc = Jsoup.connect("http://www.moodmusic.today/").get();
String title = doc.select("title").text();
System.out.println(title);
If you want to get raw data from a target website, you will need to do the following:
Create a URL object with the link of the website specified in the parameter
Cast it to HttpURLConnection
Retrieve its InputStream
Convert it to a String
This can work generally with java, no matter which IDE you're using.
To retrieve a connection's InputStream:
// Create a URL object
URL url = new URL("https://yourwebsitehere.domain");
// Retrieve its input stream
HttpURLConnection connection = ((HttpURLConnection) url.openConnection());
InputStream instream = connection.getInputStream();
Make sure to handle java.net.MalformedURLException and java.io.IOException
To convert an InputStream to a String
public static String toString(InputStream in) throws IOException {
StringBuilder builder = new StringBuilder();
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
String line;
while ((line = reader.readLine()) != null) {
builder.append(line).append("\n");
}
reader.close();
return builder.toString();
}
You can copy and modify the code above and use it in your source code!
Make sure to have the following imports
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
Example:
public static String getDataRaw() throws IOException, MalformedURLException {
URL url = new URL("https://yourwebsitehere.domain");
HttpURLConnection connection = ((HttpURLConnection) url.openConnection());
InputStream instream = connection.getInputStream();
return toString(instream);
}
To call getDataRaw(), handle IOException and MalformedURLException and you're good to go!
Hope this helps!

HTTPS web request in Java

I am currently trying to check a MD5 Hash using the api provided by the following site: https://md5db.net/api/
The following code seems to produce an error and can't find the site. The code does however work for other sites. It just doesn't seem to work with the md5db.net site. Not sure what I am doing wrong.
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
public class FetchURLData {
public static void main(String[] args) {
try {
URL url = new URL("https://md5db.net/api/5d41402abc4b2a76b9719d911017c592");
BufferedReader br = new BufferedReader(newInputStreamReader(url.openStream()));
String strTemp = "";
while (null != (strTemp = br.readLine())) {
System.out.println(strTemp);
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
Update to Java 8u101 or newer.
The site uses an SSL certificate issued by Let's Encrypt, which is however not supported with Java 8u100 or earlier as mentioned here:
Does Java support Let's Encrypt certificates?

How do I work with the Expedia XML API in Java

I am having the Expedia account for getting a hotel list, and they giving the XML format data.
I need to process the XML and display HTML formatted data on my website using the Java programming language. I used the file_get_contents in PHP, but I don't know about penny of link to API in Java. What would be an elaborate explanation?
The code below will read the page contents as file_get_contents in php
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
public class URLExp {
public static void main(String[] args) {
try {
URL google = new URL("http://www.google.com/");
URLConnection yc = google.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc
.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
}
in.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Now after passing your url in the code you should get the xml as you said. Parse the XML and use it.
Use apache httpclient to invoke the URL, get the XML by setting headers as text /XML. Easiest way to parse the XML is to use castor library. Post your code and we can help more...

How do I get my Java code to display results after connection?

I have java code which connects to a PHP script I've written and posts to it. The PHP contacts an API for evaluation and returns the results in html format.
The Java appears to work, but in Eclipse the result is raw html, not rendered form.
I would like to get my results to launch in a browser. I tried placing it in my xampp folder, but that did nothing, it just downloaded the Java script upon clicking the file.
Any ideas on how I can accomplish this? I am open to changing the PHP code somehow to have it just return the variables and having Java create some form for the user to see. I'm just not so adept at Java right now. Ideas and examples are great!
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.net.URL;
import java.net.URLConnection;
import java.net.URLEncoder;
public class Connect {
public void POSTDATA() {
}
public static void main(String[] args) {
try {
// Construct data
String data = URLEncoder.encode("ipaddress", "UTF-8") + "=" + URLEncoder.encode("98.36.2.53", "UTF-8");
// Send data
URL url = new URL("http://localhost/myfiles/WorkingVersion.php");
URLConnection conn = url.openConnection();
conn.setDoOutput(true);
OutputStreamWriter wr = new OutputStreamWriter(conn.getOutputStream());
wr.write(data);
wr.flush();
// Get the response
BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line;
while ((line = rd.readLine()) != null) {
System.out.println(line);
}
wr.close();
rd.close();
} catch (Exception e) {
}
}
}
Your Java application simply connects to the server and does a POST with ipaddress=98.36.2.53. If you want to display the result in a browser, why use Java anyway?
Several easy solutions are:
Rewrite your PHP script to accept the parameter via GET and access it via an URL from the Webbrowser http://localhost/myfiles/WorkingVersion.php?ipaddress=98.36.2.53
Write an HTML page that uses a form to POST the data - e.g. by having an <input type="hidden" name="ipaddress" value="98.36.2.53">. You will need user interaction to post the from, but maybe this is sufficient
Use JavaScript to access the server, do the POST request and read the data. As JavaScript runs in the browser, it is easy to display it on the webpage (e.g. by using jQuery's .html( htmlString ) method.

Categories

Resources