Reading the page source of a website using HttpURLconnection - java

I'm trying to read the page source of a site in a way that each time i open the site with different ID.
I'm manage to read 5-6 pages but after that i read the pages with serves notice: "please activate browser cookies to view this site"
I know I need to manage the cookies in a certain way, but any way I tried did not work.
That's my code:
public void read_and_save_pages() {
for (String id : ids) {
try {
// open url
URL url = new URL(link + id);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
// set user agent
connection
.setRequestProperty(
"User-Agent",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.76 Safari/537.36");
// read page source code
BufferedReader in = new BufferedReader(new InputStreamReader(
connection.getInputStream(), "windows-1255"));
// create file to write
FileWriter fstream = new FileWriter(
path + ".html");
BufferedWriter out = new BufferedWriter(fstream);
// write file
String line = in.readLine();
while (line != null) {
out.write(line + '\n');
line = in.readLine();
}
out.close();
} catch (Exception e) {// Catch exception if any
System.err.println("Error: " + e.getMessage());
}
}
}

Related

Downloading File, java.io.IOException: Server returned HTTP response code: 403 for URL

Im trying to download a file, but for some people running it, the server is giving error 403.
try (BufferedInputStream in = new BufferedInputStream(new URL("http://example.com/test.zip").openStream());
FileOutputStream fileOutputStream = new FileOutputStream("./test.zip")) {
byte dataBuffer[] = new byte[1024];
int bytesRead;
while ((bytesRead = in.read(dataBuffer, 0, 1024)) != -1) {
fileOutputStream.write(dataBuffer, 0, bytesRead);
}
} catch (IOException e18) {
error("Error: "+e18);
e18.printStackTrace();
return false;
}
While researching this error(403 - Forbidden), I found multiple posts saying that a user agent needs to be specified, I believe this may be the case, I am not sure how to easily add a user agent to my code.
Thank You in advance!
URL tgtUrl = new URL("http://example.com/test.zip");
java.net.URLConnection c = tgtUrl .openConnection();
c.setRequestProperty("User-Agent", " USER AGENT STRING HERE ");
ReadableByteChannel tar = Channels.newChannel(c.getInputStream());
OR
URL tgtUrl = new URL("http://example.com/test.zip");
java.net.URLConnection c = tgtUrl .openConnection();
c.setRequestProperty("User-Agent", " USER AGENT STRING HERE ");
BufferedReader br = new BufferedReader(new InputStreamReader(c.getInputStream()));
System.out.println(br.readLine());
Ref : Java: Download from an URL
Might be duplicate question
Simply adding:
System.setProperty("http.agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.75 Safari/535.7");
Fixed it!
Thanks everyone!

I'm getting java.io.FileNotFoundException for HTTPS URL

My code is like that:
URL url = new URL("https://nominatim.openstreetmap.org/reverse?format=json&lat=44.400000&lon=26.088492&zoom=18&addressdetails=1");
HttpsURLConnection connection = (HttpsURLConnection) url.openConnection();
connection.setRequestMethod("POST");
connection.setRequestProperty("User-Agent", "Mozilla/5.0");
connection.setRequestProperty("Accept-Language","en-US");
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
StringBuilder json = new StringBuilder(1024);
String tmp;
while ((tmp = reader.readLine()) != null) json.append(tmp).append("\n");
reader.close();
JSONObject data = new JSONObject(json.toString());
However i am getting java.io.FileNotFoundException at BufferedReader. The address is correct and any browser displays the json result. I need to get the human readable address from lat and lon, also known as reverse geocoding. I have tried many things but nothing worked, so i will be very thankful if you tell me what i am doing wrong. If it is possible i prefer not to use any external library.
I wrote this code block and found the solution. You can look to parameters of setRequestProperty method
String response = null;
try {
URL url = new URL("https://nominatim.openstreetmap.org/reverse?format=json&lat=44.400000&lon=26.088492&zoom=18&addressdetails=1");
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
connection.connect();
connection.getResponseCode(); //if you want to check response code
InputStream stream = connection.getErrorStream();
if (stream == null) {
stream = connection.getInputStream();
BufferedReader r = new BufferedReader(new InputStreamReader(connection.getInputStream(), Charset.forName("UTF-8")));
StringBuilder sb = new StringBuilder();
String line;
while ((line = r.readLine()) != null) {
sb.append(line);
}
System.out.println(sb.toString());
}
} catch (Exception e) {
e.printStackTrace();
}
In fact the problem seems to be gone for now as the only thing corrected is addRequestProperty instead of setRequestProperty and the user-agent data but i don't think it is so important. I am not so familiar with add and set requestproperty and don't know exactly what is the difference, but it seems to be important in this case.
URL url = new URL("https://nominatim.openstreetmap.org/reverse?format=json&lat=44.400000&lon=26.088492&zoom=18&addressdetails=1");
HttpsURLConnection connection = (HttpsURLConnection) url.openConnection();
connection.setRequestMethod("GET"); //POST or GET no matter
connection.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0");
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
StringBuilder json = new StringBuilder(1024);
String tmp;
while ((tmp = reader.readLine()) != null) json.append(tmp).append("\n");
reader.close();
JSONObject data = new JSONObject(json.toString());
Thank you all for your answers, problem is solved!

How do I get the same html code from a java request as I do from inspect in Chrome?

I'm trying to get the stream link for a video that is embeded in a website. Firstly I get the html from the website containg the player. Then refine this to the embedded link and then from that i get the stream link. In the past when i have done this I have been able to use Chrome to find the video player element then look for it in Java. However, when i look for the component i found from chrome it is not in the html code i get from Java.
(this method has worked in the past with different websites)
I'm using Inspect Element in chrome to find the player
This is my code to find an element of a website in Java:
//Opens Connection
URL url = new URL(address);
//Gets Data
URLConnection connection = url.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36");
InputStream is = connection.getInputStream();
//Creates bufferd reader
BufferedReader in = new BufferedReader(new InputStreamReader(is));
String inputLine;
//Finds the line
while ((inputLine = in.readLine()) != null) {
if (inputLine.contains(target) == true) {
break;
}
}
//Closes The input stream and buffered reader
in.close();
is.close();
//Returns the found line
return inputLine;
Any help is appreciated.

I'm getting an IllegalStateException: Already connected and I cannot figure out why

So I'm trying to write a program which connects to a site and pulls data from the source code. Whenever I call this method, once it reaches the line connection.setRequestProperty("Cookie", cookie); it doesn't proceed any further and spits out "IllegalStateException: Already connected". I'm trying to cycle through 123 different URL's so the URL changes each time the method is called, so I'm not too sure why it's telling me it's already connected when I'm attempting to reconnect to a different URL. I've tried searching everywhere for a solution and cannot find one. Can any of you help? Thanks!
private void getUrlData(String u, String championName) throws IOException {
List<String> data = new ArrayList<String>();
try {
BufferedWriter out = new BufferedWriter(new FileWriter("Other Stuff/Champion Data Test.txt"));
out.write(championName);
out.newLine();
URL url = new URL(u);
URLConnection connection = url.openConnection();
String cookie = connection.getHeaderField("Set-Cookie");
connection.setRequestProperty("Cookie", cookie);
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36");
connection.connect();
Scanner in = new Scanner(connection.getInputStream());
String inputLine;
while(in.hasNext()) {
inputLine = in.nextLine();
if(inputLine.contains("stat-label")) {
out.write(in.nextLine());
in.nextLine();
in.nextLine();
out.write(" " + in.nextLine());
}
}
}
catch(Exception e) {
System.out.println(e);
}
}
I found out the problem, but me solving this problem aroused new problems. The problem was me using the method connection.getHeaderField("Set-Cookie").

Processing a website by using POST data and cookies

I try to access an ASPX-website where subsequent pages are returned based on
post data. Unfortunately all my attempts to get the following pages fail.
Hopefully, someone here has an idea where to find the error!
In step one I read the session ID from the cookie as well as the value of the
viewstate variable in the returned html page. Step two intends to send it
back to the server to get the desired page.
Sniffing the data in the webbrowser gives
Host=www.geocaching.com
User-Agent=Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100618
Iceweasel/3.5.9 (like Firefox/3.5.9)
Accept=text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language=en-us,en;q=0.5
Accept-Encoding=gzip,deflate
Accept-Charset=ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive=300
Connection=keep-alive
Referer=http://www.geocaching.com/seek/nearest.aspx?state_id=149
Cookie=Send2GPS=garmin; BMItemsPerPage=200; maprefreshlock=true; ASP.
NET_SessionId=c4jgygfvu1e4ft55dqjapj45
Content-Type=application/x-www-form-urlencoded
Content-Length=4099
POSTDATA=__EVENTTARGET=ctl00%24ContentBody%24pgrBottom%
24lbGoToPage_3&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPD[...]2Xg%3D%
3D&language=on&logcount=on&gpx=on
Currently, my script looks like this
import java.net.*;
import java.io.*;
import java.util.*;
import java.security.*;
import java.net.*;
public class test1 {
public static void main(String args[]) {
// String loginWebsite="http://geocaching.com/login/default.aspx";
final String loginWebsite = "http://www.geocaching.com/seek/nearest.aspx?state_id=159";
final String POST_CONTENT_TYPE = "application/x-www-form-urlencoded";
// step 1: get session ID from cookie
String sessionId = "";
String viewstate = "";
try {
URL url = new URL(loginWebsite);
String key = "";
URLConnection urlConnection = url.openConnection();
if (urlConnection != null) {
for (int i = 1; (key = urlConnection.getHeaderFieldKey(i)) != null; i++) {
// get ASP.NET_SessionId from cookie
// System.out.println(urlConnection.getHeaderField(key));
if (key.equalsIgnoreCase("set-cookie")) {
sessionId = urlConnection.getHeaderField(key);
sessionId = sessionId.substring(0, sessionId.indexOf(";"));
}
}
BufferedReader in = new BufferedReader(new InputStreamReader(urlConnection.getInputStream()));
// get the viewstate parameter
String aLine;
while ((aLine = in.readLine()) != null) {
// System.out.println(aLine);
if (aLine.lastIndexOf("id=\"__VIEWSTATE\"") > 0) {
viewstate = aLine.substring(aLine.lastIndexOf("value=\"") + 7, aLine.lastIndexOf("\" "));
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println(sessionId);
System.out.println("\n");
System.out.println(viewstate);
System.out.println("\n");
// String goToPage="3";
// step2: post data to site
StringBuilder htmlResult = new StringBuilder();
try {
String encoded = "__EVENTTARGET=ctl00$ContentBody$pgrBottom$lbGoToPage_3" + "&" + "__EVENTARGUMENT=" + "&"
+ "__VIEWSTATE=" + viewstate;
URL url = new URL(loginWebsite);
URLConnection urlConnection = url.openConnection();
urlConnection = url.openConnection();
// Specifying that we intend to use this connection for input
urlConnection.setDoInput(true);
// Specifying that we intend to use this connection for output
urlConnection.setDoOutput(true);
// Specifying the content type of our post
urlConnection.setRequestProperty("Content-Type", POST_CONTENT_TYPE);
// urlConnection.setRequestMethod("POST");
urlConnection.setRequestProperty("Cookie", sessionId);
urlConnection.setRequestProperty("Content-Type", "text/html");
DataOutputStream out = new DataOutputStream(urlConnection.getOutputStream());
out.writeBytes(encoded);
out.flush();
out.close();
BufferedReader in = new BufferedReader(new InputStreamReader(urlConnection.getInputStream()));
String aLine;
while ((aLine = in.readLine()) != null) {
System.out.println(aLine);
}
} catch (MalformedURLException e) {
// Print out the exception that occurred
System.err.println("Invalid URL " + e.getMessage());
} catch (IOException e) {
// Print out the exception that occurred
System.err.println("Unable to execute " + e.getMessage());
}
}
}
Any idea what's wrong? Any help is very appreciated!
Update
Thank you for the fast reply!
I switched to use the HttpURLConnection instead of the URLConnection which implements the setRequestMethod(). I also corrected the minor mistakes you mentioned, e.g. removed the obsolete first setRequestProperty call.
Unfortunately this doesn’t change anything... I think I set all relevant parameters but still get the first page of the list, only. It seems that the "__EVENTTARGET=ctl00$ContentBody$pgrBottom$lbGoToPage_3" is ignored. I don't have any clues why.
Internally, the form on the website looks like this:
<form name="aspnetForm" method="post" action="nearest.aspx?state_id=159" id="aspnetForm">
It is called by the following javascript:
<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['aspnetForm'];
if (!theForm) {
theForm = document.aspnetForm;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
//]]>
</script>
Hopefully, this helps to find a solution?
Greetings
maik.
Do you actually want to GET or POST? If you want to POST, then you may need the setRequestMethd() line.
You're setting Content-Type twice -- I think you may need to combine these into one line.
Then, don't close the output stream before you try and read from the input stream.
Other than that, is there any more logging you can put in/clues you can give as to what way it's going wrong in?
Hey use following code
String userAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0";
org.jsoup.nodes.Document jsoupDoc = Jsoup.connect(url).timeout(15000).userAgent(userAgent).referrer("http://calendar.legis.ga.gov/Calendar/?chamber=House").ignoreContentType(true)
.data("__EVENTTARGET", eventtarget).data("__EVENTARGUMENT", eventarg).data("__VIEWSTATE", viewState).data("__VIEWSTATEGENERATOR", viewStateGenarator)
.data("__EVENTVALIDATION", viewStateValidation).parser(Parser.xmlParser()).post();

Categories

Resources