Getting the source code for the following page using Java - java

I am trying to get the source code for the following page: http://www.amazon.com/gp/offer-listing/082470732X/ref=dp_olp_0?ie=UTF8&redirect=true&condition=all
(Please note that Amazon takes you to another page if you click on the link. To get to the page that I am interested in reading please copy the link and paste it to an empty tab in your browser. Thanks!)
Normally using java.net API, I can get the source code for most of the URLs with almost no problem, however for the above link I get nothing. It turned out that the input stream generated by the connection is encoded by gzip, so I tried the following:
URL url = new URL(urlString);
HttpURLConnection urlConnection = (HttpURLConnection) url.openConnection();
InputStream is = urlConnection.getInputStream();
HttpURLConnection.setFollowRedirects(true);
urlConnection.setRequestProperty("Accept-Encoding", "gzip, deflate");
String encoding = urlConnection.getContentEncoding();
if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
is = new GZIPInputStream(is);
} else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
is = new InflaterInputStream((is), new Inflater(true));
}
However this time I get the following error deterministically:
java.io.EOFException
at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:249)
at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:239)
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:142)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:67)
at domain.logic.ItemScraper.loadURL(ItemScraper.java:405)
at domain.logic.ItemScraper.main(ItemScraper.java:510)
Can anybody see my mistake? Is there another way to read this particular page? Can somebody explain me why my browser (firefox) can read it, however I cannot read the source using Java?
Thanks in advance, best regards,

Instead of
is = new GZIPInputStream(is);
try
is = new GZIPInputStream(urlConnection.getInputStream());
As for the EOFException, if you add
urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.50 Safari/534.24");
it would go away.

You can use a standard BufferedReader to read the response of a webserver of a given URL.
URLIn = new BufferedReader(new InputStreamReader(new URL(URLOrFilename).openStream()));
Then use ...
while ((incomingLine = URLIn.readLine()) != null) {
...
}
... to get the response.

Related

Java URL Connection throws IOException with 403, works perfectly in browser

so I am currently trying to download the html of a website, however I ran into this problem where one website constantly gives me 403 back. I've already had that error in previous projects, and were always able to fix it with adding a User-Agent, however this time, nothing I tried helped. I even copied every single part of my header in my browser, but I still get 403 in Java, while it works perfectly with wget, or other programming languages. Maybe someone here can help me?
URL im trying to download is: here
I'm using the following code (I've copied them 1:1 from my request in firefox):
if (file.exists()) {
Files.delete(file.toPath());
}
HttpURLConnection httpcon = (HttpURLConnection) url.openConnection();
httpcon.setRequestMethod("GET");
httpcon.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8");
httpcon.setRequestProperty("Accept-Encoding", "gzip, deflate, br");
httpcon.setRequestProperty("Accept-Language", "en-GB,en;q=0.5");
httpcon.setRequestProperty("Cache-Control", "max-age=0");
httpcon.setRequestProperty("Connection", "keep-alive");
httpcon.setRequestProperty("Host", "www.mediamarkt.de");
httpcon.setRequestProperty("Sec-Fetch-Dest", "document");
httpcon.setRequestProperty("Sec-Fetch-Mode", "navigate");
httpcon.setRequestProperty("Sec-Fetch-Site", "none");
httpcon.setRequestProperty("Sec-Fetch-User", "?1");
httpcon.setRequestProperty("TE", "trailers");
httpcon.setRequestProperty("Upgrade-Insecure-Requests", "1");
httpcon.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0");
InputStream is = httpcon.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
BufferedWriter bwr = new BufferedWriter(new FileWriter(file, true));
String line;
while ((line = br.readLine()) != null) {
bwr.write(line);
}
is.close();
br.close();
bwr.close();

POST Request not working with Python, but working with Java

I am going to transform java code to python code.
but login POST request is not working via python code.
// Java code
String inputData = “{\"lang\" : \"ko\", \"loginName\" : \"kkstar123\", \"password\" : \"123123123\"}”;
String strUrl = “http://10.110.120.80/management/user/login.json”;
StringBuffer sb = new StringBuffer();
HttpURLConnection conn = (HttpURLConnection) new URL(strUrl).openConnection();
conn.setDoOutput(true);
conn.setDoInput(true);
conn.setRequestMethod("POST");
conn.setRequestProperty("Accept", "application/json");
conn.setRequestProperty("Content-Type", "application/json; charset=\"UTF-8\"");
OutputStream out = conn.getOutputStream(); // if remove OutputStream, it return 404 error
out.write(inputData.getBytes());
out.close();
System.out.println(conn.getResponseCode());
InputStreamReader in = new InputStreamReader((InputStream)conn.getContent());
BufferedReader br = new BufferedReader(in);
String line;
while ((line = br.readLine()) != null)
sb.append(line).append("\n");
System.out.println(sb.toString());
Above works fine, but.. below python code return 404 error
// python code
import requests
from bs4 import BeautifulSoup
header = {
'Accept': 'application/json',
'Content-Type': 'application/json; charset="UTF-8"',
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Mobile Safari/537.36'
}
payload = {
"lang": "ko",
"loginName": "kkstar123",
"password": "123123123"
}
loginURL = "http://10.110.120.80/management/user/login.json"
with requests.Session() as s:
login_req = s.post(loginURL, data = payload, headers = header)
print(login_req)
print(html)
What is the problem? plz help me to resolve this issue. (if i use selenium, it works fine! but i want get request and response quickly! so i am trying to use requests api)
I should't use 'data' param for POST request. after change this to json, it works. i think i add the Content-Type into header with JSON, so i should use json param.
Have you checked if the request sent in the browser involves any cookies?
If there are cookies present, you'll need to convert those cookies using SimpleCookie and add them to request's headers.
Also, create a Session object from requests as it will be better.

How do I get the same html code from a java request as I do from inspect in Chrome?

I'm trying to get the stream link for a video that is embeded in a website. Firstly I get the html from the website containg the player. Then refine this to the embedded link and then from that i get the stream link. In the past when i have done this I have been able to use Chrome to find the video player element then look for it in Java. However, when i look for the component i found from chrome it is not in the html code i get from Java.
(this method has worked in the past with different websites)
I'm using Inspect Element in chrome to find the player
This is my code to find an element of a website in Java:
//Opens Connection
URL url = new URL(address);
//Gets Data
URLConnection connection = url.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36");
InputStream is = connection.getInputStream();
//Creates bufferd reader
BufferedReader in = new BufferedReader(new InputStreamReader(is));
String inputLine;
//Finds the line
while ((inputLine = in.readLine()) != null) {
if (inputLine.contains(target) == true) {
break;
}
}
//Closes The input stream and buffered reader
in.close();
is.close();
//Returns the found line
return inputLine;
Any help is appreciated.

how to handle utf-8 content from website

i'm new in java and i'm stuck in this function:
public String getFromUrl(String url){
String content = "";
try{
URL U = new URL(url);
URLConnection conn = U.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)");
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
String line;
while((line = reader.readLine()) != null)content += line+"\r\n";
reader.close();
}
catch(Exception e){}
return content;
}
i always get question marks instead of utf-8 symbols!
what do i do wrong?
i read this post
first: i cant understand why byte array is used?
second: how should "while loop" look like in this case cause if i write
while((line = reader.readLine()) != null)content = line.getBytes("UTF-8");
my eclipse says something like "the local variable content may not have been initialized"
third: how i should convert byte array back into string?
then i read this one. i didnt even try the way it was in this post because i'm trying to write a function that will simulate browsers get and post request. it seems i found out how to perform it with URL class so i dont want to use any other classes and methods.
and now the only problem i have is how to handle utf-8 content.
any help apriciated!
Dump:
String encoding = conn.getContentEncoding();
If not null, you can use that for the reader.
And dump the possible exception catched.

java and website redirection detection

I have java related question...
Website www.stationv3.com gets updated daily (most of the time at least, it's kinda irregular). Every time I connect to a site using address www.stationv3.com (using a browser), it redirects me to it's subpage www.stationv3.com/date_of_latest_update.html
I'm trying to make a program that will pull latest comic from the site, but I am not sure how to find out it's exact address. But I know I'd be able to find out if I could somehow find out where where am I being redirected on every connect. Is that possible with java? I know it can do all sorts of quirky things, but I'm still new to internet related stuff...
I used exact site name just to make it easy for you to check outwhat's going on...
And also, I'm creating a generic code, one which could (with some tinkering) be applyed to any site that functions in that manner.
import java.net.*;
public class ShowStationV3Redirect {
public static void main(String[] args) throws Exception {
URL url = new URL(args[0]);
HttpURLConnection.setFollowRedirects(false);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
System.out.println("Response code = " + connection.getResponseCode());
String header = connection.getHeaderField("location");
if (header != null)
System.out.println("www.stationv3.com redirected to " + header);
}
}
The above code snippet tells you what URL you are being redirected to.
I think you could just fecth:
http://www.stationv3.com/comics/{yyyy}{mm}{dd}sv3.gif
and forget about the redirection problem. You can use this code (not tested indeed):
URL server = new URL("<put here the image URL>");
HttpURLConnection connection = (HttpURLConnection)server.openConnection();
connection.setRequestMethod("GET");
connection.setDoInput(true);
connection.setDoOutput(true);
connection.setUseCaches(false);
connection.addRequestProperty("Accept","image/gif");
connection.addRequestProperty("Accept-Encoding", "gzip, deflate");
connection.connect();
InputStream is = connection.getInputStream();
OutputStream os = new FileOutputStream("c:/mycomic.gif");
byte[] buffer = new byte[1024];
int byteReaded = is.read(buffer);
while(byteReaded != -1)
{
os.write(buffer,0,byteReaded);
byteReaded = is.read(buffer);
}
os.close();

Categories

Resources