I'm doing project using struts1.
I'm fetching RSS feeds using ROME but it fails for two conditions:
When my firewall forbidden rss url (response code 403)
When I insert incorrect rss url
To avoid such conditions what should I do?
Just catch the exceptions and handle them.
There are some situations you simply cannot avoid.
You can't avoid network outages, you can't avoid incorrectly typed URLs.
What you can do, however, is check if network is reachable, and whether the URL is typed correctly or not.
You should catch the exceptions and provide meaningful error messages to the user.
About 403
Some feeds seems have some protection (for DDOS)
So based on user Agent (in your case "Java") they deny you to read the feed
So you have to set your own user agent (like the firefox user agent), before opening connection like this
System.setProperty("http.agent", USER_AGENT);
URLConnection openConnection = url.openConnection();
is = url.openConnection().getInputStream();
if ("gzip".equals(openConnection.getContentEncoding())) {
is = new GZIPInputStream(is);
}
InputSource source = new InputSource(is);
input = new SyndFeedInput();
syndicationFeed = input.build(source);
XmlReader reader = new XmlReader(url);
syndicationFeed = input.build(reader);
My current USER_AGENT String is
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0";
Related
Line InputStream is = new GZIPInputStream (con.getInputStream ()); returns this "java.util.zip.ZipException: Not in GZIP format" exception. Does anyone know how to solve this?
my code:
private String getJsonFromRapidAPI(final String url) throws Exception {
final String token = generateSessionToken();
final HttpClient httpclient = new DefaultHttpClient();
final HttpGet httpget = new HttpGet(url);
String fileContents = null;
StringBuilder sb = new StringBuilder();
BufferedReader in = null;
if (inetAddress == null) {
inetAddress = InetAddress.getLocalHost();
}
final String serverIP = inetAddress.getHostAddress();
URL urlToCall = new URL(url);
HttpURLConnection con = (HttpURLConnection) urlToCall.openConnection();
con.setRequestProperty("Accept", "application/json");
con.setRequestProperty("Accept-Encoding", "gzip");
con.setRequestProperty("Authorization", token);
con.setRequestProperty("Customer-Ip", serverIP);
con.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36");
InputStream is = new GZIPInputStream(con.getInputStream());
StringWriter writer = new StringWriter();
IOUtils.copy(is, writer, "UTF-8");
return writer.toString();
}
Two issues here:
You can tell a server you want GZIP encoding. That doesn't mean it'll neccessarily comply.
As a consequence, your current code goes: Well, I asked for gzip, so I will assume i Must gunzip the stream. That's incorrect - you need to ask for gzip as you do, but you need to let 'should I GZipInputStream the response' depend on whether the server actually did that. To figure this out, use con.connect to actually connect (.getInputStream implies it, but now we need to fetch some response info before opening the inputstream so you need to explicitly connect now), and you can call ."gzip".equals(con.getContentEncoding()) to figure it out. If that is true, then wrap con.getInputStream() in GZIPInputStream. If it isn't, don't do that.
More generally you're abusing HttpURLConnection.
HttpURLConnection's job is to deal with the stream/processing of the data itself. That means headers like Transfer-Encoding, Host, and Accept-Encoding aren't 'your turf', you shouldn't be messing with this stuff. It needs to send 'hey, I can support gzip!' and it needs to wrap the response in a GZIPInputStream if the server says: "Great, okay, therefore I gzipped it for you!".
Unfortunately HttpURLConnection is extremely basic and messes up this distinction itself, for example it looks at the Transfer-Encoding header you set and will chunk sends (I think - oof, I haven't used this class in forever because its so bad).
If you have needs to make things even slightly complicated (and I'd say this qualifies), then stop using it. There's OkHttp from square. Java itself (Available in JDK11 and up, I think) calls HttpURLConnection obsolete and has replaced it - here is a tutorial, and even apache has HttpComponents but note that this is outdated and as with most apache java libraries, oof, it's bad API design. I wouldn't use that one, even if it is rather popular. It's still much better than HttpURLConnection though!
I'm trying to retrieve a github web page using a java code, for this I used following code.
String startingUrl = "https://github.com/xxxxxx";
URL url = new URL(startingUrl );
HttpURLConnection uc = (HttpURLConnection) url.openConnection();
uc.connect();
String line = null;
StringBuffer tmp = new StringBuffer();
try{
BufferedReader in = new BufferedReader(new InputStreamReader(uc.getInputStream(), "UTF-8"));
while ((line = in.readLine()) != null) {
tmp.append(line);
}
}catch(FileNotFoundException e){
}
However, the page I received here is different from what I observe in browser after login to github. I tried sending authorization header as following, but it didn't worked either.
uc.setRequestProperty("Authorization", "Basic encodexxx");
How can I retrieve the same page that I see when I logged in?
I can't tell you more on this, because I don't know what are you getting, but most common issue for web crawlers is the fact that website owners mostly don't like web crawlers. Thus, you should behave like regular user - your browser for instance. Open your browser inspection element (press f12) when you are reaching some website and see what your browser send in request, then try to mimic it: For example, add Host, Referer, etc in your header. You need to experiment on this.
Also, good to know - some website owners will use advanced techniques (so they will block you to access their site), some won't stop you crawling on their website. Some will let you do what you want. Most fair option is to check www.somedomain.com/robots.txt and there is list of endpoints that are allowed for scraping and those that shouldn't be allowed.
I'm going to keep this question short and sweet. I have a function that takes a URL to read as a string and returns a string of the HTML source of a webpage. Here it is:
public static String getHTML(String urlToRead) throws Exception // Returns the source code of a given URL.
{
StringBuilder result = new StringBuilder();
URL url = new URL(urlToRead);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36");
BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream(), Charset.forName("UTF-8")));
String line;
while ((line = rd.readLine()) != null)
{
result.append(line + System.getProperty("line.separator"));
}
rd.close();
result.toString();
}
It works like a charm, with the exception of one tiny quirk. Certain characters are not being read correctly by the InputStreamReader. The "ł" character isn't correctly read, and is instead replaced by a "?". That's the only character I've found thus far that follows this behaviour but there's no telling what other characters aren't being read correctly.
It seems like an issue with the character set. I'm using UTF-8 as you can see from the code. All other character sets I've tried using in its place have either outright not worked or have had trouble with far more than just one character.
What kind of thing could be causing this issue? Any help would be greatly appreciated!
Have you tried :
conn.setRequestProperty("content-type", "text/plain; charset=utf-8");
You should use the same charset as the resource you read. First make sure what is the encoding used by that HTML. Usually its content type is sent in response header. You can easily get this information using any web browser with network tracking (since you have GET request).
For example using Chrome - open empty tab, open dev tools (F12), and load desired web page. Than you can look at network tab in dev tools and examine response headers.
I use simple code to get html for http://www.ip-adress.com, but it shows error http code 403.
I try it in other website like google.com in program, it can work. i can also open www.ip-adress.com in browse, why i can't use it in java program.
public class urlconnection
{
public static void main(String[] args)
{
StringBuffer document = new StringBuffer();
try
{
URL url = new URL("http://www.ip-adress.com");
URLConnection conn = url.openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line = null;
while ((line = reader.readLine()) != null)
document.append(line + " ");
reader.close();
}
catch (MalformedURLException e)
{
e.printStackTrace();
}
catch (IOException e)
{
e.printStackTrace();
}
System.out.println(document.toString());
}
}
java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.ip-adress.com/
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at urlconnection.main(urlconnection.java:14)
This is the line you required
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
refer this
The web-server can detect that you are not actually trying to access it via HTTP, so it rejects your request. There are ways to fake that to trick the server into thinking that you are a browser.
I suppose the site checks user agent header and blocks what it seems to be "a robot". You need to mimic normal browser. Check this solution Setting user agent of a java URLConnection or try to use commons http client AND set user agent.
I don't believe that this is fundamentally a Java problem. You're doing the right thing to make an HTTP connection, and the server is doing "the right thing" from its perspective by responding to your request with a 403 response.
Let's be clear about this - the response you're getting is due to whatever logic is being employed by the target webserver.
So if you were to ask "how can I modify my request so that http://www.ip-address.com returns a 200 response", then people may be able to come up with workarounds that keep that server happy. But this is a host-specific process; your Java code is arguably correct, though it should have better error handling because you can always get non-2xx responses.
Try to change Connection User-Agent to something like Browsers, most of times I use Mozilla/6.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1
I am trying to make a request to a webpage that requires cookies. I'm using HTTPUrlConnection, but the response always comes back saying
<div class="body"><p>Your browser's cookie functionality is turned off. Please turn it on.
How can I make the request such that the queried server thinks I have cookies turned on. My code goes something like this.
private String readPage(String page) throws MalformedURLException {
try {
URL url = new URL(page);
HttpURLConnection uc = (HttpURLConnection) url.openConnection();
uc.connect();
InputStream in = uc.getInputStream();
int v;
while( (v = in.read()) != -1){
sb.append((char)v);
}
in.close();
uc.disconnect();
} catch (IOException e){
e.printStackTrace();
}
return sb.toString();
}
You need to add a CookieHandler to the system for it handle cookie. Before Java 6, there is no CookieHandler implementation in the JRE, you have to write your own. If you are on Java 6, you can do this,
CookieHandler.setDefault(new CookieManager());
URLConnection's cookie handling is really weak. It barely works. It doesn't handle all the cookie rules correctly. You should use Apache HttpClient if you are dealing with sensitive cookies like authentication.
I think server can't determine at the first request that a client does not support cookies. So, probably server sends redirects. Try to disable redirects:
uc.setInstanceFollowRedirects(false);
Then you will be able to get cookies from response and use them (if you need) on the next request.
uc.getHeaderFields()
// get cookie (set-cookie) here
URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.0; pl; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2");
conn.addRequestProperty("Referer", "http://xxxx");
conn.addRequestProperty("Cookie", "...");
If you're trying to scrape large volumes of data after a login, you may even be better off with a scripted web scraper like WebHarvest (http://web-harvest.sourceforge.net/) I've used it to great success in some of my own projects.