JSoup Connection.userAgent defeated by sun.net.www.protocol.http.HttpURLConnection - java

Apparently, sun.net.www.protocol.http.HttpURLConnection will always append "Java/version" to the UserAgent. Therefore, JSoup Connection.userAgent cannot set the useragent to what you want; the "Java/version" stuff gets appended anyway.
See Set user-agent property in https connection header
Some websites reject requests that contain "Java" anywhere in
the user agent, giving various 4xx and 5xx HTTP errors.
The StackOverflow post referenced above suggests using Apache instead of Sun's HTTP connection class, but this is not an option if I want to use JSoup.
I wonder what the JSoup team thinks of this. Is my description correct? Is it a bug or a feature? Are there any plans to fix it, i.e. to make it possible to set the userAgent to what one wants, without additional appendages?
thanks
JWG

You could use Jsoup.parse(html) where the html String could be fetched using Apache HTTP or any other library of your choice.
Regards,
Allahbaksh

Related

How to check is webpage is static or dynamic

I'm doing some web scraping and using Jsoup to parse html files and my understanding is that Jsoup doesn't work well with dynamic web pages. Is there a way to check if a web page is dynamic so that I don't bother attempting to parse it using Jsoup?
Short answer: Not really. You need to check case by case
Explanation:
Today's websites are full of ajax calls. Many are loading important data, others are only maginally interesting when you scrape a site's content. Many very modern sites even do both, they send complete rendered page to the client where it gets transformed to a web-app (keyword isomorphic rendering)
So you need to check the site in question case by case. It is not that hard though. just fire up Curl and see if you get the content you need. If not, it is often also not that hard to understand the structure and parameters of the ajax calls. If you are doing this, then you often get even dynamic content fine with only Jsoup.
You cannot be sure 100% that a website is dynamic or static, cause there are ways to hide the clues that show a website is dynamic. but you can check on a limited number of HTTP response headers to test whether its dynamic or static :
Cookie : An HTTP cookie previously sent by the server with Set-Cookie
X-Csrf-Token : Used to prevent cross-site request forgery. Alternative header names are: X-CSRFToken and X-XSRF-TOKEN
X-Powered-By : specifies the technology (e.g. ASP.NET, PHP, JBoss) supporting the web application (version details are often in X-Runtime, X-Version, or X-AspNet-Version)
These are 3 HTTP headers that a server scripting is involved with to generate(as far as I know)
Also chances are that a webpage with form related elements should have a server side mechanism to process form data.

Set Request Header and Forward to another application

I am writing a Java based Web application, which, in the actual production environment would be front-ended by another application which would set certain HTTP request headers before the request hits my application.
However, in the development environment I do not have the front-ending application, for which I need to create a mock web application that simulates the same behavior. i.e. this mock application should set the request headers and redirect or forward or whatever that I do not know :) to a certain page in my application.
How can I accomplish this?
The following articles may help you:
Adding Header Information to an existing HTTP Request
How to modify request headers in a J2EE web
application.
P.S.
I am sorry I provided only links, that was one of my early answer on SO ))
In case you don't want to modify your code as suggested by #user1979427 you can use a proxy server to modify headers or add headers on the fly.
For example in Apache HTTPD you would add something like below and proxy the
Header add HEADER "HEADERVALUE"
RequestHeader set HEADER "HEADERVALUE"
Refer to HTTPD doc
You should create a AddReqHeaderForFrowardWrapper request wrapper passing the headername and header values. And, override the request header related methods to return your custom header.
You can use Tracer to implement this.
There are frameworks available to support this implementation.
Spring has Sleuth, Zipkin, OpenTracing available.
I find OpenTracing to be easy to use without worrying about dependency conflicts.
Read more about it here: https://opentracing.io/guides/java/
Instead of writing a mock application, I used a browser add-on that allowed me to add custom headers!
For setting header in java, you can use:
request.setHeader(attributeName, attributeValue);
And for redirecting to another page, you can use:
request.sendRedirect(URL);

Should I use URL rewriting to protect against XSS

Let's say someone enters the following URL in their browser:
http://www.mywebsite.com/<script>alert(1)</script>
The page is displayed as normal with the alert popup as well. I think this should result in a 404, but how do I best achieve that?
My webapp is running on a Tomcat 7 server. Modern browser will automatically protect against this, but older ones, I am looking at you IE6, wont.
It sounds like you are actually getting a 404 page, but that page includes the resource (in this case a piece of JavaScript code) and doesn't do any converting of < and > to their respective HTML entities. I've seen this happen on several websites.
The solution would be to create a custom 404 page which doesn't echo back the resource to the page, or that does proper HTML entity conversion beforehand. There are plenty of tutorials you can find through Google which should help you do this.
Here's what I did:
Created a high level servlet filter which uses OWASP's HTML sanitizer to check for dodgy characters. If there are any, I redirect to our 404 page.
You should put a filter in your webapp to protect against an XSS attack.
Get all the parameters from the HttpServletRequest object and replace any parameter with value starting with with spaces in filter code.
This way any harmful JS script won't reach your server side components.

jsonp referer url determine

I have a jquery plugin and I'm using jsonp for crossdomain call to a jsp file.
I want to strict the jsp return values only to specific websites in our database.
To achieve this I need to somehow get the ip or url of the website the jsonp call triggered and not the client/user ip. I've tried the referer value in the http header but this will not work with IE and I guess this is not the best solution either.
How can I securely now who is calling my jsp file with my plugin, from his website?
Thanks in advance.
The simplest answer would be to issue each website a unique key or other identifier that they include in their request. You parse this identifier and flex your response appropriately.
However with a request originating from the client browser, you would have to be careful and would have to evaluate what you mean by how "securely" you need the request to be handled. (since the untrusted client would be making the request it would be a simple task to harvest and reuse such an identifier)...
Referrer (if present) could be used as a double check, but as you pointed out, this is unreliable and coming from an untrusted client computer, this portion of the request could be faked as well.
If we could assume some server side processing by the website owners, you could have them implement a proxy for the jsonp call (which would ensure such a token would never fall into the hands of the browser)... but we'd have to know if such a safeguard would really be worth it or not :)

How to modify the header of a HttpUrlConnection

Im trying to improve the Java Html Document a little but i'm running into problems with the HttpUrlConntion. One thing is that some servers block a request if the user agent is a Java VM. Another problem is that the HttpUrlConnection does not set the Referrer or Location header field. Since several sites use these fields to verify that the content was accessed from their own site, I'm blocked here as well. As far as I can see the only resolution is to replace the URL handler of the HTTP protocol. Or is there any way to modify the default HTTP Handler?
Open the URL with URL.openConnection. Optionally cast to HttpURLConnection. Call URLConnection.setRequestProperty/addRequestProperty.
The default User-Agent header value is set from the "http.agent" system property. The PlugIn and WebStart allow you to set this property.
If you use Apache HttpClient to manage your programmatic HTTP connectivity you get an extremely useful API which makes creating connections (and optional automatic re-connecting on fail), setting Headers, posts vs gets, handy methods for retrieving the returned content and much much more.
I solved my problem. We can just send the header to application/json and pass the body as a json object. That simply solves the issue.

Categories

Resources