I want to get an HTML page from a meta refresh redirect very similar as in question can jsoup handle meta refresh redirect.
But I can't get it to work. I want to do a search on http://synchronkartei.de.
I have the following code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class SynchronkarteiScraper {
public static void main(String[] args) throws Exception{
Document doc = Jsoup.connect("https://www.synchronkartei.de/search.php")
.data("cat", "2")
.data("search", "Thomas Danneberg")
.data("action", "search")
.followRedirects(true)
.get();
Elements meta = doc.select("html head meta");
for (final Element m : meta){
if (m.attr("http-equiv").contains("refresh")){
doc = Jsoup.connect(m.baseUri()+m.attr("content").split("=")[1]).get();
}
}
System.out.println(doc.body().toString());
}
}
This does the search, which leads to a temporary site that gets refreshed opens the real result page.
It is the same as going to http://synchronkartei.de, selecting "Sprecher" from the dropdownbox, entering "Thomas Danneberg" to the textfield and hitting enter.
But even after extracting the refresh URL and do a second connect, I still get the content of the temporary landing page, which can be seen in the prinln of the body.
So what is going wrong here?
As a note, the site synchronkartei.de always redirects to HTTPS. And since it is using a certificate from StartCom, java complains about the certificate path. To let the above code snippet work, it is necessary to use the VM parameter -Djavax.net.ssl.trustStore=<path-to-keystore> with the correct certificate.
I have to admit, that I am no expert in Jsoup, but I know some details about the Synchronkartei, though.
Deutsche Synchronkartei supports OpenSearchDescriptions, which is linked at /search.xml. That said, you could also use https://www.synchronkartei.de/search.php?search={searchTerms} to get your search term into the session.
All you need is a cookie "sid" with the session ID, the Synchronkartei provides you. After that, a direct request to https://www.synchronkartei.de/index.php?action=search will provide you the results, regardless of your referrer.
What I mean is, first send a request to https://www.synchronkartei.de/search.php?search={searchTerms} or https://www.synchronkartei.de/search.php?cat={Category}&search={searchTerms}&action=search (as you did above) and ignore the result completely if it has an HTTP result of 200, but safe the session cookie. After that, you place a request to https://www.synchronkartei.de/index.php?action=search which should provide you the whole list of results then.
Funzi
Related
I tried to Parse the web pages ending with .tv and .mobi extension but every time I tried I end up with the same error. Jsoup can easily parse the websites ending with .com, .org , .in etc but not .tv or .mobi.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.jsoup.nodes.Element;
public class sample {
public static void main(String[] args) throws IOException{
Document doc =Jsoup.connect("http://www.xmovies8.tv").get();
String title = doc.title();
System.out.println(title);
}
}
Stack trace:
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://www.xmovies8.tv
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:598)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:548)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:235)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:224)
at eric.sample.main(sample.java:30)
/home/azeem/.cache/netbeans/8.1/executor-snippets/run.xml:53: Java returned: 1
BUILD FAILED (total time: 3 seconds)
And Also it failed to parse:
http://www.xmovies8.tv
www.fztvseries.mobi
is there any solution in Jsoup so that i can connect to different websites ending with .mobi , .tv , .xyz etc?
Your problem isn't anything to do with the TLD of the domain you're attempting to scrape, infact, it's nothing to do with the name at all, or even Jsoup.
If you read your stack trace, you will see you're getting a response code of:
HTTP 403 Forbidden, which according to HTTP Specification, means your request was seen by the web server, and deliberately refused.
Now, this could be for a number of reasons that all depend on the website you're trying to scrape.
It could be that the website sees you are trying to scrape, and they have explicitly gone out of the way to prevent being scraped
It could also be that that page requires a permission you don't have, or you need to be logged in.
I also noticed that particular domain uses CloudFlare, so it could be that CloudFlare is intercepting your request before it's even reaching the website itself.
I would make sure it's not against the website's policy to scrape them, and if it isn't, try maybe changing the User-Agent Header of your scraper to a normal browser User agent instead of java and see if it works.
I am reading around StackOverflow and I can't find a fix for my code. I'm pretty sure I'm just doing some silly mistake as this is the first time using JSoup.
My main goal is logging into the website: https://ps.seattleschools.org/public/
Right now I have this code:
public static final String USERNAME = "---Place Holder---";
public static final String PASSWORD = "---Place Holder---";
public static final String URL = "https://ps.seattleschools.org/public/";
public static final String POST_URL = "https://ps.seattleschools.org/guardian/home.html";
public static void main(String[] args) throws IOException {
Connection.Response loginForm = Jsoup.connect(URL)
.data("Account", USERNAME)
.data("pw", PASSWORD)
.method(Connection.Method.POST)
.execute();
Document document = Jsoup.connect(POST_URL)
.cookies(loginForm.cookies())
.get();
System.out.println(document);
}
If you look on the website, the form is really big, but most of the items are hidden. Am I supposed to fill them out anyway? Anyways, thanks for helping
Generally, yes, you need to include all inputs that will be sent along with your POST request.
For some reason your link is not working for me (takes too long to load). So I'll layout a set of general practices to follow when logging in with Jsoup:
Sometimes it's essential to first load the login page, and save the cookies to pass on to the actual login request (some websites will send you a security token when you load the login page, this needs to be sent along with the request)
Hidden inputs are there for a reason. You need to include them! One way to find out what inputs you need to send is by examining the post request sent with developer tool.
I'd suggest you read up on a blog I wrote about logging into a website using Jsoup. As long as the website does not have any fancy JavaScript to prevent you, I'm certain you will be able to login to any website by following the procedures I've written.
I am looking for a clean/simple way in HtmlUnit to request a webpage from a server in a specific language.
To do this i have been trying to request "bankofamerica.com" for their homepage in spanish instead of english.
This is what i have done so far:
I tried to set "Accept-Language" header to "es" in the Http request. I did this using:
myWebClient.addRequestHeader("Accept-Language" , "es");
It did not work. I then created a web request with the following code:
URL myUrl = new URL("https://www.bankofamerica.com/");
WebRequest myRequest = new WebRequest(myUrl);
myRequest.setAdditionalHeader("Accept-Language", "es");
HtmlPage aPage = myWebClient.getPage(myRequest);
Since this failed too i printed out the request object for this url , to check if these headers are being set.
[<url="https://www.bankofamerica.com/", GET, EncodingType[name=application/x-www-form-urlencoded], [], {Accept-Language=es, Accept-Encoding=gzip, deflate, Accept=*/*}, null>]
So the server is being requested for a spanish page but in response its sending the homepage in english (the response header has the value of Content-Language set to en-US)
I did find a hack to retrieve the BOA page in spanish. I visited this page and used the chrome developer tool to get the cookie value from the request
header. I used this value to do the following:
myRequest.setAdditionalHeader("Cookie", "TLTSID= ........._LOCALE_COOKIE=es-US; CONTEXT=es_US; INTL_LANG=es_US; LANG_COOKIE=es_US; hp_pf_anon=anon=((ct=+||st=+||fn=+||zc=+||lang=es_US));..........1870903; throttle_value=43");
I am guessing the answer lies somewhere here.
Here lies my next question. If i am writing a script to retrieve 100 different websites in Spanish (ie Assuming they all have their pages in the spanish) . Is there a clean way in HtmlUnit to accomplish this.
(If cookies is indeed a solution then to create them in htmlunit you need to specify the domain name. One would have to then create cookies for each of the 100 sites. As far as i know there is no way in HtmlUnit to do something like:
Cookie langCookie = new Cookie("All Domains","LANG_COOKIE","es_US");
myWebClient.getCookieManager().addCookie(langCookie);)
NOTE: I am using HtmlUnit 2.12 and setting BrowserVersion.CHROME in the webclient
Thanks.
Regarding your first concern the clear/simple(/only?) way of requesting a webpage in a particular language is, as you said, to set the HTTP Accept-Language request header to the locale(s) you want. That is it.
Now the fact that you request a page in a particular language doesn't mean that you will actually get a page in that language. The server has to be set up to process that HTTP header and respond accordingly. Even if a site has a whole section in spanish it doesn't mean that the site is responding to the HTTP header.
A clear example of this is the page you provided. I performed a quick test on it and found that it is clearly not responding accordingly to the Accept-Language I've set (which was es). Hitting the home page using es resulted in getting results in english. However, the page has a link that states En EspaƱol which means In Spanish the page does switch to spanish and you get redirected to https://www.bankofamerica.com?request_locale=es_US.
So you might be tempted to think that the page handles the locale by a request parameter. However, that is not (only) the case. Because if you then open the home page again (without the locale parameter) you will see the Spanish version again. That is clearly a proof that they are being stored somewhere else, most likely in the session, which will most likely be handled by cookies.
That can easily be confirmed by opening a private session or clearing the cookies and confirming this behaviour (I've just done that).
I think that explains the mystery of the webpage existing in Spanish but being fetched in English. (Note how most bank webpages do not conform to basic standards such as responding to simple HTTP requests... and they are handling our money!)
Regarding your second question, it would be like asking What is the recipe to not get ill ever?. It just doesn't depend on you. Also note that your first concerned used the word request while your second concern used the word retrieve. I think it should be clear by now that you can only be 100% sure of what you request but not of what you retrieve.
Regarding setting a value in a cookie manually, that is technically possible. However, that is just like adding another parameter in a get request: http://domain.com?login=yes. The parameter will only be processed by the server if it is expecting it. Otherwise, it will be ignored. That is what will happen to the value in your cookie.
Summary: There are standards to follow. You can try to use them but if the one in the other side doesn't then you won't get the results you expect. Your best choice: do your best and follow the standards.
I'm building a Django App with allauth.
I have a page, with authentication required, where I put a Java applet. This applet do GET requests to other pages (of the same django project) which return Json objects.
The applet gets the CSRF token from the parent web page, using JSObject.
The problem is that I want to set ALL the pages with authentication control, but I cannot get the sessionid cookie from the parent web page of the applet, so it cannot do GET (and neither POST) to obtain (or save) data.
Maybe it is a simple way to obtain this, but I'm a newby, and I haven't found anything.
Ask freely if you need something.
Thank you.
EDIT:
Has I wrote downstairs, I found out that the sessionid cookie is marked as HTTPOnly, so the problem now is which is the most safe way to allow the applet to do POST and GET request.
For example it is possible to create a JS method in the page, which GET the data and pass it down to the applet?
Maybe in the same way I can do the POST?
EDIT:
I successfully get the data, using a jquery call from the page. The problem now is that the code throws an InvocationTargetException. I found out the position of the problem, but I don't know how to solve it.
Here is the Jquery code:
function getFloor() {
$.get(
"{% url ... %}",
function(data) {
var output = JSON.stringify(data);
document.mapGenerator.setFloor(output)
}
);}
And here there are the two functions of the applet.
The ** part is the origin of the problem.
public void setFloor(String input) {
Floor[] f = Floor.parse(input);
}
public static Floor[] parse(String input) {
**Gson gson = new Gson();**
Floor[] floors = gson.fromJson(input, Floor[].class);
return floors;
}
And HERE is the log that come out on my server, where you can see that the applet try to load the Gson's library from the server (instead from the applet)
"GET /buildings/generate/com/google/gson/Gson.class HTTP/1.1" 404 4126
Somebady can help me?
You can do something like this in your applet:
String cookies = JSObject.getWindow(this).eval("document.cookie").toString();
This will give you all the cookies for that page delimited by semicolons.
I have a servlet named EditPhotos which, believe it or not, is used for editing the photos associated with a certain item on a web design I am developing. The URL path to edit a photo is [[SITEROOT]]/EditPhotos/[[ITEMNAME]].
When you go to this path (GET), the page loads fine. You can then click on a 'delete' link that POSTs to the same page, telling it to delete the photo. The servlet receives this delete command properly and successfully deletes the photo. It then sends a redirect back to the first page (GET).
For some reason, this redirect fails. I don't know how or why, but using the HTTPFox plugin for firefox, I see that the POST request receives 0 bytes in response and has the code NS_BINDING_ABORTED.
The code I am using to send the redirect, is the same code I have used throughout the website to send redirects:
response.sendRedirect(Constants.SITE_ROOT + "EditPhotos/" + itemURL);
I have checked the final URL that the redirect sends, and it is definitely correct, but the browser never receives the redirect. Why?
Read the server logs. Do you see IllegalStateException: response already committed with the sendRedirect() call in the trace?
If so, then that means that the redirect failed because the response headers are already been sent. Ensure that you aren't touching the HttpServletResponse at all before calling the sendRedirect(). A redirect namely exist of basically a Location response header with the new URL as value.
If not, then you're probably handling the request using JavaScript which in turn failed to handle the new location.
If neither is the case or you still cannot figure it, then we'd be interested in the smallest possible copy'n'pasteable code snippet which reproduces exactly this problem. Update then your question to include it.
Update as per the comments, the culprit is indeed in JavaScript. A redirect on a XMLHttpRequest POST isn't going to work. Are you using homegrown XMLHttpRequest functions or a library around it like as jQuery? If jQuery, please read this question carefully. It boils down to that you need to return a specific response and then let JS/jQuery do the new window.location itself.
Turns out that it was the JavaScript I was using to send the POST that was the problem.
I originally had this:
Delete
And everything got fixed when I changed it to this:
Delete
The deletePhoto function is:
function deletePhoto(photoID) {
doPost(document.URL, {'action':'delete', 'id':photoID});
}
function doPost(path, params) {
var form = document.createElement("form");
form.setAttribute("method", "POST");
form.setAttribute("action", path);
for(var key in params) {
var hiddenField = document.createElement("input");
hiddenField.setAttribute("type", "hidden");
hiddenField.setAttribute("name", key);
hiddenField.setAttribute("value", params[key]);
form.appendChild(hiddenField);
}
document.body.appendChild(form);
form.submit();
}