I am writing a program to scrape the source code off a website. Each time the next button is clicked to go to the next page on the website a post request it sent.
I have been looking at using httpclient to take care of this issue, and have looked through examples and the httpclient API, but I cant seem to figure out whether httpclient can do this. Is this a function of httpclient, and if so what class would go about doing this?
I think that you're saying the webpage you're performing an http get on contains a "next button" on it, and that when you view the webpage in the browser and click the next button, the next page of the website is displayed.
If this is the case, yes, http client is able to do the same thing. But understand that http client does not integrate with your web browser. But you could scour the source code returned from the http get request using a library like jsoup to extract the url for the "next" page on the website, and then issue another http get to get that resource.
Assuming you already have the code for http client to issue the initial http get request, there is no additional api that is required. You just make another request after your program discovers the url for the "next" resource.
Related
Long story short I having an issue sending post request to sever, after completing my first one. This is being done in java.
Basically my question is using apache http client is it possible to press a button as I can't find any other away around this, (I am normally use selenium but am attempting to save ram by removing the browser so using post request instead)
Here is an example of html code:
I have look into making post request to the sever by using network tab in chrome tried multiple things but wouldn't work.
Tried to use htmlUnit to send POST requests to communicate to server and met a tiny problem: target .php url is being changed from time to time
(www123.example.net -> www345.example.net, etc.). The only way to get new adress is to open site and check it's XHR requests, find one which goes to www???.example.net and then use this address to send POSTs.
So the question is: is there a way to track XHR using htmlUnit or any other Java library?
If you really need help you have to show your problem in more detail, provide some info about the web site you are requesting, show you code and try to explain what you expect and what goes wrong. Without this details we can only guess.
Looks like you should try to think about HtmlUnit more like a browser you can control from java instead of doing simple Http requests. Have a look at the simple samples on the HtmlUnit web site (the one at the bottoms is for you).
Try something like this (the same steps as the user of an ordinary browser does)
* open the url/page
* fill the various form fields
* find the submit button an click
* use the resulting page content
Usually HtmlUnit does all the stuff in the background for you.
I am attempting to log into a website using Java's HttpURLConnection. I have figured out how to use a POST request to post to the website and log in, but I have no way of knowing if the login was successful or not.
Looking at some tutorials, I discerned that reloading the page usually works. The problem with this specific implementation is that upon entering credentials, the website opens a pop up window, with the same URL as the parent site.
This can be solved either of two ways. Looking at Chrome's Developer Tools, I realized that the POST request returns whether the login was successful, as seen here
Is it possible to get the popup window or look for the response to the POST request? I'd rather use native java is possible.
Reload will work, if you'll keep the same HTTP session. Actually the website cannot open an popup - the web browser does it according to the login response. You should do the same - that is to check the response. Luckily you don't have to parse the response content, try to check the response code. For login the HTTP 200 may stand for successful login and HTTP 401 for failure.
I have made a web-scraper for Google Scholar in Java with JSoup. The scraper search Scholar for a DOI and finds the citations for this paper. This data is needed for a research.
But, the scraper only works for the first requests. ..
After that the scraper encounters a captcha on the Scholar site.
However, when I open the website in my browser (Chrome) Google Scholar opens normally.
How is this possible? All request come from the same IP-address!
So far I have tried the following options:
Choose a random user-agent for the request (from a list of 5 user-agents)
Random delay between request between 5- 50 seconds
Use a TOR-proxy. However almost all the end-nodes have already been blocked by Google
When I analyse the request made by Chrome to Scholar I see that a cookie is used with some session ID's. Probably this is why Chrome requests are not blocked. Is it possible to use this cookie for request made with JSoup?
Thank you!
There's three things that spring to mind:
You aren't saving the cookies between requests. Your first request should save the cookie and pass it to the server for the next request (setting the Referer header wouldn't hurt too). There's an example here.
If Google was being tricky they could see that your first request didn't load any css/js/images on the page. This is a sure sign that you are a bot.
Javascript is doing something in the page once you have it loaded.
I think the first is the most likely option. You should try copy as many of the headers you see in the request from Chrome into your java code.
Nowadays many websites contain some content loaded by ajax(e.g,comments in some video websites). Normally we can't crawl these data and what we get is just some js source code. So here is the question: in what ways can we execute the javascript code after we get the html response and get to the final page we want?
I know that HtmlUnit has the ability to execute background js,yet some many bugs and errors are there. Are there any else tools can help me with it?
Some people tell me that I can crawl the ajax request url, analyze its parameters and send request again so as to gain the data. If things can't work out according to the way I mention above, can anyone tell me how to extract the ajax url and send the request in correct format?
By the way,if the language is java,it would be the best
Yes, Netwoof can crawl Ajax easily. Its API and bot builder let you do it without a line of code.
Thats the great thing about HTTP you don't even need java. My goto tool for debugging AJAX is the chrome extension Postman. I start by looking at the request in the chrome debugger and identifying the salient bits(url or form encoded params etc.)
Then it can be as simple as opening a tab and launch requests at the server with Postman. As long as its all in the same browser context all of your cookies(for authentication, etc.) will be shipped along too.