I need to scrape French court cases for a project, but I can't figure out how to get Java to navigate the Court's search engine.
Here's the search page I need to manipulate. I want to start scraping the results page, but I can't get to that page from Java with just the URL. I need some way to have Java order the server to execute a search based on my date parameters (01/01/2003 - 30/06/2003), and then I can run the show by simply manipulating the URL I'm connecting to.
Any Suggestions?
First make sure the terms of service for the site allow this.
I would httpclient posts to send the data and get the results. See the form on the page, figure out which variables you need to emulate and submit them with httpclient. You should get back the results you are looking for. Also this page has lots of javascript, so you need to figure out what it is doing, maybe its never submitting the form but making ajax calls to update the page, but maybe you can get the same results.
You can always install something like "fiddler" and watch the http traffic the page is sending and then emulate that using httpclient.
Related
I want to crawl the whole content of the following link with a Java program. The first page is no problem, but when I want to crawl the data of the next pages, there is the same source code as for page one. Therefore a simple HTTP Get does not help at all.
This is the link for the page I need to crawl.
The web site has active contents that need to be interpreted and executed by a HMTL/CSS/JavaScript rendering engine. Therefore I have a simple solution with PhantomJS, but it is sophisticated to run PhantomJS code in Java.
Is there any easier way to read the whole content of the page with Java code? I already searched for a solution, but could not find anything suitable.
Appreciate your help,
kind regards.
Using the Chrome network log (or a similar tool in any other browser) you can identify the XHR request that loads the actual data displayed on the page. I have removed some of the query parameters, but essentially the request looks like this:
GET https://www.blablacar.de/search_xhr?fn=frankfurt&fcc=DE&tn=muenchen&tcc=DE&sort=trip_date&order=asc&limit=10&page=1&user_bridge=0&_=1461181945520
Helpfully, the query parameters look quite easy to understand. The order=asc&limit=10&page=1 part looks like it would be easy to adjust to return your desired results. You could adjust the page parameter to crawl successive pages of data.
The response is JSON, for which there are a ton of libraries available.
I am trying to scrape a website, using Web Client, i am able to get the data on the first page and parse it, but I do not know how to read the data on the second page, the website is calling a java script to navigate to the second page. Can anyone suggest me how do I get the data from the next pages?
Thanks in advance
The problem you're going to have is while you (a person) can read the JavaScript in the first page and see it is navigating to another page, having the computer do this is going to be hard.
If you could identify the block of code performing the navigation, you would then need to execute it in such a way that allowed your program to extract the URL. This again is going to be very specific to the structure of the JavaScript and would require a person to identify this.
In short, I think you're dead in the water with this one, though it serves as a good example of why the Unobtrusive JavaScript concept is so important.
This framework integrates HtmlUnit with its headless javascript enabled browser to fully support scriping multiple pages in the same WebClient session: https://github.com/subes/invesdwin-webproxy
So I'm making a program for android that tries to download something from www.wupload.com. What I want isn't a browser but to interact with the webpage without actually showing it. Like how HtmlUnit is supposed to work.
I'm using apache for the html requests and what I've done so far is send a post that simulates clicking on slow download on the web page. Then I read the response so I can get some variables needed to make the next post and execute the next post. In theory, the web page should be showing the captcha cause the response I get is please enter the captcha, but no image url.
The next step would be to enter the captcha and finally download the file, the problem I'm having is I don't know how to show the captcha image to the user. Do I have to capture it somehow? I know how to make the post to send what the user would type, but the image url of the captcha isn't in the source code.
I thought of inspecting the web page so I could get the url from the DOM tree, like what inspect element on google chrome does, but I have no idea if it's even possible. Any ideas would be great.
thx
The captcha is probably generated using JavaScript. Therefore when you get the source of the website, the captcha hasn't yet been generated and you won't see the image in the source HTML. You would need to run the Javascript somehow. You could try using a WebView because it has built-in support for Javascript, or get a Javascript library for Java and use it somehow. I think it would be a lot of work.
Edit:
Actually, if they are using a thirdy-party captcha library, I'm sure it uses some sort of HTTP request system, so you might be able to inspect it with this plugin for Firefox.
I am looking to develop an app that will take login details from the user, go to a website, login, return values on the web page and then display them to the user on the phone.
Does java have this functionallity? Will I need to use javascript instead maybe? do these answers depend on the website that I am trying to access?
In my head I figure that I could just read in the paramaters as strings or chars, parse the webpage for the appropriate form and "paste" the appropriate value into the form "box". However, I have never attempted anything like this with coding so I am completely new to the idea and dont really know where to start. I tried googling around but any information that I found was either irrelevant or conflicting.
I'm not looking for the code to do it because I will not really learn anythig from that but a finger in the right direction would be great. I really do want to try get better at programming so that's why I've started to give myself these little side projects
Any help that can be offered would be great
Ian,
You can try using http-client (http://hc.apache.org/httpclient-3.x/) lib from apache. It lets to pro grammatically access a website (from a Java code). You will need to do the following things
Use the http-client lib to POST the data to the web site.
Receive the html response.
Use some html parser or xpath to retrieve the values from the response html.
You would need a script which accesses the webpage and enters the data, but in my opinion this is illegal. Because you are accessing a secured area and are able to look into sensitive data. Also accessing the page via a script is "botting" - most pages have safety precautions to prevent the execution of scripts, because most of them are harmful.
In my opinion there is no legal and easy solution to this.
I'm writing a perl program that was doing a simple get command to retrieve results and process them. But the site has been updated and now has a java component that handles the results (so the actual data is not in the source code anymore).
This is the site:
http://wro.westchesterclerk.com/legalsearch.aspx
Try putting in:
Index Number: 11103
Year: 2009
I want to be able to pro grammatically enter the "index number" and "year" at the bottom of the form where it says "search by number" and then retrieve the results listed next to it.
I've written many programs in Perl that simply pass variables via the URL and the results are listed in the source code, so it's easy to parse. (Using LWP:Simple)
Like:
$html = get("http://www.url.com?id=$somenum&year=$someyear")
But this is totally new to me and I don't know where to begin.
I'm somewhat familiar with LWP:UserAgent and Mechanize.
I'd really appreciate any help.
Thanks!
That sort of question gets asked a lot. The standard answer is Wireshark.
I was just using it on that website with the test data you gave and extracted a single responsible POST request. This lets you bypass Javascript altogether.
It might be more logical for you to use one of the modules which drives a browser. Something like Mozilla::Mechanize or the Selenium tools.
A browser knows best how to interact with the server using AJAX and re-render the DOM and so on, so build your script on top of that ability.
What your asking to do in this case is hard. Not impossible but hard.
method A:
You can sift through their javascript code. What their "ajax" is doing is making a get/post request to another web page and dynamically loading the results. If you can decipher what that link is and the proper arguments you can continue to use get. I would recoment Getting the firebug plugin and any other tool that will help you de-obfuscate their javascript.
Another Method:
If your program could access a web browser(with javascript url support. like firefox). You could programatticaly go to these addresses, then wait a moment and get your data.
http://wro.westchesterclerk.com/legalsearch.aspx
javascript: function go() { document.getElementById('ctl00_tbSearchArea__ctl1_cphLegalSearch_splMain_tmpl0_tbLegalSearchType__ctl0_txtInde xNo').value=11109; document.getElementById('ctl00_tbSearchArea__ctl1_cphLegalSearch_splMain_tmpl0_tbLegalSearchType__ctl0_txtYear').value='09';searchClick();} go();
This is a method we have used along with mozembed to programatically get around this stuff. Recently we switched to Web Kit. And to remove this from taking up a video display we have used Xvfb/Xvnc to create a virtual desktop to load the browser in.
Those are the methods I have came up with so far. Let me know if you come up with another. Also I hope I helped.