Read full content of a web page in Java - java

I want to crawl the whole content of the following link with a Java program. The first page is no problem, but when I want to crawl the data of the next pages, there is the same source code as for page one. Therefore a simple HTTP Get does not help at all.
This is the link for the page I need to crawl.
The web site has active contents that need to be interpreted and executed by a HMTL/CSS/JavaScript rendering engine. Therefore I have a simple solution with PhantomJS, but it is sophisticated to run PhantomJS code in Java.
Is there any easier way to read the whole content of the page with Java code? I already searched for a solution, but could not find anything suitable.
Appreciate your help,
kind regards.

Using the Chrome network log (or a similar tool in any other browser) you can identify the XHR request that loads the actual data displayed on the page. I have removed some of the query parameters, but essentially the request looks like this:
GET https://www.blablacar.de/search_xhr?fn=frankfurt&fcc=DE&tn=muenchen&tcc=DE&sort=trip_date&order=asc&limit=10&page=1&user_bridge=0&_=1461181945520
Helpfully, the query parameters look quite easy to understand. The order=asc&limit=10&page=1 part looks like it would be easy to adjust to return your desired results. You could adjust the page parameter to crawl successive pages of data.
The response is JSON, for which there are a ton of libraries available.

Related

How to deal in Android with RESTful API that throws HTML table

This API returns a whole HTML table. I'm searching how to add this table (as is) into my UI but I've never seen such API throwing HTM table.Browsing Internet for an answer is not giving me any hope either.
Is it possible to put it into a webview? or any other UI object? My application sends a word to the API, and I'm getting the table in return.
I'd appreciate some code example.
You can certainly just show that exact same page in a WebView. If you want to parse the table and display only certain information, there is a library call JSOUP that is available which makes it very convenient to parse HTML.
It looks like you don't mind displaying the whole thing in a WebView - if that is acceptable, then you just load the page into a WebView widget. WebView will take care of rendering the page exactly as you see it in a browser. You only have to tell it what to load.
You parse the output like you would any other web request. If you wanted to include the table in your own webpage, you could. Or you could parse the response for the specific info you need.
Don't think of it as an API, think of it as a URL you're requesting and now you need to do something with the contents. That might help with your Googling. You're essentially doing page scraping.

How can i extract a dynamic string/word from a website using Java

Hello everyone here is my problem.
I want to extract 2 words from a website, the words are "won" or "loss". If i can find those 2 words on the website i will be able to write the program i am working on.The problems i have are...
When i write a java program to get the html code from the site it only gives me the html code that is not changing ie: it doesnt giving the dynamic php code parts.
When i "inspect elements" on the website it gives me exactly what i want. It says i either won or loss in the html tags . However if i simply view source it doesn't show me that dynamic php code that u would see when inspecting elements.
Is there a way for me to write code that looks at "inspect elements" for the website and keep track of the part of the html code that is changing between "win" or "loss"?
I've had trouble with something like this before and since you lack details I will give you the best answer I can...
More information that will be helpful to know maybe if you edit include,
Code... Show me what you got
The html code
APIs or frameworks used in you application
So the issue seems like when you request the site the information is not there. Normally this doesn't happen since most webpage display information at load time.
These days we do a lot of stuff with Javascript so therefore that is probably the part you are having problems with. Javascript can load information onto the page dynamically at anytime. It need not me at load time and even if it looks like it by eye that its there when the page loads it may not be since it's too fast to notice.
Look into the javascript code and see if you can find a get, post, or put action and see if you can follow that to where it loads the page. Then mimic the request in your program.

scrape website multiple pages using Web Client java

I am trying to scrape a website, using Web Client, i am able to get the data on the first page and parse it, but I do not know how to read the data on the second page, the website is calling a java script to navigate to the second page. Can anyone suggest me how do I get the data from the next pages?
Thanks in advance
The problem you're going to have is while you (a person) can read the JavaScript in the first page and see it is navigating to another page, having the computer do this is going to be hard.
If you could identify the block of code performing the navigation, you would then need to execute it in such a way that allowed your program to extract the URL. This again is going to be very specific to the structure of the JavaScript and would require a person to identify this.
In short, I think you're dead in the water with this one, though it serves as a good example of why the Unobtrusive JavaScript concept is so important.
This framework integrates HtmlUnit with its headless javascript enabled browser to fully support scriping multiple pages in the same WebClient session: https://github.com/subes/invesdwin-webproxy

How do i get Contents of an ASPX file through java?

In java, is there any way to get the content of a webpage, wich is an .ASPX file.
I know how to read/write anything from a normal HTML page, but asp pages seem to have one URL for multiple pages, so it's not really possible to reach the desired page by URL.
I understand you can't/won't give me complete instructions right here, but could you maybe send me in the right direction ?
Thanks in advance.
There is nothing special about ASPX pages compared to any other type of page; "plain" html pages could have been dynamically generated as well.
Just don't forget that the query string is also part of the URL. Many ASPX, PHP, etc pages might not even be 'correct' to request without some query string value at all. And other sites don't have file extensions at all... like this site itself. You just have to be sure to get the entire URL for each unique 'page'.
I'm not an expert on .asp, so I might be wrong. However, my impression is that a .asp page should ultimately return HTML (similarly to what a .jsp page does), so you can fetch the content in the same way as you would do for an HTML page.
However, you write that
asp pages seem to have one URL for multiple pages
this makes me think that perhaps your .asp page is using AJAX and so the page content may change while the URL doesn't. Is this your case?
I understand that you are trying to read the aspx from a client PC, not from the server.
If that's right, accessing an HTTP resource is independent from the technology used by the server, all you need to do is to open an http request and retrieve the results.
If you see multiple pages from one URL, then one of the following is happening:
1) POST data is sent to the aspx, and it renders different HTML due to these parameters
2) You are not looking really at the inner page but to a page that provides the frames for the HTML being rendered
3) The page uses heavily Ajax in order to be rendered. The "contents" of the page are not download through the initial request but later by javascript.
Generally, it is probably the first reason.

How to use Java to navigate a Web Search

I need to scrape French court cases for a project, but I can't figure out how to get Java to navigate the Court's search engine.
Here's the search page I need to manipulate. I want to start scraping the results page, but I can't get to that page from Java with just the URL. I need some way to have Java order the server to execute a search based on my date parameters (01/01/2003 - 30/06/2003), and then I can run the show by simply manipulating the URL I'm connecting to.
Any Suggestions?
First make sure the terms of service for the site allow this.
I would httpclient posts to send the data and get the results. See the form on the page, figure out which variables you need to emulate and submit them with httpclient. You should get back the results you are looking for. Also this page has lots of javascript, so you need to figure out what it is doing, maybe its never submitting the form but making ajax calls to update the page, but maybe you can get the same results.
You can always install something like "fiddler" and watch the http traffic the page is sending and then emulate that using httpclient.

Categories

Resources