java html parser doesnt read all page - java

I'm parsing html pages to get specific information, but there are some pages that I cant get all the information displayed on the web page, for example in this page
I cant get the reviews information.
By the way, if you see the source code of the page there are very much empty lines, and the reviews information dont appear.
Do you know why?
Some library to read this type of pages?
Thanks

I'm willing to bet they are using some sort of javascript to load in the review information. In order to access that information, you are going to need to somehow either mimic the request or evaluate the javascript and then parse the resulting page. I would suggest examining their javascript and mimicking the request they use to download the review information as that will be much easier than attempting to evaluate the javascript in your code.

Related

How do i get relevant Information from a Webpage (HTML) into my Android App?

at first i have to say that my english isnt the best so dont blame me for that:).
I want to create a Food-App for Android-Systems which is able to get Information (like ingredients,preperation) of Webpages by downloading them through an Asynctask and saving them into a Database (SQLite).
I learned to work with JSON - based website and to show the results(after downloading the data) in a ListView. Now i have the problem that i dont have JSON in front of me and i have really no clue about how i can write a code which extracts relevant information of HTML - Webpage. Is it even possible?
Sorry if maybe of u may laugh now how unknowing i am but i try to learn:)
So i basically know much about Asynctask and Databases. But the point is i dont know how to connect them all with my specific problem.
Thank you in advance for all who will deal with my topic!
Try working with jsoup . Here you can find the jsoup library and full source code.
See an example at this site: http://www.vogella.com/tutorials/jsoup/article.html
Add jsoup to your project by adding this line to your app build.gradle :
compile 'org.jsoup:jsoup:1.10.3'
HTML is a XML based presentation of pages.
You can parse it with DOM, if theres repeated tags you may find easier to parse with SAX. But you will need to parse every information on the site and navigate trough the graph to extract what you want.

How to deal in Android with RESTful API that throws HTML table

This API returns a whole HTML table. I'm searching how to add this table (as is) into my UI but I've never seen such API throwing HTM table.Browsing Internet for an answer is not giving me any hope either.
Is it possible to put it into a webview? or any other UI object? My application sends a word to the API, and I'm getting the table in return.
I'd appreciate some code example.
You can certainly just show that exact same page in a WebView. If you want to parse the table and display only certain information, there is a library call JSOUP that is available which makes it very convenient to parse HTML.
It looks like you don't mind displaying the whole thing in a WebView - if that is acceptable, then you just load the page into a WebView widget. WebView will take care of rendering the page exactly as you see it in a browser. You only have to tell it what to load.
You parse the output like you would any other web request. If you wanted to include the table in your own webpage, you could. Or you could parse the response for the specific info you need.
Don't think of it as an API, think of it as a URL you're requesting and now you need to do something with the contents. That might help with your Googling. You're essentially doing page scraping.

Read full content of a web page in Java

I want to crawl the whole content of the following link with a Java program. The first page is no problem, but when I want to crawl the data of the next pages, there is the same source code as for page one. Therefore a simple HTTP Get does not help at all.
This is the link for the page I need to crawl.
The web site has active contents that need to be interpreted and executed by a HMTL/CSS/JavaScript rendering engine. Therefore I have a simple solution with PhantomJS, but it is sophisticated to run PhantomJS code in Java.
Is there any easier way to read the whole content of the page with Java code? I already searched for a solution, but could not find anything suitable.
Appreciate your help,
kind regards.
Using the Chrome network log (or a similar tool in any other browser) you can identify the XHR request that loads the actual data displayed on the page. I have removed some of the query parameters, but essentially the request looks like this:
GET https://www.blablacar.de/search_xhr?fn=frankfurt&fcc=DE&tn=muenchen&tcc=DE&sort=trip_date&order=asc&limit=10&page=1&user_bridge=0&_=1461181945520
Helpfully, the query parameters look quite easy to understand. The order=asc&limit=10&page=1 part looks like it would be easy to adjust to return your desired results. You could adjust the page parameter to crawl successive pages of data.
The response is JSON, for which there are a ton of libraries available.

How can i extract a dynamic string/word from a website using Java

Hello everyone here is my problem.
I want to extract 2 words from a website, the words are "won" or "loss". If i can find those 2 words on the website i will be able to write the program i am working on.The problems i have are...
When i write a java program to get the html code from the site it only gives me the html code that is not changing ie: it doesnt giving the dynamic php code parts.
When i "inspect elements" on the website it gives me exactly what i want. It says i either won or loss in the html tags . However if i simply view source it doesn't show me that dynamic php code that u would see when inspecting elements.
Is there a way for me to write code that looks at "inspect elements" for the website and keep track of the part of the html code that is changing between "win" or "loss"?
I've had trouble with something like this before and since you lack details I will give you the best answer I can...
More information that will be helpful to know maybe if you edit include,
Code... Show me what you got
The html code
APIs or frameworks used in you application
So the issue seems like when you request the site the information is not there. Normally this doesn't happen since most webpage display information at load time.
These days we do a lot of stuff with Javascript so therefore that is probably the part you are having problems with. Javascript can load information onto the page dynamically at anytime. It need not me at load time and even if it looks like it by eye that its there when the page loads it may not be since it's too fast to notice.
Look into the javascript code and see if you can find a get, post, or put action and see if you can follow that to where it loads the page. Then mimic the request in your program.

How to fill out form data on a website

I am looking to develop an app that will take login details from the user, go to a website, login, return values on the web page and then display them to the user on the phone.
Does java have this functionallity? Will I need to use javascript instead maybe? do these answers depend on the website that I am trying to access?
In my head I figure that I could just read in the paramaters as strings or chars, parse the webpage for the appropriate form and "paste" the appropriate value into the form "box". However, I have never attempted anything like this with coding so I am completely new to the idea and dont really know where to start. I tried googling around but any information that I found was either irrelevant or conflicting.
I'm not looking for the code to do it because I will not really learn anythig from that but a finger in the right direction would be great. I really do want to try get better at programming so that's why I've started to give myself these little side projects
Any help that can be offered would be great
Ian,
You can try using http-client (http://hc.apache.org/httpclient-3.x/) lib from apache. It lets to pro grammatically access a website (from a Java code). You will need to do the following things
Use the http-client lib to POST the data to the web site.
Receive the html response.
Use some html parser or xpath to retrieve the values from the response html.
You would need a script which accesses the webpage and enters the data, but in my opinion this is illegal. Because you are accessing a secured area and are able to look into sensitive data. Also accessing the page via a script is "botting" - most pages have safety precautions to prevent the execution of scripts, because most of them are harmful.
In my opinion there is no legal and easy solution to this.

Categories

Resources