Parse dynamically loading (by scroll) page using JSOUP

Parse dynamically loading (by scroll) page using JSOUP - java

I am trying to count number of apps for a specific string. Like Flash Light, and here is the link that i am using to load page in jsoup,
Jsoup.connect("https://play.google.com/store/search?q=Flash+Light&c=apps&gl=us&hl=en")
Problem is that it only return 20 apps but there are more than 100 apps results when i open it in browser and scroll down. When i monitored closely i found out that for first time PalyStore shows 20 results rest of the results are fetched on scrolling.
Can anyone please tell me how to handle that?
Also i just want to count number of results if there is any other way that would be great too.

Jsoup cannot process dynamically loaded content. You need a different set of tools, like htmlunit.

Related

Split a long text for paginated display

I'm building a website for a friend who's writing a novel, and want to display it, chapter by chapter, in a book-like display, with pages turning.
I have a frontend app in Angular 2 and a backend in Java (as they're the tools I'm more familiar with). A backoffice on the Angular app allows the user to add the text of a chapter, which is sent to the backend to be stored in the DB. Then the front of the Angular calls the backend to retrieve the chapter, and has to display it in the book-like display.
My problem is how can I split the text of the chapter into pages in order to display it. I could change the backoffice to force the user to add the text page by page. I could ask the user to put a specific marker in the text to indicate a page break. But I'ld like the process to be as transparent as possible for the user.
So I went for a solution by splitting the text on the backend. I estimated how many characters are on a line, and how many lines are on a page, then I cut the text accordingly (with some adjustments, as it's a HTML text with tags in it).
But it feels like a very strict approach, as I'm choosing the size of a page, regardless of the display interface size.
So I'm wondering if there is a better approach :
- a different splitting algorithm
- a tool front-side to display my text without splitting it
- something else
Does anyone had to face a similar problem ?
Thanks

You are performing that action on server side that has no sense of the page length.
I assume that a better approach shall be to get the complete chapter from backend to front end; and have a front end function that will calculate :
- the number of characters per lines based on page size
- the number for line based on page size
- the number of chapter pages based on previous info
This is a way better approach than your full backend ones.
However; this is not a responsive approach.
Do you have interest and need within a responsive one ?
If yes; you may add a watch on the page length/height to recalculate the above values and re generated your pages

Testing responsiveness of HTML page using Java

I am developing an application to test whether a HTML page is responsive or not. Right now, I am assuming that using media queries is the only way to make a HTML page responsive.
But I am using a very crude logic to test it. I am parsing the HTML file and reading it for the presence of a media query statement. If its present I am declaring it as responsive, otherwise non-responsive.
Is there any other way I can go about it?
Is there any other test I can perform before declaring it as responsive or non-responsive?

Check if they are using hard coded px instead of % or em. Maybe see if text is too small or links to close together.
At the end of the day it wont be a great resource for responsive checking since there are so many factors

According to Ethan Marcotte's seminal article that introduced Responsive Web Design (http://alistapart.com/article/responsive-web-design), a responsive page will use media queries, flexible grid layouts and responsive text.
But, even if a page has these elements, it doesn't mean that it is using them correctly. A responsive page is not one that simply uses media queries.
I'm not sure that the ability to programmatically determine if a page is built responsively is even a viable goal. You can check for ingredients, but that won't tell you if the right recipe was followed.
Also, why have you tagged this question with Java?

Developing app to detect webpage change

I'm trying to make a desktop app with java to track changes made to a webpage as a side project and also to monitor when my professors add content to their webpages. I did a bit of research and my current approach is to use the Jsoup library to retrieve the webpage, run it through a hashing algorithm, and then compare the current hash value with a previous hash value.
Is this a recommended approach? I'm open to suggestions and ideas since before I did any research I had no clue how to start nor what jsoup was.

One potential problem with your hashing method: if the page contains any dynamically generated content that changes on each refresh, as many modern websites do, your program will report that the page is constantly changing. Hashing the whole page will only work if the site does not employ any of this dynamic content (ads, hit counter, social media, etc.).
What specifically are you looking for that has changed? Perhaps new assignments being posted? You likely do not want to monitor the entire page for changes anyway. Therefore, you should use an HTML parser -- this is where Jsoup comes in.
First, parse the page into a Document object:
Document doc = Jsoup.parse(htmlString)
You can now perform a number of methods on the Document object to traverse the HTML Nodes. (See Jsoup docs on DOM navigation methods)
For instance, say there is a table on the site, and each row of the table represents a different assignment. The following code would get the table by its ID and each of its row by selecting each of the table's tags.
Element assignTbl = doc.getElementById("assignmentTable");
Elements tblRows = assignTbl.getElementsByTag("tr");
for (Element tblRow: tblRows) {
tblRow.html();
}
You will need to somehow view the webpage's source code (such as Inspect Element in Google Chrome) to figure out the page's structure and design your code accordingly. This way, not only would the algorithm be more reliable, but you could take it much further, such as extracting the details of the assignment that has changed. (If you would like assistance, please edit your question with the target page's HTML.)

Display fast changing values in the browser

I have written a Java program, which reads numbers from different files. The numbers are added while being read from the files and the sum is displayed in a browser. The browser keeps on displaying the new sum getting created at every step.
I know how to display static values in a browser. I can use Javascripts. But I don't know what mechanism to use to display continuously a changing value.
Any help is appreciated!

You'll have to request the data to display from the server. You can use a data-binding library like Knockout to automatically update the page as the underlying model changes, or you can just use a library like jquery to modify the DOM on your own.
Alternatively, you could keep a pipe open to the server using the Comet model: http://en.wikipedia.org/wiki/Comet_%28programming%29. However, it can be expensive to eat up a thread for long periods of time on your web server.
Good luck.

Check out knockout.js http://www.knockoutjs.com/ it is a framework for updating UI automatically when data changes

Java Bing Image Search

I have a small application in java which searches images using bing image search. The problem I am facing is that, its getting only first 20 images. May be because when we search on bing.com it populates first 20 images first and then its an infinite scrolling feature.
Is there any way to search more than 20 images using bing?
Cheers :)

I'm guessing this is because this site uses ajax to populate the "infinite" scrolling list as you call it.
You probably send an http request and get the initial page (btw on my browser I got 6 images accross x 4 down, i.e. 24 not 20; thinking about it maybe my client also got 20 only at first and got the last 4 w/ ajax...), and you'd need to do the paging trough by way of ajax requests.
At a glance, the xhtml and associated javascript of the page is very dense and somewhat obfuscated, It would take a while to get oriented... An alternative to analyzing this page is to instead use a packet sniffer (such as wireshark) and to capture the requests which take place when you scroll down.
Essentially this will likely expose some form of ajax request, which you can then easily emulate with java. Typically the ajax response is easy to parse whatever its nature (xml, jason, gzip...).
A possible snags to this well laid out plan is if the returned data in the ajax response is encrypted, for example where the extra images are bundled in some sort of envelope for which you'll then need to discover the format.
Depending on the actual task at hand, you may try alternatives such as automations within GreaseMonkey (on Firefox) or similar tools.
What of Bing API ?
Note that all the above approaches are akin to screen-scraping and hence quite sensitive to even minute changes in the Bing application, and, depending on effective usage and context, this could put the project in a legal grey area... A better approach may be to register and obtain a proper application ID with MS/Bing and to use the Bing API.

You are simulating a browser? Doesn't the Bing engine have an entry point for programs instead - a web service or so - which would make your task much easier.
EDIT: SDK appears to be here: http://msdn.microsoft.com/en-us/library/cc980922.aspx

Just wanted to post a direct answer to the question:
Bing uses Ajax (of course) for the infinite scroll. Each "tick" is based on a simple ajax get request, which accuires new images.
For instance, this url returns 30 results (121-151) in a "htmlraw" format based on the query "max payne".
http://www.bing.com/images/async?q=max+payne&format=htmlraw&first=121
Edit:
It works with the original url too, just add &first=NUMBER to the querystring. Example:
www.bing.com/images/search?q=payne&go=&form=QBLH&scope=images&filt=all&first=10
I am building my own bulk image collector (for a "learning project" for myself) and I found out that it is paginated like this.
FYI, Google and Bing are easy, Yahoo and Altavista (redundant, since their results are from Yahoo) are far more problematic - they don't post the directlink to the original image.
Have fun! :)

This can be done by using count parameter. For example, I tried GET "https://api.cognitive.microsoft.com/bing/v7.0/images/search?q=shoes&mkt=en-us&count=30" call and it returns 30 images.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.