I'm using PHP to scrape some information off webpages, however, I've discovered that the info I'm trying to scrape from the pages is loading through some manner of AJAX/javascript. I thought I remembered that Curl could iterate through the javascript, but I've found that that's not the case.
I seem to remember some sort of backend "web browser" library/function that could trace through javascript and AJAX, to get at a final page result of what a full-functioned browser would arrive at.
Is there a library or function that can do this? Any ideas on how to go about this, other than having to manually trace through the scripts/redirects myself? It doesn't have to be pretty -- I'm just looking to scrape the resulting text.
Maybe not in php but in other languages there's: Watir/WatiN, selenium, watir/selenium-webdriver, capybara-webkit, celerity, node.js runs js directly, as well as phantomjs. There's also iMacros and similar commercial options.
But I usually find that I can get the data I want without any of these by just looking at the requests the page is making and recreate them/parsing the response.
I don't think there is such a library. If you're really desperate and you have lots of time on your hands, then you can, of course, download source code of Firefox, for example, and build yourself something useful. However I don't think this is going to be the best use of yours or anybody else's resources.
Note that even google's indexing bot does not process ajax. Here is what Google has to say about it. It's quite possible that the site you're dealing with does support this, in which case you can try using this google's technique, but on the whole, unfortunately, you're out of luck.
Related
I would like to make an android application for a website. I would like the app to be more than just a web-view displaying the website, that would be pretty pointless. Has anyone done this before that wouldn't mind helping me out.
It depends on what you would like to do. I love using jsoup.
If it is an app that simply only scrapes and displays data of a webpage that would be the easiest route. If you have a login on that webpage it is already getting a bit more difficult but still quite easily doable.
In any case you will have to use some sort of html parser/scraper if you want to write an app for an existing webpage and in my opinion jsoup is one of the best and easiest to handle, because in many cases not the complete webpage has to be downloaded. It can check for specific tags, classes, names, elements on a webpage and only select the things needed
Have a look for yourself on jsoup.org
I'm trying to build an android app that will log into a website, scrape the website for data specific to the user, then format that data nicely on a mobile screen.
I've noticed that there are several similar questions to my own, and after reading some of the documentation, I am still very confused as to how I should go about this.
Here's what I know
The site that I want to log into utilizes asp.net and the login.aspx uses POST for the login form.
There is no API for this website
There is also no single sign on
I'm very new to Android and a novice JAVA programmer at best. Will someone please help me carve the path of research that I need to do in order to write this app? I feel that I mostly need help with connecting to the website and getting the data, I'll be able to figure out the layouts and formatting myself.
I am more than willing to research and read whatever is necessary, but I would like to minimize any irrelevant information that would ultimately lead to more confusion.
Thank you in advance for the help
For the purpose of accessing the website all you need to know about is HTTP. It doesn't matter whether your target website is built with PHP or ASP etc. As you're only concern is how to communicate with the website through HTTP which is independent of technology used by the website. You can try wikipedia for descriptions of these methods.
It might be worthwhile reading the Java URL tutorials for how to use Java classes. As regards extracting the data itself you might want to read up on Parsers. This link might give you some first ideas.
I'd use the Jsoup
API. I've posted many many threads on that issue. This'll probably help you login. I don't know how android manages SSL certificates. So that you'll have to research on your own. But this right here is a good start.
Jsoup Cookies for HTTPS scraping
I'm looking around for a crawling tool, written in Java, to detect invalid url's in our sites.
The difficulty is that much of the url's are done with javaScript, CSS3 and Ajax. So just getting the content of the site's url wouldn't do.
The ideal would be a headless tool that is able to do the javaScript, CSS styling and AJAX calls and spits out the various url's it accessed in doing so.
I do realize this is a tall order, but maybe it exists somewhere ?
I suggest using on http://htmlunit.sourceforge.net/, which is made for those things.
http://hc.apache.org/httpcomponents-client-ga/index.html
I am looking to develop an app that will take login details from the user, go to a website, login, return values on the web page and then display them to the user on the phone.
Does java have this functionallity? Will I need to use javascript instead maybe? do these answers depend on the website that I am trying to access?
In my head I figure that I could just read in the paramaters as strings or chars, parse the webpage for the appropriate form and "paste" the appropriate value into the form "box". However, I have never attempted anything like this with coding so I am completely new to the idea and dont really know where to start. I tried googling around but any information that I found was either irrelevant or conflicting.
I'm not looking for the code to do it because I will not really learn anythig from that but a finger in the right direction would be great. I really do want to try get better at programming so that's why I've started to give myself these little side projects
Any help that can be offered would be great
Ian,
You can try using http-client (http://hc.apache.org/httpclient-3.x/) lib from apache. It lets to pro grammatically access a website (from a Java code). You will need to do the following things
Use the http-client lib to POST the data to the web site.
Receive the html response.
Use some html parser or xpath to retrieve the values from the response html.
You would need a script which accesses the webpage and enters the data, but in my opinion this is illegal. Because you are accessing a secured area and are able to look into sensitive data. Also accessing the page via a script is "botting" - most pages have safety precautions to prevent the execution of scripts, because most of them are harmful.
In my opinion there is no legal and easy solution to this.
I need to screen scrape some data from a website, because it isn't available via their web service. When I've needed to do this previously, I've written the Java code myself using Apache's HTTP client library to make the relevant HTTP calls to download the data. I figured out the relevant calls I needed to make by clicking through the relevant screens in a browser while using the Charles web proxy to log the corresponding HTTP calls.
As you can imagine this is a fairly tedious process, and I'm wodering if there's a tool that can actually generate the Java code that corresponds to a browser session. I expect the generated code wouldn't be as pretty as code written manually, but I could always tidy it up afterwards. Does anyone know if such a tool exists? Selenium is one possibility I'm aware of, though I'm not sure if it supports this exact use case.
Thanks,
Don
I would also add +1 for HtmlUnit since its functionality is very powerful: if you are needing behaviour 'as though a real browser was scraping and using the page' that's definitely the best option available. HtmlUnit executes (if you want it to) the Javascript in the page.
It currently has full featured support for all the main Javascript libraries and will execute JS code using them. Corresponding with that you can get handles to the Javascript objects in page programmatically within your test.
If however the scope of what you are trying to do is less, more along the lines of reading some of the HTML elements and where you dont much care about Javascript, then using NekoHTML should suffice. Its similar to JDom giving programmatic - rather than XPath - access to the tree. You would probably need to use Apache's HttpClient to retrieve pages.
The manageability.org blog has an entry which lists a whole bunch of web page scraping tools for Java. However, I do not seem to be able to reach it right now, but I did find a text only representation in Google's cache here.
You should take a look at HtmlUnit - it was designed for testing websites but works great for screen scraping and navigating through multiple pages. It takes care of cookies and other session-related stuff.
I would say I personally like to use HtmlUnit and Selenium as my 2 favorite tools for Screen Scraping.
A tool called The Grinder allows you to script a session to a site by going through its proxy. The output is Python (runnable in Jython).