I am looking to develop an app that will take login details from the user, go to a website, login, return values on the web page and then display them to the user on the phone.
Does java have this functionallity? Will I need to use javascript instead maybe? do these answers depend on the website that I am trying to access?
In my head I figure that I could just read in the paramaters as strings or chars, parse the webpage for the appropriate form and "paste" the appropriate value into the form "box". However, I have never attempted anything like this with coding so I am completely new to the idea and dont really know where to start. I tried googling around but any information that I found was either irrelevant or conflicting.
I'm not looking for the code to do it because I will not really learn anythig from that but a finger in the right direction would be great. I really do want to try get better at programming so that's why I've started to give myself these little side projects
Any help that can be offered would be great
Ian,
You can try using http-client (http://hc.apache.org/httpclient-3.x/) lib from apache. It lets to pro grammatically access a website (from a Java code). You will need to do the following things
Use the http-client lib to POST the data to the web site.
Receive the html response.
Use some html parser or xpath to retrieve the values from the response html.
You would need a script which accesses the webpage and enters the data, but in my opinion this is illegal. Because you are accessing a secured area and are able to look into sensitive data. Also accessing the page via a script is "botting" - most pages have safety precautions to prevent the execution of scripts, because most of them are harmful.
In my opinion there is no legal and easy solution to this.
Related
As an application to get a job I need to make a web app. I'm only familiar with Java SE so here comes my concerns. I need to make web service where at the beginning there will be authentication window, then I need to show the JSON data (probably parse it and show) as table or as list with button near to choose one of the row from the table to get to the next page where there user can choose a materials and so on.
I have data in JSON on server I need to pull it from there, then I need to show data which looks like this /materialDetails?ID=x where x is ID (it's probably HTTP or URI). Should I use Java REST? If yes I need to create a site in XML and then put java data inside? There're only a few tutorials on the internet and I can't find any good(sometimes the problem is in server, sometimes with dependencies). I was looking for information also on youtube but except https://www.youtube.com/watch?v=X36Dud8cS4Y I cant find anything useful? Could someone explain me this to make it at least a lil bit easier? Or just lead me to pick a specific framework. Thanks in advance
You could create a Dynamic Web Project with Tomcat and a MySQL Database for starters. You could use RESTEasy to create a WebService that gets data from your Database.
I don't know what exactly is expected from you, but this might be a good start. "Making a web app" is a bit like saying "I need to develop a java program", it is a bit vague !
I don't know REST but I think your application can be implemented with this technologies: HTML and Servlets/JSPs.
I would write an authentication page in HTML (one form element with 2 inputs and a button) which would pass credentials to a Java Server Page or a Servlet (they're equivalent). There I would build the table (another HTML element) thus producing a new HTML page.
P.S.: you're using JSON as a format so there's no need to learn XML.
I am trying to scrape a website, using Web Client, i am able to get the data on the first page and parse it, but I do not know how to read the data on the second page, the website is calling a java script to navigate to the second page. Can anyone suggest me how do I get the data from the next pages?
Thanks in advance
The problem you're going to have is while you (a person) can read the JavaScript in the first page and see it is navigating to another page, having the computer do this is going to be hard.
If you could identify the block of code performing the navigation, you would then need to execute it in such a way that allowed your program to extract the URL. This again is going to be very specific to the structure of the JavaScript and would require a person to identify this.
In short, I think you're dead in the water with this one, though it serves as a good example of why the Unobtrusive JavaScript concept is so important.
This framework integrates HtmlUnit with its headless javascript enabled browser to fully support scriping multiple pages in the same WebClient session: https://github.com/subes/invesdwin-webproxy
I'm using PHP to scrape some information off webpages, however, I've discovered that the info I'm trying to scrape from the pages is loading through some manner of AJAX/javascript. I thought I remembered that Curl could iterate through the javascript, but I've found that that's not the case.
I seem to remember some sort of backend "web browser" library/function that could trace through javascript and AJAX, to get at a final page result of what a full-functioned browser would arrive at.
Is there a library or function that can do this? Any ideas on how to go about this, other than having to manually trace through the scripts/redirects myself? It doesn't have to be pretty -- I'm just looking to scrape the resulting text.
Maybe not in php but in other languages there's: Watir/WatiN, selenium, watir/selenium-webdriver, capybara-webkit, celerity, node.js runs js directly, as well as phantomjs. There's also iMacros and similar commercial options.
But I usually find that I can get the data I want without any of these by just looking at the requests the page is making and recreate them/parsing the response.
I don't think there is such a library. If you're really desperate and you have lots of time on your hands, then you can, of course, download source code of Firefox, for example, and build yourself something useful. However I don't think this is going to be the best use of yours or anybody else's resources.
Note that even google's indexing bot does not process ajax. Here is what Google has to say about it. It's quite possible that the site you're dealing with does support this, in which case you can try using this google's technique, but on the whole, unfortunately, you're out of luck.
I'm writing a perl program that was doing a simple get command to retrieve results and process them. But the site has been updated and now has a java component that handles the results (so the actual data is not in the source code anymore).
This is the site:
http://wro.westchesterclerk.com/legalsearch.aspx
Try putting in:
Index Number: 11103
Year: 2009
I want to be able to pro grammatically enter the "index number" and "year" at the bottom of the form where it says "search by number" and then retrieve the results listed next to it.
I've written many programs in Perl that simply pass variables via the URL and the results are listed in the source code, so it's easy to parse. (Using LWP:Simple)
Like:
$html = get("http://www.url.com?id=$somenum&year=$someyear")
But this is totally new to me and I don't know where to begin.
I'm somewhat familiar with LWP:UserAgent and Mechanize.
I'd really appreciate any help.
Thanks!
That sort of question gets asked a lot. The standard answer is Wireshark.
I was just using it on that website with the test data you gave and extracted a single responsible POST request. This lets you bypass Javascript altogether.
It might be more logical for you to use one of the modules which drives a browser. Something like Mozilla::Mechanize or the Selenium tools.
A browser knows best how to interact with the server using AJAX and re-render the DOM and so on, so build your script on top of that ability.
What your asking to do in this case is hard. Not impossible but hard.
method A:
You can sift through their javascript code. What their "ajax" is doing is making a get/post request to another web page and dynamically loading the results. If you can decipher what that link is and the proper arguments you can continue to use get. I would recoment Getting the firebug plugin and any other tool that will help you de-obfuscate their javascript.
Another Method:
If your program could access a web browser(with javascript url support. like firefox). You could programatticaly go to these addresses, then wait a moment and get your data.
http://wro.westchesterclerk.com/legalsearch.aspx
javascript: function go() { document.getElementById('ctl00_tbSearchArea__ctl1_cphLegalSearch_splMain_tmpl0_tbLegalSearchType__ctl0_txtInde xNo').value=11109; document.getElementById('ctl00_tbSearchArea__ctl1_cphLegalSearch_splMain_tmpl0_tbLegalSearchType__ctl0_txtYear').value='09';searchClick();} go();
This is a method we have used along with mozembed to programatically get around this stuff. Recently we switched to Web Kit. And to remove this from taking up a video display we have used Xvfb/Xvnc to create a virtual desktop to load the browser in.
Those are the methods I have came up with so far. Let me know if you come up with another. Also I hope I helped.
I need to screen scrape some data from a website, because it isn't available via their web service. When I've needed to do this previously, I've written the Java code myself using Apache's HTTP client library to make the relevant HTTP calls to download the data. I figured out the relevant calls I needed to make by clicking through the relevant screens in a browser while using the Charles web proxy to log the corresponding HTTP calls.
As you can imagine this is a fairly tedious process, and I'm wodering if there's a tool that can actually generate the Java code that corresponds to a browser session. I expect the generated code wouldn't be as pretty as code written manually, but I could always tidy it up afterwards. Does anyone know if such a tool exists? Selenium is one possibility I'm aware of, though I'm not sure if it supports this exact use case.
Thanks,
Don
I would also add +1 for HtmlUnit since its functionality is very powerful: if you are needing behaviour 'as though a real browser was scraping and using the page' that's definitely the best option available. HtmlUnit executes (if you want it to) the Javascript in the page.
It currently has full featured support for all the main Javascript libraries and will execute JS code using them. Corresponding with that you can get handles to the Javascript objects in page programmatically within your test.
If however the scope of what you are trying to do is less, more along the lines of reading some of the HTML elements and where you dont much care about Javascript, then using NekoHTML should suffice. Its similar to JDom giving programmatic - rather than XPath - access to the tree. You would probably need to use Apache's HttpClient to retrieve pages.
The manageability.org blog has an entry which lists a whole bunch of web page scraping tools for Java. However, I do not seem to be able to reach it right now, but I did find a text only representation in Google's cache here.
You should take a look at HtmlUnit - it was designed for testing websites but works great for screen scraping and navigating through multiple pages. It takes care of cookies and other session-related stuff.
I would say I personally like to use HtmlUnit and Selenium as my 2 favorite tools for Screen Scraping.
A tool called The Grinder allows you to script a session to a site by going through its proxy. The output is Python (runnable in Jython).