Web Data Extraction and Form filling - java

I am currently beginning the development of a (UI?) backup from a Webplatform. It is not our platform and I don't have access to the source.
I just have the HTML-rendered view of the Form-Data of the elements I entered.
So the task is to browse to the HTML, store the data (XML/JSON) and then login to the site to fill out the forms again to resubmit the data...
At the moment I'm prototyping with C++ QtWebEngine.
What' the best way to do such a task? What are good frameworks for "browsing" the web and analysing HTML?
Solutions in c++/java/javascript (or a firefox-addon?) are preferred.
Thanks for your help!

same as DSL language interpreter use "Document Object Model (DOM)"
my advice : C# webform app and webbrowser control:
webbrowser.navigate([url])
WebBrowser.DocumentCompleted Event
WebBrowser.Document (read document and help about "System.Windows.Forms.HtmlDocument" )
maybe need inject some java script in
/*
please don't use this info for hack and attack
*/

You could definitely do something like this using Firefox's Addon SDK. In particular you should look into the PageWorker module that allows you to load and run JS code against web pages without showing the page - everything happens in the background.

Related

What should i use to create web services app?

As an application to get a job I need to make a web app. I'm only familiar with Java SE so here comes my concerns. I need to make web service where at the beginning there will be authentication window, then I need to show the JSON data (probably parse it and show) as table or as list with button near to choose one of the row from the table to get to the next page where there user can choose a materials and so on.
I have data in JSON on server I need to pull it from there, then I need to show data which looks like this /materialDetails?ID=x where x is ID (it's probably HTTP or URI). Should I use Java REST? If yes I need to create a site in XML and then put java data inside? There're only a few tutorials on the internet and I can't find any good(sometimes the problem is in server, sometimes with dependencies). I was looking for information also on youtube but except https://www.youtube.com/watch?v=X36Dud8cS4Y I cant find anything useful? Could someone explain me this to make it at least a lil bit easier? Or just lead me to pick a specific framework. Thanks in advance
You could create a Dynamic Web Project with Tomcat and a MySQL Database for starters. You could use RESTEasy to create a WebService that gets data from your Database.
I don't know what exactly is expected from you, but this might be a good start. "Making a web app" is a bit like saying "I need to develop a java program", it is a bit vague !
I don't know REST but I think your application can be implemented with this technologies: HTML and Servlets/JSPs.
I would write an authentication page in HTML (one form element with 2 inputs and a button) which would pass credentials to a Java Server Page or a Servlet (they're equivalent). There I would build the table (another HTML element) thus producing a new HTML page.
P.S.: you're using JSON as a format so there's no need to learn XML.

Accessing HTML elements from java

I need to access HTML elements from my Java program based on the id or className of the element (like getElementByID or getElementsByClassName). I also need to be able to click a few buttons on the page.
Main Points:
I am creating a desktop application. It is not a web app.
I need a browserless solution
My code needs to auto fill a form and submit it without opening a page.
Are there any libraries out there that could suit my needs, or could I achieve this in plain Java code? If my question isn't clear please let me know and I will try to explain it in a better way. Thank you.
Selenium is a library used for scraping html elements using java. It is often used for testing. There is another library called Vaadin. It is used for doing front-end work on the back-end using java. It is similar to Java Swing but used on the web. Hope this helps!

scrape website multiple pages using Web Client java

I am trying to scrape a website, using Web Client, i am able to get the data on the first page and parse it, but I do not know how to read the data on the second page, the website is calling a java script to navigate to the second page. Can anyone suggest me how do I get the data from the next pages?
Thanks in advance
The problem you're going to have is while you (a person) can read the JavaScript in the first page and see it is navigating to another page, having the computer do this is going to be hard.
If you could identify the block of code performing the navigation, you would then need to execute it in such a way that allowed your program to extract the URL. This again is going to be very specific to the structure of the JavaScript and would require a person to identify this.
In short, I think you're dead in the water with this one, though it serves as a good example of why the Unobtrusive JavaScript concept is so important.
This framework integrates HtmlUnit with its headless javascript enabled browser to fully support scriping multiple pages in the same WebClient session: https://github.com/subes/invesdwin-webproxy

Web interface for a Java application

I recently developed a whole system in Java that connected to a database and exports and imports the table content to an excel sheet. I used SWING for the user interface. the user will interact with it for authentication and file management.
Apparently the client changed the requirements, He wants everything from a Web Interface. My team leader advised to look through JSP.
What does JSP actually do?
Will I have to rewrite the User Interface in Web if I used JSP?
is there an more effective and efficient solution to do this job?
I would Appreciate a specific answer
I'm not sure what you mean by "specific answer", but here goes:
JSP is a kind of template language, based on Java, and a technology for dynamically generating HTML. It's a server side technology. Look here.
Yes, if you're going for a pure web/HTML solution, you'll need to completely rewrite the UI.
There are other frameworks for creating webapps, such as Vaadin or Play! Framework that may be "better" than JSP, but then there's a whole new API/framework to learn...
What does JSP actually do?
Will I have to rewrite the User Interface in Web if I used JSP?
is there an more effective and efficient solution to do this job?
and
I used SWING for the user interface.
and
exports and imports the table content to an excel sheet.
not, have to look at JavaFX 2
You will certainly need to rewrite the user interface if you convert to JSPs.
JSPs are essentially just a method for dynamically generating HTML (with the option to embed Java code to produce parts of the page).
It is still possible to run Swing applications from a web browser: you might want to take a look at Java Web Start. This will save you from having to do a complete rewrite.
1.) JSP is pretty much like PHP. It is server side scripting. When ever a browser request for a page (JSP page), server (mostly Tomcat or any application server which you deploy your JSP project) will generate HTML content using the JSP code. Mainly JSP consist of part HTML, JavaScipt (if you want dynamic stuff), and Java.
2.) As far as I know if you are aked to do it in JSP then you need to o all the client side work again in JSP. There you will be generating HTML UIs using Java codes. But you can use all the back end codes you used.
You may can use SWING in a Applet.

autogenerate HTTP screen scraping Java code

I need to screen scrape some data from a website, because it isn't available via their web service. When I've needed to do this previously, I've written the Java code myself using Apache's HTTP client library to make the relevant HTTP calls to download the data. I figured out the relevant calls I needed to make by clicking through the relevant screens in a browser while using the Charles web proxy to log the corresponding HTTP calls.
As you can imagine this is a fairly tedious process, and I'm wodering if there's a tool that can actually generate the Java code that corresponds to a browser session. I expect the generated code wouldn't be as pretty as code written manually, but I could always tidy it up afterwards. Does anyone know if such a tool exists? Selenium is one possibility I'm aware of, though I'm not sure if it supports this exact use case.
Thanks,
Don
I would also add +1 for HtmlUnit since its functionality is very powerful: if you are needing behaviour 'as though a real browser was scraping and using the page' that's definitely the best option available. HtmlUnit executes (if you want it to) the Javascript in the page.
It currently has full featured support for all the main Javascript libraries and will execute JS code using them. Corresponding with that you can get handles to the Javascript objects in page programmatically within your test.
If however the scope of what you are trying to do is less, more along the lines of reading some of the HTML elements and where you dont much care about Javascript, then using NekoHTML should suffice. Its similar to JDom giving programmatic - rather than XPath - access to the tree. You would probably need to use Apache's HttpClient to retrieve pages.
The manageability.org blog has an entry which lists a whole bunch of web page scraping tools for Java. However, I do not seem to be able to reach it right now, but I did find a text only representation in Google's cache here.
You should take a look at HtmlUnit - it was designed for testing websites but works great for screen scraping and navigating through multiple pages. It takes care of cookies and other session-related stuff.
I would say I personally like to use HtmlUnit and Selenium as my 2 favorite tools for Screen Scraping.
A tool called The Grinder allows you to script a session to a site by going through its proxy. The output is Python (runnable in Jython).

Categories

Resources