Providing input data for web scraping

Providing input data for web scraping - java

I want to scrap data from the following site:
http://www.upmandiparishad.in/commodityWiseAll.aspx
There are two input elements, Commodity and Date. How do I provide these values and retrieve the resulting information?

To extract data from a web page from Java, you can use jsoup.
To provide input elements, you need to understand how they are provided originally by your browser.
Basically, there are two most common methods for a request-response between a client and a server:
GET - Requests data from a specified resource
POST - Submits data to be processed to a specified resource
You can find more about them here.
When you select the Commodity and the Date input values, you can investigate the methods used to provide those values to the server by examining network requests. For example, in Chrome, you can press F12 and select the Network tab to check the information being sent to and from the browser.
When you find out the way of providing the data, you can then form your HTTP request accordingly to provide the same data via jsoup or similar library.
For example, here is how you can provide simple input fields to your request:
Document doc = Jsoup.connect("http://example.com/")
.data("some_input_1", "some_data_1")
.data("some_input_2", "some_data_2")
.post();
This is ofcourse just to get you started, it is by no means a complete answer. You need to show real effort on your side to search for answers online, as there are plenty.
Here are just a few to get you started:
http://www.mkyong.com/java/how-to-send-http-request-getpost-in-java/
http://simplescrape.sourceforge.net/
http://www.xyzws.com/Javafaq/how-to-use-httpurlconnection-post-data-to-web-server/139
http://www.javaworld.com/article/2077532/learn-java/java-tip-34--posting-via-java.html

Related

Using POST method instead of GET in a rest API

Is there a specific scenario where we use a POST instead of GET, to implement the functionality of get operation ?

GET is supposed to get :) and POST is used to mainly add something new or sometimes often used for updates as well (although PUT is recommended in such scenarios). There is no specific scenario where we use a POST instead of a GET, if we require this, that means we are probably doing it wrong, although nothing stops you doing this but this is bad design and you should take a step back and plan your API carefully.
There are 2 important cases for a POST i.e. POST is more secure than a GET and POST can send large amount of data but even with this I won't recommend why one will use POST to simulate a GET behaviour.

Lets understand usage of get and post :
What is GET Method?
It appends form-data to the URL in name/ value pairs. The length of the URL is limited by 2048 characters. This method must not be used if you have a password or some sensitive information to be sent to the server. It is used for submitting the form where the user can bookmark the result. It is better for data that is not secure. It cannot be used for sending binary data like images or word documents. It also provides $_GET associative array to access all the sent information using the GET method.
What is POST Method?
It appends form-data to the body of the HTTP request in such a way that data is not shown in the URL. This method does not have any restrictions on data size to be sent. Submissions by form with POST cannot be bookmarked. This method can be used to send ASCII as well as binary data like image and word documents. Data sent by the POST method goes through HTTP header so security depends on the HTTP protocol. You have to know that your information is secure by using secure HTTP. This method is a little safer than GET because the parameters are not stored in browser history or in web server logs. It also provides $_POST associative array to access all the sent information using the POST method.
Source: https://www.edureka.co/blog/get-and-post-method/
So both the methods have their specific usage.

POST method is used to send data to a server to create or update a resource.
GET method is used to request data from a specified resource.
If you want to fetch some data you can use the GET method. But if you want to update an existing resource or create any new resource you should use POST. GET will not help you to create/update resources. So exposing the api should be specific to your needs.
UPDATE
So your main question is in what scenario we can use POST to implement the functionality of GET.
To answer that, as you understand what GET and POST does, so with GET request you will only fetch the resource. But with POST request you are creating or updating the resource and also can send the response body containing the form data in the same request response scenario. So suppose you are creating a new resource and the same resource you want to see, instead of making a POST call first and making a GET call again to fetch the same resource will cost extra overhead. You can skip the GET call and see your desired response from the POST response itself. This is the scenario you can use POST instead of making an extra GET call.

Java Http Request that only returns certain elements I want

Is there a method in Java to make a HTTP request to a webpage where the response will only be some specific elements I want instead of the whole document?
For example, if I was to request a <div> called "example", the response would be only that element and not the rest of the fluff that exists on the page, which I do not need.
Most methods I looked at, involve getting an entire HTML page and then parsing it. I want to look at the page and then just pluck out the div I want and only have that as a response. The pages I am dealing with contain a lot of advert content I want to ignore.

That's not possible. The way the web works is you send a HTTP GET request to a page, and it returns the entire page. What you do with it (parsing, etc) is up to you, but you have no influence over the HTTP protocol.
This could however be realised if you host a webpage using a custom server/API that you implemented yourself. You could send a request with certain parameters specifying what you needed, and it could parse the html page server side.

HTTP has nothing to do with the content of the page, it is simply a protocol that governs server requests and responses.
I understand what you want to do, you've just asked slightly the wrong question. Don't worry about HTTP, that is simply the protocol that governs server requests and responses (GET, PUT, POST, HEAD, OPTIONS).
The problem you are describing can only be handled after retrieval of the content is completed. You need to be working with the Document Object Model (DOM) that is the foundation of XML and XHTML. This means that you will need to familiarize yourself with DOM, and maybe XPath and XSL as well.
The functionality you are asking for can be implemented in many ways, but it generally boils down to a sequence of non-trivial operations:
Retrieve page content for URL (including negotiating encodings, HTTP redirects and protocol changes).
Clean up non-well-formed content (i.e., unclosed or improperly nested tags, e.g., using JTidy).
Parse page content into DOM.
Traverse DOM to find the nodes you are interested in (e.g., via DOM or XPath).
Build output DOM (e.g. via org.w3c.dom classes).
Write output DOM to file (combination of java.io and org.w3c.dom).
While it is possible to implement this from scratch, there are already a few open source projects that have this functionality, try something like jsoup: Java HTML Parser.

No its not possible. The HTTP Get/post calls will return complete web page information but not some portion of it.

java - parsing an aspx website - post parameters

I have my client's e-shop, which is created by another company. I want to parse all the products and put them in an xml. I know how to get to the first page of each "brand" but I have difficulties passing the argument to change the page for the paginated results.
This is the e-shop "http://www.gialia.net.gr/ProductCatalog/20/CAR.aspx" that points to one brand.
When I user tamper-data on firefox I see that when you want to press the second-page of the results is posts the :
"__EVENTTARGET=ctl00%24wpmMain%24wp131820866%24wp512420601%24dpgTop%24ctl01%24ctl01"
the last string: "ct101" means go to page 2, If I change it to ct102 it goes to page 3 etc.
BUT i'm trying to create it as a GET request so I can create these parameters dynamically in my Java code and parse each responce. But when I create the url as:
http://www.gialia.net.gr/ProductCatalog/20/CAR.aspx?__EVENTTARGET=ctl00$wpmMain$wp131820866$wp512420601$dpgTop$ctl01$ctl02
I get no results.
Can someone please take a look and give me some suggestions?

The site you give us here is very poor in design concerning the search engines (SEO), and so the parse of the page one by one is too difficult.
To change page is make post back, and with javascript only. So you must do the same to move to the next page of the catalog, you need to make a full post back of the page with all the parameters.
Now, the page is so bad designed that the programmer have disable the __EVENTVALIDATION of the controls probably because he not let him do wrong things, so when you can tamper the data, but still you need to make post back. By simple type on the url one only parametre the code behind did not understand that is post back. You need to send and at least the Viewstate and the rest hidden parameters.
But isn't more easy to just get from your client access direct to the database and reads them from there ?

How can I send a newsletter with xPages content?

I have some content displayed using computed fields inside a repeat in my xpage.
I now need to be able to send out a newsletter (by email) every week with the content of this repeat. The content can be both plain text and html
My site is also translated into different languages so I need the code to be able to specify the language and return the content in that language.
I am thinking about creating a scheduled lotusscript or java agent that somehow read the content of the repeat. is this possible? if so, some sample code to get me started would be great
edit: the content is only available to logged in users
thanks
Thomas

Use a java agent, and instead of going to the content natively, do a web page open and open the page as if in a browser, then process the result. (you could make a special version of the web page that hides all extraneous content as well if you wanted)

How is the data for the repeat evaluated? Can it be translated in to a lotusscript database.search?
If so then it would be best to forget about the actual xPage and concentrate on working out how to get the same data via LotusScript and then write your scheduled agent to loop through the document collection and generate the email that way.
Looking to the Xpage would generate a lot of extra work, you need to be authenticated as the user ( if the data in the repeat is different from one user to the next ) to get the exact same data that this particular user would see and then you have to parse the page to extract the data.

If you have a complicated enough newsletter that you want to do an Xpage and not build the html yourself in the agent, what you could do is build a single xpage that changes what's rendered based on a special query string, then in your agent get the html from a URLConnection and pass the html into the body of your email.
You could build the URL based on a view that shows documents with today's date.

I would solve this by giving the user a teaser on what to read and give them a link to the full content.

You should check out Weihang Chens (my colleague) article about rendering an xPage as Mime and sending it as a mail.
http://www.bleedyellow.com/blogs/weihang/entry/render_a_xpages_programmtically_and_send_it_as_a_mail?lang=en_us
We got this working in house and it is very convenient.
He describes 3 different approaches to the problem.

How do you access URL text following the # sign through Java?

Using Java (.jsp or whatever) is there a way where I can send a request for this page:
http://www.mystore.com/store/shelf.jsp?category=mens#page=2
and have the Java code parse the URL and see the #page=2 and respond accordingly?
Basically, I'm looking for the Java code that allows me to access the characters following the hash tag.
The reason I'm doing this is that I want to load subsequent pages via AJAX (on my shelf) and then allow the user to copy and paste the URL and send it to a friend. Without the ability of Java being able to read the characters following the hash tag I'm uncertain as to how I would manipulate the URL with Javascript in a way that the server would be able to also read without causing the page to re-load.
I'm having trouble even figuring out how to access/see the entire URL (http://www.mystore.com/store/shelf.jsp?category=mens#page=2) from within my Java code...

You can't.
The fragment identifier is only used client side, so it isn't sent to the server.
You have to parse it out with JavaScript, and then run your Ajax routines.
If you are loading entire pages (and just leaving some navigation and branding behind) then it almost certainly isn't worth using Ajax for this in the first place. Regular links work better.

Why can't you use a URL like this:
http://www.mystore.com/store/shelf.jsp?category=mens&page=2
If you want the data to be stored in the url, it's gotta be in the query string.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.