Jsoup posting modified Document - java

I'm trying to create a web scraper for my coming android app. Therefore I need to use a simple search form on a website, fill it out and send my results back to the server.
As mentioned in the Jsoup-Cookbook, I scraped the site I needed from the Server and changed the values.
Now I just need to post my modified document back to the server and scrape the resulting page.
As far as I've seen in the Jsoup-API there is no way to post something back, except with the .data-Attribute in Jsoup.connection, which is unfortunately not able to fill out text fields by their id.
Any ideas or workarounds, how to post the modified document, or its parts back to the website ?

You seem to misunderstand how HTTP works in general. It is not true that the entire HTML document with modified input values is been sent from the client to the server. It's more so that the name=value pairs of all input elements are been sent as request parameters. The server will return the desired HTML response then.
For example, if you want to simulate a submit of the following form in Jsoup (you can find the exact HTML form syntax by opening the page with the form in your browser and do a rightclick, View Source)
<form method="post" action="http://example.com/somescript">
<input type="text" name="text1" />
<input type="text" name="text2" />
<input type="hidden" name="hidden1" value="hidden1value" />
<input type="submit" name="button1" value="Submit" />
<input type="submit" name="button2" value="Other button" />
</form>
then you need to construct the request as follows:
Document document = Jsoup.connect("http://example.com/somescript")
.data("text1", "yourText1Value") // Fill the first input field.
.data("text2", "yourText2Value") // Fill the second input field.
.data("hidden1", "hidden1value") // You need to keep it unmodified!
.data("button1", "Submit") // This way the server knows which button was pressed.
.post();
// ...
In some cases you'd also need to send the session cookies back, but that's a subject apart (and a question which has already been asked several times here before; in general, it's easier to use a real HTTP client for this and pass its response through Jsoup#parse()).
See also:
HTTP tutorial
HTTP specification

That's not the way. You should create a POST request (use Apache HTTP Components), get the response and then scrape it with JSoup.

Related

Filling out a HTML-form with complex name (dot-notation in input-tag) with Java and Jaunt API

-
hey folks,
i am building a Java-tool, trying to automatically fill out some form input elements in an HTML-Page using Java and Jaunt API.
the HTML-Code is like:
<fieldset class = "fieldsetlong">
<legend>searchprofile</legend>
<label for="reference">reference:</label>
<input maxlength="50" name="reference" id="reference" type="text" />
</fieldset>
<fieldset class = "fieldsetlong">
<legend>searchcriteria</legend>
<label for="surname">surname:</label>
<input name="searchprofile.surname" id="surname" type="text" />
</fieldset>
The Java-Code for filling in the "normal" Input-field reference (it works) looks like:
form.set("reference", "123Test");
Unfortunately, I am not able to fill out the fields that use the dot-notation searchprofile.surname in the name
Here's a sample of what i've tried (without success):
form.set("surname", "TestPerson");
form.set("searchprofile.surname", "TestPerson");
form.set("name=\"searchprofile.surname\"", pers.getSurname());
form.set("id=\"surname\"", pers.getSurname());
For each of these commands I get a NotFoundException and don't know whether I can do this with Jaunt.
It would appreciate any kind of help in this regard.
Thanks in advance
Edit - is there a way to reach the dot-notated input-field searchprofile.surname with JSoup?
HTML allows dots in the name-Attribute, but does Jaunt accept this abc.name?
Not sure about Jaunt, never used it before. However Jsoup seems to be a pretty decent library to be used here. I myself have been using Jsoup for a fairly long time and it has been very successful in scraping web pages, filling input form and submit, and of course, HTML parsing!
I've posted a step by step guide to fill in form input fields and submit to server in the following answer: How to login with Jsoup
Basically it works very similar to your code, a very brief example would be:
Connection.Response response = Jsoup.connect(url)
.data("Name", "Value")
.method(Method.POST).execute();
Today, at work the Jaunt solution with
form.set("searchprofile.surname", "TestPerson");
worked like a charm.
I don't know what the problem was earlier but I am glad that it worked.
The HTML allows to use dots and minus, etc. which I misinterpreted as some kind of nested forms or hierarchies but the dot-notation is just a valid name-attribute in HTML.

Get raw HTML in servlet to churn out a PDF file

I have a JSP which is rendered after it is forwarded from a servlet. Now that I have a HTML from JSP I want to post this page in order to generate a PDF.
As per my understanding the submit button only submits a form. But, I need to submit raw HTML to eventually use FlyingSaucer or similiar PDF creator library.
What is the way to use my HTML and then save the PDF to a file?
Please chime in to correct if I am wrong and what you think about my approach. Any advice would be greatly appreciated.
Edit: Sorry I have posted no code but at the moment I have hit a wall in the servlet in my quest to get around this.
You've basically 2 options:
Let JS set the current HTML DOM tree as a (hidden) request parameter on submit.
<form method="post" action="pdfservlet">
<input type="hidden" name="source" />
<input type="submit" value="generate" onclick="this.form.source.value = document.documentElement.outerHTML;" />
</form>
It's in pdfservlet available as request.getParameter("source").
Let pdfservlet request the desired page programmatically using URL/URLConnection.
InputStream source = new URL("http://localhost:8080/context/someservlet").openStream();
// ...
Set if necessary JSESSIONID cookie with current session ID if you need it to run in same session.
URLConnection connection = new URL("http://localhost:8080/context/someservlet").openConnection();
connection.setRequestProperty("Cookie", "JSESSIONID=" + request.getSession().getId());
InputStream source = connection.getInputStream();
// ...

Use Servlets to display the data on same webpage?

I am using a html form like this:
<form action="question" method="get">
where question is a java servlet class which renders the data from the form and display on other page.
What I am trying to do is display this data just below the html form not on other screen.
(Somewhat like the page where we Ask Question in stackoverflow.com where the question you enter is rendered and displayed below.)
So I am trying to do same. Anyone has an idea how to do that?
The simplest way to do it, is to use javascript (client side).
Below is a very crude example on how to do this. This will give you an idea on how to proceed.
create a html page, with two separate text area boxes.
Let the first text area box be the source where you type in the text.
Assign it an id 'source_area'.
<textarea id='source_area'>
</textarea>
Let the second text area box be the destination.
Assign it an id 'destination_area'.
Set this area as "readonly" because you don't want users typing here directly.
<textarea id='destination_area' readonly>
</textarea>
Now when a user types into the first box, we need to capture the particular action.
For this example I will use the "onKeyUp" to capture events when a keyboard key is released.
Now when typing into the source text box, a key on your keyboard is released, it will invoke a javascript function "transferToNextArea()" is invoked.
We will create the javascript function "transferToNextArea()" in
Read more about javascripts here. http://w3schools.com/js/js_events.asp
Complete list of events here. http://w3schools.com/jsref/dom_obj_event.asp
The javascript function will extract text from 'source_area' text box.
It will then assign the same text into 'destination_area'.
function transferToNextArea()
{
//extracting text.
var varSrcText = document.getElementById("source_area").value;
//assigning text to destination.
document.getElementById("destination_area").value=varSrcText
}
Complete html (tested in Google Chrome)
<html>
<body >
Source Box
<textarea id='source_area' onKeyUp="transferToNextArea();">
</textarea>
<br>
Destination Box
<textarea id='destination_area' readonly>
</textarea>
</body>
<script type="text/javascript">
function transferToNextArea()
{
var varSrcText = document.getElementById("source_area").value;
document.getElementById("destination_area").value=varSrcText
}
</script>
</html>
This is just a very basic example. It is not very effecient, but it will give you an idea of how data can be moved around.
Before assigning the text, you could manipulate the text however you want it using javascript.
Stackoverflow formats the text as per the html tags after extracting it. This will require lot more code and more work.
Using a servlet for the above task is overkill.
You would use a servlet, only if you want to do something with the data on the server side.
Example
a) store it in a database before displaying it below.
Read about "ajax" calls to send and recieve data between the server and client.
Ajax will give you the means to send data to the servlet without having to refresh the whole page.
Create a JSP with a form
on submit post the data to some servlet
process request and produce resultant data and set it to request's attribute
forward the request to same jsp
check if the data is not null display under the form
Just let the servlet forward the request to the same JSP page and use JSTL <c:if> to conditionally display the results.
request.setAttribute("questions", questions);
request.getRequestDispatcher("/WEB-INF/questions.jsp").forward(request, response);
with
<c:if test="${not empty questions}">
<h2>There are ${fn:length(questions)} questions.</h2>
<c:forEach items="${questions}" var="question">
<div class="question">${question}</div>
</c:forEach>
</c:if>
See also:
Our servlets wiki page - Contains concrete Hello World examples.

hide value by a hyperlink

I was asking that : whenever I pass a value by a link then it looks like this:
Click here to view details
Now when i click on that hyperlink i am going to some.jsp and retrieving value of search like:
request.getparameter("someid");
But I am also seeing all those sensitive details in the browser URL, which is vulnerable. I want to hide all these details so that nothing will be shown in the browser's url but processing will be done internally. How can i do it? Please ignore jsp tags, I am learning JSTL and will soon replace scriplets but initially i want to implement it on jsp tags. Any help is much appreciated.
If you'd turn the link into a button you could pass it as a hidden POST value and have your some.jsp page read that. For example:
<form method="post" action="some.jsp">
<input type="hidden" name="someid" value="<%=something%>" />
<input type="submit" value="Click here to view details" />
</form>
Then on your some.jsp, you can read the someid POST value and do with that whatever you want.
First of all , if you want to show sensitives stuffs in URL, Why are you usng GET request.
You should use POST request. Store the value of someid in request attribute.
You can encrypt and decrypt the value of someid for doing it. See the example given here:
Encrypting a String with DES

Is it possible to hide a password field from the address bar?

I have a login form with username and password. It works, but after the request I see on the web browser something like "...login?user=myUser&password=myPassword".
Given that the form has a password field that hides the password while it's typed, it would not be funny to see the password on the address bar.
Is it possible to avoid this?
The user verification is done on the server with a custom java web server.
Set your HTTP form method to a POST, instead of a GET. This eliminates the form to append the parameters on the url.
Secure your page to use HTTPS instead of HTTP. That way, an eavesdropper cannot read unencrypted HTTP POST message.
The only way that this can be done is by not using the GET method of form submission. You need to use the POST method. More information can be found here http://www.cs.tut.fi/~jkorpela/forms/methods.html
Your form will look like this
<form method="post" action="somepage.php">
</form>
Your form is using the GET not POST. Passing variables via a query-string in the URL (GET) can be dangerous as users can see and modify these values. Change your form's method to POST. In standard HTML this would look like:
<form method="GET" action="......
...to...
<form method="POST" action=".....
You can encode the password, which will obscure it.
However using a POST form instead will hide all its fields.
Yes, use a POST request instead of GET.
Convert your form to use the HTTP "POST" method instead of "GET", e.g.:
<form action="/login" method="post">
Also consider obscuring the password before it is transmitted, e.g. using a scheme such as Base64 or MD5.
Change the 'method' attribute on the form from "get" to "post" -- and send the request over HTTPS, preferably.
When you see a "login?user=myUser&password=myPassword" in your address bar this means that your Login form is using the GET request method:
<form id="login" action="some_file" method="get">
The easiest way of hiding this info would be to change from GET to POST method:
<form id="login" action="some_file" method="post">
You can read more about both of these methods here:
When to use POST and GET?
However, note that POST is not much safer than GET. You can read more about this here:
POST and GET in terms of Security

Categories

Resources