How to scrape a website, http get vs http post?

How to scrape a website, http get vs http post? - java

I am new to programming and know very little about http, but I wrote a code to scrape a website in Java, and have been running into the issue that my code scrapes "get" http calls (based on typing in a URL) but I do not know how to go about scraping data for a "post" http call.
After a brief overview on http, I believe I will need to simulate the browser, but do not know how to do this in Java. The website I have been trying to use.
As I need to scrape that source code for all the pages, the URL does not change as each next button is clicked. I have used Firefox firebug to look at what is going on when the button is clicked, but I do not know all that I am looking for.
My code to scrape the data as of now is:
public class Scraper {
private static String month = "11";
private static String day = "4";
private static String url = "http://cpdocket.cp.cuyahogacounty.us/SheriffSearch/results.aspx?q=searchType%3dSaleDate%26searchString%3d"+month+"%2f"+day+"%2f2013%26foreclosureType%3d%27NONT%27%2c+%27PAR%27%2c+%27COMM%27%2c+%27TXLN%27"; // the input website to be scraped
public static String sourcetext; //The source code that has been scraped
//scrapeWebsite runs the method to scrape the input URL and returns a string to be parsed.
public static void scrapeWebsite() throws IOException {
URL urlconnect = new URL(url); //creates the url from the variable
URLConnection connection = urlconnect.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
connection.getInputStream(), "UTF-8"));
String inputLine;
StringBuilder sourcecode = new StringBuilder(); // creates a stringbuilder which contains the sourcecode
while ((inputLine = in.readLine()) != null)
sourcecode.append(inputLine);
in.close();
sourcetext = sourcecode.toString();
}
What would be the best way to go about scraping all the pages for each "post" call?

Take a look at the jersey client interface
View the source of each page and determine the pattern of the url for next an previous pages then loop through.

Related

Is it possible to download only the HEAD tag of a page?

I've done some research about this and had no conclusive answer.
This question lays some of the path through it: How can I download only part of a page?
But then again, I don't want to download only a random part of a page, but one of the first tags, the head.
Is it possible somehow to query the page, and stream it's content to a buffer and stop downloading (discarding the rest) as soon as you find the tag closer </head> ?
EDIT:
Adding stuff to the page itself is not possible, since I want to pull the header of websites on my app.
Imagine http://stackoverflow.com is entered as the parameter. The whole page is around 240kb, but if I stop downloading the moment I hit </head>, it's only 5kb. Allowing me to save around 97% bandwidth for this page.

Maybe this is enough for you - Open a URLConnection and read from the input stream
public class test {
public static void main(String[] args) throws Exception {
URL oracle = new URL("http://www.oracle.com/");
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null){
if(inputLine.contains("</head>")) break;
System.out.println(inputLine);
}
in.close();
}
}
here you have the tutorial

How to display captcha in ImageView in Android.?

I have a PNR Inquiry app on Google Play. It was working very fine. But recently Indian Railwys added captcha to their PNR Inquiry section and because of this I am not able to pass proper data to the server to get proper response. How to add this captcha in my app in form of an imageview and ask the users to enter captcha details also so that I can send proper data and get proper response.
Indian Railways PNR Inquiry Link

If you check the html code, its actualy pretty bad captcha.
Background of captcha is: http://www.indianrail.gov.in/1.jpg
Those numbers are actualy in input tag:
<input name="lccp_cap_val" value="14167" id="txtCaptcha" type="hidden">
What they are doing is, via javascript, use numbers from that hidden input tag
and put them on that span with "captcha" background.
So basicaly your flow is:
read their html
get "captcha" (lol, funny captcha though) value from input field
when user puts data in your PNR field and presses Get Status
post form field, put PNR in proper value, put captcha in proper value
parse response
Oh yeah, one more thing. You can put any value in hidden input and "captcha"
input, as long as they are the same. They aren't checking it via session or
anything.
EDIT (code sample for submiting form):
To simplify posting form i recommend HttpClient components from Apache:
http://hc.apache.org/downloads.cgi
Lets say you downloaded HttpClient 4.3.1. Include client, core and mime
libraries in your project (copy to libs folder, right click on project,
properties, Java Build Path, Libraries, Add Jars -> add those 3.).
Code example would be:
private static final String FORM_TARGET = "http://www.indianrail.gov.in/cgi_bin/inet_pnstat_cgi.cgi";
private static final String INPUT_PNR = "lccp_pnrno1";
private static final String INPUT_CAPTCHA = "lccp_capinp_val";
private static final String INPUT_CAPTCHA_HIDDEN = "lccp_cap_val";
private void getHtml(String userPnr) {
MultipartEntityBuilder builder = MultipartEntityBuilder.create();
builder.addTextBody(INPUT_PNR, userPnr); // users PNR code
builder.addTextBody(INPUT_CAPTCHA, "123456");
builder.addTextBody("submit", "Get Status");
builder.addTextBody(INPUT_CAPTCHA_HIDDEN, "123456"); // values don't
// matter as
// long as they
// are the same
HttpEntity entity = builder.build();
HttpPost httpPost = new HttpPost(FORM_TARGET);
httpPost.setEntity(entity);
HttpClient client = new DefaultHttpClient();
HttpResponse response = null;
String htmlString = "";
try {
response = client.execute(httpPost);
htmlString = convertStreamToString(response.getEntity().getContent());
// now you can parse this string to get data you require.
} catch (Exception letsIgnoreItForNow) {
}
}
private static String convertStreamToString(InputStream is) {
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
StringBuilder sb = new StringBuilder();
String line = null;
try {
while ((line = reader.readLine()) != null) {
sb.append(line);
}
} catch (IOException ignoredOnceMore) {
} finally {
try {
is.close();
} catch (IOException manyIgnoredExceptions) {
}
}
return sb.toString();
}
Also, be warned i didn't wrap this in async call, so you will have to do that.

Image from the network can be displayed in android via efficient image loading api's like Picasso/volley or simply image view via async task.
considering all above things as basic build a logic such that you should need a image URL for that captcha if user resets or refresh the captcha it should reload new image via network call requesting the new request implementation, you have to get REST api access to the Indian railway and check in that any image uri available in that (it may be in base64 format )
if REST API is not available you may think of building your own server with this code
RESTful API to check the PNR Status
pnrapi
Update: you don't need to do this complex hacks , just implement Drago's answer !

Trying to get exact source code from web page I see from my browser using Java.

Very very new to programming,i.e its my 2nd day. I am looking at finance webpage, and am trying to extract the stock symbols from the webpage. Using the source code from the webpage id like a list that looks like ADK-A,AEH,AED, etc..., which is a list of the symbols as they appear on the webpage and browser generated source code.
Looking at the source code via Chrome's browser you can see the stock symbols, but using java even though I get some of the source code, every way i try the stock symbols and plenty of other code are never generated.
I have tried implementations using URL class, URLConnection class, and the HtmlUnit class. I dont know much but im guessing this part of the source is generated by some sort of javascript?? I figured working with Htmlunit would help as supposedly it can handle scripts? It didnt at least the way I am using it. Anyways this is what i tried
private static String name1 = "http://www.quantumonline.com/pfdtable.cfm?Type=TaxAdvPfds&SortColumn=Company&SortOrder=ASC";
//Implementation 1
public static void main (String[] args) throws IOException {
URL thisUrl = new URL(name1);
BufferedReader thisUrlBufferedReader = new BufferedReader (new InputStreamReader(thisUrl.openStream()));
String currentline;
while( (currentline = thisUrlBufferedReader.readLine()) != null) {
if ((currentline.contains("href")) == true) {
System.out.println(currentline);
}
}
}
//Implementation 2. My understading of fudging with addRequestProperty of a URLConnection, was to make sure my that the website wasnt restricting me based on my user-agent, I
honestly dont really know what it does, but i tried with and without, didnt help
public static void main (String[] args) throws IOException {
URL thisUrl = new URL(name1);
URLConnection thisUrlConnect = thisUrl.openConnection();
thisUrlConnect.addRequestProperty("User-Agent", "the user agent i got from http://whatsmyuseragent.com/");
InputStream input = thisUrlConnect.getInputStream();
BufferedReader thisUrlBufferedReader = new BufferedReader (new InputStreamReader (input));
String currentline;
while( (currentline = thisUrlBufferedReader.readLine()) != null) {
System.out.println(currentline);
}
}
//Implementation 3 i also used WebClient(BrowserVersion.CHROME) plus all the other versions
//nothing worked
public static void main(String[] args) throws Exception {
WebClient webClient = new WebClient();
HtmlPage page = webClient.getPage(name1);
System.out.println(page.asXml());
}
}
Anyways if anyone has any ideas im all ears. THANKS!!!

Reading and printing HTML from website hangs up

I've been working on some Java code in which a string is converted into a URL and then used to download and output its corresponding URL. Unfortunately, when I run the program, it just hangs up. Does anyone have any suggestsion?
Note: I've used import java.io.* and import java.net.*
public static boolean htmlOutput(String testURL) throws Exception {
URL myPage2 = new URL(testURL); //converting String to URL
System.out.println(myPage2);
BufferedReader webInput2 = new BufferedReader(
new InputStreamReader(myPage2.openStream()));
String individualLine=null;
String completeInput=null;
while ((individualLine = webInput2.readLine()) != null) {
//System.out.println(inputLine);
System.out.println(individualLine);
completeInput=completeInput+individualLine;
}//end while
webInput2.close();
return true;
}//end htmlOutput()

[Though this answer helped the OP it is wrong. HttpURLConnection does follow redirects so this could not be the OP 's problem. I will remove it as soon as the OP removes the accepted mark.]
My guess is that you don't get anything back in the response stream because the page you are trying to connect sends you a redirect response (i.e. 302).
Try to verify that by reading the response code and iterate over the response headers. There should be a header named Location with a new url that you need to follow
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
int code = connection.getResponseCode();
Map<String, List<String>> map = conn.getHeaderFields();
// iterate over the map and find new url
If you are having trouble getting the above snippet to work take a look at a working example
You could do yourself a favor and use a third party http client like Apache Http client that can handle redirects otherwise you should do this manually.

Java - Parsing HTML - get text

I am tring to get text from a website; when you change the language the html url have an "/en" inside, but the page that have the information that i want don't have.
http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92
html tags: (the text contains the description of the photo)
<div id="redx_gallery_pic_title"> text text </div>
The problem is that the website is in german and i want the text in english, and my script gets only the german version
Any ideas how can i do it?
java code:
...
URL oracle = new URL(x);
BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));
String inputLine=null;
StringBuffer theText = new StringBuffer();
while ((inputLine = in.readLine()) != null)
theText.append(inputLine+"\n");
String html = theText.toString();
in.close();
String[] name = StringUtils.substringsBetween(html, "redx_gallery_pic_title\">", "</div>");

That site is internationalized with German as default. You need to tell the server what language you're accepting by specifying the desired ISO 639-1 language code in the Accept-Language request header.
URLConnection connection = new URL(url).openConnection();
connection.setRequestProperty("Accept-Language", "en");
InputStream input = connection.getInputStream();
// ...
Unrelated to the concrete problem, may I suggest you to have a look at Jsoup as a HTML parser? It's much more convenient with its jQuery-like CSS selector syntax and therefore much less bloated than your attempt as far:
String url = "http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92";
Document document = Jsoup.connect(url).header("Accept-Language", "en").get();
String title = document.select("#redx_gallery_pic_title").text();
System.out.println(title); // Beech, glazing V3
That's all.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to scrape a website, http get vs http post? - java

Take a look at the jersey client interface View the source of each page and determine the pattern of the url for next an previous pages then loop through.

Related

Is it possible to download only the HEAD tag of a page?

How to display captcha in ImageView in Android.?

Trying to get exact source code from web page I see from my browser using Java.

Reading and printing HTML from website hangs up

Java - Parsing HTML - get text

Categories

Resources