I am trying to automate downloading text data from a website. Before I can access the website's data,I have to input my username and password. The code I use to scrape the text is listed below. The problem is I can't figure out how to login to the page and redirect to the location of the data. I have tried login in through my browser and then running my code through eclipse but I just end up getting data from the log in screen. I can retireve data from webistes just fine provided I don't have to go through a login.
static public void printPageA(String urlString){
try {
// Create a URL for the desired page
URL url = new URL(urlString);
// Read all the text returned by the server
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String str;
while ((str = in.readLine()) != null) {
System.out.println(str);
// str is one line of text; readLine() strips the newline character(s)
}
in.close();
} catch (MalformedURLException e) {
} catch (IOException e) {
}
}
I would suggest you to use Apache HTTP Client library. It will make it easier to make HTTP requests and it takes care of things like cookies. The site probably uses cookies for keeping information about your session, so you need to:
Make the same request as when you submit the login form. This is probably a POST request with parameters such as username and password. You can see it in a network monitor of your browser (developer tools).
Read the response. It will probably contain a Set-Cookie header containing an ID of your session. You have to send this cookie along with all your subsequent requests, otherwise you will get to the login page. If you use the HTTP Client library, it will take care of it, no need to mess with it in your code.
Create a request to any page of the web site that requires authentication.
Related
I would like tor retrieve HTML data from a dynamic web page, like for example a public Facebook page: https://www.facebook.com/bbcnews/ (public content, without login)
For example, in this page, we have an infinite scroll, and we have to go at the bottom of the page to load more posts.
My current code is here:
URL url = new URL("https://www.facebook.com/bbcnews/");
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
BufferedWriter writer = new BufferedWriter(new FileWriter("path"));
while ((line = reader.readLine()) != null) {
writer.write(line);
}
This code retrieve only the first part of the page.
How retrieve more content of the web page with the infinite scroll ?
Thanks.
You won't get that through a simple BufferedReader looking at an HTTP stream. Open your browser console, then reach the end of the page. You'll see that an XHR call (asynchronous request) is fired toward this URL:
https://www.facebook.com/pages_reaction_units
With a lot of cryptic request parameters. You'll need to perform this kind of call in your java code. It's obfuscated for some reasons. Getting it done from scratch doesn't seems to be a good approach.
Better use an API provided by Facebook (maybe API Graph).
I'm trying to retrieve a github web page using a java code, for this I used following code.
String startingUrl = "https://github.com/xxxxxx";
URL url = new URL(startingUrl );
HttpURLConnection uc = (HttpURLConnection) url.openConnection();
uc.connect();
String line = null;
StringBuffer tmp = new StringBuffer();
try{
BufferedReader in = new BufferedReader(new InputStreamReader(uc.getInputStream(), "UTF-8"));
while ((line = in.readLine()) != null) {
tmp.append(line);
}
}catch(FileNotFoundException e){
}
However, the page I received here is different from what I observe in browser after login to github. I tried sending authorization header as following, but it didn't worked either.
uc.setRequestProperty("Authorization", "Basic encodexxx");
How can I retrieve the same page that I see when I logged in?
I can't tell you more on this, because I don't know what are you getting, but most common issue for web crawlers is the fact that website owners mostly don't like web crawlers. Thus, you should behave like regular user - your browser for instance. Open your browser inspection element (press f12) when you are reaching some website and see what your browser send in request, then try to mimic it: For example, add Host, Referer, etc in your header. You need to experiment on this.
Also, good to know - some website owners will use advanced techniques (so they will block you to access their site), some won't stop you crawling on their website. Some will let you do what you want. Most fair option is to check www.somedomain.com/robots.txt and there is list of endpoints that are allowed for scraping and those that shouldn't be allowed.
I want to open a website in web browser. I know it is easy but i want to do it in different way ...
It is like proxy server .I have made a java code that will get content(source code) of webpage and when browser request localhost on particular port number this code writes source code in browser. But instead of getting web page I am getting source code of webpage in browser and also i want to make a request from java code as a illusion of browser means server should feel that that request is made from a browser and not from java console.
import java.net.*;
import java.io.*;
public class URLConnectionReader {
public static void main(String args[]) throws Exception{
URL ul = null;
HttpURLConnection ulc = null;
ServerSocket server = null;
Socket client = null;
DataInputStream in = null;
DataOutputStream out = null;
String c = null;
server = new ServerSocket(9898);
System.out.println("Server is waiting for clients on port no 9898....");
while(client == null){
client = server.accept();
}
System.out.println("Connected.....");
out = new DataOutputStream(client.getOutputStream());
ul = new URL("http://www.google.com");
ulc = (HttpURLConnection)ul.openConnection();
in = new DataInputStream(ulc.getInputStream());
while((c = in.readLine())!=null){
out.writeBytes(c);
}
in.close();
out.close();
client.close();
}
}
Loading web pages is not quite as simple as you probably think. Both the browser and the server use a protocol called HTTP. In simple terms, the browser sends a request consisting of a request line, headers and sometimes data, and the server responds with a response line, headers and data. Most web pages also have related resources that need to be loaded for displaying the page (such as images, stylesheets and scripts), and each resource is loaded through a separate request.
Your program only accepts one request, completely ignores the details of the request, and then loads a fixed web page and sends it as the response. The way you are loading the web page (with a URL), you are only getting the data part of the response (the page source); the response line and the headers are missing. The headers are very important as one of them (named "Content-Type") specifies what kind of resource it is - web page, image or something else. Without it, browsers usually assume the data is plain text and display it accordingly.
So if you want your experiment to work better, you need to make sure you send a complete and valid HTTP response to the browser. You can probably reconstruct the response line and headers from the HttpURLConnection object. Or you can use sockets directly to load the web page.
A better solution would be to use a java web server (such as Jetty) in which you'd run a servlet that loads the remote page using an HTTP client library (such as Apache HttpComponents) and does the necessary processing of addresses and headers. But.. small steps :)
For starters; Im not so literate in coding.
I am pretty interested in a script on how to trigger/ or throw a Basic/Standard "Authentication Required" Dialog on a specific directory or site and the credentials that would be inputed there by the users, to be checked against another database thats on another website.
i.e. Like those "Check who blocked you on msn" websites that they get your credentials from their website and they check against the Hotmail database or servers and tell you if the credentials are incorrect (try again) or if its correct it redirects you to the specific website that is implemented by the Administrator. (in this situation Hotmail Contact List)
And also when it checks that the credentials are correct how do I make the script to store those credentials into a specific .txt file or folder?!
The only difference is that I just want it to be Basic Authentication Dialog Like This Example Here But I want this to implement on my sites.
I hope Im comprehensible.
Thank you very much in advance.
You will need to send a 401 response code to the browser which will make the browser prompt for a username and password. Here's an example in PHP taken from the PHP manual:
<?php
if (!isset($_SERVER['PHP_AUTH_USER'])) {
header('WWW-Authenticate: Basic realm="My Realm"');
header('HTTP/1.0 401 Unauthorized');
echo 'Text to send if user hits Cancel button';
exit;
} else {
echo "<p>Hello {$_SERVER['PHP_AUTH_USER']}.</p>";
echo "<p>You entered {$_SERVER['PHP_AUTH_PW']} as your password.</p>";
}
?>
You should be able to do the same thing in the language of your choice, although you will need to research where the username and password variables are stored in the language you use.
As an alternative, you may also be able to configure this in your web server. That way the web server handles authentication and you only need to program your application to get the current user name which is usually found in the "REMOTE_USER" environment variable. In Apache you might restrict access to a specific folder as follows:
<Directory /usr/local/apache/htdocs/secret>
AuthType Basic
AuthName "Restricted Files"
# (Following line optional)
AuthBasicProvider file
AuthUserFile /usr/local/apache/passwd/passwords
Require user rbowen
</Directory>
See the Apache documentation on authentication and access control for more information. Even if you are using a different web server, rest assured that this is a common feature in web servers. I'm sure you will be able to find the equivalent functionality in whatever web server you are using.
Java imports have been excluded...
To show the username/password dialog...
HttpServletResponse httpResponse = (HttpServletResponse) response;
httpResponse.setHeader("WWW-Authenticate", "Basic realm=\"My Realm\"");
httpResponse.sendError(HttpServletResponse.SC_UNAUTHORIZED, "");
To decode the request...
private boolean authenticateRequestOk(HttpServletRequest request)
{
String authorizationHeader = request.getHeader("Authorization");
if (authorizationHeader != null)
{
byte[] decodedUsernamePassword;
try
{
decodedUsernamePassword = Base64.decode(authorizationHeader.substring("Basic ".length()));
}
catch (IOException e)
{
log.error("Error decoding authorization header \"" + authorizationHeader + "\"", e);
return false;
}
String usernameAndPassword = new String(decodedUsernamePassword);
String username = StringUtils.substringBefore(usernameAndPassword, ":");
String password = StringUtils.substringAfter(usernameAndPassword, ":");
if (USERNAME.equalsIgnoreCase(username) && PASSWORD.equalsIgnoreCase(password))
{
return true;
}
}
return false;
}
This question already has answers here:
403 Forbidden with Java but not web browser?
(4 answers)
Closed 4 years ago.
I'm building a Java application which will download a HTML page from a website and save the file in my local system. I'm able to manually access the web page's URL via browser. But when I try to access the same URL in my Java program, the server returns a 503 Error. Here's the scenario:
sample URL = http://content.somesite.com/demo/somepage.asp
Able to access the above URL via browser. But the below Java code fails to download the page:
StringBuffer data = new StringBuffer();
BufferedReader br = null;
try {
br = new BufferedReader(new InputStreamReader(sourceUrl.openStream()));
String inputLine = "";
while ((inputLine = br.readLine()) != null) {
data.append(inputLine);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
br.close();
}
So, my questions are:
Am I doing anything wrongly here?
Is there a way for the server to block requests from programs/bots and allow only the requests coming from browsers?
You may want to try setting the User-Agent and Referer HTTP headers to something like what a normal web browser would send.
You can pick a User-Agent string from this list: Seehowitruns: User-agent strings.
In addition, if the page you are requesting is an internal page, it might also depend on cookies which were generated in previous page.