I'm trying to retrieve a github web page using a java code, for this I used following code.
String startingUrl = "https://github.com/xxxxxx";
URL url = new URL(startingUrl );
HttpURLConnection uc = (HttpURLConnection) url.openConnection();
uc.connect();
String line = null;
StringBuffer tmp = new StringBuffer();
try{
BufferedReader in = new BufferedReader(new InputStreamReader(uc.getInputStream(), "UTF-8"));
while ((line = in.readLine()) != null) {
tmp.append(line);
}
}catch(FileNotFoundException e){
}
However, the page I received here is different from what I observe in browser after login to github. I tried sending authorization header as following, but it didn't worked either.
uc.setRequestProperty("Authorization", "Basic encodexxx");
How can I retrieve the same page that I see when I logged in?
I can't tell you more on this, because I don't know what are you getting, but most common issue for web crawlers is the fact that website owners mostly don't like web crawlers. Thus, you should behave like regular user - your browser for instance. Open your browser inspection element (press f12) when you are reaching some website and see what your browser send in request, then try to mimic it: For example, add Host, Referer, etc in your header. You need to experiment on this.
Also, good to know - some website owners will use advanced techniques (so they will block you to access their site), some won't stop you crawling on their website. Some will let you do what you want. Most fair option is to check www.somedomain.com/robots.txt and there is list of endpoints that are allowed for scraping and those that shouldn't be allowed.
Related
I would like tor retrieve HTML data from a dynamic web page, like for example a public Facebook page: https://www.facebook.com/bbcnews/ (public content, without login)
For example, in this page, we have an infinite scroll, and we have to go at the bottom of the page to load more posts.
My current code is here:
URL url = new URL("https://www.facebook.com/bbcnews/");
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
BufferedWriter writer = new BufferedWriter(new FileWriter("path"));
while ((line = reader.readLine()) != null) {
writer.write(line);
}
This code retrieve only the first part of the page.
How retrieve more content of the web page with the infinite scroll ?
Thanks.
You won't get that through a simple BufferedReader looking at an HTTP stream. Open your browser console, then reach the end of the page. You'll see that an XHR call (asynchronous request) is fired toward this URL:
https://www.facebook.com/pages_reaction_units
With a lot of cryptic request parameters. You'll need to perform this kind of call in your java code. It's obfuscated for some reasons. Getting it done from scratch doesn't seems to be a good approach.
Better use an API provided by Facebook (maybe API Graph).
I am trying to automate downloading text data from a website. Before I can access the website's data,I have to input my username and password. The code I use to scrape the text is listed below. The problem is I can't figure out how to login to the page and redirect to the location of the data. I have tried login in through my browser and then running my code through eclipse but I just end up getting data from the log in screen. I can retireve data from webistes just fine provided I don't have to go through a login.
static public void printPageA(String urlString){
try {
// Create a URL for the desired page
URL url = new URL(urlString);
// Read all the text returned by the server
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String str;
while ((str = in.readLine()) != null) {
System.out.println(str);
// str is one line of text; readLine() strips the newline character(s)
}
in.close();
} catch (MalformedURLException e) {
} catch (IOException e) {
}
}
I would suggest you to use Apache HTTP Client library. It will make it easier to make HTTP requests and it takes care of things like cookies. The site probably uses cookies for keeping information about your session, so you need to:
Make the same request as when you submit the login form. This is probably a POST request with parameters such as username and password. You can see it in a network monitor of your browser (developer tools).
Read the response. It will probably contain a Set-Cookie header containing an ID of your session. You have to send this cookie along with all your subsequent requests, otherwise you will get to the login page. If you use the HTTP Client library, it will take care of it, no need to mess with it in your code.
Create a request to any page of the web site that requires authentication.
I am currently pen-testing a web application and came across an interesting phenomenon. During my testing sessions, I gathered URLs using a proxy. Now I wanted to test my URL list for anonymous access, so i wrote this little tool
public static void main(String[] args) {
try {
TrustAllCerts.disableCertChecks();
FileReader fr = new FileReader(new File("urls.txt"));
BufferedReader br = new BufferedReader(fr);
String urlStr = br.readLine();
while (urlStr != null) {
if (urlStr.trim().length() > 0) {
URL url = new URL(urlStr);
HttpsURLConnection urlc = (HttpsURLConnection) url.openConnection();
urlc.connect();
if (urlc.getResponseCode() == HttpURLConnection.HTTP_OK) {
System.out.println(urlStr);
} else {
System.out.println("["+urlc.getResponseCode()+"] "+urlStr);
}
urlc.disconnect();
}
urlStr = br.readLine();
}
br.close();
} catch (Exception e) {
e.printStackTrace();
}
}
It does basically nothing, but opening an URL connection on a given URL and test the HTTP response code (actually I implemented some more tests, if I'm getting redirected to a login page). However, the problem is, that this specific application (some custom MS SQL Server Reporting Services) is configured to use NTLM WWW authentication. If I try to access some of the URLs using Firefox, i get an 401 Unauthorized + login dlg. Internet Exploder performs NTLM auth in the background and grants access. It seems that the Java URLConnection (or URL) class does the same, because I am getting no 401. Is there a way to disable implicit NTLM authentication in Java? This is a bad pitfall for me.
I think the Java Network Documentation is the best resource. Setting the http.auth.preference="basic" should get you what you want. Assuming you don't need digest or something else. I'm not sure if you can go beyond that to disable NTLM.
Another thing to consider is other Java HTTP client implementations, like Apache's or Google's.
I'm not sure that this will help, but I've been stumped by the opposite.
I wanted NTLM auth to take place, so on my local machine I use a free app called CNTLM. It's a local proxy server that will forward (and NT auth) incoming requests. Good for apps that can't use NTLM proxies.
I'm sorry, I know this isn't answering your question, but maybe it proves helpful to someone out there! :)
I'm making a simple URL request with code like this:
URL url = new URL(webpage);
URLConnection urlConnection = url.openConnection();
InputStream is = urlConnection.getInputStream();
But on that last line, I'm getting the "redirected too many times error". If my "webpage" var is, say, google.com then it works fine, but when I try to use my servlet's URL then it fails. It seems I can adjust the number of times it follows the redirects (default is 20) with this:
System.setProperty("http.maxRedirects", "100");
But when I crank it up to, say, 100 it definitely takes longer to throw the error so I know it is trying. However, the URL to my servlet works fine in (any) browser and using the "persist" option in firebug it seems to only be redirecting once.
A bit more info on my servlet ... it is running in tomcat and fronted by apache using 'mod-proxy-ajp'. Also of note, it is using form authentication so any URL you enter should redirect you to the login page. As I said, this works correctly in all browsers, but for some reason the redirect isn't working with the URLConnection in Java 6.
Thanks for reading ... ideas?
It's apparently redirecting in an infinite loop because you don't maintain the user session. The session is usually backed by a cookie. You need to create a CookieManager before you use URLConnection.
// First set the default cookie manager.
CookieHandler.setDefault(new CookieManager(null, CookiePolicy.ACCEPT_ALL));
// All the following subsequent URLConnections will use the same cookie manager.
URLConnection connection = new URL(url).openConnection();
// ...
connection = new URL(url).openConnection();
// ...
connection = new URL(url).openConnection();
// ...
See also:
Using java.net.URLConnection to fire and handle HTTP requests
Duse, I have add this lines:
java.net.CookieManager cm = new java.net.CookieManager();
java.net.CookieHandler.setDefault(cm);
See this example:
java.net.CookieManager cm = new java.net.CookieManager();
java.net.CookieHandler.setDefault(cm);
String buf="";
dk = new DAKABrowser(input.getText());
try {
URL url = new URL(dk.toURL(input.getText()));
DataInputStream dis = new DataInputStream(url.openStream());
String inputLine;
while ((inputLine = dis.readLine()) != null) {
buf+=inputLine;
output.append(inputLine+"\n");
}
dis.close();
}
catch (MalformedURLException me) {
System.out.println("MalformedURLException: " + me);
}
catch (IOException ioe) {
System.out.println("IOException: " + ioe);
}
titulo.setText(dk.getTitle(buf));
I was using Jenkins on Tomcat6 on a unix environment and got this bug. For some reason, upgrading to Java7 solved it. I'd be interested to know exactly why that fixed it.
I had faced the same problem and it took considerable amount of time to understand the problem.
So to summarize the problem was in mismatch of headers.
Consider below being my Resource
#GET
#Path("booksMasterData")
#Produces(Array(core.MediaType.APPLICATION_JSON))
def booksMasterData(#QueryParam("stockStatus") stockStatus : String): Response = {
// some logic here to get the books and send it back
}
And here is client code, which was trying to connect to my above resource
ClientResponse clientResponse = restClient.resource("http://localhost:8080/booksService").path("rest").path("catalogue").path("booksMasterData").accept("application/boks-master-data+json").get(ClientResponse.class);
And the error was coming on exactly above line.
What was the problem?
My Resource was using
"application/json"
in
#Produces annotation
and my client was using
accept("application/boks-master-data+json")
and this was the problem.
It took me long to find out this as the error was no where related. Break through was when I tried to access my resource in postman with
Accept-> "application/json" header
it worked fine, however with
Accept-> "application/boks-master-data+json" header
it doesnt.
And again, even Postman was not giving me proper error. The error was too generic. Please see the below image for reference.
I have a simple web page with an
embedded Java applet.
The applet
makes HTTP calls to different Axis
Cameras who all share the same
authentication (e.g. username,
password).
I am passing the user name and password to the Java code upon launch of the applet - no problem.
When I run from within NetBeans with the applet viewer, I get full access to the cameras and see streaming video - exactly as advertised.
The problem begins when I open the HTML page in a web browser (Firefox).
Even though my code handles authentication:
URL u = new URL(useMJPGStream ? mjpgURL : jpgURL);
huc = (HttpURLConnection) u.openConnection();
String base64authorization =
securityMan.getAlias(this.securityAlias).getBase64authorization();
// if authorization is required set up the connection with the encoded
// authorization-information
if(base64authorization != null)
{
huc.setDoInput(true);
huc.setRequestProperty("Authorization",base64authorization);
huc.connect();
}
InputStream is = huc.getInputStream();
connected = true;
BufferedInputStream bis = new BufferedInputStream(is);
dis= new DataInputStream(bis);
The browser still brings up an authentication pop-up and requests the username and password for each camera separately!
To make things worse, the images displayed from the camera are frozen and old (from last night).
How can I bypass the browser's authentication?
Fixed
I added the following lines:
huc.setDoOuput(true);
huc.setUseCaches(false);
after the
huc.setDoInput(true);
line.
When running in the browser base64authorization not null, correct?
I'm not really sure what getBase64authorization is supposed to return, but I'm fairly certain when you call huc.setRequestProperty("Authorization", **autorization value**) it's looking for a HTTP Basic authentication value. Meaning **authorization value** needs to be in the format Basic **base 64 encoding of username:password** as described here.
Maybe you just need to add the Basic (note the trailing space) string to your property.