Retrieve HTML content from an infinite scroll page (Facebook)

Retrieve HTML content from an infinite scroll page (Facebook) - java

I would like tor retrieve HTML data from a dynamic web page, like for example a public Facebook page: https://www.facebook.com/bbcnews/ (public content, without login)
For example, in this page, we have an infinite scroll, and we have to go at the bottom of the page to load more posts.
My current code is here:
URL url = new URL("https://www.facebook.com/bbcnews/");
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
BufferedWriter writer = new BufferedWriter(new FileWriter("path"));
while ((line = reader.readLine()) != null) {
writer.write(line);
}
This code retrieve only the first part of the page.
How retrieve more content of the web page with the infinite scroll ?
Thanks.

You won't get that through a simple BufferedReader looking at an HTTP stream. Open your browser console, then reach the end of the page. You'll see that an XHR call (asynchronous request) is fired toward this URL:
https://www.facebook.com/pages_reaction_units
With a lot of cryptic request parameters. You'll need to perform this kind of call in your java code. It's obfuscated for some reasons. Getting it done from scratch doesn't seems to be a good approach.
Better use an API provided by Facebook (maybe API Graph).

Related

Java - Retrieving a web page with authorization

I'm trying to retrieve a github web page using a java code, for this I used following code.
String startingUrl = "https://github.com/xxxxxx";
URL url = new URL(startingUrl );
HttpURLConnection uc = (HttpURLConnection) url.openConnection();
uc.connect();
String line = null;
StringBuffer tmp = new StringBuffer();
try{
BufferedReader in = new BufferedReader(new InputStreamReader(uc.getInputStream(), "UTF-8"));
while ((line = in.readLine()) != null) {
tmp.append(line);
}
}catch(FileNotFoundException e){
}
However, the page I received here is different from what I observe in browser after login to github. I tried sending authorization header as following, but it didn't worked either.
uc.setRequestProperty("Authorization", "Basic encodexxx");
How can I retrieve the same page that I see when I logged in?

I can't tell you more on this, because I don't know what are you getting, but most common issue for web crawlers is the fact that website owners mostly don't like web crawlers. Thus, you should behave like regular user - your browser for instance. Open your browser inspection element (press f12) when you are reaching some website and see what your browser send in request, then try to mimic it: For example, add Host, Referer, etc in your header. You need to experiment on this.
Also, good to know - some website owners will use advanced techniques (so they will block you to access their site), some won't stop you crawling on their website. Some will let you do what you want. Most fair option is to check www.somedomain.com/robots.txt and there is list of endpoints that are allowed for scraping and those that shouldn't be allowed.

How to use Facebook Messenger API?

I am writing a java application in which I need to access my chat history (chat messages between me and another Facebook friend). I have looked at
this link, but it seems outdated since I have noticed that Facebook changed his messenger API significantly. I was wondering if it is still possible to access my message history via java.
p.s. I found a good Facebook Graph API called restfb. But I was not able to find such an API for chat messages.

You can use the inbox resource of the Graph API: https://developers.facebook.com/docs/graph-api/reference/v2.3/user/inbox
Edit:
In order to use this from Java, you'll need to first follow the login instructions at https://developers.facebook.com/docs/facebook-login/v2.3 . That's a large enough operation that I'm going to assume that you've already done it -- it's well outside the scope of this answer (but I'm sure there are other questions that handle it sufficiently on StackOverflow if you look).
Once you have an access token for a particular session (you can get one to test with by going to https://developers.facebook.com/docs/graph-api/reference/v2.3/user/inbox, clicking the Graph Explorer button, clicking "Get Token" -> "Get Access Token", and ensuring that "read_mailbox" is selected under "Extended Permissions), it's pretty straightforward to read the API. You can do it using only standard JDK classes in just a few lines:
String accessToken = "replaceThisWithAccessToken";
String urlString = MessageFormat.format("https://graph.facebook.com/v2.3/me/inbox?access_token={0}&&format=json&method=get",
accessToken);
URL url = new URL(urlString);
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(url.openStream()));
String line;
while ((line = bufferedReader.readLine()) != null) {
System.out.println(line);
}
bufferedReader.close();
This glosses over a lot of things -- doesn't help with authentication, assumes your active trust store contains a certification path for the Facebook SSL cert (it should), and ignore proper error handling. And in practice you'll want to use RestClient or something similar instead of using URL directly -- but the above should be indicative of basically what you need to do.

Scrape text from site with password

I am trying to automate downloading text data from a website. Before I can access the website's data,I have to input my username and password. The code I use to scrape the text is listed below. The problem is I can't figure out how to login to the page and redirect to the location of the data. I have tried login in through my browser and then running my code through eclipse but I just end up getting data from the log in screen. I can retireve data from webistes just fine provided I don't have to go through a login.
static public void printPageA(String urlString){
try {
// Create a URL for the desired page
URL url = new URL(urlString);
// Read all the text returned by the server
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String str;
while ((str = in.readLine()) != null) {
System.out.println(str);
// str is one line of text; readLine() strips the newline character(s)
}
in.close();
} catch (MalformedURLException e) {
} catch (IOException e) {
}
}

I would suggest you to use Apache HTTP Client library. It will make it easier to make HTTP requests and it takes care of things like cookies. The site probably uses cookies for keeping information about your session, so you need to:
Make the same request as when you submit the login form. This is probably a POST request with parameters such as username and password. You can see it in a network monitor of your browser (developer tools).
Read the response. It will probably contain a Set-Cookie header containing an ID of your session. You have to send this cookie along with all your subsequent requests, otherwise you will get to the login page. If you use the HTTP Client library, it will take care of it, no need to mess with it in your code.
Create a request to any page of the web site that requires authentication.

Getting the actual text response of a web page in java

I'm using java and trying to get the content of a website so that I can analyze the text on the page, however every time that I "GET" a response from the server, it is from a login page rather than the website page that I am looking at.
I am logged into the website on all my browsers, but my application is not able to see the page as if it were me.
I also tried to use an API called "Yandex" --> http://api.yandex.com/rca/
as a work-around. But when I call the page from Yandex (which would get its content) I only see information based on the login page returned.
Can anyone give me a direction to investigate? I would like to be able to get one item on the page of a website that I work for, but it doesn't seem possible.
m_strseedpath = "http://myUrl.com/mypage.html"; //not https
URLConnection connection = new URL("http://rca.yandex.com/?key={MyActualKeyNotThisText}&url=" + m_strSeedUrlPath + "").openConnection();
connection.setRequestProperty("Accept-Charset", "UTF-8");
InputStream response = connection.getInputStream();
StringWriter writer = new StringWriter();
IOUtils.copy(response, writer, "UTF-8");
String strString = writer.toString();
System.out.println(strString);

The URLConnection object will connect to the page but in a different session. You would have to programmaticaly log in from your Java code.
Create a URLConnection object to the login page, POST the user name and password, receive the content getting the InputStream from the URLConnection object, and finally create a new connection to the page you wish to analyze. You'd have to also work with cookies in order to view the second page.
Hope this helps!

The URL that you are trying to access has access restricted via login. Even if you are logged in via your browser you wont be able to access the page from your Java application because the browser has an Authenticated Session with the target website. The same session is not visible to your Java Application.
You would have to research into the ways to login to the website and then get the page content.

reading bytes from web site

I am trying to create a proxy server.
I want to read the websites byte by byte so that I can display images and all other stuff. I tried readLine but I can't display images. Do you have any suggestions how I can change my code and send all data with DataOutputStream object to browser ?
try{
Socket s = new Socket(InetAddress.getByName(req.hostname), 80);
String file = parcala(req.url);
DataOutputStream out = new DataOutputStream(clientSocket.getOutputStream());
BufferedReader dis = new BufferedReader(new InputStreamReader(s.getInputStream()));
PrintWriter socketOut = new PrintWriter(s.getOutputStream());
socketOut.print("GET "+ req.url + "\n\n");
//socketOut.print("Host: "+req.hostname);
socketOut.flush();
String line;
while ((line = dis.readLine()) != null){
System.out.println(line);
}
}
catch (Exception e){}
}
Edited Part
This is what I should have to do. I can block banned web sites but can't allow other web sites in my program.
In the filter program, you will open a TCP socket at the specified port and wait for connections. If a
request comes (i.e. the client types a URL to access a web site), the application will process it to
decide whether access is allowed or not and then, using the same socket, it will send the reply back
to the client. After the client opened her connection to WebPolice (and her request has been checked
and is allowed), the real web page needs to be shown to the client. Therefore, since the user already gave her request, now it is WebPolice’s turn to forward the request so that the user can get the web page. Thus, WebPolice acts as a client and requests the web page. This means you need to open a connection to the web server (without closing the connection to the user), forward the request over this connection, get the reply and forward it back to the client. You will use threads to handle multiple connections (at the same time and/or at different times).

I don't know what exactly you're trying to do, but crafting an HTTP request and reading its response incorporates somewhat more than you have done here. Readline won't work on binary data anyway.
You can take a look at the URLConnection class (stolen here):
URL oracle = new URL("http://www.oracle.com/");
URLConnection yc = oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
Then you can read textual or binary data from the in object.

Read line will treat the line read as a String, so unless you want to mess around with conversions over to bytes, I wouldn't recommend that.
I would just read bytes until you can't read anymore, then write them out to a file, this should allow you to grab the images, keeping file headers intact which can be important when dealing with files other than text.
Hope this helps.

Instead of using BufferedReader you can try to use InputStream.
It has several methods for reading bytes.
http://docs.oracle.com/javase/6/docs/api/java/io/InputStream.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Retrieve HTML content from an infinite scroll page (Facebook) - java

Related

Java - Retrieving a web page with authorization

How to use Facebook Messenger API?

Scrape text from site with password

Getting the actual text response of a web page in java

reading bytes from web site

Categories

Resources