Extracting links of a facebook page

Extracting links of a facebook page - java

How can I extract all the links of a facebook page. Can I extract it using jsoup and pass "like" link as parameter to extract all the user's info who liked that particular page
private static String readAll(Reader rd) throws IOException
{
StringBuilder sb = new StringBuilder();
int cp;
while ((cp = rd.read()) != -1)
{
sb.append((char) cp);
}
return sb.toString();
}
public static JSONObject readurl(String url) throws IOException, JSONException
{
InputStream is = new URL(url).openStream();
try
{
BufferedReader rd = new BufferedReader
(new InputStreamReader(is, Charset.forName("UTF-8")));
String jsonText = readAll(rd);
JSONObject json = new JSONObject(jsonText);
return json;
}
finally
{
is.close();
}
}
public static void main(String[] args) throws IOException,
JSONException, FacebookException
{
try
{
System.out.println("\nEnter the search string:");
#SuppressWarnings("resource")
Scanner sc=new Scanner(System.in);
String s=sc.nextLine();
JSONObject json = readurl("https://graph.facebook.com/"+s);
System.out.println(json);
}}
CAN i MODIFY THIS AND INTEGRATE THIS CODE. BELOW CODE EXTRACTS ALL LINKS OF A PARTICULAR PAGE. i TRIED TO THE ABOVE CODE BUT IT'S NOT WORKING
String url = "http://www.firstpost.com/tag/crime-in-india";
Document doc = Jsoup.connect(url).get();
Elements links = doc.getElementsByTag("a");
System.out.println(links.size());
for (Element link : links)
{
System.out.println(link.absUrl("href") +trim(link.text(), 35));
}
}
public static String trim(String s, int width) {
if (s.length() > width)
return s.substring(0, width-1) + ".";
else
return s;
}
}

you can try alternative way also like this :
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.LinkedHashSet;
import java.util.Set;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
public class URLExtractor {
private static class HTMLPaserCallBack extends HTMLEditorKit.ParserCallback {
private Set<String> urls;
public HTMLPaserCallBack() {
urls = new LinkedHashSet<String>();
}
public Set<String> getUrls() {
return urls;
}
#Override
public void handleSimpleTag(Tag t, MutableAttributeSet a, int pos) {
handleTag(t, a, pos);
}
#Override
public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
handleTag(t, a, pos);
}
private void handleTag(Tag t, MutableAttributeSet a, int pos) {
if (t == Tag.A) {
Object href = a.getAttribute(HTML.Attribute.HREF);
if (href != null) {
String url = href.toString();
if (!urls.contains(url)) {
urls.add(url);
}
}
}
}
}
public static void main(String[] args) throws IOException {
InputStream is = null;
try {
String u = "https://www.facebook.com/";
URL url = new URL(u);
is = url.openStream(); // throws an IOException
HTMLPaserCallBack cb = new HTMLPaserCallBack();
new ParserDelegator().parse(new BufferedReader(new InputStreamReader(is)), cb, true);
for (String aUrl : cb.getUrls()) {
System.out.println("Found URL: " + aUrl);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe) {
// nothing to see here
}
}
}
}

Kind of works, but im not sure you could use jsoup for this I would rather look into casperjs or phantomjs
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class getFaceBookLinks {
public static Elements getElementsByTag_then_FilterBySelector (String tag, String httplink, String selector){
Document doc = null;
try {
doc = Jsoup.connect(httplink).get();
} catch (IOException e) {
e.printStackTrace();
}
Elements links = doc.getElementsByTag(tag);
return links.select(selector);
}
//Test functionality
public static void main(String[] args){
// The class name for the like links on facebook is UFILikeLink
Elements likeLinks = getElementsByTag_then_FilterBySelector("a", "http://www.facebook.com", ".UFILikeLink");
System.out.println(likeLinks);
}
}

Related

I'm trying to read a text file and store it in an arraylist of objects

I'm trying to read a text file and store it in an arraylist of objects, but I keep getting an error saying I cannot convert a String to an Item, which is type of arraylist I am using. I have tried various solutions, but am not quite sure how its is suppossed to be done. I am new to coding and have this assignment due soon. Anything helps!
private void loadFile(String FileName)
{
Scanner in;
Item line;
try
{
in = new Scanner(new File(FileName));
while (in.hasNext())
{
line = in.nextLine();
MyStore.add(line);
}
in.close();
}
catch (IOException e)
{
System.out.println("FILE NOT FOUND.");
}
}
my apologies for not adding the Item class
public class Item
{
private int myId;
private int myInv;
//default constructor
public Item()
{
myId = 0;
myInv = 0;
}
//"normal" constructor
public Item(int id, int inv)
{
myId = id;
myInv = inv;
}
//copy constructor
public Item(Item OtherItem)
{
myId = OtherItem.getId();
myInv = OtherItem.getInv();
}
public int getId()
{
return myId;
}
public int getInv()
{
return myInv;
}
public int compareTo(Item Other)
{
int compare = 0;
if (myId > Other.getId())
{
compare = 1;
}
else if (myId < Other.getId())
{
compare = -1;
}
return compare;
}
public boolean equals(Item Other)
{
boolean equal = false;
if (myId == Other.getId())
{
equal = true;;
}
return equal;
}
public String toString()
{
String Result;
Result = String.format("%8d%8d", myId, myInv);
return Result;
}
}
This is the creation of my arraylist.
private ArrayList MyStore = new ArrayList ();
Here is a sample of my text file.
3679 87
196 60
12490 12
18618 14
2370 65

/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package com.mycompany.rosmery;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
/**
*
* #author Sem-6-INGENIERIAINDU
*/
public class aaa {
public static void main(String arg[]) throws FileNotFoundException, IOException{
BufferedReader files=new BufferedReader(new FileReader(new File("")));
List<String> dto=new ArrayList<>();
String line;
while((line= files.readLine())!= null){
line= files.readLine();
dto.add(line);
//Hacer la logica para esos datos
}
}
}

in.nextLine() returns a String.
So, you cannot assign in.nextLine() to an instance of Item.
Your code may need to correct it as:
List<String> myStore = new ArrayList<String>();
private void loadFile(String FileName)
{
Scanner in;
try
{
in = new Scanner(new File(FileName));
while (in.hasNext())
{
myStore.add(in.nextLine());
}
in.close();
}
catch (IOException e)
{
System.out.println("FILE NOT FOUND.");
}
}
If you want to have a list of Item after reading a file, then you need provide the logic that convert given line of information into an instance of Item.
let's say your file content is in the following format.
id1,inv1
id2,inv2
.
.
Then, you can use the type Item as the following.
List<Item> myStore = new ArrayList<Item>();
private void loadFile(String FileName)
{
Scanner in;
String[] line;
try
{
in = new Scanner(new File(FileName));
while (in.hasNext())
{
line = in.nextLine().split(",");
myStore.add(new Item(line[0], line[1]));
}
in.close();
}
catch (IOException e)
{
System.out.println("FILE NOT FOUND.");
}
}

One of the possible solutions (assuming that the data in file lines is separated by a comma), with using streams:
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class Main {
public static void main(String[] args) throws IOException {
List<Item> items = loadFile("myfile.txt");
System.out.println(items);
}
private static List<Item> loadFile(String fileName) throws IOException {
try (Stream<String> stream = Files.lines(Paths.get(fileName))) {
return stream
.map(s -> Stream.of(s.split(",")).mapToInt(Integer::parseInt).toArray())
.map(i -> new Item(i[0], i[1]))
.collect(Collectors.toList());
}
}
}
or with foreach:
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class Main {
public static void main(String[] args) throws IOException {
List<Item> items = new ArrayList<>();
for (String line : loadFile("myfile.txt")) {
String[] data = line.split(",");
int id = Integer.parseInt(data[0]);
int inv = Integer.parseInt(data[1]);
items.add(new Item(id, inv));
}
System.out.println(items);
}
private static List<String> loadFile(String fileName) throws IOException {
try (Stream<String> stream = Files.lines(Paths.get(fileName))) {
return stream.collect(Collectors.toList());
}
}
}

String line or StringTokenizer with a Reader?

I had a file to read and with this code I succeeded my JUnit tests. As you can see, I pass the String line as parameter to the readPrevisione(...) method.
package oroscopo.persistence;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.NoSuchElementException;
import java.util.Set;
import java.util.StringTokenizer;
import oroscopo.model.Previsione;
import oroscopo.model.SegnoZodiacale;
public class TextFileOroscopoRepository implements OroscopoRepository {
private HashMap<String, List<Previsione>> mapSettore = new HashMap<>();
public TextFileOroscopoRepository(Reader baseReader) throws IOException, BadFileFormatException{
if (baseReader == null)
throw new IllegalArgumentException("baseReader is null");
BufferedReader bufReader = new BufferedReader(baseReader);
String line;
while((line=bufReader.readLine()) != null){
readPrevisione(line,bufReader);
}
}
private void readPrevisione(String line, BufferedReader bufReader) throws IOException, BadFileFormatException{
String nomeSettore = line.trim();
if (!Character.isUpperCase(nomeSettore.charAt(0)))
throw new BadFileFormatException();
List<Previsione> listaPrev = new ArrayList<>();
while (!(line = bufReader.readLine()).equalsIgnoreCase("FINE")){
try{
StringTokenizer st1 = new StringTokenizer(line, "\t");
if(st1.countTokens() < 2)
throw new BadFileFormatException();
String prev = st1.nextToken("\t").trim();
int val = Integer.parseInt(st1.nextToken("\t").trim());
Set<SegnoZodiacale> segni = new HashSet<>();
if (st1.hasMoreTokens()){
while(st1.hasMoreTokens()){
try{
segni.add(SegnoZodiacale.valueOf(st1.nextToken(",").trim()));
}
catch (IllegalArgumentException e){
throw new BadFileFormatException();
}
}
Previsione p = new Previsione(prev,val,segni);
listaPrev.add(p);
}
else{
Previsione p2 = new Previsione(prev,val);
listaPrev.add(p2);
}
}
catch (NumberFormatException e){
throw new BadFileFormatException();
}
catch (NoSuchElementException e){
throw new BadFileFormatException();
}
}
mapSettore.put(nomeSettore, listaPrev);
}
#Override
public Set<String> getSettori() {
return mapSettore.keySet();
}
#Override
public List<Previsione> getPrevisioni(String settore) {
return mapSettore.get(settore.toUpperCase());
}
}
Here with the same code, instead passing the read line as parameter, I pass the StringTokenizer that already has read the line. It should work like above but my JUnit tests fail. What did I do wrong?
package oroscopo.persistence;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.NoSuchElementException;
import java.util.Set;
import java.util.StringTokenizer;
import oroscopo.model.Previsione;
import oroscopo.model.SegnoZodiacale;
public class TextFileOroscopoRepository implements OroscopoRepository {
private HashMap<String, List<Previsione>> mapSettore = new HashMap<>();
public TextFileOroscopoRepository(Reader baseReader) throws IOException, BadFileFormatException{
if (baseReader == null)
throw new IllegalArgumentException("baseReader is null");
BufferedReader bufReader = new BufferedReader(baseReader);
String line;
while((line=bufReader.readLine()) != null){
StringTokenizer st = new StringTokenizer(line);
readPrevisione(st,bufReader);
}
}
private void readPrevisione(StringTokenizer st, BufferedReader bufReader) throws IOException, BadFileFormatException{
String nomeSettore = st.nextToken().trim();
if (!Character.isUpperCase(nomeSettore.charAt(0)))
throw new BadFileFormatException();
List<Previsione> listaPrev = new ArrayList<>();
String line;
while (!(line = bufReader.readLine()).equalsIgnoreCase("FINE")){
try{
StringTokenizer st1 = new StringTokenizer(line, "\t");
if(st1.countTokens() < 2)
throw new BadFileFormatException();
String prev = st1.nextToken("\t").trim();
int val = Integer.parseInt(st1.nextToken("\t").trim());
Set<SegnoZodiacale> segni = new HashSet<>();
if (st1.hasMoreTokens()){
while(st1.hasMoreTokens()){
try{
segni.add(SegnoZodiacale.valueOf(st1.nextToken(",").trim()));
}
catch (IllegalArgumentException e){
throw new BadFileFormatException();
}
}
Previsione p = new Previsione(prev,val,segni);
listaPrev.add(p);
}
else{
Previsione p2 = new Previsione(prev,val);
listaPrev.add(p2);
}
}
catch (NumberFormatException e){
throw new BadFileFormatException();
}
catch (NoSuchElementException e){
throw new BadFileFormatException();
}
}
mapSettore.put(nomeSettore, listaPrev);
}
#Override
public Set<String> getSettori() {
return mapSettore.keySet();
}
#Override
public List<Previsione> getPrevisioni(String settore) {
return mapSettore.get(settore.toUpperCase());
}
}
EDIT: Here is the File.txt that I want to read.
And here is an example of one of my JUnit test:
#Test
public void testLetturaCorrettaPrevisioni1() throws IOException, BadFileFormatException {
Reader mr = new StringReader(
"NOMESEZIONE\navrai la testa un po' altrove\t\t4\tARIETE,TORO,GEMELLI\ngrande intimita'\t9\nFINE\n"
+ "SEZIONE2\ntesto di prova\t\t\t\t\t66\t\nFINE");
OroscopoRepository or = new TextFileOroscopoRepository(mr);
assertEquals("avrai la testa un po' altrove", or.getPrevisioni("nomesezione").get(0).getPrevisione());
assertEquals(4, or.getPrevisioni("nomesezione").get(0).getValore());
Set<SegnoZodiacale> validi = new HashSet<SegnoZodiacale>() {
private static final long serialVersionUID = 1L;
{
add(SegnoZodiacale.ARIETE);
add(SegnoZodiacale.TORO);
add(SegnoZodiacale.GEMELLI);
}
};
for (SegnoZodiacale s : SegnoZodiacale.values()) {
if (validi.contains(s))
assertTrue(or.getPrevisioni("nomesezione").get(0).validaPerSegno(s));
else
assertFalse(or.getPrevisioni("nomesezione").get(0).validaPerSegno(s));
}
assertEquals("grande intimita'", or.getPrevisioni("nomesezione").get(1).getPrevisione());
assertEquals(9, or.getPrevisioni("nomesezione").get(1).getValore());
for (SegnoZodiacale s : SegnoZodiacale.values()) {
assertTrue(or.getPrevisioni("nomesezione").get(1).validaPerSegno(s));
}
}

You are creating StringTokenizer with default delimiter, that is, "the space character, the tab character, the newline character, the carriage-return character, and the form-feed character."
So in the first case you set as value of the "nomeSettore" variable the whole line but when you use StringTokenizer.nextToken() you are giving to "nomeSettore" just the value of the first token. So, "nomeSettore" can have different values if your String "line" contains whitespaces and you will have different key-value pairs inside you map.
You can take a look at this example:
public class TestSO {
public static void main(String[] args) {
String line = "abcdfs faf afd fa";
StringTokenizer st = new StringTokenizer(line);
readPrevisione(st, null);
readPrevisione(line, null);
}
private static void readPrevisione(StringTokenizer st, BufferedReader bufReader) {
String nomeSettore = st.nextToken().trim();
System.out.println(nomeSettore);
}
private static void readPrevisione(String st, BufferedReader bufReader) {
String nomeSettore = st.trim();
System.out.println(nomeSettore);
}
}
It prints as output:
abcdfs
abcdfs faf afd fa

I've understood why it didn't work..
The String line was : "EXAMPLE\n"
but after
while((line=bufReader.readLine()) != null){
...}
line = "EXAMPLE" because the readLine() eats the newline.
So I passed to the readPrevisione() a StringTokenizer as parameter
while((line=bufReader.readLine()) != null){
StringTokenizer st = new StringTokenizer(line);
readPrevisione(st,bufReader);
}
private void readPrevisione(StringTokenizer st, BufferedReader bufReader) throws IOException, BadFileFormatException{
String nomeSettore = st.nextToken().trim();
...}
And st.nextToken() search for a \n that is not contained in "EXAMPLE". That's why it didn't work.

File list checker blocks deleting and naming of files?

In the code below some output is created, some numbers. One of the numbers is a hashvalue. The hashvalue is calculated from a folder.
While the folder is calculated it seems somehow restricted to delete, add and name files. Is this a normal behavior, or could be changed something in the code of the TaskStartPart or TaskPart class ?
import java.io.BufferedWriter;
import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.SequenceInputStream;
import java.security.DigestInputStream;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Date;
import java.util.List;
public class RestartTest {
StringBuilder sb;
String dtf = "============================";
String hexRes2 = "";
int i1 = 0;
int i2 = 0;
/**
* #param args the command line arguments
*/
public static void main(String[] args) throws InterruptedException, IOException, NoSuchAlgorithmException {
// TODO code application logic here
new RestartTest().startApp();
}
public void startApp() throws InterruptedException, IOException, NoSuchAlgorithmException {
TaskStart startTask = new TaskStart();
startTask.startCalc();
}
class TaskStart {
public void startCalc() throws InterruptedException, IOException, NoSuchAlgorithmException {
while(!Thread.currentThread().isInterrupted()) {
i1 = (int) (Math.random() * 1000);
System.out.println("Value 1: " + i1);
new TaskStart2().startCalc2();
new TaskStartPart().calculHash();
dateiSpeichern(i1,i2,"");
}
}
}
class TaskStart2 {
public void startCalc2() throws InterruptedException, IOException {
i2 = (int) (Math.random() * 1000);
System.out.println("Value 2: " + i2);
dateiSpeichern(i1,i2,"");
}
}
class TaskStartPart {
public void calculHash() throws InterruptedException, IOException, NoSuchAlgorithmException {
try {
DigestInputStream digestInputStream=null ;
MessageDigest messageDigest=MessageDigest.getInstance("SHA-512") ;
digestInputStream=new DigestInputStream(new TaskPart(new File("C:\\Users\\win7p\\Documents/t")),messageDigest) ;
//System.out.println("Path :" + direc.toString()) ;
while(digestInputStream.read()>=0) ;
//System.out.print("\nsha-512 sum=") ;
for(byte b: messageDigest.digest()) {
hexRes2 += String.format("%02x",b);
} sb = new StringBuilder(hexRes2);
dateiSpeichern(0,0,sb.substring(hexRes2.length() - 128,hexRes2.length())); System.out.println(sb.substring(hexRes2.length() - 128,hexRes2.length()));
digestInputStream.close();
} catch (IOException ex ) {ex.printStackTrace();}
}
}
class TaskPart extends InputStream {
private File mFile ;
private List<File> mFiles ;
private InputStream mInputStream ;
public TaskPart(File file) throws FileNotFoundException {
mFile=file ;
if(file.isDirectory()) {
mFiles=new ArrayList<File>(Arrays.asList(file.listFiles())) ;
Collections.sort(mFiles) ;
mInputStream=nextInputStream() ;
} else {
mFiles=new ArrayList<File>() ;
mInputStream=new FileInputStream(file) ;
}
}
#Override
public int read() throws IOException {
int result=mInputStream==null?-1:mInputStream.read() ;
if(result<0 && (mInputStream=nextInputStream())!=null)
return read() ;
else return result ;
}
protected String getRelativePath(File file) {
return file.getAbsolutePath().substring(mFile.getAbsolutePath().length()) ;
}
protected InputStream nextInputStream() throws FileNotFoundException {
if(!mFiles.isEmpty()) {
File nextFile=mFiles.remove(0) ;
return new SequenceInputStream(
new ByteArrayInputStream(getRelativePath(nextFile).getBytes()),
new TaskPart(nextFile)) ;
}
else return null ;
}
}
private void dateiSpeichern(int i1, int i2, String hexR) throws InterruptedException, IOException {
try {
String tF = new SimpleDateFormat("dd-MM-yyyy HH-mm-ss").format(new Date().getTime());
try (BufferedWriter writer = new BufferedWriter(new FileWriter("C:\\Users\\win7p\\Documents/hashLog.txt", true))) {
writer.append(tF);
writer.newLine();
writer.append(dtf);
writer.newLine();
writer.append("Hash Value: ");
//If(hexR.length() == alHash.get(0))
//alHash.add(hexR);
writer.append(hexR);
writer.newLine();
writer.append("-----");
writer.append("Value 1:");
String si1 = Integer.toString(i1);
writer.append(si1);
writer.newLine();
writer.append("*****");
writer.append("Value 2:");
String si2 = Integer.toString(i2);
writer.append(si2);
writer.newLine();
writer.flush();
writer.close();
}
} catch(IOException ex) {System.out.print("konnte Datei nicht speichern");}
catch(NullPointerException nex) {System.out.println("no Log-File, try again...");}
} }

I think I have find the problem.
In the method protected InputStream nextInputStream() of the class class TaskPart extends InputStream is a List private List mFiles;.
The problem was the List<> remained filled, so it needed to be cleared once the method was called, with mFiles.clear() in calculhash().
So that the files are not longer listed in that stream, and blocked.
Thank you

Search for a string in html file using Jsoup

Can anyone help me with searching for a particular string in HTML file using Jsoup or any other method. There are inbuilt methods but they help in extracting title or script texts inside a specific tags and not string in general.
In this code I have used one such inbuilt method to extract title from the html page.
But I want to search a string instead.
package dynamic_tester;
import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class tester {
public static void main(String args[])
{
Document htmlFile = null;
{
try {
htmlFile = Jsoup.parse(new File("x.html"), "ISO-8859-1");
}
catch (IOException e)
{
e.printStackTrace();
}
String title = htmlFile.title();
System.out.println("Title = "+title);
}
}
}

Here's a sample. It reads the HTML file as text String and then performs search on that String.
package com.example;
import java.io.FileInputStream;
import java.nio.charset.Charset;
public class SearchTest {
public static void main(String[] args) throws Exception {
StringBuffer htmlStr = getStringFromFile("test.html", "ISO-8859-1");
boolean isPresent = htmlStr.indexOf("hello") != -1;
System.out.println("is Present ? : " + isPresent);
}
private static StringBuffer getStringFromFile(String fileName, String charSetOfFile) {
StringBuffer strBuffer = new StringBuffer();
try(FileInputStream fis = new FileInputStream(fileName)) {
byte[] buffer = new byte[10240]; //10K buffer;
int readLen = -1;
while( (readLen = fis.read(buffer)) != -1) {
strBuffer.append( new String(buffer, 0, readLen, Charset.forName(charSetOfFile)));
}
} catch(Exception ex) {
ex.printStackTrace();
strBuffer = new StringBuffer();
}
return strBuffer;
}
}

Getting many memory errors when try to run it for few days in my web crawler [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I am developing a web crawler application. When i run the program i am getting these error messages below:
i've got these errors after running the program for more that 3 hours. I tried to allocate memory by changing eclipse.ini setting to 2048 MB of ram as it was answered in this topic but still get the same errors after 3 hours or less. I should run the program for more that 2-3 days non-stopping to get analyse the results.
Can you tell me what i am missing here to get these error below ?
These are my classes:
seeds.txt
http://www.stanford.edu
http://www.archive.org
WebCrawler.java
package pkg.crawler;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.net.MalformedURLException;
import java.net.SocketTimeoutException;
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.PriorityBlockingQueue;
import java.util.concurrent.TimeUnit;
import org.jsoup.HttpStatusException;
import org.jsoup.UnsupportedMimeTypeException;
import org.joda.time.DateTime;
public class WebCrawler {
public static Queue <LinkNodeLight> queue = new PriorityBlockingQueue <> (); // priority queue
public static final int n_threads = 5; // amount of threads
private static Set<String> processed = new LinkedHashSet <> (); // set of processed urls
private PrintWriter out; // output file
private PrintWriter err; // error file
private static Integer cntIntra = new Integer (0); // counters for intra- links in the queue
private static Integer cntInter = new Integer (0); // counters for inter- links in the queue
private static Integer dub = new Integer (0); // amount of skipped urls
public static void main(String[] args) throws Exception {
System.out.println("Running web crawler: " + new Date());
WebCrawler webCrawler = new WebCrawler();
webCrawler.createFiles();
try (Scanner in = new Scanner(new File ("seeds.txt"))) {
while (in.hasNext()) {
webCrawler.enque(new LinkNode (in.nextLine().trim()));
}
} catch (IOException e) {
e.printStackTrace();
return;
}
webCrawler.processQueue();
webCrawler.out.close();
webCrawler.err.close();
}
public void processQueue(){
/* run in threads */
Runnable r = new Runnable() {
#Override
public void run() {
/* queue may be empty but process is not finished, that's why we need to check if any links are being processed */
while (true) {
LinkNode link = deque();
if (link == null)
continue;
link.setStartTime(new DateTime());
boolean process = processLink(link);
link.setEndTime(new DateTime());
if (!process)
continue;
/* print the data to the csv file */
if (link.getStatus() != null && link.getStatus().equals(LinkNodeStatus.OK)) {
synchronized(out) {
out.println(getOutputLine(link));
out.flush();
}
} else {
synchronized(err) {
err.println(getOutputLine(link));
err.flush();
}
}
}
}
};
/* run n_threads threads which perform dequeue and process */
LinkedList <Thread> threads = new LinkedList <> ();
for (int i = 0; i < n_threads; i++) {
threads.add(new Thread(r));
threads.getLast().start();
}
for (Thread thread : threads) {
try {
thread.join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
/* returns true if link was actually processed */
private boolean processLink(LinkNode inputLink) {
String url = getUrlGeneralForm(inputLink);
boolean process = true;
synchronized (processed) {
if (processed.contains(url)) {
process = false;
synchronized (dub) {dub++;}
} else
processed.add(url);
}
/* start processing only if the url have not been processed yet or not being processed */
if (process) {
System.out.println("Processing url " + url);
List<LinkNodeLight> outputLinks = parseAndWieghtResults(inputLink);
for (LinkNodeLight outputLink : outputLinks) {
String getUrlGeneralForumOutput = getUrlGeneralForm(outputLink);
/* add the new link to the queue only if it has not been processed yet */
process = true;
synchronized (processed) {
if (processed.contains(getUrlGeneralForumOutput)) {
process = false;
synchronized (dub) {dub++;}
}
}
if (process) {
enque(outputLink);
}
}
return true;
}
return false;
}
void enque(LinkNodeLight link){
link.setEnqueTime(new DateTime());
/* the add method requires implicit priority */
synchronized (queue) {
if (link.interLinks)
synchronized (cntInter) {cntInter++;}
else
synchronized (cntIntra) {cntIntra++;}
//queue.add(link, 100 - (int)(link.getWeight() * 100.f));
queue.add(link);
}
}
/**
* Picks an element from the queue
* #return top element from the queue or null if the queue is empty
*/
LinkNode deque(){
/* link must be checked */
LinkNode link = null;
synchronized (queue) {
link = (LinkNode) queue.poll();
if (link != null) {
link.setDequeTime(new DateTime());
if (link.isInterLinks())
synchronized (cntInter) {cntInter--;}
else
synchronized (cntIntra) {cntIntra--;}
}
}
return link;
}
private void createFiles() {
/* create output file */
try {
out = new PrintWriter(new BufferedWriter(new FileWriter("CrawledURLS.csv", false)));
out.println(generateHeaderFile());
} catch (IOException e) {
System.err.println(e);
}
/* create error file */
try {
err = new PrintWriter(new BufferedWriter(new FileWriter("CrawledURLSERROR.csv", false)));
err.println(generateHeaderFile());
} catch (IOException e) {
System.err.println(e);
}
}
/**
* formats the string so it can be valid entry in csv file
* #param s
* #return
*/
private static String format(String s) {
// replace " by ""
String ret = s.replaceAll("\"", "\"\"");
// put string into quotes
return "\"" + ret + "\"";
}
/**
* Creates the line that needs to be written in the outputfile
* #param link
* #return
*/
public static String getOutputLine(LinkNode link){
StringBuilder builder = new StringBuilder();
builder.append(link.getParentLink()!=null ? format(link.getParentLink().getUrl()) : "");
builder.append(",");
builder.append(link.getParentLink()!=null ? link.getParentLink().getIpAdress() : "");
builder.append(",");
builder.append(link.getParentLink()!=null ? link.getParentLink().linkProcessingDuration() : "");
builder.append(",");
builder.append(format(link.getUrl()));
builder.append(",");
builder.append(link.getDomain());
builder.append(",");
builder.append(link.isInterLinks());
builder.append(",");
builder.append(Util.formatDate(link.getEnqueTime()));
builder.append(",");
builder.append(Util.formatDate(link.getDequeTime()));
builder.append(",");
builder.append(link.waitingInQueue());
builder.append(",");
builder.append(queue.size());
/* Inter and intra links in queue */
builder.append(",");
builder.append(cntIntra.toString());
builder.append(",");
builder.append(cntInter.toString());
builder.append(",");
builder.append(dub);
builder.append(",");
builder.append(new Date ());
/* URL size*/
builder.append(",");
builder.append(link.getSize());
/* HTML file
builder.append(",");
builder.append(link.getFileName());*/
/* add HTTP error */
builder.append(",");
if (link.getParseException() != null) {
if (link.getParseException() instanceof HttpStatusException)
builder.append(((HttpStatusException) link.getParseException()).getStatusCode());
if (link.getParseException() instanceof SocketTimeoutException)
builder.append("Time out");
if (link.getParseException() instanceof MalformedURLException)
builder.append("URL is not valid");
if (link.getParseException() instanceof UnsupportedMimeTypeException)
builder.append("Unsupported mime type: " + ((UnsupportedMimeTypeException)link.getParseException()).getMimeType());
}
return builder.toString();
}
/**
* generates the Header for the file
* #param link
* #return
*/
private String generateHeaderFile(){
StringBuilder builder = new StringBuilder();
builder.append("Seed URL");
builder.append(",");
builder.append("Seed IP");
builder.append(",");
builder.append("Process Duration");
builder.append(",");
builder.append("Link URL");
builder.append(",");
builder.append("Link domain");
builder.append(",");
builder.append("Link IP");
builder.append(",");
builder.append("Enque Time");
builder.append(",");
builder.append("Deque Time");
builder.append(",");
builder.append("Waiting in the Queue");
builder.append(",");
builder.append("QueueSize");
builder.append(",");
builder.append("Intra in queue");
builder.append(",");
builder.append("Inter in queue");
builder.append(",");
builder.append("Dublications skipped");
/* time was printed, but no header was */
builder.append(",");
builder.append("Time");
/* URL size*/
builder.append(",");
builder.append("Size bytes");
/* HTTP errors */
builder.append(",");
builder.append("HTTP error");
return builder.toString();
}
String getUrlGeneralForm(LinkNodeLight link){
String url = link.getUrl();
if (url.endsWith("/")){
url = url.substring(0, url.length() - 1);
}
return url;
}
private List<LinkNodeLight> parseAndWieghtResults(LinkNode inputLink) {
List<LinkNodeLight> outputLinks = HTMLParser.parse(inputLink);
if (inputLink.hasParseException()) {
return outputLinks;
} else {
return URLWeight.weight(inputLink, outputLinks);
}
}
}
HTMLParser.java
package pkg.crawler;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.io.Writer;
import java.math.BigInteger;
import java.util.Formatter;
import java.util.HashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.concurrent.TimeUnit;
import java.util.logging.Logger;
import java.security.*;
import java.nio.file.Path;
import java.nio.file.Paths;
public class HTMLParser {
private static final int READ_TIMEOUT_IN_MILLISSECS = (int) TimeUnit.MILLISECONDS.convert(30, TimeUnit.SECONDS);
private static HashMap <String, Integer> filecounter = new HashMap<> ();
public static List<LinkNodeLight> parse(LinkNode inputLink){
List<LinkNodeLight> outputLinks = new LinkedList<>();
try {
inputLink.setIpAdress(IpFromUrl.getIp(inputLink.getUrl()));
String url = inputLink.getUrl();
if (inputLink.getIpAdress() != null) {
url.replace(URLWeight.getHostName(url), inputLink.getIpAdress());
}
Document parsedResults = Jsoup
.connect(url)
.timeout(READ_TIMEOUT_IN_MILLISSECS)
.get();
inputLink.setSize(parsedResults.html().length());
/* IP address moved here in order to speed up the process */
inputLink.setStatus(LinkNodeStatus.OK);
inputLink.setDomain(URLWeight.getDomainName(inputLink.getUrl()));
if (true) {
/* save the file to the html */
String filename = parsedResults.title();//digestBig.toString(16) + ".html";
if (filename.length() > 24) {
filename = filename.substring(0, 24);
}
filename = filename.replaceAll("[^\\w\\d\\s]", "").trim();
filename = filename.replaceAll("\\s+", " ");
if (!filecounter.containsKey(filename)) {
filecounter.put(filename, 1);
} else {
Integer tmp = filecounter.remove(filename);
filecounter.put(filename, tmp + 1);
}
filename = filename + "-" + (filecounter.get(filename)).toString() + ".html";
filename = Paths.get("downloads", filename).toString();
inputLink.setFileName(filename);
/* use md5 of url as file name */
try (PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(filename)))) {
out.println("<!--" + inputLink.getUrl() + "-->");
out.print(parsedResults.html());
out.flush();
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
String tag;
Elements tagElements;
List<LinkNode> result;
tag = "a[href";
tagElements = parsedResults.select(tag);
result = toLinkNodeObject(inputLink, tagElements, tag);
outputLinks.addAll(result);
tag = "area[href";
tagElements = parsedResults.select(tag);
result = toLinkNodeObject(inputLink, tagElements, tag);
outputLinks.addAll(result);
} catch (IOException e) {
inputLink.setParseException(e);
inputLink.setStatus(LinkNodeStatus.ERROR);
}
return outputLinks;
}
static List<LinkNode> toLinkNodeObject(LinkNode parentLink, Elements tagElements, String tag) {
List<LinkNode> links = new LinkedList<>();
for (Element element : tagElements) {
if(isFragmentRef(element)){
continue;
}
String absoluteRef = String.format("abs:%s", tag.contains("[") ? tag.substring(tag.indexOf("[") + 1, tag.length()) : "href");
String url = element.attr(absoluteRef);
if(url!=null && url.trim().length()>0) {
LinkNode link = new LinkNode(url);
link.setTag(element.tagName());
link.setParentLink(parentLink);
links.add(link);
}
}
return links;
}
static boolean isFragmentRef(Element element){
String href = element.attr("href");
return href!=null && (href.trim().startsWith("#") || href.startsWith("mailto:"));
}
}
Util.java
package pkg.crawler;
import java.util.Date;
import org.joda.time.DateTime;
import org.joda.time.format.DateTimeFormat;
import org.joda.time.format.DateTimeFormatter;
public class Util {
private static DateTimeFormatter formatter;
static {
formatter = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss:SSS");
}
public static String linkToString(LinkNode inputLink){
return String.format("%s\t%s\t%s\t%s\t%s\t%s",
inputLink.getUrl(),
inputLink.getWeight(),
formatDate(inputLink.getEnqueTime()),
formatDate(inputLink.getDequeTime()),
differenceInMilliSeconds(inputLink.getEnqueTime(), inputLink.getDequeTime()),
inputLink.getParentLink()==null?"":inputLink.getParentLink().getUrl()
);
}
public static String linkToErrorString(LinkNode inputLink){
return String.format("%s\t%s\t%s\t%s\t%s\t%s",
inputLink.getUrl(),
inputLink.getWeight(),
formatDate(inputLink.getEnqueTime()),
formatDate(inputLink.getDequeTime()),
inputLink.getParentLink()==null?"":inputLink.getParentLink().getUrl(),
inputLink.getParseException().getMessage()
);
}
public static String formatDate(DateTime date){
return formatter.print(date);
}
public static long differenceInMilliSeconds(DateTime dequeTime, DateTime enqueTime){
return (dequeTime.getMillis()- enqueTime.getMillis());
}
public static int differenceInSeconds(Date enqueTime, Date dequeTime){
return (int)((dequeTime.getTime()/1000) - (enqueTime.getTime()/1000));
}
public static int differenceInMinutes(Date enqueTime, Date dequeTime){
return (int)((dequeTime.getTime()/60000) - (enqueTime.getTime()/60000));
}
}
URLWeight.java
package pkg.crawler;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.List;
import java.util.regex.Pattern;
public class URLWeight {
public static List<LinkNodeLight> weight(LinkNode sourceLink, List<LinkNodeLight> links) {
List<LinkNodeLight> interLinks = new LinkedList<>();
List<LinkNodeLight> intraLinks = new LinkedList<>();
for (LinkNodeLight link : links) {
if (isIntraLink(sourceLink, link)) {
intraLinks.add(link);
link.setInterLinks(false);
} else {
interLinks.add(link);
link.setInterLinks(true);
}
}
static boolean isIntraLink(LinkNodeLight sourceLink, LinkNodeLight link){
String parentDomainName = getHostName(sourceLink.getUrl());
String childDomainName = getHostName(link.getUrl());
return parentDomainName.equalsIgnoreCase(childDomainName);
}
public static String getHostName(String url) {
if(url == null){
// System.out.println("Deneme");
return "";
}
String domainName = new String(url);
int index = domainName.indexOf("://");
if (index != -1) {
domainName = domainName.substring(index + 3);
}
for (int i = 0; i < domainName.length(); i++)
if (domainName.charAt(i) == '?' || domainName.charAt(i) == '/') {
domainName = domainName.substring(0, i);
break;
}
/*if (index != -1) {
domainName = domainName.substring(0, index);
}*/
/* have to keep www in order to do replacements with IP */
//domainName = domainName.replaceFirst("^www.*?\\.", "");
return domainName;
}
public static String getDomainName(String url) {
String [] tmp= getHostName(url).split("\\.");
if (tmp.length == 0)
return "";
return tmp[tmp.length - 1];
}
}
PingTaskManager.java
package pkg.crawler;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class PingTaskManager {
private static ExecutorService executor = Executors.newFixedThreadPool(100);
public static void ping (LinkNode e) {
executor.submit(new PingTaks(e));
}
}
class PingTaks implements Runnable {
private LinkNode link;
public PingTaks( LinkNode link ) {
}
#Override
public void run() {
/* link.ping(); */
}
}
LinkNodeStatus.java
package pkg.crawler;
public enum LinkNodeStatus {
OK,
ERROR
}
LinkNodeLight.java
package pkg.crawler;
import org.joda.time.DateTime;
public class LinkNodeLight implements Comparable<LinkNodeLight> {
protected String url;
protected float weight;
protected DateTime enqueTime;
protected boolean interLinks;
public String getUrl() {
return url;
}
public float getWeight() {
return weight;
}
public void setWeight(float weight) {
this.weight = weight;
}
public DateTime getEnqueTime() {
return enqueTime;
}
public LinkNodeLight(String url) {
this.url = url;
}
public void setEnqueTime(DateTime enqueTime) {
this.enqueTime = enqueTime;
}
#Override
public int compareTo(LinkNodeLight link) {
if (this.weight < link.weight) return 1;
else if (this.weight > link.weight) return -1;
return 0;
}
}
LinkNode.java
package pkg.crawler;
import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.Socket;
import java.net.URL;
import java.net.UnknownHostException;
import java.util.Date;
import org.joda.time.DateTime;
public class LinkNode extends LinkNodeLight{
public LinkNode(String url) {
super(url);
}
private String tag;
private LinkNode parentLink;
private IOException parseException = null; // initialize parse Exception with null
private float weight;
private DateTime dequeTime;
private DateTime startTime;
private DateTime endTime;
private LinkNodeStatus status;
private String ipAdress;
private int size;
private String filename;
private String domain;
public DateTime getStartTime() {
return startTime;
}
public void setStartTime(DateTime startTime) {
this.startTime = startTime;
}
public DateTime getEndTime() {
return endTime;
}
public void setEndTime(DateTime endTime) {
this.endTime = endTime;
}
public DateTime getDequeTime() {
return dequeTime;
}
public String getTag() {
return tag;
}
public LinkNode getParentLink() {
return parentLink;
}
public Exception getParseException() {
return parseException;
}
public boolean hasParseException(){
return parseException!=null;
}
public void setDequeTime(DateTime dequeTime) {
this.dequeTime = dequeTime;
}
public void setTag(String tag) {
this.tag = tag;
}
public void setParentLink(LinkNode parentLink) {
this.parentLink = parentLink;
}
public void setParseException(IOException parseException) {
this.parseException = parseException;
}
#Override
public boolean equals(Object o) {
if (this == o) {
return true;
}
if (o == null || getClass() != o.getClass()) {
return false;
}
LinkNode link = (LinkNode) o;
if (url != null ? !url.equals(link.url) : link.url != null) {
return false;
}
return true;
}
#Override
public int hashCode() {
return url != null ? url.hashCode() : 0;
}
public long waitingInQueue(){
return Util.differenceInMilliSeconds( dequeTime,enqueTime );
}
public long linkProcessingDuration(){
return Util.differenceInMilliSeconds( endTime,startTime );
}
#Override
public String toString() {
StringBuilder sb = new StringBuilder("LinkNode{");
sb.append("url='").append(url).append('\'');
sb.append(", score=").append(weight);
sb.append(", enqueTime=").append(enqueTime);
sb.append(", dequeTime=").append(dequeTime);
sb.append(", tag=").append(tag);
if(parentLink!=null) {
sb.append(", parentLink=").append(parentLink.getUrl());
}
sb.append('}');
return sb.toString();
}
public void setStatus(LinkNodeStatus status) {
this.status = status;
}
public LinkNodeStatus getStatus(){
if (status == null) {
status = LinkNodeStatus.ERROR;
}
return status;
}
// check server link is it exist or not
/* this method gives fake errors
public LinkNodeStatus ping () {
boolean reachable = false;
String sanitizeUrl = url.replaceFirst("^https", "http");
try {
HttpURLConnection connection = (HttpURLConnection) new URL(sanitizeUrl).openConnection();
connection.setConnectTimeout(1000);
connection.setRequestMethod("HEAD");
int responseCode = connection.getResponseCode();
System.err.println(url + " " + responseCode);
reachable = (200 <= responseCode && responseCode <= 399);
} catch (IOException exception) {
}
return reachable?LinkNodeStatus.OK: LinkNodeStatus.ERROR;
}*/
public String getIpAdress() {
return ipAdress;
}
public void setIpAdress(String ipAdress) {
this.ipAdress = ipAdress;
}
/* methods for controlling url size */
public void setSize(int size) {
this.size = size;
}
public int getSize() {
return this.size;
}
public void setFileName(String filename) {
this.filename = filename;
}
public String getFileName() {
return this.filename;
}
public String getDomain() {
return domain;
}
public void setDomain(String domain) {
this.domain = domain;
}
}

I tried to allocate memory by changing eclipse.ini setting to 2048 MB of ram as it was answered in this topic but still get the same errors after 3 hours or less.
I hate to repeat myself(*), but in eclipse.ini you set up the memory for Eclipse, which has nothing to do with the memory for your crawler.
When using command line, you need to start it via java -Xmx2G pkg.crawler.WebCrawler.
When starting from Eclipse, you need to add -Xmx2G to the run configuration ("VM arguments" rather than "Program arguments").
(*) Link to a deleted question; requires some reputation to view.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extracting links of a facebook page - java

Related

I'm trying to read a text file and store it in an arraylist of objects

String line or StringTokenizer with a Reader?

File list checker blocks deleting and naming of files?

Search for a string in html file using Jsoup

Getting many memory errors when try to run it for few days in my web crawler [closed]

Categories

Resources