Implementing Threads Into Java Web Crawler - java

Here is the original web crawler in which i wrote: (Just for reference)
https://github.com/domshahbazi/java-webcrawler/tree/master
This is a simple web crawler which visits a given initial web page, scrapes all the links from the page and adds them to a Queue (LinkedList), where they are then popped off one by one and each visited, where the cycle starts again. To speed up my program, and for learning, i tried to implement using threads so i could have many threads operating at once, indexing more pages in less time. Below is each class:
Main class
public class controller {
public static void main(String args[]) throws InterruptedException {
DataStruc data = new DataStruc("http://www.imdb.com/title/tt1045772/?ref_=nm_flmg_act_12");
Thread crawl1 = new Crawler(data);
Thread crawl2 = new Crawler(data);
crawl1.start();
crawl2.start();
}
}
Crawler Class (Thread)
public class Crawler extends Thread {
/** Instance of Data Structure **/
DataStruc data;
/** Number of page connections allowed before program terminates **/
private final int INDEX_LIMIT = 10;
/** Initial URL to visit **/
public Crawler(DataStruc d) {
data = d;
}
public void run() {
// Counter to keep track of number of indexed URLS
int counter = 0;
// While URL's left to visit
while((data.url_to_visit_size() > 0) && counter<INDEX_LIMIT) {
// Pop next URL to visit from stack
String currentUrl = data.getURL();
try {
// Fetch and parse HTML document
Document doc = Jsoup.connect(currentUrl)
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36")
.referrer("http://www.google.com")
.timeout(12000)
.followRedirects(true)
.get();
// Increment counter if connection to web page succeeds
counter++;
/** .select returns a list of elements (links in this case) **/
Elements links = doc.select("a[href]"); // Relative URL
// Add newly found links to stack
addLinksToQueue(links);
} catch (IOException e) {
//e.printStackTrace();
System.out.println("Error: "+currentUrl);
}
}
}
public void addLinksToQueue(Elements el) {
// For each element in links
for(Element e : el) {
String theLink = e.attr("abs:href"); // 'abs' prefix ensures absolute url is returned rather then relative url ('www.reddit.com/hello' rather then '/hello')
if(theLink.startsWith("http") && !data.oldLink(theLink)) {
data.addURL(theLink);
data.addVisitedURL(theLink); // Register each unique URL to ensure it isnt stored in 'url_to_visit' again
System.out.println(theLink);
}
}
}
}
DataStruc Class
public class DataStruc {
/** Queue to store URL's, can be accessed by multiple threads **/
private ConcurrentLinkedQueue<String> url_to_visit = new ConcurrentLinkedQueue<String>();
/** ArrayList of visited URL's **/
private ArrayList<String> visited_url = new ArrayList<String>();
public DataStruc(String initial_url) {
url_to_visit.offer(initial_url);
}
// Method to add seed URL to queue
public void addURL(String url) {
url_to_visit.offer(url);
}
// Get URL at front of queue
public String getURL() {
return url_to_visit.poll();
}
// URL to visit size
public int url_to_visit_size() {
return url_to_visit.size();
}
// Add visited URL
public void addVisitedURL(String url) {
visited_url.add(url);
}
// Checks if link has already been visited
public boolean oldLink(String link) {
for(String s : visited_url) {
if(s.equals(link)) {
return true;
}
}
return false;
}
}
DataStruc is the shared data structure class, which will be concurrently accessed by each instance of a Crawler.java thread. DataStruc has a Queue to store links to be visited, and an arraylist to store visited URL's, to prevent entering a loop. I used a ConcurrentLinkedQueue to store the urls to be visited, as i see it takes care of concurrent access. I didnt require concurrent access with my arraylist of visited urls, as all i need to be able to do is add to this and iterate over it to check for matches.
My problem is that when i compare operation time of using a single thread VS using 2 threads (on the same URL), my single threaded version seems to work faster. I feel i have implemented the threading incorrectly, and would like some tips if anybody can pinpoint the issues?
Thanks!

Added: see my comment, I think the check in Crawler
// While URL's left to visit
while((data.url_to_visit_size() > 0) && counter<INDEX_LIMIT) {
is wrong. The 2nd Thread will stop immediately since the 1st Thread polled the only URL.
You can ignore the remaining, but left for history ...
My general approach to such types of "big blocks that can run in parallel" is:
Make each crawler a Callable. Probably Callable<List<String>>
Submit them to an ExecutorService
When they complete, take the results one at a time and add them to a List.
Using this strategy there is no need to use any concurrent lists at all. The disadvantage is that you don't get much live feedback as they are runnìng. And, if what they return is huge, you may need to worry about memory.
Would this suit your needs? You would have to worry about the addVisitedURL so you still need that as a concurrent data structure.
Added: Since you are starting with a single URL this strategy doesn't apply. You could apply it after the visit to the first URL.

class controller {
public static void main(String args[]) throws InterruptedException {
final int LIMIT = 4;
List<String> seedList = new ArrayList<>(); //1
seedList.add("https://www.youtube.com/");
seedList.add("https://www.digg.com/");
seedList.add("https://www.reddit.com/");
seedList.add("https://www.nytimes.com/");
DataStruc[] data = new DataStruc[LIMIT];
for(int i = 0; i < LIMIT; i++){
data[i] = new DataStruc(seedList.get(i)); //2
}
ExecutorService es = Executors.newFixedThreadPool(LIMIT);
Crawler[] crawl = new Crawler[LIMIT];
for(int i = 0; i < LIMIT; i++){
crawl[i] = new Crawler(data[i]); //3
}
for(int i = 0; i < LIMIT; i++){
es.submit(crawl[i]) // 4
}
}
}
you can try this out
create a seedlist
create objects of datastruc and add the seedlist to each of them
create crawl array and pass datastruc object to them one by one
pass the crawl object to the excutor

Related

How to force remote-only reads in Cassandra3?

We are trying to modify the Cassandra code to perform ONLY remote reads (never read locally) for performance testing purposes of the Speculative Retry and Request Duplication latency reduction techniques.
So far we have modified
src/java/org/apache/cassandra/service/AbstractReadExecutor.java
to do something like this:
public abstract class AbstractReadExecutor {
protected int getNonLocalEndpointIndex (Iterable<InetAddress> endpoints) {
int endpoint_index = 0;
// iterate thru endpoints and pick non-local one
boolean found = false;
for (InetAddress e : endpoints) {
if (! StorageProxy.canDoLocalRequest(e) ) {
found = true;
break;
}
endpoint_index++;
}
if (!found) {
endpoint_index = 0;
}
return endpoint_index;
}
}
public static class NeverSpeculatingReadExecutor extends AbstractReadExecutor {
public void executeAsync() {
int endpoint_index = getNonLocalEndpointIndex(targetReplicas);
makeDataRequests(targetReplicas.subList(endpoint_index, endpoint_index+1));
if (targetReplicas.size() > 1)
makeDigestRequests(targetReplicas.subList(1, targetReplicas.size()));
}
}
}
However, it does not work since targetReplicas is almost always just 1 endpoint (the local one) for using small workloads, 5 cassandra nodes, and a replication factor of 3.
If this is just for testing, can set 1 node to be in wrong DC and use LOCAL queries for things that that node does not own (white list load balancing policy on driver to ensure requests go to only it). Just need to make it so only test for things that node doesn't own a copy of.
Or are you interested in doing things like testing the proxy mutations in the read repairs?
I was able to only do remote-reads by adding a function "getRemoteReplicas()" that filters out the local nodes before/when the ReadExecutor object is created. consistencyLevel.filterForQuery() then usually just returns 1 node (a non-local one).
public static AbstractReadExecutor getReadExecutor(...) {
...
List<InetAddress> remoteReplicas = getRemoteReplicas( allReplicas );
List<InetAddress> targetReplicas = consistencyLevel.filterForQuery(keyspace, remoteReplicas, repairDecision);
...
}
private static List<InetAddress> getRemoteReplicas(List<InetAddress> replicas) {
logger.debug("ALL REPLICAS: " + replicas.toString());
List<InetAddress> remote_replicas = new ArrayList<>();
// iterate thru replicas and pick non-local one
boolean found = false;
for (InetAddress r : replicas) {
if (! StorageProxy.canDoLocalRequest(r) ) {
remote_replicas.add(r);
found = true;
}
}
if (!found) {
return replicas;
}
logger.debug("REMOTE REPLICAS: " + remote_replicas.toString());
return remote_replicas;
}
in src/java/org/apache/cassandra/service/AbstractReadExecutor.java

Multiples instance of Producer/Consumer with Monitor in Java

I'm building a webcrawler to download files from websites. I've a producer (the link fetcher) and a consumer (the downloader).
They both can be summarized as followed :
//Fetcher implements Runnable
public void run(){
while(String link = getLinkFromDatabase != null){
String htmlContent = HTTPrequest.getHTMLtoString(link);
ArrayList<String> links = HTTPrequest.getUrlsFromString(htmlContent); //Custom Parser/Extractor
ArrayList<String> files = HTTPrequest.getFilesFromString(htmlContent);//Custom Parser/Extractor
String SqlQueryAddLinks = "INSERT IGNORE DUPLICATE INTO [...]"; //Insert query for Links with unique key : sha256 of the url.
String SqlQUeryAddFiles = "INSERT IGNORE DUPLICATE INTO [...]"; //Insert query for Files with unique key : sha256 of the url.
Queries.sqlExec(SqlQueryAddLinks);
int RowAffected = Queries.sqlExec(SqlQueryAddFiles);
Queries.archiveLink(link);
Monitor.append(RowAffected);
}
}
//Downloader implements Runnable
public void run(){
while(String link = getFileFromeDatabase != null){
//You don't care of steps here I just download the file
if(fileDownloaded){
Queries.archiveFile(link);
Monitor.take();
}
}
}
Now i'm trying to synch both thread to assure that links cannot be too old. To do so I'm using Monitor (as described in Operating Systems : Internals and design principles wrote by William Stallings)
public class Monitor{
int N = 10;
int count;
Condition notfull, notempty;
public Monitor(){
count = 0;
}
public void append(int nbr) throws InterruptedException{
if(count >= N){
notfull.wait();
}
count+=nbr;
notempty.signal();
}
public void take() throws InterruptedException{
if(count == 0){
notempty.wait();
}
count--;
notfull.signal();
}
Now the thing is that I want to launch multiples couples of fetcher and downloader sync by a monitor. Do I need to create a new Monitors object and add a Monitor into the class of my Downloader and Fetcher or is there a better way ? The book isn't talking about multiples Producer/Consumer and is using the function parbegin(producer, consumer); in C++ (I presume it's C++).
Just by eyeballing, this code doesn't compile for many reasons and has guaranteed runtime failures.
a) you try to call a static method take/append but they are not static.
b) you try to have 2 Condition objects but you have no reentrant lock.
c) you don't even lock/unlock the reentrant lock behind the condition before waiting/notifying
d) you use Condition.wait() instead of the .await().
e) you are using Condition.signal() instead of the .signalAll()

How to remove elements from a queue in Java with a loop

I have a data structure like this:
BlockingQueue mailbox = new LinkedBlockingQueue();
I'm trying to do this:
for(Mail mail: mailbox)
{
if(badNews(mail))
{
mailbox.remove(mail);
}
}
Obviously the contents of the loop interfere with the bounds and a error is triggered, so I would normally do this:
for(int i = 0; i < mailbox.size(); i++)
{
if(badNews(mailbox.get(i)))
{
mailbox.remove(i);
i--;
}
}
But sadly BlockingQueue's don't have a function to get or remove an element by index, so I'm stuck. Any ideas?
Edit - A few clarifications:
One of my goals is the maintain the same ordering so popping from the head and putting it back into the tail is no good. Also, although no other threads will remove mail from a mailbox, they will add to it, so I don't want to be in the middle of an removal algorithm, have someone send me mail, and then have an exception occur.
Thanks in advance!
You may p̶o̶p̶ poll and p̶u̶s̶h̶ offer all the elements in your queue until you make a complete loop over your queue. Here's an example:
Mail firstMail = mailbox.peek();
Mail currentMail = mailbox.pop();
while (true) {
//a base condition to stop the loop
Mail tempMail = mailbox.peek();
if (tempMail == null || tempMail.equals(firstMail)) {
mailbox.offer(currentMail);
break;
}
//if there's nothing wrong with the current mail, then re add to mailbox
if (!badNews(currentMail)) {
mailbox.offer(currentMail);
}
currentMail = mailbox.poll();
}
Note that this approach will work only if this code is executed in a single thread and there's no other thread that removes items from this queue.
Maybe you need to check if you really want to poll or take the elements from the BlockingQueue. Similar for offer and put.
More info:
Java BlockingQueue take() vs poll()
LinkedBlockingQueue put vs offer
Another less buggy approach is using a temporary collection, not necessarily concurrent, and store the elements you still need in the queue. Here's a kickoff example:
List<Mail> mailListTemp = new ArrayList<>();
while (mailbox.peek() != null) {
Mail mail = mailbox.take();
if (!badNews(mail)) {
mailListTemp.add(mail);
}
}
for (Mail mail : mailListTemp) {
mailbox.offer(mail);
}
I looked over the solutions posted and I think I found a version that serves my purposes. What do you think about this one?
int size = mailbox.size();
for(int i = 0; i < size; i++)
{
Mail currentMail = mailbox.poll();
if (!badNews(currentMail))
mailbox.offer(currentMail);
}
Edit: A new solution that may be problem free. What you guys think?
while(true)
{
boolean badNewRemains = false;
for(Mail mail: mailbox)
{
if(badNews(mail))
{
badNewRemains = true;
mailbox.remove(mail);
break;
}
}
if(!badNewRemains)
break;
}
You can easily implement queue for your need. And you will need to, if API provided doesn't have such features.
One like:
import java.util.Iterator;
import java.util.LinkedList;
class Mail {
boolean badMail;
}
class MailQueue {
private LinkedList<Mail> backingQueue = new LinkedList<>();
private final Object lock = new Object();
public void push(Mail mail){
synchronized (lock) {
backingQueue.addLast(mail);
if(backingQueue.size() == 1){
// this is only element in queue, i.e. queue was empty before, so invoke if any thread waiting for mails in queue.
lock.notify();
}
}
}
public Mail pop() throws InterruptedException{
synchronized (lock) {
while(backingQueue.isEmpty()){
// no elements in queue, wait.
lock.wait();
}
return backingQueue.removeFirst();
}
}
public boolean removeBadMailsInstantly() {
synchronized (lock) {
boolean removed = false;
Iterator<Mail> iterator = backingQueue.iterator();
while(iterator.hasNext()){
Mail mail = iterator.next();
if(mail.badMail){
iterator.remove();
removed = true;
}
}
return removed;
}
}
}
The implemented queue will be thread-safe, whether push or pop. Also you can edit queue for more operations. And it will allow to access removeBadMailsInstantly method by multiple threads (thread-safe). And you will also learn concepts of multithreading.

List of Thread and accessing another list

I've already made another question close to this one several minutes ago, and there were good answers, but it was not what I was looking for, so I tried to be a bit clearer.
Let's say I have a list of Thread in a class :
class Network {
private List<Thread> tArray = new ArrayList<Thread>();
private List<ObjectInputStream> input = new ArrayList<ObjectInputStream>();
private void aMethod() {
for(int i = 0; i < 10; i++) {
Runnable r = new Runnable() {
public void run() {
try {
String received = (String) input.get(****).readObject(); // I don't know what to put here instead of the ****
showReceived(received); // random method in Network class
} catch (IOException ioException) {
ioException.printStackTrace();
}
}
}
tArray.add(new Thread(r));
tArray.get(i).start();
}
}
}
What should I put instead of ** ?
The first thread of the tArray list must only access the first input of the input list for example.
EDIT : Let's assume my input list has already 10 elements
It would work if you put i. You also need to add an ObjectInputStream to the list for each thread. I recommend you use input.add for that purpose. You also need to fill the tArray list with some threads, use add again there.
Here's the solution:
private void aMethod() {
for(int i = 0; i < 10; i++) {
final int index = i; // Captures the value of i in a final varialbe.
Runnable r = new Runnable() {
public void run() {
try {
String received = input.get(index).readObject().toString(); // Use te final variable to access the list.
showReceived(received); // random method in Network class
} catch (Exception exception) {
exception.printStackTrace();
}
}
};
tArray.add(new Thread(r));
tArray.get(i).start();
}
}
As you want each thread to access one element from the input array you can use the value of the i variable as an index into the list. The problem with using i directly is that an inner class cannot access non-final variables from the enclosing scope. To overcome this we assign i to a final variable index. Being final index is accessible by the code of your Runnable.
Additional fixes:
readObject().toString()
catch(Exception exception)
tArray.add(new Thread(r))

Advice for efficient blocking queries

I would like to store tuples objects in a concurent java collection and then have an efficient, blocking query method that returns the first element matching a pattern. If no such element is available, it would block until such element is present.
For instance if I have a class:
public class Pair {
public final String first;
public final String Second;
public Pair( String first, String second ) {
this.first = first;
this.second = second;
}
}
And a collection like:
public class FunkyCollection {
public void add( Pair p ) { /* ... */ }
public Pair get( Pair p ) { /* ... */ }
}
I would like to query it like:
myFunkyCollection.get( new Pair( null, "foo" ) );
which returns the first available pair with the second field equalling "foo" or blocks until such element is added. Another query example:
myFunkyCollection.get( new Pair( null, null ) );
should return the first available pair whatever its values.
Does a solution already exists ? If it is not the case, what do you suggest to implement the get( Pair p ) method ?
Clarification: The method get( Pair p) must also remove the element. The name choice was not very smart. A better name would be take( ... ).
Here's some source code. It basically the same as what cb160 said, but having the source code might help to clear up any questions you may still have. In particular the methods on the FunkyCollection must be synchronized.
As meriton pointed out, the get method performs an O(n) scan for every blocked get every time a new object is added. It also performs an O(n) operation to remove objects. This could be improved by using a data structure similar to a linked list where you can keep an iterator to the last item checked. I haven't provided source code for this optimization, but it shouldn't be too difficult to implement if you need the extra performance.
import java.util.*;
public class BlockingQueries
{
public class Pair
{
public final String first;
public final String second;
public Pair(String first, String second)
{
this.first = first;
this.second = second;
}
}
public class FunkyCollection
{
final ArrayList<Pair> pairs = new ArrayList<Pair>();
public synchronized void add( Pair p )
{
pairs.add(p);
notifyAll();
}
public synchronized Pair get( Pair p ) throws InterruptedException
{
while (true)
{
for (Iterator<Pair> i = pairs.iterator(); i.hasNext(); )
{
Pair pair = i.next();
boolean firstOk = p.first == null || p.first.equals(pair.first);
boolean secondOk = p.second == null || p.second.equals(pair.second);
if (firstOk && secondOk)
{
i.remove();
return pair;
}
}
wait();
}
}
}
class Producer implements Runnable
{
private FunkyCollection funkyCollection;
public Producer(FunkyCollection funkyCollection)
{
this.funkyCollection = funkyCollection;
}
public void run()
{
try
{
for (int i = 0; i < 10; ++i)
{
System.out.println("Adding item " + i);
funkyCollection.add(new Pair("foo" + i, "bar" + i));
Thread.sleep(1000);
}
}
catch (InterruptedException e)
{
Thread.currentThread().interrupt();
}
}
}
public void go() throws InterruptedException
{
FunkyCollection funkyCollection = new FunkyCollection();
new Thread(new Producer(funkyCollection)).start();
System.out.println("Fetching bar5.");
funkyCollection.get(new Pair(null, "bar5"));
System.out.println("Fetching foo2.");
funkyCollection.get(new Pair("foo2", null));
System.out.println("Fetching foo8, bar8");
funkyCollection.get(new Pair("foo8", "bar8"));
System.out.println("Finished.");
}
public static void main(String[] args) throws InterruptedException
{
new BlockingQueries().go();
}
}
Output:
Fetching bar5.
Adding item 0
Adding item 1
Adding item 2
Adding item 3
Adding item 4
Adding item 5
Fetching foo2.
Fetching foo8, bar8
Adding item 6
Adding item 7
Adding item 8
Finished.
Adding item 9
Note that I put everything into one source file to make it easier to run.
I know of no existing container that will provide this behavior. One problem you face is the case where no existing entry matches the query. In that case, you'll have to wait for new entries to arrive, and those new entries are supposed to arrive at the tail of the sequence. Given that you're blocking, you don't want to have to examine all the entries that precede the latest addition, because you've already inspected them and determined that they don't match. Hence, you need some way to record your current position, and be able to search forward from there whenever a new entry arrives.
This waiting is a job for a Condition. As suggested in cb160's answer, you should allocate a Condition instance inside your collection, and block on it via Condition#await(). You should also expose a companion overload to your get() method to allow timed waiting:
public Pair get(Pair p) throws InterruptedException;
public Pair get(Pair p, long time, TimeUnit unit) throws InterruptedException;
Upon each call to add(), call on Condition#signalAll() to unblock the threads waiting on unsatisfied get() queries, allowing them to scan the recent additions.
You haven't mentioned how or if items are ever removed from this container. If the container only grows, that simplifies how threads can scan its contents without worrying about contention from other threads mutating the container. Each thread can begin its query with confidence as to the minimum number of entries available to inspect. However, if you allow removal of items, there are many more challenges to confront.
In your FunkyCollection add method you could call notifyAll on the collection itself every time you add an element.
In the get method, if the underlying container (Any suitable conatiner is fine) doesn't contain the value you need, wait on the FunkyCollection. When the wait is notified, check to see if the underlying container contains the result you need. If it does, return the value, otherwise, wait again.
It appears you are looking for an implementation of Tuple Spaces. The Wikipedia article about them lists a few implementations for Java, perhaps you can use one of those. Failing that, you might find an open source implementation to imitate, or relevant research papers.

Categories

Resources