I'm building a webcrawler to download files from websites. I've a producer (the link fetcher) and a consumer (the downloader).
They both can be summarized as followed :
//Fetcher implements Runnable
public void run(){
while(String link = getLinkFromDatabase != null){
String htmlContent = HTTPrequest.getHTMLtoString(link);
ArrayList<String> links = HTTPrequest.getUrlsFromString(htmlContent); //Custom Parser/Extractor
ArrayList<String> files = HTTPrequest.getFilesFromString(htmlContent);//Custom Parser/Extractor
String SqlQueryAddLinks = "INSERT IGNORE DUPLICATE INTO [...]"; //Insert query for Links with unique key : sha256 of the url.
String SqlQUeryAddFiles = "INSERT IGNORE DUPLICATE INTO [...]"; //Insert query for Files with unique key : sha256 of the url.
Queries.sqlExec(SqlQueryAddLinks);
int RowAffected = Queries.sqlExec(SqlQueryAddFiles);
Queries.archiveLink(link);
Monitor.append(RowAffected);
}
}
//Downloader implements Runnable
public void run(){
while(String link = getFileFromeDatabase != null){
//You don't care of steps here I just download the file
if(fileDownloaded){
Queries.archiveFile(link);
Monitor.take();
}
}
}
Now i'm trying to synch both thread to assure that links cannot be too old. To do so I'm using Monitor (as described in Operating Systems : Internals and design principles wrote by William Stallings)
public class Monitor{
int N = 10;
int count;
Condition notfull, notempty;
public Monitor(){
count = 0;
}
public void append(int nbr) throws InterruptedException{
if(count >= N){
notfull.wait();
}
count+=nbr;
notempty.signal();
}
public void take() throws InterruptedException{
if(count == 0){
notempty.wait();
}
count--;
notfull.signal();
}
Now the thing is that I want to launch multiples couples of fetcher and downloader sync by a monitor. Do I need to create a new Monitors object and add a Monitor into the class of my Downloader and Fetcher or is there a better way ? The book isn't talking about multiples Producer/Consumer and is using the function parbegin(producer, consumer); in C++ (I presume it's C++).
Just by eyeballing, this code doesn't compile for many reasons and has guaranteed runtime failures.
a) you try to call a static method take/append but they are not static.
b) you try to have 2 Condition objects but you have no reentrant lock.
c) you don't even lock/unlock the reentrant lock behind the condition before waiting/notifying
d) you use Condition.wait() instead of the .await().
e) you are using Condition.signal() instead of the .signalAll()
Related
I have a Hashmap that is created for each "mailer" class and each "agent" class creates a mailer.
My problem is that each of my "agents" creates a "mailer" that in turn creates a new hashmap.
What I'm trying to do is to create one Hashmap that will be used by all the agents(every agent is a thread).
This is the Agent class:
public class Agent implements Runnable {
private int id;
private int n;
private Mailer mailer;
private static int counter;
private List<Integer> received = new ArrayList<Integer>();
#Override
public void run() {
System.out.println("Thread has started");
n = 10;
if (counter < n - 1) {
this.id = ThreadLocalRandom.current().nextInt(0, n + 1);
counter++;
}
Message m = new Message(this.id, this.id);
this.mailer.getMap().put(this.id, new ArrayList<Message>());
System.out.println(this.mailer.getMap());
for (int i = 0; i < n; i++) {
if (i == this.id) {
continue;
}
this.mailer.send(i, m);
}
for (int i = 0; i < n; i++) {
if (i == this.id) {
continue;
}
if (this.mailer.getMap().get(i) == null) {
continue;
} else {
this.received.add(this.mailer.readOne(this.id).getContent());
}
}
System.out.println(this.id + "" + this.received);
}
}
This is the Mailer class :
public class Mailer {
private HashMap<Integer, List<Message>> map = new HashMap<>();
public void send(int receiver, Message m) {
synchronized (map) {
while (this.map.get(receiver) == null) {
this.map.get(receiver);
}
if (this.map.get(receiver) == null) {
} else {
map.get(receiver).add(m);
}
}
}
public Message readOne(int receiver) {
synchronized (map) {
if (this.map.get(receiver) == null) {
return null;
} else if (this.map.get(receiver).size() == 0) {
return null;
} else {
Message m = this.map.get(receiver).get(0);
this.map.get(receiver).remove(0);
return m;
}
}
}
public HashMap<Integer, List<Message>> getMap() {
synchronized (map) {
return map;
}
}
}
I have tried so far :
Creating the mailer object inside the run method in agent.
Going by the idea (based on your own answer to this question) that you made the map static, you've made 2 mistakes.
do not use static
static means there is one map for the entire JVM you run this on. This is not actually a good thing: Now you can't create separate mailers on one JVM in the future, and you've made it hard to test stuff.
You want something else: A way to group a bunch of mailer threads together (these are all mailers for the agent), but a bit more discerning than a simple: "ALL mailers in the ENTIRE system are all the one mailer for the one agent that will ever run".
A trivial way to do this is to pass the map in as argument. Alternatively, have the map be part of the agent, and pass the agent to the mailer constructor, and have the mailer ask the agent for the map every time.
this is not thread safe
Thread safety is a crucial concept to get right, because the failure mode if you get it wrong is extremely annoying: It may or may not work, and the JVM is free to base whether it'll work right this moment or won't work on the phase of the moon or the flip of a coin: The JVM is given room to do whatever it feels like it needs to, in order to have a JVM that can make full use of the CPU's powers regardless of which CPU and operating system your app is running on.
Your code is not thread safe.
In any given moment, if 2 threads are both referring to the same field, you've got a problem: You need to ensure that this is done 'safely', and the compiler nor the runtime will throw errors if you fail to do this, but you will get bizarre behaviour because the JVM is free to give you caches, refuse to synchronize things, make ghosts of data appear, and more.
In this case the fix is near-trivial: Use java.util.concurrent.ConcurrentHashMap instead, that's all you'd have to do to make this safe.
Whenever you're interacting with a field that doesn't have a convenient 'typesafe' type, or you're messing with the field itself (one thread assigns a new value to the field, another reads it - you don't do that here, there is just the one field that always points at the same map, but you're messing with the map) - you need to use synchronized and/or volatile and/or locks from the java.util.concurrent package and in general it gets very complicated. Concurrent programming is hard.
I was able to solve this by changing the mailer to static in the Agent class
I have a data structure like this:
BlockingQueue mailbox = new LinkedBlockingQueue();
I'm trying to do this:
for(Mail mail: mailbox)
{
if(badNews(mail))
{
mailbox.remove(mail);
}
}
Obviously the contents of the loop interfere with the bounds and a error is triggered, so I would normally do this:
for(int i = 0; i < mailbox.size(); i++)
{
if(badNews(mailbox.get(i)))
{
mailbox.remove(i);
i--;
}
}
But sadly BlockingQueue's don't have a function to get or remove an element by index, so I'm stuck. Any ideas?
Edit - A few clarifications:
One of my goals is the maintain the same ordering so popping from the head and putting it back into the tail is no good. Also, although no other threads will remove mail from a mailbox, they will add to it, so I don't want to be in the middle of an removal algorithm, have someone send me mail, and then have an exception occur.
Thanks in advance!
You may p̶o̶p̶ poll and p̶u̶s̶h̶ offer all the elements in your queue until you make a complete loop over your queue. Here's an example:
Mail firstMail = mailbox.peek();
Mail currentMail = mailbox.pop();
while (true) {
//a base condition to stop the loop
Mail tempMail = mailbox.peek();
if (tempMail == null || tempMail.equals(firstMail)) {
mailbox.offer(currentMail);
break;
}
//if there's nothing wrong with the current mail, then re add to mailbox
if (!badNews(currentMail)) {
mailbox.offer(currentMail);
}
currentMail = mailbox.poll();
}
Note that this approach will work only if this code is executed in a single thread and there's no other thread that removes items from this queue.
Maybe you need to check if you really want to poll or take the elements from the BlockingQueue. Similar for offer and put.
More info:
Java BlockingQueue take() vs poll()
LinkedBlockingQueue put vs offer
Another less buggy approach is using a temporary collection, not necessarily concurrent, and store the elements you still need in the queue. Here's a kickoff example:
List<Mail> mailListTemp = new ArrayList<>();
while (mailbox.peek() != null) {
Mail mail = mailbox.take();
if (!badNews(mail)) {
mailListTemp.add(mail);
}
}
for (Mail mail : mailListTemp) {
mailbox.offer(mail);
}
I looked over the solutions posted and I think I found a version that serves my purposes. What do you think about this one?
int size = mailbox.size();
for(int i = 0; i < size; i++)
{
Mail currentMail = mailbox.poll();
if (!badNews(currentMail))
mailbox.offer(currentMail);
}
Edit: A new solution that may be problem free. What you guys think?
while(true)
{
boolean badNewRemains = false;
for(Mail mail: mailbox)
{
if(badNews(mail))
{
badNewRemains = true;
mailbox.remove(mail);
break;
}
}
if(!badNewRemains)
break;
}
You can easily implement queue for your need. And you will need to, if API provided doesn't have such features.
One like:
import java.util.Iterator;
import java.util.LinkedList;
class Mail {
boolean badMail;
}
class MailQueue {
private LinkedList<Mail> backingQueue = new LinkedList<>();
private final Object lock = new Object();
public void push(Mail mail){
synchronized (lock) {
backingQueue.addLast(mail);
if(backingQueue.size() == 1){
// this is only element in queue, i.e. queue was empty before, so invoke if any thread waiting for mails in queue.
lock.notify();
}
}
}
public Mail pop() throws InterruptedException{
synchronized (lock) {
while(backingQueue.isEmpty()){
// no elements in queue, wait.
lock.wait();
}
return backingQueue.removeFirst();
}
}
public boolean removeBadMailsInstantly() {
synchronized (lock) {
boolean removed = false;
Iterator<Mail> iterator = backingQueue.iterator();
while(iterator.hasNext()){
Mail mail = iterator.next();
if(mail.badMail){
iterator.remove();
removed = true;
}
}
return removed;
}
}
}
The implemented queue will be thread-safe, whether push or pop. Also you can edit queue for more operations. And it will allow to access removeBadMailsInstantly method by multiple threads (thread-safe). And you will also learn concepts of multithreading.
Here is the original web crawler in which i wrote: (Just for reference)
https://github.com/domshahbazi/java-webcrawler/tree/master
This is a simple web crawler which visits a given initial web page, scrapes all the links from the page and adds them to a Queue (LinkedList), where they are then popped off one by one and each visited, where the cycle starts again. To speed up my program, and for learning, i tried to implement using threads so i could have many threads operating at once, indexing more pages in less time. Below is each class:
Main class
public class controller {
public static void main(String args[]) throws InterruptedException {
DataStruc data = new DataStruc("http://www.imdb.com/title/tt1045772/?ref_=nm_flmg_act_12");
Thread crawl1 = new Crawler(data);
Thread crawl2 = new Crawler(data);
crawl1.start();
crawl2.start();
}
}
Crawler Class (Thread)
public class Crawler extends Thread {
/** Instance of Data Structure **/
DataStruc data;
/** Number of page connections allowed before program terminates **/
private final int INDEX_LIMIT = 10;
/** Initial URL to visit **/
public Crawler(DataStruc d) {
data = d;
}
public void run() {
// Counter to keep track of number of indexed URLS
int counter = 0;
// While URL's left to visit
while((data.url_to_visit_size() > 0) && counter<INDEX_LIMIT) {
// Pop next URL to visit from stack
String currentUrl = data.getURL();
try {
// Fetch and parse HTML document
Document doc = Jsoup.connect(currentUrl)
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36")
.referrer("http://www.google.com")
.timeout(12000)
.followRedirects(true)
.get();
// Increment counter if connection to web page succeeds
counter++;
/** .select returns a list of elements (links in this case) **/
Elements links = doc.select("a[href]"); // Relative URL
// Add newly found links to stack
addLinksToQueue(links);
} catch (IOException e) {
//e.printStackTrace();
System.out.println("Error: "+currentUrl);
}
}
}
public void addLinksToQueue(Elements el) {
// For each element in links
for(Element e : el) {
String theLink = e.attr("abs:href"); // 'abs' prefix ensures absolute url is returned rather then relative url ('www.reddit.com/hello' rather then '/hello')
if(theLink.startsWith("http") && !data.oldLink(theLink)) {
data.addURL(theLink);
data.addVisitedURL(theLink); // Register each unique URL to ensure it isnt stored in 'url_to_visit' again
System.out.println(theLink);
}
}
}
}
DataStruc Class
public class DataStruc {
/** Queue to store URL's, can be accessed by multiple threads **/
private ConcurrentLinkedQueue<String> url_to_visit = new ConcurrentLinkedQueue<String>();
/** ArrayList of visited URL's **/
private ArrayList<String> visited_url = new ArrayList<String>();
public DataStruc(String initial_url) {
url_to_visit.offer(initial_url);
}
// Method to add seed URL to queue
public void addURL(String url) {
url_to_visit.offer(url);
}
// Get URL at front of queue
public String getURL() {
return url_to_visit.poll();
}
// URL to visit size
public int url_to_visit_size() {
return url_to_visit.size();
}
// Add visited URL
public void addVisitedURL(String url) {
visited_url.add(url);
}
// Checks if link has already been visited
public boolean oldLink(String link) {
for(String s : visited_url) {
if(s.equals(link)) {
return true;
}
}
return false;
}
}
DataStruc is the shared data structure class, which will be concurrently accessed by each instance of a Crawler.java thread. DataStruc has a Queue to store links to be visited, and an arraylist to store visited URL's, to prevent entering a loop. I used a ConcurrentLinkedQueue to store the urls to be visited, as i see it takes care of concurrent access. I didnt require concurrent access with my arraylist of visited urls, as all i need to be able to do is add to this and iterate over it to check for matches.
My problem is that when i compare operation time of using a single thread VS using 2 threads (on the same URL), my single threaded version seems to work faster. I feel i have implemented the threading incorrectly, and would like some tips if anybody can pinpoint the issues?
Thanks!
Added: see my comment, I think the check in Crawler
// While URL's left to visit
while((data.url_to_visit_size() > 0) && counter<INDEX_LIMIT) {
is wrong. The 2nd Thread will stop immediately since the 1st Thread polled the only URL.
You can ignore the remaining, but left for history ...
My general approach to such types of "big blocks that can run in parallel" is:
Make each crawler a Callable. Probably Callable<List<String>>
Submit them to an ExecutorService
When they complete, take the results one at a time and add them to a List.
Using this strategy there is no need to use any concurrent lists at all. The disadvantage is that you don't get much live feedback as they are runnìng. And, if what they return is huge, you may need to worry about memory.
Would this suit your needs? You would have to worry about the addVisitedURL so you still need that as a concurrent data structure.
Added: Since you are starting with a single URL this strategy doesn't apply. You could apply it after the visit to the first URL.
class controller {
public static void main(String args[]) throws InterruptedException {
final int LIMIT = 4;
List<String> seedList = new ArrayList<>(); //1
seedList.add("https://www.youtube.com/");
seedList.add("https://www.digg.com/");
seedList.add("https://www.reddit.com/");
seedList.add("https://www.nytimes.com/");
DataStruc[] data = new DataStruc[LIMIT];
for(int i = 0; i < LIMIT; i++){
data[i] = new DataStruc(seedList.get(i)); //2
}
ExecutorService es = Executors.newFixedThreadPool(LIMIT);
Crawler[] crawl = new Crawler[LIMIT];
for(int i = 0; i < LIMIT; i++){
crawl[i] = new Crawler(data[i]); //3
}
for(int i = 0; i < LIMIT; i++){
es.submit(crawl[i]) // 4
}
}
}
you can try this out
create a seedlist
create objects of datastruc and add the seedlist to each of them
create crawl array and pass datastruc object to them one by one
pass the crawl object to the excutor
can someone tell if the code below would work fine?
class CriticalSection{
int iProcessId, iCounter=0;
public static boolean[] freq = new boolean[Global.iParameter[2]];
int busy;
//constructors
CriticalSection(){}
CriticalSection(int iPid){
this.iProcessId = iPid;
}
int freqAvailable(){
for(int i=0; i<
Global.iParameter[2]; i++){
if(freq[i]==true){
//this means that there is no frequency available and the request will be dropped
iCounter++;
}
}
if(iCounter == freq.length)
return 3;
BaseStaInstance.iNumReq++;
return enterCritical();
}
int enterCritical(){
int busy=0;
for(int i=0; i<Global.iParameter[2]; i++){
if(freq[i]==true){
freq[i] = false;
break;
}
}
//implement a thread that will execute the critical section simultaneously as the (contd down)
//basestation leaves it critical section and then generates another request
UseFrequency freqInUse = new UseFrequency;
busy = freqInUse.start(i);
//returns control back to the main program
return 1;
}
}
class UseFrequency extends Thread {
int iFrequency=0;
UseFrequency(int i){
this.iFrequency = i;
}
//this class just allows the frequency to be used in parallel as the other basestations carry on making requests
public void run() {
try {
sleep(((int) (Math.random() * (Global.iParameter[5] - Global.iParameter[4] + 1) ) + Global.iParameter[4])*1000);
} catch (InterruptedException e) { }
}
CriticalSection.freq[iFrequency] = true;
stop();
}
No, this code will not even compile. For example, your "UseFrequency" class has a constructor and a run() method, but then you have two lines CriticalSection.freq[iFrequency] = true; and
stop(); that aren't in any method body - they are just sitting there on their own.
If you get the code to compile it still will not work like you expect because you have multiple threads and no concurrency control. That means the different threads can "step on eachother" and corrupt shared data, like your "freq" array. Any time you have multiple threads you need to protect access to shared variables with a synchronized block. The Java Tutorial on concurrency explains this here http://java.sun.com/docs/books/tutorial/essential/concurrency/index.html
Have you tried compiling and testing it? Are you using an IDE like Eclipse? You can step through your program in the debugger to see what its doing. The way your question is structured no one can tell either way if your program is doing the right or wrong thing, because nothing is specified in the comments of the code, nor in the question posed.
I would like to store tuples objects in a concurent java collection and then have an efficient, blocking query method that returns the first element matching a pattern. If no such element is available, it would block until such element is present.
For instance if I have a class:
public class Pair {
public final String first;
public final String Second;
public Pair( String first, String second ) {
this.first = first;
this.second = second;
}
}
And a collection like:
public class FunkyCollection {
public void add( Pair p ) { /* ... */ }
public Pair get( Pair p ) { /* ... */ }
}
I would like to query it like:
myFunkyCollection.get( new Pair( null, "foo" ) );
which returns the first available pair with the second field equalling "foo" or blocks until such element is added. Another query example:
myFunkyCollection.get( new Pair( null, null ) );
should return the first available pair whatever its values.
Does a solution already exists ? If it is not the case, what do you suggest to implement the get( Pair p ) method ?
Clarification: The method get( Pair p) must also remove the element. The name choice was not very smart. A better name would be take( ... ).
Here's some source code. It basically the same as what cb160 said, but having the source code might help to clear up any questions you may still have. In particular the methods on the FunkyCollection must be synchronized.
As meriton pointed out, the get method performs an O(n) scan for every blocked get every time a new object is added. It also performs an O(n) operation to remove objects. This could be improved by using a data structure similar to a linked list where you can keep an iterator to the last item checked. I haven't provided source code for this optimization, but it shouldn't be too difficult to implement if you need the extra performance.
import java.util.*;
public class BlockingQueries
{
public class Pair
{
public final String first;
public final String second;
public Pair(String first, String second)
{
this.first = first;
this.second = second;
}
}
public class FunkyCollection
{
final ArrayList<Pair> pairs = new ArrayList<Pair>();
public synchronized void add( Pair p )
{
pairs.add(p);
notifyAll();
}
public synchronized Pair get( Pair p ) throws InterruptedException
{
while (true)
{
for (Iterator<Pair> i = pairs.iterator(); i.hasNext(); )
{
Pair pair = i.next();
boolean firstOk = p.first == null || p.first.equals(pair.first);
boolean secondOk = p.second == null || p.second.equals(pair.second);
if (firstOk && secondOk)
{
i.remove();
return pair;
}
}
wait();
}
}
}
class Producer implements Runnable
{
private FunkyCollection funkyCollection;
public Producer(FunkyCollection funkyCollection)
{
this.funkyCollection = funkyCollection;
}
public void run()
{
try
{
for (int i = 0; i < 10; ++i)
{
System.out.println("Adding item " + i);
funkyCollection.add(new Pair("foo" + i, "bar" + i));
Thread.sleep(1000);
}
}
catch (InterruptedException e)
{
Thread.currentThread().interrupt();
}
}
}
public void go() throws InterruptedException
{
FunkyCollection funkyCollection = new FunkyCollection();
new Thread(new Producer(funkyCollection)).start();
System.out.println("Fetching bar5.");
funkyCollection.get(new Pair(null, "bar5"));
System.out.println("Fetching foo2.");
funkyCollection.get(new Pair("foo2", null));
System.out.println("Fetching foo8, bar8");
funkyCollection.get(new Pair("foo8", "bar8"));
System.out.println("Finished.");
}
public static void main(String[] args) throws InterruptedException
{
new BlockingQueries().go();
}
}
Output:
Fetching bar5.
Adding item 0
Adding item 1
Adding item 2
Adding item 3
Adding item 4
Adding item 5
Fetching foo2.
Fetching foo8, bar8
Adding item 6
Adding item 7
Adding item 8
Finished.
Adding item 9
Note that I put everything into one source file to make it easier to run.
I know of no existing container that will provide this behavior. One problem you face is the case where no existing entry matches the query. In that case, you'll have to wait for new entries to arrive, and those new entries are supposed to arrive at the tail of the sequence. Given that you're blocking, you don't want to have to examine all the entries that precede the latest addition, because you've already inspected them and determined that they don't match. Hence, you need some way to record your current position, and be able to search forward from there whenever a new entry arrives.
This waiting is a job for a Condition. As suggested in cb160's answer, you should allocate a Condition instance inside your collection, and block on it via Condition#await(). You should also expose a companion overload to your get() method to allow timed waiting:
public Pair get(Pair p) throws InterruptedException;
public Pair get(Pair p, long time, TimeUnit unit) throws InterruptedException;
Upon each call to add(), call on Condition#signalAll() to unblock the threads waiting on unsatisfied get() queries, allowing them to scan the recent additions.
You haven't mentioned how or if items are ever removed from this container. If the container only grows, that simplifies how threads can scan its contents without worrying about contention from other threads mutating the container. Each thread can begin its query with confidence as to the minimum number of entries available to inspect. However, if you allow removal of items, there are many more challenges to confront.
In your FunkyCollection add method you could call notifyAll on the collection itself every time you add an element.
In the get method, if the underlying container (Any suitable conatiner is fine) doesn't contain the value you need, wait on the FunkyCollection. When the wait is notified, check to see if the underlying container contains the result you need. If it does, return the value, otherwise, wait again.
It appears you are looking for an implementation of Tuple Spaces. The Wikipedia article about them lists a few implementations for Java, perhaps you can use one of those. Failing that, you might find an open source implementation to imitate, or relevant research papers.