crawler4j asynchronously saving results to file - java

I'm evaluating crawler4j for ~1M crawls per day
My scenario is this: I'm fetching the URL and parsing its description, keywords and title, now I would like to save each URL and its words into a single file
I've seen how it's possible to save crawled data to files. However, since I have many crawls to perform I want different threads performing the save file operation on the file system (in order to not block the fetcher thread). Is that possible to do with crawler4j? If so, how?
Thanks

Consider using a Queue (BlockingQueue or similar) where you put the data to be written and which are then processed by one/more worker Threads (this approach is nothing crawler4j-specific). Search for "producer consumer" to get some general ideas.
Concerning your follow-up question on how to pass the Queue to the crawler instances, this should do the trick (this is only from looking at the source code, haven't used crawler4j on my own):
final BlockingQueue<Data> queue = …
// use a factory, instead of supplying the crawler type to pass the queue
controller.start(new WebCrawlerFactory<MyCrawler>() {
#Override
public MyCrawler newInstance() throws Exception {
return new MyCrawler(queue);
}
}, numberOfCrawlers);

Related

How to access a shared file from multiple threads most effectively?

I'm developing a small web-app whose servlets periodically get access to a shared resource which is a simple text-file on the server side holding some lines of mutable data. Most of the time, servelts just read file for the data, but some servelts may also update it, adding new lines to the file or removing and replacing existing lines. Although file contents is not updated very often, there is still little chance for the data inconsistency and file corruption if two or more servlets decide to read and write to file at the same time.
The first goal is to make the file reading/writing safe. For this purpose, I've created a helper FileReaderWriter class providing some static methods for thread-safe file access. The read and write methods are coordinated by ReentrantReadWiteLock. The rule is quite simple: multiple threads may read from file at any time as far as no other thread is writing to it at the same time.
public class FileReaderWriter {
private static final ReentrantReadWriteLock rwLock = new ReentrantReadWriteLock();
public static List<String> read(Path path) {
List<String> list = new ArrayList<>();
rwLock.readLock().lock();
try {
list = Files.readAllLines(path);
} catch (IOException e) {
e.printStackTrace();
} finally {
rwLock.readLock().unlock();
}
return list;
}
public static void write(Path path, List<String> list) {
rwLock.writeLock().lock();
try {
Files.write(path, list);
} catch (IOException e) {
e.printStackTrace();
} finally {
rwLock.writeLock().unlock();
}
}
}
Then, every servelt may use the above method for file reading like this:
String dataDir = getServletContext().getInitParameter("data-directory");
Path filePath = Paths.get(dataDir, "test.txt");
ArrayList<String> list = FileReaderWriter.read(filePath);
Similarly, writing may be done with FileReaderWriter.write(filePath, list) method. Note: if some data needs to be replaced or removed (which means fetching the data form a file, processing it and writing updated data back to a file), then the whole code paths for this operation should be locked by rwLock.writeLock() for atomicity reasons.
Now, when access to a shared file seems to be safe (at least, I hope so), the next step is to make it fast. From the scalability perspective, reading a file at every user's request to the servlet doesn't sound reasonable. So, what I thought of is to read the contents of file into ArrayList (or other collection) only once during the context initialization time and then share this ArrayList (not the file) as a context-scoped data-holder attribute. Then a context-scoped attribute can be shared by servlets with the same locking mechanism as described above and the contents of the updated ArrayList may be independently stored back to the file on some regular basis.
Another solution (in order to avoid locking) would be to use CopyOnWriteArrayList (or some other collection from java.util.concurrent package) for holding a shared data and designate a single-threaded ExecutorService to dump its contents into a file when needed. I also heard of Java Memory-Mapped Files for mapping the entire file into internal memory, but not sure if such approach is appropriate for this particular situation.
So, could anybody, please, guide me thorough the most effective ways (maybe, suggesting some other alternatives) to solve the problem with a shared file access, provided that the writing to a file is quite infrequent and the contents of it is not expected to exceed a dozens of lines.
You don't explain your real problem, only your current attempt then, is difficult to provide a good solution.
Your approach has two serious problems:
Problem 1: concurrency
a shared resource which is a simple text-file on the server side
holding some lines of mutable data
90% of the solution to a problem is a good data structure. A mutable file it's not. Even popular database engines have important concurrency limitations (eg. SQLite), don't try to reinvent the wheel.
Problem 2: horizontal scalability
Even if he solves his local concurrency problems (eg. synchronous methods), you won't be able to deploy multiple instances (nodes/servers) of your application.
Solution 1: use the right tool for the job
You don't explain exactly the nature of your (data management) problem but probably any NoSQL database will do you good (reading about MongoDB can be a good starting point).
(Bad) solution 2: use FileLock
If for some reason you insist on doing what you indicate, use low level file locks using FileLock. You will only have to deal with partial file locks and even these can be distributed horizontally. You won't have to worry about synchronizing other resources either, as file-level locks will suffice.
(Limited) solution 3: in memory structure
If you don't need horizontal scalability, you can use a shared in memory structure like ConcurrentHashMap but you will lose the horizontal scalability and you could lose transactions if you do not persist the information before an application stop.
Conclusion
Although there are more exotic distributed data models, using a database for even a single table may be the best and simplest solution.

Creating objects in parallel using RxJava

I have written a Spring Boot micro service using RxJava (aggregated service) to implement the following simplified usecase. The big picture is when an instructor uploads a course content document, set of questions should be generated and saved.
User uploads a document to the system.
The system calls a Document Service to convert the document into a text.
Then it calls another question generating service to generate set of questions given the above text content.
Finally these questions are posted into a basic CRUD micro service to save.
When a user uploads a document, lots of questions are created from it (may be hundreds or so). The problem here is I am posting questions one at a time sequentially for the CRUD service to save them. This slows down the operation drastically due to IO intensive network calls hence it takes around 20 seconds to complete the entire process. Here is the current code assuming all the questions are formulated.
questions.flatMapIterable(list -> list).flatMap(q -> createQuestion(q)).toList();
private Observable<QuestionDTO> createQuestion(QuestionDTO question) {
return Observable.<QuestionDTO> create(sub -> {
QuestionDTO questionCreated = restTemplate.postForEntity(QUESTIONSERVICE_API,
new org.springframework.http.HttpEntity<QuestionDTO>(question), QuestionDTO.class).getBody();
sub.onNext(questionCreated);
sub.onCompleted();
}).doOnNext(s -> log.debug("Question was created successfully."))
.doOnError(e -> log.error("An ERROR occurred while creating a question: " + e.getMessage()));
}
Now my requirement is to post all the questions in parallel to the CRUD service and merge the results on completion. Also note that the CRUD service will accept only one question object at a time and that can not be changed. I know that I can use Observable.zip operator for this purpose, but I have no idea on how to apply it in this context since the actual number of questions is not predetermined. How can I change the code in line 1 so that I can improve the performance of the application. Any help is appreciated.
By default the observalbes in flatMap operate on the same scheduler as you subscribed it on. In order to run your createQuestion observables in parallel, you have to subscribe them on a computation scheduler.
questions.flatMapIterable(list -> list)
.flatMap(q -> createQuestion(q).subscribeOn(Schedulers.computation()))
.toList();
Check this article for a full explanation.

Unable to figure out the correct data structure and correct approach in this scenario of parsing text

I am working on a text document parsing application.
The design of the document is as shown in the figure
Here is how the parsing is being done:
Document contains an ArrayList of pages
.
Each page has a Map<float, List<Character>>
the float value contains the yaxis value of a character location and hence a key and Character contains other informations.
Parsing is done character by character via third party library. Please add comments if more information is required.
Now while parsing I have created two ExecutorService thread pools, one for pages and one for populating map.
I initially create a Document and pass each page to the page parser as a runnable to ExecutorService. Which in turn passes an empty map to text parser.
Text parser checks if the map has a key value
if yes it adds the character to the list or a new list what ever is necessary.
Problem here is that this task can be done concurrently for all pages to speed up execution. But I am unable to do with this data structure as all threads stuck in between if parsed normally and if done in a synchronized fashion using Collections.synchronizedMap , takes a lot of time.
Also, I am maintaining two different lists of Future objects to check if the threads finished.
Kindly provide valuable suggestions for improvements and concurrent execution to speed up the execution.
If each page has its own Map<float, List<Character>>, then never have more than one thread processing a single page - then you won't need to synchronize access to the Map or use a concurrent Map implementation. You can statically partition the pages among your workers as JB Nizet suggests in the comments; another option would be to place all pages in a ConcurrentLinkedQueue and have the workers poll the queue for pages to parse, terminating when the queue is empty (poll returns null). Either way, you'll only need one ExecutorService, since each worker is responsible for both parsing and map population.

Web crawler(NYTimes) using Jsoup (Link within a link)

I have been assigned an assignment to crawl through the "nytimes" website and display the most liked, shared, etc articles on that website using the concept of a web crawler.
I have made use of JSOUP to extract all the links from the homepage of nytimes. The code is as follows:
public static void processPage(String URL) throws IOException
{
Document doc = Jsoup.connect(URL).get();
Elements questions = doc.select("a[href]");
for(Element link: questions)
{
String absUrl1 = link.absUrl("href");
if(absUrl1.contains("nytimes.com")) {
System.out.println(absUrl1); }
}
}
This code was used to extract and display all the links containing "nytimes.com" but how do I parse all those links and extract links within that link and so on? That's what a crawler is supposed to do. But I'm not able to figure it out. I tried to call the processPage function recursively but the output I'm getting is not as the expected one.
If you're using a single machine, then a Queue is for you.
As you come across links in a page that need to be crawled, add them to the Queue. If you want to be single threaded, you could write a while loop that reads from this queue. Initially, the queue might have the NY Times link in it. Once that URL is pulled and crawled, more URLs will be in the queue for processing. You can continue this for all NY Times articles.
Using a Queue also allows you to easily multithread, allowing multiple threads to take from the queue, helping increase throughput. Take a look at the producer/consumer pattern.
If a single machine isn't enough, you'll have to do something more distributed, like using Hadoop. Yahoo uses Hadoop to have multiple machines spider the web at once.

Retrieving Large Lists of Objects Using Java EE

Is there a generally-accepted way to return a large list of objects using Java EE?
For example, if you had a database ResultSet that had millions of objects how would you return those objects to a (remote) client application?
Another example -- that is closer to what I'm actually doing -- would be to aggregate data from hundreds of sources, normalize it, and incrementally transfer it to a client system as a single "list".
Since all the data cannot fit in memory, I was thinking that a combination of a stateful SessionBean and some sort of custom Iterator that called back to the server would do the trick.
So, in other words, if I have an API like Iterator<Data> getData() then what's a good way to implement getData() and Iterator<Data>?
How have you successfully solved this problem in the past?
Definitely don't duplicate the entire DB into Java's memory. This makes no sense and only makes things unnecessarily slow and memory-hogging. Rather introduce pagination at database level. You should query only the data you actually need to display on the current page, like as Google does.
If you actually have a hard time in implementing this properly and/or figuring the SQL query for the specific database, then have a look at this answer. For JPA/Hibernate equivalent, have a look at this answer.
Update as per the comments (which actually changes the entire question subject...), here's a basic (pseudo) kickoff example:
List<Source> inputSources = createItSomehow();
Source outputSource = createItSomehow();
for (Source inputSource : inputSources) {
while (inputSource.next()) {
outputSource.write(inputSource.read());
}
}
This way you effectively end up with a single entry in Java's memory instead of the entire collection as in the following (inefficient) example:
List<Source> inputSources = createItSomehow();
List<Entry> entries = new ArrayList<Entry>();
for (Source inputSource : inputSources) {
while (inputSource.next()) {
entries.add(inputSource.read());
}
}
Source outputSource = createItSomehow();
for (Entry entry : entries) {
outputSource.write(entry);
}
Pagination is a good solution when working with a web based ui. sometimes, however, it is much more efficient to stream everything in one call. the rmiio library was written explicitly for this purpose, and is already known to work in a variety of app servers.
If your list is huge, you must assume that it can't fit in memory. Or at least that if your server need to handle that on many concurrent access then you have high risk of OutOfMemoryException.
So basically, what you do is paging and using batch reading. let say you load 1 thousand objects from your database, you send them to the client request response. And you loop until you have processed all objects. (See response from BalusC)
Problem is same on client side, and you'll likely to need to stream the data to the file system to prevent OutOfMemory errors.
Please also note : It is okay to load millions of object from a database as an administrative task : like for performing a backup, and export of some 'exceptional' case. But you should not use it as a request any user could do. It will be slow and drain server resources.

Categories

Resources