How to manage a crawler URL frontier?

How to manage a crawler URL frontier? - java

Guys
I have the following code to add visited links on my crawler.
After extracting links i have a for loop which loop thorough each individual href tags.
And after i have visited a link , opened it , i will add the URL to a visited link collection variable defined above.
private final Collection<String> urlForntier = Collections.synchronizedSet(new HashSet<String>());
The crawler implementation is mulithread and assume if i have visited 100,000 urls, if i didn't terminate the crawler it will grow day by day . and It will create memory issues ? Please , what option do i have to refresh the variable without creating inconsistency across threads ?
Thanks in advance!

If your crawlers are any good, managing the crawl frontier quickly becomes difficult, slow and error-prone.
Luckily, your don't need to write this yourself, just write your crawlers to use consume the URL Frontier API and plug-in an implementation that suits you.
See https://github.com/crawler-commons/url-frontier

The most usable way for modern crawling systems is to use NoSQL databases.
This solution is notable slower than HashSet. That is why you can leverage different caching strategy like a Redis, or even Bloom filters
But including specific nature of URL, I'd like to recommend Trie data structure that gives you lot of options to manipulate and search by url string. (Discussion of java implementation can be found on this Stackoevrflow topic)

As per question, I would recommend using Redis to replace use of Collection. It's in-memory database for data structure store and super fast to insert and retrieve data with support of all standard data structures.In your case Set and you can check existence of key in set with SISMEMBER command).
Apache Nutch is also good to explore.

Related

What is the best way to store data in a Page Object Model Framework

Im building an automation framework in selenium using the Page Object Design Pattern.
Following are some of the data that Im using and where i have stored them
PageObjects (xpath, id etc) - In the Page Classes itself
Configuration Data (wait-times, browser type , the URL etc) - In a properties file.
Other data - In a class as static variables.
Once the framework starts growing it would be hard to store all the data it would be hard to organize the data. I did a some research on how others have implemented the way they store data in their framework. Here is what I found out,
Storing data (mostly page objects) in classes itself
Storing data in JSON
And some even suggested storing data in a database so that it would reduce reading times
Since there are lot of options out there, I thought of getting some feedback on what is the best way to store data and how everyone else has stored there data.

JSON or Any temp data storage is the best option as it is a framework and the purpose of it is to reuse for different projects.

I don't see any problem with the way you have stored your data.
Locators (by POM definition) should be stored in the page objects themselves.
Config data can be stored in some sort of config file... whatever you find convenient. You can use plain text, JSON, XML, etc. We use XML but that really comes down to personal preference.
I think this is fine also.
The framework doesn't really grow, the automation suite does. As long as you keep the data stored in the 3 places above consistently, I think you should be fine. The only issue I've run into with this approach is that sometimes certain pages have a LOT of functionality on them so the page objects grow quite large. In those cases, we found a way to divide the page into smaller chunks, e.g. one page had 22 tabs, each consisting of a different panel. In that case, we broke the page object into 22 different class files to keep the size more manageable and then hooked them all back into the main page as properties, e.g. mainPage.Panel1.someMethodOnPanel1();

I advice using Interfaces for each device type to store multiple type selectors, example:
import static org.openqa.selenium.By.cssSelector;
import static org.openqa.selenium.By.linkText;
import static org.openqa.selenium.By.xpath;
public interface DesktopMainPageSelector {
By FIRST_ELEMENT = cssSelector("selector_here");
By SECOND_ELEMENT = xpath("selector_here");
By THIRD_ELEMENT = id("selector_here");
}
than, just implement these selectors from whatever you need them.
You can also use enums with for a more complex structure.
I found this as best solution, because its easy to manage large numbers of selectors

WebApp: suggestion about how save a lot of information from a Sphinx search

I have an issue about my webapp: it's a intranet search webapp that asks Sphinx http://sphinxsearch.com/ (the real search engine) for a query typed by the user. The problem is that the result could be very big (also for a intranet network) so I want to save the result on the server to handle a sort of lazy load of the data.
I suppose to use Hibernate but...if the result is too big and I save, for example, 40.000 items...will it be too effort for hibernate? And retrieving them?!
Any suggestions?
Thanks in advance

You can use a limit and offset in the sphinxsearch itself: http://sphinxsearch.com/docs/2.1.3/api-func-setlimits.html. As from this doc a word about limiting the result from the sphinx server (which is 1000 by default):
One thousand records is enough to present to the end user. And if you're thinking about pulling the results to application for further sorting or filtering, that would be much more efficient if performed on Sphinx side.

maybe I'm missing something, but why not just get it piecemeal direct out of sphinx? You can jsut get small pages worth of results at a time. with setLimits.
That way you dont download the data as you need it.

Document Management System - Database Design

I'm writing my own Document Management System (DMS) in Java (the ones available don't satisfy my needs).
The documents shall be described by the Qualified DublinCore Metadata Standard. The easiest way to do this, in my opinion is do pack the key-value pairs in a RDF model with a XML representation.
To store the metadata for all documents i have two ideas (the document files will be stored in the filesystem):
Store all metadata of all documents in a single XML file
Make a XML file for each document and store it either in the filesystem or in a RDBMS (like the H2 database engine for Java), a key-value database won't solve this because the keys for one document are not unique.
Since (many) documents are linked among each other the first approach may would be better for analysing the data, but the second approach may be much faster.
Which solution you would recommend? Or are there any better solutions?
Stefan

I don't know how your analysis work, but if you need the complete graph in memory to do your analysis then use variante 1 (Store all metadata of all documents in a single XML file), because you will get no gain (but only extra work) from variante 2 in this scenario.
added
If this extra work for variant 2 is not to much, then I recomend variant 2, because it can be more calable.
you could update or add document meta data by writing only a small xml file instead of a huge one
it depends on what xml parser you use, but in some cases it is faster to parse some smaller xml files than one huge one (but this strongly depends on the ammout of data).

Have you considered using MongoDB and GridFS? http://www.mongodb.org/display/DOCS/GridFS+Specification
You can store your documents directly in MongoDB as binary and even store the associated metadata for that particular file in any format you want. It would have the ability to store documents even if they have the same name and it will generate it's own unique IDs.

BTW: even if it does not belong to your question: have a look at a JCR (Java Content Repository) implementation like JackRabbit. You could use it to store your documents and maybe your meta data too.

I'd look into a NO SQL document solution like Couch DB to see if it could help you.
I don't like the file system solution; there's no abstraction whatsoever to help you there.

If your are always accessing all documents, none of your approaches would be slower than the other. But I would recommend the second approach. When it comes to analyzing the data, you'll need to read all documents, so there is no difference if they are in different files or in one file...

Retrieving Large Lists of Objects Using Java EE

Is there a generally-accepted way to return a large list of objects using Java EE?
For example, if you had a database ResultSet that had millions of objects how would you return those objects to a (remote) client application?
Another example -- that is closer to what I'm actually doing -- would be to aggregate data from hundreds of sources, normalize it, and incrementally transfer it to a client system as a single "list".
Since all the data cannot fit in memory, I was thinking that a combination of a stateful SessionBean and some sort of custom Iterator that called back to the server would do the trick.
So, in other words, if I have an API like Iterator<Data> getData() then what's a good way to implement getData() and Iterator<Data>?
How have you successfully solved this problem in the past?

Definitely don't duplicate the entire DB into Java's memory. This makes no sense and only makes things unnecessarily slow and memory-hogging. Rather introduce pagination at database level. You should query only the data you actually need to display on the current page, like as Google does.
If you actually have a hard time in implementing this properly and/or figuring the SQL query for the specific database, then have a look at this answer. For JPA/Hibernate equivalent, have a look at this answer.
Update as per the comments (which actually changes the entire question subject...), here's a basic (pseudo) kickoff example:
List<Source> inputSources = createItSomehow();
Source outputSource = createItSomehow();
for (Source inputSource : inputSources) {
while (inputSource.next()) {
outputSource.write(inputSource.read());
}
}
This way you effectively end up with a single entry in Java's memory instead of the entire collection as in the following (inefficient) example:
List<Source> inputSources = createItSomehow();
List<Entry> entries = new ArrayList<Entry>();
for (Source inputSource : inputSources) {
while (inputSource.next()) {
entries.add(inputSource.read());
}
}
Source outputSource = createItSomehow();
for (Entry entry : entries) {
outputSource.write(entry);
}

Pagination is a good solution when working with a web based ui. sometimes, however, it is much more efficient to stream everything in one call. the rmiio library was written explicitly for this purpose, and is already known to work in a variety of app servers.

If your list is huge, you must assume that it can't fit in memory. Or at least that if your server need to handle that on many concurrent access then you have high risk of OutOfMemoryException.
So basically, what you do is paging and using batch reading. let say you load 1 thousand objects from your database, you send them to the client request response. And you loop until you have processed all objects. (See response from BalusC)
Problem is same on client side, and you'll likely to need to stream the data to the file system to prevent OutOfMemory errors.
Please also note : It is okay to load millions of object from a database as an administrative task : like for performing a backup, and export of some 'exceptional' case. But you should not use it as a request any user could do. It will be slow and drain server resources.

How to webscrape scholar.google.com in Java?

I want to write a Java func grabTopResults(String f) such that grabTopResults("automata theory") returns me a list of the top 100 cited papers on scholar.google.com for "automata theory".
Does anyone have suggestions for what libraries will make my life easy?
Thanks!

As I'm sure Google can afford the bandwidth, I'll ignore the question of whether this is immoral/illegal/prohibited by Google's T&C
First thing you need to do is figure out what HTTP request (or requests) you need to issue in order to obtain the page with the data you need. Once you've figured this out, use HttpClient to issue the same request from Java code. The previous link shows example code that explains how to do this.
Once you've downloaded the content of the relevant page, you'll need to use a HTML parser to extract the data you're interested in. The Jericho parser suggested by peperg is a good choice.
If the Google police come knocking, you've never heard of me, OK?

I use http://jericho.htmlparser.net/docs/index.html . Google Scholar doesn't have API ( http://code.google.com/p/google-ajax-apis/issues/detail?id=109 ). Of course it is not allowed by Google (read terms of use. Automatic requestr are forbidden).

Below is a bit of example code which gets the titles on the first page using the open source product TestPlan. It is a standalone product, but if you really need it I could help you integrated it into your Java code (it is written in Java itself).
GotoURL http://scholar.google.com/
SubmitForm with
%Params:q% automate theory
end
set %Items% as response //div[#class='gs_r']
foreach %Item% in %Items%
set %Title% as selectIn %Item% h3
Notice %Title%
end
This produces output like the below (my IP is Germany, thus a german response). Obviously you could format it however you like, or write it to a file; this is just a rough test.
00000000-00 GOTOURL http://scholar.google.com/
00000001-00 SUBMITFORM default
00000002-00 NOTICE [ZITATION] Stochastic complexity in statistical inquiry theory
00000003-00 NOTICE AUTOMATED THEORY FORMATION IN MATHEMATICS1
00000004-00 NOTICE Constraint generation via automated theory formation
00000005-00 NOTICE [BUCH] Automated theorem proving: after 25 years
00000006-00 NOTICE [BUCH] Introduction to the Theory of Computation
00000007-00 NOTICE [ZITATION] Computer-controlled systems: theory and design
00000008-00 NOTICE [BUCH] … , randomness & incompleteness: papers on algorithmic information theory
00000009-00 NOTICE [BUCH] Automatic control systems
00000010-00 NOTICE [BUCH] VLSI physical design automation: theory and practice
00000011-00 NOTICE Singular Control Systems.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to manage a crawler URL frontier? - java

Related

What is the best way to store data in a Page Object Model Framework

WebApp: suggestion about how save a lot of information from a Sphinx search

Document Management System - Database Design

Retrieving Large Lists of Objects Using Java EE

How to webscrape scholar.google.com in Java?

Categories

Resources