Efficient way to GET multiple HTML pages simultaneously - java

So I'm working on web scraping for a certain website. The problem is:
Given a set of URLs (in the order of 100s to 1000s), I would like to retrieve the HTML of each URL in an efficient manner, specially time-wise. I need to be able to do 1000s of requests every 5 minutes.
This should usually imply using a pool of threads to do requests from a set of not yet requested urls. But before jumping into implementing this, I believe that it's worth asking here since I believe this is a fairly common problem when doing web scraping or web crawling.
Is there any library that has what I need?

So I'm working on web scraping for a certain website.
Are you scraping a single server or is the website scraping from multiple other hosts? If it is the former, then the server you are scraping may not like too many concurrent connections from a single i/p.
If it is the latter, this is really a general question on how many outbound connections you should open from a machine. There is physical limit, but it is pretty large. Practically, it would depend on where that client is getting deployed. The better the connectivity, the higher number of connections it can accommodate.
You might want to look at the source code of a good download manager to see if they have a limit on the number of outbound connections.
Definitely user asynchronous i/o, but you would still do well to limit the number.

Your bandwidth utilization will be the sum of all of the HTML documents that you retrieve (plus a little overhead) no matter how you slice it (though some web servers may support compressed HTTP streams, so certainly use a client capable of accepting them).
The optimal number of concurrent threads depends a great deal on your network connectivity to the sites in question. Only experimentation can find an optimal number. You can certainly use one set of threads for retrieving HTML documents and a separate set of threads to process them to make it easier to find the right balance.
I'm a big fan of HTML Agility Pack for web scraping in the .NET world but cannot make a specific recommendation for Java. The following question may be of use in finding a good, Java based scraping platform
Web scraping with Java

I would start by researching asynchronous communication. Then take a look at Netty.
Keep in mind there is always a limit to how fast one can load a web page. For an average home connection, it will be around a second. Take this into consideration when programming your application.

http://wwww.Jsoup.org just for scrapping part! The thread pooling i think you should implement urself.
Update
if this approach is fitting your need, you can download the complete class files here:
http://codetoearn.blogspot.com/2013/01/concurrent-web-requests-with-thread.html
AsyncWebReader webReader = new AsyncWebReader(5/*number of threads*/, new String[]{
"http://www.google.com",
"http://www.yahoo.com",
"http://www.live.com",
"http://www.wikipedia.com",
"http://www.facebook.com",
"http://www.khorasannews.com",
"http://www.fcbarcelona.com",
"http://www.khorasannews.com",
});
webReader.addObserver(new Observer() {
#Override
public void update(Observable o, Object arg) {
if (arg instanceof Exception) {
Exception ex = (Exception) arg;
System.out.println(ex.getMessage());
} /*else if (arg instanceof List) {
List vals = (List) arg;
System.out.println(vals.get(0) + ": " + vals.get(1));
} */else if (arg instanceof Object[]) {
Object[] objects = (Object[]) arg;
HashMap result = (HashMap) objects[0];
String[] success = (String[]) objects[1];
String[] fail = (String[]) objects[2];
System.out.println("Failds");
for (int i = 0; i < fail.length; i++) {
String string = fail[i];
System.out.println(string);
}
System.out.println("-----------");
System.out.println("success");
for (int i = 0; i < success.length; i++) {
String string = success[i];
System.out.println(string);
}
System.out.println("\n\nresult of Google: ");
System.out.println(result.remove("http://www.google.com"));
}
}
});
Thread t = new Thread(webReader);
t.start();
t.join();

Related

Google App Engine Objectify - load single objects or list of keys?

I am trying to get a grasp on Google App Engine programming and wonder what the difference between these two methods is - if there even is a practical difference.
Method A)
public Collection<Conference> getConferencesToAttend(Profile profile)
{
List<String> keyStringsToAttend = profile.getConferenceKeysToAttend();
List<Conference> conferences = new ArrayList<Conference>();
for(String conferenceString : keyStringsToAttend)
{
conferences.add(ofy().load().key(Key.create(Conference.class,conferenceString)).now());
}
return conferences;
}
Method B)
public Collection<Conference> getConferencesToAttend(Profile profile)
List<String> keyStringsToAttend = profile.getConferenceKeysToAttend();
List<Key<Conference>> keysToAttend = new ArrayList<>();
for (String keyString : keyStringsToAttend) {
keysToAttend.add(Key.<Conference>create(keyString));
}
return ofy().load().keys(keysToAttend).values();
}
the "conferenceKeysToAttend" list is guaranteed to only have unique Conferences - does it even matter then which of the two alternatives I choose? And if so, why?
Method A loads entities one by one while method B does a bulk load, which is cheaper, since you're making just 1 network roundtrip to Google's datacenter. You can observe this by measuring time taken by both methods while loading a bunch of keys multiple times.
While doing a bulk load, you need to be cautious about loaded entities, if datastore operation throws exception. Operation might succeed even when some of the entities are not loaded.
The answer depends on the size of the list. If we are talking about hundreds or more, you should not make a single batch. I couldn't find documentation what is the limit, but there is a limit. If it not that much, definitely go with loading one by one. But, you should make the calls asynchronous by not using the now function:
List<<Key<Conference>> conferences = new ArrayList<Key<Conference>>();
conferences.add(ofy().load().key(Key.create(Conference.class,conferenceString));
And when you need the actual data:
for (Key<Conference> keyConference : conferences ) {
Conference c = keyConference.get();
......
}

Ideas on concurrent datastructure

I am not sure if i can put my question in the clearest fashion but i will try my best.
Lets say i am retrieving some information from a third party api. The retrieved information will be huge in size. To have a performance gain, instead of retrieving all the info in one go, i will be retrieving the info in a paged fashion (the api gives me that facility, basically an iterator). The return type is basically a list of objects.
My aim here is to process the information i have in hand(that includes comparing and storing in db and many other operations) while i get paged response on the request.
My question here to the expert community is , what data structure do you prefer in such case. Also does a framework like spring batch help you in getting performance gains in such cases.
I know the question is a bit vague, but i am looking for general ideas,tips and pointers.
In these cases, the data structure for me is java.util.concurrent.CompletionService.
For purposes of example, I'm going to assume a couple of additional constraints:
You want only one outstanding request to the remote server at a time
You want to process the results in order.
Here goes:
// a class that knows how to update the DB given a page of results
class DatabaseUpdater implements Callable { ... }
// a background thread to do the work
final CompletionService<Object> exec = new ExecutorCompletionService(
Executors.newSingleThreadExecutor());
// first call
List<Object> results = ThirdPartyAPI.getPage( ... );
// Start loading those results to DB on background thread
exec.submit(new DatabaseUpdater(results));
while( you need to ) {
// Another call to remote service
List<Object> results = ThirdPartyAPI.getPage( ... );
// wait for existing work to complete
exec.take();
// send more work to background thread
exec.submit(new DatabaseUpdater(results));
}
// wait for the last task to complete
exec.take();
This just a simple two-thread design. The first thread is responsible for getting data from the remote service and the second is responsible for writing to the database.
Any exceptions thrown by DatabaseUpdater will be propagated to the main thread when the result is taken (via exec.take()).
Good luck.
In terms of doing the actual parallelism, one very useful construct in Java is the ThreadPoolExecutor. A rough sketch of what that might look like is this:
public class YourApp {
class Processor implements Runnable {
Widget toProcess;
public Processor(Widget toProcess) {
this.toProcess = toProcess;
}
public void run() {
// commit the Widget to the DB, etc
}
}
public static void main(String[] args) {
ThreadPoolExecutor executor =
new ThreadPoolExecutor(1, 10, 30,
TimeUnit.SECONDS,
new LinkedBlockingDeque());
while(thereAreStillWidgets()) {
ArrayList<Widget> widgets = doExpensiveDatabaseCall();
for(Widget widget : widgets) {
Processor procesor = new Processor(widget);
executor.execute(processor);
}
}
}
}
But as I said in a comment: calls to an external API are expensive. It's very likely that the best strategy is to pull all the Widget objects down from the API in one call, and then process them in parallel once you've got them. Doing more API calls gives you the overhead of sending the data all the way from the server to you, every time -- it's probably best to pay that cost the fewest number of times that you can.
Also, keep in mind that if you're doing DB operations, it's possible that your DB doesn't allow for parallel writes, so you might get a slowdown there.

JMX results are confusing

I am trying to learn JMX for the last few days and now got confuse here.
I have written a simple JMX programe which is using the APIs of package java.lang.management and trying to extract the Pid, CPU time, user time. In my result I am only getting the results of current JVM thread which is my JMX programe itself but I thought I should get the result of all the processes running over JVM on the same machine. How I will get the pids, cpu time, user time for all java processes running in JVM(LINUX/WDs).
How should I can get the pids, cpu time, user time for all non-java processes running in my machine(LINUX/WDs).
My code is below:
public void update() throws Exception{
final ThreadMXBean bean = ManagementFactory.getThreadMXBean();
final long[] ids = bean.getAllThreadIds();
final ThreadInfo[] infos = bean.getThreadInfo(ids);
for (long id : ids) {
if (id == threadId) {
continue; // Exclude polling thread
}
final long c = bean.getThreadCpuTime(id);
final long u = bean.getThreadUserTime(id);
if (c == -1 || u == -1) {
continue; // Thread died
}
}
String name = null;
for (int i = 0; i < infos.length; i++) {
name = infos[i].getThreadName();
System.out.print("The name of the id is /n" + name);
}
}
I am always getting the result:
The name of the id is Attach Listener
The name of the id is Signal Dispatcher
The name of the id is Finalizer
The name of the id is Reference Handler
The name of the id is main
I have some other java processes running on my machine they are not been included in the results of bean.getAllThreadIds() API..
Ah, now I see what you want to do. I'm afraid I have some bad news.
The APIs that are exposed through ManagementFactory allow you to monitor only the JVM in which your code is running. To monitor other JVMs, you have to use the JMX Remoting API (javax.management.remote), and that introduces a whole new range of issues you have to deal with.
It sounds like what you want to do is basically write your own management console using the stock APIs provided by out-of-the-box JDK. Short answer: you can't get there from here. Slightly longer answer: you can get there from here, but the road is long, winding, uphill (nearly) the entire way, and when you're done you will most likely wish you had gone a different route (read that: use a management console that has already been written).
I recommend you use JConsole or some other management console to monitor your application(s). In my experience it is usually only important that a human (not a program) interpret the stats that are provided by the various MBeans whose references are obtainable through the ManagementFactory static methods. After all, if a program had access to, say, the amount of CPU used by some other process, what conceivable use would it have with that information (other than to provide it in some human-readable format)?

Is it possible to use IMAP + paging?

I have a Requirement to make an IMAP client as a Web application
I achieved the functionality of Sorting as:
//userFolder is an Object of IMAPFolder
Message[] messages = userFolder.getMessages();
Arrays.sort(messages, new Comparator<Message>()
{
public int compare(Message message1, Message message2)
{
int returnValue = 0;
try
{
if (sortCriteria == SORT_SENT_DATE)
{
returnValue = message1.getSentDate().compareTo(message2.getSentDate());
}
} catch (Exception e)
{
System.out.println(e.getMessage());
e.printStackTrace();
}
if (sortType == SORT_TYPE_DESCENDING)
{
returnValue = -returnValue;
}
return returnValue;
}
});
The code snippet is not complete , its just brief
SORT_SENT_DATE,SORT_TYPE_DESCENDING are my own constants.
Actually This solution is working fine, but it fails in logic for paging
Being a Web based application, i cant expect server to load all messages for every user and sort them
(We do have situations >1000 Simultaneous users with mail boxes having > 1000 messages each )
It also does not make sense for the web server to load all, sort them, return just a small part (say 1-20),
and on the next request, again load all sort them and return (21-40). Caching possible, but whts the gaurantee user would actually make a request ?
I heard there is a class called FetchProfile, can that help me here ? (I guess it would still load all messages but just the information thats required)
Is there any other way to achieve this ?
I need a solution that could also work in Search operation (searching with paging),
I have built an archietecture to create a SearchTerm but here too i would require paging.
for ref, i have asked this same Question at :
http://www.coderanch.com/t/461408/Other-JSE-JEE-APIs/java/it-possible-use-IMAP-paging
You would need a server with the SORT extension and even that may not be enough. Then you issue SORT on the specific mailbox and FETCH only those message numbers that fall into your view.
Update based on comments:
For servers where the SORT extension is not available the next best thing is to FETCH header field representing the sort key for all items (eg. FETCH 1:* BODY[HEADER.FIELDS(SUBJECT)] for subject or FETCH 1:* BODY[HEADER.FIELDS(DATA)] for sent date), then sort based on the key. You will get a list of sorted message number this way, which should be equivalent to what the SORT command would return.
If server side cache is allowed then the best way is to keep cache of envelopes (in the IMAP ENVELOPE sense) and then update it using the techniques described in RFC 4549. It's easy to sort and page given this cache.
There are two IMAP APIs on Java - the official JavaMail API and Risoretto. Risoretto is more low-level and should allow to implement anything described above, JavaMail may be able to do so as well, but I don't have much experience with it.

Query Windows Search from Java

I would like to get to query Windows Vista Search service directly ( or indirectly ) from Java.
I know it is possible to query using the search-ms: protocol, but I would like to consume the result within the app.
I have found good information in the Windows Search API but none related to Java.
I would mark as accepted the answer that provides useful and definitive information on how to achieve this.
Thanks in advance.
EDIT
Does anyone have a JACOB sample, before I can mark this as accepted?
:)
You may want to look at one of the Java-COM integration technologies. I have personally worked with JACOB (JAva COm Bridge):
http://danadler.com/jacob/
Which was rather cumbersome (think working exclusively with reflection), but got the job done for me (quick proof of concept, accessing MapPoint from within Java).
The only other such technology I'm aware of is Jawin, but I don't have any personal experience with it:
http://jawinproject.sourceforge.net/
Update 04/26/2009:
Just for the heck of it, I did more research into Microsoft Windows Search, and found an easy way to integrate with it using OLE DB. Here's some code I wrote as a proof of concept:
public static void main(String[] args) {
DispatchPtr connection = null;
DispatchPtr results = null;
try {
Ole32.CoInitialize();
connection = new DispatchPtr("ADODB.Connection");
connection.invoke("Open",
"Provider=Search.CollatorDSO;" +
"Extended Properties='Application=Windows';");
results = (DispatchPtr)connection.invoke("Execute",
"select System.Title, System.Comment, System.ItemName, System.ItemUrl, System.FileExtension, System.ItemDate, System.MimeType " +
"from SystemIndex " +
"where contains('Foo')");
int count = 0;
while(!((Boolean)results.get("EOF")).booleanValue()) {
++ count;
DispatchPtr fields = (DispatchPtr)results.get("Fields");
int numFields = ((Integer)fields.get("Count")).intValue();
for (int i = 0; i < numFields; ++ i) {
DispatchPtr item =
(DispatchPtr)fields.get("Item", new Integer(i));
System.out.println(
item.get("Name") + ": " + item.get("Value"));
}
System.out.println();
results.invoke("MoveNext");
}
System.out.println("\nCount:" + count);
} catch (COMException e) {
e.printStackTrace();
} finally {
try {
results.invoke("Close");
} catch (COMException e) {
e.printStackTrace();
}
try {
connection.invoke("Close");
} catch (COMException e) {
e.printStackTrace();
}
try {
Ole32.CoUninitialize();
} catch (COMException e) {
e.printStackTrace();
}
}
}
To compile this, you'll need to make sure that the JAWIN JAR is in your classpath, and that jawin.dll is in your path (or java.library.path system property). This code simply opens an ADO connection to the local Windows Desktop Search index, queries for documents with the keyword "Foo," and print out a few key properties on the resultant documents.
Let me know if you have any questions, or need me to clarify anything.
Update 04/27/2009:
I tried implementing the same thing in JACOB as well, and will be doing some benchmarks to compare performance differences between the two. I may be doing something wrong in JACOB, but it seems to consistently be using 10x more memory. I'll be working on a jcom and com4j implementation as well, if I have some time, and try to figure out some quirks that I believe are due to the lack of thread safety somewhere. I may even try a JNI based solution. I expect to be done with everything in 6-8 weeks.
Update 04/28/2009:
This is just an update for those who've been following and curious. Turns out there are no threading issues, I just needed to explicitly close my database resources, since the OLE DB connections are presumably pooled at the OS level (I probably should have closed the connections anyway...). I don't think I'll be any further updates to this. Let me know if anyone runs into any problems with this.
Update 05/01/2009:
Added JACOB example per Oscar's request. This goes through the exact same sequence of calls from a COM perspective, just using JACOB. While it's true JACOB has been much more actively worked on in recent times, I also notice that it's quite a memory hog (uses 10x as much memory as the Jawin version)
public static void main(String[] args) {
Dispatch connection = null;
Dispatch results = null;
try {
connection = new Dispatch("ADODB.Connection");
Dispatch.call(connection, "Open",
"Provider=Search.CollatorDSO;Extended Properties='Application=Windows';");
results = Dispatch.call(connection, "Execute",
"select System.Title, System.Comment, System.ItemName, System.ItemUrl, System.FileExtension, System.ItemDate, System.MimeType " +
"from SystemIndex " +
"where contains('Foo')").toDispatch();
int count = 0;
while(!Dispatch.get(results, "EOF").getBoolean()) {
++ count;
Dispatch fields = Dispatch.get(results, "Fields").toDispatch();
int numFields = Dispatch.get(fields, "Count").getInt();
for (int i = 0; i < numFields; ++ i) {
Dispatch item =
Dispatch.call(fields, "Item", new Integer(i)).
toDispatch();
System.out.println(
Dispatch.get(item, "Name") + ": " +
Dispatch.get(item, "Value"));
}
System.out.println();
Dispatch.call(results, "MoveNext");
}
} finally {
try {
Dispatch.call(results, "Close");
} catch (JacobException e) {
e.printStackTrace();
}
try {
Dispatch.call(connection, "Close");
} catch (JacobException e) {
e.printStackTrace();
}
}
}
As few posts here suggest you can bridge between Java and .NET or COM using commercial or free frameworks like JACOB, JNBridge, J-Integra etc..
Actually I had an experience with with one of these third parties (an expensive one :-) ) and I must say I will do my best to avoid repeating this mistake in the future. The reason is that it involves many "voodoo" stuff you can't really debug, it's very complicated to understand what is the problem when things go wrong.
The solution I would suggest you to implement is to create a simple .NET application that makes the actual calls to the windows search API. After doing so, you need to establish a communication channel between this component and your Java code. This can be done in various ways, for example by messaging to a small DB that your application will periodically pull. Or registering this component on the machine IIS (if exists) and expose simple WS API to communicate with it.
I know that it may sound cumbersome but the clear advantages are: a) you communicate with windows search API using the language it understands (.NET or COM) , b) you control all the application paths.
Any reason why you couldn't just use Runtime.exec() to query via search-ms and read the BufferedReader with the result of the command? For example:
public class ExecTest {
public static void main(String[] args) throws IOException {
Process result = Runtime.getRuntime().exec("search-ms:query=microsoft&");
BufferedReader output = new BufferedReader(new InputStreamReader(result.getInputStream()));
StringBuffer outputSB = new StringBuffer(40000);
String s = null;
while ((s = output.readLine()) != null) {
outputSB.append(s + "\n");
System.out.println(s);
}
String result = output.toString();
}
}
There are several libraries out there for calling COM objects from java, some are opensource (but their learning curve is higher) some are closed source and have a quicker learning curve. A closed source example is EZCom. The commercial ones tend to focus on calling java from windows as well, something I've never seen in opensource.
In your case, what I would suggest you do is front the call in your own .NET class (I guess use C# as that is closest to Java without getting into the controversial J#), and focus on making the interoperability with the .NET dll. That way the windows programming gets easier, and the interface between Windows and java is simpler.
If you are looking for how to use a java com library, the MSDN is the wrong place. But the MSDN will help you write what you need from within .NET, and then look at the com library tutorials about invoking the one or two methods you need from your .NET objects.
EDIT:
Given the discussion in the answers about using a Web Service, you could (and probably will have better luck) build a small .NET app that calls an embedded java web server rather than try to make .NET have the embedded web service, and have java be the consumer of the call. For an embedded web server, my research showed Winstone to be good. Not the smallest, but is much more flexible.
The way to get that to work is to launch the .NET app from java, and have the .NET app call the web service on a timer or a loop to see if there is a request, and if there is, process it and send the response.

Categories

Resources