Android HTML parsing with multithreading

Android HTML parsing with multithreading - java

I am developing app which requires html parsing. So, I'm currently using jsoup in AsyncTaskLoader like this (example):
#Override
public Boolean loadInBackground() {
try {
Connection.Response response = Jsoup.connect(getContext().getString(R.string.url_login))
.data("id", account_id, "password", account_password)
.timeout(5000)
.method(Connection.Method.POST)
.execute();
String cookie = response.cookie("JSESSIONID");
Document document = Jsoup.connect(getContext().getString(R.string.url_schedule))
.cookie("JSESSIONID", cookie)
.get();
Element table = document.select("table").first();
if (table != null) {
databaseHandler.openDatabase();
databaseHandler.getDatabase().beginTransaction();
try {
for (Element row : table.select("tr")) {
Elements columns = row.select("td");
addItem(columns, DatabaseHandler.getTableName());
}
databaseHandler.getDatabase().setTransactionSuccessful();
} finally {
databaseHandler.getDatabase().endTransaction();
}
databaseHandler.closeDatabase();
}
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
This is just one page scrape and there is few of them. And I noticed that speed of it is not very good. So, I have been told that I should consider doing multithreading and parse all these pages in separate threads at the same time so it would be faster. Now I have a few questions:
Should I still use AsyncTaskLoader or AsyncTask, or is there something else (better) for that solution ? I want to know what is the best practice for this thing.
Can anyone guide me to tutorials / examples how to do multithreading in android ?
Thanks ;)

AsyncTask should work good for this. However, as per the docs for AsyncTask, it's default model is still "one task a a time" unless you use the executeOnExecutor method.
From http://developer.android.com/reference/android/os/AsyncTask.html :
When first introduced, AsyncTasks were executed serially on a single
background thread. Starting with DONUT, this was changed to a pool of
threads allowing multiple tasks to operate in parallel. Starting with
HONEYCOMB, tasks are executed on a single thread to avoid common
application errors caused by parallel execution.
If you truly want parallel execution, you can invoke
executeOnExecutor(java.util.concurrent.Executor, Object[]) with
THREAD_POOL_EXECUTOR.
You didn't say how many pages there were to parse, so just make sure you limit the number of simultaneous tasks to a reasonable number.

Related

Optimizing method with list of 500k+ elements

I'm looking for some help since I don't know how to optimize a process.
I have to invoke a service that returns a list with more than 500K elements (I don't know why, these services belongs to the client), per each element of the list, I have to invoke 2 more services and then save some attributes in our database, this last step is not the problem, but the entire process took between 1 and 2 seconds per element, so with this time is going to take like more of 100 hours to complete the process.
My approach is the following, I have my main method, inside this method I get the large list, then I use a parallelStream to iterate in the elements of the list and then I use a CompletableFuture to call the method that invokes the 2 services mentioned above. I've tried changing the parallelStream to stream and for-each , tried to split the main list into smaller lists and many other things but I don't see a better performance, I think the problem is the invocation of those 2 services but I want to try luck asking here.
I'm using java 11, spring, and for the invocation of the services I'm using RestTemplate, and this is my code:
public void updateDiscount() {
//List with 500k elements
var relationshipList = relationshipService.getLargeList();
//CompletableFuture to make the async calls to the method above
relationshipList.parallelStream().forEach(level1 -> {
CompletableFuture.runAsync(() -> relationshipService.asyncDiscountSave(level1));
});
}
//Second class
#Async("nameOfThePool")
public void asyncDiscountSave(ElementOfList element) {
//Logic to create request
//.........
var responseClients = anotherClass.getClients(element.getGroup1()) //get the first response with restTemplate
var responseProducts = anotherClass.getProducts(element.getGroup2())//get the second response with restTemplate
for (var client : responseClients) {
for (var product : responseProducts) {
//Here we just save some attributes of these objects on our DB
}
}
}
Thanks for the help.
UPDATE:
For this particular case, the only improvement that I can do is to pass a thread pool to the completable future, the problem is the response time of the services that I need to invoke.
I decided to follow a second approach and it took like 5 hours to complete, compared with the first approach this is acceptable.

As you haven't defined an executor you are using the default pool. Adding an executor allow you to create many threads as you needed and the server resources can manage
public void updateDiscount() {
Executor executor = Executors.newFixedThreadPool( 100 );//Define the number according to server resources performance
//List with 500k elements
var relationshipList = relationshipService.getLargeList();
//CompletableFuture to make the async calls to the method above
relationshipList.parallelStream().forEach(level1 -> {
CompletableFuture.runAsync(() -> relationshipService.asyncDiscountSave(level1), executor);
});
}

How to deal with code that runs before foreach block in Apache Spark?

I'm trying to deal with some code that runs differently on Spark stand-alone mode and Spark running on a cluster. Basically, for each item in an RDD, I'm trying to add it to a list, and once this is done, I want to send this list to Solr.
This works perfectly fine when I run the following code in stand-alone mode of Spark, but does not work when the same code is run on a cluster. When I run the same code on a cluster, it is like "send to Solr" part of the code is executed before the list to be sent to Solr is filled with items. I try to force the execution by solrInputDocumentJavaRDD.collect(); after foreach, but it seems like it does not have any effect.
// For each RDD
solrInputDocumentJavaDStream.foreachRDD(
new Function<JavaRDD<SolrInputDocument>, Void>() {
#Override
public Void call(JavaRDD<SolrInputDocument> solrInputDocumentJavaRDD) throws Exception {
// For each item in a single RDD
solrInputDocumentJavaRDD.foreach(
new VoidFunction<SolrInputDocument>() {
#Override
public void call(SolrInputDocument solrInputDocument) {
// Add the solrInputDocument to the list of SolrInputDocuments
SolrIndexerDriver.solrInputDocumentList.add(solrInputDocument);
}
});
// Try to force execution
solrInputDocumentJavaRDD.collect();
// After having finished adding every SolrInputDocument to the list
// add it to the solrServer, and commit, waiting for the commit to be flushed
try {
if (SolrIndexerDriver.solrInputDocumentList != null
&& SolrIndexerDriver.solrInputDocumentList.size() > 0) {
SolrIndexerDriver.solrServer.add(SolrIndexerDriver.solrInputDocumentList);
SolrIndexerDriver.solrServer.commit(true, true);
SolrIndexerDriver.solrInputDocumentList.clear();
}
} catch (SolrServerException | IOException e) {
e.printStackTrace();
}
return null;
}
}
);
What should I do, so that sending-to-Solr part executes after the list of SolrDocuments are added to solrInputDocumentList (and works also in cluster mode)?

As I mentioned on the Spark Mailing list:
I'm not familiar with the Solr API but provided that 'SolrIndexerDriver' is a singleton, I guess that what's going on when running on a cluster is that the call to:
SolrIndexerDriver.solrInputDocumentList.add(elem)
is happening on different singleton instances of the SolrIndexerDriver on different JVMs while
SolrIndexerDriver.solrServer.commit
is happening on the driver.
In practical terms, the lists on the executors are being filled-in but they are never committed and on the driver the opposite is happening.
The recommended way to handle this is to use foreachPartition like this:
rdd.foreachPartition{iter =>
// prepare connection
Stuff.connect(...)
// add elements
iter.foreach(elem => Stuff.add(elem))
// submit
Stuff.commit()
}
This way you can add the data of each partition and commit the results in the local context of each executor. Be aware that this add/commit must be thread safe in order to avoid data loss or corruption.

have you checked under the spark UI to see the execution plan of this job.
Check how it is getting split into stages and their dependencies. That should give you an idea hopefully.

Ideas on concurrent datastructure

I am not sure if i can put my question in the clearest fashion but i will try my best.
Lets say i am retrieving some information from a third party api. The retrieved information will be huge in size. To have a performance gain, instead of retrieving all the info in one go, i will be retrieving the info in a paged fashion (the api gives me that facility, basically an iterator). The return type is basically a list of objects.
My aim here is to process the information i have in hand(that includes comparing and storing in db and many other operations) while i get paged response on the request.
My question here to the expert community is , what data structure do you prefer in such case. Also does a framework like spring batch help you in getting performance gains in such cases.
I know the question is a bit vague, but i am looking for general ideas,tips and pointers.

In these cases, the data structure for me is java.util.concurrent.CompletionService.
For purposes of example, I'm going to assume a couple of additional constraints:
You want only one outstanding request to the remote server at a time
You want to process the results in order.
Here goes:
// a class that knows how to update the DB given a page of results
class DatabaseUpdater implements Callable { ... }
// a background thread to do the work
final CompletionService<Object> exec = new ExecutorCompletionService(
Executors.newSingleThreadExecutor());
// first call
List<Object> results = ThirdPartyAPI.getPage( ... );
// Start loading those results to DB on background thread
exec.submit(new DatabaseUpdater(results));
while( you need to ) {
// Another call to remote service
List<Object> results = ThirdPartyAPI.getPage( ... );
// wait for existing work to complete
exec.take();
// send more work to background thread
exec.submit(new DatabaseUpdater(results));
}
// wait for the last task to complete
exec.take();
This just a simple two-thread design. The first thread is responsible for getting data from the remote service and the second is responsible for writing to the database.
Any exceptions thrown by DatabaseUpdater will be propagated to the main thread when the result is taken (via exec.take()).
Good luck.

In terms of doing the actual parallelism, one very useful construct in Java is the ThreadPoolExecutor. A rough sketch of what that might look like is this:
public class YourApp {
class Processor implements Runnable {
Widget toProcess;
public Processor(Widget toProcess) {
this.toProcess = toProcess;
}
public void run() {
// commit the Widget to the DB, etc
}
}
public static void main(String[] args) {
ThreadPoolExecutor executor =
new ThreadPoolExecutor(1, 10, 30,
TimeUnit.SECONDS,
new LinkedBlockingDeque());
while(thereAreStillWidgets()) {
ArrayList<Widget> widgets = doExpensiveDatabaseCall();
for(Widget widget : widgets) {
Processor procesor = new Processor(widget);
executor.execute(processor);
}
}
}
}
But as I said in a comment: calls to an external API are expensive. It's very likely that the best strategy is to pull all the Widget objects down from the API in one call, and then process them in parallel once you've got them. Doing more API calls gives you the overhead of sending the data all the way from the server to you, every time -- it's probably best to pay that cost the fewest number of times that you can.
Also, keep in mind that if you're doing DB operations, it's possible that your DB doesn't allow for parallel writes, so you might get a slowdown there.

ExecutorService slows down , bogs down my pc

I am writing a parser for a website , it has many pages (I call them IndexPages) . Each page has a lot of links (about 300 to 400 links in an IndexPage). I use Java's ExecutorService to invoke 12 Callables concurrently in one IndexPage. Each Callable just fire a http request to one link and do some parsing and db storing actions. When first IndexPage finished , program progresses to second IndexPage , until no next IndexPage found.
When running , it seems OK , I can observe the threads working/scheduling well. Each link's parsing/storing just takes about 1 to 2 seconds.
But as time goes by , I observed each Callable(parsing/storing) takes longer and longer. Take this picture for example , sometimes it takes 10 or more seconds to finish a Callable (The green bar is RUNNING , the purple bar is WAITING). And my PC is bogging down , everything becomes sluggish.
This is my main algorithm :
ExecutorService executorService = Executors.newFixedThreadPool(12);
String indexUrl = // Set initial (1st page) IndexPage
while(true)
{
String nextPage = // parse next page in the indexUrl
Set<Callable<Void>> callables = new HashSet<>();
for(String url : getUrls(indexUrl))
{
Callable callable = new ParserCallable(url , … and some DAOs);
callables.add(callable);
}
try {
executorService.invokeAll(callables);
} catch (InterruptedException e) {
e.printStackTrace();
}
if (nextPage == null)
break;
indexUrl = nextPage;
} // true
executorService.shutdown();
The algorithm is simple and self-explanatory. I wonder what may cause such situation ? Anyway to prevent such performance degradation ?
The CPU/Memory/Heap shows reasonable usage.
Environments , FYI.
==================== updated ====================
I've change my implementations from ExecutorService to ForkJoinPool :
ForkJoinPool pool=new ForkJoinPool(12);
String indexUrl = // Set initial (1st page) IndexPage
while(true)
{
Set<Callable<Void>> callables = new HashSet<>();
for(String url : for(String url : getUrls(indexUrl)))
{
Callable callable = new ParserCallable(url , DAOs...);
callables.add(callable);
}
pool.invokeAll(callables);
String nextPage = // parse next page in this indexUrl
if (nextPage == null)
break;
indexUrl = nextPage;
} // true
It takes longer than ExecutorService's solution. ExecutorService takes about 2 hours to finish all pages , while ForkJoinPool takes 3 hours , and each Callable still takes longer and longer time to complete (from 1 sec to 5,6 or even 10 seconds). I don't mind it takes longer , I just hope it takes constant time (not longer and longer) to finish a job .
I am wondering if I create a lot of (non-thread-safe) GregorianCalendar , Date and SimpleDateFormat objects in the parser and cause some thread issue. But I didn't reuse these objects or pass them among threads. So I still cannot find the reason.

Based on the heap you have a memory issue. ExecutorService.invokeAll collects all of the results of the Callable instances into a List and returns that List when they all complete. You may want to consider simply calling ExecutorService.submit since you don't seem to care about the results of each Callable.

I can't see why there is need of Callable to parse your index pages since your 'Caller' method does not expect any result from ParserCallable. I could see you would need to bit Exception handling,but still it can be managed with Runnable.
When you use Callable.call() it would return FutureTask back ,which is never used.
You should be able to improve implementation by using Runnable which could avoid this additional operation
ExecutorService executor = Executors.newFixedThreadPool(12);
for(String url : getUrls(indexUrl)) {
Runnable worker = new ParserRunnable(url , … and some DAOs);
executor.execute(worker);
}
class ParserRunnable implements Runnable{
}

As I understand it, if you have 40 pages, each with ~300 URLs, you will create ~12,000 Callables? While that it probably not too many Callables, it is a lot of HTTPConnections and Database Connections.
I think you should try using one Callable per page. You'll still gain a ton by running them in parallel. I don't know what you are using for the HTTP request, but you might be able to reuse system resources there instead of opening and closing 12,000 of them.
And especially for the DB. You'll have just 40 connections. You might even be able to be super efficient by collecting the ~300 records locally, then using a batch update.

Efficient way to GET multiple HTML pages simultaneously

So I'm working on web scraping for a certain website. The problem is:
Given a set of URLs (in the order of 100s to 1000s), I would like to retrieve the HTML of each URL in an efficient manner, specially time-wise. I need to be able to do 1000s of requests every 5 minutes.
This should usually imply using a pool of threads to do requests from a set of not yet requested urls. But before jumping into implementing this, I believe that it's worth asking here since I believe this is a fairly common problem when doing web scraping or web crawling.
Is there any library that has what I need?

So I'm working on web scraping for a certain website.
Are you scraping a single server or is the website scraping from multiple other hosts? If it is the former, then the server you are scraping may not like too many concurrent connections from a single i/p.
If it is the latter, this is really a general question on how many outbound connections you should open from a machine. There is physical limit, but it is pretty large. Practically, it would depend on where that client is getting deployed. The better the connectivity, the higher number of connections it can accommodate.
You might want to look at the source code of a good download manager to see if they have a limit on the number of outbound connections.
Definitely user asynchronous i/o, but you would still do well to limit the number.

Your bandwidth utilization will be the sum of all of the HTML documents that you retrieve (plus a little overhead) no matter how you slice it (though some web servers may support compressed HTTP streams, so certainly use a client capable of accepting them).
The optimal number of concurrent threads depends a great deal on your network connectivity to the sites in question. Only experimentation can find an optimal number. You can certainly use one set of threads for retrieving HTML documents and a separate set of threads to process them to make it easier to find the right balance.
I'm a big fan of HTML Agility Pack for web scraping in the .NET world but cannot make a specific recommendation for Java. The following question may be of use in finding a good, Java based scraping platform
Web scraping with Java

I would start by researching asynchronous communication. Then take a look at Netty.
Keep in mind there is always a limit to how fast one can load a web page. For an average home connection, it will be around a second. Take this into consideration when programming your application.

http://wwww.Jsoup.org just for scrapping part! The thread pooling i think you should implement urself.
Update
if this approach is fitting your need, you can download the complete class files here:
http://codetoearn.blogspot.com/2013/01/concurrent-web-requests-with-thread.html
AsyncWebReader webReader = new AsyncWebReader(5/*number of threads*/, new String[]{
"http://www.google.com",
"http://www.yahoo.com",
"http://www.live.com",
"http://www.wikipedia.com",
"http://www.facebook.com",
"http://www.khorasannews.com",
"http://www.fcbarcelona.com",
"http://www.khorasannews.com",
});
webReader.addObserver(new Observer() {
#Override
public void update(Observable o, Object arg) {
if (arg instanceof Exception) {
Exception ex = (Exception) arg;
System.out.println(ex.getMessage());
} /*else if (arg instanceof List) {
List vals = (List) arg;
System.out.println(vals.get(0) + ": " + vals.get(1));
} */else if (arg instanceof Object[]) {
Object[] objects = (Object[]) arg;
HashMap result = (HashMap) objects[0];
String[] success = (String[]) objects[1];
String[] fail = (String[]) objects[2];
System.out.println("Failds");
for (int i = 0; i < fail.length; i++) {
String string = fail[i];
System.out.println(string);
}
System.out.println("-----------");
System.out.println("success");
for (int i = 0; i < success.length; i++) {
String string = success[i];
System.out.println(string);
}
System.out.println("\n\nresult of Google: ");
System.out.println(result.remove("http://www.google.com"));
}
}
});
Thread t = new Thread(webReader);
t.start();
t.join();

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Android HTML parsing with multithreading - java

Related

Optimizing method with list of 500k+ elements

How to deal with code that runs before foreach block in Apache Spark?

Ideas on concurrent datastructure

ExecutorService slows down , bogs down my pc

Efficient way to GET multiple HTML pages simultaneously

Categories

Resources