How to tuning HTTPClient performance in crawling large amount small files?

How to tuning HTTPClient performance in crawling large amount small files? - java

I just want to crawl some Hacker News Stories, and my code:
import org.apache.http.client.fluent.Request;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.logging.Logger;
import java.util.stream.IntStream;
public class HackCrawler {
private static String getUrlResponse(String url) throws IOException {
return Request.Get(url).execute().returnContent().asString();
}
private static String crawlItem(int id) {
try {
String json = getUrlResponse(String.format("https://hacker-news.firebaseio.com/v0/item/%d.json", id));
if (json.contains("\"type\":\"story\"")) {
return json;
}
} catch (IOException e) {
System.out.println("crawl " + id + " failed");
}
return "";
}
public static void main(String[] args) throws FileNotFoundException {
Logger logger = Logger.getLogger("main");
PrintWriter printWriter = new PrintWriter("hack.json");
for (int i = 0; i < 10000; i++) {
logger.info("batch " + i);
IntStream.range(12530671 - (i + 1) * 100, 12530671 - i * 100)
.parallel()
.mapToObj(HackCrawler::crawlItem).filter(x -> !x.equals(""))
.forEach(printWriter::println);
}
}
}
Now it will cost 3 seconds to crawl 100(1 batch) items.
I found use multithreading by parallel will give a speed up (about 5 times), but I have no idea about how to optimise it further.
Could any one give some suggestion about that?

To achieve what Fayaz means I would use Jetty Http Client asynchronous features (https://webtide.com/the-new-jetty-9-http-client/).
httpClient.newRequest("http://domain.com/path")
.send(new Response.CompleteListener()
{
#Override
public void onComplete(Result result)
{
// Your logic here
}
});
This client internally uses Java NIO to listen for incoming responses with a single thread per connection. It then dispatches content to worker threads which are not involved in any blocking I/O operation.
You can try to play with the maximum number of connections per destination (a destination is basically an host)
http://download.eclipse.org/jetty/9.3.11.v20160721/apidocs/org/eclipse/jetty/client/HttpClient.html#setMaxConnectionsPerDestination-int-
Since you are heavily loading a single server, this should be quite high.

The following steps should get you started.
Use a single thread to get response from the site as this is basically an IO operation.
Put these responses into a queue(Read about various implementations of BlockingQueue)
Now you can have multiple threads to pick up these responses and process them as you wish.
Basically, you will be having a single producer thread that gets the responses from the sites and multiple consumers who process these responses.

Related

Search for Strings that matches regex in very huge Set or List in Java

My file with all words is about 60MB and searching for rhyme now takes about minute.
How can I faster searching process?
Split text file into at example a.txt, b.txt, c.txt (Every file with words starting at a/b/c...)?
I guess putting it in MySQL and querying with
SELECT *
FROM table
WHERE word LIKE '%rhyme'
will be slower.
I will be pleased if it will search for rhymes in about 10 seconds or less, because I want to use this API with Angular CLI frontend and use it to suggest rhymes while typing text in textarea.
Thanks for help!
package pl.kamilkoszykowski.dopewriter;
import org.springframework.web.bind.annotation.*;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.*;
import java.util.concurrent.*;
#RestController
#CrossOrigin("http://localhost:4200")
public class Controller {
Set<String> dictionary = getDictionary(); // SET WITH WORDS
#GetMapping("/rhyme/{word}")
public Set<String> rhymes(#PathVariable String word) throws InterruptedException {
String regex = "\\b[A-Za-z]*" + word + "\\b"; // RHYME TO SEARCH FOR IN DICTIONARY SET
Set<String> rhymesList = new HashSet<>();
int numRunnables = 64;
BlockingQueue<Runnable> queue = new ArrayBlockingQueue<Runnable>(numRunnables, true);
RejectedExecutionHandler handler = new ThreadPoolExecutor.CallerRunsPolicy();
ExecutorService executor = new ThreadPoolExecutor(numRunnables, numRunnables, 0L, TimeUnit.MILLISECONDS, queue, handler);
for (String a : dictionary) {
executor.execute(new Runnable() {
#Override
public void run() {
if (a.matches(regex)) {
rhymesList.add(a);
}
}
});
}
executor.shutdown();
while (executor.isTerminated() == false){
Thread.sleep(50);
}
return rhymesList;
}
public Set<String> getDictionary() { // READING TXT FILE
try {
List<String> list = new ArrayList<>(List.of(Files.readString(Paths.get("src/main/resources/dictionary.txt")).split(",")));
return new HashSet<>(list);
} catch (IOException e) {
return null;
}
}
}

Iterating over a Collection itself is a time-consuming process. Multiple ways to solve:
Use java streams. -- Faster to implement and has better speed than with custom thread pool executor.
Keep the data sorted in Set. Do a sorting technique to get the results faster. Still won't be faster enough.
Store the data in Trie data structure and get all the words under the root that matches the regex. -- Recommended.

I changed executor code part to this
executor1.execute(new Runnable() {
#Override
public void run() {
for (String a : dictionary) {
if (a.endsWith(word)) {
rhymesList.add(a);
}
}
}
});
and works. Sorry for time, that I've taken from you

Design a data source independent application using batch data

We have a legacy application that reads data from mongo for each user (query result is small to large based on user request) and our app creates a file for each user and drops to FTP server /s3. We are reading data as mongo cursor and writing each batch to file as soon it gets batch data so file writing performance is decent. This application works great but bound to mongo and mongo cursor.
Now we have to redesign this application as we have to support different data sources i.e MongoDB, Postgres DB, Kinesis, S3, etc. We have thought below ideas so far:
Build data APIs for each source and expose a paginated REST response. This is a feasible solution but it might be slow for large
query data compare to the current cursor response.
Build a data abstraction layer by feeding batch data in kafka and read batch data stream in our file generator.but most of the time user asks for sorted data so we would need to read messages in sequence. We will lose benefit of great throughput and lot of extra work to combine these data message before writing to file.
We are looking for a solution to replace the current mongo cursor and make our file generator independent of the data source.

So it sounds like you essentially want to create an API where you can maintain the efficiency of streaming as much as possible, as you are doing with writing the file while you are reading the user data.
In that case, you might want to define a push-parser API for your ReadSources which will stream data to your WriteTargets which will write the data to anything that you have an implementation for. Sorting will be handled on the ReadSource side of things since for some sources you can read in an ordered manner (such as from databases); For those sources for which you can't do this you might simply perform an intermediate step to sort your data (such as write to a temporary table) then stream it to the WriteTarget.
A basic implementation might look vaguely like this:
public class UserDataRecord {
private String data1;
private String data2;
public String getRecordAsString() {
return data1 + "," + data2;
}
}
public interface WriteTarget<Record> {
/** Write a record to the target */
public void writeRecord(Record record);
/** Finish writing to the target and save everything */
public void commit();
/** Undo whatever was written */
public void rollback();
}
public abstract class ReadSource<Record> {
protected final WriteTarget<Record> writeTarget;
public ReadSource(WriteTarget<Record> writeTarget) { this.writeTarget = writeTarget; }
public abstract void read();
}
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
public class RelationalDatabaseReadSource extends ReadSource<UserDataRecord> {
private Connection dbConnection;
public RelationalDatabaseReadSource (WriteTarget<UserDataRecord> writeTarget, Connection dbConnection) {
super(writeTarget);
this.dbConnection = dbConnection;
}
#Override public void read() {
// read user data from DB and encapsulate it in a record
try (Statement statement = dbConnection.createStatement();
ResultSet resultSet = statement.executeQuery("Select * From TABLE Order By COLUMNS");) {
while (resultSet.next()) {
UserDataRecord record = new UserDataRecord();
// stream the records to the write target
writeTarget.writeRecord(record);
}
} catch (SQLException e) {
e.printStackTrace();
}
}
}
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.nio.charset.StandardCharsets;
public class FileWriteTarget implements WriteTarget<UserDataRecord> {
private File fileToWrite;
private PrintWriter writer;
public FileWriteTarget(File fileToWrite) throws IOException {
this.fileToWrite = fileToWrite;
this.writer = new PrintWriter(new FileWriter(fileToWrite));
}
#Override public void writeRecord(UserDataRecord record) {
writer.println(record.getRecordAsString().getBytes(StandardCharsets.UTF_8));
}
#Override public void commit() {
// write trailing records
writer.close();
}
#Override
public void rollback() {
try { writer.close(); } catch (Exception e) { }
fileToWrite.delete();
}
}
This is just the general idea and needs serious improvement.
Anyone please feel free to update this API.

Fire and forget for HTTP in Java

We're implementing our own analytics for that we've exposed a web service which needs to be invoked which will capture the data in our DB.
The problem is that as this is analytics we would be making lot of calls (like for every page load, call after each js, CSS loads etc...) so there'll be many many such calls. So I don want the server to be loaded with lots of requests to be more precise pending for response. Because the response we get back will hardly be of any use to us.
So is there any way to just fire the web service request and forget that I've fired it?
I understand that every HTTP request will have as response as well.
So one thing that ticked my mind was what if we make the request timeout to zero second? But I'm pretty not sure if this is the right way of doing this.
Please provide me with more suggestions

You might find following AsyncRequestDemo.java useful:
import java.net.URI;
import java.net.URISyntaxException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import org.apache.http.client.fluent.Async;
import org.apache.http.client.fluent.Content;
import org.apache.http.client.fluent.Request;
import org.apache.http.client.utils.URIBuilder;
import org.apache.http.concurrent.FutureCallback;
/**
* Following libraries have been used:
*
* 1) httpcore-4.4.5.jar
* 2) httpclient-4.5.2.jar
* 3) commons-logging-1.2.jar
* 4) fluent-hc-4.5.2.jar *
*
*/
public class AsyncRequestDemo {
public static void main(String[] args) throws Exception {
URIBuilder urlBuilder = new URIBuilder()
.setScheme("http")
.setHost("stackoverflow.com")
.setPath("/questions/38277471/fire-and-forget-for-http-in-java");
final int nThreads = 3; // no. of threads in the pool
final int timeout = 0; // connection time out in milliseconds
URI uri = null;
try {
uri = urlBuilder.build();
} catch (URISyntaxException use) {
use.printStackTrace();
}
ExecutorService executorService = Executors.newFixedThreadPool(nThreads);
Async async = Async.newInstance().use(executorService);
final Request request = Request.Get(uri).connectTimeout(timeout);
Future<Content> future = async.execute(request, new FutureCallback<Content>() {
public void failed(final Exception e) {
System.out.println("Request failed: " + request);
System.exit(1);
}
public void completed(final Content content) {
System.out.println("Request completed: " + request);
System.out.println(content.asString());
System.exit(0);
}
public void cancelled() {
}
});
System.out.println("Request submitted");
}
}

I used this:
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
URL url = new URL(YOUR_URL_PATH, "UTF-8"));
ExecutorService executor = Executors.newFixedThreadPool(1);
Future<HttpResponse> response = executor.submit(new HttpRequest(url));
executor.shutdown();
for HttpRequest,HttpResponse
public class HttpRequest implements Callable<HttpResponse> {
private URL url;
public HttpRequest(URL url) {
this.url = url;
}
#Override
public HttpResponse call() throws Exception {
return new HttpResponse(url.openStream());
}
}
public class HttpResponse {
private InputStream body;
public HttpResponse(InputStream body) {
this.body = body;
}
public InputStream getBody() {
return body;
}
}
that is.

Yes, you could initiate the request and break the connection without waiting for a response... But you probably don't want to do that. The overhead of the server-side having to deal with ungracefully broken connections will far outweigh letting it proceed with returning a response.
A better approach to solving this kind of performance problem in a Java servlet would bet to shove all the data from the requests into a queue, respond immediately, and have one or more worker threads pick up items out of the queue for processing (such as writing it into a database).

Paho client limit?

I'm running some performance test of a service/client service using mosquitto broker and clients, and paho clients. I got some strange results:
Deployment notes:
3 machines; producer, broker, consumer
Producers: 6 python scripts using mosquitto_pub as fast they can. See below.
Consumer: simple java client show below. Subscribing to all topics.
The hardware specifics has not shown significant difference.
1) Mosquitto gets around 1459.5055 messages/s but it sends only 973.9596666666667. The subscribers just get 485.5458333333333 .
2) Not matter how many instances of the paho clients are created the performance do not improve. E.g. if you run 6 producers in one topic and 2 consumer in two topic you will get 485.5458333333333. But if you add 6 producers to the other topic (already checked that the total amount of messages increment) the total performance stay the same and per topic is divided by two.
3) If you do the precisely the test to two separated java application the performance do not drop. Each application gets the max performance.
In no case the CPU or memory reaches any limit.
Producers.py
from datetime import datetime, date, time
import os,sys,time, json, random, itertools
arg = sys.argv
host="broker"
n=1
if len(arg)>1:
n = int(arg[1])
i=0
while True :
payload = {"id":str(n),"Time":datetime.now().strftime("%Y-%m-%dT%H:%M:%S.00Z"),"ResultValue":1.0,"ResultType":"integer","Datastream":{"id":str(n)}}
os.system( "mosquitto_pub -h "+host+" -t "+("/"+str(payload["id"])+" -m " +str(json.dumps(json.dumps(payload)))+"")
Consumer.java
package eu.linksmart.testing;
import org.eclipse.paho.client.mqttv3.*;
import org.eclipse.paho.client.mqttv3.persist.MemoryPersistence;
import java.util.UUID;
public class Application implements MqttCallback {
public Application() {
id++;
}
public static void main(String[] args) {
try {
Application app = new Application();
create("1",new Application());
create("2",new Application());
while (true)
try {
Thread.sleep(30000);
} catch (InterruptedException e) {
e.printStackTrace();
}
} catch (MqttException e) {
e.printStackTrace();
}
}
static void create(String id, Application app) throws MqttException {
MqttClient mqttClient = new MqttClient("tcp://broker:1883", UUID.randomUUID().toString(), new MemoryPersistence());
mqttClient.connect();
mqttClient.subscribe("/"+id+"/#", 1);
mqttClient.setCallback(app);
}
long acc =0;
int i=0;
long start= System.nanoTime();
static int id=0;
#Override
public void connectionLost(Throwable throwable) {
}
#Override
public void messageArrived(String s, MqttMessage mqttMessage) throws Exception {
i++;
acc = (System.nanoTime()-start);
if(acc/1000000>1000){
start = System.nanoTime();
System.out.println(String.valueOf((i * 1000000000.0) / acc));
acc =0;
i=0;
}
}
#Override
public void deliveryComplete(IMqttDeliveryToken iMqttDeliveryToken) {
}
}
E.g. running producer for topic 1 as:
python Producers.py 1&
What limits the paho client inside an java application?

Well after a lot of debugging I found out what was the problem.
The topic $SYS/broker/load/messages/received/1min was reporting more messages as I was sending. Probably is counting the protocol messages as messages. It is so that in idle this topic is reporting 3.22 with one subscriber. So I thought I was sending 1459.5055 per/sec, this reported by mosquitto. But I was sending just the 485.5458333333333.
So do not trust this topic for application messages payload!

Does java have any mechanism for a VM to trace method calls on itself, without using javaagent, etc?

I want to build call graphs on the fly, starting at an arbitrary method call or with a new thread, which ever is easier, from within the running JVM itself. (this piece of software is going to be a test fixture for load testing another piece of software that consumes call graphs)
I understand there are some SPI interfaces, but it looks like you need to run -javaagent flag with them. I want to access this directly in the VM itself.
Ideally, I'd like to get a callback for entry and exit of each method call, parameters to that method call, and time in that method. Within a single thread obviously.
I know AOP could probably do this, but I'm just wondering if there are tools within the JDK that would allow me to capture this.

There is no such API provided by the JVM— even for agents started with -javaagent. The JVM TI is a native interface provided for native agents started with the -agent option or for debuggers. Java agents might use the Instrumentation API which provides the lowlevel feature of class instrumentation but no direct profiling capability.
There are two types of profiling implementations, via sampling and via instrumentation.
Sampling works by recording stack traces (samples) periodically. This does not trace every method call but still detect hot spots as they occur multiple times in the recorded stack traces. The advantage is that it does not require agents nor special APIs and you have the control over the profiler’s overhead. You can implement it via the ThreadMXBean which allows you to get stack traces of all running threads. In fact, even a Thread.getAllStackTraces() would do but the ThreadMXBean provides more detailed information about the threads.
So the main task is to implement an efficient storage structure for the methods found in the stack traces, i.e. collapsing occurrences of the same method into single call tree items.
Here is an example of a very simple sampler working on its own JVM:
import java.lang.Thread.State;
import java.lang.management.ManagementFactory;
import java.lang.management.ThreadInfo;
import java.lang.management.ThreadMXBean;
import java.util.*;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
public class Sampler {
private static final ThreadMXBean TMX=ManagementFactory.getThreadMXBean();
private static String CLASS, METHOD;
private static CallTree ROOT;
private static ScheduledExecutorService EXECUTOR;
public static synchronized void startSampling(String className, String method) {
if(EXECUTOR!=null) throw new IllegalStateException("sampling in progress");
System.out.println("sampling started");
CLASS=className;
METHOD=method;
EXECUTOR = Executors.newScheduledThreadPool(1);
// "fixed delay" reduces overhead, "fixed rate" raises precision
EXECUTOR.scheduleWithFixedDelay(new Runnable() {
public void run() {
newSample();
}
}, 150, 75, TimeUnit.MILLISECONDS);
}
public static synchronized CallTree stopSampling() throws InterruptedException {
if(EXECUTOR==null) throw new IllegalStateException("no sampling in progress");
EXECUTOR.shutdown();
EXECUTOR.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS);
EXECUTOR=null;
final CallTree root = ROOT;
ROOT=null;
return root;
}
public static void printCallTree(CallTree t) {
if(t==null) System.out.println("method not seen");
else printCallTree(t, 0, 100);
}
private static void printCallTree(CallTree t, int ind, long percent) {
long num=0;
for(CallTree ch:t.values()) num+=ch.count;
if(num==0) return;
for(Map.Entry<List<String>,CallTree> ch:t.entrySet()) {
CallTree cht=ch.getValue();
StringBuilder sb = new StringBuilder();
for(int p=0; p<ind; p++) sb.append(' ');
final long chPercent = cht.count*percent/num;
sb.append(chPercent).append("% (").append(cht.cpu*percent/num)
.append("% cpu) ").append(ch.getKey()).append(" ");
System.out.println(sb.toString());
printCallTree(cht, ind+2, chPercent);
}
}
static class CallTree extends HashMap<List<String>, CallTree> {
long count=1, cpu;
CallTree(boolean cpu) { if(cpu) this.cpu++; }
CallTree getOrAdd(String cl, String m, boolean cpu) {
List<String> key=Arrays.asList(cl, m);
CallTree t=get(key);
if(t!=null) { t.count++; if(cpu) t.cpu++; }
else put(key, t=new CallTree(cpu));
return t;
}
}
static void newSample() {
for(ThreadInfo ti:TMX.dumpAllThreads(false, false)) {
final boolean cpu = ti.getThreadState()==State.RUNNABLE;
StackTraceElement[] stack=ti.getStackTrace();
for(int ix = stack.length-1; ix>=0; ix--) {
StackTraceElement ste = stack[ix];
if(!ste.getClassName().equals(CLASS)||!ste.getMethodName().equals(METHOD))
continue;
CallTree t=ROOT;
if(t==null) ROOT=t=new CallTree(cpu);
for(ix--; ix>=0; ix--) {
ste = stack[ix];
t=t.getOrAdd(ste.getClassName(), ste.getMethodName(), cpu);
}
}
}
}
}
Profilers hunting for every method invocation without going through the debugging API use instrumentation to add notification code to every method they are interested in. The advantage is that they never miss a method invocation but on the other hand they are adding a significant overhead to the execution which might influence the result when searching for hot spots. And it’s way more complicated to implement. I can’t give you a code example for such a byte code transformation.
The Instrumentation API is provided to Java agents only but in case you want to go into the Instrumentation direction, here is a program which demonstrates how to connect to its own JVM and load itself as a Java agent:
import java.io.*;
import java.lang.instrument.Instrumentation;
import java.lang.management.ManagementFactory;
import java.nio.ByteBuffer;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.util.UUID;
import java.util.zip.ZipEntry;
import java.util.zip.ZipOutputStream;
// this API comes from the tools.jar of your JDK
import com.sun.tools.attach.*;
public class SelfAttacher {
public static Instrumentation BACK_LINK;
public static void main(String[] args) throws Exception {
// create a special property to verify our JVM connection
String magic=UUID.randomUUID().toString()+'/'+System.nanoTime();
System.setProperty("magic", magic);
// the easiest way uses the non-standardized runtime name string
String name=ManagementFactory.getRuntimeMXBean().getName();
int ix=name.indexOf('#');
if(ix>=0) name=name.substring(0, ix);
VirtualMachine vm;
getVM: {
try {
vm = VirtualMachine.attach(name);
if(magic.equals(vm.getSystemProperties().getProperty("magic")))
break getVM;
} catch(Exception ex){}
// if the easy way failed, try iterating over all local JVMs
for(VirtualMachineDescriptor vd:VirtualMachine.list()) try {
vm=VirtualMachine.attach(vd);
if(magic.equals(vm.getSystemProperties().getProperty("magic")))
break getVM;
vm.detach();
} catch(Exception ex){}
// could not find our own JVM or could not attach to it
return;
}
System.out.println("attached to: "+vm.id()+'/'+vm.provider().type());
vm.loadAgent(createJar().getAbsolutePath());
synchronized(SelfAttacher.class) {
while(BACK_LINK==null) SelfAttacher.class.wait();
}
System.out.println("Now I have hands on instrumentation: "+BACK_LINK);
System.out.println(BACK_LINK.isModifiableClass(SelfAttacher.class));
vm.detach();
}
// create a JAR file for the agent; since our class is already in class path
// our jar consisting of a MANIFEST declaring our class as agent only
private static File createJar() throws IOException {
File f=File.createTempFile("agent", ".jar");
f.deleteOnExit();
Charset cs=StandardCharsets.ISO_8859_1;
try(FileOutputStream fos=new FileOutputStream(f);
ZipOutputStream os=new ZipOutputStream(fos)) {
os.putNextEntry(new ZipEntry("META-INF/MANIFEST.MF"));
ByteBuffer bb = cs.encode("Agent-Class: "+SelfAttacher.class.getName());
os.write(bb.array(), bb.arrayOffset()+bb.position(), bb.remaining());
os.write(10);
os.closeEntry();
}
return f;
}
// invoked when the agent is loaded into the JVM, pass inst back to the caller
public static void agentmain(String agentArgs, Instrumentation inst) {
synchronized(SelfAttacher.class) {
BACK_LINK=inst;
SelfAttacher.class.notifyAll();
}
}
}

You can modify bytecode of each method adding routine to log method's enter/exit events. Javassist will help you http://www.csg.ci.i.u-tokyo.ac.jp/~chiba/javassist/
Also check out a nice tutorial: https://today.java.net/pub/a/today/2008/04/24/add-logging-at-class-load-time-with-instrumentation.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to tuning HTTPClient performance in crawling large amount small files? - java

Related

Search for Strings that matches regex in very huge Set or List in Java

Design a data source independent application using batch data

Fire and forget for HTTP in Java

Paho client limit?

Does java have any mechanism for a VM to trace method calls on itself, without using javaagent, etc?

Categories

Resources