OrientDB slow write - java

OrientDB official site says:
On common hardware stores up to 150.000 documents per second, 10
billions of documents per day. Big Graphs are loaded in few
milliseconds without executing costly JOIN such as the Relational
DBMSs.
But, executing the following code shows that it's taking ~17000ms to insert 150000 simple documents.
import com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx;
import com.orientechnologies.orient.core.record.impl.ODocument;
public final class OrientDBTrial {
public static void main(String[] args) {
ODatabaseDocumentTx db = new ODatabaseDocumentTx("remote:localhost/foo");
try {
db.open("admin", "admin");
long a = System.currentTimeMillis();
for (int i = 1; i < 150000; ++i) {
final ODocument foo = new ODocument("Foo");
foo.field("code", i);
foo.save();
}
long b = System.currentTimeMillis();
System.out.println(b - a + "ms");
for (ODocument doc : db.browseClass("Foo")) {
doc.delete();
}
} finally {
db.close();
}
}
}
My hardware:
Dell Optiplex 780
Intel(R) Core(TM)2 Duo CPU E7500 # 2.93Ghz
8GB RAM
Windows 7 64bits
What am I doing wrong?
Splitting the saves in 10 concurrent threads to minimize Java's overhead made it run in ~13000ms. Still far slower than what OrientDB front page says.

You can achieve that by using 'Flat Database' and orientdb as an embedded library in java
see more explained here
http://code.google.com/p/orient/wiki/JavaAPI
what you use is server mode and it sends many requests to orientdb server,
judging by your benchmark you got ~10 000 inserts per seconds which is not bad,
e.g I think 10 000 requests/s is very good performance for any webserver
(and orientdb server actually is a webserver and you can query it through http, but I think java is using binary mode)

The numbers from the OrientDB site are benchmarked for a local database (with no network overhead), so if you use a remote protocol, expect some delays.
As Krisztian pointed out, reuse objects if possible.

Read the documentation first on how to achive the best performance!
Few tips:
-> Do NOT instantiate ODocument always:
final ODocument doc;
for (...) {
doc.reset();
doc.setClassName("Class");
// Put data to fields
doc.save();
}
-> Do NOT rely on System.currentTimeMillis() - use perf4j or similar tool to measure times, because the first one measures global system times hence includes execution time of all other programs running on your system!

Related

Issues with Dynamic Destinations in Dataflow

I have a Dataflow job that reads data from pubsub and based on the time and filename writes the contents to GCS where the folder path is based on the YYYY/MM/DD. This allows files to be generated in folders based on date and uses apache beam's FileIO and Dynamic Destinations.
About two weeks ago, I noticed an unusual buildup of unacknowledged messages. Upon restarting the df job the errors disappeared and new files were being written in GCS.
After a couple of days, writing stopped again, except this time, there were errors claiming that processing was stuck. After some trusty SO research, I found out that this was likely caused by a deadlock issue in pre 2.90 Beam because it used the Conscrypt library as the default security provider. So, I upgraded to Beam 2.11 from Beam 2.8.
Once again, it worked, until it didn't. I looked more closely at the error and noticed that it had a problem with a SimpleDateFormat object, which isn't thread-safe. So, I switched to use Java.time and DateTimeFormatter, which is thread-safe. It worked until it didn't. However, this time, the error was slightly different and didn't point to anything in my code:
The error is provided below.
Processing stuck in step FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles for at least 05m00s without outputting or completing in state process
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at org.apache.beam.vendor.guava.v20_0.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:469)
at org.apache.beam.vendor.guava.v20_0.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76)
at org.apache.beam.runners.dataflow.worker.MetricTrackingWindmillServerStub.getStateData(MetricTrackingWindmillServerStub.java:202)
at org.apache.beam.runners.dataflow.worker.WindmillStateReader.startBatchAndBlock(WindmillStateReader.java:409)
at org.apache.beam.runners.dataflow.worker.WindmillStateReader$WrappedFuture.get(WindmillStateReader.java:311)
at org.apache.beam.runners.dataflow.worker.WindmillStateReader$BagPagingIterable$1.computeNext(WindmillStateReader.java:700)
at org.apache.beam.vendor.guava.v20_0.com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:145)
at org.apache.beam.vendor.guava.v20_0.com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:140)
at org.apache.beam.vendor.guava.v20_0.com.google.common.collect.MultitransformedIterator.hasNext(MultitransformedIterator.java:47)
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:701)
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source)
This error started occurring approximately 5 hours after job deployment and at an increasing rate over time. Writing slowed significantly within 24 hours. I have 60 workers and I suspect that one worker fails every time there is an error, which eventually kills the job.
In my writer, I parse the lines for certain keywords (may not be the best way) in order to determine which folder it belongs in. I then proceed to insert the file to GCS with the determined filename. Here is the code I use for my writer:
The partition function is provided as the following:
#SuppressWarnings("serial")
public static class datePartition implements SerializableFunction<String, String> {
private String filename;
public datePartition(String filename) {
this.filename = filename;
}
#Override
public String apply(String input) {
String folder_name = "NaN";
String date_dtf = "NaN";
String date_literal = "NaN";
try {
Matcher foldernames = Pattern.compile("\"foldername\":\"(.*?)\"").matcher(input);
if(foldernames.find()) {
folder_name = foldernames.group(1);
}
else {
Matcher folderid = Pattern.compile("\"folderid\":\"(.*?)\"").matcher(input);
if(folderid.find()) {
folder_name = folderid.group(1);
}
}
Matcher date_long = Pattern.compile("\"timestamp\":\"(.*?)\"").matcher(input);
if(date_long.find()) {
date_literal = date_long.group(1);
if(Utilities.isNumeric(date_literal)) {
LocalDateTime date = LocalDateTime.ofInstant(Instant.ofEpochMilli(Long.valueOf(date_literal)), ZoneId.systemDefault());
date_dtf = date.format(dtf);
}
else {
date_dtf = date_literal.split(":")[0].replace("-", "/").replace("T", "/");
}
}
return folder_name + "/" + date_dtf + "h/" + filename;
}
catch(Exception e) {
LOG.error("ERROR with either foldername or date");
LOG.error("Line : " + input);
LOG.error("folder : " + folder_name);
LOG.error("Date : " + date_dtf);
return folder_name + "/" + date_dtf + "h/" + filename;
}
}
}
And the actual place where the pipeline is deployed and run can be found below:
public void streamData() {
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))
.apply(options.getWindowDuration() + " Window",
Window.<PubsubMessage>into(FixedWindows.of(parseDuration(options.getWindowDuration())))
.triggering(AfterWatermark.pastEndOfWindow())
.discardingFiredPanes()
.withAllowedLateness(parseDuration("24h")))
.apply(new GenericFunctions.extractMsg())
.apply(FileIO.<String, String>writeDynamic()
.by(new datePartition(options.getOutputFilenamePrefix()))
.via(TextIO.sink())
.withNumShards(options.getNumShards())
.to(options.getOutputDirectory())
.withNaming(type -> FileIO.Write.defaultNaming(type, ".txt"))
.withDestinationCoder(StringUtf8Coder.of()));
pipeline.run();
}
The error 'Processing stuck ...' indicates that some particular operation took longer than 5m, not that the job is permanently stuck. However, since the step FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles is the one that is stuck and the job gets cancelled/killed, I would think on an issue while the job is writing temp files.
I found out the BEAM-7689 issue which is related to a second-granularity timestamp (yyyy-MM-dd_HH-mm-ss) that is used to write temporary files. This happens because several concurrent jobs can share the same temporary directory and this can cause that one of the jobs deletes it before the other(s) job finish(es).
According to the previous link, to mitigate the issue please upgrade to SDK 2.14. And let us know if the error is gone.
Since posting this question, I've optimized the dataflow job to dodge bottlenecks and increase parallelization. Much like rsantiago explained, processing stuck isn't an error, but simply a way dataflow communicates that a step is taking significantly longer than other steps, which is essentially a bottleneck that can't be cleared with the given resources. The changes I made seem to have addressed them. The new code is as follows:
public void streamData() {
try {
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))
.apply(options.getWindowDuration() + " Window",
Window.<PubsubMessage>into(FixedWindows.of(parseDuration(options.getWindowDuration())))
.triggering(AfterWatermark.pastEndOfWindow())
.discardingFiredPanes()
.withAllowedLateness(parseDuration("24h")))
.apply(FileIO.<String,PubsubMessage>writeDynamic()
.by(new datePartition(options.getOutputFilenamePrefix()))
.via(Contextful.fn(
(SerializableFunction<PubsubMessage, String>) inputMsg -> new String(inputMsg.getPayload(), StandardCharsets.UTF_8)),
TextIO.sink())
.withDestinationCoder(StringUtf8Coder.of())
.to(options.getOutputDirectory())
.withNaming(type -> new CrowdStrikeFileNaming(type))
.withNumShards(options.getNumShards())
.withTempDirectory(options.getTempLocation()));
pipeline.run();
}
catch(Exception e) {
LOG.error("Unable to deploy pipeline");
LOG.error(e.toString(), e);
}
}
The biggest change involved removing the extractMsg() function and changing partitioning to only use metadata. Both of these steps forced deserialization/reserialization of messages and heavily impacted performance.
Additionally, since my data set was unbounded, I had to set a non-zero number of shards. I wanted to simplify my filenaming policy, so I set it to 1 without knowing how much it hurt performance. Since then, I've found a good balance of workers/shards/machine type for my job (mostly based on guess & check, unfortunately).
Although it's still possible that a bottleneck might be observed with a large enough data load, the pipeline has been performing well despite heavy load (3-5tb per day). The changes also significantly improved autoscaling, but I'm not sure why. The dataflow job now reacts to spikes and valleys a lot quicker.

Apache Flink Using Windows to induce a delay before writing to Sink

I am wondering is possible with Flink windowing to induce a 10 minute delay from when the data enters the pipeline until it is written to a table in Cassandra.
My initial intention was to write each transaction to a table in Cassandra and query the table using a range key at the web layer but due to the volume of data, I am looking at options to delay the write for N seconds. This means that my table will only ever have data that is at least 10 minutes old.
The small diagram below shows 10 minute windows that roll every minute. As time moves on I only want to write data to Cassandra that is older than 10 minutes (the parts in green). I guess is this even possible with Flink?
I could create 11 minute windows that roll every minute but I would end up throwing 90% of the data away, which seems a waste.
Final Solution
I created my own flavour of FlinkKafkaConsumer09 called DelayedKafkaConsumer The main reason for this is to override the creation of the KafkaFetcher
public class DelayedKafkaConsumer<T> extends FlinkKafkaConsumer09<T> {
private ConsumerRecordFunction applyDelayAction;
.............
#Override
protected AbstractFetcher<T, ?> createFetcher(SourceContext<T> sourceContext, Map<KafkaTopicPartition, Long> assignedPartitionsWithInitialOffsets,
SerializedValue<AssignerWithPeriodicWatermarks<T>> watermarksPeriodic,
SerializedValue<AssignerWithPunctuatedWatermarks<T>> watermarksPunctuated,
StreamingRuntimeContext runtimeContext, OffsetCommitMode offsetCommitMode) throws Exception {
return new DelayedKafkaFetcher<>(
sourceContext, assignedPartitionsWithInitialOffsets, watermarksPeriodic, watermarksPunctuated,
runtimeContext.getProcessingTimeService(), runtimeContext.getExecutionConfig().getAutoWatermarkInterval(),
runtimeContext.getUserCodeClassLoader(), runtimeContext.getTaskNameWithSubtasks(),
runtimeContext.getMetricGroup(), this.deserializer, this.properties, this.pollTimeout, useMetrics, applyDelayAction);
}
The DelayedKafkaFetcher has a small piece of code in it's runFetchLoop that sleeps for n milliseconds before emmitting the record.
private void delayMessage(Long msgTransactTime, Long nowMinusDelay) throws InterruptedException {
if (msgTransactTime > nowMinusDelay) {
Long sleepTimeout = msgTransactTime - nowMinusDelay;
if (LOGGER.isDebugEnabled()) {
LOGGER.debug(format("Message with transaction time {0}ms is not older than {1}ms. Sleeping for {2}", msgTransactTime, nowMinusDelay, sleepTimeout));
}
TimeUnit.MILLISECONDS.sleep(sleepTimeout);
}
}

Apache Spark : TaskResultLost (result lost from block manager) Error On cluster

I have a Spark standalone cluster with 3 slaves on Virtualbox. My code is on Java and it is working fine with my small input datasets which their inputs are around 100MB totally.
I set my virtual machines RAM to be 16GB but when I was runnig my code on big input files (about 2GB) I get this error after hours of processing in my reduce part:
Job aborted due to stage failure: Total size of serialized results of 4 tasks (4.3GB) is bigger than spark.driver.maxResultSize`
I edited the spark-defaults.conf and assigned a higher amount (2GB and 4GB) for spark.driver.maxResultSize. It didn't help and the same error showed up.
No I am trying 8GB of spark.driver.maxResultSize and my spark.driver.memory is also the same as RAM size (16GB). But I get this error:
TaskResultLost (result lost from block manager)
Any comments about this? I also include an image.
I don't know if the problem is causing by the large size of maxResultSize or this is something with collections of RDDs in the code. I also provide the mapper part of the code for a better understanding.
JavaRDD<Boolean[][][]> fragPQ = uData.map(new Function<String, Boolean[][][]>() {
public Boolean[][][] call(String s) {
Boolean[][][] PQArr = new Boolean[2][][];
PQArr[0] = new Boolean[11000][];
PQArr[1] = new Boolean[11000][];
for (int i = 0; i < 11000; i++) {
PQArr[0][i] = new Boolean[11000];
PQArr[1][i] = new Boolean[11000];
for (int j = 0; j < 11000; j++) {
PQArr[0][i][j] = true;
PQArr[1][i][j] = true;
In general, this error shows that you are collecting/bringing a large amount of data onto the driver. This should never be done. You need to rethink your application logic.
Also, you don't need to modify spark-defaults.conf to set the property. Instead, you can specify such application-specific properties via --conf option in spark-shell or spark-submit, depending on how you run the job.
SOLVED:
The problem solved by increasing the master RAM size. I studied my case and found out that based on my design assigning 32GB of RAM would be sufficient. Now by doing than, my program is working fine and is calculating everything correctly.
In my case, I got this error because a firewall was blocking the block manager ports between the driver and the executors.
The port can be specified with:
spark.blockManager.port and
spark.driver.blockManager.port
See https://spark.apache.org/docs/latest/configuration.html#networking

JMX results are confusing

I am trying to learn JMX for the last few days and now got confuse here.
I have written a simple JMX programe which is using the APIs of package java.lang.management and trying to extract the Pid, CPU time, user time. In my result I am only getting the results of current JVM thread which is my JMX programe itself but I thought I should get the result of all the processes running over JVM on the same machine. How I will get the pids, cpu time, user time for all java processes running in JVM(LINUX/WDs).
How should I can get the pids, cpu time, user time for all non-java processes running in my machine(LINUX/WDs).
My code is below:
public void update() throws Exception{
final ThreadMXBean bean = ManagementFactory.getThreadMXBean();
final long[] ids = bean.getAllThreadIds();
final ThreadInfo[] infos = bean.getThreadInfo(ids);
for (long id : ids) {
if (id == threadId) {
continue; // Exclude polling thread
}
final long c = bean.getThreadCpuTime(id);
final long u = bean.getThreadUserTime(id);
if (c == -1 || u == -1) {
continue; // Thread died
}
}
String name = null;
for (int i = 0; i < infos.length; i++) {
name = infos[i].getThreadName();
System.out.print("The name of the id is /n" + name);
}
}
I am always getting the result:
The name of the id is Attach Listener
The name of the id is Signal Dispatcher
The name of the id is Finalizer
The name of the id is Reference Handler
The name of the id is main
I have some other java processes running on my machine they are not been included in the results of bean.getAllThreadIds() API..
Ah, now I see what you want to do. I'm afraid I have some bad news.
The APIs that are exposed through ManagementFactory allow you to monitor only the JVM in which your code is running. To monitor other JVMs, you have to use the JMX Remoting API (javax.management.remote), and that introduces a whole new range of issues you have to deal with.
It sounds like what you want to do is basically write your own management console using the stock APIs provided by out-of-the-box JDK. Short answer: you can't get there from here. Slightly longer answer: you can get there from here, but the road is long, winding, uphill (nearly) the entire way, and when you're done you will most likely wish you had gone a different route (read that: use a management console that has already been written).
I recommend you use JConsole or some other management console to monitor your application(s). In my experience it is usually only important that a human (not a program) interpret the stats that are provided by the various MBeans whose references are obtainable through the ManagementFactory static methods. After all, if a program had access to, say, the amount of CPU used by some other process, what conceivable use would it have with that information (other than to provide it in some human-readable format)?

Efficient way to GET multiple HTML pages simultaneously

So I'm working on web scraping for a certain website. The problem is:
Given a set of URLs (in the order of 100s to 1000s), I would like to retrieve the HTML of each URL in an efficient manner, specially time-wise. I need to be able to do 1000s of requests every 5 minutes.
This should usually imply using a pool of threads to do requests from a set of not yet requested urls. But before jumping into implementing this, I believe that it's worth asking here since I believe this is a fairly common problem when doing web scraping or web crawling.
Is there any library that has what I need?
So I'm working on web scraping for a certain website.
Are you scraping a single server or is the website scraping from multiple other hosts? If it is the former, then the server you are scraping may not like too many concurrent connections from a single i/p.
If it is the latter, this is really a general question on how many outbound connections you should open from a machine. There is physical limit, but it is pretty large. Practically, it would depend on where that client is getting deployed. The better the connectivity, the higher number of connections it can accommodate.
You might want to look at the source code of a good download manager to see if they have a limit on the number of outbound connections.
Definitely user asynchronous i/o, but you would still do well to limit the number.
Your bandwidth utilization will be the sum of all of the HTML documents that you retrieve (plus a little overhead) no matter how you slice it (though some web servers may support compressed HTTP streams, so certainly use a client capable of accepting them).
The optimal number of concurrent threads depends a great deal on your network connectivity to the sites in question. Only experimentation can find an optimal number. You can certainly use one set of threads for retrieving HTML documents and a separate set of threads to process them to make it easier to find the right balance.
I'm a big fan of HTML Agility Pack for web scraping in the .NET world but cannot make a specific recommendation for Java. The following question may be of use in finding a good, Java based scraping platform
Web scraping with Java
I would start by researching asynchronous communication. Then take a look at Netty.
Keep in mind there is always a limit to how fast one can load a web page. For an average home connection, it will be around a second. Take this into consideration when programming your application.
http://wwww.Jsoup.org just for scrapping part! The thread pooling i think you should implement urself.
Update
if this approach is fitting your need, you can download the complete class files here:
http://codetoearn.blogspot.com/2013/01/concurrent-web-requests-with-thread.html
AsyncWebReader webReader = new AsyncWebReader(5/*number of threads*/, new String[]{
"http://www.google.com",
"http://www.yahoo.com",
"http://www.live.com",
"http://www.wikipedia.com",
"http://www.facebook.com",
"http://www.khorasannews.com",
"http://www.fcbarcelona.com",
"http://www.khorasannews.com",
});
webReader.addObserver(new Observer() {
#Override
public void update(Observable o, Object arg) {
if (arg instanceof Exception) {
Exception ex = (Exception) arg;
System.out.println(ex.getMessage());
} /*else if (arg instanceof List) {
List vals = (List) arg;
System.out.println(vals.get(0) + ": " + vals.get(1));
} */else if (arg instanceof Object[]) {
Object[] objects = (Object[]) arg;
HashMap result = (HashMap) objects[0];
String[] success = (String[]) objects[1];
String[] fail = (String[]) objects[2];
System.out.println("Failds");
for (int i = 0; i < fail.length; i++) {
String string = fail[i];
System.out.println(string);
}
System.out.println("-----------");
System.out.println("success");
for (int i = 0; i < success.length; i++) {
String string = success[i];
System.out.println(string);
}
System.out.println("\n\nresult of Google: ");
System.out.println(result.remove("http://www.google.com"));
}
}
});
Thread t = new Thread(webReader);
t.start();
t.join();

Categories

Resources