How to use multithreading for gremlin edge creation

How to use multithreading for gremlin edge creation - java

I have been working on tinkerpop gremlin graph, and lately i can able to perform lots of stuff with that, now i'm struck at one point where i'm trying to process many thousands of vertices and edges, it takes around one hour to complete the process, how can i apply parallelStream() operation to this following part:
for(String s : somelist){
String[] ss = s.split(",");
graphTraversal().addEdge(ss[0], ss[1]);
}
That "somelist" contains the information for each edge's source and target vertices(~size of 65,000).

TinkerGraph technically isn't completely thread-safe for writes. You might hit some problems depending on what you're loading and how you are loading it. I can't say exactly what those problems are and what you might need to do to avoid them, but we definitely haven't tested TinkerGraph that way.
That said, 65,000 edges in the format you're specifying in your sample code should not take an hour to load into TinkerGraph even in a single threaded mode of operation. That sounds a bit excessive. I assume your sample code is not what you are actually executing as that is not valid Gremlin syntax, so it's hard to say what the problem might be.

Related

Are non-parallel Streams meant to do an operation in mass on big amount of data?

A few weeks ago, I was searching for a way to extract some specific value from a file and stumbled on this question which introduced me to the Stream Object.
My first instinct was to investigate if this object would help with other file operations, such as replacing several placeholders with corresponding values for which I used BufferedReader and FileWriter. I failed miserably at producing any working code, but since then I began taking interest on articles which covered the subject, so I could understand the intended use of Stream.
On the way, I stumbled upon Optional and came to a good understanding of it and can now identify the cases where I am comfortable using Optional while maintaining my code clean and understandable. However, I can't say this is the case for Stream, not mentioning that it may not have provided the performance gain I imagined it would bring and will still need a finally clause in cases where IO is involved.
Here is the main issue I've been trying to wrap my head around, keeping in mind that I mostly worked on one-thread programming until now: When is it prefered to use a Stream aside from parallel processing?
Is it to do an operation in bulk on a specific subset of a big collection of data, where Collection would have been used when trying to access and manipulate specific objects of the said collection? Although it seems to be the intended use, I'm still not sure that the example I linked at the beginning of my question is your typical use case.
Or is it only a construct used to make the code smaller thanks to lambda expression at the sacrifice of readability? (Nothing against lambda if used correctly, but most of the example of Stream usage I saw where quite illegible, which didn't help for my general understanding)

I've always referred to the description on the Java 8 Streams API page to help me decide between a Collection and a Stream:
However, [the Streams API] has many benefits. First, the Streams API makes use of several
techniques such as laziness and short-circuiting to optimize your data
processing queries.
Both a Stream and a Collection can be used to apply a computation on every single element of a dataset before storing it. However, I've found Streams useful if my pipeline includes several distinct filter/sort/map operations for each data element, as the Stream API can optimize these calculations behind the scenes and has parallelization support built in as well.
I agree that readability can be affected both positively and negatively by using a Stream - you're correct that some Stream examples are completely unreadable, and I don't think that readability should be the key decision point for using a Stream over something else.
If you're truly optimizing for performance on a large dataset, consider using a toolset that's purpose-built for massive datasets instead.

Algorithm and Implementation of Microflow Engine

I am working on microflow engine (backend) which is a process flow to be executed in runtime.
Consider the following diagram where each process is a Java Class. There are variables out from process to in to another process. Since flow is dynamic in nature, very complicated flow is possible with many gateways (GW) and processes.
Is DFS/BFS a good choice to implement the runtime engine? Any idea guys.

As far as the given example is concerned, it is solved via Depth First Search (DFS), using the output node as the "root" of the tree.
This is because:
For the output to obtain a value, it needs the output of Process4
For Process4 to produce an output, it needs the outputs of Process2 and
Process3
For Process2 / Process3 to produce an output, they need the
output of GW
For GW to produce an output it needs the output from
Process1
So, the general idea would be to do a DFS from each output, all the way back to the inputs.
This will work almost as described for anything that looks like a Directed Acyclic Graph (DAG, or in fact a Tree), from the point of view of the output.
If a workflow ends up having "cycle edges" or "feedback loops", that is, if it now looks like a Graph, then additional consideration will need to be given to avoid infinite traversals and re-evaluation of a Process output.
Finally, if a workflow needs to be aware of the concept of "Time" (in general) then additional consideration will need to be given so that it is ensured that although the graph is evaluated progressively, node-by-node, in the end, it has produced the right output for time instance (n). That is, you want to avoid some Processes producing output AHEAD of the current time instance just because they were called more frequently.
A trivial example of this is already present in the question. Due to DFS, GW will be evaluated for Process2 (or Process3) but it doesn't have to be re-evaluated (for the same time instance) for Process3 (or Process2). When dealing with DAGs, you can simply add an "Evaluated" flag on each Process which is cleared at the beginning of the traversal. Then, DFS would decide to descend down the branch of a node if it finds that it is not yet evaluated. Otherwise, it simply obtains the output of some Process that was evaluated during a previous traversal. (This is why I mention "almost as described" earlier). But, this trivial trick will not work with multiple feedback loops. In that case, you really need to make the nodes "aware" about the passage of time.
For more information and for a really thorough exposition of related issues, I would strongly recommend that you go through Bruno Preiss' Y logic simulator. Although it is in C++ and is a logic simulator, it goes through exactly the same considerations that are faced by any similar system of interconnected "abstract nodes" that are supposed to be carrying out some form of "processing".
Hope this helps.

Suggested Architecture for a batch with multi-threading and common resources

I need to write a batch in Java that using multiple threads perform various operation on a bunch of data.
I got almost 60k rows of data, and need to do different operations on them. Some of them works on the same data but using different outputs.
So, the question is: is it right to create this big 60k-length ArrayList and pass it through the various operator, so they can add each one their output, or there is a better Architecture Design that someone can suggest me?
EDIT:
I need to create these objects:
MyObject, with an ArrayList of MyObject2, 3 different Integers, 2 Strings.
MyObject2, with 12 floats
MyBigObject, with an ArrayList of MyObjectof usually of 60k elements, and some Strings.
My different operators works on the same ArrayList of MyObject2, but outputs on the integers, so for example Operators1 fetch from ArrayList of MyObject2, perform some calculation and output its result on MyObject.Integer1, Operators2 fetch from ArrayList of MyObject2, perform some different calculation and output its result on MyObject.Integer2, and so on.
Is this architecture "safe"? The ArrayList of MyObject2 has to be read only, never edited from any operator.
EDIT:
Actually I don't have still code because I'm studying the architecture before, and then I'll start writing something.
Trying to rephrase my question:
Is it ok, in a Batch written in pure Java (without any Framework, I'm not using for example Spring Batch because it will be like shooting a fly with a shotgun for my project), to create a macro object, pass it around so that every different thread can read from the same datas but output their results on different datas?
Can it be dangerous if different threads reads from the same data at the same time?

It depends on your operations.
Generally it's possible to partition work on a dataset horizontally or vertically.
Horizontally means splitting your dataset into several smaller sets let each individual thread handle such a set. This code is safest yet usually slower because each individual thread will do several different operations. It's also a bit more complex to reason about for the same reason.
Vertically means each thread performs some operation on a specific "field" or "column" or whatever individual data units is in your data set.
This is generally easier to implement (each thread does one thing on the whole set) and can be faster. However each operation on the dataset needs to be independent of your other operations.
If you are unsure about multi-threading in general, I recommend doing work horizontally in parallel.
Now to the question about whether is ok to pass your full dataset around (some ArrayList), sure it is! It's just a reference and won't really matter. What matters are the operations you perform on the dataset.

How to publish to KDB Ticker Plant from Java effectively

We have market data handlers which publish quotes to KDB Ticker Plant. We use exxeleron q java libary for this purpose. Unfortunately latency is quite high: hundreds milliseconds when we try to insert a batch of records. May you suggest some latency tips for KDB + Java binding, as we need to publish quite fast.

There's not enough information in this message to give a fully qualified response, but having done the same with Java+KDB it really comes down to eliminating the possibilities. This is common sense, really, nothing super technical.
make sure you're inserting asynchronously
Verify it's exxeleron q java that is causing the latency. I don't think there's 100's of millis overhead there.
Verify the CPU that your tickerplant is on isn't overloaded. Consider re-nicing, core binding, etc
Analyse your network latencies. Also, if you're using Linux, there's a few tcp tweaks you can try, e.g. TCP_QUICKACK
As you're using Java, be smarter about garbage collection. It's highly configurable, although not directly controllable.
if you find out the tickerplant is the source of latency, you could either recode it to not write to disk - or get a faster local disk.
There's so many more suggestions, but the question is a bit too ambiguous.
EDIT
Back in 2007, with old(ish) servers and a very old version of KDB+ we were managing an insertion rate of 90k rows per second using the vanilla c.java. That was after many rounds of the above points. I'm sure you can achieve way more now, it's a matter of finding where the bottlenecks are and fixing them one by one.

Make sure the data publish to ticket plant are is batch, like wait for a little bit to insert say few rows of data in batch, but not insert row by row once any new records coming

General methods for optimizing program for speed

What are some generic methods for optimizing a program in Java, in terms of speed. I am using a DOM Parser to parse an XML file and then store certain words in an ArrayList, remove any duplicates then spell check those words by creating Google search URL's for each word, get the html document, locate the corrected word and save it to another ArrayList.
Any help would be appreciated! Thanks.

Why do you need to improve performance? From your explanation, it is pretty obvious that the big bottleneck here (or performance hit) is going to be the IO resulting from the fact that you are accessing a URL.
This will surely dwarf by orders of magnitude any minor improvements you make in data structures or XML frameworks.
It is a good general rule of thumb that your big performance problems will involve IO. Humorously enough, I am at this very moment waiting for a database query to return in a batch process. It has been running for almost an hour. But I welcome any suggested improvements to my XML parsing library nevertheless!
Here are my general methods:
Does your program perform any obviously expensive task from the perspective of latency (IO)? Do you have enough logging to see that this is where the delay is (if significant)?
Is your program prone to lock-contention (i.e. can it wait around, doing nothing, waiting for some resource to be "free")? Perhaps you are locking an entire Map whilst you make an expensive calculation for a value to store, blocking other threads from accessing the map
Is there some obvious algorithm (perhaps for data-matching, or sorting) that might have poor characteristics?
Run up a profiler (e.g. jvisualvm, which ships with the JDK itself) and look at your code hotspots. Where is the JVM spending its time?

SAX is faster than DOM. If you don't want to go through the ArrayList searching for duplicates, put everything in a LinkedHashMap -- no duplicates, and you still get the order-of-insertion that ArrayList gives you.
But the real bottleneck is going to be sending the HTTP request to Google, waiting for the response, then parsing the response. Use a spellcheck library, instead.
Edit: But take my educated guesses with a grain of salt. Use a code profiler to see what's really slowing down your program.

Generally the best method is to figure out where your bottleneck is, and fix it. You'll usually find that you spend 90% of your time in a small portion of your code, and that's where you want to focus your efforts.
Once you've figured out what's taking a lot of time, focus on improving your algorithms. For example, removing duplicates from an ArrayList can be O(n²) complexity if you're using the most obvious algorithm, but that can be reduced to O(n) if you leverage the correct data structures.
Once you've figured out which portions of your code are taking the most time, and you can't figure out how best to fix it, I'd suggest narrowing down your question and posting another question here on StackOverflow.
Edit
As #oxbow_lakes so snidely put it, not all performance bottlenecks are to be found in the code's big-O characteristics. I certainly had no intention to imply that they were. Since the question was about "general methods" for optimizing, I tried to stick to general ideas rather than talking about this specific program. But here's how you can apply my advice to this specific program:
See where your bottleneck is. There are a number of ways to profile your code, ranging from high-end, expensive profiling software to really hacky. Chances are, any of these methods will indicate that your program spends the 99% of its time waiting for a response from Google.
Focus on algorithms. Right now your algorithm is (roughly):
Parse the XML
Create a list of words
For each word
Ping Google for a spell check.
Return results
Since most of your time is spent in the "ping Google" phase, an obvious way to fix this would be to avoid doing that step more times than necessary. For example:
Parse the XML
Create a list of words
Send list of words to spelling service.
Parse results from spelling service.
Return results
Of course, in this case, the biggest speed boost would probably be by using spell checker that runs on the same machine, but that isn't always an option. For example, TinyMCE runs as a javascript program within the browser, and it can't afford to download the entire dictionary as part of the web page. So it packages up all the words into a distinct list and performs a single AJAX request to get a list of those words that aren't in the dictionary.

These folks are probably right, but a few random pauses will turn *probably" into "definitely, and here's why".

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.