Importing massive dataset to Neo4j is extremely slow [closed]

Importing massive dataset to Neo4j is extremely slow [closed] - java

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a rather large dataset, ~68 million data points. The data is currently stored in MongoDB and I have written a Java program that goes through the data to link data points together and place them in the Neo4j database using Cypher commands. I ran this program with a test set of data (~1.5 million) and it worked, ran it overnight. Now when I try to import the whole dataset, the program is extremely slow. Ran the whole weekend and only ~350,000 data points have made it. Through some short testing, it seems like Neo4j is the bottleneck. It's been half an hour since I stopped the Java program but Neo4j's CPU usage is at 100% and new nodes are still being added (from the Java program). Is there anyway to overcome this bottleneck? I've thought about multithreading, but since I'm trying to create a network, there are lots of dependencies and non-thread-safe operations being performed. Thanks for your help!
EDIT: The data I have is a list of users. The data that is contained is the user id, and an array of the user's friends' ids. My Cypher queries look a little like this:
"u:USER {id:" + currentID + "}) CREATE (u)-[:FRIENDS {ts:" + timeStamp}]->(u" + connectionID + ":USER {id:" + connectionID + "})"
Sorry if this is really terrible, pretty new to this

You should first look at this:
neo4j import slowing down
If you still decide to DIY, there's a few things you should look out for: First, make sure you don't try to import all your data in one transaction, otherwise your code will spend most of the time suspended by the Garbage Collector. Second, ensure you have given plenty of memory to the Neo4j process (or your application if you're using an embedded instance of Neo4j). 68 million nodes is trivial for Neo4j, but if the Cypher you're generating is constantly looking things up to e.g. create new relationships, then you'll run into severe paging issues if you don't allocate enough memory. Finally, if you are looking up nodes by properties (rather than by id) then you should be using labels and schema indexes:
http://neo4j.com/news/labels-and-schema-indexes-in-neo4j/

Did you configure neo4j.properties and neo4j-wrapper.conf files?
It is highly recommended to adjust the values according to the amount of RAM available on your machine.
in conf/neo4j-wrapper.conf I usually use for a 12GB RAM server
wrapper.java.initmemory=8000
wrapper.java.maxmemory=8000
in conf/neo4j.properties I set
dbms.pagecache.memory=8000
See http://neo4j.com/blog/import-10m-stack-overflow-questions/ for a complete example to import 10M nodes in a few minutes, it's a good starting point
SSD are also recommended to speed up import.

One thing I learned when loading bulk data into a database was to switch off indexing temporarily on the destination table(s). Otherwise every new record added caused a separate update to the indexes, resulting in a lot of work on the disk. It was much quicker to re-index the whole table in a separate operation after the data load was complete. YMMV.

Related

Neo4j - Reading large amounts of data with Java

I'm currently trying to read large amounts of data into my Java application using the official Bolt driver. I'm having issues because the graph is fairly large (~17k nodes, ~500k relationships) and of course I'd like to read this in chunks for memory efficiency. What I'm trying to get is a mix of fields between the origin and destination nodes, as well as the relationship itself. I tried writing a pagination query:
MATCH (n:NodeLabel)-[r:RelationshipLabel]->(n:NodeLabel)
WITH r.some_date AS some_date, r.arrival_times AS arrival_times,
r.departure_times AS departure_times, r.path_ids AS path_ids,
n.node_id AS origin_node_id, m.node_id AS dest_node_id
ORDER BY id(r)
RETURN some_date, arrival_times, departure_times, path_ids,
origin_node_id, dest_node_id
LIMIT 5000
(I changed some of the label and field naming so it's not obvious what the query is for)
The idea was I'd use SKIP on subsequent queries to read more data. However, at 5000 rows/read this is taking roughly 7 seconds per read, presumably because of the full scan ORDER BY forces, and if I SKIP it goes up in execution time and memory usage significantly. This is way too long to read the whole thing, is there any way I can speed up the query? Or stream the results in chunks into my app? In general, what is the best approach to reading large amounts of data?
Thanks in advance.

Instead of skip. From the second call you can do id(r) > "last received id(r)" it should actually reduce the process time as you go.

How to publish to KDB Ticker Plant from Java effectively

We have market data handlers which publish quotes to KDB Ticker Plant. We use exxeleron q java libary for this purpose. Unfortunately latency is quite high: hundreds milliseconds when we try to insert a batch of records. May you suggest some latency tips for KDB + Java binding, as we need to publish quite fast.

There's not enough information in this message to give a fully qualified response, but having done the same with Java+KDB it really comes down to eliminating the possibilities. This is common sense, really, nothing super technical.
make sure you're inserting asynchronously
Verify it's exxeleron q java that is causing the latency. I don't think there's 100's of millis overhead there.
Verify the CPU that your tickerplant is on isn't overloaded. Consider re-nicing, core binding, etc
Analyse your network latencies. Also, if you're using Linux, there's a few tcp tweaks you can try, e.g. TCP_QUICKACK
As you're using Java, be smarter about garbage collection. It's highly configurable, although not directly controllable.
if you find out the tickerplant is the source of latency, you could either recode it to not write to disk - or get a faster local disk.
There's so many more suggestions, but the question is a bit too ambiguous.
EDIT
Back in 2007, with old(ish) servers and a very old version of KDB+ we were managing an insertion rate of 90k rows per second using the vanilla c.java. That was after many rounds of the above points. I'm sure you can achieve way more now, it's a matter of finding where the bottlenecks are and fixing them one by one.

Make sure the data publish to ticket plant are is batch, like wait for a little bit to insert say few rows of data in batch, but not insert row by row once any new records coming

Writing hundreds of data objects to a Mongo database

I am working on a Minecraft network which has several servers manipulating 'user-objects', which is just a Mongo document. After a user object is modified it need to be written to the database immediately, otherwise it may be overwritten in other servers (which have an older version of the user object), but sometimes hundreds of objects need to be written away in a short amount of time.. (in a few seconds). My question is: How can I easily write objects to a MongoDB database without really overload the database..
I have been thinking up an idea but I have no idea if it is relevant:
- Create some sort of queue in another thread, everytime an data object gets need to be saved into the database it gets in the queue and then in the 'queue thread' the objects will be saved one by one with some sort of interval..
Thanks in advance
btw Im using Morphia as framework in Java

"hundreds of objects [...] in a few seconds" doesn't sound that much. How much can you do at the moment?
The setting most important for the speed of write operations is the WriteConcern. What are you using at the moment and is this the right setting for your project (data safety vs speed)?
If you need to do many write operations at once, you can probably speed up things with bulk operations. They have been added in MongoDB 2.6 and Morphia supports them as well — see this unit test.
I would be very cautious with a queue:
Do you really need it? Depending on your hardware and configuration you should be able to do hundreds or even thousands of write operations per second.
Is async really the best approach for you? The producer of the write operation / message can only assume his change has been applied, but it probably has not and is still waiting in the queue to be written. Is this the intended behaviour?
Does it make your life easier? You need to know another piece of software, which adds many new and most likely unforeseen problems.
If you need to scale your writes, why not use sharding? No additional technology and your code will behave the same with and without it.
You might want to read the following blogpost on why you probably want to avoid queues for this kind of operation in general: http://widgetsandshit.com/teddziuba/2011/02/the-case-against-queues.html

Processing an array that can't be kept in memory [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I have to process a big array of string in Java, which can't be kept in memory. Because of this, the array must be processed in several chunks. The size of each chunk can be specified by the program's user, but if the user doesn't specify a size, the program must decide the most appropriate size.
My first thought was to use a in-disk database like cassandra. That way, every time I want to process a chunk of the big array, I would do a query to the database.
The problem I saw was that I'd need to control available memory of the JVM and RAM, which I think would be too difficult. Also, I would have to figure out how to set the size of each chunk to make the most of the available memory without filling it.
For that, I've thought about using something like MemCached or SSDB (alternative to Redis that allows you to store a part of the database in disk - https://github.com/ideawu/ssdb), but I'm not sure if that's the best option. The idea is that Memcached or SSDB would help manage the exchange of data between memory and disk without me having to implement any control to avoid filling memory.
Really, I don't like too much the idea of adding dependencies (Memcached or SSDB) just to make my program function.
Then, my question is: are there any good alternatives to solve my problem? Is the previous reasoning wrong?
Thanks in advance!
CLARIFICATIONS
---------------
What kind of processing do you have to do?
processing is related to data analysis techniques for getting information using existing data (in the big array)
How big is the array? How big are the strings? Is your processing random access or sequential? Why can't you just use a file?
the size of the array can change, it haven't a fixed value. The idea is that a user (not end-user) can processing an array in chunk when it's neccesary for him. For example, an user maybe want to process an array of size 100.000 in several chunk and other user don't need to process an array in several chunk while the size of the array is less than 1.000.000 (depending on the size of memory of each user).
My processing is sequential.
I don't use a file because in other questions of this page recommend that it's better to use a database rather than a file. Moreover, if I used a file, I have to control the available memory space, preventing the memory is full (and an error occurs in the program)
Where are the Strings you wanna process? Are they already stored somewhere, or do you generate them somehow on the fly?
Strings are obtained from users and they were stored in an array completely. Now, the idea is to store the strings passed by the user to the database, and later (when user decide), the processing of strings will be done (it haven't to be immediately after storing the strings in the database).

Lightweight database-like library for storing key-value pairs [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
What is the best way of storing key-value pairs of Strings in a file in Java, that is scalable (can work with a large number of pairs, i.e. doesn't read or write entire file on access), but is as lightweight as possible?
I am asking this because even the lightest database libraries, like SQLite and H2 seem like an overkill for this purpose, and are even impossible to use for ME programs (although I would need this mainly for SE programs for now).

Oracle BerkeleyDB java edition allows you to store key-value objects, it is simple to use and administer, and up-scalable to heaven (or so). At 820k is not that big.
But if you are thinking about down scaling to j2me, you may try TinySQL
Pros:
It is small (93k!)
It is embeddable
It uses DBF or text files files to store data, so they are easy
to read.
Cons:
It is an old unmaintained project
It is not designed to work in j2me, but since it can work in JDK 1.1.8 it won't be hard to make it work in j2me. Of course you will have to change some code from using RandomAccessFile to FileConnection and stuff like that, but at least you wont need to mess with generics related code.
It is not very fast, because it does not use indexes, so you need to try and see if it is fits your needs
It is not feature complete, just gives you a small subset of SQL

There are some good ideas in this SO answer. My own inclination would be to use noSQL or similar, while that discussion is more centered on hashmap. Either will do, I believe.

For a static set of key-value pairs, Dan Bernstein's cdb comes to mind. To quote from the cdb description:
cdb is a fast, reliable, simple package for creating and reading constant databases. Its database structure provides several features:
Fast lookups: A successful lookup in a large database normally takes just two disk accesses. An unsuccessful lookup takes only one.
Low overhead: A database uses 2048 bytes, plus 24 bytes per record, plus the space for keys and data.
No random limits: cdb can handle any database up to 4 gigabytes. There are no other restrictions; records don't even have to fit into memory. Databases are stored in a machine-independent format.
Fast atomic database replacement: cdbmake can rewrite an entire database two orders of magnitude faster than other hashing packages.
Fast database dumps: cdbdump prints the contents of a database in cdbmake-compatible format.
cdb is designed to be used in mission-critical applications like e-mail. Database replacement is safe against system crashes. Readers don't have to pause during a rewrite.
It appears there is a Java implementation available at http://www.strangegizmo.com/products/sg-cdb/ with a BSD license.

Obvious initial thoughts are to use Properties as these are streamed but they are ultimately fully loaded. You also couldn't partially read a buffered set.
With that in mind, you could see this additional other SO response. This refers to navigating (albeit imperfectly) around a stream so that you could reposition your read:
changing the index positioning in InputStream
With a separate index (say by initial character) you could intelligently reposition the cursor in the stream, perhaps.

Chronicle Map is a modern off-heap key-value store for Java. If could be (optionally) persisted to disk, acting like an eventually-consistent database. Chronicle Map features
Queries which are faster than 1 us, in some use-cases as fast as 100 ns (see comparison with other similar libraries for Java).
Perfect scalability for processing from multiple threads and even processes, thanks to segmented shared-nothing design and multi-level locks, allowing multiple operations to access the same data concurrently.
Very low overhead per entry, less than 20 bytes / entry is achievable.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.