neo4j - batch insertion using neo4j rest graph db - java

I'm using version 2.0.1 .
I have like hundred of thousands of nodes that needs to be inserted. My neo4j graph db is on a stand alone server, and I'm using RestApi through the neo4j rest graph db library to achieved this.
However, I'm facing a slow performance result. I've chopped my queries into batches, sending 500 cypher statements in a single http call. The result that I'm getting is like:
10:38:10.984 INFO commit
10:38:13.161 INFO commit
10:38:13.277 INFO commit
10:38:15.132 INFO commit
10:38:15.218 INFO commit
10:38:17.288 INFO commit
10:38:19.488 INFO commit
10:38:22.020 INFO commit
10:38:24.806 INFO commit
10:38:27.848 INFO commit
10:38:31.172 INFO commit
10:38:34.767 INFO commit
10:38:38.661 INFO commit
And so on.
The query that I'm using is as follows:
MERGE (a{main:{val1},prop2:{val2}}) MERGE (b{main:{val3}}) CREATE UNIQUE (a)-[r:relationshipname]-(b);
My code is this:
private RestAPI restAPI;
private RestCypherQueryEngine engine;
private GraphDatabaseService graphDB = new RestGraphDatabase("http://localdomain.com:7474/db/data/");
...
restAPI = ((RestGraphDatabase) graphDB).getRestAPI();
engine = new RestCypherQueryEngine(restAPI);
...
Transaction tx = graphDB.getRestAPI().beginTx();
try {
int ctr = 0;
while (isExists) {
ctr++;
//excute query here through engine.query()
if (ctr % 500 == 0) {
tx.success();
tx.close();
tx = graphDB.getRestAPI().beginTx();
LOGGER.info("commit");
}
}
tx.success();
} catch (FileNotFoundException | NumberFormatException | ArrayIndexOutOfBoundsException e) {
tx.failure();
} finally {
tx.close();
}
Thanks!
UPDATED BENCHMARK.
Sorry for the confusion, the benchmark that I've posted isn't accurate, and is not for 500 queries. My ctr variable isn't actually referring to the number of cypher queries.
So now, I'm having like 500 queries per 3 seconds and that 3 seconds keeps on increasing as well. It's still way slow compared to the embedded neo4j.

If you have to ability to use Neo4j 2.1.0-M01 (don't use it in prod yet!!), you could benefit from new features. If you'd create/generate a CSV file like this:
val1,val2,val3
a_value,another_value,yet_another_value
a,b,c
....
you'd only need to launch the following code:
final GraphDatabaseService graphDB = new RestGraphDatabase("http://server:7474/db/data/");
final RestAPI restAPI = ((RestGraphDatabase) graphDB).getRestAPI();
final RestCypherQueryEngine engine = new RestCypherQueryEngine(restAPI);
final String filePath = "file://C:/your_file_path.csv";
engine.query("USING PERIODIC COMMIT 500 LOAD CSV WITH HEADERS FROM '" + filePath
+ "' AS csv MERGE (a{main:csv.val1,prop2:csv.val2}) MERGE (b{main:csv.val3})"
+ " CREATE UNIQUE (a)-[r:relationshipname]->(b);", null);
You'd have to make sure that the file can be accessed from the machine where your server is installed on.
Take a look at my server plugin that does this for you on the server. If you build this and put in the plugins folder, you could use the plugin in java as follows:
final RestAPI restAPI = new RestAPIFacade("http://server:7474/db/data");
final RequestResult result = restAPI.execute(RequestType.POST, "ext/CSVBatchImport/graphdb/csv_batch_import",
new HashMap<String, Object>() {
{
put("path", "file://C:/.../neo4j.csv");
}
});
EDIT:
You can also use a BatchCallback in the java REST wrapper to boost the performance and it removes the transactional boilerplate code as well. You could write your script similar to:
final RestAPI restAPI = new RestAPIFacade("http://server:7474/db/data");
int counter = 0;
List<Map<String, Object>> statements = new ArrayList<>();
while (isExists) {
statements.add(new HashMap<String, Object>() {
{
put("val1", "abc");
put("val2", "abc");
put("val3", "abc");
}
});
if (++counter % 500 == 0) {
restAPI.executeBatch(new Process(statements));
statements = new ArrayList<>();
}
}
static class Process implements BatchCallback<Object> {
private static final String QUERY = "MERGE (a{main:{val1},prop2:{val2}}) MERGE (b{main:{val3}}) CREATE UNIQUE (a)-[r:relationshipname]-(b);";
private List<Map<String, Object>> params;
Process(final List<Map<String, Object>> params) {
this.params = params;
}
#Override
public Object recordBatch(final RestAPI restApi) {
for (final Map<String, Object> param : params) {
restApi.query(QUERY, param);
}
return null;
}
}

Related

Akka stream stops processing data

When I run the below stream it does not receive any subsequent data once the stream runs.
final long HOUR = 3600000;
final long PAST_HOUR = System.currentTimeMillis()-HOUR;
private final static ActorSystem actorSystem = ActorSystem.create(Behaviors.empty(), "as");
protected static ElasticsearchParams constructElasticsearchParams(
String indexName, String typeName, ApiVersion apiVersion) {
if (apiVersion == ApiVersion.V5) {
return ElasticsearchParams.V5(indexName, typeName);
} else if (apiVersion == ApiVersion.V7) {
return ElasticsearchParams.V7(indexName);
}
else {
throw new IllegalArgumentException("API version " + apiVersion + " is not supported");
}
}
String queryStr = "{ \"bool\": { \"must\" : [{\"range\" : {"+
"\"timestamp\" : { "+
"\"gte\" : "+PAST_HOUR
+" }} }]}} ";
ElasticsearchConnectionSettings connectionSettings =
ElasticsearchConnectionSettings.create("****")
.withCredentials("****", "****");
ElasticsearchSourceSettings sourceSettings =
ElasticsearchSourceSettings.create(connectionSettings)
.withApiVersion(ApiVersion.V7);
Source<ReadResult<Stats>, NotUsed> dataSource =
ElasticsearchSource.typed(
constructElasticsearchParams("data", "_doc", ApiVersion.V7),
queryStr,
sourceSettings,
Stats.class);
dataSource.buffer(10000, OverflowStrategy.backpressure());
dataSource.backpressureTimeout(Duration.ofSeconds(1));
dataSource
.log("error")
.runWith(Sink.foreach(a -> System.out.println(a)), actorSystem);
produces output :
ReadResult(id=1656107389556,source=Stats(size=0.09471),version=)
Data is continually being written to the index data but the stream does not process it once it has started. Shouldn't the stream continually process data from the upstream source? In this case, the upstream source is an Elastic index named data.
I've tried amending the query to match all documents :
String queryStr = "{\"match_all\": {}}";
but the same result.
The Elasticsearch source does not run continuously. It initiates a search, manages pagination (using the bulk API) and streams results; when Elasticsearch reports no more results it completes.
You could do something like
Source.repeat(Done).flatMapConcat(done -> ElasticsearchSource.typed(...))
Which will run a new search immediately after the previous one finishes. Note that it would be the responsibility of the downstream to filter out duplicates.

How to call an oracle stored proc in apache beam?

I am just trying to learn Apache Beam and am returning data from an oracle database. I have managed to set up basic connectivity and return some data but I need to call a stored proc before running the sql query to return my data (the stored proc sets the query context to limit the data returned to a specific partition)
I've tried adding a second .withQuery statement but this does not work. The code doesn't return an error but returns data from all partitions
Pipeline p = Pipeline.create(options);
PCollection<List<String>> rows p.apply(JdbcIO.<List<Strng>>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"oracle.jdbc.driver.OracleDriver","jdbc:oracle:thin:#server")
.withUsername("uname")
.withPassword("pword")
)
.withQuery("call procname(partitionid)")
.withQuery("Select * from table")
.withCoder(ListCoder.of(StringUtf8Coder.of()))
.withRowMapper(new JdbcIO.RowMapper<List<String>>(){
public List<String> mapRow(ResultSet resultSet) throws Exception {
List<String> addRow = new ArrayList<String>();
for(int i=1; i<= resultSet.getMetaData().getColumnCount();i++)
{
addRow.add(i-1, String.valueOf(resultSet.getObject(i)));
}
return addRow;
}
}

How to find performance bottlenecks with Spring Data Redis and local Redis server

I am trying to optimize performance on fetching data from Redis. The server is currently running locally on my 2015 Macbook Pro.
First: Problem explanation
For the time being I only have 32 keys stored as hashes. 16 of these store quite long JSON strings in every hash value with <300 fields in each hash. The rest are quite small, so I don't have any problems with them.
From a Spring Boot application, using the Spring Data Redis template, with a Jedis connection, the total time to retrieve the 16 big hashes is ~1700ms by pipelining 4 HGETALL commands 4 times.
My question:
How do I proceed to find the real bottlenecks? I have already checked the SLOWLOG, which tells me the actions performed on the server are very fast, < 1ms per HGETALL command, which is to be expected. This means the bottleneck has to be between the Java application and the Redis server. Is it possible that the latency is the cause of the other ~1650ms? (Latency doesn't seem to be a problem with the other smaller hashes.) Or could it be the deserialization of my JSON strings? I am not sure how I can test this as I have no way to put in timers in the RedisTemplate code.
Below is the code I use to pipeline the HGETALL commands:
private Map<Date, Map<Integer, Departure>> pipelineMethod(List<String> keys, String keyspace) {
long pipeTime = System.currentTimeMillis();
List<Object> results = redisTemplate.executePipelined(
(RedisCallback) (connection) -> {
for (String key : keys) {
long actionTime = System.currentTimeMillis();
connection.hGetAll((keyspace + key).getBytes());
System.out.println("HGETALL finished in " + (System.currentTimeMillis()-actionTime) + "ms");
}
return null;
}
);
System.out.println("Pipeline finished in " + (System.currentTimeMillis()-pipeTime) + "ms");
Map<Date, Map<Integer, Departure>> resultMap = new ConcurrentHashMap<>();
for (int i = 0; i < keys.size(); i++) {
if (results.get(i) == null) {
resultMap.put(new Date(Long.parseLong(keys.get(i))), null);
log.debug("Hash map from redis on " + new Date(Long.parseLong(keys.get(i))) + " was null on retrieval");
}
else
resultMap.put(new Date(Long.parseLong(keys.get(i))), (Map<Integer, Departure>) results.get(i));
}
return resultMap;
}
Any suggestions would be greatly appreciated.
try using lettuce client with connection pool
private GenericObjectPoolConfig getPoolConfig() {
GenericObjectPoolConfig poolConfig = new GenericObjectPoolConfig();
//All below should use the propertysources;
poolConfig.setMaxTotal(20);
poolConfig.setMaxIdle(20);
poolConfig.setMinIdle(0);
return poolConfig;
}
#Bean
#Primary
public RedisConnectionFactory redisConnectionFactory() {
DefaultLettucePool lettucePool = new DefaultLettucePool(redisHost, Integer.valueOf(redisPort).intValue(), getPoolConfig());
lettucePool.setPassword(redisPass);
lettucePool.afterPropertiesSet();
LettuceConnectionFactory clientConfig = new LettuceConnectionFactory(lettucePool);
clientConfig.afterPropertiesSet();
return clientConfig;
}
#Bean
public RedisTemplate<?, ?> redisTemplate(RedisConnectionFactory connectionFactory) {
RedisTemplate<byte[], byte[]> template = new RedisTemplate<>();
template.setConnectionFactory(connectionFactory);
return template;
}

MongoDB - find() calls getting stuck sometimes - Timeout

We are using MongoDb for saving and fetching data.
All calls that are putting data into collections are working fine and are through common method.
All calls that are fetching data from collections are working fine sometimes and are through common method.
But Sometimes, only for one of the collection, i get my calls being stuck for forever, consuming CPU usage. I have to manually kill the threads otherwise it consumes my whole CPU.
Mongo Connection
MongoClient mongo = new MongoClient(hostName , Integer.valueOf(port));
DB mongoDb = mongo.getDB(dbName);
Code To fetch
DBCollection collection = mongoDb.getCollection(collectionName);
DBObject dbObject = new BasicDBObject("_id" , key);
DBCursor cursor = collection.find(dbObject);
Though i have figured out the collection for which it is causing issues, but how can i improve upon this, since it is occurring for this particular collection and sometimes.
EDIT
Code to save
DBCollection collection = mongoDb.getCollection(collectionName);
DBObject query = new BasicDBObject("_id" , key);
DBObject update = new BasicDBObject();
update.put("$set" , JSON.parse(value));
collection.update(query , update , true , false);
Bulk Write / collection
DB mongoDb = controllerFactory.getMongoDB();
DBCollection collection = mongoDb.getCollection(collectionName);
BulkWriteOperation bulkWriteOperation = collection.initializeUnorderedBulkOperation();
Map<String, Object> dataMap = (Map<String, Object>) JSON.parse(value);
for (Entry<String, Object> entrySet : dataMap.entrySet()) {
BulkWriteRequestBuilder bulkWriteRequestBuilder = bulkWriteOperation.find(new BasicDBObject("_id" ,
entrySet.getKey()));
DBObject update = new BasicDBObject();
update.put("$set" , entrySet.getValue());
bulkWriteRequestBuilder.upsert().update(update);
}
How can i set timeout for fetch calls..??
A different approach is to use the proposed method for MongoDB 3.2 Driver. Keep in mind that you have to update your .jar libraries (if you haven't) to the latest version.
public final MongoClient connectToClient(String hostName, String port) {
try {
MongoClient client = new MongoClient(hostName, Integer.valueOf(port));
return client;
} catch(MongoClientException e) {
System.err.println("Cannot connect to Client.");
return null;
}
}
public final MongoDatabase connectToDB(String databaseName) {
try {
MongoDatabase db = client.getDatabase(databaseName);
return db;
} catch(Exception e) {
System.err.println("Error in connecting to database " + databaseName);
return null;
}
public final void closeConnection(MongoClient client) {
client.close();
}
public final void findDoc(MongoDatabase db, String collectionName) {
MongoCollection<Document> collection = db.getCollection(collectionName);
try {
FindIterable<Document> iterable = collection
.find(new Document("_id", key));
Document doc = iterable.first();
//For an Int64 field named 'special_id'
long specialId = doc.getLong("special_id");
} catch(MongoException e) {
System.err.println("Error in retrieving document.");
} catch(NullPointerException e) {
System.err.println("Document with _id " + key + " does not exist.");
}
}
public final void insertToDB(MongoDatabase db, String collectioName) {
try {
db.getCollection(collectionName).insertOne(new Document()
.append("special_id", 5)
//Append anything
);
catch(MongoException e) {
System.err.println("Error in inserting new document.");
}
}
public final void updateDoc(MongoDatabase db, String collectionName, long id) {
MongoCollection<Document> collection = db.getCollection(collectionName);
try {
collection.updateOne(new Document("_id", id),
new Document("$set",
new Document("special_id",
7)));
catch(MongoException e) {
System.err.println("Error in updating new document.");
}
}
public static void main(String[] args) {
String hostName = "myHost";
String port = "myPort";
String databaseName = "myDB";
String collectionName = "myCollection";
MongoClient client = connectToClient(hostName, port);
if(client != null) {
MongoDatabase db = connectToDB(databaseName);
if(db != null) {
findDoc(db, collectionName);
}
client.closeConnection();
}
}
EDIT: As the others suggested, check from the command line if the procedure of finding the document by its ID is slow too. Then maybe this is a problem with your hard drive. The _id is supposed to be indexed but for better or for worse, re-create the index on the _id field.
The answers posted by others are great, but did not solve my purpose.
Actually issue was in my existing code itself , my cursor was waiting in while loop infinite time.
I was missing few checks which has been resolved now.
Just some possible explanations/thoughts.
In general "query by id" has to be fast since _id is supposed to be indexed, always. The code snippet looks correct, so probably the reason is in mongo itself. This leads me to a couple of suggestions:
Try to connect to mongo directly from the command line and run the "find" from there. The chances are that you'll still be able to observe occasional slowness.
In this case:
Maybe its about the disks (maybe this particular server is deployed on the slow disk or at least there is a correlation with some slowness of accessing the disk).
Maybe your have a sharded configuration and one shard is slower than others
Maybe its a network issue that occurs sporadically. If you run mongo locally/on staging env. with the same collection does this reproduce?
Maybe (Although I hardly believe that) the query runs in sub un-optimal way. In this case you can use "explain()" as someone has already suggested here.
If you happen to have replica set, please figure out what is the [Read Preference]. Who knows, maybe you prefer to get this id from the sub-optimal server

How to mass delete multiple rows in hbase?

I have the following rows with these keys in hbase table "mytable"
user_1
user_2
user_3
...
user_9999999
I want to use the Hbase shell to delete rows from:
user_500 to user_900
I know there is no way to delete, but is there a way I could use the "BulkDeleteProcessor" to do this?
I see here:
https://github.com/apache/hbase/blob/master/hbase-examples/src/test/java/org/apache/hadoop/hbase/coprocessor/example/TestBulkDeleteProtocol.java
I want to just paste in imports and then paste this into the shell, but have no idea how to go about this. Does anyone know how I can use this endpoint from the jruby hbase shell?
Table ht = TEST_UTIL.getConnection().getTable("my_table");
long noOfDeletedRows = 0L;
Batch.Call<BulkDeleteService, BulkDeleteResponse> callable =
new Batch.Call<BulkDeleteService, BulkDeleteResponse>() {
ServerRpcController controller = new ServerRpcController();
BlockingRpcCallback<BulkDeleteResponse> rpcCallback =
new BlockingRpcCallback<BulkDeleteResponse>();
public BulkDeleteResponse call(BulkDeleteService service) throws IOException {
Builder builder = BulkDeleteRequest.newBuilder();
builder.setScan(ProtobufUtil.toScan(scan));
builder.setDeleteType(deleteType);
builder.setRowBatchSize(rowBatchSize);
if (timeStamp != null) {
builder.setTimestamp(timeStamp);
}
service.delete(controller, builder.build(), rpcCallback);
return rpcCallback.get();
}
};
Map<byte[], BulkDeleteResponse> result = ht.coprocessorService(BulkDeleteService.class, scan
.getStartRow(), scan.getStopRow(), callable);
for (BulkDeleteResponse response : result.values()) {
noOfDeletedRows += response.getRowsDeleted();
}
ht.close();
If there exists no way to do this through JRuby, Java or alternate way to quickly delete multiple rows is fine.
Do you really want to do it in shell because there are various other better ways. One way is using the native java API
Construct an array list of deletes
pass this array list to Table.delete method
Method 1: if you already know the range of keys.
public void massDelete(byte[] tableName) throws IOException {
HTable table=(HTable)hbasePool.getTable(tableName);
String tablePrefix = "user_";
int startRange = 500;
int endRange = 999;
List<Delete> listOfBatchDelete = new ArrayList<Delete>();
for(int i=startRange;i<=endRange;i++){
String key = tablePrefix+i;
Delete d=new Delete(Bytes.toBytes(key));
listOfBatchDelete.add(d);
}
try {
table.delete(listOfBatchDelete);
} finally {
if (hbasePool != null && table != null) {
hbasePool.putTable(table);
}
}
}
Method 2: If you want to do a batch delete on the basis of a scan result.
public bulkDelete(final HTable table) throws IOException {
Scan s=new Scan();
List<Delete> listOfBatchDelete = new ArrayList<Delete>();
//add your filters to the scanner
s.addFilter();
ResultScanner scanner=table.getScanner(s);
for (Result rr : scanner) {
Delete d=new Delete(rr.getRow());
listOfBatchDelete.add(d);
}
try {
table.delete(listOfBatchDelete);
} catch (Exception e) {
LOGGER.log(e);
}
}
Now coming down to using a CoProcessor. only one advice, 'DON'T USE CoProcessor' unless you are an expert in HBase.
CoProcessors have many inbuilt issues if you need I can provide a detailed description to you.
Secondly when you delete anything from HBase it's never directly deleted from Hbase there is tombstone marker get attached to that record and later during a major compaction it gets deleted, so no need to use a coprocessor which is highly resource exhaustive.
Modified code to support batch operation.
int batchSize = 50;
int batchCounter=0;
for(int i=startRange;i<=endRange;i++){
String key = tablePrefix+i;
Delete d=new Delete(Bytes.toBytes(key));
listOfBatchDelete.add(d);
batchCounter++;
if(batchCounter==batchSize){
try {
table.delete(listOfBatchDelete);
listOfBatchDelete.clear();
batchCounter=0;
}
}}
Creating HBase conf and getting table instance.
Configuration hConf = HBaseConfiguration.create(conf);
hConf.set("hbase.zookeeper.quorum", "Zookeeper IP");
hConf.set("hbase.zookeeper.property.clientPort", ZookeeperPort);
HTable hTable = new HTable(hConf, tableName);
If you already aware of the rowkeys of the records that you want to delete from HBase table then you can use the following approach
1.First create a List objects with these rowkeys
for (int rowKey = 1; rowKey <= 10; rowKey++) {
deleteList.add(new Delete(Bytes.toBytes(rowKey + "")));
}
2.Then get the Table object by using HBase Connection
Table table = connection.getTable(TableName.valueOf(tableName));
3.Once you have table object call delete() by passing the list
table.delete(deleteList);
The complete code will look like below
Configuration config = HBaseConfiguration.create();
config.addResource(new Path("/etc/hbase/conf/hbase-site.xml"));
config.addResource(new Path("/etc/hadoop/conf/core-site.xml"));
String tableName = "users";
Connection connection = ConnectionFactory.createConnection(config);
Table table = connection.getTable(TableName.valueOf(tableName));
List<Delete> deleteList = new ArrayList<Delete>();
for (int rowKey = 500; rowKey <= 900; rowKey++) {
deleteList.add(new Delete(Bytes.toBytes("user_" + rowKey)));
}
table.delete(deleteList);

Categories

Resources