How to call an oracle stored proc in apache beam? - java

I am just trying to learn Apache Beam and am returning data from an oracle database. I have managed to set up basic connectivity and return some data but I need to call a stored proc before running the sql query to return my data (the stored proc sets the query context to limit the data returned to a specific partition)
I've tried adding a second .withQuery statement but this does not work. The code doesn't return an error but returns data from all partitions
Pipeline p = Pipeline.create(options);
PCollection<List<String>> rows p.apply(JdbcIO.<List<Strng>>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"oracle.jdbc.driver.OracleDriver","jdbc:oracle:thin:#server")
.withUsername("uname")
.withPassword("pword")
)
.withQuery("call procname(partitionid)")
.withQuery("Select * from table")
.withCoder(ListCoder.of(StringUtf8Coder.of()))
.withRowMapper(new JdbcIO.RowMapper<List<String>>(){
public List<String> mapRow(ResultSet resultSet) throws Exception {
List<String> addRow = new ArrayList<String>();
for(int i=1; i<= resultSet.getMetaData().getColumnCount();i++)
{
addRow.add(i-1, String.valueOf(resultSet.getObject(i)));
}
return addRow;
}
}

Related

Improve Performance with Multiple Row Inserts and Fetches with Oracle SQL Stored Procedures and Java Spring

I currently have stored procedures for Oracle SQL, version 18c, for both inserting and fetching multiple rows of data from one parent table and one child table, being called from my Java Spring Boot application. Everything works fine, but it is extremely slow, for only a few rows of data.
When only inserting 70 records between the two, it takes up to 267 seconds into empty tables. Fetching that same data back out takes about 40 seconds.
Any help would be greatly appreciated or if there is any additional information needed from me.
Below is a cut down and renamed version of my stored procedures for my parent and child tables, actual parent table has 32 columns and child has 11.
PROCEDURE processParentData(
i_field_one varchar2,
v_parent_id OUT number) is
v_new PARENT%ROWTYPE;
BEGIN
v_new.id := ROW_SEQUENCE.nextval;
v_new.insert_time := systimestamp;
v_new.field_one := i_field_one;
insert into PARENT values v_new;
v_parent_id := v_new.id;
END;
PROCEDURE readParentData(
i_field_one IN varchar2,
v_parent OUT SYS_REFCURSOR) AS
BEGIN
OPEN v_parent FOR select h.* from PARENT h
where h.field_one = i_field_one;
END;
PROCEDURE processChild(
i_field_one varchar2,
i_parent_id number) is
v_new CHILD%ROWTYPE;
BEGIN
v_new.id := ROW_SEQUENCE.nextval;
v_new.insert_time := systimestamp;
v_new.field_one := i_field_one;
v_new.parent_id := i_parent_id;
insert into CHILD values v_new;
END;
PROCEDURE readChild(
i_parent_id IN number,
v_child OUT SYS_REFCURSOR) AS
BEGIN
OPEN v_child FOR select h.* from CHILD h
where h.parent_id = i_parent_id;
END;
For my Java code I am using Spring JDBC. After I get the parent data, I then fetch each child data by looping through the parent data and calling readChild with the parent ID for each.
var simpleJdbcCall = new SimpleJdbcCall(jdbcTemplate)
.withCatalogName("PARENT_PACKAGE")
.withProcedureName("processParentData");
SqlParameterSource sqlParameterSource = new MapSqlParameterSource()
.addValue("i_field_one", locationId)
.addValue("v_parent_id", null);
Map<String, Object> out = simpleJdbcCall.execute(sqlParameterSource);
var stopId = (BigDecimal) out.get("v_parent_id");
return stopId.longValue();
var simpleJdbcCall = new SimpleJdbcCall(jdbcTemplate)
.withCatalogName("PARENT_PACKAGE")
.withProcedureName("readParentData")
.returningResultSet("v_parent", BeanPropertyRowMapper.newInstance(Parent.class));
SqlParameterSource sqlParameterSource = new MapSqlParameterSource()
.addValue("i_field_one", location.getId());
Map<String, Object> out = simpleJdbcCall.execute(sqlParameterSource);
return (List<Parent>) out.get("v_parent");
UPDATE 1: As I know and have tested, using the same data and tables, if I use pure JDBC or JPA/Hibernate for inserting and fetching to the tables directly and avoid using stored procedures, then the whole process of inserting and fetching only takes a few seconds.
The issue is, at the company I work at, they have set a policy that all applications going forward are not allowed to have direct read/write access to the database and everything must be done through stored procedures, they say for security reasons. Meaning I need to workout how to do the same thing we have been doing for years with direct read/write access, now with only using Oracle stored procedures.
UPDATE 2: Adding my current Java code for fetching the child data.
for (Parent parent : parents) {
parent.setChilds(childRepository.readChildByParentId(parent.getId()));
}
public List<Child> readChildByParentId(long parentId) {
var simpleJdbcCall = new SimpleJdbcCall(jdbcTemplate)
.withCatalogName("CHILD_PACKAGE")
.withProcedureName("readChild")
.returningResultSet("v_child", BeanPropertyRowMapper.newInstance(Child.class));
SqlParameterSource sqlParameterSource = new MapSqlParameterSource()
.addValue("i_parent_id ", parentId);
Map<String, Object> out = simpleJdbcCall.execute(sqlParameterSource);
return (List<Child>) out.get("v_child");
}
The problem is that the insert you are trying to perform using the stored procedure is not optimized, because you are calling the database every time you try to insert a row.
I strongly recommend you to transform the data to XML (for example, you can also use CSV) and pass it to the procedure, then loop over it and perform the inserts that you need.
Here is an example made using Oracle:
CREATE OR REPLACE PROCEDURE MY_SCHEMA.my_procedure(xmlData clob) IS
begin
FOR CONTACT IN (SELECT *
FROM XMLTABLE(
'/CONTACTS/CONTACT' PASSING
XMLTYPE(contactes)
COLUMNS param_id FOR ORDINALITY
,id NUMBER PATH 'ID'
,name VARCHAR2(100) PATH 'NAME'
,surname VARCHAR2(100) PATH 'SURNAME'
))
LOOP
INSERT INTO PARENT_TABLE VALUES CONTACT.id, CONTACT.name, CONTACT.surname;
end loop;
end;
The XML, you can use a String to pass the data to the procedure:
<CONTACTS>
<CONTACT>
<ID>1</ID>
<NAME>Jonh</NAME>
<SURNAME>Smith</SURNAME>
</CONTACT>
<CONTACTS>
For my Java code I am using Spring JDBC. After I get the parent data, I then fetch each child data by looping through the parent data and calling readChild with the parent ID for each.
Instead of fetching child data in loop, you can modify your procedure to accept list of parent id and return all the data in one call.
It will be helpful if you share spring boot for loop code as well.
Update
Instead of fetching single parent details, you should have update your code like this. Also you have to update your procedure as well.
List<Long> parents = new ArrayList<>();
for (Parent parent : parents) {
parents.add(parent.getId());
}
You can use java streams but that is secondary things.
Now you have to modify your procedure and method to accept multiple parent ids.
List<Child> children = childRepository.readreadChildByParentId(parents);
public List<Child> readChildByParentId(long parentId) {
var simpleJdbcCall = new SimpleJdbcCall(jdbcTemplate)
.withCatalogName("CHILD_PACKAGE")
.withProcedureName("readChild")
.returningResultSet("v_child", BeanPropertyRowMapper.newInstance(Child.class));
SqlParameterSource sqlParameterSource = new MapSqlParameterSource()
.addValue("i_parent_id ", parentId);
Map<String, Object> out = simpleJdbcCall.execute(sqlParameterSource);
return (List<Child>) out.get("v_child");
}
After having all the children you can set parent children via java code.
P.S.
Could you please check if you fetch parents with children if parent is coming from the database?
Your performance problems are probably related with the number of operations performed against the database: you are iterating in Java your collections, and interacting with the database in every iteration. You need to minimize the number of operations performed.
One possible solution can be the use of the standard STRUCT and ARRAY Oracle types. Please, consider for instance the following example:
public static void insertData() throws SQLException {
DriverManagerDataSource dataSource = ...
JdbcTemplate jdbcTemplate = new JdbcTemplate(dataSource);
jdbcTemplate.setResultsMapCaseInsensitive(true);
SimpleJdbcCall insertDataCall = new SimpleJdbcCall(jdbcTemplate)
.withCatalogName("parent_child_pkg")
.withProcedureName("insert_data")
.withoutProcedureColumnMetaDataAccess()
.useInParameterNames("p_parents")
.declareParameters(
new SqlParameter("p_parents", OracleTypes.ARRAY, "PARENT_ARRAY")
);
OracleConnection connection = null;
try {
connection = insertDataCall
.getJdbcTemplate()
.getDataSource()
.getConnection()
.unwrap(OracleConnection.class)
;
List<Parent> parents = new ArrayList<>(100);
Parent parent = null;
List<Child> chilren = null;
Child child = null;
for (int i = 0; i < 100; i++) {
parent = new Parent();
parents.add(parent);
parent.setId((long) i);
parent.setName("parent-" + i);
chilren = new ArrayList<>(1000);
parent.setChildren(chilren);
for (int j = 0; j < 1000; j++) {
child = new Child();
chilren.add(child);
child.setId((long) j);
child.setName("parent-" + j);
}
}
System.out.println("Inserting data...");
StopWatch stopWatch = new StopWatch();
stopWatch.start("insert-data");
StructDescriptor parentTypeStructDescriptor = StructDescriptor.createDescriptor("PARENT_TYPE", connection);
ArrayDescriptor parentArrayDescriptor = ArrayDescriptor.createDescriptor("PARENT_ARRAY", connection);
StructDescriptor childTypeStructDescriptor = StructDescriptor.createDescriptor("CHILD_TYPE", connection);
ArrayDescriptor childArrayDescriptor = ArrayDescriptor.createDescriptor("CHILD_ARRAY", connection);
Object[] parentArray = new Object[parents.size()];
int pi = 0;
for (Parent p : parents) {
List<Child> children = p.getChildren();
Object[] childArray = new Object[children.size()];
int ci = 0;
for (Child c : children) {
Object[] childrenObj = new Object[2];
childrenObj[0] = c.getId();
childrenObj[1] = c.getName();
STRUCT childStruct = new STRUCT(childTypeStructDescriptor, connection, childrenObj);
childArray[ci++] = childStruct;
}
ARRAY childrenARRAY = new ARRAY(childArrayDescriptor, connection, childArray);
Object[] parentObj = new Object[3];
parentObj[0] = p.getId();
parentObj[1] = p.getName();
parentObj[2] = childrenARRAY;
STRUCT parentStruct = new STRUCT(parentTypeStructDescriptor, connection, parentObj);
parentArray[pi++] = parentStruct;
}
ARRAY parentARRAY = new ARRAY(parentArrayDescriptor, connection, parentArray);
Map in = Collections.singletonMap("p_parents", parentARRAY);
insertDataCall.execute(in);
connection.commit();
stopWatch.stop();
System.out.println(stopWatch.prettyPrint());
} catch (Throwable t) {
t.printStackTrace();
connection.rollback();
} finally {
if (connection != null) {
try {
connection.close();
} catch (Throwable nested) {
nested.printStackTrace();
}
}
}
}
Where:
CREATE OR REPLACE TYPE child_type AS OBJECT (
id NUMBER,
name VARCHAR2(512)
);
CREATE OR REPLACE TYPE child_array
AS TABLE OF child_type;
CREATE OR REPLACE TYPE parent_type AS OBJECT (
id NUMBER,
name VARCHAR2(512),
children child_array
);
CREATE OR REPLACE TYPE parent_array
AS TABLE OF parent_type;
CREATE SEQUENCE PARENT_SEQ INCREMENT BY 1 MINVALUE 1;
CREATE SEQUENCE CHILD_SEQ INCREMENT BY 1 MINVALUE 1;
CREATE TABLE parent_table (
id NUMBER,
name VARCHAR2(512)
);
CREATE TABLE child_table (
id NUMBER,
name VARCHAR2(512),
parent_id NUMBER
);
CREATE OR REPLACE PACKAGE parent_child_pkg AS
PROCEDURE insert_data(p_parents PARENT_ARRAY);
END;
CREATE OR REPLACE PACKAGE BODY parent_child_pkg AS
PROCEDURE insert_data(p_parents PARENT_ARRAY) IS
l_parent_id NUMBER;
l_child_id NUMBER;
BEGIN
FOR i IN 1..p_parents.COUNT LOOP
SELECT parent_seq.nextval INTO l_parent_id FROM dual;
INSERT INTO parent_table(id, name)
VALUES(l_parent_id, p_parents(i).name);
FOR j IN 1..p_parents(i).children.COUNT LOOP
SELECT child_seq.nextval INTO l_child_id FROM dual;
INSERT INTO child_table(id, name, parent_id)
VALUES(l_child_id, p_parents(i).name, l_parent_id);
END LOOP;
END LOOP;
END;
END;
And Parent and Child are simple POJOs:
import java.util.ArrayList;
import java.util.List;
public class Parent {
private Long id;
private String name;
private List<Child> children = new ArrayList<>();
public Long getId() {
return id;
}
public void setId(Long id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public List<Child> getChildren() {
return children;
}
public void setChildren(List<Child> children) {
this.children = children;
}
}
public class Child {
private Long id;
private String name;
public Long getId() {
return id;
}
public void setId(Long id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
}
Please, forgive for the code legibility and incorrect error handling, I will improve the answer later including some information about obtaining the data as well.
The times you mention are horrible indeed. A big boost forward in performance will be to work set based. This means reducing the row by row database calls.
Row by row is synonymous for slow, especially when network round trips are involved.
One call to get the parent.
One call to get the set of children and process them. The jdbc fetch size is a nice tunable here. Give it a chance to work for you.
You do not need to use DYNAMIC SQL OPEN v_parent FOR and also it is not clear how the view v_parent is defined.
Try to check exec plan of this query:
FOR select h.* from PARENT h where h.field_one = ?;
Usually returning recordset via SYS_REFCURSOR increases performance when you return more (let's say) than 10K records.
The SimpleJdbcCall object can be reused in your scenario as only the parameters changes. The SimpleJdbcCall object compiles the jdbc statement on the first invocation. It does some meta-data fetching and it interacts with the Database for that. So, having separate objects would mean fetching same metadata that many times which is not needed.
So, I suggest to initialise all the 4 SimpleJdbcCall objects in the very beginning and then work with them.
var insertParentJdbcCall = new SimpleJdbcCall(jdbcTemplate)
.withCatalogName("PARENT_PACKAGE")
.withProcedureName("processParentData");
var readParentJdbcCall = new SimpleJdbcCall(jdbcTemplate)
.withCatalogName("PARENT_PACKAGE")
.withProcedureName("readParentData")
.returningResultSet("v_parent", BeanPropertyRowMapper.newInstance(Parent.class));
var insertChildJdbcCall = new SimpleJdbcCall(jdbcTemplate)
.withCatalogName("CHILD_PACKAGE")
.withProcedureName("processChildData");
var readChildJdbcCall = new SimpleJdbcCall(jdbcTemplate)
.withCatalogName("CHILD_PACKAGE")
.withProcedureName("readChild")
.returningResultSet("v_child", BeanPropertyRowMapper.newInstance(Child.class));

How to do batchUpdate instead of update on Namedparameterjdbctemplate

I am parsing a file and am creating a list of string elements that im inserting in to my table. Im trying to set a batch size of 5 rows inserted at a time and can't figure out how to use .batchupdate in place of .update in my code
You're currently calling update(String sql, SqlParameterSource paramSource).
The comparable batch version is batchUpdate(String sql, SqlParameterSource[] batchArgs).
So it seems pretty obvious, to do it as a batch, build an array, and make the call.
final int batchSize = 5;
List<SqlParameterSource> args = new ArrayList<>();
for (ZygateEntity zygateInfo : parseData){
SqlParameterSource source = new MapSqlParameterSource("account_name", zygateInfo.getAccountName())
.addValue("command_name", zygateInfo.getCommandName())
.addValue("system_name", zygateInfo.getSystemName())
.addValue("CREATE_DT", zygateInfo.getCreateDt());
args.add(source);
if (args.size() == batchSize) {
namedParameterJdbcTemplate.batchUpdate(sql, args.toArray(new SqlParameterSource[args.size()]));
args.clear();
}
}
if (! args.isEmpty()) {
namedParameterJdbcTemplate.batchUpdate(sql, args.toArray(new SqlParameterSource[args.size()]));
}

Query Lucene Indexes Created in Apache Geode

Created a Lucene index in Geode with the code provided in documentation. Then put a couple of objects in the region and queried the region with a Lucene query, which documentation also shows how. But the query result is always empty. Here is my code:
Starting a Geode server and creating a Lucene index in it:
public static void startServerAndLocator() throws InterruptedException {
ServerLauncher serverLauncher = new ServerLauncher.Builder()
.setMemberName("server1")
.setServerPort(40404)
.set("start-locator", "127.0.0.1[10334]")
.build();
ServerLauncher.ServerState state = serverLauncher.start();
_logger.info(state.toString());
Cache cache = new CacheFactory().create();
createLuceneIndex(cache);
cache.createRegionFactory(RegionShortcut.PARTITION).create("test");
}
public static void createLuceneIndex(Cache cache) throws InterruptedException {
LuceneService luceneService = LuceneServiceProvider.get(cache);
luceneService.createIndexFactory()
.addField("fullName")
.addField("salary")
.addField("phone")
.create("employees", "/test");
}
Putting objects in region and querying:
public static void testGeodeServer() throws LuceneQueryException, InterruptedException {
ClientCache cache = new ClientCacheFactory()
.addPoolLocator("localhost", 10334)
.create();
Region<Integer, Person> region = cache
.<Integer, Person>createClientRegionFactory(ClientRegionShortcut.CACHING_PROXY).create("test");
List<Person> persons = Arrays.asList(
new Person("John", 3000, 5556644),
new Person("Jane", 4000, 6664488),
new Person("Janet", 3500, 1112233));
for (int i = 0; i < persons.size(); i++) {
region.put(i, persons.get(i));
}
LuceneService luceneService = LuceneServiceProvider.get(cache);
LuceneQuery<Integer, Person> query = luceneService.createLuceneQueryFactory()
.setLimit(10)
.create("employees", "/test", "fullName:John AND salary:3000", "salary");
Collection<Person> values = query.findValues();
System.out.println("Query results:");
for (Person person : values) {
System.out.println(person);
}
cache.close();
}
Person is a basic POJO class with three fields (name, salary, phone).
What am I doing wrong here? Why the query result is empty?
If you do a query with just fullName, do you still get no results?
I think the issue is that salary and phone are getting stored as IntPoint. You could make them String fields in your Person class so they get stored as strings, or you could use an integer query, eg.
luceneService.createLuceneQueryFactory()
.create("employees", "test",
index -> IntPoint.newExactQuery("salary", 30000))
The events are still in AsyncEventQueue and not flushed into index yet. (It might take 10+ milliseconds). The AsyncEventQueue's default flush interval is 10ms.
You need to add following code before doing query.
luceneService.waitUntilFlushed("employees", "/test", 30000, TimeUnit.MILLISECONDS);
Another issue in the program is:
The salary field is a integer. But the query try to do a string query on the salary field and mixed with another string field.
To query on a integer field mixed with a string field, you need to create a LuceneQueryProvider to bind a StringQueryParser with a IntPoint.newExactQuery(or other IntPoint queries).
If you just want to try the basic functionality, you can only use String fields for the time being. (i.e. change the salary field to String)

How to mass delete multiple rows in hbase?

I have the following rows with these keys in hbase table "mytable"
user_1
user_2
user_3
...
user_9999999
I want to use the Hbase shell to delete rows from:
user_500 to user_900
I know there is no way to delete, but is there a way I could use the "BulkDeleteProcessor" to do this?
I see here:
https://github.com/apache/hbase/blob/master/hbase-examples/src/test/java/org/apache/hadoop/hbase/coprocessor/example/TestBulkDeleteProtocol.java
I want to just paste in imports and then paste this into the shell, but have no idea how to go about this. Does anyone know how I can use this endpoint from the jruby hbase shell?
Table ht = TEST_UTIL.getConnection().getTable("my_table");
long noOfDeletedRows = 0L;
Batch.Call<BulkDeleteService, BulkDeleteResponse> callable =
new Batch.Call<BulkDeleteService, BulkDeleteResponse>() {
ServerRpcController controller = new ServerRpcController();
BlockingRpcCallback<BulkDeleteResponse> rpcCallback =
new BlockingRpcCallback<BulkDeleteResponse>();
public BulkDeleteResponse call(BulkDeleteService service) throws IOException {
Builder builder = BulkDeleteRequest.newBuilder();
builder.setScan(ProtobufUtil.toScan(scan));
builder.setDeleteType(deleteType);
builder.setRowBatchSize(rowBatchSize);
if (timeStamp != null) {
builder.setTimestamp(timeStamp);
}
service.delete(controller, builder.build(), rpcCallback);
return rpcCallback.get();
}
};
Map<byte[], BulkDeleteResponse> result = ht.coprocessorService(BulkDeleteService.class, scan
.getStartRow(), scan.getStopRow(), callable);
for (BulkDeleteResponse response : result.values()) {
noOfDeletedRows += response.getRowsDeleted();
}
ht.close();
If there exists no way to do this through JRuby, Java or alternate way to quickly delete multiple rows is fine.
Do you really want to do it in shell because there are various other better ways. One way is using the native java API
Construct an array list of deletes
pass this array list to Table.delete method
Method 1: if you already know the range of keys.
public void massDelete(byte[] tableName) throws IOException {
HTable table=(HTable)hbasePool.getTable(tableName);
String tablePrefix = "user_";
int startRange = 500;
int endRange = 999;
List<Delete> listOfBatchDelete = new ArrayList<Delete>();
for(int i=startRange;i<=endRange;i++){
String key = tablePrefix+i;
Delete d=new Delete(Bytes.toBytes(key));
listOfBatchDelete.add(d);
}
try {
table.delete(listOfBatchDelete);
} finally {
if (hbasePool != null && table != null) {
hbasePool.putTable(table);
}
}
}
Method 2: If you want to do a batch delete on the basis of a scan result.
public bulkDelete(final HTable table) throws IOException {
Scan s=new Scan();
List<Delete> listOfBatchDelete = new ArrayList<Delete>();
//add your filters to the scanner
s.addFilter();
ResultScanner scanner=table.getScanner(s);
for (Result rr : scanner) {
Delete d=new Delete(rr.getRow());
listOfBatchDelete.add(d);
}
try {
table.delete(listOfBatchDelete);
} catch (Exception e) {
LOGGER.log(e);
}
}
Now coming down to using a CoProcessor. only one advice, 'DON'T USE CoProcessor' unless you are an expert in HBase.
CoProcessors have many inbuilt issues if you need I can provide a detailed description to you.
Secondly when you delete anything from HBase it's never directly deleted from Hbase there is tombstone marker get attached to that record and later during a major compaction it gets deleted, so no need to use a coprocessor which is highly resource exhaustive.
Modified code to support batch operation.
int batchSize = 50;
int batchCounter=0;
for(int i=startRange;i<=endRange;i++){
String key = tablePrefix+i;
Delete d=new Delete(Bytes.toBytes(key));
listOfBatchDelete.add(d);
batchCounter++;
if(batchCounter==batchSize){
try {
table.delete(listOfBatchDelete);
listOfBatchDelete.clear();
batchCounter=0;
}
}}
Creating HBase conf and getting table instance.
Configuration hConf = HBaseConfiguration.create(conf);
hConf.set("hbase.zookeeper.quorum", "Zookeeper IP");
hConf.set("hbase.zookeeper.property.clientPort", ZookeeperPort);
HTable hTable = new HTable(hConf, tableName);
If you already aware of the rowkeys of the records that you want to delete from HBase table then you can use the following approach
1.First create a List objects with these rowkeys
for (int rowKey = 1; rowKey <= 10; rowKey++) {
deleteList.add(new Delete(Bytes.toBytes(rowKey + "")));
}
2.Then get the Table object by using HBase Connection
Table table = connection.getTable(TableName.valueOf(tableName));
3.Once you have table object call delete() by passing the list
table.delete(deleteList);
The complete code will look like below
Configuration config = HBaseConfiguration.create();
config.addResource(new Path("/etc/hbase/conf/hbase-site.xml"));
config.addResource(new Path("/etc/hadoop/conf/core-site.xml"));
String tableName = "users";
Connection connection = ConnectionFactory.createConnection(config);
Table table = connection.getTable(TableName.valueOf(tableName));
List<Delete> deleteList = new ArrayList<Delete>();
for (int rowKey = 500; rowKey <= 900; rowKey++) {
deleteList.add(new Delete(Bytes.toBytes("user_" + rowKey)));
}
table.delete(deleteList);

neo4j - batch insertion using neo4j rest graph db

I'm using version 2.0.1 .
I have like hundred of thousands of nodes that needs to be inserted. My neo4j graph db is on a stand alone server, and I'm using RestApi through the neo4j rest graph db library to achieved this.
However, I'm facing a slow performance result. I've chopped my queries into batches, sending 500 cypher statements in a single http call. The result that I'm getting is like:
10:38:10.984 INFO commit
10:38:13.161 INFO commit
10:38:13.277 INFO commit
10:38:15.132 INFO commit
10:38:15.218 INFO commit
10:38:17.288 INFO commit
10:38:19.488 INFO commit
10:38:22.020 INFO commit
10:38:24.806 INFO commit
10:38:27.848 INFO commit
10:38:31.172 INFO commit
10:38:34.767 INFO commit
10:38:38.661 INFO commit
And so on.
The query that I'm using is as follows:
MERGE (a{main:{val1},prop2:{val2}}) MERGE (b{main:{val3}}) CREATE UNIQUE (a)-[r:relationshipname]-(b);
My code is this:
private RestAPI restAPI;
private RestCypherQueryEngine engine;
private GraphDatabaseService graphDB = new RestGraphDatabase("http://localdomain.com:7474/db/data/");
...
restAPI = ((RestGraphDatabase) graphDB).getRestAPI();
engine = new RestCypherQueryEngine(restAPI);
...
Transaction tx = graphDB.getRestAPI().beginTx();
try {
int ctr = 0;
while (isExists) {
ctr++;
//excute query here through engine.query()
if (ctr % 500 == 0) {
tx.success();
tx.close();
tx = graphDB.getRestAPI().beginTx();
LOGGER.info("commit");
}
}
tx.success();
} catch (FileNotFoundException | NumberFormatException | ArrayIndexOutOfBoundsException e) {
tx.failure();
} finally {
tx.close();
}
Thanks!
UPDATED BENCHMARK.
Sorry for the confusion, the benchmark that I've posted isn't accurate, and is not for 500 queries. My ctr variable isn't actually referring to the number of cypher queries.
So now, I'm having like 500 queries per 3 seconds and that 3 seconds keeps on increasing as well. It's still way slow compared to the embedded neo4j.
If you have to ability to use Neo4j 2.1.0-M01 (don't use it in prod yet!!), you could benefit from new features. If you'd create/generate a CSV file like this:
val1,val2,val3
a_value,another_value,yet_another_value
a,b,c
....
you'd only need to launch the following code:
final GraphDatabaseService graphDB = new RestGraphDatabase("http://server:7474/db/data/");
final RestAPI restAPI = ((RestGraphDatabase) graphDB).getRestAPI();
final RestCypherQueryEngine engine = new RestCypherQueryEngine(restAPI);
final String filePath = "file://C:/your_file_path.csv";
engine.query("USING PERIODIC COMMIT 500 LOAD CSV WITH HEADERS FROM '" + filePath
+ "' AS csv MERGE (a{main:csv.val1,prop2:csv.val2}) MERGE (b{main:csv.val3})"
+ " CREATE UNIQUE (a)-[r:relationshipname]->(b);", null);
You'd have to make sure that the file can be accessed from the machine where your server is installed on.
Take a look at my server plugin that does this for you on the server. If you build this and put in the plugins folder, you could use the plugin in java as follows:
final RestAPI restAPI = new RestAPIFacade("http://server:7474/db/data");
final RequestResult result = restAPI.execute(RequestType.POST, "ext/CSVBatchImport/graphdb/csv_batch_import",
new HashMap<String, Object>() {
{
put("path", "file://C:/.../neo4j.csv");
}
});
EDIT:
You can also use a BatchCallback in the java REST wrapper to boost the performance and it removes the transactional boilerplate code as well. You could write your script similar to:
final RestAPI restAPI = new RestAPIFacade("http://server:7474/db/data");
int counter = 0;
List<Map<String, Object>> statements = new ArrayList<>();
while (isExists) {
statements.add(new HashMap<String, Object>() {
{
put("val1", "abc");
put("val2", "abc");
put("val3", "abc");
}
});
if (++counter % 500 == 0) {
restAPI.executeBatch(new Process(statements));
statements = new ArrayList<>();
}
}
static class Process implements BatchCallback<Object> {
private static final String QUERY = "MERGE (a{main:{val1},prop2:{val2}}) MERGE (b{main:{val3}}) CREATE UNIQUE (a)-[r:relationshipname]-(b);";
private List<Map<String, Object>> params;
Process(final List<Map<String, Object>> params) {
this.params = params;
}
#Override
public Object recordBatch(final RestAPI restApi) {
for (final Map<String, Object> param : params) {
restApi.query(QUERY, param);
}
return null;
}
}

Categories

Resources