Improve performance in Mongodb using java driver - java

I wanted to integrate MongoDB in my applicaion. I have tested using Apache Banchmarking tool and produce 1,00,000 incoming request with 1000 concurrency level. After some test of insertion of records in mongodb, I can figure out that it is inserting around 1000 rec/sec. But it is not sufficient for my applicaion. Can anybody suggest that what is the best way to improve perofmance, so that I can acheive the goal of 2000 rec/sec.
My code is:
private static MongoOptions mo = new MongoOptions();
mo.connectionsPerHost = 20;
mo.threadsAllowedToBlockForConnectionMultiplier = 100;
private static Mongo m = new Mongo("127.0.0.1",mo);
private static DB db = m.getDB("mydb");
private static DBCollection coll = db.getCollection("mycoll");
DBObject dbObj = (DBObject) JSON.parse(msg);
db.requestStart();
coll.insert(dbObj);
dbObj.removeField("_id");
dbObj.put("val", "-10");
coll.insert(dbObj);
db.requestDone();

Having 1000 clients (which is what I assume you mean by concurrency level 1000) hitting the DB at one time sounds high to me. If it is running on a 1-2 core system your box is probably spending a lot of time switching between the different processes. Is the DB and benchmarking tool running on the same box? That will increase the amount of time it spends process switching also.
You could try putting the client on one multi core box and the DB on another.
Or try running fewer simulated clients maybe 10-20.

Related

Investigating slow simple queries in JDBC and MySQL

PreparedStatement.executeQuery() is taking ~20x longer to execute than if it were run directly via the shell. I've logged with timers to determine that this method is the culprit.
The query and some DB info (ignoring the Java issue for the moment):
mysql> SELECT username from users where user_id = 1; // lightning fast
Running that same query 1,000 times via mysqlslap is also lightning fast.
mysqlslap --create-schema=mydb --user=root -p --query="select username from phpbb_users where user_id = 1" --number-of-queries=1000 --concurrency=1
Benchmark
Average number of seconds to run all queries: 0.051 seconds
Minimum number of seconds to run all queries: 0.051 seconds
Maximum number of seconds to run all queries: 0.051 seconds
Number of clients running queries: 1
Average number of queries per client: 1000
The Problem: Performing the same query in JDBC slows things significantly. In a for loop calling the below queryUsername() 1,000 times (this is called in the Main method, which isn't shown here) takes around 872ms. That's ~17x slower! I've tracked down the heavy usage by placing timers in various spots (omitted some for brevity). The primary suspect is stmt.executeQuery() which took 776ms of the 872ms runtime.
public static String queryUsername() {
String username = "";
// DBCore.getConnection() returns HikariDataSource.getConnection() implementation exactly as per https://www.baeldung.com/hikaricp
try (Connection connection = DBCore.getConnection();
PreparedStatement stmt = connection.prepareStatement("SELECT username from phpbb_users where user_id = ?");) {
stmt.setInt(1, 1); // just looking for user_id 1 for now
// Google timer used to measure how long executeQuery() is taking
// Another Timer is used outside of this method call to see how long
// total execution takes.
// Approximately 1 second in for loop calling this method 1000 times
Stopwatch s = Stopwatch.createStarted();
try (ResultSet rs = stmt.executeQuery();) {
s.stop(); // stopping the timer after executeQuery() has been called
timeElapsed += s.elapsed(TimeUnit.MICROSECONDS);
while (rs.next())
{
username = rs.getString("username"); // the query returns 1 record
}
}
} catch (SQLException e) {
e.printStackTrace();
}
return username;
}
Additional context and things tried:
SHOW OPEN TABLES has several tables open, but all have In_use=0 and Name_locked=0.
SHOW FULL PROCESSLIST looks healthy.
user_id is an indexed primary key
The Server is an Upcloud $5/month 1-Core, 1GB RAM running Ubuntu 20.04.1 LTS (GNU/Linux 5.4.0-66-generic x86_64). Mysql Ver 8.0.23-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu))
JDBC Driver is mysql-connector-java_8.0.23.jar, which was obtained from mysql-connector-java_8.0.23-1ubuntu20.04_all via https://dev.mysql.com/downloads/connector/j/
Don't reconnect each time. Open the connection at the start; reuse it until the web page (or program) is finished.
Chances are that you are comparing different realities.
When running mysqlslap you are most likely using Unix Domain Sockets in the communication between the tool and MySQL server. Try changing that to TCP and you should observe an immediate performance drop. Connector/J, on the other hand, creates TCP based connections by default (Unix Domain Sockets can be used but only by using a third party library).
Also, in mysqlslap you are running a simple query directly, which are handled by a COM_QUERY protocol command. In the Java sample you are preparing the query first and then executing it. Depending on how Connector/J is configured this may result in a single COM_QUERY protocol command or a pair of commands, namely, COM_STMT_PREPARE and COM_STMT_EXECUTE. Connector/J is also affected by how its statement caches are configured (and/or the CP ones). However, you are only measuring the executeQuery part so, theoretically, Connector/J could be being favored here.
Finally, unless you actually come up with a use case where you guarantee that both executions are effectively doing the same work under the same circumstances, you can compare results and point out differences, but you can't take any conclusions from it. For example, it's not that hard to introduce caches and make those simple iterations even completely skip communicating to the server... that would make things extremely fast.
move borrowing connection and Stopwatch related code out of method. then measure as:
Stopwatch s = Stopwatch.createStarted();
try (Connection con = ....) {
for (int i=0; i < 1000; i++) {
queryUsername( con );
}
}
s.stop();
print s.elapsed(TimeUnit.MICROSECONDS);

Using Apache Spark in poor systems with cassandra and java

I want to use Apache Spark on my cluster which is made by 5 poor systems. At first I have implemented cassandra 3.11.3 on my nodes and all of my nodes are OK.
After that I have inserted 100k records in my nodes with a JAVA api without using Spark and all is OK too.
Now I want to execute a simple query like as follows:
select * from myKeySpace.myTbl where field1='someValue';
Since my nodes are weak in hardware, I want to get just a little records from myTbl like this:
select * from myKeySpace.myTbl where field1='someValue' limit 20;
I have tested this (A) but it is very slow (and I don't know the reason):
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue' limit 20");
and also (B) that I think Spark fetches all data and then uses limit function which is not my goal:
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue'").limit(20);
I think I can use Spark core (C) too. Also I know that a method called perPartitionLimit is implemented in cassandra 3.6 and greater (D).
As you know, since my nodes are weak, I don't want to fetch all records from cassandra table and then use limit function or something like that. I want to fetch just a little number of records from my table in such that my nodes can handle that.
So what is the best solution?
update:
I have done the suggestion which is given by #AKSW at the comment:
SparkConf conf = new SparkConf()
.setAppName("SparkTest")
.set("spark.cassandra.connection.host","192.168.107.100");
long limit=20;
JavaSparkContext jsc = new JavaSparkContext(conf);
CassandraJavaRDD<CassandraRow> rdd1 = javaFunctions(jsc)
.cassandraTable("myKeySpace", "myTbl")
.select("id").perPartitionLimit(limit);
System.out.println("Count: " + rdd1.count()); //output is "Count: 100000" which is wrong!
jsc.stop();
but perPartitionLimit(limit) that limit=20 does not work and all records fetch!

Influx db java client batch does not write to DB

I am trying to write points to influxDB using their Java client.
Batch is important to me.
If I use the influxDB.enableBatch with influxDB.write(Point) no data is inserted.
If I use the BatchPoints and influxDB.write(batchPoints) - data is inserted successfully.
Both code samples are taken from: https://github.com/influxdata/influxdb-java/tree/influxdb-java-2.7
InfluxDB influxDB = InfluxDBFactory.connect(influxUrl, influxUser, influxPassword);
influxDB.setDatabase(dbName);
influxDB.setRetentionPolicy("autogen");
// Flush every 2000 Points, at least every 100ms
influxDB.enableBatch(2000, 100, TimeUnit.MILLISECONDS);
influxDB.write(Point.measurement("cpu")
.time(System.currentTimeMillis(), TimeUnit.MILLISECONDS)
.addField("idle", 90L)
.addField("user", 9L)
.addField("system", 1L)
.build());
Query query = new Query("SELECT idle FROM cpu", dbName);
QueryResult result = influxDB.query(query);
Returns nothing.
BatchPoints batchPoints = BatchPoints.database(dbName).tag("async", "true").build();
Point point1 = Point
.measurement("cpu")
.tag("atag", "test")
.addField("idle", 90L)
.addField("usertime", 9L)
.addField("system", 1L)
.build();
batchPoints.point(point1);
influxDB.write(batchPoints);
Query query = new Query("SELECT * FROM cpu ", dbName);
QueryResult result = influxDB.query(query);
This returns data successfully.
As mentioned, I need the first way to function.
How can I achieve that?
versions:
influxdb-1.3.6
influxdb-java:2.7
Regards, Ido
maybe it's too late or you have already resolved your issue, but I will answer your question, it may be useful for others.
I think your first example is not working because you enabled batch functionality and it will "Flush every 2000 Points, at least every 100ms". So basically it's working, but you are making select before the actual save is performed.
When you use influxDB.enableBatch(...); functionality influxdb-client creates internal thread pool for storing your data after collecting them or by timeout and it will not be done immediately.
In second example when you use influxDB.write(batchPoints); influxdb-client is synchronously writing your data to InfluxDb. That's why your select statement is able to return data immediately.

Spring JDBC template ROW Mapper is too slow

I have a db fetch call with Spring jdbcTemplate and rows to be fetched is around 1 millions. It takes too much time iterating in result set. After debugging the behavior I found that it process some rows like a batch and then waits for some time and then again takes a batch of rows and process them. It seems row processing is not continuous so overall time is going into minutes. I have used default configuration for data source. Please help.
[Edit]
Here is some sample code
this.prestoJdbcTempate.query(query, new RowMapper<SomeObject>() {
#Override
public SomeObject mapRow(final ResultSet rs, final int rowNum) throws SQLException {
System.out.println(rowNum);
SomeObject obj = new SomeObject();
obj.setProp1(rs.getString(1));
obj.setProp2(rs.getString(2));
....
obj.setProp8(rs.getString(8));
return obj;
}
});
As most of the comments tell you, One mllion records is useless and unrealistic to be shown in any UI - if this is a real business requirement, you need to educate your customer.
Network traffic application and database server is a key factor in performance in scenarios like this. There is one optional parameter that can really help you in this scenario is : fetch size - that too to certain extent
Example :
Connection connection = //get your connection
Statement statement = connection.createStatement();
statement.setFetchSize(1000); // configure the fetch size
Most of the JDBC database drivers have a low fetch size by default and tuning this can help you in this situation. **But beware ** of the following.
Make sure your jdbc driver supports fetch size
Make sure your JVM heap setting ( -Xmx) is wide enough to handle objects created as a result of this.
Finally, select only the columns you need to reduce network overhead.
In spring, JdbcTemplate lets you set the fetchSize

direct neighbor relationships cypher query performance

This question is similar to these two: 16283441, 15456345.
UPDATE: here's a database dump.
In a db of 190K nodes and 727K relationships (and 128MB of database disk usage), I'd like to run the following query:
START start_node=node(<id>)
MATCH (start_node)-[r:COOCCURS_WITH]-(partner),
(partner)-[s:COOCCURS_WITH]-(another_partner)-[:COOCCURS_WITH]-(start_node)
RETURN COUNT(DISTINCT s) as num_partner_partner_links;
In this db 90% of the nodes have 0 relationships, and the remaining 10% have from 1 up to 670, so the biggest network this query can return cannot possibly have more than 220K links (670*670)/2).
On nodes with less than 10K partner_partner_links the query takes 2-4 seconds, when wormed up.
For more connected nodes (20-45K links) it takes about 40-50sec (don't know how much it'd take for the most connected ones).
Specifying relationship direction helps a bit but not much (but then the query doesn't return what I need it to return).
Profiling the query on one of the biggest nodes says:
==> ColumnFilter(symKeys=[" INTERNAL_AGGREGATE48d9beec-0006-4dae-937b-9875f0370ea6"], returnItemNames=["num_partner_links"], _rows=1, _db_hits=0)
==> EagerAggregation(keys=[], aggregates=["( INTERNAL_AGGREGATE48d9beec-0006-4dae-937b-9875f0370ea6,Distinct)"], _rows=1, _db_hits=0)
==> PatternMatch(g="(partner)-['r']-(start_node)", _rows=97746, _db_hits=34370048)
==> TraversalMatcher(trail="(start_node)-[ UNNAMED3:COOCCURS_WITH WHERE true AND true]-(another_partner)-[s:COOCCURS_WITH WHERE true AND true]-(partner)", _rows=116341, _db_hits=117176)
==> ParameterPipe(_rows=1, _db_hits=0)
neo4j-sh (0)$
I don't see why would this be so slow, most of the stuff should be in the RAM anyway. Is it possible to have this in under 100ms or neo4j is not up to that? I could put up the whole db somewhere if that would help..
The biggest puzzle is that the same query runs slower when rewritten to use with different node symbols :)
START n=node(36)
MATCH (n)-[r:COOCCURS_WITH]-(m),
(m)-[s:COOCCURS_WITH]-(p)-[:COOCCURS_WITH]-(n)
RETURN COUNT(DISTINCT s) AS num_partner_partner_links;
START start_node=node(36)
MATCH (start_node)-[r:COOCCURS_WITH]-(partner),
(partner)-[s:COOCCURS_WITH]-(another_partner)-[:COOCCURS_WITH]-(start_node)
RETURN COUNT(DISTINCT s) AS num_partner_partner_links;
The former always runs in +4.2 seconds, and latter in under 3.8, no matter how many times I run one and another (interleaved)!?
SW/HW details: (advanced) Neo4j v1.9.RC2, JDK 1.7.0.10, a macbook pro with an SSD disk, 8GBRAM, 2 core i7, with the following neo4j config:
neostore.nodestore.db.mapped_memory=550M
neostore.relationshipstore.db.mapped_memory=540M
neostore.propertystore.db.mapped_memory=690M
neostore.propertystore.db.strings.mapped_memory=430M
neostore.propertystore.db.arrays.mapped_memory=230M
neostore.propertystore.db.index.keys.mapped_memory=150M
neostore.propertystore.db.index.mapped_memory=140M
wrapper.java.initmemory=4092
wrapper.java.maxmemory=4092
Change your query to the one below. On my laptop, with significatly lower specs than yours, the execution time halves.
START start_node=node(36)
MATCH (start_node)-[r:COOCCURS_WITH]-(partner)
WITH start_node, partner
MATCH (partner)-[s:COOCCURS_WITH]-(another_partner)-[:COOCCURS_WITH]-(start_node)
RETURN COUNT(DISTINCT s) AS num_partner_partner_links;
Also, using your settings doesn't affect performance much as compared to the default settings. I'm afraid that you can't get the performance you want, but this query is a step in the right direction.
Usually the traversal API will be faster than Cypher because you explicitly control the traversal. I've mimicked the query as follows:
public class NeoTraversal {
public static void main(final String[] args) {
final GraphDatabaseService db = new GraphDatabaseFactory()
.newEmbeddedDatabaseBuilder("/neo4j")
.loadPropertiesFromURL(NeoTraversal.class.getClassLoader().getResource("neo4j.properties"))
.newGraphDatabase();
final Set<Long> uniquePartnerRels = new HashSet<Long>();
long startTime = System.currentTimeMillis();
final Node start = db.getNodeById(36);
for (final Path path : Traversal.description()
.breadthFirst()
.relationships(Rel.COOCCURS_WITH, Direction.BOTH)
.uniqueness(Uniqueness.NODE_GLOBAL)
.evaluator(Evaluators.atDepth(1))
.traverse(start)) {
Node partner = start.equals(path.startNode()) ? path.endNode() : path.startNode();
for (final Path partnerPath : Traversal.description()
.depthFirst()
.relationships(Rel.COOCCURS_WITH, Direction.BOTH)
.uniqueness(Uniqueness.RELATIONSHIP_PATH)
.evaluator(Evaluators.atDepth(2))
.evaluator(Evaluators.includeWhereEndNodeIs(start))
.traverse(partner)) {
uniquePartnerRels.add(partnerPath.relationships().iterator().next().getId());
}
}
System.out.println("Execution time: " + (System.currentTimeMillis() - startTime));
System.out.println(uniquePartnerRels.size());
}
static enum Rel implements RelationshipType {
COOCCURS_WITH
}
}
This clearly outperforms the cypher query, thus this might be a good alternative for you. Optimization is likely still possible.
Seems like for anything but depth/breadth first traversal, neo4j is not that "blazing fast". I've solved the problem by precomputing all the networks and storing them into MongoDB. A node document describing a network looks like this:
{
node_id : long,
partners : long[],
partner_partner_links : long[]
}
Partners and partner_partner_links are ids of documents describing egdes. Fetching the whole network takes 2 queries: one for this document, and another for edge properties (which also holds node properties):
db.edge.find({"_id" : {"$in" : network.partner_partner_links}});

Categories

Resources