Overhead of iterating over list in Java - java

I have a REST API that fetches a list of entities that is then mapped to DTOs. The requests take quite a lot of time because for 10 fetched entities it's about 1.5seconds. The query to the DB seems OK - it takes about 100ms.
I found that this one piece of code during the mapping entitiy->DTO phase takes about 80% of the time:
Map<String, List<DailyTimeDTO>> dailyTimesGroupedByClientStream = timesheetReport.getDailyTimes()
.stream()
.filter(dailyTime -> dailyTime.getWorkTime() != null)
.map(DailyTimeDTO::create)
.collect(Collectors.groupingBy(DailyTimeDTO::getClient));
With about 30-40 DailyTime objects it takes 80-100 ms and it is ran for each entity while mapping so when mapping 10 entities the request takes about 1.5 seconds and for 100 entities it takes a whole lot more.
I tried implementing it without the stream but it didn't really help (as below). I tried debugging what takes so much time (using the System.nanoTime() as below) but each iteration of the loop takes about 5 microseconds (so 0.005 ms). But the whole loop takes about 80-100 ms. So where does so much more time get lost? Is it the overhead of iterating over a list multiple times? Is there something I can do about besides figuring out a different way to perform this mapping?
dailyTimes.forEach( dailyTime -> {
long startTime = System.nanoTime();
if (dailyTime == null || dailyTime.getWorkTime() == null) return;
DailyTimeDTO dailyTimeDTO = DailyTimeDTO.create(dailyTime);
map.add(dailyTimeDTO.getClient(), dailyTimeDTO);
log.info("Iteration full {}", (System.nanoTime() - startTime) / 1000);
});
The DailyTimeDTO::create is nothing special:
public static DailyTimeDTO create(DailyTime dailyTime) {
return DailyTimeDTO.builder()
.client(dailyTime.getClient() == null ? "" : dailyTime.getClient().getClientName())
.project(dailyTime.getProject() == null ? "" : dailyTime.getProject().getProjectName())
.hours(dailyTime.getWorkTime())
.build();
}

Related

MongoDB Performance Test - Basic Understanding - Java

I want to check how fast the CRUD Operations are executing on a MongoDB.
Therefore I recorded the time with the following code:
long start = System.nanoTime();
FindIterable<Document> datasetFindIterable = this.collection.find(filter);
long finish = System.nanoTime();
long timeElapsed = finish - start;
I am aware, that the FindIterable Object comes with "executionStats" and "executionTimeMillis":
JSONObject jsonobject = (JSONObject) parser.parse(datasetFindIterable.explain().toJson())
JSONObject executionStats = (JSONObject) jsonobject.get("executionStats");
Long executionTimeMillis = (Long) executionStats.get("executionTimeMillis");
However I am a bit confused, I get the following results:
start (ns)
finish (ns)
timeElapsed (ns)
executionTimeMillis (ms)
582918161918004
582918161932511
14507
1234
14507 ns are 0.014507 ms
How can it be, that the executionTimeMillis (1234 ms) is that much larger than the difference between the System.nanoTime() (=0.014507 ms). Shouldn't it be the other way around, since the System.nanoTime() does also need some time to execute itself?
If I recall correctly, there are asynchronous and synchronous MongoDB Drivers available.
If you use an asynchronous driver, it could be the issue, that the
"long finish = System.nanoTime();"
command does not wait for the
"FindIterable<Document> datasetFindIterable = this.collection.find(filter);"
command to return with a value, therefore the time difference could be lower than the execution time stored in the FindIterable variable.

Multi threading queries Oracle / java 7

I'm having performance issue trying to query large table, with multiple threads.
I'm using Oracle, Spring 2 and java 7.
I use a PoolDataSource (driver oracle.jdbc.pool.OracleDataSource) with as many connections as the number of thread to analyse a single table.
I made sure by logging poolDataSource.getStatistics() that I have enough available connections at any time.
Here is the code :
ExecutorService executorService = Executors.newFixedThreadPool(nbThreads);
List<Foo> foo = new ArrayList<>();
List<Callable<List<Foo>>> callables = new ArrayList<>();
int offset = 1;
while(offSetMaxReached) {
callables.add(new Callable<List<Foo>> {
#Override
public List<Foo> call() throws SQLException, InterruptedException {
return dao.doTheJob(...);
}
});
offset += 10000;
}
for(Future<List<Foo>> fooFuture : executorService.invokeAll(callables)) {
geometrieIncorrectes.addAll(fooFuture.get());
}
executorService.shutdown();
executorService.awaitTermination(1, TimeUnit.DAYS);
In the dao, I use a connection from the PoolDataSource, and Spring JdbcTemplate.query(query, PreparedStatementSetter, RowCallBackHandler).
The doTheJob method do the exact same thing for every query results.
My queries look like : SELECT A, B, C FROM MY.BIGTABLE OFFSET ? ROWS FETCH NEXT ? ROWS ONLY
So, to sum up, I have n threads called by a FixThreadPool, each thread deal with the exact same amount of data and do the exact same things.
But each thread completion is longer than the last one !
Example :
4 threads launched at the same time, but the first row of each resultSet (ie in the RowCallBackHandlers) is processed at :
thread 1 : 1.5s
thread 2 : 9s
thread 3 : 18s
thread 4 : 35s
and so on...
What can be the cause of such behavior ?
Edit :
The main cause of the problem was within the processed table itself.
Using OFFSET x ROWS FETCH NEXT y ROWS ONLY, Oracle needs to run through all the lines, starting by one.
So accessing offset 10 is faster than accessing offset 10000000 !
I was able to get good response time with temporary tables, and good indexes.

How can I get JPA/Entity Manager to make parallel queries instead of lumping them into one batch?

Inside the doGet method in my servlet I'm using a JPA TypedQuery to retrieve my data. I'm able to get the data I want through an http get request method. The method to get the data takes roughly 10 seconds and when I make a single request all is good. The problem occurs when I get multiple requests at the same time. If I make 4 request at the same time, all 4 queries are lumped together and they take 40 seconds to get the data back for all of them. How can I get JPA to make 4 separate queries in parallel? Is this something in the persistence.xml that needs set or is it a code related issue? Note: I've also tried executing this code in a thread. A link and some appropriate terminology to increase my understanding would be appreciated.
Thanks!
try{
String sequenceNo = request.getParameter("sequenceNo");
EntityManagrFactory emf = Persistence.createEntityManagerFactory("mydbcon");
EntityManager em = emf.createEntityManager();
long startTime = System.currentTimeMillis();
List<Myeo> returnData = methodToGetData(em);
System.out.println(sequenceNo + " " + (System.currentTimeMillis() - startTime));
String myJson = new Gson().toJson(returnData);
resp.getOutputStream().print(myJson);
resp.getOutputStream().flush();
}finally{
resp.getOutputStream().close();
if (em.isOpen())
em.close();
}
4 simulaneous request samples
localhost/myservlet/mycodeblock?sequenceNo=A
localhost/myservlet/mycodeblock?sequenceNo=B
localhost/myservlet/mycodeblock?sequenceNo=C
localhost/myservlet/mycodeblock?sequenceNo=D
resulting print statements
A 38002
B 38344
C 38785
D 39065
What I want
A 9002
B 9344
C 9785
D 10065
If you do 4 separate GET-requests these request should be called in parallel. They must not be lumped together, since they are called in different transactions.
If that does not work as you wrote, you should check whether you have defined a database-connection-pool-size or a servlet-thread-pool-size which serializes the calls to the dbms.

compare timestamp from auto_now format of django in java

I am working on a django and java project in which I need to compare the time in django to the time in current time in java.
I am storing the enbled_time in models as :
enabled_time = models.DateTimeField(auto_now = True, default=timezone.now())
The time gets populated in the db in the form :
2017-02-26 14:54:02
Now in my java project a cron is running which checks whether enabled_time plus an expiry time is greater than the current time something as:
Long EditedTime = db.getEnabledTime() + (expiryTime*60*1000); //expiryTime is in mins
if (System.currentTimeMillis() - EditedTime > 0) {
//do something
}
Here db is the database entity for that table.
But db.getEnabledTime() gives a result '2017'. What am I doing wrong?
PS: I am storing time as Long which seems unsuitable to me. Can someone suggest which datatype should I choose or does it work fine?

direct neighbor relationships cypher query performance

This question is similar to these two: 16283441, 15456345.
UPDATE: here's a database dump.
In a db of 190K nodes and 727K relationships (and 128MB of database disk usage), I'd like to run the following query:
START start_node=node(<id>)
MATCH (start_node)-[r:COOCCURS_WITH]-(partner),
(partner)-[s:COOCCURS_WITH]-(another_partner)-[:COOCCURS_WITH]-(start_node)
RETURN COUNT(DISTINCT s) as num_partner_partner_links;
In this db 90% of the nodes have 0 relationships, and the remaining 10% have from 1 up to 670, so the biggest network this query can return cannot possibly have more than 220K links (670*670)/2).
On nodes with less than 10K partner_partner_links the query takes 2-4 seconds, when wormed up.
For more connected nodes (20-45K links) it takes about 40-50sec (don't know how much it'd take for the most connected ones).
Specifying relationship direction helps a bit but not much (but then the query doesn't return what I need it to return).
Profiling the query on one of the biggest nodes says:
==> ColumnFilter(symKeys=[" INTERNAL_AGGREGATE48d9beec-0006-4dae-937b-9875f0370ea6"], returnItemNames=["num_partner_links"], _rows=1, _db_hits=0)
==> EagerAggregation(keys=[], aggregates=["( INTERNAL_AGGREGATE48d9beec-0006-4dae-937b-9875f0370ea6,Distinct)"], _rows=1, _db_hits=0)
==> PatternMatch(g="(partner)-['r']-(start_node)", _rows=97746, _db_hits=34370048)
==> TraversalMatcher(trail="(start_node)-[ UNNAMED3:COOCCURS_WITH WHERE true AND true]-(another_partner)-[s:COOCCURS_WITH WHERE true AND true]-(partner)", _rows=116341, _db_hits=117176)
==> ParameterPipe(_rows=1, _db_hits=0)
neo4j-sh (0)$
I don't see why would this be so slow, most of the stuff should be in the RAM anyway. Is it possible to have this in under 100ms or neo4j is not up to that? I could put up the whole db somewhere if that would help..
The biggest puzzle is that the same query runs slower when rewritten to use with different node symbols :)
START n=node(36)
MATCH (n)-[r:COOCCURS_WITH]-(m),
(m)-[s:COOCCURS_WITH]-(p)-[:COOCCURS_WITH]-(n)
RETURN COUNT(DISTINCT s) AS num_partner_partner_links;
START start_node=node(36)
MATCH (start_node)-[r:COOCCURS_WITH]-(partner),
(partner)-[s:COOCCURS_WITH]-(another_partner)-[:COOCCURS_WITH]-(start_node)
RETURN COUNT(DISTINCT s) AS num_partner_partner_links;
The former always runs in +4.2 seconds, and latter in under 3.8, no matter how many times I run one and another (interleaved)!?
SW/HW details: (advanced) Neo4j v1.9.RC2, JDK 1.7.0.10, a macbook pro with an SSD disk, 8GBRAM, 2 core i7, with the following neo4j config:
neostore.nodestore.db.mapped_memory=550M
neostore.relationshipstore.db.mapped_memory=540M
neostore.propertystore.db.mapped_memory=690M
neostore.propertystore.db.strings.mapped_memory=430M
neostore.propertystore.db.arrays.mapped_memory=230M
neostore.propertystore.db.index.keys.mapped_memory=150M
neostore.propertystore.db.index.mapped_memory=140M
wrapper.java.initmemory=4092
wrapper.java.maxmemory=4092
Change your query to the one below. On my laptop, with significatly lower specs than yours, the execution time halves.
START start_node=node(36)
MATCH (start_node)-[r:COOCCURS_WITH]-(partner)
WITH start_node, partner
MATCH (partner)-[s:COOCCURS_WITH]-(another_partner)-[:COOCCURS_WITH]-(start_node)
RETURN COUNT(DISTINCT s) AS num_partner_partner_links;
Also, using your settings doesn't affect performance much as compared to the default settings. I'm afraid that you can't get the performance you want, but this query is a step in the right direction.
Usually the traversal API will be faster than Cypher because you explicitly control the traversal. I've mimicked the query as follows:
public class NeoTraversal {
public static void main(final String[] args) {
final GraphDatabaseService db = new GraphDatabaseFactory()
.newEmbeddedDatabaseBuilder("/neo4j")
.loadPropertiesFromURL(NeoTraversal.class.getClassLoader().getResource("neo4j.properties"))
.newGraphDatabase();
final Set<Long> uniquePartnerRels = new HashSet<Long>();
long startTime = System.currentTimeMillis();
final Node start = db.getNodeById(36);
for (final Path path : Traversal.description()
.breadthFirst()
.relationships(Rel.COOCCURS_WITH, Direction.BOTH)
.uniqueness(Uniqueness.NODE_GLOBAL)
.evaluator(Evaluators.atDepth(1))
.traverse(start)) {
Node partner = start.equals(path.startNode()) ? path.endNode() : path.startNode();
for (final Path partnerPath : Traversal.description()
.depthFirst()
.relationships(Rel.COOCCURS_WITH, Direction.BOTH)
.uniqueness(Uniqueness.RELATIONSHIP_PATH)
.evaluator(Evaluators.atDepth(2))
.evaluator(Evaluators.includeWhereEndNodeIs(start))
.traverse(partner)) {
uniquePartnerRels.add(partnerPath.relationships().iterator().next().getId());
}
}
System.out.println("Execution time: " + (System.currentTimeMillis() - startTime));
System.out.println(uniquePartnerRels.size());
}
static enum Rel implements RelationshipType {
COOCCURS_WITH
}
}
This clearly outperforms the cypher query, thus this might be a good alternative for you. Optimization is likely still possible.
Seems like for anything but depth/breadth first traversal, neo4j is not that "blazing fast". I've solved the problem by precomputing all the networks and storing them into MongoDB. A node document describing a network looks like this:
{
node_id : long,
partners : long[],
partner_partner_links : long[]
}
Partners and partner_partner_links are ids of documents describing egdes. Fetching the whole network takes 2 queries: one for this document, and another for edge properties (which also holds node properties):
db.edge.find({"_id" : {"$in" : network.partner_partner_links}});

Categories

Resources