MongoDB Java driver much slower than console with $gte/$lte - java

I'm using MongoDB 4.0.1 with Java driver (MongoDB-driver-sync) 3.8.0
My collection has 564'039 elements with 13 key-values, 2 of which are maps with 10 more key-values.
If I execute the following query in the console, it gives me the results in less than a second:
db.getCollection('tracking_points').find({c: 8, d: 11,
t: {$gte: new Date("2018-08-10"), $lte: new Date("2018-09-10")}
})
But if I execute this in Java it takes more than 30 seconds:
collection.find(
and(
eq("c", clientId),
eq("d", unitId),
gte("t", start),
lte("t", end)
)
).forEach((Block<Document>) document -> {
// nothing here
});
There is an index on "t" (timestamp) and without it, the console find takes few seconds.
How can this be fixed?
Edit: Here is the log from the DB after the java query:
"2018-09-21T08:06:38.842+0300 I COMMAND [conn9236] command fleetman_dev.tracking_points command: count { count: \"tracking_points\", query: {}, $db: \"fleetman_dev\", $readPreference: { mode: \"primaryPreferred\" } } planSummary: COUNT keysExamined:0 docsExamined:0 numYields:0 reslen:45 locks:{ Global: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_msg 0ms",
"2018-09-21T08:06:38.862+0300 I COMMAND [conn9236] command fleetman_dev.tracking_points command: find { find: \"tracking_points\", filter: { c: 8, d: 11, t: { $gte: new Date(1536526800000), $lte: new Date(1536613200000) } }, $db: \"fleetman_dev\", $readPreference: { mode: \"primaryPreferred\" } } planSummary: IXSCAN { t: 1 } cursorid:38396803834 keysExamined:101 docsExamined:101 numYields:0 nreturned:101 reslen:24954 locks:{ Global: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { ",
"2018-09-21T08:06:39.049+0300 I COMMAND [conn9236] command fleetman_dev.tracking_points command: getMore { getMore: 38396803834, collection: \"tracking_points\", $db: \"fleetman_dev\", $readPreference: { mode: \"primaryPreferred\" } } originatingCommand: { find: \"tracking_points\", filter: { c: 8, d: 11, t: { $gte: new Date(1536526800000), $lte: new Date(1536613200000) } }, $db: \"fleetman_dev\", $readPreference: { mode: \"primaryPreferred\" } } planSummary: IXSCAN { t: 1 } cursorid:38396803834 keysExamined:33810 doc",

You are using the Java driver correctly but your conclusion - that the Java driver is much slower than the console - is based on an invalid comparison. The two code blocks is your question are not equivalent. In the shell variant you retrieve a cursor. In the Java variant you retrieve a cursor and you walk over the contents of that cursor.
A valid comparison between the Mongo shell and the Java driver would either have to include walking over the cursor in the shell variant, for example:
db.getCollection('tracking_points').find({c: 8, d: 11,
t: {$gte: new Date("2018-08-10"), $lte: new Date("2018-09-10")}
}).forEach(
function(myDoc) {
// nothing here
}
)
Or it would have to remove walking over the cursor from the Java variant, for example:
collection.find(
and(
eq("c", clientId),
eq("d", unitId),
gte("t", start),
lte("t", end)
)
);
Both of these would be more valid forms of comparison. If you run either of those you'll see that the elapsed times are much closer to each other. The follow on question might be 'why does it take 30s to read this data?'. If so, the fact that you can get the cursor back sub second tells us that the issue is not about indexing, instead it is likely to be related to the amount of data being read by the query.
To isolate where the issue is occurring you could gather elasped times for the following:
read the data, iterating over each document but do not parse each document
read the data and parse each document while reading
If the elapsed time for no. 2 is not much more than the elapsed time for no. 1 then you know that the issue is not in parsing and is more likely to be in network transfer. If the elapsed time for no. 2 is much greater than no. 1 then you know that the issue is in parsing and you can dig into the parse call to attribute the elapsed time. It could be constrained resources on the client (CPU and/or memory) or a sub optimal parse implementation. I can't tell at this remove but using the above approach to isolate where the problem resides will at least help you to direct your investigation.

Related

How to track committed offset with Spark job for kafka batch

I have a use case where i am writing to a Kafka topic in batches using spark job (no streaming).Initially i pump-in suppose 10 records to Kafka topic and run the spark job which does some processing and finally write to another Kafka topic.
Next time when i push another 5 records and run the spark job, my requirement is to start processing these 5 records only not from starting offset. I need to maintain the committed offset so that spark job should run on next offset position and do the processing.
Here is code from kafka side to fetch the offset:
private static List<TopicPartition> getPartitions(KafkaConsumer consumer, String topic) {
List<PartitionInfo> partitionInfoList = consumer.partitionsFor(topic);
return partitionInfoList.stream().map(x -> new TopicPartition(topic, x.partition())).collect(Collectors.toList());
}
public static void getOffSet(KafkaConsumer consumer) {
List<TopicPartition> topicPartitions = getPartitions(consumer, topic);
consumer.assign(topicPartitions);
consumer.seekToBeginning(topicPartitions);
topicPartitions.forEach(x -> {
System.out.println("Partition-> " + x + " startingOffSet-> " + consumer.position(x));
});
consumer.assign(topicPartitions);
consumer.seekToEnd(topicPartitions);
topicPartitions.forEach(x -> {
System.out.println("Partition-> " + x + " endingOffSet-> " + consumer.position(x));
});
topicPartitions.forEach(x -> {
consumer.poll(1000) ;
OffsetAndMetadata offsetAndMetadata = consumer.committed(x);
long position = consumer.position(x);
System.out.printf("Committed: %s, current position %s%n", offsetAndMetadata == null ? null : offsetAndMetadata
.offset(), position);
});
}
Below code is for spark to load the messages from topic which is not working :
Dataset<Row> kafkaDataset = session.read().format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", topic)
.option("group.id", "test-consumer-group")
.option("startingOffsets","{\"Topic1\":{\"0\":2}}")
.option("endingOffsets", "{\"Topic1\":{\"0\":3}}")
.option("enable.auto.commit","true")
.load();
After above code executes i am again trying to get the offset by calling
getoffset(consumer)
from the topic which always reads from 0 offset and committed offset fetched initially keeps on increasing. I am new to kafka and still figuring out how to handle such scenarion.Please help here.
Initially i had 10 records in my topic, i published another 2 records and here is the o/p:
Output post getoffset method executes :
Partition-> Topic00-0 startingOffSet-> 0 Partition->
Topic00-0 endingOffSet-> 12 Committed: 12, current position
12
Output post spark code executes for loading messages.
Partition-> Topic00-0 startingOffSet-> 0 Partition->
Topic00-0 endingOffSet-> 12 Committed: 12, current position
12
I see no diff and . Please take a look and suggest resolution for this sceanario.

java.nio.ByteBuffer wrap method partly working with sbt run

I have an issue where I read a bytestream from a big file ~ (100MB) and after some integers I get the value 0 (but only with sbt run ). When I hit the play button on IntelliJ I get the value I expected > 0.
My guess was that the environment is somehow different. But I could not spot the difference.
// DemoApp.scala
import java.nio.{ByteBuffer, ByteOrder}
object DemoApp extends App {
val inputStream = getClass.getResourceAsStream("/HandRanks.dat")
val handRanks = new Array[Byte](inputStream.available)
inputStream.read(handRanks)
inputStream.close()
def evalCard(value: Int) = {
val offset = value * 4
println("value: " + value)
println("offset: " + offset)
ByteBuffer.wrap(handRanks, offset, handRanks.length - offset).order(ByteOrder.LITTLE_ENDIAN).getInt
}
val cards: List[Int] = List(51, 45, 14, 2, 12, 28, 46)
def eval(cards: List[Int]): Unit = {
var p = 53
cards.foreach(card => {
println("p = " + evalCard(p))
p = evalCard(p + card)
})
println("result p: " + p);
}
eval(cards)
}
The HandRanks.dat can be found here: (I put it inside a directory called resources)
https://github.com/Robert-Nickel/scala-texas-holdem/blob/master/src/main/resources/HandRanks.dat
build.sbt is:
name := "LoadInts"
version := "0.1"
scalaVersion := "2.13.4"
On my windows machine I use sbt 1.4.6 with Oracle Java 11
You will see that the evalCard call will work 4 times but after the fifth time the return value is 0. It should be higher than 0, which it is when using IntelliJ's play button.
You are not reading a whole content. This
val handRanks = new Array[Byte](inputStream.available)
allocates only as much as InputStream buffer and then you read the amount in buffer with
inputStream.read(handRanks)
Depending of defaults you will process different amount but they will never be 100MB of data. For that you would have to read data into some structure in the loop (bad idea) or process it in chunks (with iterators, stream, etc).
import scala.util.Using
// Using will close the resource whether error happens or not
Using(getClass.getResourceAsStream("/HandRanks.dat")) { inputStream =>
def readChunk(): Option[Array[Byte]] = {
// can be done better, but that's not the point here
val buffer = new Array[Byte](inputStream.available)
val bytesRead = inputStream.read(buffer)
if (bytesRead >= 0) Some(buffer.take(bytesRead))
else None
}
#tailrec def process(): Unit = {
readChunk() match {
case Some(chunk) =>
// do something
process()
case None =>
// nothing to do - EOF reached
}
}
process()
}

Cucumber and Jenkins: False "duplicate step definition"

I'm getting this error message in a Jenkins build for a Java, Selenium and Cucumber project:
Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 1.466 sec <<< FAILURE! - in
e2e.CucumberTest
e2e.CucumberTest Time elapsed: 1.466 sec <<< ERROR!
cucumber.runtime.DuplicateStepDefinitionException: Duplicate step definitions in void
e2e.sak.OpprettSakSteps.TestCaseDSLTester(String) in file:/tmp/workspace/n_DSL-og-
TestsakBuilder_combined/e2e/cucumber/target/test-classes/ and
e2e.sak.OpprettSakSteps.TestCaseDSLTester$default(OpprettSakSteps,String,int,Object) in file:/tmp
/workspace/n_DSL-og-TestsakBuilder_combined/e2e/cucumber/target/test-classes/
I don't see where the supposed duplicate step definition is? When looking through the Java/Kotlin file, there are absolutely no duplicates, and all step definitions have a trailing "$" (which has been the cause of earlier erroneous duplicate messages). Also, I don't understand what Jenkins is comparing, even though it seems that it tries to show me exactly where the duplicate is:
e2e.sak.OpprettSakSteps.TestCaseDSLTester(String)
and
e2e.sak.OpprettSakSteps.TestCaseDSLTester$default(OpprettSakSteps,String,int,Object)
It points to the same method, but with different parameters? Neither the method name nor the Cucumber step definition name is the same, so what's it complaining about?
Here's the step definition kode (Kotlin):
#Gitt("^jeg prøver meg på DSL for \"([^\"]*)\"$")
fun TestsakDSLTester(saksRef: String = "saken") {
TestsakDSL.create {
opprettet = LocalDateTime.now().minusDays(1)
kommunenr = "5035"
hovedsoker = "Kåre Kotlin"
fnr = "22097930922"
sokere = 2
arbeidlisteHovedsoker { mutableListOf(mutableListOf("Knus og Knask AS", 100, true),
mutableListOf("Del og Hel", 20, false)) }
arbeidlisteMedsoker { mutableListOf(mutableListOf("Kiosken på hjørnet", 50, false)) }
barn = 2
biler = 1
verger = mutableListOf(VergeUtils.Companion.VergeFor.HOVEDSOKER,
VergeUtils.Companion.VergeFor.MEDSOKER)
}
System.out.println("");
}
#Gitt("en sak med (\\d+) søkere med hver sin verge$")
fun enSakMedNSokereMedHverSinVerge(): Testsak {
return TestsakBuilder("saken", "5035")
.setSoknadtype(Bakgrunn.Hva.values().toList().shuffled().first())
.setVerger(vergeFor =
mutableListOf(VergeUtils.Companion.VergeFor.HOVEDSOKER)).createSak().build();
}

Is there an alternative for GroupReduceFunction running apache-flink java in parallel?

The code below is running locally but not on the cluster. It hangs on GroupReduceFunction and do not terminates even after hours (it takes for large data ~ 9 minutes to compute locally). The last message in the log:
GroupReduce (GroupReduce at main(MyClass.java:80)) (1/1) (...) switched from DEPLOYING to RUNNING.
The code fragment:
DataSet<MyData1> myData1 = env.createInput(new UserDefinedFunctions.MyData1Set());
DataSet<MyData2> myData2 = DataSetUtils.sampleWithSize(myData1, false, 8, Long.MAX_VALUE)
.reduceGroup(new GroupReduceFunction<MyData1, MyData2>() {
#Override
public void reduce(Iterable<MyData1> itrbl, Collector<MyData2> clctr) throws Exception {
int id = 0;
for (MyData1 myData1 : itrbl) {
clctr.collect(new MyData2(id++, myData1));
}
}
});
Any ideas how I could run this segment in parallel? Thanks in advance!

Writing a pipelineplugin in Jenkins

I am writing a Jenkinsplugin. I have set up a pipeline script, when I execute the script its calling some shell scripts and setting up a pipeline. Thats working fine.
Example of my code:
node('master') {
try {
def appWorkspace = './app/'
def testWorkspace = './tests/'
stage('Clean up') {
// cleanWs()
}
stage('Build') {
parallel (
app: {
dir(appWorkspace) {
git changelog: false, credentialsId: 'jenkins.git', poll: false, url: 'https://src.url/to/our/repo'
dir('./App') {
sh "#!/bin/bash -lx \n ./gradlew assembleRelease"
}
}
},
tests: {
dir(testWorkspace) {
git changelog: false, credentialsId: 'jenkins.git', poll: false, url: 'https://src.url/to/our/repo'
sh "#!/bin/bash -lx \n nuget restore ./Tests/MyProject/MyProject.sln"
sh "#!/bin/bash -lx \n msbuild ./Tests/MyProject/MyProject.Core/ /p:Configuration=Debug"
}
}
)
}
stage('Prepare') {
parallel (
'install-apk': {
sh '''#!/bin/bash -lx
result="$(adbExtendedVersion shell pm list packages packagename.app)"
if [ ! -z "$result" ]
then
adbExtendedVersion uninstall packagename.app
fi
adbExtendedVersion install ''' + appWorkspace + '''/path/to/app-release.apk'''
},
'start-appium': {
sh "#!/bin/bash -lx \n GetAllAttachedDevices.sh"
sh "sleep 20s"
}
)
}
stage('Test') {
// Reading content of the file
def portsFileContent = readFile 'file.txt'
// Split the file by next line
def ports = portsFileContent.split('\n')
// Getting device IDs to get properties of device
def deviceIDFileContent = readFile 'IDs.txt'
def deviceIDs = deviceIDFileContent.split('\n')
// Define port and id as an pair
def pairs = (0..<Math.min(ports.size(), deviceIDs.size())).collect { i -> [id: deviceIDs[i], port: ports[i]] }
def steps = pairs.collectEntries { pair ->
["UI Test on ${pair.id}", {
sh "#!/bin/bash -lx \n mono $testWorkspace/Tests/packages/NUnit.ConsoleRunner.3.7.0/tools/nunit3-console.exe $testWorkspace/Tests/bin/Debug/MyProject.Core.dll --params=port=${pair.port}"
}]
}
parallel steps
}
}
catch (Exception e) {
println(e);
}
finally {
stage('Clean') {
archiveArtifacts 'TestResult.xml'
sh "#!/bin/bash -lx \n KillInstance.sh"
}
}
}
This is a groovy script defining my pipeline. What I am trying to achieve with my plugin is, that the user who uses this plugin just inserts some pathvariables eg. path to his solution, or path to his github source. My Plugin then executes the above listed script automatically with the given parameters.
My problem is, that I cant find any documentation how to write such a pipeline construct in Java. If someone could point me in the right direction I would appreciate that.

Categories

Resources