Akka stream stops processing data - java

When I run the below stream it does not receive any subsequent data once the stream runs.
final long HOUR = 3600000;
final long PAST_HOUR = System.currentTimeMillis()-HOUR;
private final static ActorSystem actorSystem = ActorSystem.create(Behaviors.empty(), "as");
protected static ElasticsearchParams constructElasticsearchParams(
String indexName, String typeName, ApiVersion apiVersion) {
if (apiVersion == ApiVersion.V5) {
return ElasticsearchParams.V5(indexName, typeName);
} else if (apiVersion == ApiVersion.V7) {
return ElasticsearchParams.V7(indexName);
}
else {
throw new IllegalArgumentException("API version " + apiVersion + " is not supported");
}
}
String queryStr = "{ \"bool\": { \"must\" : [{\"range\" : {"+
"\"timestamp\" : { "+
"\"gte\" : "+PAST_HOUR
+" }} }]}} ";
ElasticsearchConnectionSettings connectionSettings =
ElasticsearchConnectionSettings.create("****")
.withCredentials("****", "****");
ElasticsearchSourceSettings sourceSettings =
ElasticsearchSourceSettings.create(connectionSettings)
.withApiVersion(ApiVersion.V7);
Source<ReadResult<Stats>, NotUsed> dataSource =
ElasticsearchSource.typed(
constructElasticsearchParams("data", "_doc", ApiVersion.V7),
queryStr,
sourceSettings,
Stats.class);
dataSource.buffer(10000, OverflowStrategy.backpressure());
dataSource.backpressureTimeout(Duration.ofSeconds(1));
dataSource
.log("error")
.runWith(Sink.foreach(a -> System.out.println(a)), actorSystem);
produces output :
ReadResult(id=1656107389556,source=Stats(size=0.09471),version=)
Data is continually being written to the index data but the stream does not process it once it has started. Shouldn't the stream continually process data from the upstream source? In this case, the upstream source is an Elastic index named data.
I've tried amending the query to match all documents :
String queryStr = "{\"match_all\": {}}";
but the same result.

The Elasticsearch source does not run continuously. It initiates a search, manages pagination (using the bulk API) and streams results; when Elasticsearch reports no more results it completes.
You could do something like
Source.repeat(Done).flatMapConcat(done -> ElasticsearchSource.typed(...))
Which will run a new search immediately after the previous one finishes. Note that it would be the responsibility of the downstream to filter out duplicates.

Related

Taking unexpectedly long amount of time to transform message in kstreams application

I have a pretty basic use case kstreams application where I am debouncing messages for a few seconds and then using transform to either remove or store the message in a state store. I also have a punctuate method that fires every 30 seconds to go through the store and emit the messages.
What I'm finding is that from the time my application gets the message to the time it makes it to the transform function takes much longer than I would expect(I would assume the transform function happens pretty quickly after the window expires). This is not entirely an issue for my use case but I am curious as to what might be taking so long to make it to the transform function.
final StreamsBuilder builder = new StreamsBuilder();
final StoreBuilder<KeyValueStore<String, Payload>> store = Stores.keyValueStoreBuilder(
Stores.inMemoryKeyValueStore(keyValueStoreName),
Serdes.String(),
avroSerde
);
builder.addStateStore(store);
final Consumed<String, Payload> consumed = Consumed.with(Serdes.String(), avroSerde)
.withTimestampExtractor(new WallclockTimestampExtractor());
final Produced<String, Payload> produced = Produced.with(Serdes.String(), avroSerde);
final KStream<String, Payload> stream = builder.stream(inputTopic, consumed);
final SessionWindows sessionWindows = SessionWindows
.with(Duration.ofSeconds(2));
final SessionWindowTransformerSupplier transformerSupplier =
new SessionWindowTransformerSupplier(keyValueStoreName, scheduleTimeSeconds);
final SessionBytesStoreSupplier sessionBytesStoreSupplier = Stores.persistentSessionStore(
"debounce-window",
Duration.ofSeconds(3));
final Materialized<String, Payload, SessionStore<Bytes, byte[]>> materializedAs =
Materialized.as(sessionBytesStoreSupplier);
stream
.selectKey((key, value) -> {
logger.info("selecting key: " + key);
return key;
})
.groupByKey()
.windowedBy(sessionWindows)
.reduce(payloadDebounceFunction::apply, materializedAs)
.toStream()
.transform(transformerSupplier, keyValueStoreName)
.to(outputTopic, produced);
return builder;
Here is my transform/punctuate method:
#Override
public void init(ProcessorContext context) {
this.processorContext = context;
this.store = (KeyValueStore<String, Payload>) context.getStateStore(keyValueStoreName);
context.schedule(ofSeconds(scheduleTime), WALL_CLOCK_TIME, timestamp -> punctuate());
}
#Override
public KeyValue<String, Payload> transform(Windowed<String> key, Payload value) {
synchronized (this) {
if(value != null) {
BatchScanStatus status = extractStatus(value);
boolean removeFromStoreStatus = BatchScanStatus.CANCELLED.equals(status)
|| BatchScanStatus.FINALIZING.equals(status);
if(removeFromStoreStatus) {
logger.info("Deleting key from store: {}", key);
store.delete(key.key());
} else {
logger.info("Adding key to store: {}", key);
store.putIfAbsent(key.key(), value);
}
processorContext.commit();
}
return null;
}
}
private void punctuate() {
synchronized (this) {
final KeyValueIterator<String, Payload> keyIter = store.all();
while(keyIter.hasNext()) {
final KeyValue<String, Payload> record = keyIter.next();
logger.info("Forwarding key: {}", record.key);
processorContext.forward(record.key, record.value);
}
keyIter.close();
}
}
The length of time it takes to get from the selectKey function to the transform function confuses me as in this run it took ~24 seconds
15:58:35.238 [scheduler-79112bd0-2310-482e-9aab-8bcaae746082-StreamThread-1] INFO c.b.d.f.s.kstreams.Scheduler - selecting key: keykeykey
15:58:59.181 [scheduler-79112bd0-2310-482e-9aab-8bcaae746082-StreamThread-1] INFO c.b.d.f.s.k.s.SessionTransformer - Adding key to store: [keykeykey#1570737515238/1570737515238]
Does kstreams do more work than it appears here in order for something like this to take the amount of time that it does? Hoping to get some enlightenment on whether this is a configuration/timing issue or if this is normal behavior for a kstreams application.
EDIT: I think I have found where I have gone wrong and it has to do with the default value of commit.interval.ms.
Changes were not getting committed to the internal topic until it checked and thus, my transform function would not fire until these changes arrived on the internal topic. I shortened this to a second and immediately saw the difference.

TFS JAVA SDK - How to run shared query

I have a application which useds TFS JAVA SDK 14.0.3 .
I have a shared query  on my tfs , how can i run the shared query and get the response back using TFS SDK 14.0.3
Also I could see that the query url will expire in every 90 days , so any better way to execute the shared query?
Now I  have a method to run a query , i want method  to  run shared query also.
public void getWorkItem(TFSTeamProjectCollection tpc, Project project){
WorkItemClient workItemClient = project.getWorkItemClient();
// Define the WIQL query.
String wiqlQuery = "Select ID, Title,Assigned from WorkItems where (State = 'Active') order by Title";
// Run the query and get the results.
WorkItemCollection workItems = workItemClient.query(wiqlQuery);
System.out.println("Found " + workItems.size() + " work items.");
System.out.println();
// Write out the heading.
System.out.println("ID\tTitle");
// Output the first 20 results of the query, allowing the TFS SDK to
// page
// in data as required
final int maxToPrint = 5;
for (int i = 0; i < workItems.size(); i++) {
if (i >= maxToPrint) {
System.out.println("[...]");
break;
}
WorkItem workItem = workItems.getWorkItem(i);
System.out.println(workItem.getID() + "\t" + workItem.getTitle());
}
}
Shared query is a query which has been run and saved, so what you need should be getting a a shared query, not run a shared query. You could refer to case Access TFS Team Query from Client Object API:
///Handles nested query folders
private static Guid FindQuery(QueryFolder folder, string queryName)
{
foreach (var item in folder)
{
if (item.Name.Equals(queryName, StringComparison.InvariantCultureIgnoreCase))
{
return item.Id;
}
var itemFolder = item as QueryFolder;
if (itemFolder != null)
{
var result = FindQuery(itemFolder, queryName);
if (!result.Equals(Guid.Empty))
{
return result;
}
}
}
return Guid.Empty;
}
static void Main(string[] args)
{
var collectionUri = new Uri("http://TFS/tfs/DefaultCollection");
var server = new TfsTeamProjectCollection(collectionUri);
var workItemStore = server.GetService<WorkItemStore>();
var teamProject = workItemStore.Projects["TeamProjectName"];
var x = teamProject.QueryHierarchy;
var queryId = FindQuery(x, "QueryNameHere");
var queryDefinition = workItemStore.GetQueryDefinition(queryId);
var variables = new Dictionary<string, string>() {{"project", "TeamProjectName"}};
var result = workItemStore.Query(queryDefinition.QueryText,variables);
}
By the way, you could also check the REST API in the following link:
https://learn.microsoft.com/en-us/rest/api/vsts/wit/queries/get

how to get all uncommitted messages in kafka when manually committing offset

In my application ,i am consuming json messages from kafka topic and Multiple instances are running of my application. I have set kafka prop as: props.put("enable.auto.commit", "false")
So When i consume message ,i push it to my DB and then commit it as :
private static void commitMessage(KafkaConsumer<String, String> kafkaConsumer, ConsumerRecord message, String kafkaTopic) {
long nextOffset = message.offset() + 1;
TopicPartition topicPartition = new TopicPartition(kafkaTopic, message.partition());
OffsetAndMetadata offsetAndMetadata = new OffsetAndMetadata(nextOffset);
Map<TopicPartition, OffsetAndMetadata> offsetAndMetadataMap = new HashMap<>();
offsetAndMetadataMap.put(topicPartition, offsetAndMetadata);
//
log.info("Commiting processed kafka message, topic [" + kafkaTopic + "], partition [" + message.partition() + "], next offset [" + nextOffset + "]");
kafkaConsumer.commitSync(offsetAndMetadataMap);
}
Now it may happen after consuming message(but before pushing it to DB) my Application restarts for some reason. Now i want to consume uncommitted message again from kafka after restart. I am able to do using seek:
private static void seekAllPartitions(KafkaConsumer<String, String> kafkaConsumer, String kafkaTopic) {
List<PartitionInfo> partitionInfos = kafkaConsumer.partitionsFor(kafkaTopic);
println 'Size ofpartition list : ' + partitionInfos.size()
for (PartitionInfo partitionInfo : partitionInfos) {
TopicPartition topicPartition = new TopicPartition(kafkaTopic, partitionInfo.partition());
OffsetAndMetadata committedForPartition = kafkaConsumer.committed(topicPartition);
try {
if (committedForPartition != null) {
println 'Seeking offset...' + committedForPartition.offset()
kafkaConsumer.seek(topicPartition, committedForPartition.offset());
}
} catch (Exception ex) {}
}
}
Now problem is - seek(topicPartition,committedForPartition.offset()) gives me last uncommitted message and not the intermediate uncommitted messages.As i mentioned ,multiple instance are running - i may end up with intermediate uncommitted messages for ex : Instance a -2nd msg was not committed and Instance b - 5 the msg not committed but it gives me 5th message only and not 2nd.

Database insertion synchronization

I have a java code that generates a request number based on the data received from database, and then updates the database for newly generated
synchronized (this.getClass()) {
counter++;
System.out.println(counter);
System.out.println("start " + System.identityHashCode(this));
certRequest
.setRequestNbr(generateRequestNumber(certInsuranceRequestAddRq
.getAccountInfo().getAccountNumberId()));
System.out.println("outside funcvtion"+certRequest.getRequestNbr());
reqId = Utils.getUniqueId();
certRequest.setRequestId(reqId);
System.out.println(reqId);
ItemIdInfo itemIdInfo = new ItemIdInfo();
itemIdInfo.setInsurerId(certRequest.getRequestId());
certRequest.setItemIdInfo(itemIdInfo);
dao.insert(certRequest);
addAccountRel();
counter++;
System.out.println(counter);
System.out.println("end");
}
the output for System.out.println() statements is `
1
start 27907101
com.csc.exceed.certificate.domain.CertRequest#a042cb
inside function request number66
outside funcvtion66
AF88172D-C8B0-4DCD-9AC6-12296EF8728D
2
end
3
start 21695531
com.csc.exceed.certificate.domain.CertRequest#f98690
inside function request number66
outside funcvtion66
F3200106-6033-4AEC-8DC3-B23FCD3CA380
4
end
In my case I get a call from two threads for this code.
If you observe both the threads run independently. However the data for request number is same in both the cases.
is it possible that before the database updation for first thread completes the second thread starts execution.
`
the code for generateRequestNumber() is as follows:
public String generateRequestNumber(String accNumber) throws Exception {
String requestNumber = null;
if (accNumber != null) {
String SQL_QUERY = "select CERTREQUEST.requestNbr from CertRequest as CERTREQUEST, "
+ "CertActObjRel as certActObjRel where certActObjRel.certificateObjkeyId=CERTREQUEST.requestId "
+ " and certActObjRel.certObjTypeCd=:certObjTypeCd "
+ " and certActObjRel.certAccountId=:accNumber ";
String[] parameterNames = { "certObjTypeCd", "accNumber" };
Object[] parameterVaues = new Object[] {
Constants.REQUEST_RELATION_CODE, accNumber };
List<?> resultSet = dao.executeNamedQuery(SQL_QUERY,
parameterNames, parameterVaues);
// List<?> resultSet = dao.retrieveTableData(SQL_QUERY);
if (resultSet != null && resultSet.size() > 0) {
requestNumber = (String) resultSet.get(0);
}
int maxRequestNumber = -1;
if (requestNumber != null && requestNumber.length() > 0) {
maxRequestNumber = maxValue(resultSet.toArray());
requestNumber = Integer.toString(maxRequestNumber + 1);
} else {
requestNumber = Integer.toString(1);
}
System.out.println("inside function request number"+requestNumber);
return requestNumber;
}
return null;
}
Databases allow multiple simultaneous connections, so unless you write your code properly you can mess up the data.
Since you only seem to require a unique growing integer, you can easily generate one safely inside the database with for example a sequence (if supported by the database). Databases not supporting sequences usually provide some other way (such as auto increment columns in MySQL).

neo4j - batch insertion using neo4j rest graph db

I'm using version 2.0.1 .
I have like hundred of thousands of nodes that needs to be inserted. My neo4j graph db is on a stand alone server, and I'm using RestApi through the neo4j rest graph db library to achieved this.
However, I'm facing a slow performance result. I've chopped my queries into batches, sending 500 cypher statements in a single http call. The result that I'm getting is like:
10:38:10.984 INFO commit
10:38:13.161 INFO commit
10:38:13.277 INFO commit
10:38:15.132 INFO commit
10:38:15.218 INFO commit
10:38:17.288 INFO commit
10:38:19.488 INFO commit
10:38:22.020 INFO commit
10:38:24.806 INFO commit
10:38:27.848 INFO commit
10:38:31.172 INFO commit
10:38:34.767 INFO commit
10:38:38.661 INFO commit
And so on.
The query that I'm using is as follows:
MERGE (a{main:{val1},prop2:{val2}}) MERGE (b{main:{val3}}) CREATE UNIQUE (a)-[r:relationshipname]-(b);
My code is this:
private RestAPI restAPI;
private RestCypherQueryEngine engine;
private GraphDatabaseService graphDB = new RestGraphDatabase("http://localdomain.com:7474/db/data/");
...
restAPI = ((RestGraphDatabase) graphDB).getRestAPI();
engine = new RestCypherQueryEngine(restAPI);
...
Transaction tx = graphDB.getRestAPI().beginTx();
try {
int ctr = 0;
while (isExists) {
ctr++;
//excute query here through engine.query()
if (ctr % 500 == 0) {
tx.success();
tx.close();
tx = graphDB.getRestAPI().beginTx();
LOGGER.info("commit");
}
}
tx.success();
} catch (FileNotFoundException | NumberFormatException | ArrayIndexOutOfBoundsException e) {
tx.failure();
} finally {
tx.close();
}
Thanks!
UPDATED BENCHMARK.
Sorry for the confusion, the benchmark that I've posted isn't accurate, and is not for 500 queries. My ctr variable isn't actually referring to the number of cypher queries.
So now, I'm having like 500 queries per 3 seconds and that 3 seconds keeps on increasing as well. It's still way slow compared to the embedded neo4j.
If you have to ability to use Neo4j 2.1.0-M01 (don't use it in prod yet!!), you could benefit from new features. If you'd create/generate a CSV file like this:
val1,val2,val3
a_value,another_value,yet_another_value
a,b,c
....
you'd only need to launch the following code:
final GraphDatabaseService graphDB = new RestGraphDatabase("http://server:7474/db/data/");
final RestAPI restAPI = ((RestGraphDatabase) graphDB).getRestAPI();
final RestCypherQueryEngine engine = new RestCypherQueryEngine(restAPI);
final String filePath = "file://C:/your_file_path.csv";
engine.query("USING PERIODIC COMMIT 500 LOAD CSV WITH HEADERS FROM '" + filePath
+ "' AS csv MERGE (a{main:csv.val1,prop2:csv.val2}) MERGE (b{main:csv.val3})"
+ " CREATE UNIQUE (a)-[r:relationshipname]->(b);", null);
You'd have to make sure that the file can be accessed from the machine where your server is installed on.
Take a look at my server plugin that does this for you on the server. If you build this and put in the plugins folder, you could use the plugin in java as follows:
final RestAPI restAPI = new RestAPIFacade("http://server:7474/db/data");
final RequestResult result = restAPI.execute(RequestType.POST, "ext/CSVBatchImport/graphdb/csv_batch_import",
new HashMap<String, Object>() {
{
put("path", "file://C:/.../neo4j.csv");
}
});
EDIT:
You can also use a BatchCallback in the java REST wrapper to boost the performance and it removes the transactional boilerplate code as well. You could write your script similar to:
final RestAPI restAPI = new RestAPIFacade("http://server:7474/db/data");
int counter = 0;
List<Map<String, Object>> statements = new ArrayList<>();
while (isExists) {
statements.add(new HashMap<String, Object>() {
{
put("val1", "abc");
put("val2", "abc");
put("val3", "abc");
}
});
if (++counter % 500 == 0) {
restAPI.executeBatch(new Process(statements));
statements = new ArrayList<>();
}
}
static class Process implements BatchCallback<Object> {
private static final String QUERY = "MERGE (a{main:{val1},prop2:{val2}}) MERGE (b{main:{val3}}) CREATE UNIQUE (a)-[r:relationshipname]-(b);";
private List<Map<String, Object>> params;
Process(final List<Map<String, Object>> params) {
this.params = params;
}
#Override
public Object recordBatch(final RestAPI restApi) {
for (final Map<String, Object> param : params) {
restApi.query(QUERY, param);
}
return null;
}
}

Categories

Resources