How to query directly from a kafka topic?

How to query directly from a kafka topic? - java

I've looked into Interactive queries and KSQL but I can't seem to figure out if querying for a specific record(s) based on key is possible.
Say I have a record in a topic as shown:
{
key: 12314,
value:
{
id: "1",
name: "bob"
}
}
Would it be possible to search for key 12314 in a topic? Also does KSQL and interactive queries consume the entire topic to do queries?

Assuming your value is valid JSON (i.e. the field names are also quoted) then you can do this easily with KSQL/ksqlDB:
Examine the Kafka topic in ksqlDB:
ksql> PRINT test3;
Format:JSON
1/9/20 12:11:35 PM UTC , 12314 , {"id": "1", "name": "bob" }
Declare the stream:
ksql> CREATE STREAM FOO (ID VARCHAR, NAME VARCHAR)
WITH (KAFKA_TOPIC='test3',VALUE_FORMAT='JSON');
Filter the stream as data arrives
ksql> SELECT ROWKEY, ID, NAME FROM FOO WHERE ROWKEY='12314' EMIT CHANGES;
+----------------------------+----------------------------+----------------------------+
|ROWKEY |ID |NAME |
+----------------------------+----------------------------+----------------------------+
|12314 |1 |bob |

Everyone always forgets to add that you can use an interactive query if the underlying dataset is small and can be materialized.
For example, you cannot efficiently find a message by key in a huge topic. At least I cannot find such a way

Yes, you can do it with interactive queries.
You can create a kafka stream to read the input topic and generate a state store ( in memory/rocksdb and synchronize with kafka ).
This state store is queryable by key ( ReadOnlyKeyValueStore ).
You have multiples examples in the official documentation:
https://kafka.apache.org/10/documentation/streams/developer-guide/interactive-queries.html

Related

Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark

I have a kafka stream that I am loading to Spark. Messages from Kafka topic has following attributes: bl_iban, blacklisted,timestamp. So there are IBANS, flag about whether or not is that IBAN blacklisted (Y/N) and also there is timestamp of that record.
The thing is that there can be multiple records for one IBAN, because overtime IBAN can get blacklisted or "removed". And the thing that I am trying to achieve is that I want to know the current status for each of IBANS. However I have started with even simpler goal and that is to list for each IBAN latest timestamp (and after that I would like to add blacklisted status as well) So I have produced the following code (where blacklist represents Dataset that I have loaded from Kafka):
blackList = blackList.groupBy("bl_iban")
.agg(col("bl_iban"), max("timestamp"));
And after that I have tried to print that to console using following code:
StreamingQuery query = blackList.writeStream()
.format("console")
.outputMode(OutputMode.Append())
.start();
I have run my code and I get following error:
Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark
So I put watermark to my Dataset like so:
blackList = blackList.withWatermark("timestamp", "2 seconds")
.groupBy("bl_iban")
.agg(col("bl_iban"), max("timestamp"));
And got same error after that.
Any ideas how can I approach this problem?
Update:
With help of mike I have managed to get rid of that error. But the problem is that I still cannot get my blacklist working. I can see how data is loaded from Kafka but after that from my group operation I get two empty batches and that is it.
Printed data from Kafka:
+-----------------------+-----------+-----------------------+
|bl_iban |blacklisted|timestamp |
+-----------------------+-----------+-----------------------+
|SK047047595122709025789|N |2020-04-10 17:26:58.208|
|SK341492788657560898224|N |2020-04-10 17:26:58.214|
|SK118866580129485701645|N |2020-04-10 17:26:58.215|
+-----------------------+-----------+-----------------------+
This is how I got that blacklist that is outputted:
blackList = blackList.selectExpr("split(cast(value as string),',') as value", "cast(timestamp as timestamp) timestamp")
.selectExpr("value[0] as bl_iban", "value[1] as blacklisted", "timestamp");
And this is my group operation:
Dataset<Row> blackListCurrent = blackList.withWatermark("timestamp", "20 minutes")
.groupBy(window(col("timestamp"), "10 minutes", "5 minutes"), col("bl_iban"))
.agg(col("bl_iban"), max("timestamp"));
Link to source file: Spark Blacklist

When you use watermarking in Spark you need to ensure that your aggregation knows about the window. The Spark documentation provides some more background.
In your case the code should look something like this
blackList = blackList.withWatermark("timestamp", "2 seconds")
.groupBy(window(col("timestamp"), "10 minutes", "5 minutes"), col("bl_iban"))
.agg(col("bl_iban"), max("timestamp"));
It is important, that the attribute timestamp has the data type timestamp!

How do I access a list of maps nesting in ${body} of Apache Camel in Java

I am trying to access data in the incoming {body} of my incoming Json I have done the unmarshaling with Jackson and mapped it to a Java Map with
`.unmarshal().json(JsonLibrary.Jackson, java.util.Map.class)
`
My Json data is something like this after unmarshal step above
`{ "projectId" : 12345,
"title" : “12345 - Plant 1 Processing",
"partners": [{"partnerName": "partnerJV1", "partnerLocation": "JA"},
{"partnerName": "partnerJV2", "partnerLocation": "FL"},
{"partnerName": "partnerJV3", "partnerLocation": "OH"}
]`
The last part can have 0-N number of partnerName, partnerLocation maps in the partners List.
Now I am having to insert this data into a SQL table with
.to("sql:classpath:sql/sql_queries.sql")
My sql_queries.sql has the following query in it to insert data fields into the table:
`INSERT INTO MY_TABLE(PID, TITLE, PartnerName1, PartnerLocation1, PartnerName2, PartnerLocation2, PartnerName3, PartnerLocation3) VALUES(:#${body['projectId']}, :#${body['title']}, :#${body['partners[0]']['partnerName']}, :#${body['partners[0]']['partnerLocation']} )'
I cannot figure out the syntax for the last part which is a List of Maps. If it should be
`:#${body['partners[0]']['partnerName']}`
or something else in order for me to get that value.
Any hints would help, thank you!

What worked for me in the end was this:
:#${body['partners'][0]['partnerName']}
but I would love to find a way to iterate over the values like a list in Java if I don't know the size of it to begin with.

How to get last status message of each record using SQL?

Consider the following scenario.
I have a Java application which uses Oracle database to store some status codes and messages.
Ex: I have patient record that process in 3 layers (assume 1. Receiving class 2. Translation class 3. Sending class). We store data into database in each layer. When we run query it will show like this.
Name Status Status_Message
XYZ 11 XML message received
XYZ 21 XML message translated to swift format
XYZ 31 Completed message send to destination
ABC 11 XML message received
ABC 21 XML message translated to swift format
ABC 91 Failed message send to destination
On Java class I am executing the below query to get the last status message.
select STATUS_MESSAGE from INTERFACE_MESSAGE_STATUS
where NAME = ? order by STATUS
I publish this status message on a webpage. But my problem is I am not getting the last status message; it's behaving differently. It is printing sometimes "XML message received", sometimes "XML message translated to swift format", etc.
But I want to publish the last status like "Completed message send to destination" or "Failed message send to destination" depending on the last status. How can I do that? Please suggest.

You can use a query like this:
select
i.STATUS_MESSAGE
from
INTERFACE_MESSAGE_STATUS i,
(select
max(status) as last_status
from
INTERFACE_MESSAGE_STATUS
where
name = ?) s
where
i.name = ? and
i.status = s.last_status
In the above example I am assuming the status with the highest status is the last status.
I would recommend you to create a view out of this select query and then use that in your codebase. The reason is that it is much easier to read and makes it possible to easily select on multiple last_statuses without complicating your queries too much.

You have no explicit ordering specified. As the data is stored in a HEAP in Oracle, there is no specific order given. In other words: many factors influence the element you get. Only explicit ORDER BY guarantees desired order. And/or creating a index on some of the rows.
My suggestion: add a date_created field to your DB and sort based on that.

Hadoop - Analyze log file (Java)

Logfile looks like this:
Time stamp,activity,-,User,-,id,-,data
--
2013-01-08T16:21:35.561+0100,reminder,-,User1234,-,131235467,-,-
2013-01-02T15:57:24.024+0100,order,-,User1234,-,-,-,{items:[{"prd":"131235467","count": 5, "amount": 11.6},{"prd": "13123545", "count": 1, "amount": 55.99}], oid: 5556}
2013-01-08T16:21:35.561+0100,login,-,User45687,-,143435467,-,-
2013-01-08T16:21:35.561+0100,reminder,-,User45687,-,143435467,-,-
2013-01-08T16:21:35.561+0100,order,-,User45687,-,-,-,{items:[{"prd":"1315467","count": 5, "amount": 11.6},{"prd": "133545", "count": 1, "amount": 55.99}], oid: 5556}
...
...
Edit
Concrete example from this log:
User1234 has got a reminder - this reminder has id=131235467, after this he made an order with following data : {items:[{"prd":"131235467","count": 5, "amount": 11.6},{"prd": "13123545", "count": 1, "amount": 55.99}], oid: 5556}
In this case id and prd of data are the same, so i want sum up count*amount -> in this case 5*11.6 = 58 and output it like
User 1234 Prdsum: 58
User45687 made also an order but he didn't received a reminder so no sum up of his data
Output:
User45687 Prdsum: 0
Final Output of this log:
User 1234 Prdsum: 58
User45687 Prdsum: 0
My Question is: How can i compare(?) this values -> id and prd in data?
The key is the user. Would a custom Writable be useful -> value= (id, data). I need some ideas.

I recommend getting the raw output sum as you are doing as the result of the first pass of one Hadoop job, so at the end of the Hadoop job, you have a result like this:
User1234 Prdsum: 58
User45687 Prdsum: 0
and then have a second Hadoop job (or standalone job) that compares the various values and produces another report.
Do you need "state" as part of the first Hadoop job? If so, then you will need to keep a HashMap or HashTable in your mapper or reducer that stores the values of all the keys (users in this case) to compare - but that is not a good setup, IMHO. You are better off just doing an aggregate in one Hadoop job, and doing the comparison in another.

One way to achieve is by using a composite key.
Mapper output Key is combination of userid, event id (reminder -> 0, order -> 1). Partition data using userid and you need to write your own comparator.
here is the gist.
Mapper
for every event check the event type
if event type is "reminder"
emit : <User1234,0> <reminder id>
if event type is "order"
split if you have multiple orders
for every order
emit : <User1234,1> <prd, count* amount, other interested blah>
Partition using userid so all entries with same user is will go to same reducer.
Reducer
At reducer all entries will be grouped by userid and sorted event id (i.e first you will get all reminders for a given userid and followed by orders).
If `eventid` is 0
add reminders id to a set (`reminderSet`).
If `eventid` is is 1 && prd is in `remindersSet`
emit : `<userid> <prdsum>`
else
emit : `<userid> <0>`
More details on Composite key can be found in 'Hadoop definitive guide' or here

Need suggestions to clarify the concept of mongoDB to store and retrieve images

I am new to mongoDB. I was told to use mongoDB to my photo management web app. I am not able to understand mongoDB's basic concept. The documents.
What is documents in mongoDB?
j = { name : "mongo" };
t = { x : 3 };
In mongoDB website they told that the above 2 lines were 2 documents.
But till this time i thought .txt, .doc .excel... etc. were documents.(This may be funny, but i am really in need of understanding its concepts!)
How do you represent a txt file for example example.txt in mongoDB?
What is collection?
Collection of documents is known as "Collections in mongoDB"
How many collections i can create?
All documents were shared in all collections
Finnaly i come to my part, How shall i represent images in mongoDB?
With the help of tutorials i learned to store and retrieve images from mongoDB using java!!
But, without the understanding of mongoDB's concepts i cannot move further!
The blog's and articles about mongoDB is pretty interesting. But still i am not able to understand its basic concepts!!!
Can anyone strike my head with mongoDB!!??

Perhaps comparing MongoDB to SQL would help you ...
In SQL queries work against tables, columns & rows in set-based operations. There are pre-defined schema's (and hopefully indexes) to aid the query processor (as well as the querier!)
SQL Table / Rows
id | Column1 | Column2
-----------------------
1 | aaaaa | Bill
2 | bbbbb | Sally
3 | ccccc | Kyle
SQL Query
SELECT * FROM Table1 WHERE Column1 = 'aaaaa' ORDER BY Column2 DESC
This query would return all the columns in the table named Table1 where the column named Column1 has a value of aaaaa it then will order the results by the value of Column2 and return the results in descending order to the client.
MongoDB
In MongoDB there are no tables, columns or rows ... instead their are Collections (these are like tables) and Documents inside the Collections (like rows.)
MongoDB Collection / Documents
{
"_id" : ObjectId("497ce96f395f2f052a494fd4"),
"attribute1" : "aaaaa",
"attribute2" : "Bill",
"randomAttribute" : "I am different"
}
{
"_id" : ObjectId("497ce96f395f2f052a494fd5"),
"attribute1" : "bbbbb",
"attribute2" : "Sally"
}
{
"_id" : ObjectId("497ce96f395f2f052a494fd6"),
"attribute1" : "ccccc",
"attribute2" : "Kyle"
}
However there is no predefined "table structure" or "schema" like a SQL table. For example you can see the second document in this collection has an attribute called randomAttribute that none of the other documents have.
This is just fine, it won't affect our queries but does allow for for some very powerful things ...
The data is stored in a format called BSON which is very close to the Javascript JSON standard. You can find out more at http://bson.org/
MongoDB Query
SELECT * FROM Table1 WHERE Column1 = 'aaaaa' ORDER BY Column2 DESC
How would we do this same thing in MongoDB's Shell?
> db.collection1.find({attribute1:"aaaaa"}).sort({attribute2:-1});
Perhaps you can already see how similar a MongoDB query really is to SQL (while appearing quite different.) I have some posts up on http://learnmongo.com which might help you as well.

MongoDB is a document database : http://en.wikipedia.org/wiki/Document-oriented_database

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to query directly from a kafka topic? - java

Everyone always forgets to add that you can use an interactive query if the underlying dataset is small and can be materialized. For example, you cannot efficiently find a message by key in a huge topic. At least I cannot find such a way

Related

Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark

How do I access a list of maps nesting in ${body} of Apache Camel in Java

How to get last status message of each record using SQL?

Hadoop - Analyze log file (Java)

Need suggestions to clarify the concept of mongoDB to store and retrieve images

Categories

Resources