Hadoop - Analyze log file (Java)

Hadoop - Analyze log file (Java) - java

Logfile looks like this:
Time stamp,activity,-,User,-,id,-,data
--
2013-01-08T16:21:35.561+0100,reminder,-,User1234,-,131235467,-,-
2013-01-02T15:57:24.024+0100,order,-,User1234,-,-,-,{items:[{"prd":"131235467","count": 5, "amount": 11.6},{"prd": "13123545", "count": 1, "amount": 55.99}], oid: 5556}
2013-01-08T16:21:35.561+0100,login,-,User45687,-,143435467,-,-
2013-01-08T16:21:35.561+0100,reminder,-,User45687,-,143435467,-,-
2013-01-08T16:21:35.561+0100,order,-,User45687,-,-,-,{items:[{"prd":"1315467","count": 5, "amount": 11.6},{"prd": "133545", "count": 1, "amount": 55.99}], oid: 5556}
...
...
Edit
Concrete example from this log:
User1234 has got a reminder - this reminder has id=131235467, after this he made an order with following data : {items:[{"prd":"131235467","count": 5, "amount": 11.6},{"prd": "13123545", "count": 1, "amount": 55.99}], oid: 5556}
In this case id and prd of data are the same, so i want sum up count*amount -> in this case 5*11.6 = 58 and output it like
User 1234 Prdsum: 58
User45687 made also an order but he didn't received a reminder so no sum up of his data
Output:
User45687 Prdsum: 0
Final Output of this log:
User 1234 Prdsum: 58
User45687 Prdsum: 0
My Question is: How can i compare(?) this values -> id and prd in data?
The key is the user. Would a custom Writable be useful -> value= (id, data). I need some ideas.

I recommend getting the raw output sum as you are doing as the result of the first pass of one Hadoop job, so at the end of the Hadoop job, you have a result like this:
User1234 Prdsum: 58
User45687 Prdsum: 0
and then have a second Hadoop job (or standalone job) that compares the various values and produces another report.
Do you need "state" as part of the first Hadoop job? If so, then you will need to keep a HashMap or HashTable in your mapper or reducer that stores the values of all the keys (users in this case) to compare - but that is not a good setup, IMHO. You are better off just doing an aggregate in one Hadoop job, and doing the comparison in another.

One way to achieve is by using a composite key.
Mapper output Key is combination of userid, event id (reminder -> 0, order -> 1). Partition data using userid and you need to write your own comparator.
here is the gist.
Mapper
for every event check the event type
if event type is "reminder"
emit : <User1234,0> <reminder id>
if event type is "order"
split if you have multiple orders
for every order
emit : <User1234,1> <prd, count* amount, other interested blah>
Partition using userid so all entries with same user is will go to same reducer.
Reducer
At reducer all entries will be grouped by userid and sorted event id (i.e first you will get all reminders for a given userid and followed by orders).
If `eventid` is 0
add reminders id to a set (`reminderSet`).
If `eventid` is is 1 && prd is in `remindersSet`
emit : `<userid> <prdsum>`
else
emit : `<userid> <0>`
More details on Composite key can be found in 'Hadoop definitive guide' or here

Related

How to query directly from a kafka topic?

I've looked into Interactive queries and KSQL but I can't seem to figure out if querying for a specific record(s) based on key is possible.
Say I have a record in a topic as shown:
{
key: 12314,
value:
{
id: "1",
name: "bob"
}
}
Would it be possible to search for key 12314 in a topic? Also does KSQL and interactive queries consume the entire topic to do queries?

Assuming your value is valid JSON (i.e. the field names are also quoted) then you can do this easily with KSQL/ksqlDB:
Examine the Kafka topic in ksqlDB:
ksql> PRINT test3;
Format:JSON
1/9/20 12:11:35 PM UTC , 12314 , {"id": "1", "name": "bob" }
Declare the stream:
ksql> CREATE STREAM FOO (ID VARCHAR, NAME VARCHAR)
WITH (KAFKA_TOPIC='test3',VALUE_FORMAT='JSON');
Filter the stream as data arrives
ksql> SELECT ROWKEY, ID, NAME FROM FOO WHERE ROWKEY='12314' EMIT CHANGES;
+----------------------------+----------------------------+----------------------------+
|ROWKEY |ID |NAME |
+----------------------------+----------------------------+----------------------------+
|12314 |1 |bob |

Everyone always forgets to add that you can use an interactive query if the underlying dataset is small and can be materialized.
For example, you cannot efficiently find a message by key in a huge topic. At least I cannot find such a way

Yes, you can do it with interactive queries.
You can create a kafka stream to read the input topic and generate a state store ( in memory/rocksdb and synchronize with kafka ).
This state store is queryable by key ( ReadOnlyKeyValueStore ).
You have multiples examples in the official documentation:
https://kafka.apache.org/10/documentation/streams/developer-guide/interactive-queries.html

JOOQ/SQL How to select min date based on foreign key

I have a table called Foo which contains 3 columns, (id, time, barId) and I would like to select all fields from Foo where the time (stored as a timestamp) is the lowest one in a group of barId. For example if I had
Id, time, barId
1, 10am, 1
2, 11am, 1
3, 10am, 2
4, 9am, 2
I would expect to receive back rows 1 and 4.
Currently I am using
.select(FOO.ID, FOO.TIME.min, FOO.BAR_ID)
.from(FOO)
.where()
.groupBy(FOO.BAR_ID)
.fetchInto(Foo.class);
And I am receiving an error stating column "foo.id" must appear in the GROUP BY clause or be used in an aggregate function

The issue I had was that I was not grouping by the rows I was selecting.
The working code is
.select(FOO.ID, FOO.TIME.min, FOO.BAR_ID)
.from(FOO)
.where()
.groupBy(FOO.ID, FOO.TIME, FOO.BAR_ID)
.fetchInto(Foo.class);

Funnel analysis using MongoDB?

I have a collection named 'event' it tracks event from mobile applications.
The structure of event document is
{
eventName:"eventA",
screenName:"HomeScreen",
timeStamp: NumberLong("135698658"),
tracInfo:
{
...,
"userId":"user1",
"sessionId":"123cdasd2123",
...
}
}
I want to create report to display a particular funnel:
eg:
funnel is : event1 -> event2 -> event3
I want to find count of:
event1
event1 then event2
event1 then event2 and then event3
and the session is also considered i.e occurred in single session.
Note: just want to be clear, I want to be able to create any funnel that I define, and be able to create a report for it.

Your solution is likely to revolve around an aggregation like this:
db.event.aggregate([
{ $group: { _id: '$tracInfo.sessionId', events: { $push: '$eventName' } } }
])
where every resulting document will contain a sessionId and a list of eventNames. Add other fields to the $group results as needed. I imagine the logic for detecting your desired sequences in-pipeline would be pretty hairy, so you might consider saving the results to a different collection which you can inspect at your leisure. 2.6 features a new $out operator for just such occasions.

About solr query facet

In my solr document, the document data is like:
{
"createTime":"2013-09-10",
"reason":"reason1",
"postId":"postId_1",
"_version_":1445959401549594624 },
{
"createTime":"2013-09-11",
"reason":"reason2",
"postId":"postId_1",
"_version_":1445959401549594624 },
{
"createTime":"2013-09-12",
"reason":"reason3",
"postId":"postId_1",
"_version_":1445959401549594624 },
{
"createTime":"2013-09-13",
"reason":"reason4",
"postId":"postId_2",
"_version_":1445959401549594624 },<script>alert("1")</script>
Now I need use solr facetQuery to select some data like this:
1. postId1, 3 records, the last createTime is "2013-09-12"
2. postId2, 1 record, the last createTime is "2013-09-13", reason is reason4
How can I do this using solr facetQuery?

You can use Field Collapsing feature, which can help you group the results.
If you group on post_id, you would be able to get the the results as per the post id.
You would get the count for each post id (numFound), which will give you the 3 records part.
You can order the results within the group by date desc and return single result (group.limit=1) which will give you the last date.
You can pick up the reason from the records.

Generate random number within a range (0-100k) in a cluster environment

I have to generate random number within a range (0-100.000) in a cluster environment (many stateless Java based app servers + Mongodb) - so every user request will get some unique number and will maintain it in the next few requests.
As I understand, I have two options:
1. have some number persisted in mongo and incrementAndGet it - but it's not atomic - bad choice.
2. Use Redis - it's atomic and support counters.
3. Any idea? Is it safe to use UUID and set a range for it ?
4. Hazelcast ?
Any other though?
Thanks

I would leverage the existing MongoDB infrastructure and use the MongoDB findAndModify command to do an atomic increment and get operation.
For the shell the command would look like.
var result = db.ids.findAndModify( {
query: { _id: "counter" },
sort: { rating: 1 },
new : true,
update: { $inc: { counter: 1 } },
upsert : true
} );
The 'new : true' returns the document after the update. Upsert creates the document if it is missing.
The 10gen supported driver and the Asynchronous Driver both contain helper methods/builders for the find and modify command.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hadoop - Analyze log file (Java) - java

Related

How to query directly from a kafka topic?

JOOQ/SQL How to select min date based on foreign key

Funnel analysis using MongoDB?

About solr query facet

Generate random number within a range (0-100k) in a cluster environment

Categories

Resources