I am new bee to spark, I have around 15 TB data in mongo
ApplicationName Name IPCategory Success Fail CreatedDate
abc a.com cd 3 1 25-12-2015 00:00:00
def d.com ty 2 2 25-12-2015 01:20:00
abc b.com cd 5 0 01-01-2015 06:40:40
I am looking for based on ApplicationName, groupby (Name,IpCategory) for one week data.I am able to fetch data from mongo and save output to mongo. I am working on it using java.
NOTE:- From one month data I need only last week. It should be groupby(Name,IPCategory).
Related
As I come from RDBM background I am bit confuse with DynamoDB, how to write this query.
Problem : need to filter out those data which is more than 15 minutes.
I have created GSI with hashkeymaterialType and createTime (create time format Instant.now().toEpochMilli()).
Now I have to write a java query which gives those values which is more than 15 minutes.
Here is an example using cli.
:v1 should be the material type I'd that you are searching on. :v2 should be your epoch time in milliseconds for 30 mins ago which you will have to calculate.
aws dynamodb query \
--table-name mytable \
--index-name myindex
--key-condition-expression "materialType = :v1 AND create time > :v2" \
--expression-attribute-values '{
":v1": {"S": "some id"},
":V2": {"N": "766677876567"}
}'
I am trying to find out the time spent on each tab/website by the user.
For example if I visited youtube and watched it for 10 minutes then I should be able to see something like this
www.youtube.com ---> 10 minutes
I already made a connection with sqlite database i.e. History file present in chrome directory and was able to run the following sql command to fetch the data:
SELECT urls.id, urls.url, urls.title, urls.visit_count, urls.typed_count, urls.last_visit_time, urls.hidden, urls.favicon_id, visits.visit_time, visits.from_visit, visits.visit_duration, visits.transition, visit_source.source FROM urls JOIN visits ON urls.id = visits.url LEFT JOIN visit_source ON visits.id = visit_source.id
So can anyone tell me which combination of column can i use to get the time spent on each website.
Please note that: visit_duration is not giving me appropriate data.
visit_duration Stores duration in microseconds, you need to convert and format that number. Here is one way to show a human-readable visit duration:
SELECT urls.url AS URL, (visits.visit_duration / 3600 / 1000000) || ' hours ' || strftime('%M minutes %S seconds', visits.visit_duration / 1000000 / 86400.0) AS Duration
FROM urls LEFT JOIN visits ON urls.id = visits.url
Here is a sample output:
URL
Duration
http://www.stackoverflow.com/
3 hours 14 minutes 15 seconds
You can also use strftime if you want more format options
I have 2 csv files .
Employee.csv with the schema
EmpId Fname
1 John
2 Jack
3 Ram
and 2nd csv file as
Leave.csv
EmpId LeaveType Designation
1 Sick SE
1 Casual SE
2 Sick SE
3 Privilege M
1 Casual SE
2 Privilege SE
Now I want the data in json as
EmpID-1
Sick : 2
Casual : 2
Privilege : 0
Using spark in Java
Grouping by the column 'LeaveType' and perfoming count on them
import org.apache.spark.sql.functions.count
val leaves = ??? // Load leaves
leaves.groupBy(col("LeaveType")).agg(count(col("LeaveType").as("total_leaves")).show()
I'm not familiar with Java syntax but if you do not want to use the dataframe API, you may do something like this in scala,
val rdd= sc.textfile("/path/to/leave.csv").map(_.split(",")).map(x=>((x(0),x(1),x(2)),1)).reduceByKey(_+_)
now you need to use some external API like GSON to transform each element of this RDD to desired JSON format. Each element of this rdd is a Tuple4, in which there is (EmpId, leaveType, Designation, Countofleaves)
Let me know if this helped, Cheers.
I am writing a spark streaming job in java which takes input record from kafka.
Now the record is available in JavaDstream as a custom java object.
Sample record is :
TimeSeriesData: {tenant_id='581dd636b5e2ca009328b42b', asset_id='5820870be4b082f136653884', bucket='2016', parameter_id='58218d81e4b082f13665388b', timestamp=Mon Aug 22 14:50:01 IST 2016, window=null, value='11.30168'}
Now I want to aggregate this data based on min, hour, day and week of the field "timestamp".
My question is, how to aggregate JavaDstream records based on a window. A sample code will be helpful.
In hive I need to split the column using regex_extract on "/" and then pick the 3rd value for eg: products/apple products/iphone in this case it is iphone, if there is no 3rd value then we need to fallback on the 2nd value which is apple products? plz guide me on achieving that.
input.txt
products/apple products/iphone
products/laptop
products/apple products/mobile/lumia
products/apple products/cover/samsung/gallaxy S4
hive> create table inputTable(line String);
OK
Time taken: 0.086 seconds
hive> load data local inpath '/home/kishore/Data/input.txt'
> into table inputTable;
Loading data to table default.inputtable
Table default.inputtable stats: [numFiles=1, totalSize=133]
OK
Time taken: 0.277 seconds
hive> select split(line,'/')[size(split(line, '/'))-1] from inputTable;
OK
iphone
laptop
lumia
gallaxy S4
Time taken: 0.073 seconds, Fetched: 4 row(s)