In hive I need to split the column using regex_extract on "/" and then pick the 3rd value for eg: products/apple products/iphone in this case it is iphone, if there is no 3rd value then we need to fallback on the 2nd value which is apple products? plz guide me on achieving that.
input.txt
products/apple products/iphone
products/laptop
products/apple products/mobile/lumia
products/apple products/cover/samsung/gallaxy S4
hive> create table inputTable(line String);
OK
Time taken: 0.086 seconds
hive> load data local inpath '/home/kishore/Data/input.txt'
> into table inputTable;
Loading data to table default.inputtable
Table default.inputtable stats: [numFiles=1, totalSize=133]
OK
Time taken: 0.277 seconds
hive> select split(line,'/')[size(split(line, '/'))-1] from inputTable;
OK
iphone
laptop
lumia
gallaxy S4
Time taken: 0.073 seconds, Fetched: 4 row(s)
Related
I have a kafka stream that I am loading to Spark. Messages from Kafka topic has following attributes: bl_iban, blacklisted,timestamp. So there are IBANS, flag about whether or not is that IBAN blacklisted (Y/N) and also there is timestamp of that record.
The thing is that there can be multiple records for one IBAN, because overtime IBAN can get blacklisted or "removed". And the thing that I am trying to achieve is that I want to know the current status for each of IBANS. However I have started with even simpler goal and that is to list for each IBAN latest timestamp (and after that I would like to add blacklisted status as well) So I have produced the following code (where blacklist represents Dataset that I have loaded from Kafka):
blackList = blackList.groupBy("bl_iban")
.agg(col("bl_iban"), max("timestamp"));
And after that I have tried to print that to console using following code:
StreamingQuery query = blackList.writeStream()
.format("console")
.outputMode(OutputMode.Append())
.start();
I have run my code and I get following error:
Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark
So I put watermark to my Dataset like so:
blackList = blackList.withWatermark("timestamp", "2 seconds")
.groupBy("bl_iban")
.agg(col("bl_iban"), max("timestamp"));
And got same error after that.
Any ideas how can I approach this problem?
Update:
With help of mike I have managed to get rid of that error. But the problem is that I still cannot get my blacklist working. I can see how data is loaded from Kafka but after that from my group operation I get two empty batches and that is it.
Printed data from Kafka:
+-----------------------+-----------+-----------------------+
|bl_iban |blacklisted|timestamp |
+-----------------------+-----------+-----------------------+
|SK047047595122709025789|N |2020-04-10 17:26:58.208|
|SK341492788657560898224|N |2020-04-10 17:26:58.214|
|SK118866580129485701645|N |2020-04-10 17:26:58.215|
+-----------------------+-----------+-----------------------+
This is how I got that blacklist that is outputted:
blackList = blackList.selectExpr("split(cast(value as string),',') as value", "cast(timestamp as timestamp) timestamp")
.selectExpr("value[0] as bl_iban", "value[1] as blacklisted", "timestamp");
And this is my group operation:
Dataset<Row> blackListCurrent = blackList.withWatermark("timestamp", "20 minutes")
.groupBy(window(col("timestamp"), "10 minutes", "5 minutes"), col("bl_iban"))
.agg(col("bl_iban"), max("timestamp"));
Link to source file: Spark Blacklist
When you use watermarking in Spark you need to ensure that your aggregation knows about the window. The Spark documentation provides some more background.
In your case the code should look something like this
blackList = blackList.withWatermark("timestamp", "2 seconds")
.groupBy(window(col("timestamp"), "10 minutes", "5 minutes"), col("bl_iban"))
.agg(col("bl_iban"), max("timestamp"));
It is important, that the attribute timestamp has the data type timestamp!
I am trying to find out the time spent on each tab/website by the user.
For example if I visited youtube and watched it for 10 minutes then I should be able to see something like this
www.youtube.com ---> 10 minutes
I already made a connection with sqlite database i.e. History file present in chrome directory and was able to run the following sql command to fetch the data:
SELECT urls.id, urls.url, urls.title, urls.visit_count, urls.typed_count, urls.last_visit_time, urls.hidden, urls.favicon_id, visits.visit_time, visits.from_visit, visits.visit_duration, visits.transition, visit_source.source FROM urls JOIN visits ON urls.id = visits.url LEFT JOIN visit_source ON visits.id = visit_source.id
So can anyone tell me which combination of column can i use to get the time spent on each website.
Please note that: visit_duration is not giving me appropriate data.
visit_duration Stores duration in microseconds, you need to convert and format that number. Here is one way to show a human-readable visit duration:
SELECT urls.url AS URL, (visits.visit_duration / 3600 / 1000000) || ' hours ' || strftime('%M minutes %S seconds', visits.visit_duration / 1000000 / 86400.0) AS Duration
FROM urls LEFT JOIN visits ON urls.id = visits.url
Here is a sample output:
URL
Duration
http://www.stackoverflow.com/
3 hours 14 minutes 15 seconds
You can also use strftime if you want more format options
I am newbie in Druid. My problem is that how to store and query HashMap in Druid using java to interact.
I have network table as follow:
Network f1 f1 f3 .... fn
value 1 3 2 ..... 2
Additional, I have range-time table
time impression
2016-08-10-00 1000
2016-08-10-00 3000
2016-08-10-00 4000
2016-08-10-00 2000
2016-08-10-00 8000
In Druid can I store range-time table as a HashMap and query both of the tables above with the statement:
Filter f1 = 1 and f2 = 1 and range-time between [t1, t2].
Can anyone help me ? Thanks so much.
#VanThaoNguye,
Yes you can store the hashmaps in druid and you can query with bound filters.
You can read more about bound filters here: http://druid.io/docs/latest/querying/filters.html#bound-filter
I am new bee to spark, I have around 15 TB data in mongo
ApplicationName Name IPCategory Success Fail CreatedDate
abc a.com cd 3 1 25-12-2015 00:00:00
def d.com ty 2 2 25-12-2015 01:20:00
abc b.com cd 5 0 01-01-2015 06:40:40
I am looking for based on ApplicationName, groupby (Name,IpCategory) for one week data.I am able to fetch data from mongo and save output to mongo. I am working on it using java.
NOTE:- From one month data I need only last week. It should be groupby(Name,IPCategory).
In Hive, how do I apply the lower() UDF to an Array of string?
Or any UDF in general. I don't know how to apply a "map" in a select query
If your use case is that you are transforming an array in isolation (not as part of a table), then the combination of explode, lower, and collect_list should do the trick. For example (please pardon the horrible execution times, I'm running on an underpowered VM):
hive> SELECT collect_list(lower(val))
> FROM (SELECT explode(array('AN', 'EXAMPLE', 'ARRAY')) AS val) t;
...
... Lots of MapReduce spam
...
MapReduce Total cumulative CPU time: 4 seconds 10 msec
Ended Job = job_1422453239049_0017
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 4.01 sec HDFS Read: 283 HDFS Write: 17 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 10 msec
OK
["an","example","array"]
Time taken: 33.05 seconds, Fetched: 1 row(s)
(Note: Replace array('AN', 'EXAMPLE', 'ARRAY') in the above query with whichever expression you are using to generate the array.
If instead your use case is such that your arrays stored in a column of a Hive table and you need to apply the lowercase transformation to them, to my knowledge you have two principle options:
Approach #1: Use the combination of explode and LATERAL VIEW to separate the array. Use lower to transform the individual elements, and then collect_list to glue them back together. A simple example with silly made-up data:
hive> DESCRIBE foo;
OK
id int
data array<string>
Time taken: 0.774 seconds, Fetched: 2 row(s)
hive> SELECT * FROM foo;
OK
1001 ["ONE","TWO","THREE"]
1002 ["FOUR","FIVE","SIX","SEVEN"]
Time taken: 0.434 seconds, Fetched: 2 row(s)
hive> SELECT
> id, collect_list(lower(exploded))
> FROM
> foo LATERAL VIEW explode(data) exploded_table AS exploded
> GROUP BY id;
...
... Lots of MapReduce spam
...
MapReduce Total cumulative CPU time: 3 seconds 310 msec
Ended Job = job_1422453239049_0014
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 3.31 sec HDFS Read: 358 HDFS Write: 44 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 310 msec
OK
1001 ["one","two","three"]
1002 ["four","five","six","seven"]
Time taken: 34.268 seconds, Fetched: 2 row(s)
Approach #2: Write a simple UDF to apply the transformation. Something like:
package my.package_name;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class LowerArray extends UDF {
public List<Text> evaluate(List<Text> input) {
List<Text> output = new ArrayList<Text>();
for (Text element : input) {
output.add(new Text(element.toString().toLowerCase()));
}
return output;
}
}
And then invoke the UDF directly on the data:
hive> ADD JAR my_jar.jar;
Added my_jar.jar to class path
Added resource: my_jar.jar
hive> CREATE TEMPORARY FUNCTION lower_array AS 'my.package_name.LowerArray';
OK
Time taken: 2.803 seconds
hive> SELECT id, lower_array(data) FROM foo;
...
... Lots of MapReduce spam
...
MapReduce Total cumulative CPU time: 2 seconds 760 msec
Ended Job = job_1422453239049_0015
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 2.76 sec HDFS Read: 358 HDFS Write: 44 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 760 msec
OK
1001 ["one","two","three"]
1002 ["four","five","six","seven"]
Time taken: 27.243 seconds, Fetched: 2 row(s)
There are some trade-offs between the two approaches. #2 will probably be more efficient at runtime in general than #1, since the GROUP BY clause in #1 forces a reduction stage while the UDF approach does not. However, #1 does everything in HiveQL and is a bit more easily generalized (you can replace lower with some other kind of string transformation in the query if you needed to). With the UDF approach of #2, you potentially have to write a new UDF for each different kind of transformation you want to apply.