Hive: apply lowercase to an array

Hive: apply lowercase to an array - java

In Hive, how do I apply the lower() UDF to an Array of string?
Or any UDF in general. I don't know how to apply a "map" in a select query

If your use case is that you are transforming an array in isolation (not as part of a table), then the combination of explode, lower, and collect_list should do the trick. For example (please pardon the horrible execution times, I'm running on an underpowered VM):
hive> SELECT collect_list(lower(val))
> FROM (SELECT explode(array('AN', 'EXAMPLE', 'ARRAY')) AS val) t;
...
... Lots of MapReduce spam
...
MapReduce Total cumulative CPU time: 4 seconds 10 msec
Ended Job = job_1422453239049_0017
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 4.01 sec HDFS Read: 283 HDFS Write: 17 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 10 msec
OK
["an","example","array"]
Time taken: 33.05 seconds, Fetched: 1 row(s)
(Note: Replace array('AN', 'EXAMPLE', 'ARRAY') in the above query with whichever expression you are using to generate the array.
If instead your use case is such that your arrays stored in a column of a Hive table and you need to apply the lowercase transformation to them, to my knowledge you have two principle options:
Approach #1: Use the combination of explode and LATERAL VIEW to separate the array. Use lower to transform the individual elements, and then collect_list to glue them back together. A simple example with silly made-up data:
hive> DESCRIBE foo;
OK
id int
data array<string>
Time taken: 0.774 seconds, Fetched: 2 row(s)
hive> SELECT * FROM foo;
OK
1001 ["ONE","TWO","THREE"]
1002 ["FOUR","FIVE","SIX","SEVEN"]
Time taken: 0.434 seconds, Fetched: 2 row(s)
hive> SELECT
> id, collect_list(lower(exploded))
> FROM
> foo LATERAL VIEW explode(data) exploded_table AS exploded
> GROUP BY id;
...
... Lots of MapReduce spam
...
MapReduce Total cumulative CPU time: 3 seconds 310 msec
Ended Job = job_1422453239049_0014
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 3.31 sec HDFS Read: 358 HDFS Write: 44 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 310 msec
OK
1001 ["one","two","three"]
1002 ["four","five","six","seven"]
Time taken: 34.268 seconds, Fetched: 2 row(s)
Approach #2: Write a simple UDF to apply the transformation. Something like:
package my.package_name;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class LowerArray extends UDF {
public List<Text> evaluate(List<Text> input) {
List<Text> output = new ArrayList<Text>();
for (Text element : input) {
output.add(new Text(element.toString().toLowerCase()));
}
return output;
}
}
And then invoke the UDF directly on the data:
hive> ADD JAR my_jar.jar;
Added my_jar.jar to class path
Added resource: my_jar.jar
hive> CREATE TEMPORARY FUNCTION lower_array AS 'my.package_name.LowerArray';
OK
Time taken: 2.803 seconds
hive> SELECT id, lower_array(data) FROM foo;
...
... Lots of MapReduce spam
...
MapReduce Total cumulative CPU time: 2 seconds 760 msec
Ended Job = job_1422453239049_0015
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 2.76 sec HDFS Read: 358 HDFS Write: 44 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 760 msec
OK
1001 ["one","two","three"]
1002 ["four","five","six","seven"]
Time taken: 27.243 seconds, Fetched: 2 row(s)
There are some trade-offs between the two approaches. #2 will probably be more efficient at runtime in general than #1, since the GROUP BY clause in #1 forces a reduction stage while the UDF approach does not. However, #1 does everything in HiveQL and is a bit more easily generalized (you can replace lower with some other kind of string transformation in the query if you needed to). With the UDF approach of #2, you potentially have to write a new UDF for each different kind of transformation you want to apply.

Related

Get all items of the last 15 minutes in DynamoDB

As I come from RDBM background I am bit confuse with DynamoDB, how to write this query.
Problem : need to filter out those data which is more than 15 minutes.
I have created GSI with hashkeymaterialType and createTime (create time format Instant.now().toEpochMilli()).
Now I have to write a java query which gives those values which is more than 15 minutes.

Here is an example using cli.
:v1 should be the material type I'd that you are searching on. :v2 should be your epoch time in milliseconds for 30 mins ago which you will have to calculate.
aws dynamodb query \
--table-name mytable \
--index-name myindex
--key-condition-expression "materialType = :v1 AND create time > :v2" \
--expression-attribute-values '{
":v1": {"S": "some id"},
":V2": {"N": "766677876567"}
}'

KStream batch process windows

I want to batch messages with KStream interface.
I have a Stream with Keys/values
I tried to collect them in a tumbling window and then I wanted to process the complete window at once.
builder.stream(longSerde, updateEventSerde, CONSUME_TOPIC)
.aggregateByKey(
HashMap::new,
(aggKey, value, aggregate) -> {
aggregate.put(value.getUuid, value);
return aggregate;
},
TimeWindows.of("intentWindow", 100),
longSerde, mapSerde)
.foreach((wk, values) -> {
The thing is foreach gets called on each update to the KTable.
I would like to process the whole window once it is complete. As in collect Data from 100 ms and then process at once. In for each.
16:** - windows from 2016-08-23T10:56:26 to 2016-08-23T10:56:27, key 2016-07-21T14:38:16.288, value count: 294
16:** - windows from 2016-08-23T10:56:26 to 2016-08-23T10:56:27, key 2016-07-21T14:38:16.288, value count: 295
16:** - windows from 2016-08-23T10:56:26 to 2016-08-23T10:56:27, key 2016-07-21T14:38:16.288, value count: 296
16:** - windows from 2016-08-23T10:56:26 to 2016-08-23T10:56:27, key 2016-07-21T14:38:16.288, value count: 297
16:** - windows from 2016-08-23T10:56:26 to 2016-08-23T10:56:27, key 2016-07-21T14:38:16.288, value count: 298
16:** - windows from 2016-08-23T10:56:26 to 2016-08-23T10:56:27, key 2016-07-21T14:38:16.288, value count: 299
16:** - windows from 2016-08-23T10:56:27 to 2016-08-23T10:56:28, key 2016-07-21T14:38:16.288, value count: 1
16:** - windows from 2016-08-23T10:56:27 to 2016-08-23T10:56:28, key 2016-07-21T14:38:16.288, value count: 2
16:** - windows from 2016-08-23T10:56:27 to 2016-08-23T10:56:28, key 2016-07-21T14:38:16.288, value count: 3
16:** - windows from 2016-08-23T10:56:27 to 2016-08-23T10:56:28, key 2016-07-21T14:38:16.288, value count: 4
16:** - windows from 2016-08-23T10:56:27 to 2016-08-23T10:56:28, key 2016-07-21T14:38:16.288, value count: 5
16:** - windows from 2016-08-23T10:56:27 to 2016-08-23T10:56:28, key 2016-07-21T14:38:16.288, value count: 6
at some point the new window starts with 1 entry in the map.
So I don't even know when the window is full.
any hints to to batch process in kafka streams

My actual tasks is to push updates from the stream to redis but I don't want to read / update / write individiually even though redis is fast.
My solution for now is to use KStream.process() supply a processor which adds to a queue on process and actually process the queue in punctuate.
public class BatchedProcessor extends AbstractProcessor{
...
BatchedProcessor(Writer writer, long schedulePeriodic)
#Override
public void init(ProcessorContext context) {
super.init(context);
context.schedule(schedulePeriodic);
}
#Override
public void punctuate(long timestamp) {
super.punctuate(timestamp);
writer.processQueue();
context().commit();
}
#Override
public void process(Long aLong, IntentUpdateEvent intentUpdateEvent) {
writer.addToQueue(intentUpdateEvent);
}
I still have to test but it solves the problem I had. One could easily write such a processor in a very generic way. The API is very neat and clean but a processBatched((List batchedMessaages)-> ..., timeInterval OR countInterval) that just uses punctuate to process the batch and commits at that point and collects KeyValues in a Store might be a useful addition.
But maybe it was intended to solve this with a Processor and keep the API purely in the one message at a time low latency focus.

Right now (as of Kafka 0.10.0.0 / 0.10.0.1): The windowing behavior you are describing is "working as expected". That is, if you are getting 1,000 incoming messages, you will (currently) always see 1,000 updates going downstream with the latest versions of Kafka / Kafka Streams.
Looking ahead: The Kafka community is working on new features to make this update-rate behavior more flexible (e.g. to allow for what you described above as your desired behavior). See KIP-63: Unify store and downstream caching in streams for more details.

====== Update ======
On further testing, this does not work.
The correct approach is to use a processor as outlined by #friedrich-nietzsche. I am down-voting my own answer.... grrrr.
===================
I am still wrestling with this API (but I love it, so it's time well spent :)), and I am not sure what you're trying to accomplish downstream from where your code sample ended, but it looks similar to what I got working. High level is:
Object read from source. It represents a key and 1:∞ number of events, and I want to publish the total number of events per key every 5 seconds ( or TP5s, transactions per 5 seconds ). The beginning of the code looks the same, but I use:
KStreamBuilder.stream
reduceByKey
to a window(5000)
to a new stream which gets the accumulated value for each key every 5 secs.
map that stream to a new KeyValue per key
to the sink topic.
In my case, each window period, I can reduce all events to one event per key, so this works. If you want to retain all the individual events per window, I assume that could use reduce to map each instance to a collection of instances (possibly with the same key, or you might need a new key) and at the end of each window period, the downstream stream will get a bunch of collections of your events (or maybe just one collection of all the events), all in one go. It looks like this, sanitized and Java 7-ish:
builder.stream(STRING_SERDE, EVENT_SERDE, SOURCE_TOPICS)
.reduceByKey(eventReducer, TimeWindows.of("EventMeterAccumulator", 5000), STRING_SERDE, EVENT_SERDE)
.toStream()
.map(new KeyValueMapper<Windowed<String>, Event, KeyValue<String,Event>>() {
public KeyValue<String, Event> apply(final Windowed<String> key, final Event finalEvent) {
return new KeyValue<String, Event>(key.key(), new Event(key.window().end(), finalEvent.getCount());
}
}).to(STRING_SERDE, EVENT_SERDE, SINK_TOPIC);

Cassandra Query on secondary index :ReadTimeout: code=1200

I am using [cqlsh 5.0.1 | Cassandra 2.2.1 | CQL spec 3.3.0 | Native protocol v4] version. I have 2 node cassandra cluster with replication factor as 2.
$ nodetool status test_keyspace
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.xxx.4.xxx 85.32 GB 256 100.0% xxxx-xx-xx-xx-xx rack1
UN 10.xxx.4.xxx 80.99 GB 256 100.0% x-xx-xx-xx-xx rack1
[I have replaced numbers with x]
This is keyspace defination.
cqlsh> describe test_keyspace;
CREATE KEYSPACE test_keyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '2'} AND durable_writes = true;
CREATE TABLE test_keyspace.test_table (
id text PRIMARY KEY,
listids map<int, timestamp>
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX list_index ON test_keyspace.test_table (keys(listids));
id are unique and listids's key has cardinality close to 1000. I have millions records in this keyspace.
I want to get count of records with specific key and also list of those records. I tried this query from cqlsh:
select count(1) from test_table where listids contains key 12;
Got this error after few seconds:
ReadTimeout: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
I have already modified timeout parameters in cqlshrc and cassandra.yaml.
cat /etc/cassandra/conf/cassandra.yaml | grep read_request_timeout_in_ms
#read_request_timeout_in_ms: 5000
read_request_timeout_in_ms: 300000
cat ~/.cassandra/cqlshrc
[connection]
timeout = 36000
request_timeout = 36000
client_timeout = 36000
When i checked /var/log/cassandra/system.log I got only this-
WARN [SharedPool-Worker-157] 2016-07-25 11:56:22,010 SelectStatement.java:253 - Aggregation query used without partition key
I am using Java client form my code. The java client is also getting lot of read timeouts. One solution might be remodeling my data but that will take more time (Although i am not sure about it). Can someone suggest quick solution of this problem?
Adding stats :
$ nodetool cfstats test_keyspace
Keyspace: test_keyspace
Read Count: 5928987886
Read Latency: 3.468279416568199 ms.
Write Count: 1590771056
Write Latency: 0.02020026287239664 ms.
Pending Flushes: 0
Table (index): test_table.list_index
SSTable count: 9
Space used (live): 9664953448
Space used (total): 9664953448
Space used by snapshots (total): 4749
Off heap memory used (total): 1417400
SSTable Compression Ratio: 0.822577888909709
Number of keys (estimate): 108
Memtable cell count: 672265
Memtable data size: 30854168
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 1718274
Local read latency: 63.356 ms
Local write count: 1031719451
Local write latency: 0.015 ms
Pending flushes: 0
Bloom filter false positives: 369
Bloom filter false ratio: 0.00060
Bloom filter space used: 592
Bloom filter off heap memory used: 520
Index summary off heap memory used: 144
Compression metadata off heap memory used: 1416736
Compacted partition minimum bytes: 73
Compacted partition maximum bytes: 2874382626
Compacted partition mean bytes: 36905317
Average live cells per slice (last five minutes): 5389.0
Maximum live cells per slice (last five minutes): 51012
Average tombstones per slice (last five minutes): 2.0
Maximum tombstones per slice (last five minutes): 2759
Table: test_table
SSTable count: 559
Space used (live): 62368820540
Space used (total): 62368820540
Space used by snapshots (total): 4794
Off heap memory used (total): 817427277
SSTable Compression Ratio: 0.4856571513639344
Number of keys (estimate): 96692796
Memtable cell count: 2587248
Memtable data size: 27398085
Memtable off heap memory used: 0
Memtable switch count: 558
Local read count: 5927272991
Local read latency: 3.788 ms
Local write count: 559051606
Local write latency: 0.037 ms
Pending flushes: 0
Bloom filter false positives: 4905594
Bloom filter false ratio: 0.00023
Bloom filter space used: 612245816
Bloom filter off heap memory used: 612241344
Index summary off heap memory used: 196239565
Compression metadata off heap memory used: 8946368
Compacted partition minimum bytes: 43
Compacted partition maximum bytes: 1916
Compacted partition mean bytes: 173
Average live cells per slice (last five minutes): 1.0
Maximum live cells per slice (last five minutes): 1
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1

You can either redesign your tables, or split your query in multiple smaller queries.
You are selecting using a secondary index without using the partition key (thats what the warning tells you). Doing that, you essentially perform a full table scan. Your nodes have to look into every partition in order to fulfill your request.
A solution without changing the datamodel would be, to iterate over all partitions and run your query once per partition.
select count(*) from test_table where id = 'somePartitionId' and listids contains key 12;
This way, your nodes know on which partition you are looking for these information. You would then have to aggregate the results of these query on clientside.

I faced the same issue. tried
1) # Can also be set to None to disable:client_timeout = None in cqlshrc in home .cassandra. Did not helped.
2) Increased the timeout *timeout_in_ms in ym.cassandra.yaml
Did not helped too. Finally I settled in running loop on select clause in my java code and received count. 12 million rows gave me count in 7 seconds. It is fast.
Cluster cluster = Cluster.builder()
.addContactPoints(serverIp)
.build();
session = cluster.connect(keyspace);
String cqlStatement = "SELECT count(*) FROM imadmin.device_appclass_attributes";
//String cqlStatement = "SELECT * FROM system_schema.keyspaces";
for (Row row : session.execute(cqlStatement)) {
System.out.println(row.toString());
}

How do I choose the best k mean cluster in weka

As you can see the bottom result I have two different clusters using different seed. I would like to choose the best cluster out of the two clusters.
I know that the minimum square error is the better. However, it shows the same square error although I use different seeds. I want to know why it shows similar square error. I also want to know what other things I need to consider when i am selecting the best cluster.
*******************************************************************
kMeans
======
Number of iterations: 10
Within cluster sum of squared errors: 527.6988818392938
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1
(4898) (2781) (2117)
=====================================================
fixedacidity 6.8548 6.9565 6.7212
volatileacidity 0.2782 0.2826 0.2725
citricacid 0.3342 0.3389 0.3279
residualsugar 6.3914 8.2678 3.9265
chlorides 0.0458 0.0521 0.0374
freesulfurdioxide 35.3081 38.6897 30.8658
totalsulfurdioxide 138.3607 155.2585 116.1627
density 0.994 0.9958 0.9916
pH 3.1883 3.1691 3.2134
sulphates 0.4898 0.492 0.4871
alcohol 10.5143 9.6325 11.6726
quality 5.8779 5.4779 6.4034
Time taken to build model (full training data) : 0.19 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 2781 ( 57%)
1 2117 ( 43%)
***********************************************************************
kMeans
======
Number of iterations: 7
Within cluster sum of squared errors: 527.6993178146143
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1
(4898) (2122) (2776)
=====================================================
fixedacidity 6.8548 6.7208 6.9572
volatileacidity 0.2782 0.2723 0.2828
citricacid 0.3342 0.3281 0.3389
residualsugar 6.3914 3.9451 8.2614
chlorides 0.0458 0.0374 0.0522
freesulfurdioxide 35.3081 30.9105 38.6697
totalsulfurdioxide 138.3607 116.2175 155.2871
density 0.994 0.9917 0.9958
pH 3.1883 3.2137 3.1689
sulphates 0.4898 0.4876 0.4916
alcohol 10.5143 11.6695 9.6312
quality 5.8779 6.4043 5.4755
Time taken to build model (full training data) : 0.15 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 2122 ( 43%)
1 2776 ( 57%)

Define "best result".
By the definition of k-means, a lower sum of squares is better.
Anything else is worse by k-means - but that doesn't mean that a different quality criterion (or clustering algorithm) could be more helpful for your actual problem.

Using different seeds doesnot guarantee you different clusters in the result.

regex extract in hive for the following scenario?

In hive I need to split the column using regex_extract on "/" and then pick the 3rd value for eg: products/apple products/iphone in this case it is iphone, if there is no 3rd value then we need to fallback on the 2nd value which is apple products? plz guide me on achieving that.

input.txt
products/apple products/iphone
products/laptop
products/apple products/mobile/lumia
products/apple products/cover/samsung/gallaxy S4
hive> create table inputTable(line String);
OK
Time taken: 0.086 seconds
hive> load data local inpath '/home/kishore/Data/input.txt'
> into table inputTable;
Loading data to table default.inputtable
Table default.inputtable stats: [numFiles=1, totalSize=133]
OK
Time taken: 0.277 seconds
hive> select split(line,'/')[size(split(line, '/'))-1] from inputTable;
OK
iphone
laptop
lumia
gallaxy S4
Time taken: 0.073 seconds, Fetched: 4 row(s)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hive: apply lowercase to an array - java

In Hive, how do I apply the lower() UDF to an Array of string? Or any UDF in general. I don't know how to apply a "map" in a select query

Related

Get all items of the last 15 minutes in DynamoDB

KStream batch process windows

Cassandra Query on secondary index :ReadTimeout: code=1200

How do I choose the best k mean cluster in weka

regex extract in hive for the following scenario?

Categories

Resources