I am using [cqlsh 5.0.1 | Cassandra 2.2.1 | CQL spec 3.3.0 | Native protocol v4] version. I have 2 node cassandra cluster with replication factor as 2.
$ nodetool status test_keyspace
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.xxx.4.xxx 85.32 GB 256 100.0% xxxx-xx-xx-xx-xx rack1
UN 10.xxx.4.xxx 80.99 GB 256 100.0% x-xx-xx-xx-xx rack1
[I have replaced numbers with x]
This is keyspace defination.
cqlsh> describe test_keyspace;
CREATE KEYSPACE test_keyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '2'} AND durable_writes = true;
CREATE TABLE test_keyspace.test_table (
id text PRIMARY KEY,
listids map<int, timestamp>
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX list_index ON test_keyspace.test_table (keys(listids));
id are unique and listids's key has cardinality close to 1000. I have millions records in this keyspace.
I want to get count of records with specific key and also list of those records. I tried this query from cqlsh:
select count(1) from test_table where listids contains key 12;
Got this error after few seconds:
ReadTimeout: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
I have already modified timeout parameters in cqlshrc and cassandra.yaml.
cat /etc/cassandra/conf/cassandra.yaml | grep read_request_timeout_in_ms
#read_request_timeout_in_ms: 5000
read_request_timeout_in_ms: 300000
cat ~/.cassandra/cqlshrc
[connection]
timeout = 36000
request_timeout = 36000
client_timeout = 36000
When i checked /var/log/cassandra/system.log I got only this-
WARN [SharedPool-Worker-157] 2016-07-25 11:56:22,010 SelectStatement.java:253 - Aggregation query used without partition key
I am using Java client form my code. The java client is also getting lot of read timeouts. One solution might be remodeling my data but that will take more time (Although i am not sure about it). Can someone suggest quick solution of this problem?
Adding stats :
$ nodetool cfstats test_keyspace
Keyspace: test_keyspace
Read Count: 5928987886
Read Latency: 3.468279416568199 ms.
Write Count: 1590771056
Write Latency: 0.02020026287239664 ms.
Pending Flushes: 0
Table (index): test_table.list_index
SSTable count: 9
Space used (live): 9664953448
Space used (total): 9664953448
Space used by snapshots (total): 4749
Off heap memory used (total): 1417400
SSTable Compression Ratio: 0.822577888909709
Number of keys (estimate): 108
Memtable cell count: 672265
Memtable data size: 30854168
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 1718274
Local read latency: 63.356 ms
Local write count: 1031719451
Local write latency: 0.015 ms
Pending flushes: 0
Bloom filter false positives: 369
Bloom filter false ratio: 0.00060
Bloom filter space used: 592
Bloom filter off heap memory used: 520
Index summary off heap memory used: 144
Compression metadata off heap memory used: 1416736
Compacted partition minimum bytes: 73
Compacted partition maximum bytes: 2874382626
Compacted partition mean bytes: 36905317
Average live cells per slice (last five minutes): 5389.0
Maximum live cells per slice (last five minutes): 51012
Average tombstones per slice (last five minutes): 2.0
Maximum tombstones per slice (last five minutes): 2759
Table: test_table
SSTable count: 559
Space used (live): 62368820540
Space used (total): 62368820540
Space used by snapshots (total): 4794
Off heap memory used (total): 817427277
SSTable Compression Ratio: 0.4856571513639344
Number of keys (estimate): 96692796
Memtable cell count: 2587248
Memtable data size: 27398085
Memtable off heap memory used: 0
Memtable switch count: 558
Local read count: 5927272991
Local read latency: 3.788 ms
Local write count: 559051606
Local write latency: 0.037 ms
Pending flushes: 0
Bloom filter false positives: 4905594
Bloom filter false ratio: 0.00023
Bloom filter space used: 612245816
Bloom filter off heap memory used: 612241344
Index summary off heap memory used: 196239565
Compression metadata off heap memory used: 8946368
Compacted partition minimum bytes: 43
Compacted partition maximum bytes: 1916
Compacted partition mean bytes: 173
Average live cells per slice (last five minutes): 1.0
Maximum live cells per slice (last five minutes): 1
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
You can either redesign your tables, or split your query in multiple smaller queries.
You are selecting using a secondary index without using the partition key (thats what the warning tells you). Doing that, you essentially perform a full table scan. Your nodes have to look into every partition in order to fulfill your request.
A solution without changing the datamodel would be, to iterate over all partitions and run your query once per partition.
select count(*) from test_table where id = 'somePartitionId' and listids contains key 12;
This way, your nodes know on which partition you are looking for these information. You would then have to aggregate the results of these query on clientside.
I faced the same issue. tried
1) # Can also be set to None to disable:client_timeout = None in cqlshrc in home .cassandra. Did not helped.
2) Increased the timeout *timeout_in_ms in ym.cassandra.yaml
Did not helped too. Finally I settled in running loop on select clause in my java code and received count. 12 million rows gave me count in 7 seconds. It is fast.
Cluster cluster = Cluster.builder()
.addContactPoints(serverIp)
.build();
session = cluster.connect(keyspace);
String cqlStatement = "SELECT count(*) FROM imadmin.device_appclass_attributes";
//String cqlStatement = "SELECT * FROM system_schema.keyspaces";
for (Row row : session.execute(cqlStatement)) {
System.out.println(row.toString());
}
Related
I want to batch messages with KStream interface.
I have a Stream with Keys/values
I tried to collect them in a tumbling window and then I wanted to process the complete window at once.
builder.stream(longSerde, updateEventSerde, CONSUME_TOPIC)
.aggregateByKey(
HashMap::new,
(aggKey, value, aggregate) -> {
aggregate.put(value.getUuid, value);
return aggregate;
},
TimeWindows.of("intentWindow", 100),
longSerde, mapSerde)
.foreach((wk, values) -> {
The thing is foreach gets called on each update to the KTable.
I would like to process the whole window once it is complete. As in collect Data from 100 ms and then process at once. In for each.
16:** - windows from 2016-08-23T10:56:26 to 2016-08-23T10:56:27, key 2016-07-21T14:38:16.288, value count: 294
16:** - windows from 2016-08-23T10:56:26 to 2016-08-23T10:56:27, key 2016-07-21T14:38:16.288, value count: 295
16:** - windows from 2016-08-23T10:56:26 to 2016-08-23T10:56:27, key 2016-07-21T14:38:16.288, value count: 296
16:** - windows from 2016-08-23T10:56:26 to 2016-08-23T10:56:27, key 2016-07-21T14:38:16.288, value count: 297
16:** - windows from 2016-08-23T10:56:26 to 2016-08-23T10:56:27, key 2016-07-21T14:38:16.288, value count: 298
16:** - windows from 2016-08-23T10:56:26 to 2016-08-23T10:56:27, key 2016-07-21T14:38:16.288, value count: 299
16:** - windows from 2016-08-23T10:56:27 to 2016-08-23T10:56:28, key 2016-07-21T14:38:16.288, value count: 1
16:** - windows from 2016-08-23T10:56:27 to 2016-08-23T10:56:28, key 2016-07-21T14:38:16.288, value count: 2
16:** - windows from 2016-08-23T10:56:27 to 2016-08-23T10:56:28, key 2016-07-21T14:38:16.288, value count: 3
16:** - windows from 2016-08-23T10:56:27 to 2016-08-23T10:56:28, key 2016-07-21T14:38:16.288, value count: 4
16:** - windows from 2016-08-23T10:56:27 to 2016-08-23T10:56:28, key 2016-07-21T14:38:16.288, value count: 5
16:** - windows from 2016-08-23T10:56:27 to 2016-08-23T10:56:28, key 2016-07-21T14:38:16.288, value count: 6
at some point the new window starts with 1 entry in the map.
So I don't even know when the window is full.
any hints to to batch process in kafka streams
My actual tasks is to push updates from the stream to redis but I don't want to read / update / write individiually even though redis is fast.
My solution for now is to use KStream.process() supply a processor which adds to a queue on process and actually process the queue in punctuate.
public class BatchedProcessor extends AbstractProcessor{
...
BatchedProcessor(Writer writer, long schedulePeriodic)
#Override
public void init(ProcessorContext context) {
super.init(context);
context.schedule(schedulePeriodic);
}
#Override
public void punctuate(long timestamp) {
super.punctuate(timestamp);
writer.processQueue();
context().commit();
}
#Override
public void process(Long aLong, IntentUpdateEvent intentUpdateEvent) {
writer.addToQueue(intentUpdateEvent);
}
I still have to test but it solves the problem I had. One could easily write such a processor in a very generic way. The API is very neat and clean but a processBatched((List batchedMessaages)-> ..., timeInterval OR countInterval) that just uses punctuate to process the batch and commits at that point and collects KeyValues in a Store might be a useful addition.
But maybe it was intended to solve this with a Processor and keep the API purely in the one message at a time low latency focus.
Right now (as of Kafka 0.10.0.0 / 0.10.0.1): The windowing behavior you are describing is "working as expected". That is, if you are getting 1,000 incoming messages, you will (currently) always see 1,000 updates going downstream with the latest versions of Kafka / Kafka Streams.
Looking ahead: The Kafka community is working on new features to make this update-rate behavior more flexible (e.g. to allow for what you described above as your desired behavior). See KIP-63: Unify store and downstream caching in streams for more details.
====== Update ======
On further testing, this does not work.
The correct approach is to use a processor as outlined by #friedrich-nietzsche. I am down-voting my own answer.... grrrr.
===================
I am still wrestling with this API (but I love it, so it's time well spent :)), and I am not sure what you're trying to accomplish downstream from where your code sample ended, but it looks similar to what I got working. High level is:
Object read from source. It represents a key and 1:∞ number of events, and I want to publish the total number of events per key every 5 seconds ( or TP5s, transactions per 5 seconds ). The beginning of the code looks the same, but I use:
KStreamBuilder.stream
reduceByKey
to a window(5000)
to a new stream which gets the accumulated value for each key every 5 secs.
map that stream to a new KeyValue per key
to the sink topic.
In my case, each window period, I can reduce all events to one event per key, so this works. If you want to retain all the individual events per window, I assume that could use reduce to map each instance to a collection of instances (possibly with the same key, or you might need a new key) and at the end of each window period, the downstream stream will get a bunch of collections of your events (or maybe just one collection of all the events), all in one go. It looks like this, sanitized and Java 7-ish:
builder.stream(STRING_SERDE, EVENT_SERDE, SOURCE_TOPICS)
.reduceByKey(eventReducer, TimeWindows.of("EventMeterAccumulator", 5000), STRING_SERDE, EVENT_SERDE)
.toStream()
.map(new KeyValueMapper<Windowed<String>, Event, KeyValue<String,Event>>() {
public KeyValue<String, Event> apply(final Windowed<String> key, final Event finalEvent) {
return new KeyValue<String, Event>(key.key(), new Event(key.window().end(), finalEvent.getCount());
}
}).to(STRING_SERDE, EVENT_SERDE, SINK_TOPIC);
As you can see the bottom result I have two different clusters using different seed. I would like to choose the best cluster out of the two clusters.
I know that the minimum square error is the better. However, it shows the same square error although I use different seeds. I want to know why it shows similar square error. I also want to know what other things I need to consider when i am selecting the best cluster.
*******************************************************************
kMeans
======
Number of iterations: 10
Within cluster sum of squared errors: 527.6988818392938
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1
(4898) (2781) (2117)
=====================================================
fixedacidity 6.8548 6.9565 6.7212
volatileacidity 0.2782 0.2826 0.2725
citricacid 0.3342 0.3389 0.3279
residualsugar 6.3914 8.2678 3.9265
chlorides 0.0458 0.0521 0.0374
freesulfurdioxide 35.3081 38.6897 30.8658
totalsulfurdioxide 138.3607 155.2585 116.1627
density 0.994 0.9958 0.9916
pH 3.1883 3.1691 3.2134
sulphates 0.4898 0.492 0.4871
alcohol 10.5143 9.6325 11.6726
quality 5.8779 5.4779 6.4034
Time taken to build model (full training data) : 0.19 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 2781 ( 57%)
1 2117 ( 43%)
***********************************************************************
kMeans
======
Number of iterations: 7
Within cluster sum of squared errors: 527.6993178146143
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1
(4898) (2122) (2776)
=====================================================
fixedacidity 6.8548 6.7208 6.9572
volatileacidity 0.2782 0.2723 0.2828
citricacid 0.3342 0.3281 0.3389
residualsugar 6.3914 3.9451 8.2614
chlorides 0.0458 0.0374 0.0522
freesulfurdioxide 35.3081 30.9105 38.6697
totalsulfurdioxide 138.3607 116.2175 155.2871
density 0.994 0.9917 0.9958
pH 3.1883 3.2137 3.1689
sulphates 0.4898 0.4876 0.4916
alcohol 10.5143 11.6695 9.6312
quality 5.8779 6.4043 5.4755
Time taken to build model (full training data) : 0.15 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 2122 ( 43%)
1 2776 ( 57%)
Define "best result".
By the definition of k-means, a lower sum of squares is better.
Anything else is worse by k-means - but that doesn't mean that a different quality criterion (or clustering algorithm) could be more helpful for your actual problem.
Using different seeds doesnot guarantee you different clusters in the result.
I put almost 190 million records in Cassandra(2.1.11) cluster with 3 nodes, and the replication factor is 1 , then I write client application to count the all records using datastax's Java Driver, the snippet code as follows:
Statement stmt = new SimpleStatement("select * from test" );
System.out.println("starting to read records ");
stmt.setFetchSize(10000);
ResultSet rs = session.execute(stmt);
//System.out.println("rs.size " + rs.all().size());
long cntRecords = 0;
for(Row row : rs){
cntRecords++;
if(cntRecords % 10000000 == 0){
System.out.println("the " + cntRecords/10000000 + " X 10 millions of records");
}
}
After the above variable cntRecords is more than 30 millions, I always get the exception:
Exception in thread "main" com.datastax.driver.core.exceptions.ReadTimeoutException:
Cassandra timeout during read query at consistency ONE (1 responses were required but only
0 replica responded)
I got several results in google and changed the settings about heap and GC, the following is my relative settings:
-XX:InitialHeapSize=17179869184
-XX:MaxHeapSize=17179869184
-XX:MaxNewSize=12884901888
-XX:MaxTenuringThreshold=1
-XX:NewSize=12884901888
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseCompressedOops
-XX:+UseConcMarkSweepGC
-XX:+UseCondCardMark
-XX:+UseGCLogFileRotation
-XX:+UseParNewGC
-XX:+UseTLAB
-XX:+UseThreadPriorities
-XX:+CMSClassUnloadingEnabled
and I used GCViewer to analysis the gc log file and the througputs are 99.95%, 98.15% and 95.75%.
UPDATED BEGIN:
And I used jstat to monitor one of the three nodes and found that when the S1's value changed into 100.00 I will get the above error quickly:
/usr/java/jdk1.7.0_80/bin/jstat -gcutil 8862 1000
S0 S1 E O P YGC YGCT FGC FGCT GCT
0.00 100.00 28.57 36.29 74.66 55 14.612 2 0.164 14.776
And once S1 changed into 100.00, S1 no longer will decrease, I don't know this is relative to the error? Or what property in cassandra.yaml or cassandra-env.sh I should set for this?
What should I do for finishing the task to count the all records? Thanks in advance!
ATTACH:
the following is other options:
-XX:+CMSEdenChunksRecordAlways
-XX:CMSInitiatingOccupancyFraction=75
-XX:+CMSParallelInitialMarkEnabled
-XX:+CMSParallelRemarkEnabled
-XX:CMSWaitDuration=10000
-XX:CompileCommandFile=bin/../conf/hotspot_compiler
-XX:GCLogFileSize=94371840
-XX:+HeapDumpOnOutOfMemoryError
-XX:NumberOfGCLogFiles=90
-XX:OldPLABSize=16
-XX:PrintFLSStatistics=1
-XX:+PrintGC
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC
-XX:+PrintPromotionFailure
-XX:+PrintTenuringDistribution
-XX:StringTableSize=1000003
-XX:SurvivorRatio=8
-XX:ThreadPriorityPolicy=42
-XX:ThreadStackSize=256
Examine why you need to know the number of rows. Does your application really need to know this? If it can survive with "just" a good approximation, then create a counter and increment it as you load your data.
http://docs.datastax.com/en/cql/3.1/cql/cql_using/use_counter_t.html
Things you can try:
Select a single column instead of *. This might reduce by GC pressure and network consumption. Preferably pick a column that has a small number of bytes and is part of the primary key: select column1 from test
Add a short pause after every 1M records. Have your loop pause for 500ms or so every 1M records. This may give the nodes a quick breather to take care of things like GC
Edit cassandra.yaml on your nodes and increase range_request_timeout_in_ms and read_request_timeout_in_ms
Figure out the token ranges assigned to each node and issue a separate query for each token range. Add the counts from each query. This takes advantage of the token-aware driver to issue each "token range" query directly to the node that can answer it. See this blog article for a full description with sample code.
i'm trying to create a cassandra database using a single node cluster(i think) but no matter whatever value i set the replication factor i keep getting this error:
me.prettyprint.hector.api.exceptions.HUnavailableException: : May not be enough replicas present to handle consistency level.
Here's my code:
public static String[]getSerializedClusterMap(){
Cluster cluster=HFactory.getOrCreateCluster("TestCluster", "localhost:9160");
// Keyspace keyspace=HFactory.createKeyspace("KMeans", cluster);
KeyspaceDefinition keyspaceDefinition=cluster.describeKeyspace("myKeyspace");
if (cluster.describeKeyspace("myKeyspace")==null){
ColumnFamilyDefinition columnFamilyDefinition=HFactory.createColumnFamilyDefinition("myKeyspace","clusters",ComparatorType.BYTESTYPE);
KeyspaceDefinition keyspaceDefinition1=HFactory.createKeyspaceDefinition("myKeyspace",ThriftKsDef.DEF_STRATEGY_CLASS,1,Arrays.asList(columnFamilyDefinition));
cluster.addKeyspace(keyspaceDefinition1,true);
}
Keyspace keyspace=HFactory.createKeyspace("myKeyspace", cluster);
Mutator<String>mutator=HFactory.createMutator(keyspace, me.prettyprint.cassandra.serializers.StringSerializer.get());
String[]serializedMap=new String[2],clusters={"cluster-0","cluster-1"};
try{
me.prettyprint.hector.api.query.ColumnQuery<String,String,String> columnQuery=HFactory.createStringColumnQuery(keyspace);
for(int i=0;i<clusters.length;i++){
columnQuery.setColumnFamily("user").setKey("cluster").setName(clusters[i]);
QueryResult<HColumn<String,String>>result=columnQuery.execute();
serializedMap[i]=result.get().getValue();
}
}catch (HectorException ex){
ex.printStackTrace();
}
return serializedMap;
}
Any suggestions on to what should i do or on what the value of the replication factor should be?
After running,'use keyspace "myKeyspace;' and 'describe;',the output is:
Keyspace: myKeyspace:
Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
Durable Writes: true
Options: [replication_factor:3]
Column Families:
ColumnFamily: user
Key Validation Class: org.apache.cassandra.db.marshal.BytesType
Default column value validator: org.apache.cassandra.db.marshal.BytesType
Cells sorted by: org.apache.cassandra.db.marshal.BytesType
GC grace seconds: 864000
Compaction min/max thresholds: 4/32
Read repair chance: 1.0
DC Local Read repair chance: 0.0
Populate IO Cache on flush: false
Replicate on write: true
Caching: KEYS_ONLY
Bloom Filter FP chance: default
Built indexes: []
Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
Compression Options:
sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor
your keyspace is configured with a RF of 3
Options: [replication_factor:3]
On a 1 node cluster, no quorum can ever be achieved since it requires at least 2. Alter your rf to 1 or use a consistency level of ONE.
In Hive, how do I apply the lower() UDF to an Array of string?
Or any UDF in general. I don't know how to apply a "map" in a select query
If your use case is that you are transforming an array in isolation (not as part of a table), then the combination of explode, lower, and collect_list should do the trick. For example (please pardon the horrible execution times, I'm running on an underpowered VM):
hive> SELECT collect_list(lower(val))
> FROM (SELECT explode(array('AN', 'EXAMPLE', 'ARRAY')) AS val) t;
...
... Lots of MapReduce spam
...
MapReduce Total cumulative CPU time: 4 seconds 10 msec
Ended Job = job_1422453239049_0017
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 4.01 sec HDFS Read: 283 HDFS Write: 17 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 10 msec
OK
["an","example","array"]
Time taken: 33.05 seconds, Fetched: 1 row(s)
(Note: Replace array('AN', 'EXAMPLE', 'ARRAY') in the above query with whichever expression you are using to generate the array.
If instead your use case is such that your arrays stored in a column of a Hive table and you need to apply the lowercase transformation to them, to my knowledge you have two principle options:
Approach #1: Use the combination of explode and LATERAL VIEW to separate the array. Use lower to transform the individual elements, and then collect_list to glue them back together. A simple example with silly made-up data:
hive> DESCRIBE foo;
OK
id int
data array<string>
Time taken: 0.774 seconds, Fetched: 2 row(s)
hive> SELECT * FROM foo;
OK
1001 ["ONE","TWO","THREE"]
1002 ["FOUR","FIVE","SIX","SEVEN"]
Time taken: 0.434 seconds, Fetched: 2 row(s)
hive> SELECT
> id, collect_list(lower(exploded))
> FROM
> foo LATERAL VIEW explode(data) exploded_table AS exploded
> GROUP BY id;
...
... Lots of MapReduce spam
...
MapReduce Total cumulative CPU time: 3 seconds 310 msec
Ended Job = job_1422453239049_0014
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 3.31 sec HDFS Read: 358 HDFS Write: 44 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 310 msec
OK
1001 ["one","two","three"]
1002 ["four","five","six","seven"]
Time taken: 34.268 seconds, Fetched: 2 row(s)
Approach #2: Write a simple UDF to apply the transformation. Something like:
package my.package_name;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class LowerArray extends UDF {
public List<Text> evaluate(List<Text> input) {
List<Text> output = new ArrayList<Text>();
for (Text element : input) {
output.add(new Text(element.toString().toLowerCase()));
}
return output;
}
}
And then invoke the UDF directly on the data:
hive> ADD JAR my_jar.jar;
Added my_jar.jar to class path
Added resource: my_jar.jar
hive> CREATE TEMPORARY FUNCTION lower_array AS 'my.package_name.LowerArray';
OK
Time taken: 2.803 seconds
hive> SELECT id, lower_array(data) FROM foo;
...
... Lots of MapReduce spam
...
MapReduce Total cumulative CPU time: 2 seconds 760 msec
Ended Job = job_1422453239049_0015
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 2.76 sec HDFS Read: 358 HDFS Write: 44 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 760 msec
OK
1001 ["one","two","three"]
1002 ["four","five","six","seven"]
Time taken: 27.243 seconds, Fetched: 2 row(s)
There are some trade-offs between the two approaches. #2 will probably be more efficient at runtime in general than #1, since the GROUP BY clause in #1 forces a reduction stage while the UDF approach does not. However, #1 does everything in HiveQL and is a bit more easily generalized (you can replace lower with some other kind of string transformation in the query if you needed to). With the UDF approach of #2, you potentially have to write a new UDF for each different kind of transformation you want to apply.