Related
I got an error(Agent configuration for 'a1' has no configfilters) when I use flume 1.9 to transfer the data from kafka to HDFS, but no other error or info were reported.
The source I used is KafkaSource, sink is file sink.
Interceptor I used is self-define which I will show bolow.
Agent configuration for 'a1' has no configfilters.
the logger info is below. differ from other question, the
16 Aug 2022 11:45:27,600 WARN [conf-file-poller-0] (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.validateConfigFilterSet:623) - Agent configuration for 'a1' has no configfilters.
16 Aug 2022 11:45:27,623 INFO [conf-file-poller-0] (org.apache.flume.conf.FlumeConfiguration.validateConfiguration:163) - Post-validation flume configuration contains configuration for agents: [a1]
16 Aug 2022 11:45:27,624 INFO [conf-file-poller-0] (org.apache.flume.node.AbstractConfigurationProvider.loadChannels:151) - Creating channels
16 Aug 2022 11:45:27,628 INFO [conf-file-poller-0] (org.apache.flume.channel.DefaultChannelFactory.create:42) - Creating instance of channel c1 type file
16 Aug 2022 11:45:27,642 INFO [conf-file-poller-0] (org.apache.flume.node.AbstractConfigurationProvider.loadChannels:205) - Created channel c1
16 Aug 2022 11:45:27,643 INFO [conf-file-poller-0] (org.apache.flume.source.DefaultSourceFactory.create:41) - Creating instance of source r1, type org.apache.flume.source.kafka.KafkaSource
16 Aug 2022 11:45:27,655 INFO [conf-file-poller-0] (org.apache.flume.sink.DefaultSinkFactory.create:42) - Creating instance of sink: k1, type: hdfs
16 Aug 2022 11:45:27,786 INFO [conf-file-poller-0] (org.apache.flume.node.AbstractConfigurationProvider.getConfiguration:120) - Channel c1 connected to [r1, k1]
16 Aug 2022 11:45:27,787 INFO [conf-file-poller-0] (org.apache.flume.node.Application.startAllComponents:162) - Starting new configuration:{ sourceRunners:{r1=PollableSourceRunner: { source:org.apache.flume.source.kafka.KafkaSource{name:r1,state:IDLE} counterGroup:{ name:null counters:{} } }} sinkRunners:{k1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor#2aa62fb9 counterGroup:{ name:null counters:{} } }} channels:{c1=FileChannel c1 { dataDirs: [/opt/module/flume/data/ranqi/behavior2] }} }
16 Aug 2022 11:45:27,788 INFO [conf-file-poller-0] (org.apache.flume.node.Application.startAllComponents:169) - Starting Channel c1
16 Aug 2022 11:45:27,790 INFO [conf-file-poller-0] (org.apache.flume.node.Application.startAllComponents:184) - Waiting for channel: c1 to start. Sleeping for 500 ms
16 Aug 2022 11:45:27,790 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.FileChannel.start:278) - Starting FileChannel c1 { dataDirs: [/opt/module/flume/data/ranqi/behavior2] }...
16 Aug 2022 11:45:27,833 INFO [lifecycleSupervisor-1-0] (org.apache.flume.instrumentation.MonitoredCounterGroup.register:119) - Monitored counter group for type: CHANNEL, name: c1: Successfully registered new MBean.
16 Aug 2022 11:45:27,833 INFO [lifecycleSupervisor-1-0] (org.apache.flume.instrumentation.MonitoredCounterGroup.start:95) - Component type: CHANNEL, name: c1 started
16 Aug 2022 11:45:27,839 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.<init>:356) - Encryption is not enabled
16 Aug 2022 11:45:27,840 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.replay:406) - Replay started
16 Aug 2022 11:45:27,845 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.replay:418) - Found NextFileID 3, from [/opt/module/flume/data/ranqi/behavior2/log-3, /opt/module/flume/data/ranqi/behavior2/log-2]
16 Aug 2022 11:45:27,851 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:55) - Starting up with /opt/module/flume/checkpoint/ranqi/behavior2/checkpoint and /opt/module/flume/checkpoint/ranqi/behavior2/checkpoint.meta
16 Aug 2022 11:45:27,851 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:59) - Reading checkpoint metadata from /opt/module/flume/checkpoint/ranqi/behavior2/checkpoint.meta
16 Aug 2022 11:45:27,906 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.FlumeEventQueue.<init>:115) - QueueSet population inserting 0 took 0
16 Aug 2022 11:45:27,908 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.replay:457) - Last Checkpoint Mon Aug 15 17:11:08 CST 2022, queue depth = 0
16 Aug 2022 11:45:27,918 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.doReplay:542) - Replaying logs with v2 replay logic
16 Aug 2022 11:45:27,919 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.ReplayHandler.replayLog:249) - Starting replay of [/opt/module/flume/data/ranqi/behavior2/log-2, /opt/module/flume/data/ranqi/behavior2/log-3]
16 Aug 2022 11:45:27,920 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.ReplayHandler.replayLog:262) - Replaying /opt/module/flume/data/ranqi/behavior2/log-2
16 Aug 2022 11:45:27,925 INFO [lifecycleSupervisor-1-0] (org.apache.flume.tools.DirectMemoryUtils.getDefaultDirectMemorySize:112) - Unable to get maxDirectMemory from VM: NoSuchMethodException: sun.misc.VM.maxDirectMemory(null)
16 Aug 2022 11:45:27,926 INFO [lifecycleSupervisor-1-0] (org.apache.flume.tools.DirectMemoryUtils.allocate:48) - Direct Memory Allocation: Allocation = 1048576, Allocated = 0, MaxDirectMemorySize = 1908932608, Remaining = 1908932608
16 Aug 2022 11:45:27,982 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.LogFile$SequentialReader.skipToLastCheckpointPosition:660) - Checkpoint for file(/opt/module/flume/data/ranqi/behavior2/log-2) is: 1660554206424, which is beyond the requested checkpoint time: 1660555388025 and position 0
16 Aug 2022 11:45:27,982 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.ReplayHandler.replayLog:262) - Replaying /opt/module/flume/data/ranqi/behavior2/log-3
16 Aug 2022 11:45:27,983 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.LogFile$SequentialReader.skipToLastCheckpointPosition:658) - fast-forward to checkpoint position: 273662090
16 Aug 2022 11:45:27,983 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.LogFile$SequentialReader.next:683) - Encountered EOF at 273662090 in /opt/module/flume/data/ranqi/behavior2/log-3
16 Aug 2022 11:45:27,983 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.ReplayHandler.replayLog:345) - read: 0, put: 0, take: 0, rollback: 0, commit: 0, skip: 0, eventCount:0
16 Aug 2022 11:45:27,984 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.FlumeEventQueue.replayComplete:417) - Search Count = 0, Search Time = 0, Copy Count = 0, Copy Time = 0
16 Aug 2022 11:45:27,988 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.replay:505) - Rolling /opt/module/flume/data/ranqi/behavior2
16 Aug 2022 11:45:27,988 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.roll:990) - Roll start /opt/module/flume/data/ranqi/behavior2
16 Aug 2022 11:45:27,989 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.LogFile$Writer.<init>:220) - Opened /opt/module/flume/data/ranqi/behavior2/log-4
16 Aug 2022 11:45:27,996 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.roll:1006) - Roll end
16 Aug 2022 11:45:27,996 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.EventQueueBackingStoreFile.beginCheckpoint:230) - Start checkpoint for /opt/module/flume/checkpoint/ranqi/behavior2/checkpoint, elements to sync = 0
16 Aug 2022 11:45:28,000 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.EventQueueBackingStoreFile.checkpoint:255) - Updating checkpoint metadata: logWriteOrderID: 1660621527859, queueSize: 0, queueHead: 557327
16 Aug 2022 11:45:28,008 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.writeCheckpoint:1065) - Updated checkpoint for file: /opt/module/flume/data/ranqi/behavior2/log-4 position: 0 logWriteOrderID: 1660621527859
16 Aug 2022 11:45:28,008 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.FileChannel.start:289) - Queue Size after replay: 0 [channel=c1]
16 Aug 2022 11:45:28,290 INFO [conf-file-poller-0] (org.apache.flume.node.Application.startAllComponents:196) - Starting Sink k1
16 Aug 2022 11:45:28,291 INFO [conf-file-poller-0] (org.apache.flume.node.Application.startAllComponents:207) - Starting Source r1
16 Aug 2022 11:45:28,292 INFO [lifecycleSupervisor-1-1] (org.apache.flume.instrumentation.MonitoredCounterGroup.register:119) - Monitored counter group for type: SINK, name: k1: Successfully registered new MBean.
16 Aug 2022 11:45:28,292 INFO [lifecycleSupervisor-1-1] (org.apache.flume.instrumentation.MonitoredCounterGroup.start:95) - Component type: SINK, name: k1 started
16 Aug 2022 11:45:28,292 INFO [lifecycleSupervisor-1-4] (org.apache.flume.source.kafka.KafkaSource.doStart:524) - Starting org.apache.flume.source.kafka.KafkaSource{name:r1,state:IDLE}...
flume agent start use shell command below.
#!/bin/bash
case $1 in
"start")
echo " --------start flume-------"
ssh hadoop104 "nohup /opt/module/flume/bin/flume-ng agent -n a1 -c /opt/module/flume/conf -f /opt/module/flume/job/ranqi/ranqi_kafka_to_hdfs_db.conf >/dev/null 2>&1 &"
;;
"stop")
echo " --------stop flume-------"
ssh hadoop104 "ps -ef | grep ranqi_kafka_to_hdfs_db.conf | grep -v grep |awk '{print \$2}' | xargs -n1 kill"
;;
esac
flume config is below.
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092
a1.sources.r1.kafka.topics = copy_1015
a1.sources.r1.kafka.consumer.group.id = flume
a1.sources.r1.setTopicHeader = true
a1.sources.r1.topicHeader = topic
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.atguigu.flume.interceptor.ranqi.ranqiTimestampInterceptor$Builder
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/ranqi/behavior2
a1.channels.c1.dataDirs = /opt/module/flume/data/ranqi/behavior2/
a1.channels.c1.maxFileSize = 2146435071
a1.channels.c1.capacity = 1123456
a1.channels.c1.keep-alive = 6
## sink1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /origin_data/ranqi/db/%{topic}_inc/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = db
a1.sinks.k1.hdfs.round = false
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = gzip
## 拼装
a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1
ranqiTimestampInterceptor class I defined is below, which in flume/lib.
package com.atguigu.flume.interceptor.ranqi;
import com.alibaba.fastjson.JSONObject;
import com.atguigu.flume.interceptor.db.TimestampInterceptor;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.nio.charset.StandardCharsets;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.List;
import java.util.Map;
public class ranqiTimestampInterceptor implements Interceptor {
public static String dateToStamp(String s) throws ParseException {
String res;
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Date date = simpleDateFormat.parse(s);
long ts = date.getTime();
res = String.valueOf(ts);
return res;
}
#Override
public void initialize() {
}
private final static Logger logger = LoggerFactory.getLogger(ranqiTimestampInterceptor.class);
#Override
public Event intercept(Event event) {
byte[] body = event.getBody();
Long createDate ;
String time = new String();
String log = new String(body, StandardCharsets.UTF_8);
JSONObject jsonObject = JSONObject.parseObject(log);
// logger.info(log);
logger.info(String.valueOf(jsonObject));
JSONObject data = jsonObject.getObject("data", JSONObject.class);
if(data.containsKey("createDate") && data.getLong("createDate") != null){
createDate = data.getLong("createDate");
try {
createDate = Long.valueOf(dateToStamp(String.valueOf(createDate)));
time = String.valueOf(createDate);
} catch (ParseException e) {
e.printStackTrace();
}
finally {
Long ts = jsonObject.getLong("ts");
time = String.valueOf(ts);
}
}else{
Long ts = jsonObject.getLong("ts");
time = String.valueOf(ts);
}
System.out.println(time);
logger.info(time);
Map<String, String> headers = event.getHeaders();
headers.put("timestamp",time);
return event;
}
#Override
public List<Event> intercept(List<Event> list) {
for (Event event : list) {
intercept(event);
}
return list;
}
#Override
public void close() {
}
public static class Builder implements Interceptor.Builder{
#Override
public Interceptor build() {
return new TimestampInterceptor();
}
#Override
public void configure(Context context) {
}
}
}
.
I'm upgrading Flume1.7 to 1.9.I have 5 Kafka sources and 7 aws-s3's sinks in FLume's conf.
Firstly I need stop Flume 1.7.So I execute the command
'kill ps -ef |grep flume |grep bidinfo | awk '{print $2}'' to stop the bidinfo task,but this process still exists.22 hours later ,this process always alive until now. What else can I do without this command 'kill -9 xxxx'.Welcome to suggest!!!
This server has been running for 60 days with Flume 1.7,kafka ,I've tried to execute the command 'kill -3 xxxx',However, only two sources and two sinks have been stopped and the others have been running.
I read the source code of flume-kafkaSource and observed flume's log.
It is possible that two methods(Consumer. wakeup ();Consumer. close ();) are not completed.
#flume-conf
ag.sources = src_1 src_2 src_3 src_4 src_5
ag.channels = ch
ag.sinks = sk_1 sk_2 sk_3 sk_4 sk_5 sk_6 sk_7
#source:kafka
ag.sources.src_1.type = org.apache.flume.source.kafka.KafkaSource
ag.sources.src_1.kafka.bootstrap.servers = xxxx
ag.sources.src_1.kafka.consumer.group.id = flume.xxxxx
ag.sources.src_1.kafka.consumer.retry.backoff.ms = 10000
ag.sources.src_1.batchSize = 5000
ag.sources.src_1.batchDurationMillis = 2000
ag.sources.src_1.kafka.topics = xxxx
ag.sources.src_1.interceptors = i1
ag.sources.src_1.interceptors.i1.type = xxxx.interceptor
ag.sources.src_1.channels = ch
#sink:aws S3
ag.sinks.sk_1.type = hdfs
ag.sinks.sk_1.hdfs.path = s3a://xxxxxxx
ag.sinks.sk_1.hdfs.filePrefix = %{minute}
ag.sinks.sk_1.hdfs.fileSuffix = .xxx.1.lzo
ag.sinks.sk_1.hdfs.rollSize = 0
ag.sinks.sk_1.hdfs.rollCount = 0
ag.sinks.sk_1.hdfs.rollInterval = 0
ag.sinks.sk_1.hdfs.idleTimeout = 180
ag.sinks.sk_1.hdfs.callTimeout = 600000
ag.sinks.sk_1.hdfs.closeTries = 5
ag.sinks.sk_1.hdfs.retryInterval = 60
ag.sinks.sk_1.hdfs.batchSize = 3000
ag.sinks.sk_1.hdfs.codeC = lzop
ag.sinks.sk_1.hdfs.fileType = CompressedStream
ag.sinks.sk_1.hdfs.writeFormat = Text
ag.sinks.sk_1.channel = ch
#channels
ag.channels.ch.type = memory
ag.channels.ch.capacity = 2000000
ag.channels.ch.transactionCapacity = 100000
Error Message
09 Sep 2019 11:14:18,010 ERROR [PollableSourceRunner-KafkaSource-src_2] (org.apache.flume.source.kafka.KafkaSource.doProcess:314) - KafkaSource EXCEPTION, {}
org.apache.flume.ChannelException: java.lang.InterruptedException
at org.apache.flume.channel.BasicTransactionSemantics.commit(BasicTransactionSemantics.java:154)
at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:194)
at org.apache.flume.source.kafka.KafkaSource.doProcess(KafkaSource.java:295)
at org.apache.flume.source.AbstractPollableSource.process(AbstractPollableSource.java:60)
at org.apache.flume.source.PollableSourceRunner$PollingRunner.run(PollableSourceRunner.java:133)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:582)
at org.apache.flume.channel.MemoryChannel$MemoryTransaction.doCommit(MemoryChannel.java:119)
at org.apache.flume.channel.BasicTransactionSemantics.commit(BasicTransactionSemantics.java:151)
... 5 more
#this place I think is the main problem
09 Sep 2019 11:14:18,011 INFO [PollableSourceRunner-KafkaSource-src_2] (org.apache.flume.source.PollableSourceRunner$PollingRunner.run:143) - Source runner interrupted. Exiting
09 Sep 2019 11:14:18,011 INFO [agent-shutdown-hook] (org.apache.kafka.clients.consumer.internals.AbstractCoordinator$2.onFailure:571) - LeaveGroup request failed with error
org.apache.kafka.clients.consumer.internals.SendFailedException
#-------------
09 Sep 2019 11:14:18,018 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:149) - Component type: SOURCE, name: src_2 stopped
09 Sep 2019 11:14:18,018 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:155) - Shutdown Metric for type: SOURCE, name: src_2. source.start.time == 1567911973591
09 Sep 2019 11:14:18,018 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:161) - Shutdown Metric for type: SOURCE, name: src_2. source.stop.time == 1567998858018
09 Sep 2019 11:14:18,018 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SOURCE, name: src_2. source.kafka.commit.time == 2470886
09 Sep 2019 11:14:18,018 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SOURCE, name: src_2. source.kafka.empty.count == 0
09 Sep 2019 11:14:18,018 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SOURCE, name: src_2. source.kafka.event.get.time == 78823378
09 Sep 2019 11:14:18,018 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SOURCE, name: src_2. src.append-batch.accepted == 0
09 Sep 2019 11:14:18,018 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SOURCE, name: src_2. src.append-batch.received == 0
09 Sep 2019 11:14:18,018 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SOURCE, name: src_2. src.append.accepted == 0
09 Sep 2019 11:14:18,018 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SOURCE, name: src_2. src.append.received == 0
09 Sep 2019 11:14:18,018 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SOURCE, name: src_2. src.events.accepted == 767319160
09 Sep 2019 11:14:18,018 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SOURCE, name: src_2. src.events.received == 767366357
09 Sep 2019 11:14:18,018 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SOURCE, name: src_2. src.open-connection.count == 0
I recently started programming with Esper and I have a smart wearable that sends pedometer data to my laptop. I then process this data using esper. But suppose I have multiple smart wearables with each an unique MAC address. I use time windows and my question is how can I change my rule file so that the rules only fire for events with the same macaddress and take appropiate action based on this MAC address. My initialization and rule are:
Configuration cepConfig = new Configuration();
cepConfig.addEventType("Steps", Steps.class.getName());
// We setup the engine
EPServiceProvider cep = EPServiceProviderManager.getProvider("myCEPEngine", cepConfig);
EPRuntime cepRT = cep.getEPRuntime();
// We register an EPL statement
EPAdministrator cepSteps1 = cep.getEPAdministrator();
EPStatement cepStatementSteps1 = cepSteps1.createEPL("select * from "
+ "Steps().win:time(1 hour) "
+ "group by macAddress "
+ "having sum(max(steps)-min(steps)) < 100");
cepStatementSteps1.addListener(new rule1Listener());
My Steps class has the following fields:
double steps;
String stepsTimestamp;
String macAddress;
And this is how I insert the events:
Steps steps0 = new Steps(0, new Date(timeStamp).toString(), "K5E45H778");
cepRT.sendEvent(steps0);
Steps steps00 = new Steps(0, new Date(timeStamp).toString(), "LD24ESF74");
cepRT.sendEvent(steps00);
Steps steps1 = new Steps(25, new Date(timeStamp).toString(), "K5E45H778");
cepRT.sendEvent(steps1);
Steps steps2 = new Steps(50, new Date(timeStamp).toString(), "LD24ESF74");
cepRT.sendEvent(steps2);
Steps steps3 = new Steps(55, new Date(timeStamp).toString(), "K5E45H778");
cepRT.sendEvent(steps3);
Steps steps4 = new Steps(105, new Date(timeStamp).toString(), "LD24ESF74");
cepRT.sendEvent(steps4);
Steps steps5 = new Steps(75, new Date(timeStamp).toString(), "K5E45H778");
cepRT.sendEvent(steps5);
Steps steps6 = new Steps(110, new Date(timeStamp).toString(), "K5E45H778");
cepRT.sendEvent(steps6);
This is my output:
Sending tick: Steps: 0.0 Timestamp: Mon Mar 14 18:13:23 CET 2016 Mac Address: K5E45H778
->Rule 1 fired: K5E45H778
Sending tick: Steps: 0.0 Timestamp: Mon Mar 14 18:18:23 CET 2016 Mac Address: LD24ESF7474
->Rule 1 fired: LD24ESF7474
Sending tick: Steps: 25.0 Timestamp: Mon Mar 14 18:23:23 CET 2016 Mac Address: K5E45H778
->Rule 1 fired: K5E45H778
Sending tick: Steps: 105.0 Timestamp: Mon Mar 14 18:28:23 CET 2016 Mac Address: LD24ESF7474
Sending tick: Steps: 55.0 Timestamp: Mon Mar 14 18:33:23 CET 2016 Mac Address: K5E45H778
->Rule 1 fired: K5E45H778
Sending tick: Steps: 75.0 Timestamp: Mon Mar 14 18:38:23 CET 2016 Mac Address: K5E45H778
Sending tick: Steps: 110.0 Timestamp: Mon Mar 14 18:43:23 CET 2016 Mac Address: K5E45H778
Why doesn't the rule fire for the one but last event of 75 steps?
The SQL-standard "group by" clause is for aggregation per group. Thus just adding "group by macAddress" should get it done.
Looking at this flow...
public Date nextExecutionTime(TriggerContext triggerContext) {
return this.invoked.getAndSet(true) ? null : new Date();
}
#Bean
public IntegrationFlow mainFlow() {
JsonObjectMapper<?, ?> jsonObjectMapper = new Jackson2JsonObjectMapper(objectMapper);
// #formatter:off
return IntegrationFlows
.from(
amazonS3InboundSynchronizationMessageSource(),
e -> e.poller(p -> p.trigger(this::nextExecutionTime))
)
.channel(LoggingUtils.createLoggingMessageChannel("File:::"))
.transform(new FileToInputStreamTransformer())
.split(new FileSplitter(), null)
.channel(c -> c.executor(Executors.newFixedThreadPool(10)))
.transform(Transformers.fromJson(persistentType(), jsonObjectMapper))
.handle(LoggingUtils.createLoggingMessageHandler("Parsed JSON record #"))
//.handle(jdbcRepositoryHandler())
//.publishSubscribeChannel(p -> p.subscribe(persistenceSubFlow()))
.get();
// #formatter:on
}
Why is it that I'm only able to read one file?
Even though the configured MessageSource (a AmazonS3InboundSynchronizationMessageSource) writes more than one file to local directory.
Sample console output
2015-09-11 09:52:59,856 [task-scheduler-1] org.springframework.integration.aws.s3.InboundFileSynchronizationImpl INFO Sync completed
2015-09-11 09:52:59,860 [task-scheduler-1] org.springframework.integration.handler.LoggingHandler INFO Event: [File:::] - Message: [GenericMessage [payload=/Users/cphi/development/projects/expedia/git/luis-data-migration-service/target/s3-dump/RatePlanLevelRestrictionLog/2015/08/23/00/2015-08-22-23-58-0.302402118982895.gz, headers={id=e58c332b-c217-8059-c4e8-09bba2c430a0, timestamp=1441990379859}]]
2015-09-11 09:52:59,918 [pool-2-thread-8] org.springframework.integration.handler.LoggingHandler INFO Event: [Parsed JSON record #] - Message: [GenericMessage [payload=RatePlanLevelRestrictionLog[roomTypeId=,ratePlanId=201744463,stayDate=Wed Sep 02 17:00:00 PDT 2015,ratePlanLevel=0,hotelId=4469515,rprLogSeqNum=16,logActionTypeId=2,sellStateId=1,startAllowed=,endAllowed=,fplosMaskArrival=,fplosMaskStayThrough=,doaCostPriceChanged=,supplierUpdateDate=Sat Aug 22 07:57:24 PDT 2015,supplierUpdateTuid=68630676,createDate=Sat Aug 22 07:57:24 PDT 2015,changeRequestId=31461011173,changeRequestSourceId=], headers={sequenceNumber=8, file_name=2015-08-22-23-58-0.302402118982895.gz, sequenceSize=0, correlationId=16e44a80-2669-b2bf-f2bf-f12fe6bb4510, file_originalFile=/Users/cphi/development/projects/expedia/git/luis-data-migration-service/target/s3-dump/RatePlanLevelRestrictionLog/2015/08/23/00/2015-08-22-23-58-0.302402118982895.gz, id=6b866d25-07e8-22a4-381c-d26205393f3b, timestamp=1441990379898}]]
2015-09-11 09:52:59,919 [pool-2-thread-3] org.springframework.integration.handler.LoggingHandler INFO Event: [Parsed JSON record #] - Message: [GenericMessage [payload=RatePlanLevelRestrictionLog[roomTypeId=,ratePlanId=1030513,stayDate=Wed Aug 26 17:00:00 PDT 2015,ratePlanLevel=0,hotelId=1615126,rprLogSeqNum=6,logActionTypeId=2,sellStateId=0,startAllowed=,endAllowed=,fplosMaskArrival=,fplosMaskStayThrough=,doaCostPriceChanged=,supplierUpdateDate=Sat Aug 22 07:57:35 PDT 2015,supplierUpdateTuid=46712703,createDate=Sat Aug 22 07:57:35 PDT 2015,changeRequestId=31461014045,changeRequestSourceId=], headers={sequenceNumber=3, file_name=2015-08-22-23-58-0.302402118982895.gz, sequenceSize=0, correlationId=16e44a80-2669-b2bf-f2bf-f12fe6bb4510, file_originalFile=/Users/cphi/development/projects/expedia/git/luis-data-migration-service/target/s3-dump/RatePlanLevelRestrictionLog/2015/08/23/00/2015-08-22-23-58-0.302402118982895.gz, id=ddf1ee98-55c4-81de-af77-a886a340fe07, timestamp=1441990379897}]]
2015-09-11 09:52:59,919 [pool-2-thread-2] org.springframework.integration.handler.LoggingHandler INFO Event: [Parsed JSON record #] - Message: [GenericMessage [payload=RatePlanLevelRestrictionLog[roomTypeId=,ratePlanId=163007,stayDate=Fri Dec 11 16:00:00 PST 2015,ratePlanLevel=0,hotelId=897973,rprLogSeqNum=3,logActionTypeId=2,sellStateId=0,startAllowed=,endAllowed=,fplosMaskArrival=,fplosMaskStayThrough=,doaCostPriceChanged=,supplierUpdateDate=Sat Aug 22 07:57:16 PDT 2015,supplierUpdateTuid=46712703,createDate=Sat Aug 22 07:57:16 PDT 2015,changeRequestId=31461009374,changeRequestSourceId=], headers={sequenceNumber=2, file_name=2015-08-22-23-58-0.302402118982895.gz, sequenceSize=0, correlationId=16e44a80-2669-b2bf-f2bf-f12fe6bb4510, file_originalFile=/Users/cphi/development/projects/expedia/git/luis-data-migration-service/target/s3-dump/RatePlanLevelRestrictionLog/2015/08/23/00/2015-08-22-23-58-0.302402118982895.gz, id=d7d7a418-6593-bc57-fc7d-e181778be0c8, timestamp=1441990379899}]]
...
Directory contents
.../target/s3-dump/RatePlanLevelRestriction
+- 2015
+-- 08
+--- 23
+---- 00
+----- 2015-08-22-23-58-0.302402118982895.gz
+----- 2015-08-22-23-58-0.302992661055088.gz
+----- 2015-08-22-23-58-0.303107496339691.gz
If you're curious here's the gists for:
LoggingUtils: https://gist.github.com/fastnsilver/82f242dd5b42bfd118e8
amazonS3InboundSynchronizationMessageSource() config: https://gist.github.com/fastnsilver/fb750c02b58a04686509
Increase maxMessagesPerPoll on the poller (default is 1).
My MapReduce app counts usage of field values in a Hive table. I managed to build and run it from Eclipse after including all jars from /usr/lib/hadood, /usr/lib/hive and /usr/lib/hcatalog directories. It works.
After many frustrations I have also managed to compile and run it as Maven project:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.bigdata.hadoop</groupId>
<artifactId>FieldCounts</artifactId>
<packaging>jar</packaging>
<name>FieldCounts</name>
<version>0.0.1-SNAPSHOT</version>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.hcatalog</groupId>
<artifactId>hcatalog-core</artifactId>
<version>0.11.0</version>
</dependency>
</dependencies>
</project>
To run the job from command line the following script has to be used :
#!/bin/sh
export LIBJARS=/usr/lib/hcatalog/share/hcatalog/hcatalog-core.jar,/usr/lib/hive/lib/hive-exec-0.12.0.2.0.6.1-101.jar,/usr/lib/hive/lib/hive-metastore-0.12.0.2.0.6.1-101.jar,/usr/lib/hive/lib/libfb303-0.9.0.jar,/usr/lib/hive/lib/jdo-api-3.0.1.jar,/usr/lib/hive/lib/antlr-runtime-3.4.jar,/usr/lib/hive/lib/datanucleus-api-jdo-3.2.1.jar,/usr/lib/hive/lib/datanucleus-core-3.2.2.jar
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:.:/usr/lib/hcatalog/share/hcatalog/hcatalog-core.jar:/usr/lib/hive/lib/hive-exec-0.12.0.2.0.6.1-101.jar:/usr/lib/hive/lib/hive-metastore-0.12.0.2.0.6.1-101.jar:/usr/lib/hive/lib/libfb303-0.9.0.jar:/usr/lib/hive/lib/jdo-api-3.0.1.jar:/usr/lib/hive/lib/antlr-runtime-3.4.jar:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.1.jar:/usr/lib/hive/lib/datanucleus-core-3.2.2.jar
hadoop jar FieldCounts-0.0.1-SNAPSHOT.jar com.bigdata.hadoop.FieldCounts -libjars ${LIBJARS} simple simpout
Now Hadoop creates and starts the job that next fails because Hadoop can not find Map class:
14/03/26 16:25:58 INFO mapreduce.Job: Running job: job_1395407010870_0007
14/03/26 16:26:07 INFO mapreduce.Job: Job job_1395407010870_0007 running in uber mode : false
14/03/26 16:26:07 INFO mapreduce.Job: map 0% reduce 0%
14/03/26 16:26:13 INFO mapreduce.Job: Task Id : attempt_1395407010870_0007_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.bigdata.hadoop.FieldCounts$Map not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1720)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:721)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.lang.ClassNotFoundException: Class com.bigdata.hadoop.FieldCounts$Map not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1626)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1718)
... 8 more
Why this happen? Job jar contains all classes including Map:
jar tvf FieldCounts-0.0.1-SNAPSHOT.jar
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/
121 Wed Mar 26 15:51:04 MSK 2014 META-INF/MANIFEST.MF
0 Wed Mar 26 14:29:58 MSK 2014 com/
0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/
0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/
3992 Fri Mar 21 17:29:22 MSK 2014 hive-site.xml
4093 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts.class
2961 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Reduce.class
1621 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/TableFieldValueKey.class
4186 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Map.class
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/
1030 Wed Mar 26 14:28:22 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.xml
123 Wed Mar 26 14:30:02 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.properties
[hdfs#localhost target]$ jar tvf FieldCounts-0.0.1-SNAPSHOT.jar
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/
121 Wed Mar 26 15:51:04 MSK 2014 META-INF/MANIFEST.MF
0 Wed Mar 26 14:29:58 MSK 2014 com/
0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/
0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/
3992 Fri Mar 21 17:29:22 MSK 2014 hive-site.xml
4093 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts.class
2961 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Reduce.class
1621 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/TableFieldValueKey.class
4186 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Map.class
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/
1030 Wed Mar 26 14:28:22 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.xml
123 Wed Mar 26 14:30:02 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.properties
What is wrong? Should I put Map and Reduce classes in separate files?
MapReduce code:
package com.bigdata.hadoop;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.util.*;
import org.apache.hcatalog.mapreduce.*;
import org.apache.hcatalog.data.*;
import org.apache.hcatalog.data.schema.*;
import org.apache.log4j.Logger;
public class FieldCounts extends Configured implements Tool {
public static class Map extends Mapper<WritableComparable, HCatRecord, TableFieldValueKey, IntWritable> {
static Logger logger = Logger.getLogger("com.foo.Bar");
static boolean firstMapRun = true;
static List<String> fieldNameList = new LinkedList<String>();
/**
* Return a list of field names not containing `id` field name
* #param schema
* #return
*/
static List<String> getFieldNames(HCatSchema schema) {
// Filter out `id` name just once
if (firstMapRun) {
firstMapRun = false;
List<String> fieldNames = schema.getFieldNames();
for (String fieldName : fieldNames) {
if (!fieldName.equals("id")) {
fieldNameList.add(fieldName);
}
}
} // if (firstMapRun)
return fieldNameList;
}
#Override
protected void map( WritableComparable key,
HCatRecord hcatRecord,
//org.apache.hadoop.mapreduce.Mapper
//<WritableComparable, HCatRecord, Text, IntWritable>.Context context)
Context context)
throws IOException, InterruptedException {
HCatSchema schema = HCatBaseInputFormat.getTableSchema(context.getConfiguration());
//String schemaTypeStr = schema.getSchemaAsTypeString();
//logger.info("******** schemaTypeStr ********** : "+schemaTypeStr);
//List<String> fieldNames = schema.getFieldNames();
List<String> fieldNames = getFieldNames(schema);
for (String fieldName : fieldNames) {
Object value = hcatRecord.get(fieldName, schema);
String fieldValue = null;
if (null == value) {
fieldValue = "<NULL>";
} else {
fieldValue = value.toString();
}
//String fieldNameValue = fieldName+"."+fieldValue;
//context.write(new Text(fieldNameValue), new IntWritable(1));
TableFieldValueKey fieldKey = new TableFieldValueKey();
fieldKey.fieldName = fieldName;
fieldKey.fieldValue = fieldValue;
context.write(fieldKey, new IntWritable(1));
}
}
}
public static class Reduce extends Reducer<TableFieldValueKey, IntWritable,
WritableComparable, HCatRecord> {
protected void reduce( TableFieldValueKey key,
java.lang.Iterable<IntWritable> values,
Context context)
//org.apache.hadoop.mapreduce.Reducer<Text, IntWritable,
//WritableComparable, HCatRecord>.Context context)
throws IOException, InterruptedException {
Iterator<IntWritable> iter = values.iterator();
int sum = 0;
// Sum up occurrences of the given key
while (iter.hasNext()) {
IntWritable iw = iter.next();
sum = sum + iw.get();
}
HCatRecord record = new DefaultHCatRecord(3);
record.set(0, key.fieldName);
record.set(1, key.fieldValue);
record.set(2, sum);
context.write(null, record);
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
args = new GenericOptionsParser(conf, args).getRemainingArgs();
// To fix Hadoop "META-INFO" (http://stackoverflow.com/questions/17265002/hadoop-no-filesystem-for-scheme-file)
conf.set("fs.hdfs.impl",
org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
conf.set("fs.file.impl",
org.apache.hadoop.fs.LocalFileSystem.class.getName());
// Get the input and output table names as arguments
String inputTableName = args[0];
String outputTableName = args[1];
// Assume the default database
String dbName = null;
Job job = new Job(conf, "FieldCounts");
HCatInputFormat.setInput(job,
InputJobInfo.create(dbName, inputTableName, null));
job.setJarByClass(FieldCounts.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
// An HCatalog record as input
job.setInputFormatClass(HCatInputFormat.class);
// Mapper emits TableFieldValueKey as key and an integer as value
job.setMapOutputKeyClass(TableFieldValueKey.class);
job.setMapOutputValueClass(IntWritable.class);
// Ignore the key for the reducer output; emitting an HCatalog record as
// value
job.setOutputKeyClass(WritableComparable.class);
job.setOutputValueClass(DefaultHCatRecord.class);
job.setOutputFormatClass(HCatOutputFormat.class);
HCatOutputFormat.setOutput(job,
OutputJobInfo.create(dbName, outputTableName, null));
HCatSchema s = HCatOutputFormat.getTableSchema(job);
System.err.println("INFO: output schema explicitly set for writing:"
+ s);
HCatOutputFormat.setSchema(job, s);
return (job.waitForCompletion(true) ? 0 : 1);
}
public static void main(String[] args) throws Exception {
String classpath = System.getProperty("java.class.path");
System.out.println("*** CLASSPATH: "+classpath);
int exitCode = ToolRunner.run(new FieldCounts(), args);
System.exit(exitCode);
}
}
As I found out the problem was in directory permissions where MapReduce jar was located. This jar was built in a home directory of a regular, not hdfs user. As long as this MRD job outputs results of its work directly into Hive table, it should be run under hdfs user. In case such a job is run under regular user it has no permission to write data into Hive table!
On the other hand a home directory of a regular user in CentOS has 700 permissions. So when you run hadoop jar ... command under user different from the user owning this home directory access to MRD jar gets denied somewhere in the process of loading classes by Hadoop. That's why under hdfs user this job results in java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.bigdata.hadoop.MyMap not found.
Recursively changing permission of home directory from 700 to 755 where MRD jar was built solves this problem.
Yet a more important problem remains: How to run a job under regular user so it has permission to write data into a Hive table?
I've found the following
Set hadoop system user for client embedded in Java webapp
that allowed me to connect as the expected hadoop user, but jar is not yet uploaded nor executed... ClassNotFound remains