I'm trying to do a secondary sort in mapreduce with a composite key that consisnts of:
String natural-key = program name
Long key-for-sorting = time in milli since 1970
The problem is that After sorting I get lots of reducers according to the entire composite key
By debugging I have verified that the hashcode and the compare functions are correct.
From debug logging it where each block is from a different reducer it shows that either the grouping or the partitioning didn't succeed.
from debug logs:
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key=the voice
14/12/14 00:55:12 INFO popularitweet.EtanReducer: the voice: Thu Dec 11 17:51:03 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: the voice: Thu Dec 11 17:51:03 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key the voice ended
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key=top gear
14/12/14 00:55:12 INFO popularitweet.EtanReducer: top gear: Thu Dec 11 17:51:04 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key top gear ended
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key=american horror story
14/12/14 00:55:12 INFO popularitweet.EtanReducer: american horror story: Thu Dec 11 17:51:04 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key american horror story ended
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key=the voice
14/12/14 00:55:12 INFO popularitweet.EtanReducer: the voice: Thu Dec 11 17:51:04 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key the voice ended
As you can see the voice is sent to two different reducers but the timestamp is different.
Any help would be appreciated.
The composite key is the following class:
public class ProgramKey implements WritableComparable<ProgramKey> {
private String program;
private Long timestamp;
public ProgramKey() {
}
public ProgramKey(String program, Long timestamp) {
this.program = program;
this.timestamp = timestamp;
}
#Override
public int compareTo(ProgramKey o) {
int result = program.compareTo(o.program);
if (result == 0) {
result = timestamp.compareTo(o.timestamp);
}
return result;
}
#Override
public void write(DataOutput dataOutput) throws IOException {
WritableUtils.writeString(dataOutput, program);
dataOutput.writeLong(timestamp);
}
#Override
public void readFields(DataInput dataInput) throws IOException {
program = WritableUtils.readString(dataInput);
timestamp = dataInput.readLong();
}
My implemeted Partitioner, GroupingComparator, and SortingComparator are these:
public class ProgramKeyPartitioner extends Partitioner<ProgramKey, TweetObject> {
#Override
public int getPartition(ProgramKey programKey, TweetObject tweetObject, int numPartitions) {
int hash = programKey.getProgram().hashCode();
int partition = hash % numPartitions;
return partition;
}
}
public class ProgramKeyGroupingComparator extends WritableComparator {
protected ProgramKeyGroupingComparator() {
super(ProgramKey.class, true);
}
#Override
public int compare(WritableComparable a, WritableComparable b) {
ProgramKey k1 = (ProgramKey) a;
ProgramKey k2 = (ProgramKey) b;
return k1.getProgram().compareTo(k2.getProgram());
}
}
public class TimeStampComparator extends WritableComparator {
protected TimeStampComparator() {
super(ProgramKey.class, true);
}
#Override
public int compare(WritableComparable a, WritableComparable b) {
ProgramKey ts1 = (ProgramKey)a;
ProgramKey ts2 = (ProgramKey)a;
int result = ts1.getProgram().compareTo(ts2.getProgram());
if (result == 0) {
result = ts1.getTimestamp().compareTo(ts2.getTimestamp());
}
return result;
}
}
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
// Create configuration
Configuration conf = new Configuration();
// Create job
Job job = new Job(conf, "test1");
job.setJarByClass(EtanMapReduce.class);
// Set partitioner keyComparator and groupComparator
job.setPartitionerClass(ProgramKeyPartitioner.class);
job.setGroupingComparatorClass(ProgramKeyGroupingComparator.class);
job.setSortComparatorClass(TimeStampComparator.class);
// Setup MapReduce
job.setMapperClass(EtanMapper.class);
job.setMapOutputKeyClass(ProgramKey.class);
job.setMapOutputValueClass(TweetObject.class);
job.setReducerClass(EtanReducer.class);
// Specify key / value
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(TweetObject.class);
// Input
FileInputFormat.addInputPath(job, inputPath);
job.setInputFormatClass(TextInputFormat.class);
// Output
FileOutputFormat.setOutputPath(job, outputDir);
job.setOutputFormatClass(TextOutputFormat.class);
// Delete output if exists
FileSystem hdfs = FileSystem.get(conf);
if (hdfs.exists(outputDir))
hdfs.delete(outputDir, true);
// Execute job
logger.info("starting job");
int code = job.waitForCompletion(true) ? 0 : 1;
System.exit(code);
}
Edit...
your TimeStampComparator seems to have a typo... you're setting ts2 to a when it should be set to b:
ProgramKey ts1 = (ProgramKey)a;
ProgramKey ts2 = (ProgramKey)a;
when it should be:
ProgramKey ts1 = (ProgramKey)a;
ProgramKey ts2 = (ProgramKey)b;
This would result in incorrectly sorted key/value pairs and invalidates the assumption made by the grouping comparator that the key/value pairs are sorted.
Check also that the original program names are in UTF-8 as that's what WritableUtils assumes. Is your system's default code page also UTF-8?
Related
I got an error(Agent configuration for 'a1' has no configfilters) when I use flume 1.9 to transfer the data from kafka to HDFS, but no other error or info were reported.
The source I used is KafkaSource, sink is file sink.
Interceptor I used is self-define which I will show bolow.
Agent configuration for 'a1' has no configfilters.
the logger info is below. differ from other question, the
16 Aug 2022 11:45:27,600 WARN [conf-file-poller-0] (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.validateConfigFilterSet:623) - Agent configuration for 'a1' has no configfilters.
16 Aug 2022 11:45:27,623 INFO [conf-file-poller-0] (org.apache.flume.conf.FlumeConfiguration.validateConfiguration:163) - Post-validation flume configuration contains configuration for agents: [a1]
16 Aug 2022 11:45:27,624 INFO [conf-file-poller-0] (org.apache.flume.node.AbstractConfigurationProvider.loadChannels:151) - Creating channels
16 Aug 2022 11:45:27,628 INFO [conf-file-poller-0] (org.apache.flume.channel.DefaultChannelFactory.create:42) - Creating instance of channel c1 type file
16 Aug 2022 11:45:27,642 INFO [conf-file-poller-0] (org.apache.flume.node.AbstractConfigurationProvider.loadChannels:205) - Created channel c1
16 Aug 2022 11:45:27,643 INFO [conf-file-poller-0] (org.apache.flume.source.DefaultSourceFactory.create:41) - Creating instance of source r1, type org.apache.flume.source.kafka.KafkaSource
16 Aug 2022 11:45:27,655 INFO [conf-file-poller-0] (org.apache.flume.sink.DefaultSinkFactory.create:42) - Creating instance of sink: k1, type: hdfs
16 Aug 2022 11:45:27,786 INFO [conf-file-poller-0] (org.apache.flume.node.AbstractConfigurationProvider.getConfiguration:120) - Channel c1 connected to [r1, k1]
16 Aug 2022 11:45:27,787 INFO [conf-file-poller-0] (org.apache.flume.node.Application.startAllComponents:162) - Starting new configuration:{ sourceRunners:{r1=PollableSourceRunner: { source:org.apache.flume.source.kafka.KafkaSource{name:r1,state:IDLE} counterGroup:{ name:null counters:{} } }} sinkRunners:{k1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor#2aa62fb9 counterGroup:{ name:null counters:{} } }} channels:{c1=FileChannel c1 { dataDirs: [/opt/module/flume/data/ranqi/behavior2] }} }
16 Aug 2022 11:45:27,788 INFO [conf-file-poller-0] (org.apache.flume.node.Application.startAllComponents:169) - Starting Channel c1
16 Aug 2022 11:45:27,790 INFO [conf-file-poller-0] (org.apache.flume.node.Application.startAllComponents:184) - Waiting for channel: c1 to start. Sleeping for 500 ms
16 Aug 2022 11:45:27,790 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.FileChannel.start:278) - Starting FileChannel c1 { dataDirs: [/opt/module/flume/data/ranqi/behavior2] }...
16 Aug 2022 11:45:27,833 INFO [lifecycleSupervisor-1-0] (org.apache.flume.instrumentation.MonitoredCounterGroup.register:119) - Monitored counter group for type: CHANNEL, name: c1: Successfully registered new MBean.
16 Aug 2022 11:45:27,833 INFO [lifecycleSupervisor-1-0] (org.apache.flume.instrumentation.MonitoredCounterGroup.start:95) - Component type: CHANNEL, name: c1 started
16 Aug 2022 11:45:27,839 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.<init>:356) - Encryption is not enabled
16 Aug 2022 11:45:27,840 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.replay:406) - Replay started
16 Aug 2022 11:45:27,845 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.replay:418) - Found NextFileID 3, from [/opt/module/flume/data/ranqi/behavior2/log-3, /opt/module/flume/data/ranqi/behavior2/log-2]
16 Aug 2022 11:45:27,851 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:55) - Starting up with /opt/module/flume/checkpoint/ranqi/behavior2/checkpoint and /opt/module/flume/checkpoint/ranqi/behavior2/checkpoint.meta
16 Aug 2022 11:45:27,851 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:59) - Reading checkpoint metadata from /opt/module/flume/checkpoint/ranqi/behavior2/checkpoint.meta
16 Aug 2022 11:45:27,906 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.FlumeEventQueue.<init>:115) - QueueSet population inserting 0 took 0
16 Aug 2022 11:45:27,908 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.replay:457) - Last Checkpoint Mon Aug 15 17:11:08 CST 2022, queue depth = 0
16 Aug 2022 11:45:27,918 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.doReplay:542) - Replaying logs with v2 replay logic
16 Aug 2022 11:45:27,919 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.ReplayHandler.replayLog:249) - Starting replay of [/opt/module/flume/data/ranqi/behavior2/log-2, /opt/module/flume/data/ranqi/behavior2/log-3]
16 Aug 2022 11:45:27,920 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.ReplayHandler.replayLog:262) - Replaying /opt/module/flume/data/ranqi/behavior2/log-2
16 Aug 2022 11:45:27,925 INFO [lifecycleSupervisor-1-0] (org.apache.flume.tools.DirectMemoryUtils.getDefaultDirectMemorySize:112) - Unable to get maxDirectMemory from VM: NoSuchMethodException: sun.misc.VM.maxDirectMemory(null)
16 Aug 2022 11:45:27,926 INFO [lifecycleSupervisor-1-0] (org.apache.flume.tools.DirectMemoryUtils.allocate:48) - Direct Memory Allocation: Allocation = 1048576, Allocated = 0, MaxDirectMemorySize = 1908932608, Remaining = 1908932608
16 Aug 2022 11:45:27,982 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.LogFile$SequentialReader.skipToLastCheckpointPosition:660) - Checkpoint for file(/opt/module/flume/data/ranqi/behavior2/log-2) is: 1660554206424, which is beyond the requested checkpoint time: 1660555388025 and position 0
16 Aug 2022 11:45:27,982 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.ReplayHandler.replayLog:262) - Replaying /opt/module/flume/data/ranqi/behavior2/log-3
16 Aug 2022 11:45:27,983 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.LogFile$SequentialReader.skipToLastCheckpointPosition:658) - fast-forward to checkpoint position: 273662090
16 Aug 2022 11:45:27,983 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.LogFile$SequentialReader.next:683) - Encountered EOF at 273662090 in /opt/module/flume/data/ranqi/behavior2/log-3
16 Aug 2022 11:45:27,983 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.ReplayHandler.replayLog:345) - read: 0, put: 0, take: 0, rollback: 0, commit: 0, skip: 0, eventCount:0
16 Aug 2022 11:45:27,984 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.FlumeEventQueue.replayComplete:417) - Search Count = 0, Search Time = 0, Copy Count = 0, Copy Time = 0
16 Aug 2022 11:45:27,988 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.replay:505) - Rolling /opt/module/flume/data/ranqi/behavior2
16 Aug 2022 11:45:27,988 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.roll:990) - Roll start /opt/module/flume/data/ranqi/behavior2
16 Aug 2022 11:45:27,989 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.LogFile$Writer.<init>:220) - Opened /opt/module/flume/data/ranqi/behavior2/log-4
16 Aug 2022 11:45:27,996 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.roll:1006) - Roll end
16 Aug 2022 11:45:27,996 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.EventQueueBackingStoreFile.beginCheckpoint:230) - Start checkpoint for /opt/module/flume/checkpoint/ranqi/behavior2/checkpoint, elements to sync = 0
16 Aug 2022 11:45:28,000 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.EventQueueBackingStoreFile.checkpoint:255) - Updating checkpoint metadata: logWriteOrderID: 1660621527859, queueSize: 0, queueHead: 557327
16 Aug 2022 11:45:28,008 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.writeCheckpoint:1065) - Updated checkpoint for file: /opt/module/flume/data/ranqi/behavior2/log-4 position: 0 logWriteOrderID: 1660621527859
16 Aug 2022 11:45:28,008 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.FileChannel.start:289) - Queue Size after replay: 0 [channel=c1]
16 Aug 2022 11:45:28,290 INFO [conf-file-poller-0] (org.apache.flume.node.Application.startAllComponents:196) - Starting Sink k1
16 Aug 2022 11:45:28,291 INFO [conf-file-poller-0] (org.apache.flume.node.Application.startAllComponents:207) - Starting Source r1
16 Aug 2022 11:45:28,292 INFO [lifecycleSupervisor-1-1] (org.apache.flume.instrumentation.MonitoredCounterGroup.register:119) - Monitored counter group for type: SINK, name: k1: Successfully registered new MBean.
16 Aug 2022 11:45:28,292 INFO [lifecycleSupervisor-1-1] (org.apache.flume.instrumentation.MonitoredCounterGroup.start:95) - Component type: SINK, name: k1 started
16 Aug 2022 11:45:28,292 INFO [lifecycleSupervisor-1-4] (org.apache.flume.source.kafka.KafkaSource.doStart:524) - Starting org.apache.flume.source.kafka.KafkaSource{name:r1,state:IDLE}...
flume agent start use shell command below.
#!/bin/bash
case $1 in
"start")
echo " --------start flume-------"
ssh hadoop104 "nohup /opt/module/flume/bin/flume-ng agent -n a1 -c /opt/module/flume/conf -f /opt/module/flume/job/ranqi/ranqi_kafka_to_hdfs_db.conf >/dev/null 2>&1 &"
;;
"stop")
echo " --------stop flume-------"
ssh hadoop104 "ps -ef | grep ranqi_kafka_to_hdfs_db.conf | grep -v grep |awk '{print \$2}' | xargs -n1 kill"
;;
esac
flume config is below.
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092
a1.sources.r1.kafka.topics = copy_1015
a1.sources.r1.kafka.consumer.group.id = flume
a1.sources.r1.setTopicHeader = true
a1.sources.r1.topicHeader = topic
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.atguigu.flume.interceptor.ranqi.ranqiTimestampInterceptor$Builder
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/ranqi/behavior2
a1.channels.c1.dataDirs = /opt/module/flume/data/ranqi/behavior2/
a1.channels.c1.maxFileSize = 2146435071
a1.channels.c1.capacity = 1123456
a1.channels.c1.keep-alive = 6
## sink1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /origin_data/ranqi/db/%{topic}_inc/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = db
a1.sinks.k1.hdfs.round = false
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = gzip
## 拼装
a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1
ranqiTimestampInterceptor class I defined is below, which in flume/lib.
package com.atguigu.flume.interceptor.ranqi;
import com.alibaba.fastjson.JSONObject;
import com.atguigu.flume.interceptor.db.TimestampInterceptor;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.nio.charset.StandardCharsets;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.List;
import java.util.Map;
public class ranqiTimestampInterceptor implements Interceptor {
public static String dateToStamp(String s) throws ParseException {
String res;
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Date date = simpleDateFormat.parse(s);
long ts = date.getTime();
res = String.valueOf(ts);
return res;
}
#Override
public void initialize() {
}
private final static Logger logger = LoggerFactory.getLogger(ranqiTimestampInterceptor.class);
#Override
public Event intercept(Event event) {
byte[] body = event.getBody();
Long createDate ;
String time = new String();
String log = new String(body, StandardCharsets.UTF_8);
JSONObject jsonObject = JSONObject.parseObject(log);
// logger.info(log);
logger.info(String.valueOf(jsonObject));
JSONObject data = jsonObject.getObject("data", JSONObject.class);
if(data.containsKey("createDate") && data.getLong("createDate") != null){
createDate = data.getLong("createDate");
try {
createDate = Long.valueOf(dateToStamp(String.valueOf(createDate)));
time = String.valueOf(createDate);
} catch (ParseException e) {
e.printStackTrace();
}
finally {
Long ts = jsonObject.getLong("ts");
time = String.valueOf(ts);
}
}else{
Long ts = jsonObject.getLong("ts");
time = String.valueOf(ts);
}
System.out.println(time);
logger.info(time);
Map<String, String> headers = event.getHeaders();
headers.put("timestamp",time);
return event;
}
#Override
public List<Event> intercept(List<Event> list) {
for (Event event : list) {
intercept(event);
}
return list;
}
#Override
public void close() {
}
public static class Builder implements Interceptor.Builder{
#Override
public Interceptor build() {
return new TimestampInterceptor();
}
#Override
public void configure(Context context) {
}
}
}
.
I have a PCollection of KV where key is gcs file_patterns and value is some additional info of the files (e.g., the "Source" systems that generated the files). E.g.,
KV("gs://bucket1/dir1/*", "SourceX"),
KV("gs://bucket1/dir2/*", "SourceY")
I need a PTransferm to expand the file_patterns to all matching files in the GCS folders, and keep the "Source" field. E.g., if there are two files X1.dat, X2.dat under dir1 and one file (Y1.dat) under dir2, the output will be:
KV("gs://bucket1/dir1/X1.dat", "SourceX"),
KV("gs://bucket1/dir1/X2.dat", "SourceX")
KV("gs://bucket1/dir2/Y1.dat", "SourceY")
Could I use FileIO.matchAll() to achieve this? I am stuck on how to combine/join the "Source" field to the matching files. This is something I was trying, not quite there yet:
public PCollection<KV<String, String> expand(PCollection<KV<String, String>> filesAndSources) {
return filesAndSources
.apply("Get file names", Keys.create())
.apply(FileIO.matchAll())
.apply(FileIO.readMatches())
.apply(ParDo.of(
new DoFn<ReadableFile, KV<String, String>>() {
#ProcessElement
public void processElement(ProcessContext c) {
ReadableFile file = c.element();
String fileName = file.getMetadata().resourceId().toString();
c.output(KV.of(fileName, XXXXX)); // How to get the value field ("Source") from the input KV?
My difficulty is the last line, for XXXXX, how do I get the value field ("Source") from the input KV? Any way to "join" or "combine" the input KV's value back to the 'expended' keys, as one key (file_pattern) is expanded to multiple values.
Thank you!
MatchResult.Medata contains the resourceId you are already using but not the GCS path (with wildcards) it matched.
You can achieve what you want using side inputs. To demonstrate this I created the following filesAndSources (as per your comment this could be an input parameter so it can't be hard-coded downstream):
PCollection<KV<String, String>> filesAndSources = p.apply("Create file pattern and source pairs",
Create.of(KV.of("gs://" + Bucket + "/sales/*", "Sales"),
KV.of("gs://" + Bucket + "/events/*", "Events")));
I materialize this into a side input (in this case as Map). The key will be the glob pattern converted into a regex one (thanks to this answer) and the value will be the source string:
final PCollectionView<Map<String, String>> regexAndSources =
filesAndSources.apply("Glob pattern to RegEx", ParDo.of(new DoFn<KV<String, String>, KV<String, String>>() {
#ProcessElement
public void processElement(ProcessContext c) {
String regex = c.element().getKey();
StringBuilder out = new StringBuilder("^");
for(int i = 0; i < regex.length(); ++i) {
final char ch = regex.charAt(i);
switch(ch) {
case '*': out.append(".*"); break;
case '?': out.append('.'); break;
case '.': out.append("\\."); break;
case '\\': out.append("\\\\"); break;
default: out.append(ch);
}
}
out.append('$');
c.output(KV.of(out.toString(), c.element().getValue()));
}})).apply("Save as Map", View.asMap());
Then, after reading the filenames we can use the side input to parse each path to see which is the matching pattern/source pair:
filesAndSources
.apply("Get file names", Keys.create())
.apply(FileIO.matchAll())
.apply(FileIO.readMatches())
.apply(ParDo.of(new DoFn<ReadableFile, KV<String, String>>() {
#ProcessElement
public void processElement(ProcessContext c) {
ReadableFile file = c.element();
String fileName = file.getMetadata().resourceId().toString();
Set<Map.Entry<String,String>> patternSet = c.sideInput(regexAndSources).entrySet();
for (Map.Entry< String,String> pattern:patternSet)
{
if (fileName.matches(pattern.getKey())) {
String source = pattern.getValue();
c.output(KV.of(fileName, source));
}
}
}}).withSideInputs(regexAndSources))
Note that the regex conversion is done when before materializing the side input instead of here to avoid duplicate work.
The output, as expected in my case:
Feb 24, 2019 10:44:05 PM org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
INFO: Matched 2 files for pattern gs://REDACTED/events/*
Feb 24, 2019 10:44:05 PM org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
INFO: Matched 2 files for pattern gs://REDACTED/sales/*
Feb 24, 2019 10:44:05 PM com.dataflow.samples.RegexFileIO$3 processElement
INFO: key=gs://REDACTED/sales/sales1.csv, value=Sales
Feb 24, 2019 10:44:05 PM com.dataflow.samples.RegexFileIO$3 processElement
INFO: key=gs://REDACTED/sales/sales2.csv, value=Sales
Feb 24, 2019 10:44:05 PM com.dataflow.samples.RegexFileIO$3 processElement
INFO: key=gs://REDACTED/events/events1.csv, value=Events
Feb 24, 2019 10:44:05 PM com.dataflow.samples.RegexFileIO$3 processElement
INFO: key=gs://REDACTED/events/events2.csv, value=Events
Full code.
Hi i am running an application which reads records from HBase and writes into text files .
I have used combiner in my application and custom partitioner also.
I have used 41 reducer in my application because i need to create 40 reducer output file that satisfies my condition in custom partitioner.
All working fine but when i use combiner in my application it creates map output file per regions or per mapper .
Foe example i have 40 regions in my application so 40 mapper getting initiated then it create 40 map-output files .
But reducer is not able to combine all map-output and generate final reducer output file that will be 40 reducer output files.
Data in the files are correct but no of files has increased .
Any idea how can i get only reducer output files.
// Reducer Class
job.setCombinerClass(CommonReducer.class);
job.setReducerClass(CommonReducer.class); // reducer class
below is my Job details
Submitted: Mon Apr 10 09:42:55 CDT 2017
Started: Mon Apr 10 09:43:03 CDT 2017
Finished: Mon Apr 10 10:11:20 CDT 2017
Elapsed: 28mins, 17sec
Diagnostics:
Average Map Time 6mins, 13sec
Average Shuffle Time 17mins, 56sec
Average Merge Time 0sec
Average Reduce Time 0sec
Here is my reducer logic
import java.io.IOException;
import org.apache.log4j.Logger;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
public class CommonCombiner extends Reducer<NullWritable, Text, NullWritable, Text> {
private Logger logger = Logger.getLogger(CommonCombiner.class);
private MultipleOutputs<NullWritable, Text> multipleOutputs;
String strName = "";
private static final String DATA_SEPERATOR = "\\|\\!\\|";
public void setup(Context context) {
logger.info("Inside Combiner.");
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
}
#Override
public void reduce(NullWritable Key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text value : values) {
final String valueStr = value.toString();
StringBuilder sb = new StringBuilder();
if ("".equals(strName) && strName.length() == 0) {
String[] strArrFileName = valueStr.split(DATA_SEPERATOR);
String strFullFileName[] = strArrFileName[1].split("\\|\\^\\|");
strName = strFullFileName[strFullFileName.length - 1];
String strArrvalueStr[] = valueStr.split(DATA_SEPERATOR);
if (!strArrvalueStr[0].contains(HbaseBulkLoadMapperConstants.FF_ACTION)) {
sb.append(strArrvalueStr[0] + "|!|");
}
multipleOutputs.write(NullWritable.get(), new Text(sb.toString()), strName);
context.getCounter(Counters.FILE_DATA_COUNTER).increment(1);
}
}
}
public void cleanup(Context context) throws IOException, InterruptedException {
multipleOutputs.close();
}
}
I have replaced multipleOutputs.write(NullWritable.get(), new Text(sb.toString()), strName);
with
context.write()
and i got the correct output .
I am having a set of text files in my FTP server.
I want to Read all the files are all uploaded by today.
and among that i have to print the last three uploaded file's properties.
(name,uploaded time,size).
Now i will be able to print name and properties of the filed present in FTP server but its not in a order and displays like a junk.
Now i want to print the name,size,path,upload time of the last three uploaded files.
Can any one help me to achieve this?
Here are my snippet:
package com;
import java.io.File;
import edu.vt.middleware.crypt.io.TeePrintStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
import java.net.ConnectException;
//import java.sql.Date;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Arrays;
import java.util.Calendar;
import java.util.Date;
import javax.mail.MessagingException;
import javax.mail.internet.AddressException;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.comparator.LastModifiedFileComparator;
import org.apache.commons.net.ftp.FTPFile;
import com.enterprisedt.net.ftp.FTPConnectMode;
import com.enterprisedt.net.ftp.FTPException;
import com.enterprisedt.net.ftp.FTPTransferType;
import com.enterprisedt.net.ftp.Protocol;
import com.enterprisedt.net.ftp.SecureFileTransferClient;
import java.io.File;
import java.io.FilenameFilter;
import edu.vt.middleware.crypt.io.TeePrintStream;
public class getFilesFTP
{
public static File dir=new File("D:/log_FTPCHECK");
public static String logname="output.txt";
public static File logfile=new File(dir,logname);
public static StringBuffer sb=new StringBuffer();
public static byte[] filesize;
public static void main(String args[]) throws Exception
{
sb.append("*************************************
************************************************
*********************************************************************");
sb.append("<p align=center><B><U>SCOPUS FILE UPLOAD CHECK AUTO
GENERATED LOG REPORT</U></B></p>");
sb.append("***********************************************
****************************************************************
*******************************************");
String host="example.com";
String username="john";
String password="doe";
int count=0;
File Filename;;
Date FileDate;
String invalidfilename=".";
String filetype="";
//DateFormat dateFormat = new SimpleDateFormat("dd/mm/year");
//String date1="";
String Lastmodifieddata="";
String Lastmodifieddata_time="";
Date todayDate;
try{
SimpleDateFormat dateformat=new SimpleDateFormat("yyyy/MM/dd");
SimpleDateFormat Format_time=new SimpleDateFormat("yyyy/MM/dd
HH:mm:ss");
String timeStamp =
dateformat.format(Calendar.getInstance().getTime());
// System.out.println("I am executed2");
// System.out.println("Todays Date :"+timeStamp );
sb.append(System.lineSeparator());
//sb.append("Todays Date :"+timeStamp );
sb.append(System.lineSeparator());
// System.out.println("I am executed2");
SecureFileTransferClient client=new
SecureFileTransferClient();
client.getAdvancedFTPSettings().setConnectMode(FTPConnectMode.PASV);
client.setRemoteHost(host);
client.setUserName(username);
client.setPassword(password);
client.setProtocol(Protocol.SFTP);
client.setRemotePort(22);
client.setContentType(FTPTransferType.BINARY);
// sb.append(System.lineSeparator());
System.out.println("connecting to sftp...");
// sb.append(System.lineSeparator());
sb.append("connecting to sftp...");
sb.append(System.lineSeparator());
client.connect();
System.out.println("SFTP Connection established
successfully.");
sb.append(System.lineSeparator());
sb.append("SFTP Connection established successfully.");
sb.append(System.lineSeparator());
String path1="/sftp/content-providers/tho-e/data/incoming
/scopusbk";
com.enterprisedt.net.ftp.FTPFile[] directroy =
client.directoryList(path1);
System.out.println("Total Number of Files Found
:"+directroy.length);
// Arrays.sort(directroy);
// FTpFileComparator[] comp = new
FTpFileComparator[files.length];
sb.append(System.lineSeparator());
//sb.append("Total Number of Files Found
:"+directroy.length);
sb.append(System.lineSeparator());
int x=0;
for (int i = 0; i < directroy.length; i++)
{
//System.out.println("entered in for loop");
Filename= new File(directroy[i].getName());
FileDate=(Date) directroy[i].lastModified();
//Filetype=getFileExtension(Filename);
// System.out.println("Name:"+Filename);
Lastmodifieddata=dateformat.format(directroy[i].lastModified());
Lastmodifieddata_time=Format_time.format(directroy[i].lastModified());
//filesize=directroy[i].getName().getBytes();
long size = directroy[i].size();
if(timeStamp.equalsIgnoreCase(Lastmodifieddata))
{
if ((directroy[i]).getName().endsWith("txt"))
{
System.out.println("File Name : "+Filename + " ||
Upload Time : "+Lastmodifieddata_time+" || Size : "+size+" kb");
sb.append(System.lineSeparator());
sb.append("File Name : "+Filename + " || Upload Time :
"+Lastmodifieddata_time+" || Size : "+size+" kb");
sb.append(System.lineSeparator());
//System.out.println();
// String path1="/sftp/suppliers/thomdi/signals
/ContentCAR";
// com.enterprisedt.net.ftp.FTPFile[] directroy =
client.directoryList(path1);
count++;
//}
}
}
else
{
//System.out.println("No todays files");
}
}
if(count>0)
{
System.out.println("Total Number of files :"+count);
sb.append(System.lineSeparator());
sb.append("Total Number of file :"+count);
sb.append(System.lineSeparator());
}
else
{
System.out.println("No Files uploaded today....!!!");
sb.append("No Files uploaded today....!!!");
}
// PrintStream out = new PrintStream(new
FileOutputStream("D:/output.txt"));
// System.setOut(out);
if(!logfile.exists())
{
logfile.createNewFile();
}
FileUtils.writeStringToFile(logfile,sb.toString());
FTPMailer.sendmailFTP();
count=0;
}
catch(SecurityException se)
{
System.out.println("Security credentials mismatch
issue...Unable to Login ");
sb.append(System.lineSeparator());
sb.append("Security credentials mismatch issue...Unable to
Login ");
sb.append(System.lineSeparator());
se.printStackTrace();
}
catch(ConnectException ce)
{
System.out.println("Unable to Reach FTP Server..");
System.out.println("Check the Internet Connectivity");
sb.append(System.lineSeparator());
sb.append("Unable to Reach FTP Server..");
sb.append(System.lineSeparator());
sb.append("Check the Internet Connectivity");
sb.append(System.lineSeparator());
}
}
}
|Kindly help I have googled a lot but am not able to reach FTP file sorting.
Any help will be greatly appreciative.
One solution could be to sort FTPFile[] in descending order (assuming that the last modified time is the uploaded time).
Arrays.sort(ftpfiles, new Comparator<FTPFile>() {
#Override
public int compare(FTPFile o1, FTPFile o2) {
return o2.lastModified().compareTo(o1.lastModified());
}
});
to print the three recent uploaded files (after sorting the array)
for (int i = 0; i < 3) {
// amend the output for your needs
System.out.printn(ftpfiles[i]);
}
Code is not tested. Written based on the javadoc of FTPFile.
edit Small snippet tested with the free library version.
import com.enterprisedt.net.ftp.FTPFile;
...
public static void main(String[] args) throws Exception {
// create an array of dummy files
Calendar cal = GregorianCalendar.getInstance();
FTPFile[] ftpfiles = new FTPFile[5];
cal.set(Calendar.SECOND, 1);
ftpfiles[0] = new FTPFile("raw", "file1", 111, false, cal.getTime());
cal.set(Calendar.SECOND, 5);
ftpfiles[1] = new FTPFile("raw", "file5", 555, false, cal.getTime());
cal.set(Calendar.SECOND, 3);
ftpfiles[2] = new FTPFile("raw", "file3", 333, false, cal.getTime());
cal.set(Calendar.SECOND, 4);
ftpfiles[3] = new FTPFile("raw", "file4", 444, false, cal.getTime());
cal.set(Calendar.SECOND, 2);
ftpfiles[4] = new FTPFile("raw", "file2", 222, false, cal.getTime());
System.out.println("unsorted file list");
for (FTPFile ftpfile : ftpfiles) {
printFileInfo(ftpfile);
}
// sort array by last modification time in descending order
Arrays.sort(ftpfiles, new Comparator<FTPFile>() {
#Override
public int compare(FTPFile o1, FTPFile o2) {
return o2.lastModified().compareTo(o1.lastModified());
}
});
System.out.println("sorted file list");
for (FTPFile ftpfile : ftpfiles) {
printFileInfo(ftpfile);
}
System.out.println("the three recent files only");
for (int i = 0; i < 3; i++) {
printFileInfo(ftpfiles[i]);
}
}
static void printFileInfo(FTPFile ftpfile) {
System.out.printf("name: %s mtime: %s size: %d%n",
ftpfile.getName(),
ftpfile.lastModified(),
ftpfile.size()
);
}
output
unsorted file list
name: file1 mtime: Fri Feb 12 12:23:01 CET 2016 size: 111
name: file5 mtime: Fri Feb 12 12:23:05 CET 2016 size: 555
name: file3 mtime: Fri Feb 12 12:23:03 CET 2016 size: 333
name: file4 mtime: Fri Feb 12 12:23:04 CET 2016 size: 444
name: file2 mtime: Fri Feb 12 12:23:02 CET 2016 size: 222
sorted file list
name: file5 mtime: Fri Feb 12 12:23:05 CET 2016 size: 555
name: file4 mtime: Fri Feb 12 12:23:04 CET 2016 size: 444
name: file3 mtime: Fri Feb 12 12:23:03 CET 2016 size: 333
name: file2 mtime: Fri Feb 12 12:23:02 CET 2016 size: 222
name: file1 mtime: Fri Feb 12 12:23:01 CET 2016 size: 111
the three recent files only
name: file5 mtime: Fri Feb 12 12:23:05 CET 2016 size: 555
name: file4 mtime: Fri Feb 12 12:23:04 CET 2016 size: 444
name: file3 mtime: Fri Feb 12 12:23:03 CET 2016 size: 333
I initialize the logger like this:
public static void init() {
ConsoleHandler handler = new ConsoleHandler();
handler.setFormatter(new LogFormatter());
Logger.getLogger(TrackerConfig.LOGGER_NAME).setUseParentHandlers(false);
Logger.getLogger(TrackerConfig.LOGGER_NAME).addHandler(handler);
}
The LogFormatter's format function:
#Override
public String format(LogRecord record) {
StringBuilder sb = new StringBuilder();
sb.append(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss Z").format(new Date(record.getMillis())))
.append(" ")
.append(record.getLevel().getLocalizedName()).append(": ")
.append(formatMessage(record)).append(LINE_SEPARATOR);
return sb.toString();
}
To use the Log I use the following method:
private static void log(Level level, String message) {
Logger.getLogger(TrackerConfig.LOGGER_NAME).log(level, message);
if (level.intValue() >= TrackerConfig.DB_LOGGER_LEVEL.intValue()) {
DBLog.getInstance().log(level, message);
}
}
The DBLog.log method:
public void log(Level level, String message) {
try {
this.logBatch.setTimestamp(1, new Timestamp(Calendar.getInstance().getTime().getTime()));
this.logBatch.setString(2, level.getName());
this.logBatch.setString(3, message);
this.logBatch.addBatch();
} catch (SQLException ex) {
Log.logError("SQL error: " + ex.getMessage()); // if this happens the code will exit anyways so it will not cause a loop
}
}
Now a normal Log output looks like that:
2013-04-20 18:00:59 +0200 INFO: Starting up Tracker
It works for some time but the LogFormatter seems to be reset for whatever reason.
Sometimes only one Log entry is displayed correctly and after that the Log entries are displayed like:
Apr 20, 2013 6:01:01 PM package.util.Log log INFO:
Loaded 33266 database entries.
again.
What I tried:
For debugging purposes I added a thread that outputs the memory usage of the jvm every x seconds.
The output worked with the right Log Format until the reserved memory value changed (the free memory value change did not reset the log format) like this:
2013-04-20 18:16:24 +0200 WARNING: Memory usage: 23 / 74 / 227 MiB
2013-04-20 18:16:25 +0200 WARNING: Memory usage: 20 / 74 / 227 MiB
2013-04-20 18:16:26 +0200 WARNING: Memory usage: 18 / 74 / 227 MiB
Apr 20, 2013 6:16:27 PM package.util.Log log WARNING:
Memory usage: 69 / 96 / 227 MiB
Apr 20, 2013 6:16:27 PM package.util.Log log INFO:
Scheduler running
Apr 20, 2013 6:16:27 PM package.Log log WARNING:
Memory usage: 67 / 96 / 227 MiB
Also note that the log level seems to be reset from warning to info here.
Where the problem seems to be:
When I comment out the database log function like this:
private static void log(Level level, String message) {
Logger.getLogger(TrackerConfig.LOGGER_NAME).log(level, message);
if (level.intValue() >= TrackerConfig.DB_LOGGER_LEVEL.intValue()) {
// DBLog.getInstance().log(level, message);
}
}
the log is formatted properly.
Any ideas what could be wrong with the DBLog's log function or why the log suddenly resets?
I would not really call this a solution but it works now.
The cause seemed to be the memory calculation itself.
Even if I just calculated it without logging it, the log format was reset.
I have no idea why it worked when I just commented out the DBLog usage.
int mb = 1024 * 1024;
long freeMemory = Runtime.getRuntime().freeMemory() / mb;
long reservedMemory = Runtime.getRuntime().totalMemory() / mb;
long maxMemory = Runtime.getRuntime().maxMemory() / mb;
String memoryUsage = "Memory usage: " + freeMemory + " / " + reservedMemory + " / " + maxMemory + " MiB";
This is the code I used. As soon as I commented it out the log format did not reset anymore and now everything works as expected.