Reading Data From FTP Server in Hadoop/Cascading - java

I want to read data from FTP Server.I am providing path of the file which resides on FTP server in the format ftp://Username:Password#host/path.
When I use map reduce program to read data from file it works fine. I want to read data from same file through Cascading framework. I am using Hfs tap of cascading framework to read data. It throws following exception Stream closed
at org.apache.hadoop.fs.ftp.FTPInputStream.close(
at Source)
at org.apache.hadoop.util.LineReader.close(
at org.apache.hadoop.mapred.LineRecordReader.close(
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(
at org.apache.hadoop.mapred.MapTask.runOldMapper(
at org.apache.hadoop.mapred.LocalJobRunner$
Below is the code of cascading framework from where I am reading the files:
public class FTPWithHadoopDemo {
public static void main(String args[]) {
Tap source = new Hfs(new TextLine(new Fields("line")), "ftp://user:pwd#xx.xx.xx.xx//input1");
Tap sink = new Hfs(new TextLine(new Fields("line1")), "OP\\op", SinkMode.REPLACE);
Pipe pipe = new Pipe("First");
pipe = new Each(pipe, new RegexSplitGenerator("\\s+"));
pipe = new GroupBy(pipe);
Pipe tailpipe = new Every(pipe, new Count());
FlowDef flowDef = FlowDef.flowDef().addSource(pipe, source).addTailSink(tailpipe, sink);
new HadoopFlowConnector().connect(flowDef).complete();
I tried to look in Hadoop Source code for the same exception. I found that in the MapTask class there is one method runOldMapper which deals with stream. And in the same method there is finally block where stream gets closed (in.close()). When I remove that line from finally block it works fine. Below is the code:
private <INKEY, INVALUE, OUTKEY, OUTVALUE> void runOldMapper(final JobConf job, final TaskSplitIndex splitIndex,
final TaskUmbilicalProtocol umbilical, TaskReporter reporter)
throws IOException, InterruptedException, ClassNotFoundException {
InputSplit inputSplit = getSplitDetails(new Path(splitIndex.getSplitLocation()), splitIndex.getStartOffset());
updateJobWithSplit(job, inputSplit);
RecordReader<INKEY, INVALUE> in = isSkipping()
? new SkippingRecordReader<INKEY, INVALUE>(inputSplit, umbilical, reporter)
: new TrackedRecordReader<INKEY, INVALUE>(inputSplit, job, reporter);
job.setBoolean("mapred.skip.on", isSkipping());
int numReduceTasks = conf.getNumReduceTasks();"numReduceTasks: " + numReduceTasks);
MapOutputCollector collector = null;
if (numReduceTasks > 0) {
collector = new MapOutputBuffer(umbilical, job, reporter);
} else {
collector = new DirectMapOutputCollector(umbilical, job, reporter);
MapRunnable<INKEY, INVALUE, OUTKEY, OUTVALUE> runner = ReflectionUtils.newInstance(job.getMapRunnerClass(),
try {, new OldOutputCollector(collector, conf), reporter);
} finally {
// close
in.close(); // close input
please assist me in solving this problem.

After some efforts I found out that hadoop uses org.apache.hadoop.fs.ftp.FTPFileSystem Class for FTP.
This class doesn't supports seek, i.e. Seek to the given offset from the start of the file. Data is read in one block and then file system seeks to next block to read. Default block size is 4KB for FTPFileSystem. As seek is not supported it can only read data less than or equal to 4KB.


Batching multiple files to Amazon S3 using the Java SDK

I'm trying to upload multiple files to Amazon S3 all under the same key, by appending the files. I have a list of file names and want to upload/append the files in that order. I am pretty much exactly following this tutorial but I am looping through each file first and uploading that in part. Because the files are on hdfs (the Path is actually org.apache.hadoop.fs.Path), I am using the input stream to send the file data. Some pseudocode is below (I am commenting the blocks that are word for word from the tutorial):
// Create a list of UploadPartResponse objects. You get one of these for
// each part upload.
List<PartETag> partETags = new ArrayList<PartETag>();
// Step 1: Initialize.
InitiateMultipartUploadRequest initRequest = new InitiateMultipartUploadRequest(
bk.getBucket(), bk.getKey());
InitiateMultipartUploadResult initResponse =
try {
int i = 1; // part number
for (String file : files) {
Path filePath = new Path(file);
// Get the input stream and content length
long contentLength = fss.get(branch).getFileStatus(filePath).getLen();
InputStream is = fss.get(branch).open(filePath);
long filePosition = 0;
while (filePosition < contentLength) {
// create request
//upload part and add response to our list
// Step 3: Complete.
CompleteMultipartUploadRequest compRequest = new
} catch (Exception e) {
However, I am getting the following error: The XML you provided was not well-formed or did not validate against our published schema (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: 2C1126E838F65BB9), S3 Extended Request ID: QmpybmrqepaNtTVxWRM1g2w/fYW+8DPrDwUEK1XeorNKtnUKbnJeVM6qmeNcrPwc
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(
at com.amazonaws.http.AmazonHttpClient.executeHelper(
at com.amazonaws.http.AmazonHttpClient.execute(
If anyone knows what the cause of this error might be, that would be greatly appreciated. Alternatively, if there is a better way to concatenate a bunch of files into one s3 key, that would be great as well. I tried using java's builtin SequenceInputStream but that did not work. Any help would be greatly appreciated. For reference, the total size of all the files could be as large as 10-15 gb.
I know it's probably a bit late but worth giving my contribution.
I've managed to solve a similar problem using the SequenceInputStream.
The tricks is in being able to calculate the total size of the result file and then feeding the SequenceInputStream with an Enumeration<InputStream>.
Here's some example code that might help:
public void combineFiles() {
List<String> files = getFiles();
long totalFileSize =
.reduce(0L, (f, s) -> f + s);
try {
try (InputStream partialFile = new SequenceInputStream(getInputStreamEnumeration(files))) {
ObjectMetadata resultFileMetadata = new ObjectMetadata();
s3Client.putObject("bucketName", "resultFilePath", partialFile, resultFileMetadata);
} catch (IOException e) {
LOG.error("An error occurred while combining files. {}", e);
private Enumeration<? extends InputStream> getInputStreamEnumeration(List<String> files) {
return new Enumeration<InputStream>() {
private Iterator<String> fileNamesIterator = files.iterator();
public boolean hasMoreElements() {
return fileNamesIterator.hasNext();
public InputStream nextElement() {
try {
return new FileInputStream(Paths.get(;
} catch (FileNotFoundException e) {
throw new RuntimeException(e);
Hope this helps!

Reading and writing files using Java 7 nio

I have files which consist of json elements in an array.
(several file. each file has json array of elements)
I have a process that knows to take each json element as a line from file and process it.
So I created a small program that reads the JSON array, and then writes the elements to another file.
The output of this utility will be the input of the other process.
I used Java 7 NIO (and gson).
I tried to use as much Java 7 NIO as possible.
Is there any improvement I can do?
What about the filter? Which approach is better?
public class TransformJsonsUsers {
public TransformJsonsUsers() {
public static void main(String[] args) throws IOException {
final Gson gson = new Gson();
Path path = Paths.get("C:\\work\\data\\resources\\files");
final Path outputDirectory = Paths
DirectoryStream.Filter<Path> filter = new DirectoryStream.Filter<Path>() {
public boolean accept(Path entry) throws IOException {
// which is better?
// BasicFileAttributeView attView = Files.getFileAttributeView(entry, BasicFileAttributeView.class);
// return attView.readAttributes().isRegularFile();
return !Files.isDirectory(entry);
DirectoryStream<Path> directoryStream = Files.newDirectoryStream(path, filter);
directoryStream.forEach(new Consumer<Path>() {
public void accept(Path filePath) {
String fileOutput = outputDirectory.toString() + File.separator + filePath.getFileName();
Path fileOutputPath = Paths.get(fileOutput);
try {
BufferedReader br = Files.newBufferedReader(filePath);
User[] users = gson.fromJson(br, User[].class);
BufferedWriter writer = Files.newBufferedWriter(fileOutputPath, Charset.defaultCharset());
for (User user : users) {
} catch (IOException e) {
throw new RuntimeException(filePath.toString(), e);
There is no point of using Filter if you want to read all the files from the directory. Filter is primarily designed to apply some filter criteria and read a subset of files. Both of them may not have any real difference in over all performance.
If you looking to improve performance, you can try couple different approaches.
Depending on how many files exists in the directory and how powerful your CPU is, you can apply multi threading to process more than one file at a time
Right now you are reading and writing to another file synchronously. You can queue content of the file using Queue and create asynchronous writer.
You can combine both of these approaches as well to improve performance further.
Don't put the I/O into the filter. That's not what it's for. You should get the complete list of files and then process it. For example if the I/O creates another file in the directory, the behaviour is undefined. You might miss a file, or see the new file in the accept() method.

How to read Nutch content from Java/Scala?

I'm using Nutch to crawl some websites (as a process that runs separate of everything else), while I want to use a Java (Scala) program to analyse the HTML data of websites using Jsoup.
I got Nutch to work by following the tutorial (without the script, only executing the individual instructions worked), and I think it's saving the websites' HTML in the crawl/segments/<time>/content/part-00000 directory.
The problem is that I cannot figure out how to actually read the website data (URLs and HTML) in a Java/Scala program. I read this document, but find it a bit overwhelming since I've never used Hadoop.
I tried to adapt the example code to my environment, and this is what I arrived at (mostly by guesswprk):
val reader = new MapFile.Reader(FileSystem.getLocal(new Configuration()), ".../apache-nutch-1.8/crawl/segments/20140711115438/content/part-00000", new Configuration())
var key = null
var value = null, value) // test for a single value
However, I am getting this exception when I run it:
Exception in thread "main" java.lang.NullPointerException
I am not sure how to work with a MapFile.Reader, specifically, what constructor parameters I am supposed to pass to it. What Configuration objects am I supposed to pass in? Is that the correct FileSystem? And is that the data file I'm interested in?
val conf = NutchConfiguration.create()
val fs = FileSystem.get(conf)
val file = new Path(".../part-00000/data")
val reader = new SequenceFile.Reader(fs, file, conf)
val webdata = Stream.continually {
val key = new Text()
val content = new Content(), content)
(key, content)
public class ContentReader {
public static void main(String[] args) throws IOException {
Configuration conf = NutchConfiguration.create();
Options opts = new Options();
GenericOptionsParser parser = new GenericOptionsParser(conf, opts, args);
String[] remainingArgs = parser.getRemainingArgs();
FileSystem fs = FileSystem.get(conf);
String segment = remainingArgs[0];
Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
Text key = new Text();
Content content = new Content();
// Loop through sequence files
while (, content)) {
try {
System.out.write(content.getContent(), 0,
} catch (Exception e) {
Alternatively, you can use org.apache.nutch.segment.SegmentReader (example).

How to set HTTP header in Apache JClouds?

I'm using Apache JClouds to connect to my Openstack Swift installation. I managed to upload and download objects from Swift. However, I failed to see how to upload dynamic large object to Swift.
To upload dynamic large object, I need to upload all segments first, which I can do as usual. Then I need to upload a manifest object to combine them logically. The problem is to tell Swift this is a manifest object, I need to set a special header, which I don't know how to do that using JClouds api.
Here's a dynamic large object example from openstack official website.
The code I'm using:
public static void main(String[] args) throws IOException {
BlobStore blobStore = ContextBuilder.newBuilder("swift").endpoint("http://localhost:8080/auth/v1.0")
.credentials("test:test", "test").buildView(BlobStoreContext.class).getBlobStore();
blobStore.createContainerInLocation(null, "container");
ByteSource segment1 = ByteSource.wrap("foo".getBytes(Charsets.UTF_8));
Blob seg1Blob = blobStore.blobBuilder("/foo/bar/1").payload(segment1).contentLength(segment1.size()).build();
System.out.println(blobStore.putBlob("container", seg1Blob));
ByteSource segment2 = ByteSource.wrap("bar".getBytes(Charsets.UTF_8));
Blob seg2Blob = blobStore.blobBuilder("/foo/bar/2").payload(segment2).contentLength(segment2.size()).build();
System.out.println(blobStore.putBlob("container", seg2Blob));
ByteSource manifest = ByteSource.wrap("".getBytes(Charsets.UTF_8));
// TODO: set manifest header here
Blob manifestBlob = blobStore.blobBuilder("/foo/bar").payload(manifest).contentLength(manifest.size()).build();
System.out.println(blobStore.putBlob("container", manifestBlob));
Blob dloBlob = blobStore.getBlob("container", "/foo/bar");
InputStream input = dloBlob.getPayload().openStream();
while (true) {
int i =;
if (i < 0) {
System.out.print((char) i); // should print "foobar"
The "TODO" part is my problem.
I've been pointed out that Jclouds handles large file upload automatically, which is not so useful in our case. In fact, we do not know how large the file will be or when the next segment will arrive at the time we start to upload the first segment. Our api is designed to make client able to upload their files in chunks of their own chosen size and at their own chosen time, and when done, call a 'commit' to make these chunks as a file. So this makes us want to upload the manifest on our own here.
According to #Everett Toews's answer, I've got my code correctly running:
public static void main(String[] args) throws IOException {
CommonSwiftClient swift = ContextBuilder.newBuilder("swift").endpoint("http://localhost:8080/auth/v1.0")
.credentials("test:test", "test").buildApi(CommonSwiftClient.class);
SwiftObject segment1 = swift.newSwiftObject();
swift.putObject("container", segment1);
SwiftObject segment2 = swift.newSwiftObject();
swift.putObject("container", segment2);
swift.putObjectManifest("container", "foo/bar2");
SwiftObject dlo = swift.getObject("container", "foo/bar", GetOptions.NONE);
InputStream input = dlo.getPayload().openStream();
while (true) {
int i =;
if (i < 0) {
System.out.print((char) i);
jclouds handles writing the manifest for you. Here are a couple of examples that might help you, UploadLargeObject and largeblob.MainApp.
Try using
Map<String, String> manifestMetadata = ImmutableMap.of(
"X-Object-Manifest", "<container>/<prefix>");
If that doesn't work you might have to use the CommonSwiftClient like in

Hadoop DistributedCache object changed during job

I'm trying to run KMeans on AWS, and I ran into the following exception when trying to read updated cluster centroids from the DistributedCache: The distributed cache object s3://mybucket/centroids_6/part-r-00009 changed during the job from 4/8/13 2:20 PM to 4/8/13 2:20 PM
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.localizePublicCacheObject(
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.getLocalCache(
at org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(
at org.apache.hadoop.mapred.TaskTracker$
at Method)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(
at org.apache.hadoop.mapred.TaskTracker.localizeJob(
at org.apache.hadoop.mapred.TaskTracker$
What sets this question apart from this one is the fact that this error appears intermittently. I've run the same code successfully on a smaller dataset. Furthermore, when I change the number of centroids from 12 (seen above in the code) to 8, it fails on iteration 5 instead of 6 (which can you see in the centroids_6 name above).
Here's the relevant DistributedCache code in the main driver that runs the KMeans loop:
int iteration = 1;
long changes = 0;
do {
// First, write the previous iteration's centroids to the dist cache.
Configuration iterConf = new Configuration();
Path prevIter = new Path(centroidsPath.getParent(),
String.format("centroids_%s", iteration - 1));
FileSystem fs = prevIter.getFileSystem(iterConf);
Path pathPattern = new Path(prevIter, "part-*");
FileStatus [] list = fs.globStatus(pathPattern);
for (FileStatus status : list) {
DistributedCache.addCacheFile(status.getPath().toUri(), iterConf);
// Now, set up the job.
Job iterJob = new Job(iterConf);
iterJob.setJobName("KMeans " + iteration);
Path nextIter = new Path(centroidsPath.getParent(),
String.format("centroids_%s", iteration));
KMeansDriver.delete(iterConf, nextIter);
// Set input/output formats.
// Set Mapper, Reducer, Combiner
// Set MR formats.
// Set input/output paths.
FileInputFormat.addInputPath(iterJob, data);
FileOutputFormat.setOutputPath(iterJob, nextIter);
if (!iterJob.waitForCompletion(true)) {
System.err.println("ERROR: Iteration " + iteration + " failed!");
changes = iterJob.getCounters().findCounter(KMeansDriver.Counter.CONVERGED).getValue();
} while (changes > 0);
How else would the files be modified? The only possibility I can think of is that, at the completion of one iteration, the loop begins again before the centroids from the previous job have finished writing. But within the comment, I invoke the job with waitForCompletion(true), so there shouldn't be any residual parts of the job running when the loop starts over. Any ideas?
This isn't really an answer, but I did realize it was silly to use the DistributedCache in the way I was, as opposed to reading the results from the previous iteration directly from HDFS. I instead wrote this method in the main driver:
public static HashMap<Integer, VectorWritable> readCentroids(Configuration conf, Path path)
throws IOException {
HashMap<Integer, VectorWritable> centroids = new HashMap<Integer, VectorWritable>();
FileSystem fs = FileSystem.get(path.toUri(), conf);
FileStatus [] list = fs.globStatus(new Path(path, "part-*"));
for (FileStatus status : list) {
SequenceFile.Reader reader = new SequenceFile.Reader(fs, status.getPath(), conf);
IntWritable key = null;
VectorWritable value = null;
try {
key = (IntWritable)reader.getKeyClass().newInstance();
value = (VectorWritable)reader.getValueClass().newInstance();
} catch (InstantiationException e) {
} catch (IllegalAccessException e) {
while (, value)) {
centroids.put(new Integer(key.get()),
new VectorWritable(value.get(), value.getClusterId(), value.getNumInstances()));
return centroids;
This is invoked in the setup() method of the Mapper and Reducer during each iteration, to read the centroids of the previous iteration.
protected void setup(Context context) throws IOException {
Configuration conf = context.getConfiguration();
Path centroidsPath = new Path(conf.get(KMeansDriver.CENTROIDS));
centroids = KMeansDriver.readCentroids(conf, centroidsPath);
This allowed me to remove the block of code in the loop in my original question which writes the centroids to the DistributedCache. I tested it, and it now works on both large and small datasets.
I still don't know why I was getting the error I posted about (how would something in the read-only DistributedCache be changed? especially when I was changing HDFS paths on every iteration?), but this seems to both work and be a much less hack-y way of reading the centroids.

