My process creates a huge number of files time to time, I wanted to transfer files from my local directory to some location in HDFS, other than using NiFi, is it possible to develop that flow in java. If yes, please guide me in by giving some reference code in Java.
Please help me out!
You could do a couple of Things :-
1) Use Apache flume :- https://www.dezyre.com/hadoop-tutorial/flume-tutorial. This page says :- "Apache Flume is a distributed system used for aggregating the files to a single location. " This solution should be better than using kafka since it has been designed specifically for files.
2) Write Java code to ssh to your machine and scan for files that were modified after a specific timestamp. If you find such files open an input stream and save it on the machine your java code is running.
3) Alternatively your java code could be running on the machine your files are being created and you could scan for files created after specific timestamp and move them to any new machine
4) If you want to use only kafka. You could write a java code to read files find latest file/row and publish it to a kafka topic. Flume can do all this out of the box.
I don't know if there is a limit on the size of a message in Kafka, but you can use the ByteArraySerializer in the producer/consumer properties. Convert your file to bytes and then reconstruct it on the consumer.
Doing a quick search I found this
message.max.bytes (default:1000000) – Maximum size of a message the
broker will accept. This has to be smaller than the consumer
fetch.message.max.bytes, or the broker will have messages that can’t
be consumed, causing consumers to hang.
Related
I am running spark job in a cluster which has 2 worker nodes! I am using the code below (spark java) for saving the computed dataframe as csv to worker nodes.
dataframe.write().option("header","false").mode(SaveMode.Overwrite).csv(outputDirPath);
I am trying to understand how spark writes multiple part files on each worker node.
Run1) worker1 has part files and SUCCESS ; worker2 has _temporarty/task*/part* each task has the part files run.
Run2) worker1 has part files and also _temporary directory; worker2 has multiple part files
Can anyone help me understand why is this behavior?
1)Should I consider the records in outputDir/_temporary as part of the output file along with the part files in outputDir?
2)Is _temporary dir supposed to be deleted after job run and move the part files to outputDir?
3)why can't it create part files directly under ouput dir?
coalesce(1) and repartition(1) cannot be the option since the outputDir file itself will be around 500GB
Spark 2.0.2. 2.1.3 and Java 8, no HDFS
After analysis, observed that my spark job is using fileoutputcommitter version 1 which is default.
Then I included config to use fileoutputcommitter version 2 instead of version 1 and tested in 10 node spark standalone cluster in AWS. All part-* files are generated directly under outputDirPath specified in the dataframe.write().option("header","false").mode(SaveMode.Overwrite).csv(outputDirPath)
We can set the property
By including the same as --conf 'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2' in spark-submit command
or set the property using sparkContext javaSparkContext.hadoopConifiguration().set("mapreduce.fileoutputcommitter.algorithm.version","2")
I understand the consequence in case of failures as outlined in the spark docs, but I achieved the desired result!
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version, defaultValue is 1
The
file output committer algorithm version, valid algorithm version
number: 1 or 2. Version 2 may have better performance, but version 1
may handle failures better in certain situations, as per
MAPREDUCE-4815.
TL;DR To properly write (or read for that matter) data using file system based source you'll need a shared storage.
_temporary directory is a part of basic commit mechanism used by Spark - data is first written to a temporary directory, and once all task finished, atomically moved to the final destination. You can read more about this process in Spark _temporary creation reason
For this process to be successful you need a shared file system (HDFS, NFS, and so on) or equivalent distributed storage (like S3). Since you don't have one, failure to clean temporary state is expected - Saving dataframe to local file system results in empty results.
The behavior you observed (data partially committed and partially not) can occur, when some executors are co-located with the driver and share file system with the driver, enabling full commit for the subset of data.
Multiple part files are based on your dataframe partition. The number of files or data written is dependent on the number of partitions the DataFrame has at the time you write out the data. By default, one file is written per partition of the data.
you can control it by using coalesce or repartition. you can reduce the partition or increase it.
if you make coalesce of 1, then you wont see multiple part files in it but this affects writing Data in Parallel.
[outputDirPath = /tmp/multiple.csv ]
dataframe
.coalesce(1)
.write.option("header","false")
.mode(SaveMode.Overwrite)
.csv(outputDirPath);
on your question on how to refer it..
refer as /tmp/multiple.csv for all below parts.
/tmp/multiple.csv/part-00000.csv
/tmp/multiple.csv/part-00001.csv
/tmp/multiple.csv/part-00002.csv
/tmp/multiple.csv/part-00003.csv
Environment: Java 7 on an Ubuntu 12 server.
I have a Java application that polls for incoming .zip files that are delivered via sftp. I have no control over the client that's delivering the files.
The files being delivered are quite large, and in some cases, the poll mechanism detects a file while it's still being written. In this situation, the Java application borks because it thinks the file is corrupt.
What's the most effective way of detecting when the local sftp server has finished writing the file?
There are a number of approaches to dealing with this. You can choose one, but the more you implement the better:
The sender should upload as a .tmp file, then rename to .zip once done so that the watcher only sees the finished file.
The watcher should check the last modified time of the file, and if it was modified in the last 10 seconds (maybe 1 minute) then ignore the file and try again later.
If your OS supports it, try to get an exclusive lock on the file before reading it. This is not so easy in java, and depends on OS specifics.
Always send the file as a zip file, as if the file is incomplete of otherwise corrupted it will fail the CRC check. Also you get the added benefit for smaller transfers, smaller archive folder etc. (Of course you are already doing this, as mentioned in the question).
Look at the File2 component of camel and look at all the options it gives you. Make you want to use Camel, right?
See answer: https://stackoverflow.com/a/5851185/92063 which mentions incron. You can use it to notify your application that a file system event has taken place.
A quote from the linked website:
incron :: inotify cron system
This program is an "inotify cron" system. It consists of a daemon and
a table manipulator. You can use it a similar way as the regular cron.
The difference is that the inotify cron handles filesystem events
rather than time periods.
You have no control over the sender, that is unfortunate because the best solution would be the following (I will give another solution afterwards which doesn't require the sender to change anything).
The sender should rename the file when the upload is finished.
E.g. the file is named fileInProgress.txt during upload and fileFinished.txt when the upload is finished. You will restrict your java program to only watch for files with the name *Finished.txt. This is the easiest and an absolute reliable solution.
Your solution would be the following.
From your java program do a file-listing on the upload folder and store the file-sizes.
Wait for 10 secs (or longer if you want to be on the save side).
Do a file-listing again.
All files that didn't change in size are finished and can be processed.
Note that this does not give you absolute certainity that the upload is finished but it comes closer the longer the interval between your file size check is.
As David Roussel mentioned, Camel would be very useful for this. Take a look at initialDelay (among any others you may find useful) from File2 as this would place a specified delay before it polls the directory.
Any sort of file polling that I have done I have used Camel as it is easier to handle these kinds of situations.
Need some help to understand how HDFS and Storm are integrated. Storm can process incoming stream of data using many nodes. My data is, let's say, log entries from different machines. So how do I store that all? Ideally I'd like to store logs from one machine to a one or many files dedicated to that machine. However does does it work? Will I be able to append to the same file in HDFS from many different Storm nodes?
PS: I still working on getting all this running so I can't test this physically... but it does bother me.
Write a file in hdfs with Java
No, you cannot write to the same file from more than one task at a time. Each task would need to write to it's own file in a directory and then you could process them using directory/* if you are using hadoop
I am facing a very strange issue in java(J2EE App.). I have an Application that reads data from customer configuration Files placed in a location on local machine/ server , reads it via Java API and Displays it on the UI of the Tool. later, Through UI, the data can be Changed and is written back to the file by tool via Java API.
The problem is that the tool fails to read information (reads half of the file) and causes data loss on the UI. But the Issue is not Consistent. It happens About 1 in 20 times only. Rest it always reads well.
I am not able to reproduce the issue on my WINDOwS machine. But is was seen in the Production Server (ON UNIX Environment).
Please Suggest what I need to check. Are there any Permission related Issues in UNIX.
Can my tool have a Bug in it? or is it environment problem that the tool suffers from.
Should I try
try {
// my code
} catch(Throwable t) {
t.printStackTrace();
}
To debug if it's an issue in Environment?
Windows tends to lock files so you are less likely to read it while it is being written to. Linux takes the view you know what you are doing and doesn't lock by default. This means you can see files before you have finished. This is a common problem with files as they are not designed as a messaging protocol and so you have to come up with something heuristic to handle this deficiency. A better approach is to not use files for communication between processes or you have to be very aware of it's limitations.
I have a cluster of 4 servers. A file consists of many logical documents. Each file is started as a workflow. So, in summary a workflow runs on the server for each physical input file which can contain as many as 3,00,000 logical documents. At any given time 80 workflows are running concurrently across the cluster. Is there a way to speed up the file processing? Is file splitting a good alternative ? Any suggestions? Everything is java based running on a tomcat servlet engine.
Try to process the files in Oracle Coherence. This gives grid processing. Coherence also provides data persistence as well.