My Spark Master needs to read a file in order. Here is what I am trying to avoid (in pseudocode):
if file-path starts with "hdfs://"
Read via HDFS API
else
Read via native FS API
I think the following would do the trick, letting Spark deal with distinguishing between local/HDFS:
JavaSparkContext sc = new JavaSparkContext(new SparkConf());
List<String> lines = sc.textFile(path).collect();
Is it safe to assume that lines will be in order; i.e. that lines.get(0) is the first line of the file, lines.get(1) is the second line; etc?
If not, any suggestions on how to avoid explicitly checking FS type?
Related
I have requirement where read 100 of S3 folder's csv file. In Single execution, it may get files in only few S3 folders like 60 folders have files. I need to process those 60 files and publish those data into Kafka topic. This job needs to scheduled each 4hr. And CSV data can be small records & with huge data like 6 GB also.
I have to develop in Java and deploy into AWS.
Thinking to use Spring Batch: Like below steps:
1. Traverse all 100 S3 folders and identify each folder which has files e.g. 60 folder has files.
2. create those many jobs\task like e.g. 60 jobs and execute them in parallel.
restriction: I should not use AWS EMR for this process.
Please suggest me a good approach to handle this best performance, with minimal failure data process.
For your use case, if all files are same type(i.e if it can be processed one by one), then you can use below option.
Using ResourceLoader, we can read files in S3 in ItemReader as like other resource. This would help to read files in S3 in chunks instead of loading entire file into memory.
With the dependencies injected for ResourceLoader and AmazonS3 client, have your reader configuration as below:
Replace values for sourceBucket and sourceObjectPrefix as needed.
#Autowired
private ResourceLoader resourceLoader;
#Autowired
private AmazonS3 amazonS3Client;
// READER
#Bean(destroyMethod="")
#StepScope
public SynchronizedItemStreamReader<Employee> employeeDataReader() {
SynchronizedItemStreamReader synchronizedItemStreamReader = new SynchronizedItemStreamReader();
List<Resource> resourceList = new ArrayList<>();
String sourceBucket = yourBucketName;
String sourceObjectPrefix = yourSourceObjectPrefix;
log.info("sourceObjectPrefix::"+sourceObjectPrefix);
ListObjectsRequest listObjectsRequest = new ListObjectsRequest()
.withBucketName(sourceBucket)
.withPrefix(sourceObjectPrefix);
ObjectListing sourceObjectsListing;
do{
sourceObjectsListing = amazonS3Client.listObjects(listObjectsRequest);
for (S3ObjectSummary sourceFile : sourceObjectsListing.getObjectSummaries()){
if(!(sourceFile.getSize() > 0)
|| (!sourceFile.getKey().endsWith(DOT.concat("csv")))
){
// Skip if file is empty (or) file extension is not "csv"
continue;
}
log.info("Reading "+sourceFile.getKey());
resourceList.add(resourceLoader.getResource("s3://".concat(sourceBucket).concat("/")
.concat(sourceFile.getKey())));
}
listObjectsRequest.setMarker(sourceObjectsListing.getNextMarker());
}while(sourceObjectsListing.isTruncated());
Resource[] resources = resourceList.toArray(new Resource[resourceList.size()]);
MultiResourceItemReader<Employee> multiResourceItemReader = new MultiResourceItemReader<>();
multiResourceItemReader.setName("employee-multiResource-Reader");
multiResourceItemReader.setResources(resources);
multiResourceItemReader.setDelegate(employeeFileItemReader());
synchronizedItemStreamReader.setDelegate(multiResourceItemReader);
return synchronizedItemStreamReader;
}
#Bean
#StepScope
public FlatFileItemReader<Employee> employeeFileItemReader()
{
FlatFileItemReader<Employee> reader = new FlatFileItemReader<Employee>();
reader.setLinesToSkip(1);
reader.setLineMapper(new DefaultLineMapper() {
{
setLineTokenizer(new DelimitedLineTokenizer() {
{
setNames(Employee.fields());
}
});
setFieldSetMapper(new BeanWrapperFieldSetMapper<Employee>() {
{
setTargetType(Employee.class);
}
});
}
});
return reader;
}
For each file/resource, the MultiResourceItemReader will delegate to the FlatFileItemReader being configured.
For itemProcessor part, you can scale using asyncProcessor/writer approach also as needed.
Here is one possible approach for you to think about. (Fyi, i have done file processing using spring-batch and threading using the strategy i am outlining here. But that code belongs to my company and cannot share it.)
I would suggest you read these articles to understand how to scale up using spring-batch.
First, spring-batch documentation
https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html
Next, a good post from stackoverflow itself.
Best Spring batch scaling strategy
After reading both and understanding all the different ways, i would suggest you concentrate on Partitioning,
https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html#partitioning
This is the technique i used as well. In your case, you can spawn the number of threads for each file from the partitioner.
You may need to maintain the state, ie, if the file is assigned to a thread or not. 'Processing' and 'Completed Processing' also could be states in your code.
This depends on your requirement. (I had a whole set of states maintained in a singleton which all threads would update after picking up a file, and finished processing a file etc)
You also need to think about finishing each file before the 4 hour window is over. You may be able to keep the file as is, or you may want to move this to a new location while processing. or rename the file while processing. Again it depends on your requirements. But you need to think about this scenario. (In my case, i renamed the file by adding a unique suffix made of timestamp in milliseconds , so it could not be overwritten by new file. )
Finally, a sample from a blog which processes 5 csv files through partitioner.
You can start from this sample.
https://www.baeldung.com/spring-batch-partitioner
And search for more samples to see if this is the approach you want to take. Good luck.
Problem
I have a file saved in HDFS and all I want to do is to run my spark application, calculate a result javaRDD and use saveAsTextFile() in order to store the new "file" in HDFS.
However Spark's saveAsTextFile() does not work if the file already exists. It does not overwrite it.
What I tried
So I searched for a solution to this and I found that a possible way to make it work could be deleting the file through the HDFS API before trying to save the new one.
I added the Code:
FileSystem hdfs = FileSystem.get(new Configuration());
Path newFolderPath = new Path("hdfs://node1:50050/hdfs/" +filename);
if(hdfs.exists(newFolderPath)){
System.out.println("EXISTS");
hdfs.delete(newFolderPath, true);
}
filerdd.saveAsTextFile("/hdfs/" + filename);
When I tried to run my Spark application, the file was deleted but I get a FileNotFoundException.
Considering the fact, that this exception occurs when someone is trying to read a file from a path and the file does not exist, this makes no sense because after deleting the file, there is no code that tries to read it.
Part of my code
JavaRDD<String> filerdd = sc.textFile("/hdfs/" + filename) // load the file here
...
...
// Transformations here
filerdd = filerdd.map(....);
...
...
// Delete old file here
FileSystem hdfs = FileSystem.get(new Configuration());
Path newFolderPath = new Path("hdfs://node1:50050/hdfs/" +filename);
if(hdfs.exists(newFolderPath)){
System.out.println("EXISTS");
hdfs.delete(newFolderPath, true);
}
// Write new file here
filerdd.saveAsTextFile("/hdfs/" + filename);
I am trying to do the simplest thing here but I have no idea why this does not work. Maybe the filerdd is somehow connected to the path??
The problem is you use the same path for input and output. Spark's RDD will be executed lazily. It runs when you call saveAsTextFile. At this point, you have already deleted the newFolderPath. So filerdd will complain.
Anyway, you should not use the same path for input and output.
I am trying to implement a MapReduce job that processes a large text file (as a look up file) in addition to the actual dataset (input). the look up file is more than 2GB.
I tried to load the text file as a third argument as follows:
but I got Java Heap Space Error.
After doing some search, it is suggested to use Distributed Cache. this is what I have done so far
First, I used this method to read the look up file:
public static String readDistributedFile(Context context) throws IOException {
URI[] cacheFiles = context.getCacheFiles();
Path path = new Path(cacheFiles[0].getPath().toString());
FileSystem fs = FileSystem.get(new Configuration());
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(path)));
String line;
while ((line = br.readLine()) != null) {
// split line
sb.append(line);
sb.append("\n");
}
br.close();
return sb.toString();
}
Second, In the Mapper:
protected void setup(Context context)
throws IOException, InterruptedException {
super.setup(context);
String lookUpText = readDistributedFile(context);
//do something with the text
}
Third, to run the job
hadoop jar mapReduceJob.jar the.specific.class -files ../LargeLookUpFileInStoredLocally.txt /user/name/inputdataset/*.gz /user/name/output
But the problem is that the job is taking long time to be load.
May be it was not a good idea to use the distributed cache or may be I am missing something in my code.
I am working with Hadoop 2.5.
I have already checked some related questions such as [1].
Any ideas will be great!
[1] Hadoop DistributedCache is deprecated - what is the preferred API?
Distributed cache is mostly used to move the files which are needed by Map reduce at the task nodes and are not part of jar.
Other usage is when performing joins which include a big and small data set, so that, rather than using Multiple input paths, we use a single input(big) file, and get the other small file using distributed cache and then compare(or join) both the data sets.
The reason for more time in your case is because you are trying to read entire 2 gb file before the map reduce starts(as it is started in setup method).
Can you give the reason why you are loading the the huge 2gb file using distributed cache.
I wrote a programm that reads a csv file and puts it into a TableModel. My problem is that I want to expand the programm so, that if the csv file gets changes from outside my tablemodel gets updated and gets the new values.
I would now programm a scheduler so that the thread sleeps for about a minute and checks it every minute if the timestamp of the file changed. If so it would read the file again. But i dont know what happens to the whole programm if i use a scheduler because this little software i write will be a part of a much much bigger software wich is running on JDK 6. So I search for a performant and independent from the bigger software solution to get the changes in the tablemodel.
Can someone help out?
java.nio.file package now contains the Watch Service API. This, effectively:
This API enables you to register a directory (or directories) with the
watch service. When registering, you tell the service which types of
events you are interested in: file creation, file deletion, or file
modification. When the service detects an event of interest, it is
forwarded to the registered process. The registered process has a
thread (or a pool of threads) dedicated to watching for any events it
has registered for. When an event comes in, it is handled as needed.
See reference here.
Oh! This API is only available from JDK 7 (onwards).
**OpenCsv is a best way to read csv file in java.
if your are using maven then you can use below dependency or download it's jar from web.**
#SuppressWarnings({"rawtypes", "unchecked"})
public void readCsvFile() {
CSVReader csvReader;
CsvToBean csv;
File fileEntry;
try {
fileEntry = new File("path of your file");
csv = new CsvToBean();
csvReader = new CSVReader(new FileReader(fileEntry), ',', '"', 1);
List list = csv.parse(setColumMapping(), csvReader);
//List of LabReportSampleData class
} catch (IOException e) {
e.printStackTrace();
}
}
//Below function is used to map the your csv file to your mapping object.
//columns String array: The value inside your csv file. means 0 index map with degree variable in your mapping class.
#SuppressWarnings({"rawtypes", "unchecked"})
private static ColumnPositionMappingStrategy setColumMapping() {
ColumnPositionMappingStrategy strategy = new ColumnPositionMappingStrategy();
strategy.setType(LabReportSampleData.class);
String[] columns =
new String[] {"degree", "radian", "shearStress", "shearingStrain", "sourceUnit"};
strategy.setColumnMapping(columns);
return strategy;
}
I'm new to Hadoop and recently I was asked to do a test project using Hadoop.
So while I was reading BigData, happened to know about Pail. Now what I want to do is something like this. First create a simple object and then serialize it using Thrift and put that into the HDFS using Pail. Then I want to get that object inside the map function and do what ever I want. But I have no idea on getting tat object inside the map function.
Can someone please tell me of any references or explain how to do that?
I can think of three options:
Use the -files option and name the file in HDFS (preferable as the task tracker will download the file once for all jobs running on that node)
Use the DistributedCache (similar logic to the above), but you configure the file via some API calls rather than through the command line
Load the file directly from HDFS (less efficient as you're pulling the file over HDFS for each task)
As for some code, put the load logic into your mapper's setup(...) or configure(..) method (depending on whether you're using the new or old API) as follows:
protected void setup(Context context) {
// the -files option makes the named file available in the local directory
File file = new File("filename.dat");
// open file and load contents ...
// load the file directly from HDFS
FileSystem fs = FileSystem.get(context.getConfiguration());
InputStream hdfsInputStream = fs.open("/path/to/file/in/hdfs/filename.dat");
// load file contents from stream...
}
DistributedCache has some example code in the Javadocs