SparkSession readText file with custom logic - java

I want to read spark text file into JavaRDD, Below code works perfectly fine
JavaRDD rdd = sparkSession.sparkContext().textFile(filePath, 100).toJavaRDD();
I want to apply some conditional reading in this function of textFile
For example:
if the content of text file is as below (note this is simplified example)
1
2
2
3
4
4
I want to be able to look ahead or look back and eliminate duplicates based on some logic.
I don't want to do it at the time of processing rdd. I want to be able to do it at the time of reading text file itself.

As Spark goes through an optimizer. spark will actually perform the transformations and the filter for each line as it is read, it will not need to put all the data in memory.
My advice is to use the filter operation. Furthermore you can persist resulting RDD to avoid recomputation.

Related

Read 321 mb csv file efficiently. I have used opencsv to read data from this 321 mb file but it's take a lot time to read and do further operations

My job is to get data from CSV between two dates. For this firstly I have loaded all data into a list of beans. Then loop through the list and prepared my result.
The problem is in this process I have to loop through a lot of data.
I want a solution where I will load only necessary data from CSV. Ultimately I want to reduce the operations I am doing.
beansExtractedFromCsv = new CsvToBeanBuilder(new FileReader(s3StorageLocation.concat(s3FileName))).withType(CsvExtractTemplate.class).build().parse();
In this line, I am parsing all data from CSV
dataWriteOnCSVOrDB(fromTime, toTime, false, beansExtractedFromCsv);
Here I am passing all data extracted from CSV to my method, where I am looping through a lot of data to calculate my expected result.
There is another option in OpenCsv to read line by line. I haven't tried that.

Deeplearning4j: How would I prepare this data for a RNN that uses LSTM?

In my code, I download data from a source to a CSV file, then I apply a transformation process to it, after which it is written to a final CSV file. At this point in time, one row of my data looks like this:
45.414001,10358500,45.698002,44.728001,0.0
The first column is the data I want to predict, and the final column(the one with the 0s) is just a place holder for now, it will be a double number. Using deeplearning4j, I then load this data from the CSV file into a recordreader. Here is what that looks like:
RecordReader recordReader = new CSVRecordReader(numSkipLines);
recordReader.initialize(new FileSplit(inputPath));
So my question is, what should I do next? I want to use this data with a RNN LSTM model, which will predict the first column, one step into the future. What should I do next?
Usually, depending on how much data/rows you have, you separate between training and testing data. Testing data can usually be a bit smaller than your training data as the testing is just to see if your model predicts effectively.
The training data should be split between a smaller training and a validation set. You should be able to use the validation set to see after how many epochs/rounds of training do you being to overfit/underfit. You want to train to when the model just begins to overfit.

Apache Spark: Issues with saveAsTextFile() and filter()

When I try to use the function saveAsTextFile() I always get empty files even that the RDD contains tuples:
myRDD.saveAsTextFile("C:/Users/pc/Desktop/chna.txt");
What can be the reason?
Let's assume that it works and the data is registered in the textfile, how can I retrieve it through the shell or through my code (note: I am using Java)?
Does any solution exist to modify a text file through my code (using Java always), I tried the following code but got an java.io.NotSerializableException , is there any other possible solution?
BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter("C:/Users/pc/Desktop/chn.txt", true));
pairsRDD.foreach(x -> bufferedWriter.write(x._1+" "+x._2));
bufferedWriter.newLine(); // ...
bufferedWriter.close();
When I used this line of code:
JavaPairRDD<Integer, String> filterRDD = pairsRDD.filter((x,y) -> (x._1.equals(y._1))&&(x._2.equals(y._2)))));
I got an IOException , is it caused because the RDD is empty? Or the condition used for filter is wrong?
How can I fix this problem and what is the reason of it?
java.io.IOException: Could not locate executable null\bin\winutils.exe
in the Hadoop binaries.
When I create the RDD, it takes the first line (name of fields) too, how can I avoid this? Because I want to take only the lines which contains values.
saveAsTextFile() takes a path to a folder as parameter, not a path to a file. It will actually write one file per partition in that folder, named part-r-xxxxx (xxxxx being 00000 to whatever number of partitions you have).
To read your data again, it's a simple as using sparkContext.textFile() or .wholeTextFile() methods (depending whether you want to read a single file or a full folder).
There's no simple solution in spark to modify a file in place, since you don't control the naming of whatever spark writes, and spark forbids writing in a non empty folder in the first place.
If you really want to do that, the best thing to do is to not use spark, since it's not a matter of distributed computing, and use e.g sed or awk to do in place file editing, which will be orders of magnitude more performant, and a one liner.

Reading large images from HDFS in mapreduce

There is a very large image (~200MB) in HDFS (block size 64MB). I want to know the following:
How to read the image in a mapReduce job?
Many topics suggest WholeInputFormat. Is there any other alternative and how to do it?
When WholeInputFormat is used, will there be any parallel processing of the blocks? I guess no.
If your block size is 64 MB, most probably HDFS would have split your image file into chunks and replicated it across the cluster, depending on what your cluster configuration is.
Assuming that you want to process your image file as 1 record rather than multiple blocks/line by line, here are a few options I can think of to process image file as a whole.
You can implement a custom input format and a record reader. The isSplitable() method in the input format should return false. The RecordReader.next( LongWritable pos, RecType val ) method should read the entire file and set val to the file contents. This will ensure
that the entire file goes to one map task as a single record.
You can sub-class the input format and override the isSplitable() method so that it returns false. This example shows how create a sub-class
SequenceFileInputFormat to implement a NonSplittableSequenceFileInputFormat.
I guess it depends on what type of processing you want to perform. If you are trying to perform something that can be done first splitting the big input into smaller image files and then independently processing the blocks and finally stitching the outputs parts back into large final output - then it may be possible. I'm no image expert but suppose if you want to make a color image into grayscale then you may be cut the large image into small images. Then convert them parallelly using MR. Once the mappers are done then stitch them back to one large grayscale image.
If you understand the format of the image then you may write your own recordreader to help the framework understand the record boundaries preventing corruption when they are inputted to the mappers.
Although you can use WholeFileInputFormat or SequenceFileInputFormat or something custom to read the image file, the actual issue(in my view) is to draw something out of the read file. OK..You have read the file, now what??How are you going to process your image to detect any object inside your mapper. I'm not saying it's impossible, but it would require a lot work to be done.
IMHO, you are better off using something like HIPI. HIPI provides an API for performing image processing tasks on top of MapReduce framework.
Edit :
If you really want to do it your way, then you need to write a custom InputFormat. Since images are not like text files, you can't use delimiters like \n for split creation. One possible workaround could be to create splits based on some given number of bytes. For example, if your image file is of 200MB, you could write an InputFormat which will create splits of 100MB(or whatever you give as a parameter in your Job configuration). I had faced such a scenario long ago while dealing with some binary files and this project had helped me a lot.
HTH

How do I output whole files from a map job?

This is a basic question about mapreduce outputs.
I'm trying to create a map function that takes in an xml file and makes a pdf using apache fop. However I'm a little confused as how to output it, since I know that it goes out as a (key,value) pair.
I'm also not using streaming to do this.
The point of map-reduce is to tackle large amount of data that would usually not fit in memory - so input and output would usually be stored on disks somehow (a.k.a. files).
Input-output must be specified in key-value format
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
I have not tried this but this is what i would do:
Write output of mapper to this form: key is the filename in Text (keep filename unique) and value is the output of fop in TextOutputFormat. Write it using TextOutputFormat.
Suggestion:
I am assuming that your use case is just reading input xml (maybe doing some operation on its data) and writing data to PDF files using fop. I dont think this is a hadoop use case in first place...becoz whatever you want to do can be done by a batch script. How big are your xml files ? How many xml files do you have to process ?
EDIT:
SequenceFileOutputFormat will write in a SequenceFile. SequenceFile has its own headers and other metadata along with the text that is stored. Also it stores data in form of key:values.
SequenceFile Common Header
version - A byte array: 3 bytes of magic header 'SEQ', followed by 1 byte of actual version no. (e.g. SEQ4 or SEQ6)
keyClassName - String
valueClassName - String
compression - A boolean which specifies if compression is turned on for keys/values in this file.
blockCompression - A boolean which specifies if block compression is turned on for keys/values in this file.
compressor class - The classname of the CompressionCodec which is used to compress/decompress keys and/or values in this SequenceFile (if compression is enabled).
metadata - SequenceFile.Metadata for this file (key/value pairs)
sync - A sync marker to denote end of the header.
Using SequenceFile ruin your application as you will end up with corrupted output PDF files. Try this out and see for yourself.
You have lots of input files...and this is where hadoop sucks. (read this). Still I feel that you can do your desired operation using a script to invoke fop on every document one by one. If you have multiple nodes, run the same script but on different subset of input documents. Trust me, this will run FASTER than hadoop considering the overhead involved in creating maps and reduces (you dont need reduces..i know).

Categories

Resources