Passing value between two separate MapReduce jobs

Passing value between two separate MapReduce jobs - java

I have an Hadoop program, where I need to pass a single output which is generated from first MapReduce task to a second MapReduce task.
Ex.
MapReduce -1 -> Writes double value to the hdfs (file name is similar to part-00000).
In the second MapReduce job I want to use the double value from part-00000 file.
How can I do it. Can anyone please give some code snippet.

Wait for first job to finish and then run second one on the output of the first. You can do it:
1) In the Driver:
int code = firstJob.waitForCompletion(true) ? 0 : 1;
if (code) {
Job secondJob = new Job(new Configuration(), "JobChaining-Second");
TextInputFormat.addInputPath(secondJob, outputDirOfFirstJob);
...
}
2) Use JobControl and ControlledJob:
http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html
To use JobControl, start by wrapping your jobs with ControlledJob. Doing this is
relatively simple: you create your job like you usually would, except you also create a
ControlledJob that takes in your Job or Configuration as a parameter, along with a
list of its dependencies (other ControlledJobs). Then, you add them one-by-one to the
JobControl object, which handles the rest.
3) Externally (e.g. from shell script). Pass input/output paths as arguments.
4) Use Apache Oozie. You will specify your jobs in XML.

Related

How to process multiple files separately after SparkContext.wholeTextFiles?

I'm trying to use wholeTextFiles to read all the files names in a folder and process them one-by-one seperately(For example, I'm trying to get the SVD vector of each data set and there are 100 sets in total). The data are saved in .txt files spitted by space and arranged in different lines(like a matrix).
The problem I came across with is that after I use "wholeTextFiles("path with all the text files")", It's really difficult to read and parse the data and I just can't use the method like what I used when reading only one file. The method works fine when I just read one file and it gives me the correct output. Could someone please let me know how to fix it here? Thanks!
public static void main (String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("whole text files").setMaster("local[2]").set("spark.executor.memory","1g");;
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaPairRDD<String, String> fileNameContentsRDD = jsc.wholeTextFiles("/Users/peng/FMRITest/regionOutput/");
JavaRDD<String[]> lineCounts = fileNameContentsRDD.map(new Function<Tuple2<String, String>, String[]>() {
#Override
public String[] call(Tuple2<String, String> fileNameContent) throws Exception {
String content = fileNameContent._2();
String[] sarray = content .split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i< sarray.length; i++){
values[i] = Double.parseDouble(sarray[i]);
}
pd.cache();
RowMatrix mat = new RowMatrix(pd.rdd());
SingularValueDecomposition<RowMatrix, Matrix> svd = mat.computeSVD(84, true, 1.0E-9d);
Vector s = svd.s();
}});

Quoting the scaladoc of SparkContext.wholeTextFiles:
wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)] Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
In other words, wholeTextFiles might not simply be what you want.
Since by design "Small files are preferred" (see the scaladoc), you could mapPartitions or collect (with filter) to grab a subset of the files to apply the parsing to.
Once you have the files per partitions in your hands, you could use Scala's Parallel Collection API and schedule Spark jobs to execute in parallel:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

Is it possible to execute a command on all workers within Apache Spark?

I have a situation where I want to execute a system process on each worker within Spark. I want this process to be run an each machine once. Specifically this process starts a daemon which is required to be running before the rest of my program executes. Ideally this should execute before I've read any data in.
I'm on Spark 2.0.2 and using dynamic allocation.

You may be able to achieve this with a combination of lazy val and Spark broadcast. It will be something like below. (Have not compiled below code, you may have to change few things)
object ProcessManager {
lazy val start = // start your process here.
}
You can broadcast this object at the start of your application before you do any transformations.
val pm = sc.broadcast(ProcessManager)
Now, you can access this object inside your transformation like you do with any other broadcast variables and invoke the lazy val.
rdd.mapPartition(itr => {
pm.value.start
// Other stuff here.
}

An object with static initialization which invokes your system process should do the trick.
object SparkStandIn extends App {
object invokeSystemProcess {
import sys.process._
val errorCode = "echo Whatever you put in this object should be executed once per jvm".!
def doIt(): Unit = {
// this object will construct once per jvm, but objects are lazy in
// another way to make sure instantiation happens is to check that the errorCode does not represent an error
}
}
invokeSystemProcess.doIt()
invokeSystemProcess.doIt() // even if doIt is invoked multiple times, the static initialization happens once
}

A specific answer for a specific use case, I have a cluster with 50 nodes and I wanted to know which ones have CET timezone set:
(1 until 100).toSeq.toDS.
mapPartitions(itr => {
sys.process.Process(
Seq("bash", "-c", "echo $(hostname && date)")
).
lines.
toIterator
}).
collect().
filter(_.contains(" CET ")).
distinct.
sorted.
foreach(println)
Notice I don't think it's guaranteed 100% you'll get a partition for every node so the command might not get run on every node, even using using a 100 elements Dataset in a cluster with 50 nodes like the previous example.

Save a spark RDD using mapPartition with iterator

I have some intermediate data that I need to be stored in HDFS and local as well. I'm using Spark 1.6. In HDFS as intermediate form I'm getting data in /output/testDummy/part-00000 and /output/testDummy/part-00001. I want to save these partitions in local using Java/Scala so that I could save them as /users/home/indexes/index.nt(by merging both in local) or /users/home/indexes/index-0000.nt and /home/indexes/index-0001.nt separately.
Here is my code:
Note: testDummy is same as test, output is with two partitions. I want to store them separately or combined but local with index.nt file. I prefer to store separately in two data-nodes. I'm using cluster and submit spark job on YARN. I also added some comments, how many times and what data I'm getting. How could I do? Any help is appreciated.
val testDummy = outputFlatMapTuples.coalesce(Constants.INITIAL_PARTITIONS).saveAsTextFile(outputFilePathForHDFS+"/testDummy")
println("testDummy done") //1 time print
def savesData(iterator: Iterator[(String)]): Iterator[(String)] = {
println("Inside savesData") // now 4 times when coalesce(Constants.INITIAL_PARTITIONS)=2
println("iter size"+iterator.size) // 2 735 2 735 values
val filenamesWithExtension = outputPath + "/index.nt"
println("filenamesWithExtension "+filenamesWithExtension.length) //4 times
var list = List[(String)]()
val fileWritter = new FileWriter(filenamesWithExtension,true)
val bufferWritter = new BufferedWriter(fileWritter)
while (iterator.hasNext){ //iterator.hasNext is false
println("inside iterator") //0 times
val dat = iterator.next()
println("datadata "+iterator.next())
bufferWritter.write(dat + "\n")
bufferWritter.flush()
println("index files written")
val dataElements = dat.split(" ")
println("dataElements") //0
list = list.::(dataElements(0))
list = list.::(dataElements(1))
list = list.::(dataElements(2))
}
bufferWritter.close() //closing
println("savesData method end") //4 times when coal=2
list.iterator
}
println("before saving data into local") //1
val test = outputFlatMapTuples.coalesce(Constants.INITIAL_PARTITIONS).mapPartitions(savesData)
println("testRDD partitions "+test.getNumPartitions) //2
println("testRDD size "+test.collect().length) //0
println("after saving data into local") //1
PS: I followed, this and this but not exactly same what I'm looking for, I did somehow but not getting anything in index.nt

A couple of things:
Never call Iterator.size if you plan to use data later. Iterators are TraversableOnce. The only way to compute Iterator size is to traverse all its element and after that there is no more data to be read.
Don't use transformations like mapPartitions for side effects. If you want to perform some type of IO use actions like foreach / foreachPartition. It is a bad practice and doesn't guarantee that given piece of code will be executed only once.
Local path inside action or transformations is a local path of particular worker. If you want to write directly on the client machine you should fetch data first with collect or toLocalIterator. It could be better though to write to distributed storage and fetch data later.

Java 7 provides means to watch directories.
https://docs.oracle.com/javase/tutorial/essential/io/notification.html
The idea is to create a watch service, register it with the directory of interest (mention the events of your interest, like file creation, deletion, etc.,), do watch, you will be notified of any events like creation, deletion, etc., you can take whatever action you want then.
You will have to depend on Java hdfs api heavily wherever applicable.
Run the program in background since it waits for events forever. (You can write logic to quit after you do whatever you want)
On the other hand, shell scripting will also help.
Be aware of coherency model of hdfs file system while reading files.
Hope this helps with some idea.

Pentaho SDK, how to define a text file input

I'm trying to define a Pentaho Kettle (ktr) transformation via code. I would like to add to the transformation a Text File Input Step: http://wiki.pentaho.com/display/EAI/Text+File+Input.
I don't know how to do this (note that I want to achieve the result in a custom Java application, not using the standard Spoon GUI). I think I should use the TextFileInputMeta class, but when I try to define the filename the trasformation doesn't work anymore (it seems empty in Spoon).
This is the code I'm using. I think the third line has something wrong:
PluginRegistry registry = PluginRegistry.getInstance();
TextFileInputMeta fileInMeta = new TextFileInputMeta();
fileInMeta.setFileName(new String[] {myFileName});
String fileInPluginId = registry.getPluginId(StepPluginType.class, fileInMeta);
StepMeta fileInStepMeta = new StepMeta(fileInPluginId, myStepName, fileInMeta);
fileInStepMeta.setDraw(true);
fileInStepMeta.setLocation(100, 200);
transAWMMeta.addStep(fileInStepMeta);

To run a transformation programmatically, you should do the following:
Initialise Kettle
Prepare a TransMeta object
Prepare your steps
Don't forget about Meta and Data objects!
Add them to TransMeta
Create Trans and run it
By default, each transformation germinates a thread per step, so use trans.waitUntilFinished() to force your thread to wait until execution completes
Pick execution's results if necessary
Use this test as example: https://github.com/pentaho/pentaho-kettle/blob/master/test/org/pentaho/di/trans/steps/textfileinput/TextFileInputTests.java
Also, I would recommend you create the transformation manually and to load it from file, if it is acceptable for your circumstances. This will help to avoid lots of boilerplate code. It is quite easy to run transformations in this case, see an example here: https://github.com/pentaho/pentaho-kettle/blob/master/test/org/pentaho/di/TestUtilities.java#L346

Retrieving Jobs and/or Process

Is there an easy way to retrieve a job and check e.g. the status with Play?
I have a few encoding jobs/downloading jobs which run for a long time. In some cases I want to cancel them.
Is there a way to retrieve a list of Jobs or something?
E.g. one Job calls the FFMPEG encoder using the ProcessBuilder. I would like to be able to get this job and kill the process if it is not required (e.g. wrong file uploaded and don't want to wait for an hour before it is finished). If I can get a handle to that Job then I can get to the process as well.
I am using Play 1.2.4

See JobsPlugin.java to see how to list all the scheduledJobs.
Getting the task currently executed is more tricky but you can find your jobs in JobsPlugin.scheduledJobs list by checking Job class and call a method in your custom Job to tell him to cancel
Something like
for (Job<?> job : JobsPlugin.scheduledJobs) {
if (job instanceof MyJob) {
((MyJob) job).cancelWork();
}
}
where cancelWork is your custom method

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.