Hadoop: Change number of reducers at runtime

Hadoop: Change number of reducers at runtime - java

Assume the following scenario: A set of dependent jobs, which are send to hadoop. Hadoop executes the first one, then the second one that depends on the first, etc. The jobs are submitted in one go using JobControl (see code below).
Using Hadoop 2.x (in Java), is it possible to change the number of reducers of a job at runtime? More specific, how can I change then number of reducers in job 2 after job 1 has been executed?
Also, is there a way to let hadoop automatically infer the number of reducers by estimating map output? It always takes 1, and I cannot find a way to change the default setting (except explicitly setting the number myself).
// 1. create JobControl
JobControl jc = new JobControl(name);
// 2. add all the controlled jobs to the job control
// note that this is done in one go by using a collection
jc.addJobCollection(jobs);
// 3. execute the jobcontrol in a Thread
Thread workflowThread = new Thread(jc, "Thread_" + name);
workflowThread.setDaemon(true); // will not avoid JVM to shutdown
// 4. we wait for it to complete
LOG.info("Waiting for thread to complete: " + workflowThread.getName());
while (!jc.allFinished()) {
Thread.sleep(REFRESH_WAIT);
}

Your first question. Yes, you can set number of reducers of job 2 after execution of job 1 in your driver program:
Job job1 = new Job(conf, "job 1");
//your job setup here
//...
job1.submit();
job1.waitForCompletion(true);
int job2Reducers = ... //compute based on job1 results here
Job job2 = new Job(conf, "job 2");
job2.setNumReduceTasks(job2Reducers);
//your job2 setup here
//...
job2.submit();
job2.waitForCompletion(true);
Second question, to my knowledge, no, you can't make Hadoop automatically choose number of reducers based on your mapper load.

The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust the number of maps.
So we can set number of reducer tasks using same logic as map does.
To make reducer dynamic I wrote logic to set the number of reducer tasks dynamically to adjust with the number of map tasks at runtime.
In Java code:
long defaultBlockSize = 0;
int NumOfReduce = 10; // default value you can give any number.
long inputFileLength = 0;
try {
FileSystem fileSystem = FileSystem.get(this.getConf()); // hdfs file system
inputFileLength = fileSystem.getContentSummary(
new Path(PROP_HDFS_INPUT_LOCATION)).getLength();// input files stored in hdfs location
defaultBlockSize = fileSystem.getDefaultBlockSize(new Path(
hdfsFilePath.concat("PROP_HDFS_INPUT_LOCATION")));// getting default block size
if (inputFileLength > 0 && defaultBlockSize > 0) {
NumOfReduce = (int) (((inputFileLength / defaultBlockSize) + 1) * 2);// calculating number of tasks
}
System.out.println("NumOfReduce : " + NumOfReduce);
} catch (Exception e) {
LOGGER.error(" Exception{} ", e);
}
job.setNumReduceTasks(NumOfReduce);

Related

Does flink streaming have cache/persist feature? (like spark)

I have a Flink streaming program that have branch processing logic after a long transformation logic. Will the long transformation logic be executed multiple times? Pseudo code:
env = getEnvironment();
DataStream<Event> inputStream = getInputStream();
tempStream = inputStream.map(very_heavy_computation_func)
output1 = tempStream.map(func1);
output1.addSink(sink1);
output2 = tempStream.map(func2);
output2.addSink(sink2);
env.execute();
Questions:
How many times would inputStream.map(very_heavy_computation_func) be executed?
Once or twice?
If twice, how can I cache tempStream (or other method) to avoid the previous transformation being executed multiple times?

You can actually answer (1) easily by just trying out more or less exactly your example:
public class TestProgram {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
SingleOutputStreamOperator<Integer> stream = env.fromElements(1, 2, 3)
.map(i -> {
System.out.println("Executed expensive computation for: " + i);
return i;
});
stream.map(i -> i).addSink(new PrintSinkFunction<>());
stream.map(i -> i).addSink(new PrintSinkFunction<>());
env.execute();
}
}
produces (on my machine, for example):
Executed expensive computation for: 3
Executed expensive computation for: 1
Executed expensive computation for: 2
9> 3
8> 2
8> 2
9> 3
7> 1
7> 1
You can also find a more technical answer here which explains how records are replicated to downstream operators, rather than running the source/operator multiple times.

Apache Flink: The execution environment and multiple sink

My question might cause some confusion so please see Description first. It might be helpful to identify my problem. I will add my Code later at the end of the question (Any suggestions regarding my code structure/implementation is also welcomed).
Thank you for any help in advance!
My question:
How to define multiple sinks in Flink Batch processing without having it get data from one source repeatedly?
What is the difference between createCollectionEnvironment() and getExecutionEnvironment() ? Which one should I use in local environment?
What is the use of env.execute()? My code will output the result without this sentence. if I add this sentence it will pop an Exception:
-
Exception in thread "main" java.lang.RuntimeException: No new data sinks have been defined since the last execution. The last execution refers to the latest call to 'execute()', 'count()', 'collect()', or 'print()'.
at org.apache.flink.api.java.ExecutionEnvironment.createProgramPlan(ExecutionEnvironment.java:940)
at org.apache.flink.api.java.ExecutionEnvironment.createProgramPlan(ExecutionEnvironment.java:922)
at org.apache.flink.api.java.CollectionEnvironment.execute(CollectionEnvironment.java:34)
at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:816)
at MainClass.main(MainClass.java:114)
Description:
New to programming. Recently I need to process some data (grouping data, calculating standard deviation, etc.) using Flink Batch processing.
However I came to a point where I need to output two DataSet.
The structure was something like this
From Source(Database) -> DataSet 1 (add index using zipWithIndex())-> DataSet 2 (do some calculation while keeping index) -> DataSet 3
First I output DataSet 2, the index is e.g. from 1 to 10000;
And then I output DataSet 3 the index becomes from 10001 to 20000 although I did not change the value in any function.
My guessing is when outputting DataSet 3 instead of using the result of
previously calculated DataSet 2 it started from getting data from database again and then perform the calculation.
With the use of ZipWithIndex() function it does not only give the wrong index number but also increase the connection to db.
I guess that this is relevant to the execution environment, as when I use
ExecutionEnvironment env = ExecutionEnvironment.createCollectionsEnvironment();
will give the "wrong" index number (10001-20000)
and
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
will give the correct index number (1-10000)
The time taken and number of database connections is different and the order of print will be reversed.
OS, DB, other environment details and versions:
IntelliJ IDEA 2017.3.5 (Community Edition)
Build #IC-173.4674.33, built on March 6, 2018
JRE: 1.8.0_152-release-1024-b15 amd64
JVM: OpenJDK 64-Bit Server VM by JetBrains s.r.o
Windows 10 10.0
My Test code(Java)：
public static void main(String[] args) throws Exception {
ExecutionEnvironment env = ExecutionEnvironment.createCollectionsEnvironment();
//Table is used to calculate the standard deviation as I figured that there is no such calculation in DataSet.
BatchTableEnvironment tableEnvironment = TableEnvironment.getTableEnvironment(env);
//Get Data from a mySql database
DataSet<Row> dbData =
env.createInput(
JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("com.mysql.cj.jdbc.Driver")
.setDBUrl($database_url)
.setQuery("select value from $table_name where id =33")
.setUsername("username")
.setPassword("password")
.setRowTypeInfo(new RowTypeInfo(BasicTypeInfo.DOUBLE_TYPE_INFO))
.finish()
);
// Add index for assigning group (group capacity is 5)
DataSet<Tuple2<Long, Row>> indexedData = DataSetUtils.zipWithIndex(dbData);
// Replace index(long) with group number(int), and convert Row to double at the same time
DataSet<Tuple2<Integer, Double>> rawData = indexedData.flatMap(new GroupAssigner());
//Using groupBy() to combine individual data of each group into a list, while calculating the mean and range in each group
//put them into a POJO named GroupDataClass
DataSet<GroupDataClass> groupDS = rawData.groupBy("f0").combineGroup(new GroupCombineFunction<Tuple2<Integer, Double>, GroupDataClass>() {
#Override
public void combine(Iterable<Tuple2<Integer, Double>> iterable, Collector<GroupDataClass> collector) {
Iterator<Tuple2<Integer, Double>> it = iterable.iterator();
Tuple2<Integer, Double> var1 = it.next();
int groupNum = var1.f0;
// Using max and min to calculate range, using i and sum to calculate mean
double max = var1.f1;
double min = max;
double sum = 0;
int i = 1;
// The list is to store individual value
List<Double> list = new ArrayList<>();
list.add(max);
while (it.hasNext())
{
double next = it.next().f1;
sum += next;
i++;
max = next > max ? next : max;
min = next < min ? next : min;
list.add(next);
}
//Store group number, mean, range, and 5 individual values within the group
collector.collect(new GroupDataClass(groupNum, sum / i, max - min, list));
}
});
//print because if no sink is created, Flink will not even perform the calculation.
groupDS.print();
// Get the max group number and range in each group to calculate average range
// if group number start with 1 then the maximum of group number equals to the number of group
// However, because this is the second sink, data will flow from source again, which will double the group number
DataSet<Tuple2<Integer, Double>> rangeDS = groupDS.map(new MapFunction<GroupDataClass, Tuple2<Integer, Double>>() {
#Override
public Tuple2<Integer, Double> map(GroupDataClass in) {
return new Tuple2<>(in.groupNum, in.range);
}
}).max(0).andSum(1);
// collect and print as if no sink is created, Flink will not even perform the calculation.
Tuple2<Integer, Double> rangeTuple = rangeDS.collect().get(0);
double range = rangeTuple.f1/ rangeTuple.f0;
System.out.println("range = " + range);
}
public static class GroupAssigner implements FlatMapFunction<Tuple2<Long, Row>, Tuple2<Integer, Double>> {
#Override
public void flatMap(Tuple2<Long, Row> input, Collector<Tuple2<Integer, Double>> out) {
// index 1-5 will be assigned to group 1, index 6-10 will be assigned to group 2, etc.
int n = new Long(input.f0 / 5).intValue() + 1;
out.collect(new Tuple2<>(n, (Double) input.f1.getField(0)));
}
}

It's fine to connect a source to multiple sink, the source gets executed only once and records get broadcasted to the multiple sinks. See this question Can Flink write results into multiple files (like Hadoop's MultipleOutputFormat)?
getExecutionEnvironment is the right way to get the environment when you want to run your job. createCollectionEnvironment is a good way to play around and test. See the documentation
The exception error message is very clear: if you call print or collect your data flow gets executed. So you have two choices:
Either you call print/collect at the end of your data flow and it gets executed and printed. That's good for testing stuff. Bear in mind you can only call collect/print once per data flow, otherwise it gets executed many time while it's not completely defined
Either you add a sink at the end of your data flow and call env.execute(). That's what you want to do once your flow is in a more mature shape.

I'm getting different results every time I run my code

I'm using ELKI to cluster my data I used KMeansLloyd<NumberVector> with k=3 every time I run my java code I'm getting totally different clusters results, is this normal or there is something I should do to make my output nearly stable?? here my code that I got from elki tutorials
DatabaseConnection dbc = new ArrayAdapterDatabaseConnection(a);
// Create a database (which may contain multiple relations!)
Database db = new StaticArrayDatabase(dbc, null);
// Load the data into the database (do NOT forget to initialize...)
db.initialize();
// Relation containing the number vectors:
Relation<NumberVector> rel = db.getRelation(TypeUtil.NUMBER_VECTOR_FIELD);
// We know that the ids must be a continuous range:
DBIDRange ids = (DBIDRange) rel.getDBIDs();
// K-means should be used with squared Euclidean (least squares):
//SquaredEuclideanDistanceFunction dist = SquaredEuclideanDistanceFunction.STATIC;
CosineDistanceFunction dist= CosineDistanceFunction.STATIC;
// Default initialization, using global random:
// To fix the random seed, use: new RandomFactory(seed);
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(RandomFactory.DEFAULT);
// Textbook k-means clustering:
KMeansLloyd<NumberVector> km = new KMeansLloyd<>(dist, //
3 /* k - number of partitions */, //
0 /* maximum number of iterations: no limit */, init);
// K-means will automatically choose a numerical relation from the data set:
// But we could make it explicit (if there were more than one numeric
// relation!): km.run(db, rel);
Clustering<KMeansModel> c = km.run(db);
// Output all clusters:
int i = 0;
for(Cluster<KMeansModel> clu : c.getAllClusters()) {
// K-means will name all clusters "Cluster" in lack of noise support:
System.out.println("#" + i + ": " + clu.getNameAutomatic());
System.out.println("Size: " + clu.size());
System.out.println("Center: " + clu.getModel().getPrototype().toString());
// Iterate over objects:
System.out.print("Objects: ");
for(DBIDIter it = clu.getIDs().iter(); it.valid(); it.advance()) {
// To get the vector use:
NumberVector v = rel.get(it);
// Offset within our DBID range: "line number"
final int offset = ids.getOffset(it);
System.out.print(v+" " + offset);
// Do NOT rely on using "internalGetIndex()" directly!
}
System.out.println();
++i;
}

I would say, since you are using RandomlyGeneratedInitialMeans:
Initialize k-means by generating random vectors (within the data sets value range).
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(RandomFactory.DEFAULT);
Yes, it is normal.

K-Means is supposed to be initialized randomly. It is desirable to get different results when running it multiple times.
If you don't want this, use a fixed random seed.
From the code you copy and pasted:
// To fix the random seed, use: new RandomFactory(seed);
That is exactly what you should do...
long seed = 0;
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(
new RandomFactory(seed));

This was too long for a comment. As #Idos stated, You are initializing your data randomly; that's why you're getting random results. Now the question is, how do you ensure the results are robust? Try this:
Run the algorithm N times. Each time, record the cluster membership for each observation. When you are finished, classify an observation into the cluster which contained it most often. For example, suppose you have 3 observations, 3 classes, and run the algorithm 3 times:
obs R1 R2 R3
1 A A B
2 B B B
3 C B B
Then you should classify obs1 as A since it was most often classified as A. Classify obs2 as B since it was always classified as B. And classify obs3 as B since it was most often classified as B by the algorithm. The results should become increasingly stable the more times you run the algorithm.

how to benchmark an infinite loop java nio watchservice program

I have a infinite polling loop using java.nio.file.WatchService looking for new files .Inside the loop i have fixed thread pool executor service to process files concurrently.
As the polling service keeps running, how can i benchmark the time taken for a batch of say 10/n files to process.i am able to time each file in the runnable class but how can get the batch processing time ?

Something like this should work:
// inside the listener for the WatchService
final MyTimer t = new MyTimer(); // takes current time, initialized to 0 tasks
for (Change c : allChanges) {
t.incrementTaskCount(); // synchronized
launchConcurrentProcess(c, t);
}
// inside your processor, after processing a change
t.decrementTaskCount(); // also synchronized
// inside MyTimer
public void synchronized decrementTaskCount() {
totalTasks --;
// depending on your benchmarking needs, you can do different things here
// I am printing max time only (= end of last), but min/max/avg may also be nice
if (totalTasks == 0) {
System.err.println("The time spent on this batch was "
+ System.currentTimeMillis() - initialTime);
}
}

Creating a large number of multiple outputs

I am creating a large number of output files, for example 500. I am getting already being created exception,as shoen below. The program recovers by itself when the number of output files is small. For ex. if its 50 files, though this exception occurs, the program starts running successfully after printing this exception several times.
But, for many files, it eventually fails with an IOException.
I have pasted the error and then the code below:
12/10/29 15:47:27 INFO mapred.JobClient: Task Id : attempt_201210231820_0235_r_000004_3, Status : FAILED
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /home/users/mlakshm/preopa406/data-r-00004 for DFSClient_attempt_201210231820_0235_r_000004_3 on client 10.0.1.100, because this file is already being created by DFSClient_attempt_201210231820_0235_r_000004_2 on 10.0.1.130
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1406)
I have pasted the code :
In the Reduce method, I have the below logic to generate ouputs:
int data_hash = (int)data_str.hashCode();
int data_int1 = 0;
int k = 500;
int check1 = 0;
for (int l = 10; l>0; l++)
{
if((data_hash%l==0)&&(check1 == 0))
{
check1 = 1;
int range = (int) k/10;
String check = "true";
while(range > 0 && check.equals("true"))
{
if(data_hash % range-1 == 0)
{
check = "false";
data_int1 = range*10;
}
}
}
}
mos.getCollector("/home/users/mlakshm/preopa407/cdata"+data_int1, reporter).collect(new Text(t+" "+alsort.get(0)+" "+alsort.get(1)), new Text(intersection));
PLs help!

The problem is that all the reducer are trying to write files with the same naming scheme.
The reason it's doing this because
mos.getCollector("/home/users/mlakshm/preopa407/cdata"+data_int1, reporter).collect(new Text(t+" "+alsort.get(0)+" "+alsort.get(1)), new Text(intersection));
Set's the file name based on a characteristic of the data not the identity of the reducer.
You have a couple of choices :
Rework your map job so so that the key that's emitted matches up with the hash that your calculating in this job. That would make sure that each reducer got a span of values.
Include in the file name a identifier that is unqiue to each mapper. This would leave you with a set of part files for each reducer.
Could you perhaps explain why your using multiple outputs here? I don't think you need to.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hadoop: Change number of reducers at runtime - java

Related

Does flink streaming have cache/persist feature? (like spark)

Apache Flink: The execution environment and multiple sink

I'm getting different results every time I run my code

how to benchmark an infinite loop java nio watchservice program

Creating a large number of multiple outputs

Categories

Resources