Pig UDF, file in Distributed Cache deleted during batch work

Pig UDF, file in Distributed Cache deleted during batch work - java

public class GetCountryFromIP extends EvalFunc<String> {
#Override
public List<String> getCacheFiles() {
List<String> list = new ArrayList<String>(1);
list.add("/input/pig/resources/GeoLite2-Country.mmdb#GeoLite2-Country");
return list;
}
#Override
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0 || input.get(0) == null) {
return null;
}
try {
String inputIP = (String) input.get(0);
String output;
File database = new File("./GeoLite2-Country");
//CODE FOR EXPLAIN
if (database.exists()) {
System.out.print("EXIST!!!");
} else {
System.out.print("NOTEXISTS!!!");
}
//CODE FOR EXPLAIN
DatabaseReader reader = new DatabaseReader.Builder(database).build();
InetAddress ipAddress = InetAddress.getByName(inputIP);
CountryResponse response = reader.country(ipAddress);
Country country = response.getCountry();
output = country.getIsoCode();
return output;
} catch (AddressNotFoundException e) {
return null;
} catch (Exception ee) {
throw new IOException("Uncaught exec" + ee);
}
}
}
Here is My UDF code, I need GeoLite2-Count.mmdb File, So use GetCacheFile.
Also I put all Pig-Latin to one pig file, 'batch.pig'
When I run this file 'pig batch.pig'
Output seams like this
2015-10-06 01:16:56,737 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - soft limit at 83886080
2015-10-06 01:16:56,737 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - bufstart = 0; bufvoid = 104857600
2015-10-06 01:16:56,737 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - kvstart = 26214396; length = 6553600
2015-10-06 01:16:56,738 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2015-10-06 01:16:56,744 [LocalJobRunner Map Task Executor #0] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2015-10-06 01:16:56,754 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map - Aliases being processed per job phase (AliasName[line,offset]): M: weblog[-1,-1],weblog_web[30,13],weblog_web[-1,-1],weblog_web[-1,-1],desktop_active_log_account_filter[7,36],desktop_parsed[3,18],desktop_parsed_abstract[5,26],weblog_web[-1,-1],web_active_log_account_filter[20,32],weblog_web_parsed[16,20],weblog_web_parsed_abstract[18,29] C: R:
EXIST!!!
2015-10-06 01:16:56,997 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.LocalJobRunner -
2015-10-06 01:16:56,997 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Starting flush of map output
...
...
...
2015-10-06 01:16:57,938 [Thread-1885] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce task executor complete.
2015-10-06 01:16:57,939 [pool-59-thread-1] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
NOTEXIST!!!
2015-10-06 01:16:57,974 [pool-59-thread-1] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce - Aliases being processed per job phase (AliasName[line,offset]): M: account_hour_activity[42,24],account_hour_activity_group[41,30],team_hour_activity[76,21],team_hour_activity_group[75,27] C:
...
...
..
2015-10-06 01:16:57,976 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Finished spill 0
2015-10-06 01:16:57,977 [Thread-2139] INFO org.apache.hadoop.mapred.LocalJobRunner - map task executor complete.
2015-10-06 01:16:57,981 [Thread-2139] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local1209692101_0021
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: Local Rearrange[tuple]{tuple}(true) - scope-2240 Operator Key: scope-2240): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: weblog_web_parsed_abstract: New For Each(false,false,false)[bag] - scope-1379 Operator Key: scope-1379): org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: com.tosslab.sprinklr.country.GetCountryFromIP [Uncaught execjava.io.FileNotFoundException: ./GeoLite2-Country (No such file or directory)]
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: Local Rearrange[tuple]{tuple}(true) - scope-2240 Operator Key: scope-2240): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: weblog_web_parsed_abstract: New For Each(false,false,false)[bag] - scope-1379 Operator Key: scope-1379): org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: com.tosslab.sprinklr.country.GetCountryFromIP [Uncaught execjava.io.FileNotFoundException: ./GeoLite2-Country (No such file or directory)]
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:316)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNextTuple(POLocalRearrange.java:291)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.runPipeline(POSplit.java:259)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.processPlan(POSplit.java:241)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.processPlan(POSplit.java:246)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.getNextTuple(POSplit.java:233)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: weblog_web_parsed_abstract: New For Each(false,false,false)[bag] - scope-1379 Operator Key: scope-1379): org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: com.tosslab.sprinklr.country.GetCountryFromIP [Uncaught execjava.io.FileNotFoundException: ./GeoLite2-Country (No such file or directory)]
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:316)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:246)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307)
... 17 more
This Means mmdb File is deleted During the batch work...
What's going on here? How could I solve this Issue?

Seems like the job is run from local mode.
2015-10-06 01:16:57,976 [**LocalJobRunner** Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Finished spill 0
When you run the job in local mode the distributed cache is not supported.
2015-10-05 23:22:56,675 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Distributed cache not supported or needed in local mode.
Put everything in HDFS and run in mapreduce mode.

Related

Controlling emit values from Flux generate for sequential task execution

I'm writing a simple orchestration framework using reactor framework which executes tasks sequentially, and the next task to execute is dependent on the result from previous tasks. I might have multiple paths to choose from based on the outcome of previous tasks. Earlier, I wrote a similar framework based on a static DAG where I passed as list of tasks as iterables and used Flux.fromIterable(taskList). However, this does not give me the flexibility to go dynamic because of the static array publisher.
I'm looking for alternate approaches like do(){}while(condition) to solve for DAG traversal and task decision and I came up with Flux.generate(). I evaluate the next step in generate method and pass the next task downstream. The problem I'm facing now is, Flux.generate does not wait for downstream to complete, but pushes until the condition is set to invalid. And by the time task 1 gets executed, task 2 would have been pushed n times, which is not the expected behavior.
Can someone please point me towards the right direction?
Thanks.
First iteration using List of tasks (static DAG)
Flux.fromIterable(taskList)
.publishOn(this.factory.getSharedSchedulerPool())
.concatMap(
reactiveTask -> {
log.info("Running task =>{}", reactiveTask.getTaskName());
return reactiveTask
.run(ctx);
})
// Evaluates status from previous task and terminates stream or continues.
.takeWhile(context -> evaluateStatus(context))
.onErrorResume(throwable -> buildResponse(ctx, throwable))
.doOnCancel(() -> log.info("Task cancelled"))
.doOnComplete(() -> log.info("Completed flow"))
.subscribe();
Attempt to dynamic dag
Flux.generate(
(SynchronousSink<ReactiveTask<OrchestrationContext>> synchronousSink) -> {
ReactiveTask<OrchestrationContext> task = null;
if (ctx.getLastExecutedStep() == null) {
// first task;
task = getFirstTaskFromDAG();
} else {
task = deriveNextStep(ctx.getLastExecutedStep(), ctx.getDecisionData());
}
if (task.getName.equals("END")) {
synchronousSink.complete();
}
synchronousSink.next(task);
})
.publishOn(this.factory.getSharedSchedulerPool())
.doOnNext(orchestrationContextReactiveTask -> log.info("On next => {}",
orchestrationContextReactiveTask.getTaskName()))
.concatMap(
reactiveTask -> {
log.info("Running task =>{}", reactiveTask.getTaskName());
return reactiveTask
.run(ctx);
})
.onErrorResume(throwable -> buildResponse(ctx, throwable))
.takeUntil(context -> evaluateStatus(context, tasks))
.doOnCancel(() -> log.info("Task cancelled"))
.doOnComplete(() -> log.info("Completed flow")).subscribe();
The problem in above approach is, while task 1 is executing, the onNext() subscriber prints many time because generate is publishing. I want the generate method to wait on results from previous task and submit new task. In non-reactive world, this can be achieve through simple while() loop.
Each Task will perform the following action.
public class ResponseTask extends AbstractBaseTask {
private TaskDefinition taskDefinition;
final String taskName;
public ResponseTask(
StateManager stateManager,
ThreadFactory factory,
) {
this.taskDefinition = taskDefinition;
this.taskName = taskName;
}
public Mono<String> transform(OrchestrationContext context) {
Any masterPayload = Any.wrap(context.getIngestionPayload());
return Mono.fromCallable(() -> stateManager.doTransformation(context, masterPayload);
}
public Mono<OrchestrationContext> execute(OrchestrationContext context, String payload) {
log.info("Executing sleep for task=>{}", context.getLastExecutedStep());
return Mono.delay(Duration.ofSeconds(1), factory.getSharedSchedulerPool())
.then(Mono.just(context));
}
public Mono<OrchestrationContext> run(OrchestrationContext context) {
log.info("Executing task:{}. Last executed:{}", taskName, context.getLastExecutedStep());
return transform(context)
.doOnNext((result) -> log.info("Transformation complete for task=?{}", taskName);)
.flatMap(payload -> {
return execute(context, payload);
}).onErrorResume(throwable -> {
context.setStatus(FAILED);
return Mono.just(context);
});
}
}
EDIT - From #Ikatiforis 's recommendation - I got the following output
Here's the output from my side.
2021-12-02 09:58:14,643 INFO (reactive_shared_pool) [ReactiveEngine lambda$doOrchestration$5:98] On next => Task1
2021-12-02 09:58:14,644 INFO (reactive_shared_pool) [ReactiveEngine lambda$doOrchestration$6:101] Running task =>Task1
2021-12-02 09:58:14,644 INFO (reactive_shared_pool) [AbstractBaseTask run:75] Executing task:Task1. Last executed:Task1
2021-12-02 09:58:14,658 INFO (reactive_shared_pool) [ReactiveEngine lambda$doOrchestration$5:98] On next => Task2
2021-12-02 09:58:14,659 INFO (reactive_shared_pool) [AbstractBaseTask lambda$run$0:83] Transformation complete for task=?Task1
2021-12-02 09:58:14,659 INFO (reactive_shared_pool) [ResponseTask execute:41] Executing sleep for task=>Task1
2021-12-02 09:58:15,661 INFO (reactive_shared_pool) [AbstractBaseTask lambda$run$4:106] Success for task=>Task1
2021-12-02 09:58:15,663 INFO (reactive_shared_pool)
[ReactiveEngine lambda$doOrchestration$6:101] Running task =>Task2
2021-12-02 09:58:15,811 INFO (cassandra-nio-worker-8) [AbstractBaseTask run:75] Executing task:Task2. Last executed:Task2
2021-12-02 09:58:15,811 INFO (reactive_shared_pool) [ReactiveEngine lambda$doOrchestration$5:98] On next => Task2
2021-12-02 09:58:15,812 INFO (reactive_shared_pool) [AbstractBaseTask lambda$run$0:83] Transformation complete for task=?Task2
2021-12-02 09:58:15,812 INFO (reactive_shared_pool) [ResponseTask execute:41] Executing sleep for task=>Task2
2021-12-02 09:58:15,837 INFO (centaurus_reactive_shared_pool) [ReactiveEngine lambda$doOrchestration$9:113] Completed flow
I see couple of problems here --
The sequence of execution is
1. Task does transformations ( runs on Mono.fromCallable)
2. Task induces a delay - Mono.fromDelay()
3. Task completes execution. After this, generate method should evaluate the context and pass on the next task to be executed.
What I see from the output is:
1. Task 1 starts the transformations - Runs on Mono.fromCallable.
2. Task 2 doOnNext is reported - which means the stream already got this task.
3. Task 1 completes.
4. Task 2 starts and executes delay -> the stream does not wait for response from task 2 but completes the flow.

The problem in above approach is, while task 1 is executing, the
onNext() subscriber prints many time because generate is publishing.
This is happening because concatMap requests a number of items upfront(32 by default) instead of requesting elements one by one. If you really need to request one element at the time you can use concatMap(Function<? super T,? extends Publisher<? extends V>> mapper,int prefetch) variant method and provide the prefetch value like this:
.concatMap(reactiveTask -> {
log.info("Running task =>{}", reactiveTask.getTaskName());
return reactiveTask.run(ctx);
}, 1)
Edit
There is also a publishOn method which takes a prefetch value. Take a look at the following Fibonacci generator sample and let me know if it works as you expect:
generateFibonacci(100)
.publishOn(boundedElasticScheduler, 1)
.doOnNext(number -> log.info("On next => {}", number))
.concatMap(number -> {
log.info("Running task => {}", number);
return task(number).doOnNext(num -> log.info("Task completed => {}", num));
}, 1)
.takeWhile(context -> context < 3)
.subscribe();
public Flux<Integer> generateFibonacci(int limit) {
return Flux.generate(
() -> new FibonacciState(0, 1),
(state, sink) -> {
log.info("Generating number: " + state);
sink.next(state.getFormer());
if (state.getLatter() > limit) {
sink.complete();
}
int temp = state.getFormer();
state.setFormer(state.getLatter());
state.setLatter(temp + state.getLatter());
return state;
});
}
Here is the output:
2021-12-02 10:47:51,990 INFO main c.u.p.p.s.c.Test - Generating number: FibonacciState(former=0, latter=1)
2021-12-02 10:47:51,993 INFO pool-1-thread-1 c.u.p.p.s.c.Test - On next => 0
2021-12-02 10:47:51,996 INFO pool-1-thread-1 c.u.p.p.s.c.Test - Running task => 0
2021-12-02 10:47:54,035 INFO pool-1-thread-1 c.u.p.p.s.c.Test - Task completed => 0
2021-12-02 10:47:54,035 INFO pool-1-thread-1 c.u.p.p.s.c.Test - Generating number: FibonacciState(former=1, latter=1)
2021-12-02 10:47:54,036 INFO pool-1-thread-1 c.u.p.p.s.c.Test - On next => 1
2021-12-02 10:47:54,036 INFO pool-1-thread-1 c.u.p.p.s.c.Test - Running task => 1
2021-12-02 10:47:56,036 INFO pool-1-thread-1 c.u.p.p.s.c.Test - Task completed => 1
2021-12-02 10:47:56,036 INFO pool-1-thread-1 c.u.p.p.s.c.Test - Generating number: FibonacciState(former=1, latter=2)
2021-12-02 10:47:56,036 INFO pool-1-thread-1 c.u.p.p.s.c.Test - On next => 1
2021-12-02 10:47:56,036 INFO pool-1-thread-1 c.u.p.p.s.c.Test - Running task => 1
2021-12-02 10:47:58,036 INFO pool-1-thread-1 c.u.p.p.s.c.Test - Task completed => 1
2021-12-02 10:47:58,036 INFO pool-1-thread-1 c.u.p.p.s.c.Test - Generating number: FibonacciState(former=2, latter=3)
2021-12-02 10:47:58,036 INFO pool-1-thread-1 c.u.p.p.s.c.Test - On next => 2
2021-12-02 10:47:58,036 INFO pool-1-thread-1 c.u.p.p.s.c.Test - Running task => 2
2021-12-02 10:48:00,036 INFO pool-1-thread-1 c.u.p.p.s.c.Test - Task completed => 2
2021-12-02 10:48:00,037 INFO pool-1-thread-1 c.u.p.p.s.c.Test - Generating number: FibonacciState(former=3, latter=5)
2021-12-02 10:48:00,037 INFO pool-1-thread-1 c.u.p.p.s.c.Test - On next => 3
2021-12-02 10:48:00,037 INFO pool-1-thread-1 c.u.p.p.s.c.Test - Running task => 3
2021-12-02 10:48:02,037 INFO pool-1-thread-1 c.u.p.p.s.c.Test - Task completed => 3
2021-12-02 10:52:07,877 INFO pool-1-thread-2 c.u.p.p.s.c.Test - Completed flow
Edit 04122021
You stated:
I'm trying to simulate HTTP / blocking calls. Hence the Mono.delay.
Mono#Delay is not the appropriate method to simulate a blocking call. The delay is introduced through the parallel scheduler and as a result, it does not wait for the task to complete. You can simulate a blocking call like this:
public String get() throws IOException {
HttpsURLConnection connection = (HttpsURLConnection) new URL("https://jsonplaceholder.typicode.com/comments").openConnection();
connection.setRequestMethod("GET");
try(InputStream inputStream = connection.getInputStream()) {
return new String(inputStream.readAllBytes(), StandardCharsets.UTF_8);
}
}
Note that as an alternative you could use .limitRate(1) operator instead of the prefetch parameter.

Spark java code is runnning in spark_core v2.2 but failing in spark_core v2.3

I have a spark java code that is running good in spark-core_2.11 v2.2.0 but throwing an exception in spark-core_2.11 v2.3.1.
The code is basically to map a column "isrecurrence" to values 1 if the value is true and 0 if the value is false.
That column also contains value "null" (as string). Those "null" string will get replaced by "\N" (for hive will read this data as NULL).
Code:
public static Seq<String> convertListToSeq(List<String> inputList)
{
return JavaConverters.asScalaIteratorConverter(inputList.iterator()).asScala().toSeq();
}
String srcCols = "id,whoid,whatid,whocount,whatcount,subject,activitydate,status,priority,ishighpriority,ownerid,description,isdeleted,accountid,isclosed,createddate,createdbyid,lastmodifieddate,lastmodifiedbyid,systemmodstamp,isarchived,calldurationinseconds,calltype,calldisposition,callobject,reminderdatetime,isreminderset,recurrenceactivityid,isrecurrence,recurrencestartdateonly,recurrenceenddateonly,recurrencetimezonesidkey,recurrencetype,recurrenceinterval,recurrencedayofweekmask,recurrencedayofmonth,recurrenceinstance,recurrencemonthofyear,recurrenceregeneratedtype";
String table = "task";
String[] colArr = srcCols.split(",");
List<String> colsList = Arrays.asList(colArr);
Dataset<Row> filtered = spark.read().format("com.springml.spark.salesforce")
.option("username", prop.getProperty("salesforce_user"))
.option("password", prop.getProperty("salesforce_auth"))
.option("login", prop.getProperty("salesforce_login_url"))
.option("soql", "SELECT "+srcCols+" from "+table)
.option("version", prop.getProperty("salesforce_version"))
.load().persist();
String column = "isrecurrence"; //This column has values 'true' or 'false' as string.
//'true' will be mapped to '1' (as string)
//'false' will be mapped to '0' (as string).
String newCol = column + "_mapped_to_new_value";
filtered = filtered.selectExpr(convertListToSeq(colsList))
.withColumn(newCol, //code is breaking here at "withColumn"
when(filtered.col(column).notEqual("null"),
when(filtered.col(column).equalTo("true"), 1)
.otherwise(when(filtered.col(column).equalTo("false"), 0)))
.otherwise(lit("\\N"))).alias(newCol)
.drop(filtered.col(column));
filtered.write().mode(SaveMode.Overwrite).option("delimiter", "^").csv(hdfsExportLoaction);
Error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved operator 'Project [id#35, whoid#21, whatid#1, whocount#13, whatcount#5, subject#27, activitydate#22, status#19, priority#24, ishighpriority#10, ownerid#15, description#2, isdeleted#20, accountid#3, isclosed#12, createddate#34, createdbyid#16, lastmodifieddate#0, lastmodifiedbyid#37, systemmodstamp#28, isarchived#30, calldurationinseconds#23, calltype#9, calldisposition#6, ... 16 more fields];;
'Project [id#35, whoid#21, whatid#1, whocount#13, whatcount#5, subject#27, activitydate#22, status#19, priority#24, ishighpriority#10, ownerid#15, description#2, isdeleted#20, accountid#3, isclosed#12, createddate#34, createdbyid#16, lastmodifieddate#0, lastmodifiedbyid#37, systemmodstamp#28, isarchived#30, calldurationinseconds#23, calltype#9, calldisposition#6, ... 16 more fields]
+- Project [id#35, whoid#21, whatid#1, whocount#13, whatcount#5, subject#27, activitydate#22, status#19, priority#24, ishighpriority#10, ownerid#15, description#2, isdeleted#20, accountid#3, isclosed#12, createddate#34, createdbyid#16, lastmodifieddate#0, lastmodifiedbyid#37, systemmodstamp#28, isarchived#30, calldurationinseconds#23, calltype#9, calldisposition#6, ... 15 more fields]
+- Project [id#35, whoid#21, whatid#1, whocount#13, whatcount#5, subject#27, activitydate#22, status#19, priority#24, ishighpriority#10, ownerid#15, description#2, isdeleted#20, accountid#3, isclosed#12, createddate#34, createdbyid#16, lastmodifieddate#0, lastmodifiedbyid#37, systemmodstamp#28, isarchived#30, calldurationinseconds#23, calltype#9, calldisposition#6, ... 15 more fields]
+- Relation[LastModifiedDate#0,WhatId#1,Description#2,AccountId#3,RecurrenceDayOfWeekMask#4,WhatCount#5,CallDisposition#6,ReminderDateTime#7,RecurrenceEndDateOnly#8,CallType#9,IsHighPriority#10,RecurrenceRegeneratedType#11,IsClosed#12,WhoCount#13,RecurrenceInterval#14,OwnerId#15,CreatedById#16,RecurrenceActivityId#17,IsReminderSet#18,Status#19,IsDeleted#20,WhoId#21,ActivityDate#22,CallDurationInSeconds#23,... 15 more fields] DatasetRelation(null,com.springml.salesforce.wave.impl.ForceAPIImpl#68303c3e,SELECT id,whoid,whatid,whocount,whatcount,subject,activitydate,status,priority,ishighpriority,ownerid,description,isdeleted,accountid,isclosed,createddate,createdbyid,lastmodifieddate,lastmodifiedbyid,systemmodstamp,isarchived,calldurationinseconds,calltype,calldisposition,callobject,reminderdatetime,isreminderset,recurrenceactivityid,isrecurrence,recurrencestartdateonly,recurrenceenddateonly,recurrencetimezonesidkey,recurrencetype,recurrenceinterval,recurrencedayofweekmask,recurrencedayofmonth,recurrenceinstance,recurrencemonthofyear,recurrenceregeneratedtype from task,null,org.apache.spark.sql.SQLContext#2ec23ec3,null,0,1000,None,false,false)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:92)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:356)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:354)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:354)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:92)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3301)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1312)
at org.apache.spark.sql.Dataset.withColumns(Dataset.scala:2197)
at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:2164)
at com.sfdc.SaleforceReader.mapColumns(SaleforceReader.java:187)
at com.sfdc.SaleforceReader.main(SaleforceReader.java:547)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:904)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/07/10 09:38:51 INFO SparkContext: Invoking stop() from shutdown hook
19/07/10 09:38:51 INFO AbstractConnector: Stopped Spark#72456279{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
19/07/10 09:38:51 INFO SparkUI: Stopped Spark web UI at http://ebdp-po-e007s.sys.comcast.net:4040
19/07/10 09:38:51 INFO YarnClientSchedulerBackend: Interrupting monitor thread
19/07/10 09:38:51 INFO YarnClientSchedulerBackend: Shutting down all executors
19/07/10 09:38:51 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
19/07/10 09:38:51 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
services=List(),
started=false)
19/07/10 09:38:51 INFO YarnClientSchedulerBackend: Stopped
19/07/10 09:38:51 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/07/10 09:38:51 INFO MemoryStore: MemoryStore cleared
19/07/10 09:38:51 INFO BlockManager: BlockManager stopped
19/07/10 09:38:51 INFO BlockManagerMaster: BlockManagerMaster stopped
I am not sure if this is happening due to nested when-otherwise().

I used the lit() and it worked:
when(filtered.col(column).equalTo("true"), lit(1))
.otherwise(when(filtered.col(column).equalTo("false"), lit(0)))
instead of
when(filtered.col(column).equalTo("true"), 1)
.otherwise(when(filtered.col(column).equalTo("false"), 0))

java.io.IOException in MapReduce

I want to use MapReduce to get the max value and min value for each year in a txt file. the contents in the file look like this:
1979 23 23 2 43 24 25 26 26 26 26 25 26
1980 26 27 28 28 28 30 31 31 31 30 30 30
1981 31 32 32 32 33 34 35 36 36 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38
1985 38 39 39 39 39 41 41 41 00 40 39 39
The first column represents years.
I want MapReduce to give me a final output like this:
1979 2, 26
1980 26, 31
...
so I write the code in Java like this:
public class MaxValue_MinValue {
public static class E_Mappter extends Mapper<Object, Text, Text, IntWritable> {
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] elements = line.split("\\s");
Text outputKey = new Text(elements[0]);
for(int i = 1; i<elements.length;i++) {
context.write(outputKey, new IntWritable(Integer.parseInt(elements[i])));
}
}
}
public static class E_Reducer extends Reducer<Text,IntWritable, Text, Text> {
public void reduce(Text inKey,Iterable<IntWritable> inValues, Context context) throws IOException, InterruptedException {
int maxTemp = 0;
int minTemp = 0;
for(IntWritable ele : inValues) {
if (ele.get() > maxTemp) {
maxTemp = ele.get();
}
if (ele.get() < minTemp) {
minTemp = ele.get();
}
}
context.write(inKey, new Text("Max is " + maxTemp + ", Min is " + minTemp));
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf,"Max value, min value for each year");
job.setJarByClass(MaxValue_MinValue.class);
job.setMapperClass(E_Mappter.class);
job.setReducerClass(E_Reducer.class);
job.setCombinerClass(E_Reducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0:1);
}
}
but when I run it, I got below error messages:
hadoop#steven81-HP:/usr/local/hadoop277$ ./bin/hadoop jar ./myApp/MinValue_MaxValue.jar /user/hadoop/input/Electrical__Consumption.txt /user/hadoop/output7
19/04/10 16:59:21 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
19/04/10 16:59:21 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
19/04/10 16:59:21 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
19/04/10 16:59:22 INFO input.FileInputFormat: Total input paths to process : 1
19/04/10 16:59:22 INFO mapreduce.JobSubmitter: number of splits:1
19/04/10 16:59:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1076320101_0001
19/04/10 16:59:23 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
19/04/10 16:59:23 INFO mapreduce.Job: Running job: job_local1076320101_0001
19/04/10 16:59:23 INFO mapred.LocalJobRunner: OutputCommitter set in config null
19/04/10 16:59:23 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
19/04/10 16:59:23 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
19/04/10 16:59:23 INFO mapred.LocalJobRunner: Waiting for map tasks
19/04/10 16:59:23 INFO mapred.LocalJobRunner: Starting task: attempt_local1076320101_0001_m_000000_0
19/04/10 16:59:23 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
19/04/10 16:59:23 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
19/04/10 16:59:23 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/hadoop/input/Electrical__Consumption.txt:0+204
19/04/10 16:59:23 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
19/04/10 16:59:23 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
19/04/10 16:59:23 INFO mapred.MapTask: soft limit at 83886080
19/04/10 16:59:23 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
19/04/10 16:59:23 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
19/04/10 16:59:23 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
19/04/10 16:59:24 INFO mapred.MapTask: Starting flush of map output
19/04/10 16:59:24 INFO mapred.LocalJobRunner: map task executor complete.
19/04/10 16:59:24 INFO mapreduce.Job: Job job_local1076320101_0001 running in uber mode : false
19/04/10 16:59:24 INFO mapreduce.Job: map 0% reduce 0%
19/04/10 16:59:24 WARN mapred.LocalJobRunner: job_local1076320101_0001
java.lang.Exception: java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.IntWritable
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.IntWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1077)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:715)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at test.map.reduce.MaxValue_MinValue$E_Mappter.map(MaxValue_MinValue.java:23)
at test.map.reduce.MaxValue_MinValue$E_Mappter.map(MaxValue_MinValue.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/04/10 16:59:25 INFO mapreduce.Job: Job job_local1076320101_0001 failed with state FAILED due to: NA
19/04/10 16:59:25 INFO mapreduce.Job: Counters: 0
I was confused by this error "Type mismatch in value from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.IntWritable" because the map's output is (Text, IntWritable) and the input for the reduce is also (Text, IntWritable), so I don't know why, can anyone help me?

The Combiner must be able to accept data from the Mapper, and must output data that can be used as input for the Reducer. In your case, the Combiner output type is <Text, Text>, but the Reducer input type is <Text, IntWritable> and so they don't match.
You don't actually need MapReduce for this problem, because you have all the data for each year available on each line, and you don't need to compare between lines.
String line = value.toString();
String[] elements = line.split("\\s");
Text year = new Text(elements[0]);
int maxTemp = INTEGER.MIN_VALUE;
int minTemp = INTEGER.MAX_VALUE;
int temp;
for(int i = 1; i<elements.length;i++) {
temp = Integer.parseInt(elements[i])
if (temp < minTemp) {
minTemp = temp;
} else if (temp > maxTemp) {
maxTemp = temp;
}
}
System.out.println("For year " + year + ", the minimum temperature was " + minTemp + " and the maximum temperature was " + maxTemp);

Average in map reduce

I am trying to calculate average of numbers in hadoop stand alone setup. I am not able to run the program. But program compile without any error and jar file also created.I think I am using correct commands to execute the program in hadoop set up. Somebody please review my code and tell me is there any problem . Here is my code
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
class sum_count{
int sum;
int count;
}
public class Average {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, Object>{
private final static IntWritable valueofkey = new IntWritable();
private Text word = new Text();
sum_count sc=new sum_count();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
int sum=0;
int count=0;
int v;
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
v=Integer.parseInt(word.toString());
count=count+1;
sum=sum+v;
}
//valueofkey.set(sum);
word.set("average");
sc.sum=sum;
sc.count=count;
// context.write(word, valueofkey);
// valueofkey.set(count);
// word.set("count");
context.write(word,sc);
}
}
public static class IntSumReducer
extends Reducer<Text,Object,Text,IntWritable> {
private IntWritable result = new IntWritable();
private IntWritable test=new IntWritable();
public void reduce(Text key, Iterable<sum_count> values,Context context) throws IOException, InterruptedException {
int sum = 0;
int count=0;
int wholesum=0;
int wholecount=0;
for (sum_count val : values) {
//value=val.get();
wholesum=wholesum+val.sum;
wholecount=wholecount+val.count;
}
int res=wholesum/wholecount;
result.set(res);
context.write(key, result );
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "");
job.setJarByClass(Average.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Here is my output:
manu#manu-Latitude-E5430-vPro:~/hadoop-2.7.2$ ./bin/hadoop jar av.jar Average bin/user/hduser/input bin/user/hduser/out12
Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar
16/07/01 11:19:05 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
16/07/01 11:19:05 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
16/07/01 11:19:05 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/07/01 11:19:05 INFO input.FileInputFormat: Total input paths to process : 2
16/07/01 11:19:05 INFO mapreduce.JobSubmitter: number of splits:2
16/07/01 11:19:05 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local276107063_0001
16/07/01 11:19:05 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
16/07/01 11:19:05 INFO mapreduce.Job: Running job: job_local276107063_0001
16/07/01 11:19:05 INFO mapred.LocalJobRunner: OutputCommitter set in config null
16/07/01 11:19:05 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/07/01 11:19:05 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
16/07/01 11:19:05 INFO mapred.LocalJobRunner: Waiting for map tasks
16/07/01 11:19:05 INFO mapred.LocalJobRunner: Starting task: attempt_local276107063_0001_m_000000_0
16/07/01 11:19:06 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/07/01 11:19:06 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/07/01 11:19:06 INFO mapred.LocalJobRunner: Starting task: attempt_local276107063_0001_m_000001_0
16/07/01 11:19:06 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/07/01 11:19:06 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/07/01 11:19:06 INFO mapred.LocalJobRunner: map task executor complete.
16/07/01 11:19:06 WARN mapred.LocalJobRunner: job_local276107063_0001
java.lang.Exception: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:134)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:745)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:132)
... 8 more
Caused by: java.lang.NoClassDefFoundError: sum_count
at Average$TokenizerMapper.<init>(Average.java:24)
... 13 more
Caused by: java.lang.ClassNotFoundException: sum_count
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 14 more
16/07/01 11:19:06 INFO mapreduce.Job: Job job_local276107063_0001 running in uber mode : false
16/07/01 11:19:06 INFO mapreduce.Job: map 0% reduce 0%
16/07/01 11:19:06 INFO mapreduce.Job: Job job_local276107063_0001 failed with state FAILED due to: NA
16/07/01 11:19:06 INFO mapreduce.Job: Counters: 0

You're getting a ClassNotFoundException on sum_count. Having two classes declared at the top level of a file isnt really a good way to structure your code. It looks like when the TokenizerMapper tries to create that class, it can't find it on the class path.
I would just put that class in a file of its own. It will need changing anyway, your job won't work as you have it since sum_count doesnt implement the Writable interface. It should look more like:
public class SumCount implements Writable {
public int sum;
public int count;
#Override
public void write(DataOutput out) throws IOException {
out.writeInt(sum);
out.writeInt(count);
}
#Override
public void readFields(DataInput in) throws IOException {
sum = in.readInt();
count = in.readInt();
}
}
In your main() you also need to tell it what types of Key/Value it will write out are:
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(SumCount.class);
Note the change in class name. See the Java naming convention docs here.

how to define our own map and reduce class

i want to create map reduce job of my own.
the map class's output is :Text(key),Text(value)
the reduce class's output is :Text,Intwritable
I tried to implement it in following way:
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class artistandTrack {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String line = value.toString();
String[] names=line.split(" ");
Text artist_name = new Text(names[2]);
Text track_name = new Text(names[3]);
output.collect(artist_name,track_name);
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, IntWritable> {
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += 1;
Text x1=values.next();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(artistandTrack.class);
conf.setJobName("artisttrack");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(Text.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
when i try to run it it shows the following output and terminates
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/10/17 06:09:15 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
14/10/17 06:09:15 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/10/17 06:09:16 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
14/10/17 06:09:18 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
14/10/17 06:09:19 INFO mapred.FileInputFormat: Total input paths to process : 1
14/10/17 06:09:19 INFO mapreduce.JobSubmitter: number of splits:1
14/10/17 06:09:19 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local803195645_0001
14/10/17 06:09:20 WARN conf.Configuration: file:/app/hadoop/tmp/mapred/staging/userloki803195645/.staging/job_local803195645_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
14/10/17 06:09:20 WARN conf.Configuration: file:/app/hadoop/tmp/mapred/staging/userloki803195645/.staging/job_local803195645_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
14/10/17 06:09:20 WARN conf.Configuration: file:/app/hadoop/tmp/mapred/local/localRunner/userloki/job_local803195645_0001/job_local803195645_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
14/10/17 06:09:20 WARN conf.Configuration: file:/app/hadoop/tmp/mapred/local/localRunner/userloki/job_local803195645_0001/job_local803195645_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
14/10/17 06:09:20 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
14/10/17 06:09:20 INFO mapreduce.Job: Running job: job_local803195645_0001
14/10/17 06:09:20 INFO mapred.LocalJobRunner: OutputCommitter set in config null
14/10/17 06:09:20 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
14/10/17 06:09:20 INFO mapred.LocalJobRunner: Waiting for map tasks
14/10/17 06:09:20 INFO mapred.LocalJobRunner: Starting task: attempt_local803195645_0001_m_000000_0
14/10/17 06:09:20 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
14/10/17 06:09:20 INFO mapred.MapTask: Processing split: hdfs://localhost:54310/project5/input/sad.txt:0+272
14/10/17 06:09:21 INFO mapred.MapTask: numReduceTasks: 1
14/10/17 06:09:21 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
14/10/17 06:09:21 INFO mapreduce.Job: Job job_local803195645_0001 running in uber mode : false
14/10/17 06:09:21 INFO mapreduce.Job: map 0% reduce 0%
14/10/17 06:09:22 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
14/10/17 06:09:22 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
14/10/17 06:09:22 INFO mapred.MapTask: soft limit at 83886080
14/10/17 06:09:22 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
14/10/17 06:09:22 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
14/10/17 06:09:25 INFO mapred.LocalJobRunner:
14/10/17 06:09:25 INFO mapred.MapTask: Starting flush of map output
14/10/17 06:09:25 INFO mapred.MapTask: Spilling map output
14/10/17 06:09:25 INFO mapred.MapTask: bufstart = 0; bufend = 120; bufvoid = 104857600
14/10/17 06:09:25 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214384(104857536); length = 13/6553600
14/10/17 06:09:25 INFO mapred.LocalJobRunner: map task executor complete.
14/10/17 06:09:25 WARN mapred.LocalJobRunner: job_local803195645_0001
***java.lang.Exception: java.io.IOException: wrong value class: class org.apache.hadoop.io.IntWritable is not class org.apache.hadoop.io.Text
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.io.IOException: wrong value class: class org.apache.hadoop.io.IntWritable is not class org.apache.hadoop.io.Text
at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:199)
at org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:1307)
at artistandTrack$Reduce.reduce(artistandTrack.java:44)
at artistandTrack$Reduce.reduce(artistandTrack.java:37)
at org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1572)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1611)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1462)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)***
14/10/17 06:09:26 INFO mapreduce.Job: Job job_local803195645_0001 failed with state FAILED due to: NA
14/10/17 06:09:26 INFO mapreduce.Job: Counters: 11
Map-Reduce Framework
Map input records=4
Map output records=4
Map output bytes=120
Map output materialized bytes=0
Input split bytes=97
Combine input records=0
Combine output records=0
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
File Input Format Counters
Bytes Read=272
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at artistandTrack.main(artistandTrack.java:68)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
from where the wrong class is coming
java.lang.Exception: java.io.IOException: wrong value class: class org.apache.hadoop.io.IntWritable is not class org.apache.hadoop.io.Text
and
why the job fails
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at artistandTrack.main(artistandTrack.java:68)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
i dont understand where it's going wrong.
any help

I think the problem is in line:
conf.setCombinerClass(Reduce.class);
Your Map produces a pair of Text and Text, then your combiner takes it and produces a pair of Text and IntWritable, but your Reducer can't accept IntWritable as value so it throws an exception. Try to remove line with Combiner setting.

Use Main from bellow:
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(artistandTrack.class);
conf.setJobName("artisttrack");
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(Map.class);
//conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
//conf.setOutputKeyClass(Text.class);
//conf.setOutputValueClass(IntWritable.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(Text.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Pig UDF, file in Distributed Cache deleted during batch work - java

Related

Controlling emit values from Flux generate for sequential task execution

Spark java code is runnning in spark_core v2.2 but failing in spark_core v2.3

java.io.IOException in MapReduce

Average in map reduce

how to define our own map and reduce class

Categories

Resources