Submit PySpark to Yarn cluster using Java - java

i need to create a Java program that submit python scripts (that use PySpark) to a Yarn cluster.
Now, i saw that using SparkLauncher is the same as using a YarnClient, because it uses a Yarn Client built-in (writing my own Yarn Client is insane, i tried, too much things to handle).
So i wrote:
public static void main(String[] args) throws Exception {
String SPARK_HOME = System.getProperty("SPARK_HOME");
submit(SPARK_HOME, args);
}
static void submit(String SPARK_HOME, String[] args) throws Exception {
String[] arguments = new String[]{
// application name
"--name",
"SparkPi-Python",
"--class",
"org.apache.spark.deploy.PythonRunner",
"--py-files",
SPARK_HOME + "/python/lib/pyspark.zip,"+ SPARK_HOME +"/python/lib/py4j-0.9-src.zip",
// Python Program
"--primary-py-file",
"/home/lorenzo/script.py",
// number of executors
"--num-executors",
"2",
// driver memory
"--driver-memory",
"512m",
// executor memory
"--executor-memory",
"512m",
// executor cores
"--executor-cores",
"2",
"--queue",
"default",
// argument 1 to my Spark program
"--arg",
null,
};
System.setProperty("SPARK_YARN_MODE", "true");
System.out.println(SPARK_HOME);
SparkLauncher sparkLauncher = new SparkLauncher();
sparkLauncher.setSparkHome("/usr/hdp/current/spark2-client");
sparkLauncher.setAppResource("/home/lorenzo/script.py");
sparkLauncher.setMaster("yarn");
sparkLauncher.setDeployMode("cluster");
sparkLauncher.setVerbose(true);
sparkLauncher.launch().waitFor();
}
When i run this Jar from a machine in the cluster, nothing happen... No Error, No Log, No yarn container... just nothing... if i try to put a println inside this code, obvs, it prints the println.
What i'm misconfiguring?
If i want run this JAR from a different machine, where and how should declare the IP?

Related

calling a Redis function(loaded Lua script) using Lettuce library

I am using Java, Spring-Boot, Redis 7.0.4, and lettuce 6.2.0.RELEASE.
I wrote a Lua script as below:
#!lua
name = updateRegisterUserJobAndForwardMsg
function updateRegisterUserJobAndForwardMsg (KEYS, ARGV)
local jobsKey = KEYS[1]
local inboxKey = KEYS[2]
local jobRef = KEYS[3]
local jobIdentity = KEYS[4]
local accountsMsg = ARGV[1]
local jobDetail = redis.call('HGET', jobsKey ,jobRef)
local jobObj = cmsgpack.unpack(jobDetail)
local msgSteps = jobObj['steps']
msgSteps[jobIdentity] = 'IN_PROGRESS'
jobDetail = redis.call('HSET', jobsKey, jobRef, cmsgpack.pack(jobObj))
local ssoMsg = redis.call('RPUSH', inboxKey, cmsgpack.pack(accountsMsg))
return jobDetail
end
redis.register_function('updateRegisterUserJobAndForwardMsg', updateRegisterUserJobAndForwardMsg)
Then I registered it as a function in my Redis using the below command:
cat updateJobAndForwardMsgScript.lua | redis-cli -x FUNCTION LOAD REPLACE
Now I can easily call my function using Redis-cli as below:
FCALL updateJobAndForwardMsg 4 key1 key2 key3 key4 arg1
And it will get executed successfully!!
Now I want to call my function using lettuce which is my Redis-client library in my application, but I haven't found anything on the net, and it seems that lettuce does not support Redis 7 new feature for calling FUNCTION using FCALL command!!
Does it have any other customized way for executing Redis commands using lettuce?
Any help would be appreciated!!
After a bit more research about the requirement, I found the following StackOverFlow answer:
StackOverFlow Answer
And also based on the documentation:
Redis Custom Commands :
Custom commands can be dispatched on the one hand using Lua and the
eval() command, on the other side Lettuce 4.x allows you to trigger
own commands. That API is used by Lettuce itself to dispatch commands
and requires some knowledge of how commands are constructed and
dispatched within Lettuce.
Lettuce provides two levels of command dispatching:
Using the synchronous, asynchronous or reactive API wrappers which
invoke commands according to their nature
Using the bare connection to influence the command nature and
synchronization (advanced)
So I could handle my requirements by creating an interface which extends the io.lettuce.core.dynamic.Commands interface as below:
public interface CustomCommands extends Commands {
#Command("FCALL :funcName :keyCnt :jobsKey :inboxRef :jobsRef :jobIdentity :frwrdMsg ")
Object fcall_responseJob(#Param("funcName") byte[] functionName, #Param("keyCnt") Integer keysCount,
#Param("jobsKey") byte[] jobsKey, #Param("inboxRef") byte[] inboxRef,
#Param("jobsRef") byte[] jobsRef, #Param("jobIdentity") byte[] jobIdentity,
#Param("frwrdMsg") byte[] frwrdMsg);
}
Then I could easily call my loaded FUNCTION(which was a Lua script) as below:
private void updateResponseJobAndForwardMsgToSSO(SharedObject message, SharedObject responseMessage) {
try {
ObjectMapper objectMapper = new MessagePackMapper();
RedisCommandFactory factory = new RedisCommandFactory(connection);
CustomCommands commands = factory.getCommands(CustomCommands.class);
Object obj = commands.fcall_responseJob(
Constant.REDIS_RESPONSE_JOB_FUNCTION_NAME.getBytes(StandardCharsets.UTF_8),
Constant.REDIS_RESPONSE_JOB_FUNCTION_KEY_COUNT,
(message.getAgent() + Constant.AGENTS_JOBS_POSTFIX).getBytes(StandardCharsets.UTF_8),
(message.getAgent() + Constant.AGENTS_INBOX_POSTFIX).getBytes(StandardCharsets.UTF_8),
message.getReferenceNumber().getBytes(StandardCharsets.UTF_8),
message.getTyp().getBytes(StandardCharsets.UTF_8),
objectMapper.writeValueAsBytes(responseMessage));
LOG.info(obj.toString());
} catch (Exception e) {
e.printStackTrace();
}
}

Unable to read a (text)file in FileProcessing.PROCESS_CONTINUOS mode

I have a requirement to read a file continously from a specific path.
Means flink job should continously poll the specified location and read a file that will arrive at this location at certains intervals .
Example: my location on windows machine is C:/inputfiles get a file file_1.txt at 2:00PM, file_2.txt at 2:30PM, file_3.txt at 3:00PM.
I experimented it with below code .
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.io.FilePathFilter;
import org.apache.flink.api.java.io.TextInputFormat;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.FileProcessingMode;
import org.apache.flink.util.Collector;
import java.util.Arrays;
import java.util.List;
public class ContinuousFileProcessingTest {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(10);
String localFsURI = "D:\\FLink\\2021_01_01\\";
TextInputFormat format = new TextInputFormat(new org.apache.flink.core.fs.Path(localFsURI));
format.setFilesFilter(FilePathFilter.createDefaultFilter());
DataStream<String> inputStream =
env.readFile(format, localFsURI, FileProcessingMode.PROCESS_CONTINUOUSLY, 100);
SingleOutputStreamOperator<String> soso = inputStream.map(String::toUpperCase);
soso.print();
soso.writeAsText("D:\\FLink\\completed", FileSystem.WriteMode.OVERWRITE);
env.execute("read and write");
}
}
Now to test this on flink cluster i brought flink cluster up using flink's 1.9.2 version and i was able to achieve my goal of reading file continously at some intervals.
Note: Flink's 1.9.2 version can bring up cluster on windows machine.
But now i have to upgrade flink's version from 1.9.2 to 1.12 .And we used docker to bring cluster up on 1.12 (unlike 1.9.2).
Unlike windows path i changed the file location as per docker location but the same above program in not running there.
Moreover:
Accessing file is not the problem.Means if i put the file before starting the job then this job reads these files correctly but if i add any new file at runtime then it does not read this newly added files.
Need help to find the solution.
Thanks in advance.
Try to reduce directoryScanInterval from sample code to Duration.ofSeconds(50).toMillis() and checkout StreamExecutionEnvironment.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC) mode.
For RuntimeExecutionMode referred from https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/api/common/RuntimeExecutionMode.html
Working code as below:
public class ContinuousFileProcessingTest {
private static final Logger log = LoggerFactory.getLogger(ReadSpecificFilesFlinkBatch.class);
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(10);
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
String localFsURI = "file:///usr/test";
// create the monitoring source along with the necessary readers.
TextInputFormat format = new TextInputFormat(new org.apache.flink.core.fs.Path(localFsURI));
log.info("format : " + format.toString());
format.setFilesFilter(FilePathFilter.createDefaultFilter());
log.info("setFilesFilter : " + FilePathFilter.createDefaultFilter().toString());
log.info("getFilesFilter : " + format.getFilePath().toString());
DataStream<String> inputStream =
env.readFile(format, localFsURI, FileProcessingMode.PROCESS_CONTINUOUSLY, Duration.ofSeconds(50).toMillis());
SingleOutputStreamOperator<String> soso = inputStream.map(String::toUpperCase);
soso.writeAsText("file:///usr/test/completed.txt", FileSystem.WriteMode.OVERWRITE);
env.execute("read and write");
}
}
This code works on docker desktop with Flink 1.12 and container file path as file:///usr/test.Note Keep parallalism as minimum 2 so that parallelly files can be processed.

Execute an AWS command in eclipse

I execute an EC2 command through eclipse like:
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
String spot = "aws ec2 describe-spot-price-history --instance-types"
+ " m3.medium --product-description \"Linux/UNIX (Amazon VPC)\"";
System.out.println(spot);
Runtime runtime = Runtime.getRuntime();
final Process process = runtime.exec(spot);
//********************
InputStreamReader isr = new InputStreamReader(process.getInputStream());
BufferedReader buff = new BufferedReader (isr);
String line;
while((line = buff.readLine()) != null)
System.out.print(line);
}
The result in eclipse console is:
aws ec2 describe-spot-price-history --instance-types m3.medium --product-description "Linux/UNIX (Amazon VPC)"
{ "SpotPriceHistory": []}
However, when I execute the same command (aws ec2 describe-spot-price-history --instance-types m3.medium --product-description "Linux/UNIX (Amazon VPC)") in shell I obtain a different result.
"Timestamp": "2018-09-07T17:52:48.000Z",
"AvailabilityZone": "us-east-1f",
"InstanceType": "m3.medium",
"ProductDescription": "Linux/UNIX",
"SpotPrice": "0.046700"
},
{
"Timestamp": "2018-09-07T17:52:48.000Z",
"AvailabilityZone": "us-east-1a",
"InstanceType": "m3.medium",
"ProductDescription": "Linux/UNIX",
"SpotPrice": "0.047000"
}
My question is: How can obtain in eclipse console the same result as in shell console ?
It looks like you are not getting the expected output because you are passing a console command through your Java code which is not getting parsed properly, and you are not utilizing the AWS SDKs for Java instead.
To get the expected output in your Eclipse console, you could utilize the DescribeSpotPriceHistory Java SDK API call in your code[1]. An example code snippet for this API call according to the documentation is as follows:
AmazonEC2 client = AmazonEC2ClientBuilder.standard().build();
DescribeSpotPriceHistoryRequest request = new DescribeSpotPriceHistoryRequest().withEndTime(new Date("2014-01-06T08:09:10"))
.withInstanceTypes("m1.xlarge").withProductDescriptions("Linux/UNIX (Amazon VPC)").withStartTime(new Date("2014-01-06T07:08:09"));
DescribeSpotPriceHistoryResult response = client.describeSpotPriceHistory(request);
Also, you could look into this website containing Java file examples of various scenarios utilizing the DescribeSpotPriceHistory API call in Java[2].
For more details about DescribeSpotPriceHistory, kindly refer to the official documentation[3].
References
[1]. https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/ec2/AmazonEC2.html#describeSpotPriceHistory-com.amazonaws.services.ec2.model.DescribeSpotPriceHistoryRequest-
[2]. https://www.programcreek.com/java-api-examples/index.php?api=com.amazonaws.services.ec2.model.SpotPrice
[3]. https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeSpotPriceHistory.html

Cucumber 4.2: Separate Runner for each browser

I had implemented Cucumber 4.2 parallel execution for chrome browser only.Now, I want to implement parallel execution for two browsers (Firefox/Chrome). Please provide an example or skeleton so that i can improve from it. Besides, where to search for Cucumber API javadoc?
Chrome Runner:
public class ChromeTestNGParallel {
#Test
public void execute() {
//Main.main(new String[]{"--threads", "4", "-p", "timeline:target/cucumber-parallel-report", "-g", "com.peterwkc.step_definitions", "src/main/features"});
String [] argv = new String[]{"--threads", "8", "-p", "timeline:target/cucumber-parallel-report", "-g", "com.peterwkc.step_definitions", "src/main/features"};
ClassLoader contextClassLoader = Thread.currentThread().getContextClassLoader();
byte exitstatus = Main.run(argv, contextClassLoader);
}
}
Firefox Runner:
public class FirefoxTestNGParallel {
#Test
public void execute() {
//Main.main(new String[]{"--threads", "4", "-p", "timeline:target/cucumber-parallel-report", "-g", "com.peterwkc.step_definitions", "src/main/features"});
String [] argv = new String[]{"--threads", "8", "-p", "timeline:target/cucumber-parallel-report", "-g", "com.peterwkc.step_definitions", "src/main/features"};
ClassLoader contextClassLoader = Thread.currentThread().getContextClassLoader();
byte exitstatus = Main.run(argv, contextClassLoader);
}
}
This is what I want.
I think you can do this outside Cucumber.
The first part is to configure Cucumber to run with a particular browser using either a command line parameter or the environment.
The second part is to run two (or more cucumber instances at the same time). Basically use virtual machines to do this just run cucumber with different command line parameters to configure the browser.
You could even use a paid service like Circle CI to do this for you.

part file empty while running pig in Eclipse using libraries

I ran a sample pig script in mapreduce mode and it ran successfully.
My pigscript:
allsales = load 'sales' as (name,price,country);
bigsales = filter allsales by price >999;
sortedbigsales = order bigsales by price desc;
store sortedbigsales into 'topsales';
Now, I am trying to implement that in eclipse (currently I am running using libraries).
One doubt: Pig Local mode means that we need hadoop installation as default?
IdLocal.java:
public class IdLocal {
public static void main(String[] args) {
try {
PigServer pigServer = new PigServer("local");
runIdQuery(pigServer, "/home/sreeveni/myfiles/pig/data/sales");
} catch (Exception e) {
}
}
public static void runIdQuery(PigServer pigServer, String inputFile)
throws IOException {
pigServer.registerQuery("allsales = load '" + inputFile+ "' as (name,price,country);");
pigServer.registerQuery("bigsales = filter allsales by price >999;");
pigServer.registerQuery("sortedbigsales = order bigsales by price desc;");
pigServer.store("sortedbigsales","/home/sreeveni/myfiles/OUT/topsalesjava");
}
}
The console is showing success for me, but my part file is empty.
Why is it so?
1) local mode pig does not mean that you have to have hadoop installed. you can run it without hadoop and hdfs. Everything will be performed single threaded on your machine and it should read/write from your local filesystem by default.
2) Regarding your empty output, ensure that your input file exists on your local filesystem and that it has records in the 'price' field greater than 999. You could be filtering them all out otherwise. Also, pig defaults to tab separated files. Is your inputFile tab separated? if not, then your schema definition will have the 'name' field hold the entire row in the file, and 'price' and 'country' will always be null.
hope that helps

Categories

Resources