Checkpoint with spark file streaming in java - java

I want to implement checkpoint with spark file streaming application to process all unprocessed files from hadoop if in any case my spark streaming application stop/terminates. I am following this : streaming programming guide, but not found JavaStreamingContextFactory. Please help me what should I do.
My Code is
public class StartAppWithCheckPoint {
public static void main(String[] args) {
try {
String filePath = "hdfs://Master:9000/mmi_traffic/listenerTransaction/2020/*/*/*/";
String checkpointDirectory = "hdfs://Mongo1:9000/probeAnalysis/checkpoint";
SparkSession sparkSession = JavaSparkSessionSingleton.getInstance();
JavaStreamingContextFactory contextFactory = new JavaStreamingContextFactory() {
#Override public JavaStreamingContext create() {
SparkConf sparkConf = new SparkConf().setAppName("ProbeAnalysis");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaStreamingContext jssc = new JavaStreamingContext(sc, Durations.seconds(300));
JavaDStream<String> lines = jssc.textFileStream(filePath).cache();
jssc.checkpoint(checkpointDirectory);
return jssc;
}
};
JavaStreamingContext context = JavaStreamingContext.getOrCreate(checkpointDirectory, contextFactory);
context.start();
context.awaitTermination();
context.close();
sparkSession.close();
} catch(Exception e) {
e.printStackTrace();
}
}
}

You must use Checkpointing
For checkpointing use stateful transformations either updateStateByKey or reduceByKeyAndWindow. There are a plenty of examples in spark-examples provided along with prebuild spark and spark source in git-hub. For your specific, see JavaStatefulNetworkWordCount.java;

Related

Not able to read from Parquet Reader along with Hadoop configuration using Java

I need to read parquet file from s3 using java & maven support.
public static void main(String[] args) throws IOException, URISyntaxException {
Path path = new Path("s3", "batch-dev", "/aman/part-e52b.c000.snappy.parquet");
Configuration conf = new Configuration();
conf.set("fs.s3.awsAccessKeyId", "xxx");
conf.set("fs.s3.awsSecretAccessKey", "xxxxx");
InputFile file = HadoopInputFile.fromPath(path, conf);
ParquetFileReader reader2 = ParquetFileReader.open(conf, path);
//MessageType schema = reader2.getFooter().getFileMetaData().getSchema();
//System.out.println(schema);
}
Using above code, give FileNotFoundException
Note that: Note that I am using s3 scheme and not s3a. Not sure whether we have support for s3 scheme in Hadoop.
Exception in thread "main" java.io.FileNotFoundException: s3://batch-dev/aman/part-e52b.c000.snappy.parquet: No such file or directory.
at org.apache.hadoop.fs.s3.S3FileSystem.getFileStatus(S3FileSystem.java:334)
at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:39)
at com.bidgely.cloud.core.cass.gb.S3GBRawDataHandler.main(S3GBRawDataHandler.java:505)
However, with the same path if I use s3Client, I am able to get the object. But the problem here is that I can not read parquet data from input stream getting from below code.
public static void main(String args[]) {
AWSCredentials credentials = new BasicAWSCredentials("XXXXX", "XXXXX");
AmazonS3 s3Client = AmazonS3ClientBuilder.standard().withRegion("us-west-2").withCredentials(new AWSStaticCredentialsProvider(credentials)).build();
S3Object object = s3Client.getObject(new GetObjectRequest("batch-dev", "/aman/part-e52b.c000.snappy.parquet"));
System.out.println(object.getObjectContent());
}
Kindly help me with the solution. [I had to use java only].

Application on spark using java langage

I'm new on a spark , and I want to run an application on this framework using java language. I tried the following code :
public class Alert_Arret {
private static final SparkSession sparkSession = SparkSession.builder().master("local[*]").appName("Stateful Streaming Example").config("spark.sql.warehouse.dir", "file:////C:/Users/sgulati/spark-warehouse").getOrCreate();
public static Properties Connection_db () {
Properties connectionProperties = new Properties();
connectionProperties.put("user", "xxxxx");
connectionProperties.put("password", "xxxxx");
connectionProperties.put("driver","com.mysql.jdbc.Driver");
connectionProperties.put("url","xxxxxxxxxxxxxxxxx");
return connectionProperties;
}
public static void GetData() {
boolean checked = true;
String dtable = "alerte_prog";
String dtable2 = "last_tracking_tdays";
Dataset<Row> df_a_prog = sparkSession.read().jdbc("jdbc:mysql://host:port/database", dtable, Connection_db());
// df_a_prog.printSchema();
Dataset<Row> track = sparkSession.read().jdbc("jdbc:mysql://host:port/database", dtable2, Connection_db());
if (df_a_prog.select("heureDebut") != null && df_a_prog.select("heureFin") != null ) {
track.withColumn("tracking_hour/minute", from_unixtime(unix_timestamp(col("tracking_time")), "HH:mm")).show() }
}
public static void main(String[] args) {
Connection_db();
GetData();
}
}
when I run this code nothing is displayed and I get this:
0/05/11 14:00:31 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
# A fatal error has been detected by the Java Runtime Environment:
# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000000006c40022a,pid=3376, tid=0x0000000000002e84
I use IntelliJ IDEA and version of spark: 3.0.0.

How to enable journal in embedded MongoDB from Spring Boot test

I am trying to write a Spring Boot test that uses embedded MondoDB 4.0.2; the code to test requires Mongo ChangeStreams which requires MongoDB start as a replica set. MongoDB as a replica set at MongoDB v4 requires journaling enabled. I was not able to find a way to start with journaling enabled so posted this here looking for answers. I subsequently found out how to do it - below.
I have spring-boot 2.1.3.RELEASE. Spring-data-mongodb 2.1.5.RELEASE
This is what I'd been trying:
#RunWith(SpringRunner.class)
#DataMongoTest(properties= {
"spring.mongodb.embedded.version= 4.0.2",
"spring.mongodb.embedded.storage.repl-set-name = r_0",
"spring.mongodb.embedded.storage.journal.enabled=true"
})
public class MyStreamWatcherTest {
#SpringBootApplication
#ComponentScan(basePackages = {"my.package.with.dao.classes"})
#EnableMongoRepositories( { "my.package.with.dao.repository" })
static public class Application {
public static void main(String[] args) {
SpringApplication.run(Application.class, args);
}
}
#Before
public void startup() {
MongoDatabase adminDb = mongoClient.getDatabase("admin");
Document config = new Document("_id", "rs0");
BasicDBList members = new BasicDBList();
members.add(new Document("_id", 0).append("host",
mongoClient.getConnectPoint()));
config.put("members", members);
adminDb.runCommand(new Document("replSetInitiate", config));
However, when the test starts the options used to start mongo did not include enabling journal.
The fix was to add this class:
#Configuration
public class MyEmbeddedMongoConfiguration {
private int localPort = 0;
public int getLocalPort() {
return localPort;
}
#Bean
public IMongodConfig mongodConfig(EmbeddedMongoProperties embeddedProperties) throws IOException {
MongodConfigBuilder builder = new MongodConfigBuilder()
.version(Version.V4_0_2)
.cmdOptions(new MongoCmdOptionsBuilder().useNoJournal(false).build());
// Save the local port so the replica set initializer can come get it.
this.localPort = Network.getFreeServerPort();
builder.net(new Net("127.0.0.1", this.getLocalPort(), Network.localhostIsIPv6()));
EmbeddedMongoProperties.Storage storage = embeddedProperties.getStorage();
if (storage != null) {
String databaseDir = storage.getDatabaseDir();
String replSetName = "rs0"; // Should be able to: storage.getReplSetName();
int oplogSize = (storage.getOplogSize() != null)
? (int) storage.getOplogSize().toMegabytes() : 0;
builder.replication(new Storage(databaseDir, replSetName, oplogSize));
}
return builder.build();
}
This got journal enabled and the mongod started with replica set enabled. Then I added another class to initialize the replica set:
#Configuration
public class EmbeddedMongoReplicaSetInitializer {
#Autowired
MyEmbeddedMongoConfiguration myEmbeddedMongoConfiguration;
MongoClient mongoClient;
// We don't use this MongoClient as it will try to wait for the replica set to stabilize
// before address-fetching methods will return. It is specified here to order this class's
// creation after MongoClient, so we can be sure mongod is running.
EmbeddedMongoReplicaSetInitializer(MongoClient mongoClient) {
this.mongoClient = mongoClient;
}
#PostConstruct
public void initReplicaSet() {
//List<ServerAddress> serverAddresses = mongoClient.getServerAddressList();
MongoClient mongo = new MongoClient(new ServerAddress("127.0.0.1", myEmbeddedMongoConfiguration.getLocalPort()));
MongoDatabase adminDb = mongo.getDatabase("admin");
Document config = new Document("_id", "rs0");
BasicDBList members = new BasicDBList();
members.add(new Document("_id", 0).append("host", String.format("127.0.0.1:%d", myEmbeddedMongoConfiguration.getLocalPort())));
config.put("members", members);
adminDb.runCommand(new Document("replSetInitiate", config));
mongo.close();
}
}
That's getting the job done. If anyone has tips to make this easier, please post here.

Java Kafka Consumer noClassDefFoundError

I am learning using a test Kafka consumer & producer however facing below error.
Kafka consumer program:
package kafka001;
import java.util.Arrays;
import java.util.Properties;
import java.util.Scanner;
import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.common.errors.WakeupException;
public class ConsumerApp {
private static Scanner in;
private static boolean stop = false;
public static void main(String[] args) throws Exception {
System.out.println(args[0] + args.length);
if (args.length != 2) {
System.err.printf("Usage: %s <topicName> <groupId>\n");
System.exit(-1);
}
in = new Scanner(System.in);
String topicName = args[0];
String groupId = args[1];
ConsumerThread consumerRunnable = new ConsumerThread(topicName, groupId);
consumerRunnable.start();
//System.out.println("Here");
String line = "";
while (!line.equals("exit")) {
line = in.next();
}
consumerRunnable.getKafkaConsumer().wakeup();
System.out.println("Stopping consumer now.....");
consumerRunnable.join();
}
private static class ConsumerThread extends Thread{
private String topicName;
private String groupId;
private KafkaConsumer<String,String> kafkaConsumer;
public ConsumerThread(String topicName, String groupId){
//System.out.println("inside ConsumerThread constructor");
this.topicName = topicName;
this.groupId = groupId;
}
public void run() {
//System.out.println("inside run");
// Setup Kafka producer properties
Properties configProperties = new Properties();
configProperties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "aup7727s.unix.anz:9092");
configProperties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");
configProperties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");
configProperties.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
configProperties.put(ConsumerConfig.CLIENT_ID_CONFIG, "simple");
// subscribe to topic
kafkaConsumer = new KafkaConsumer<String, String>(configProperties);
kafkaConsumer.subscribe(Arrays.asList(topicName));
// Get/process messages from topic and print it to console
try {while(true) {
ConsumerRecords<String, String> records = kafkaConsumer.poll(100);
for (ConsumerRecord<String, String> record : records)
System.out.println(record.value());
}
} catch(WakeupException ex) {
System.out.println("Exception caught " + ex.getMessage());
}finally {
kafkaConsumer.close();
System.out.println("After closing KafkaConsumer");
}
}
public KafkaConsumer<String,String> getKafkaConsumer(){
return this.kafkaConsumer;
}
}
}
When I compile the code, I am noticing following class files:
ConsumerApp$ConsumerThread.class and
ConsumerApp.class
I've generated jar file named ConsumerApp.jar through eclipse and when I run this in Hadoop cluster, I get noclassdeffound error as below:
java -cp ConsumerApp.jar kafka001/ConsumerApp console1 group1
or
hadoop jar ConsumerApp.jar console1 group1
Exception in thread "main" java.lang.NoClassDefFoundError: org.apache.kafka.common.errors.WakeupException
at kafka001.ConsumerApp.main(ConsumerApp.java:24)
Caused by: java.lang.ClassNotFoundException: org.apache.kafka.common.errors.WakeupException
at java.net.URLClassLoader.findClass(URLClassLoader.java:607)
at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:846)
at java.lang.ClassLoader.loadClass(ClassLoader.java:825)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:325)
at java.lang.ClassLoader.loadClass(ClassLoader.java:805)
... 1 more
I am using Eclipse to compile, maven build and generate jar file. Line number 24 correspond to creation of ConsumerThread instance.
I am unable to resolve if its due to ConsumerThread class name being incorrectly saved (Class file generated as ConsumerApp$ConsumerThread.class instead of ConsumerThread.class) ? or something to be taken care while generating jar file ?
Since I can't view the entire project, I would try this: Right click on the project -> go to Maven 2 tools -> click generate artifacts (check for updates). That should create any missing dependencies. Also make sure you check out other similar posts that may resolve your issue like this.

Parse CSV in apache Beam pipeline to post it Google pubsub

I'am facing a problem parsing csv in Apache Beam pipeline project.
I used line.split(",") to get an Array of strings but i have csv fields that contains conversation that have "," character and | ect...
Here's snippets of my code:
public class ConvertBlockerToConversationOperation extends DoFn<String, PubsubMessage> {
private final Logger log = LoggerFactory.getLogger(ParseCsv.class);
#ProcessElement
public void processElement(ProcessContext c) {
String startConversationMessage = c.element();
JsonObject conversation = ParseCsv.getObjectFromCsv(startConversationMessage);
c.output(new PubsubMessage(conversation.toString().getBytes(),null ));
}
}
I am using TextIO.read() to read csv from a GC Storage:
public class CsvToPubsub {
public interface Options extends PipelineOptions {
#Description("The file pattern to read records from (e.g. gs://bucket/file-*.csv)")
#Required
ValueProvider<String> getInputFilePattern();
void setInputFilePattern(ValueProvider<String> value);
#Description("The name of the topic which data should be published to. "
+ "The name should be in the format of projects/<project-id>/topics/<topic-name>.")
#Required
ValueProvider<String> getOutputTopic();
void setOutputTopic(ValueProvider<String> value);
}
public static void main(String[] args) {
ConfigurationLoader configurationLoader = new ConfigurationLoader(args[0].substring(6));
PipelineUtils pipelineUtils = new PipelineUtils();
Options options = PipelineOptionsFactory
.fromArgs(args)
.withValidation()
.as(Options.class);
run(options,configurationLoader,pipelineUtils);
}
public static PipelineResult run(Options options,ConfigurationLoader configurationLoader,PipelineUtils pipelineUtils) {
Pipeline pipeline = Pipeline.create(options);
pipeline
.apply("Read Text Data", TextIO.read().from(options.getInputFilePattern()))
.apply("Transform CSV to Conversation", ParDo.of(new ConvertBlockerToConversationOperation()))
.apply("Generate conversation command", ParDo.of(new GenerateConversationCommandOperation(pipelineUtils)))
.apply("Partition conversations", Partition.of(4, new PartitionConversationBySourceOperation()))
.apply("Publish conveIorsations", new PublishConversationPartitionToPubSubOperation(configurationLoader, new ConvertConversationToStringOperation()));
return pipeline.run();
}
}
Is there any csv Library that support a TextIo output?

Categories

Resources