I want to implement checkpoint with spark file streaming application to process all unprocessed files from hadoop if in any case my spark streaming application stop/terminates. I am following this : streaming programming guide, but not found JavaStreamingContextFactory. Please help me what should I do.
My Code is
public class StartAppWithCheckPoint {
public static void main(String[] args) {
try {
String filePath = "hdfs://Master:9000/mmi_traffic/listenerTransaction/2020/*/*/*/";
String checkpointDirectory = "hdfs://Mongo1:9000/probeAnalysis/checkpoint";
SparkSession sparkSession = JavaSparkSessionSingleton.getInstance();
JavaStreamingContextFactory contextFactory = new JavaStreamingContextFactory() {
#Override public JavaStreamingContext create() {
SparkConf sparkConf = new SparkConf().setAppName("ProbeAnalysis");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaStreamingContext jssc = new JavaStreamingContext(sc, Durations.seconds(300));
JavaDStream<String> lines = jssc.textFileStream(filePath).cache();
jssc.checkpoint(checkpointDirectory);
return jssc;
}
};
JavaStreamingContext context = JavaStreamingContext.getOrCreate(checkpointDirectory, contextFactory);
context.start();
context.awaitTermination();
context.close();
sparkSession.close();
} catch(Exception e) {
e.printStackTrace();
}
}
}
You must use Checkpointing
For checkpointing use stateful transformations either updateStateByKey or reduceByKeyAndWindow. There are a plenty of examples in spark-examples provided along with prebuild spark and spark source in git-hub. For your specific, see JavaStatefulNetworkWordCount.java;
Related
I need to read parquet file from s3 using java & maven support.
public static void main(String[] args) throws IOException, URISyntaxException {
Path path = new Path("s3", "batch-dev", "/aman/part-e52b.c000.snappy.parquet");
Configuration conf = new Configuration();
conf.set("fs.s3.awsAccessKeyId", "xxx");
conf.set("fs.s3.awsSecretAccessKey", "xxxxx");
InputFile file = HadoopInputFile.fromPath(path, conf);
ParquetFileReader reader2 = ParquetFileReader.open(conf, path);
//MessageType schema = reader2.getFooter().getFileMetaData().getSchema();
//System.out.println(schema);
}
Using above code, give FileNotFoundException
Note that: Note that I am using s3 scheme and not s3a. Not sure whether we have support for s3 scheme in Hadoop.
Exception in thread "main" java.io.FileNotFoundException: s3://batch-dev/aman/part-e52b.c000.snappy.parquet: No such file or directory.
at org.apache.hadoop.fs.s3.S3FileSystem.getFileStatus(S3FileSystem.java:334)
at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:39)
at com.bidgely.cloud.core.cass.gb.S3GBRawDataHandler.main(S3GBRawDataHandler.java:505)
However, with the same path if I use s3Client, I am able to get the object. But the problem here is that I can not read parquet data from input stream getting from below code.
public static void main(String args[]) {
AWSCredentials credentials = new BasicAWSCredentials("XXXXX", "XXXXX");
AmazonS3 s3Client = AmazonS3ClientBuilder.standard().withRegion("us-west-2").withCredentials(new AWSStaticCredentialsProvider(credentials)).build();
S3Object object = s3Client.getObject(new GetObjectRequest("batch-dev", "/aman/part-e52b.c000.snappy.parquet"));
System.out.println(object.getObjectContent());
}
Kindly help me with the solution. [I had to use java only].
I'm new on a spark , and I want to run an application on this framework using java language. I tried the following code :
public class Alert_Arret {
private static final SparkSession sparkSession = SparkSession.builder().master("local[*]").appName("Stateful Streaming Example").config("spark.sql.warehouse.dir", "file:////C:/Users/sgulati/spark-warehouse").getOrCreate();
public static Properties Connection_db () {
Properties connectionProperties = new Properties();
connectionProperties.put("user", "xxxxx");
connectionProperties.put("password", "xxxxx");
connectionProperties.put("driver","com.mysql.jdbc.Driver");
connectionProperties.put("url","xxxxxxxxxxxxxxxxx");
return connectionProperties;
}
public static void GetData() {
boolean checked = true;
String dtable = "alerte_prog";
String dtable2 = "last_tracking_tdays";
Dataset<Row> df_a_prog = sparkSession.read().jdbc("jdbc:mysql://host:port/database", dtable, Connection_db());
// df_a_prog.printSchema();
Dataset<Row> track = sparkSession.read().jdbc("jdbc:mysql://host:port/database", dtable2, Connection_db());
if (df_a_prog.select("heureDebut") != null && df_a_prog.select("heureFin") != null ) {
track.withColumn("tracking_hour/minute", from_unixtime(unix_timestamp(col("tracking_time")), "HH:mm")).show() }
}
public static void main(String[] args) {
Connection_db();
GetData();
}
}
when I run this code nothing is displayed and I get this:
0/05/11 14:00:31 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
# A fatal error has been detected by the Java Runtime Environment:
# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000000006c40022a,pid=3376, tid=0x0000000000002e84
I use IntelliJ IDEA and version of spark: 3.0.0.
I am trying to write a Spring Boot test that uses embedded MondoDB 4.0.2; the code to test requires Mongo ChangeStreams which requires MongoDB start as a replica set. MongoDB as a replica set at MongoDB v4 requires journaling enabled. I was not able to find a way to start with journaling enabled so posted this here looking for answers. I subsequently found out how to do it - below.
I have spring-boot 2.1.3.RELEASE. Spring-data-mongodb 2.1.5.RELEASE
This is what I'd been trying:
#RunWith(SpringRunner.class)
#DataMongoTest(properties= {
"spring.mongodb.embedded.version= 4.0.2",
"spring.mongodb.embedded.storage.repl-set-name = r_0",
"spring.mongodb.embedded.storage.journal.enabled=true"
})
public class MyStreamWatcherTest {
#SpringBootApplication
#ComponentScan(basePackages = {"my.package.with.dao.classes"})
#EnableMongoRepositories( { "my.package.with.dao.repository" })
static public class Application {
public static void main(String[] args) {
SpringApplication.run(Application.class, args);
}
}
#Before
public void startup() {
MongoDatabase adminDb = mongoClient.getDatabase("admin");
Document config = new Document("_id", "rs0");
BasicDBList members = new BasicDBList();
members.add(new Document("_id", 0).append("host",
mongoClient.getConnectPoint()));
config.put("members", members);
adminDb.runCommand(new Document("replSetInitiate", config));
However, when the test starts the options used to start mongo did not include enabling journal.
The fix was to add this class:
#Configuration
public class MyEmbeddedMongoConfiguration {
private int localPort = 0;
public int getLocalPort() {
return localPort;
}
#Bean
public IMongodConfig mongodConfig(EmbeddedMongoProperties embeddedProperties) throws IOException {
MongodConfigBuilder builder = new MongodConfigBuilder()
.version(Version.V4_0_2)
.cmdOptions(new MongoCmdOptionsBuilder().useNoJournal(false).build());
// Save the local port so the replica set initializer can come get it.
this.localPort = Network.getFreeServerPort();
builder.net(new Net("127.0.0.1", this.getLocalPort(), Network.localhostIsIPv6()));
EmbeddedMongoProperties.Storage storage = embeddedProperties.getStorage();
if (storage != null) {
String databaseDir = storage.getDatabaseDir();
String replSetName = "rs0"; // Should be able to: storage.getReplSetName();
int oplogSize = (storage.getOplogSize() != null)
? (int) storage.getOplogSize().toMegabytes() : 0;
builder.replication(new Storage(databaseDir, replSetName, oplogSize));
}
return builder.build();
}
This got journal enabled and the mongod started with replica set enabled. Then I added another class to initialize the replica set:
#Configuration
public class EmbeddedMongoReplicaSetInitializer {
#Autowired
MyEmbeddedMongoConfiguration myEmbeddedMongoConfiguration;
MongoClient mongoClient;
// We don't use this MongoClient as it will try to wait for the replica set to stabilize
// before address-fetching methods will return. It is specified here to order this class's
// creation after MongoClient, so we can be sure mongod is running.
EmbeddedMongoReplicaSetInitializer(MongoClient mongoClient) {
this.mongoClient = mongoClient;
}
#PostConstruct
public void initReplicaSet() {
//List<ServerAddress> serverAddresses = mongoClient.getServerAddressList();
MongoClient mongo = new MongoClient(new ServerAddress("127.0.0.1", myEmbeddedMongoConfiguration.getLocalPort()));
MongoDatabase adminDb = mongo.getDatabase("admin");
Document config = new Document("_id", "rs0");
BasicDBList members = new BasicDBList();
members.add(new Document("_id", 0).append("host", String.format("127.0.0.1:%d", myEmbeddedMongoConfiguration.getLocalPort())));
config.put("members", members);
adminDb.runCommand(new Document("replSetInitiate", config));
mongo.close();
}
}
That's getting the job done. If anyone has tips to make this easier, please post here.
I am learning using a test Kafka consumer & producer however facing below error.
Kafka consumer program:
package kafka001;
import java.util.Arrays;
import java.util.Properties;
import java.util.Scanner;
import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.common.errors.WakeupException;
public class ConsumerApp {
private static Scanner in;
private static boolean stop = false;
public static void main(String[] args) throws Exception {
System.out.println(args[0] + args.length);
if (args.length != 2) {
System.err.printf("Usage: %s <topicName> <groupId>\n");
System.exit(-1);
}
in = new Scanner(System.in);
String topicName = args[0];
String groupId = args[1];
ConsumerThread consumerRunnable = new ConsumerThread(topicName, groupId);
consumerRunnable.start();
//System.out.println("Here");
String line = "";
while (!line.equals("exit")) {
line = in.next();
}
consumerRunnable.getKafkaConsumer().wakeup();
System.out.println("Stopping consumer now.....");
consumerRunnable.join();
}
private static class ConsumerThread extends Thread{
private String topicName;
private String groupId;
private KafkaConsumer<String,String> kafkaConsumer;
public ConsumerThread(String topicName, String groupId){
//System.out.println("inside ConsumerThread constructor");
this.topicName = topicName;
this.groupId = groupId;
}
public void run() {
//System.out.println("inside run");
// Setup Kafka producer properties
Properties configProperties = new Properties();
configProperties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "aup7727s.unix.anz:9092");
configProperties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");
configProperties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");
configProperties.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
configProperties.put(ConsumerConfig.CLIENT_ID_CONFIG, "simple");
// subscribe to topic
kafkaConsumer = new KafkaConsumer<String, String>(configProperties);
kafkaConsumer.subscribe(Arrays.asList(topicName));
// Get/process messages from topic and print it to console
try {while(true) {
ConsumerRecords<String, String> records = kafkaConsumer.poll(100);
for (ConsumerRecord<String, String> record : records)
System.out.println(record.value());
}
} catch(WakeupException ex) {
System.out.println("Exception caught " + ex.getMessage());
}finally {
kafkaConsumer.close();
System.out.println("After closing KafkaConsumer");
}
}
public KafkaConsumer<String,String> getKafkaConsumer(){
return this.kafkaConsumer;
}
}
}
When I compile the code, I am noticing following class files:
ConsumerApp$ConsumerThread.class and
ConsumerApp.class
I've generated jar file named ConsumerApp.jar through eclipse and when I run this in Hadoop cluster, I get noclassdeffound error as below:
java -cp ConsumerApp.jar kafka001/ConsumerApp console1 group1
or
hadoop jar ConsumerApp.jar console1 group1
Exception in thread "main" java.lang.NoClassDefFoundError: org.apache.kafka.common.errors.WakeupException
at kafka001.ConsumerApp.main(ConsumerApp.java:24)
Caused by: java.lang.ClassNotFoundException: org.apache.kafka.common.errors.WakeupException
at java.net.URLClassLoader.findClass(URLClassLoader.java:607)
at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:846)
at java.lang.ClassLoader.loadClass(ClassLoader.java:825)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:325)
at java.lang.ClassLoader.loadClass(ClassLoader.java:805)
... 1 more
I am using Eclipse to compile, maven build and generate jar file. Line number 24 correspond to creation of ConsumerThread instance.
I am unable to resolve if its due to ConsumerThread class name being incorrectly saved (Class file generated as ConsumerApp$ConsumerThread.class instead of ConsumerThread.class) ? or something to be taken care while generating jar file ?
Since I can't view the entire project, I would try this: Right click on the project -> go to Maven 2 tools -> click generate artifacts (check for updates). That should create any missing dependencies. Also make sure you check out other similar posts that may resolve your issue like this.
I'am facing a problem parsing csv in Apache Beam pipeline project.
I used line.split(",") to get an Array of strings but i have csv fields that contains conversation that have "," character and | ect...
Here's snippets of my code:
public class ConvertBlockerToConversationOperation extends DoFn<String, PubsubMessage> {
private final Logger log = LoggerFactory.getLogger(ParseCsv.class);
#ProcessElement
public void processElement(ProcessContext c) {
String startConversationMessage = c.element();
JsonObject conversation = ParseCsv.getObjectFromCsv(startConversationMessage);
c.output(new PubsubMessage(conversation.toString().getBytes(),null ));
}
}
I am using TextIO.read() to read csv from a GC Storage:
public class CsvToPubsub {
public interface Options extends PipelineOptions {
#Description("The file pattern to read records from (e.g. gs://bucket/file-*.csv)")
#Required
ValueProvider<String> getInputFilePattern();
void setInputFilePattern(ValueProvider<String> value);
#Description("The name of the topic which data should be published to. "
+ "The name should be in the format of projects/<project-id>/topics/<topic-name>.")
#Required
ValueProvider<String> getOutputTopic();
void setOutputTopic(ValueProvider<String> value);
}
public static void main(String[] args) {
ConfigurationLoader configurationLoader = new ConfigurationLoader(args[0].substring(6));
PipelineUtils pipelineUtils = new PipelineUtils();
Options options = PipelineOptionsFactory
.fromArgs(args)
.withValidation()
.as(Options.class);
run(options,configurationLoader,pipelineUtils);
}
public static PipelineResult run(Options options,ConfigurationLoader configurationLoader,PipelineUtils pipelineUtils) {
Pipeline pipeline = Pipeline.create(options);
pipeline
.apply("Read Text Data", TextIO.read().from(options.getInputFilePattern()))
.apply("Transform CSV to Conversation", ParDo.of(new ConvertBlockerToConversationOperation()))
.apply("Generate conversation command", ParDo.of(new GenerateConversationCommandOperation(pipelineUtils)))
.apply("Partition conversations", Partition.of(4, new PartitionConversationBySourceOperation()))
.apply("Publish conveIorsations", new PublishConversationPartitionToPubSubOperation(configurationLoader, new ConvertConversationToStringOperation()));
return pipeline.run();
}
}
Is there any csv Library that support a TextIo output?