How to read multiple text files in Spark for document clustering? - java

I want to read multiple text documents from a directory for document clustering.
For that, I want to read data as:
SparkConf sparkConf = new SparkConf().setAppName(appName).setMaster("local[*]").set("spark.executor.memory", "2g");
JavaSparkContext context = new JavaSparkContext(sparkConf);
SparkSession spark = SparkSession.builder().config(sparkConf).getOrCreate();
Dataset<Row> dataset = spark.read().textFile("path to directory");
Here, I don't want to use
JavaPairRDD data = context.wholeTextFiles(path);
because I want Dataset as a return type.

In scala you could write this:
context.wholeTextFiles("...").toDS()
In java you need to use an Encoder. See the javadoc for more detail.
JavaPairRDD<String, String> rdd = context.wholeTextFiles("hdfs:///tmp/test_read");
Encoder<Tuple2<String, String>> encoder = Encoders.tuple(Encoders.STRING(), Encoders.STRING());
spark.createDataset(rdd.rdd(), encoder).show();

Related

How to do Kafka-Spark-MongoDb integration efficiently

I am writing a Spark 2.4 transformation for spark benchmarking which will get JSON Streams from Kafka topic and need to dump it to MongoDB. I can do it using Java MongoClient, but data can be huge such as 1 Million records coming through multiple threads from Kafka. Spark processes it very fast but mongo write is very slow.
SparkConf sparkConf = new SparkConf().setMaster("local[*]").
setAppName("JavaDirectKafkaStreaming");
sparkConf.set("spark.streaming.backpressure.enabled","true");
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, Durations.seconds(2));
Map<String, Object> kafkaParams = new HashMap<String, Object>();
kafkaParams.put("bootstrap.servers", "loacalhost:9092");
kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("group.id", "2");
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", false);
Collection<String> topics = Arrays.asList("poc-topic");
final JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(streamingContext,
LocationStrategies.PreferConsistent(),
org.apache.spark.streaming.kafka010.ConsumerStrategies.<String, String> Subscribe(topics, kafkaParams));
#SuppressWarnings("serial")
JavaPairDStream<String, String> jPairDStream = stream
.mapToPair(new PairFunction<ConsumerRecord<String, String>, String, String>() {
public Tuple2<String, String> call(ConsumerRecord<String, String> record) throws Exception {
return new Tuple2<>(record.key(), record.value());
}
});
jPairDStream.foreachRDD(jPairRDD -> {
jPairRDD.foreach(rdd -> {
System.out.println("value=" + rdd._2());
if (rdd._2() != null) {
System.out.println("inserting=" + rdd._2());
Document doc = Document.parse(rdd._2());
// List<Document> list = new ArrayList<>();
// list.add(doc);
db.getCollection("collection").insertOne(doc);
System.out.println("Inserted Data Done");
}
else {
System.out.println("Got no data in this window");
}
});
});
streamingContext.start();
streamingContext.awaitTermination();
Where
MongoClient mongo = new MongoClient("localhost", 27017);
MongoDatabase db = mongo.getDatabase("mongodb");
I expect to speed up the mongo Operation,how to achiever multithreading for mongo write? (should I use MongoClientOptions for minconnection per host?)
Also is the approach taken is correct to use MongoDriver or it should done by MonogSpark connector or By spark writestream() API's. If yes how to write each rdd as separate record in mongo any example in Java?
I don't know about "efficiently" because there are a lot of factors at play here.
For example, Kafka partitions and total Spark executors are just two values that need tuned to accomodate for thoughput.
I do see you are using the ForEachWriter, which is a good way to do it, but maybe not the best considering you're doing constantly calling insertOne, compared to using Spark Structed Streaming to begin with, reading from Kafka, manipulating your data into a Struct object, then using SparkSQL Mongo Connector to directly dump to Mongo collections (which I would guess uses Mongo transactions, and inserts mutiple records at a time)
Also worth mentioning, Landoop offers a MongoDB Kafka Connect Sink, which requires one config file, and no Spark code to be written.

How to write Dataset to a excel file using hadoop office library in apache spark java

Currently I am using com.crealytics.spark.excel to read an Excel file, but using this library I can't write the dataset to an Excel file.
This link says that using hadoop office library (org.zuinnote.spark.office.excel) we can read and write to Excel files
Please help me to write dataset object to an excel file in spark java.
You can use org.zuinnote.spark.office.excel for both reading and writing excel file using Dataset. Examples are given at https://github.com/ZuInnoTe/spark-hadoopoffice-ds/. However, there is one issue if you read the Excel in Dataset and try to write it in another Excel file. Please see the issue and workaround in scala at https://github.com/ZuInnoTe/hadoopoffice/issues/12.
I have written a sample program in Java using org.zuinnote.spark.office.excel and workaround given at that link. Please see if this helps you.
public class SparkExcel {
public static void main(String[] args) {
//spark session
SparkSession spark = SparkSession
.builder()
.appName("SparkExcel")
.master("local[*]")
.getOrCreate();
//Read
Dataset<Row> df = spark
.read()
.format("org.zuinnote.spark.office.excel")
.option("read.locale.bcp47", "de")
.load("c:\\temp\\test1.xlsx");
//Print
df.show();
df.printSchema();
//Flatmap function
FlatMapFunction<Row, String[]> flatMapFunc = new FlatMapFunction<Row, String[]>() {
#Override
public Iterator<String[]> call(Row row) throws Exception {
ArrayList<String[]> rowList = new ArrayList<String[]>();
List<Row> spreadSheetRows = row.getList(0);
for (Row srow : spreadSheetRows) {
ArrayList<String> arr = new ArrayList<String>();
arr.add(srow.getString(0));
arr.add(srow.getString(1));
arr.add(srow.getString(2));
arr.add(srow.getString(3));
arr.add(srow.getString(4));
rowList.add(arr.toArray(new String[] {}));
}
return rowList.iterator();
}
};
//Apply flatMap function
Dataset<String[]> df2 = df.flatMap(flatMapFunc, spark.implicits().newStringArrayEncoder());
//Write
df2.write()
.mode(SaveMode.Overwrite)
.format("org.zuinnote.spark.office.excel")
.option("write.locale.bcp47", "de")
.save("c:\\temp\\test2.xlsx");
}
}
I have tested this code with Java 8 and Spark 2.1.0. I am using maven and added dependency for org.zuinnote.spark.office.excel from https://mvnrepository.com/artifact/com.github.zuinnote/spark-hadoopoffice-ds_2.11/1.0.3

Write JavaPairDStream to single HDFS location

I am exploring Spark Streaming using Java.
I current have Cloudera quick start VM (CDH 5.5) downloaded and I wrote a Java code for Spark streaming
I have written a program which returns JavaPairDStream. When I try to write the output to HDFS, it works but it is creating multiple folders (based on timestamp). The documentation says that this is how it is going to work, but is there a way to write the output to the same folder/file in HDFS? I tried to use repartition(1), but that did not work
Please see the code below:
if (args.length < 3) {
System.err.println("Invalid arguments");
System.exit(1);
}
SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("Product Reco Spark Streaming");
JavaStreamingContext javaStreamContext = new JavaStreamingContext(sparkConf, new Duration(10000));
String inputFile = args[0];
String outputPath = args[1];
String outputFile = args[2];
JavaDStream<String> dStream = javaStreamContext.textFileStream(inputFile);
JavaPairDStream<String, String> finalDStream = fetchProductRecommendation(dStream); // Does some logic to get the final DStream
finalDStream.print();
finalDStream.repartition(1).saveAsNewAPIHadoopFiles(outputPath, outputFile, String.class, String.class, TextOutputFormat.class);
javaStreamContext.start();
javaStreamContext.awaitTermination();
To run this program, here is the command that I am using
spark-submit --master local /home/cloudera/Spark/JarLib_ProductRecoSparkStream.jar /user/ProductRecomendations/SparkInput/ /user/ProductRecomendations/SparkOutput/ productRecoOutput
Please let me know if you need more information since this is the first time I am writing a spark stream code.

Return result in json format - Apache Spark Sql

Hi I am new to Apache spark Sql and now I am using spark java api to query data from hive. Here is the code I have used which is from Spark documentation
SparkConf sparkConf = new SparkConf().setAppName("Spark").setMaster("local");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
HiveContext sqlContext = new .HiveContext(ctx.sc());
Row[] result = = sqlContext.sql("Select * from TableName").collect();
Now I am using for loop to loop through the result and concatenating as json string and it is very slow and is there any other way to achieve this?
Can any one tell me how to do this.

Drools In Spark for Streaming File

We were able to successfully Integrate drools with spark, When we try to apply rules from Drools we were able to do for Batch file, which is present in HDFS, But we tried to use drools for Streaming file so that we can make decision instantly, But we couldn't figure out how to do it.Below is the snippet of the code what we are trying to achieve.
Case1: .
SparkConf conf = new SparkConf().setAppName("sample");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> javaRDD = sc.textFile("/user/root/spark/sample.dat");
List<String> store = new ArrayList<String>();
store = javaRDD.collect();
Case 2: when we use streaming context
SparkConf sparkconf = new SparkConf().setAppName("sparkstreaming");
JavaStreamingContext ssc =
new JavaStreamingContext(sparkconf, new Duration(1));
JavaDStream<String> lines = ssc.socketTextStream("xx.xx.xx.xx", xxxx);
In the first case we were able apply our rules on the variable store, but in the second case we were not able to apply rules on the dstream lines.
If someone has some idea, how it can be done, will be a great help.
Here is one way to get it done.
Create your knowledge session with business rules first.
//Create knowledge and session here
KnowledgeBase kbase = KnowledgeBaseFactory.newKnowledgeBase();
KnowledgeBuilder kbuilder = KnowledgeBuilderFactory.newKnowledgeBuilder();
kbuilder.add( ResourceFactory.newFileResource( "rulefile.drl"),
ResourceType.DRL );
Collection<KnowledgePackage> pkgs = kbuilder.getKnowledgePackages();
kbase.addKnowledgePackages( pkgs );
final StatelessKnowledgeSession ksession = kbase.newStatelessKnowledgeSession();
Create JavaDStream using StreamingContext.
SparkConf sparkconf = new SparkConf().setAppName("sparkstreaming");
JavaStreamingContext ssc =
new JavaStreamingContext(sparkconf, new Duration(1));
JavaDStream<String> lines = ssc.socketTextStream("xx.xx.xx.xx", xxxx);
Call DStream's foreachRDD to create facts and fire your rules.
lines.foreachRDD(new Function<JavaRDD<String>, Void>() {
#Override
public Void call(JavaRDD<String> rdd) throws Exception {
List<String> facts = rdd.collect();
//Apply rules on facts here
ksession.execute(facts);
return null;
}
});

Categories

Resources