Merge Multiple JavaRDD

Merge Multiple JavaRDD - java

I've attempted to merge multiple JavaRDD but i get only 2 merged can someone kindly help. I've been struggling with this for a while but overall i would like to be able to obtain multiple collections and use sqlContext create a group and print out all results.
here my code
JavaRDD<AppLog> logs = mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.ppa_logs").union(
mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.fav_logs").union(
mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.pps_logs").union(
mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.dd_logs").union(
mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.ppt_logs")
)
)
)
);
public JavaRDD<AppLog> mapCollection(JavaSparkContext sc ,String uri){
Configuration mongodbConfig = new Configuration();
mongodbConfig.set("mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat");
mongodbConfig.set("mongo.input.uri", uri);
JavaPairRDD<Object, BSONObject> documents = sc.newAPIHadoopRDD(
mongodbConfig, // Configuration
MongoInputFormat.class, // InputFormat: read from a live cluster.
Object.class, // Key class
BSONObject.class // Value class
);
return documents.map(
new Function<Tuple2<Object, BSONObject>, AppLog>() {
public AppLog call(final Tuple2<Object, BSONObject> tuple) {
AppLog log = new AppLog();
BSONObject header =
(BSONObject) tuple._2();
log.setTarget((String) header.get("target"));
log.setAction((String) header.get("action"));
return log;
}
}
);
}
// printing the collections
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
DataFrame logsSchema = sqlContext.createDataFrame(logs, AppLog.class);
logsSchema.registerTempTable("logs");
DataFrame groupedMessages = sqlContext.sql(
"select * from logs");
// "select target, action, Count(*) from logs group by target, action");
// "SELECT to, body FROM messages WHERE to = \"eric.bass#enron.com\"");
groupedMessages.show();
logsSchema.printSchema();

If you wanted to merge multiple JavaRDDs , simply use sc.union(rdd1,rdd2,..) instead rdd1.union(rdd2).union(rdd3).
Also check this RDD.union vs SparkContex.union

Related

How to scan a table in DynamoDB with time period?

I have a table in DynamoDB and it has an attribute 'createDate' and I want to do a scan using a filter in a specific period of that attribute (for example: 2022-01-01 to 2022-01-31) but I don't know exactly if it's possible and how to do. If anyone has done this and can help me it would be very helpful.
just one more question: is it possible to put the result in a CSV file?
Here is my code where I can scan with a single date:
public class QueryTableResearchAnswers {
static AmazonDynamoDB client = AmazonDynamoDBClientBuilder.standard().build();
static DynamoDB dynamoDB = new DynamoDB(client);
static String tableName = "research-answers";
public static void main(String[] args) throws Exception {
String researchAnswers = "Amazon DynamoDB";
findAnswersWithinTimePeriod(researchAnswers);
//findRepliesPostedWithinTimePeriod(researchAnswers);
}
private static void findAnswersWithinTimePeriod(String researchAnswers) {
Table table = dynamoDB.getTable(tableName);
Map<String, Object> expressionAttributeValues = new HashMap<String, Object>();
expressionAttributeValues.put(":startDate", "2022-01-01T00:00:00.0Z" );
ItemCollection<ScanOutcome> items = table.scan("createDate between > startDate", // FilterExpression
"bizId, accountingsessionid, accounttype, acctsessionid, choicecode, contextname, createDate, document, framedipaddress," +
"macaddress, macaddressnetworkdata, machash, mail, nasgrelocalip, nasidentifier, nasipaddress, nasportid, network, networktype, networkuuid, phone," +
"question, questionanswer, questioncode, realm, relayingmacaddress, remoteipaddress, useragent, username", // ProjectionExpression
null, // ExpressionAttributeNames - not used in this example
expressionAttributeValues);
System.out.println("Scan of " + tableName + " for january answers");
Iterator<Item> iterator = items.iterator();
while (iterator.hasNext()) {
System.out.println(iterator.next().toJSONPretty());
}
}

In general, for an arbitrary date range:
createDate BETWEEN :date1 AND :date2
But, in your specific case of 2022-01-01 to 2022-01-31 (the entire month of January), you can simplify this to:
beginsWith(createDate, "2022-01")

Multiple select queries in single job on flink table API

If I want to run two different select queries on a flink table created from the dataStream, the blink-planner runs them as two different jobs. Is there a way to combine them and run as a single job ?
Example code :
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(4);
System.out.println("Running credit scores : ");
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
DataStream<String> recordsStream =
env.readTextFile("src/main/resources/credit_trial.csv");
DataStream<CreditRecord> creditStream = recordsStream
.filter((FilterFunction<String>) line -> !line.contains(
"Loan ID,Customer ID,Loan Status,Current Loan Amount,Term,Credit Score,Annual Income,Years in current job" +
",Home Ownership,Purpose,Monthly Debt,Years of Credit History,Months since last delinquent,Number of Open Accounts," +
"Number of Credit Problems,Current Credit Balance,Maximum Open Credit,Bankruptcies,Tax Liens"))
.map(new MapFunction<String, CreditRecord>() {
#Override
public CreditRecord map(String s) throws Exception {
String[] fields = s.split(",");
return new CreditRecord(fields[0], fields[2], Double.parseDouble(fields[3]),
fields[4], fields[5].trim().equals("")?0.0: Double.parseDouble(fields[5]),
fields[6].trim().equals("")?0.0:Double.parseDouble(fields[6]),
fields[8], Double.parseDouble(fields[15]));
}
});
tableEnv.createTemporaryView("CreditDetails", creditStream);
Table creditDetailsTable = tableEnv.from("CreditDetails");
Table resultsTable = creditDetailsTable.select($("*"))
.filter($("loanStatus").isEqual("Charged Off"));
TableResult result = resultsTable.execute();
result.print();
Table resultsTable2 = creditDetailsTable.select($("*"))
.filter($("loanStatus").isEqual("Fully Paid"));
TableResult result2 = resultsTable2.execute();
result2.print();
The above code creates 2 different jobs, but I don't want that. Is there any way out ?

Apache Spark -- Java , Group Live Stream data

I am trying to get live JSON data from RabbitMQ to Apache Spark using Java and do some realtime analytics out of it.
I am able to get the data and also do some basic SQL queries on it, but I am not able to figure out the grouping part.
Below is the JSON I have
{"DeviceId":"MAC-101","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017 16:43:41","FR":10,"ASSP":20,"Mode":1,"EMode":2,"ProgramNo":2,"Status":3,"Timeinmillisecs":636340922213668165}}
{"DeviceId":"MAC-101","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017 16:43:41","FR":10,"ASSP":20,"Mode":1,"EMode":2,"ProgramNo":2,"Status":3,"Timeinmillisecs":636340922213668165}}
{"DeviceId":"MAC-102","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017 16:43:41","FR":10,"ASSP":20,"Mode":1,"EMode":2,"ProgramNo":2,"Status":3,"Timeinmillisecs":636340922213668165}}
{"DeviceId":"MAC-102","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017 16:43:41","FR":10,"ASSP":20,"Mode":1,"EMode":2,"ProgramNo":2,"Status":3,"Timeinmillisecs":636340922213668165}}
I would like to group them by device id. The idea is that way I can run and gather analytics against individual devices. Below is the sample code snippet that I am trying
public static void main(String[] args) {
try {
mconf = new SparkConf();
mconf.setAppName("RabbitMqReceiver");
mconf.setMaster("local[*]");
jssc = new JavaStreamingContext(mconf,Durations.seconds(10));
SparkSession spksess = SparkSession
.builder()
.master("local[*]")
.appName("RabbitMqReceiver2")
.getOrCreate();
SQLContext sqlctxt = new SQLContext(spksess);
JavaReceiverInputDStream<String> jsonData = jssc.receiverStream(
new mqreceiver(StorageLevel.MEMORY_AND_DISK_2()));
//jsonData.print();
JavaDStream<String> machineData = jsonData.window(Durations.minutes(1), Durations.seconds(20));
machineData.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
if(!rdd.isEmpty()){
Dataset<Row> data = sqlctxt.read().json(rdd);
//Dataset<Row> data = spksess.read().json(rdd).select("*");
data.createOrReplaceTempView("DeviceData");
data.printSchema();
//data.show(false);
// The below select query works
//Dataset<Row> groupedData = sqlctxt.sql("select * from DeviceData where DeviceId='MAC-101'");
// The below sql fails...
Dataset<Row> groupedData = sqlctxt.sql("select * from DeviceData GROUP BY DeviceId");
groupedData.show();
}
}
});
jssc.start();
jssc.awaitTermination();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
What i am looking to do with the streamed data is to see if i can push the incoming data into individual buckets...
Lets say we have the below incoming data from rabbitmq,
What i want to do is to have either a single key/value based collection which will have the device id as key and List as value
Or it could be somekind of individual dynamic collection for each device id.
Can we do something like the below code (from url -- http://backtobazics.com/big-data/spark/apache-spark-groupby-example/)
public class GroupByExample {
public static void main(String[] args) throws Exception {
JavaSparkContext sc = new JavaSparkContext();
// Parallelized with 2 partitions
JavaRDD<String> rddX = sc.parallelize(
Arrays.asList("Joseph", "Jimmy", "Tina",
"Thomas", "James", "Cory",
"Christine", "Jackeline", "Juan"), 3);
JavaPairRDD<Character, Iterable<String>> rddY = rddX.groupBy(word -> word.charAt(0));
System.out.println(rddY.collect());
}
}
So in our case we need to pass a filter for the group by w.r.t DeviceId
Working Code....
JavaDStream<String> strmData = jssc.receiverStream(
new mqreceiver(StorageLevel.MEMORY_AND_DISK_2()));
//This is just a sliding window i have kept
JavaDStream<String> machineData = strmData.window(Durations.minutes(1), Durations.seconds(10));
machineData.print();
JavaPairDStream<String, String> pairedData = machineData.mapToPair(s -> new Tuple2<String, String>(s.substring(5, 10) , new String(s)));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
groupedData.print();

It's because in queries with group by, only following columns can be used in select:
columns listed in group by
aggregation of any of column
If you use "*", then all columns are used in select - and that's why the query fails. Change the query to for example:
select DeviceId, count(distinct DeviceType) as deviceTypeCount from DeviceData group by DeviceId
and it will work, because it uses only column in group by and columns in aggregation functions

The GROUP BY statement is often used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result-set by one or more columns.

Spark - Java UDF returning multiple columns

I'm using sparkSql 1.6.2 (Java API) and I have to process the following DataFrame that has a list of value in 2 columns:
ID AttributeName AttributeValue
0 [an1,an2,an3] [av1,av2,av3]
1 [bn1,bn2] [bv1,bv2]
The desired table is:
ID AttributeName AttributeValue
0 an1 av1
0 an2 av2
0 an3 av3
1 bn1 bv1
1 bn2 bv2
I think I have to use a combination of the explode function and a custom UDF function.
I found the following resources:
Explode (transpose?) multiple columns in Spark SQL table
How do I call a UDF on a Spark DataFrame using JAVA?
and I can successfully run an example that read the two columns and return the concatenation of the first two strings in a column
UDF2 combineUDF = new UDF2<Seq<String>, Seq<String>, String>() {
public String call(final Seq<String> col1, final Seq<String> col2) throws Exception {
return col1.apply(0) + col2.apply(0);
}
};
context.udf().register("combineUDF", combineUDF, DataTypes.StringType);
the problem is to write the signature of a UDF returning two columns (in Java).
As far as I understand I must define a new StructType as the one shown below and set that as return type, but so far I didn't manage to have the final code working
StructType retSchema = new StructType(new StructField[]{
new StructField("#AttName", DataTypes.StringType, true, Metadata.empty()),
new StructField("#AttValue", DataTypes.StringType, true, Metadata.empty()),
}
);
context.udf().register("combineUDF", combineUDF, retSchema);
Any help will be really appreciated.
UPDATE: I'm trying to implement first the zip(AttributeName,AttributeValue) so then I will need just to apply the standard explode function in sparkSql:
ID AttName_AttValue
0 [[an1,av1],[an1,av2],[an3,av3]]
1 [[bn1,bv1],[bn2,bv2]]
I built the following UDF:
UDF2 combineColumns = new UDF2<Seq<String>, Seq<String>, List<List<String>>>() {
public List<List<String>> call(final Seq<String> col1, final Seq<String> col2) throws Exception {
List<List<String>> zipped = new LinkedList<>();
for (int i = 0, listSize = col1.size(); i < listSize; i++) {
List<String> subRow = Arrays.asList(col1.apply(i), col2.apply(i));
zipped.add(subRow);
}
return zipped;
}
};
But when I run the code
myDF.select(callUDF("combineColumns", col("AttributeName"), col("AttributeValue"))).show(10);
I got the following error message:
scala.MatchError: [[an1,av1],[an1,av2],[an3,av3]] (of class java.util.LinkedList)
and it looks like the combining has been performed correctly but then the return type is not the expected one in Scala.
Any Help?

Finally I managed to get the result I was looking for but probably not in the most efficient way.
Basically the are 2 step:
Zip of the two list
Explode of the list in rows
For the first step I defined the following UDF Function
UDF2 concatItems = new UDF2<Seq<String>, Seq<String>, Seq<String>>() {
public Seq<String> call(final Seq<String> col1, final Seq<String> col2) throws Exception {
ArrayList zipped = new ArrayList();
for (int i = 0, listSize = col1.size(); i < listSize; i++) {
String subRow = col1.apply(i) + ";" + col2.apply(i);
zipped.add(subRow);
}
return scala.collection.JavaConversions.asScalaBuffer(zipped);
}
};
Missing the function registration to SparkSession:
sparkSession.udf().register("concatItems",concatItems,DataTypes.StringType);
and then I called it with the following code:
DataFrame df2 = df.select(col("ID"), callUDF("concatItems", col("AttributeName"), col("AttributeValue")).alias("AttName_AttValue"));
At this stage the df2 looks like that:
ID AttName_AttValue
0 [[an1,av1],[an1,av2],[an3,av3]]
1 [[bn1,bv1],[bn2,bv2]]
Then I called the following lambda function for exploding the list into rows:
DataFrame df3 = df2.select(col("ID"),explode(col("AttName_AttValue")).alias("AttName_AttValue_row"));
At this stage the df3 looks like that:
ID AttName_AttValue
0 [an1,av1]
0 [an1,av2]
0 [an3,av3]
1 [bn1,bv1]
1 [bn2,bv2]
Finally to split the attribute name and value into two different columns, I converted the DataFrame into a JavaRDD in order to use the map function:
JavaRDD df3RDD = df3.toJavaRDD().map(
(Function<Row, Row>) myRow -> {
String[] info = String.valueOf(myRow.get(1)).split(",");
return RowFactory.create(myRow.get(0), info[0], info[1]);
}).cache();
If anybody has a better solution feel free to comment.
I hope it helps.

Query HBase table by key using RowFilter not working

I have a HBase table (from java) and i want to query the table by list of keys. I did the following, but its not working.
mFilterFeatureIt = mFeatureSet.iterator();
FilterList filterList=new FilterList(FilterList.Operator.MUST_PASS_ONE);
while (mFilterFeatureIt.hasNext()) {
long myfeatureId = mFilterFeatureIt.next();
System.out.println("FeatureId:"+myfeatureId+" , ");
RowFilter filter = new RowFilter(CompareOp.EQUAL,new BinaryComparator(Bytes.toBytes(myfeatureId)) );
filterList.addFilter(filter);
}
outputMap = HbaseUtils.getHbaseData("mytable", filterList);
System.out.println("Size of outputMap map:"+ outputMap.szie());
public static Map<String, Map<String, String>> getHbaseData(String table, FilterList filter) {
Map<String, Map<String, String>> data = new HashMap<String, Map<String, String>>();
HTable htable = null;
try {
htable = new HTable(HTableConfiguration.getHTableConfiguration(),table);
Scan scan = new Scan();
scan.setFilter(filter);
ResultScanner resultScanner = htable.getScanner(scan);
Iterator<Result> results = resultScanner.iterator();
while (results.hasNext()) {
Result result = results.next();
String rowId = Bytes.toString(result.getRow());
List<KeyValue> columns = result.list();
if (null != columns) {
HashMap<String, String> colData = new HashMap<String, String>();
for (KeyValue column : columns) {
colData.put(Bytes.toString(column.getFamily()) + ":"+ Bytes.toString(column.getQualifier()),Bytes.toString(column.getValue()));
}
data.put(rowId, colData);
}
}
} catch (IOException e) {
e.printStackTrace();
} finally {
if (htable != null)
try {
htable.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return data;
}
FeatureId:80515900 ,
FeatureId:80515901 ,
FeatureId:80515902 ,
Size of outputMap map: 0
I see that value of feature id is what i want , but I always get the above output even if the key is present in the hbase table. Can anyone tell me what am i doing wrong ?
EDIT:
I posted the code for my hbase util method too above, so that you can point me to any bugs there.
I am trying to do an SQL equivalent of select * FROM mytable where featureId in (80515900, 80515901, 80515902) My idea to achieve the same in HBase was to create a filter list with one filter for each featureId. Is that correct ?
Here is the content of my table
scan 'mytable', {COLUMNS => ['sample:tag_count'] }
80515900 column=sample:tag_count, timestamp=1339304052748, value=4
80515901 column=sample:tag_count, timestamp=1339304052748, value=0
80515902 column=sample:tag_count, timestamp=1339304052748, value=3
80515903 column=sample:tag_count, timestamp=1339304052748, value=1
80515904 column=sample:tag_count, timestamp=1339304052748, value=2

Its not returning any data as while inserting the data into hbase,
the data-type for key is 'String' (from your scan result) & while fetching, the value passed in RowFilter has 'long' data type. Use this filter:
RowFilter filter = new RowFilter(CompareOp.EQUAL,new
BinaryComparator(Bytes.toBytes(myfeatureId.toString())) );

the while loop will always generate a new filter and added to the filter list.
The circuit are all the keys in the filter. This filter can never apply to a single row. Create only one filter in the while loop pointing to a knowing "myfeatureId".
while (mFilterFeatureIt.hasNext()) {
long myfeatureId = mFilterFeatureIt.next();
System.out.println("FeatureId:"+myfeatureId+" , ");
if ( myfeatureId=="80515902") {
RowFilter filter = new RowFilter(CompareOp.EQUAL,new BinaryComparator(Bytes.toBytes(myfeatureId)) );
filterList.addFilter(filter);
}
}
EDIT
For rows quantity, the query is responsible. HBase is not
HBase filters
Filters push row selection criteria out to the HBase. Rows can be filtered remotely and in parallel. Using these functions helps you to avoid sending rows to the client that are not needed.
To get a part out of the key, gets all from 80515900 .. 80515909 try this
of course remove from the loop
RowFilter filter = new RowFilter(CompareOp.EQUAL,new BinaryComparator(Bytes.toBytes(myfeatureId)) );
filterList.addFilter(filter);
and add above the line outputMap = HbaseUtils.getHbaseData("mytable", filterList);
....
RowFilter filter = new RowFilter(CompareOp.EQUAL,new SubStringComparator("8051590"));
filterList.addFilter(filter);
outputMap = HbaseUtils.getHbaseData("mytable", filterList);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Merge Multiple JavaRDD - java

If you wanted to merge multiple JavaRDDs , simply use sc.union(rdd1,rdd2,..) instead rdd1.union(rdd2).union(rdd3). Also check this RDD.union vs SparkContex.union

Related

How to scan a table in DynamoDB with time period?

Multiple select queries in single job on flink table API

Apache Spark -- Java , Group Live Stream data

Spark - Java UDF returning multiple columns

Query HBase table by key using RowFilter not working

Categories

Resources