Multiple select queries in single job on flink table API - java

If I want to run two different select queries on a flink table created from the dataStream, the blink-planner runs them as two different jobs. Is there a way to combine them and run as a single job ?
Example code :
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(4);
System.out.println("Running credit scores : ");
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
DataStream<String> recordsStream =
env.readTextFile("src/main/resources/credit_trial.csv");
DataStream<CreditRecord> creditStream = recordsStream
.filter((FilterFunction<String>) line -> !line.contains(
"Loan ID,Customer ID,Loan Status,Current Loan Amount,Term,Credit Score,Annual Income,Years in current job" +
",Home Ownership,Purpose,Monthly Debt,Years of Credit History,Months since last delinquent,Number of Open Accounts," +
"Number of Credit Problems,Current Credit Balance,Maximum Open Credit,Bankruptcies,Tax Liens"))
.map(new MapFunction<String, CreditRecord>() {
#Override
public CreditRecord map(String s) throws Exception {
String[] fields = s.split(",");
return new CreditRecord(fields[0], fields[2], Double.parseDouble(fields[3]),
fields[4], fields[5].trim().equals("")?0.0: Double.parseDouble(fields[5]),
fields[6].trim().equals("")?0.0:Double.parseDouble(fields[6]),
fields[8], Double.parseDouble(fields[15]));
}
});
tableEnv.createTemporaryView("CreditDetails", creditStream);
Table creditDetailsTable = tableEnv.from("CreditDetails");
Table resultsTable = creditDetailsTable.select($("*"))
.filter($("loanStatus").isEqual("Charged Off"));
TableResult result = resultsTable.execute();
result.print();
Table resultsTable2 = creditDetailsTable.select($("*"))
.filter($("loanStatus").isEqual("Fully Paid"));
TableResult result2 = resultsTable2.execute();
result2.print();
The above code creates 2 different jobs, but I don't want that. Is there any way out ?

Related

Spark : dataframe - how to properly parallelize the execution

I have a Spark Streaming application that reads from Kafka with a batch interval of 5 minutes
My application stores the input in a DataFrame, and is configured to execute aggregation queries on the dataframe (~ 1000 queries)
Current solution: from the driver I execute a "for loop" on the list of 1000 queries and execute them.
Problem: I have a hundred queries to execute and it takes a huge amount of time for my application.
Is there a way to make this process more faster ?
SparkConf sparkConf = new SparkConf();
messagesFromKafka.foreachRDD((VoidFunction2<JavaRDD<String>, Time>) (rdd, time) -> {
//...
SparkSession spark = JavaSparkSessionSingleton.getInstance(sparkConf);
// Convert RDD[String]
JavaRDD<Bean> rddRow = rdd.map((Function<String, Bean>) line -> {
Bean row = new Bean();
row.setFiled1(line.split(";")[0]);
row.setFiled2(line.split(";")[1]);
//..
return row;
});
// ...
Dataset<Bean> ds = spark.createDataFrame(rddRow, Bean.class);
// Prepare List of query ...
List<String> listQuery = new ArrayList<>();
listQuery.add("select sum(..) group by filed1...");
listQuery.add("select avg (..) group by field2...");
// perform aggregation with key query
for (String query : listQuery) {
Dataset<Bean> dsResult = spark.sql(query);
}
}

Apache Spark -- Java , Group Live Stream data

I am trying to get live JSON data from RabbitMQ to Apache Spark using Java and do some realtime analytics out of it.
I am able to get the data and also do some basic SQL queries on it, but I am not able to figure out the grouping part.
Below is the JSON I have
{"DeviceId":"MAC-101","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017 16:43:41","FR":10,"ASSP":20,"Mode":1,"EMode":2,"ProgramNo":2,"Status":3,"Timeinmillisecs":636340922213668165}}
{"DeviceId":"MAC-101","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017 16:43:41","FR":10,"ASSP":20,"Mode":1,"EMode":2,"ProgramNo":2,"Status":3,"Timeinmillisecs":636340922213668165}}
{"DeviceId":"MAC-102","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017 16:43:41","FR":10,"ASSP":20,"Mode":1,"EMode":2,"ProgramNo":2,"Status":3,"Timeinmillisecs":636340922213668165}}
{"DeviceId":"MAC-102","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017 16:43:41","FR":10,"ASSP":20,"Mode":1,"EMode":2,"ProgramNo":2,"Status":3,"Timeinmillisecs":636340922213668165}}
I would like to group them by device id. The idea is that way I can run and gather analytics against individual devices. Below is the sample code snippet that I am trying
public static void main(String[] args) {
try {
mconf = new SparkConf();
mconf.setAppName("RabbitMqReceiver");
mconf.setMaster("local[*]");
jssc = new JavaStreamingContext(mconf,Durations.seconds(10));
SparkSession spksess = SparkSession
.builder()
.master("local[*]")
.appName("RabbitMqReceiver2")
.getOrCreate();
SQLContext sqlctxt = new SQLContext(spksess);
JavaReceiverInputDStream<String> jsonData = jssc.receiverStream(
new mqreceiver(StorageLevel.MEMORY_AND_DISK_2()));
//jsonData.print();
JavaDStream<String> machineData = jsonData.window(Durations.minutes(1), Durations.seconds(20));
machineData.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
if(!rdd.isEmpty()){
Dataset<Row> data = sqlctxt.read().json(rdd);
//Dataset<Row> data = spksess.read().json(rdd).select("*");
data.createOrReplaceTempView("DeviceData");
data.printSchema();
//data.show(false);
// The below select query works
//Dataset<Row> groupedData = sqlctxt.sql("select * from DeviceData where DeviceId='MAC-101'");
// The below sql fails...
Dataset<Row> groupedData = sqlctxt.sql("select * from DeviceData GROUP BY DeviceId");
groupedData.show();
}
}
});
jssc.start();
jssc.awaitTermination();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
What i am looking to do with the streamed data is to see if i can push the incoming data into individual buckets...
Lets say we have the below incoming data from rabbitmq,
What i want to do is to have either a single key/value based collection which will have the device id as key and List as value
Or it could be somekind of individual dynamic collection for each device id.
Can we do something like the below code (from url -- http://backtobazics.com/big-data/spark/apache-spark-groupby-example/)
public class GroupByExample {
public static void main(String[] args) throws Exception {
JavaSparkContext sc = new JavaSparkContext();
// Parallelized with 2 partitions
JavaRDD<String> rddX = sc.parallelize(
Arrays.asList("Joseph", "Jimmy", "Tina",
"Thomas", "James", "Cory",
"Christine", "Jackeline", "Juan"), 3);
JavaPairRDD<Character, Iterable<String>> rddY = rddX.groupBy(word -> word.charAt(0));
System.out.println(rddY.collect());
}
}
So in our case we need to pass a filter for the group by w.r.t DeviceId
Working Code....
JavaDStream<String> strmData = jssc.receiverStream(
new mqreceiver(StorageLevel.MEMORY_AND_DISK_2()));
//This is just a sliding window i have kept
JavaDStream<String> machineData = strmData.window(Durations.minutes(1), Durations.seconds(10));
machineData.print();
JavaPairDStream<String, String> pairedData = machineData.mapToPair(s -> new Tuple2<String, String>(s.substring(5, 10) , new String(s)));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
groupedData.print();
It's because in queries with group by, only following columns can be used in select:
columns listed in group by
aggregation of any of column
If you use "*", then all columns are used in select - and that's why the query fails. Change the query to for example:
select DeviceId, count(distinct DeviceType) as deviceTypeCount from DeviceData group by DeviceId
and it will work, because it uses only column in group by and columns in aggregation functions
The GROUP BY statement is often used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result-set by one or more columns.

Merge Multiple JavaRDD

I've attempted to merge multiple JavaRDD but i get only 2 merged can someone kindly help. I've been struggling with this for a while but overall i would like to be able to obtain multiple collections and use sqlContext create a group and print out all results.
here my code
JavaRDD<AppLog> logs = mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.ppa_logs").union(
mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.fav_logs").union(
mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.pps_logs").union(
mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.dd_logs").union(
mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.ppt_logs")
)
)
)
);
public JavaRDD<AppLog> mapCollection(JavaSparkContext sc ,String uri){
Configuration mongodbConfig = new Configuration();
mongodbConfig.set("mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat");
mongodbConfig.set("mongo.input.uri", uri);
JavaPairRDD<Object, BSONObject> documents = sc.newAPIHadoopRDD(
mongodbConfig, // Configuration
MongoInputFormat.class, // InputFormat: read from a live cluster.
Object.class, // Key class
BSONObject.class // Value class
);
return documents.map(
new Function<Tuple2<Object, BSONObject>, AppLog>() {
public AppLog call(final Tuple2<Object, BSONObject> tuple) {
AppLog log = new AppLog();
BSONObject header =
(BSONObject) tuple._2();
log.setTarget((String) header.get("target"));
log.setAction((String) header.get("action"));
return log;
}
}
);
}
// printing the collections
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
DataFrame logsSchema = sqlContext.createDataFrame(logs, AppLog.class);
logsSchema.registerTempTable("logs");
DataFrame groupedMessages = sqlContext.sql(
"select * from logs");
// "select target, action, Count(*) from logs group by target, action");
// "SELECT to, body FROM messages WHERE to = \"eric.bass#enron.com\"");
groupedMessages.show();
logsSchema.printSchema();
If you wanted to merge multiple JavaRDDs , simply use sc.union(rdd1,rdd2,..) instead rdd1.union(rdd2).union(rdd3).
Also check this RDD.union vs SparkContex.union

How to mass delete multiple rows in hbase?

I have the following rows with these keys in hbase table "mytable"
user_1
user_2
user_3
...
user_9999999
I want to use the Hbase shell to delete rows from:
user_500 to user_900
I know there is no way to delete, but is there a way I could use the "BulkDeleteProcessor" to do this?
I see here:
https://github.com/apache/hbase/blob/master/hbase-examples/src/test/java/org/apache/hadoop/hbase/coprocessor/example/TestBulkDeleteProtocol.java
I want to just paste in imports and then paste this into the shell, but have no idea how to go about this. Does anyone know how I can use this endpoint from the jruby hbase shell?
Table ht = TEST_UTIL.getConnection().getTable("my_table");
long noOfDeletedRows = 0L;
Batch.Call<BulkDeleteService, BulkDeleteResponse> callable =
new Batch.Call<BulkDeleteService, BulkDeleteResponse>() {
ServerRpcController controller = new ServerRpcController();
BlockingRpcCallback<BulkDeleteResponse> rpcCallback =
new BlockingRpcCallback<BulkDeleteResponse>();
public BulkDeleteResponse call(BulkDeleteService service) throws IOException {
Builder builder = BulkDeleteRequest.newBuilder();
builder.setScan(ProtobufUtil.toScan(scan));
builder.setDeleteType(deleteType);
builder.setRowBatchSize(rowBatchSize);
if (timeStamp != null) {
builder.setTimestamp(timeStamp);
}
service.delete(controller, builder.build(), rpcCallback);
return rpcCallback.get();
}
};
Map<byte[], BulkDeleteResponse> result = ht.coprocessorService(BulkDeleteService.class, scan
.getStartRow(), scan.getStopRow(), callable);
for (BulkDeleteResponse response : result.values()) {
noOfDeletedRows += response.getRowsDeleted();
}
ht.close();
If there exists no way to do this through JRuby, Java or alternate way to quickly delete multiple rows is fine.
Do you really want to do it in shell because there are various other better ways. One way is using the native java API
Construct an array list of deletes
pass this array list to Table.delete method
Method 1: if you already know the range of keys.
public void massDelete(byte[] tableName) throws IOException {
HTable table=(HTable)hbasePool.getTable(tableName);
String tablePrefix = "user_";
int startRange = 500;
int endRange = 999;
List<Delete> listOfBatchDelete = new ArrayList<Delete>();
for(int i=startRange;i<=endRange;i++){
String key = tablePrefix+i;
Delete d=new Delete(Bytes.toBytes(key));
listOfBatchDelete.add(d);
}
try {
table.delete(listOfBatchDelete);
} finally {
if (hbasePool != null && table != null) {
hbasePool.putTable(table);
}
}
}
Method 2: If you want to do a batch delete on the basis of a scan result.
public bulkDelete(final HTable table) throws IOException {
Scan s=new Scan();
List<Delete> listOfBatchDelete = new ArrayList<Delete>();
//add your filters to the scanner
s.addFilter();
ResultScanner scanner=table.getScanner(s);
for (Result rr : scanner) {
Delete d=new Delete(rr.getRow());
listOfBatchDelete.add(d);
}
try {
table.delete(listOfBatchDelete);
} catch (Exception e) {
LOGGER.log(e);
}
}
Now coming down to using a CoProcessor. only one advice, 'DON'T USE CoProcessor' unless you are an expert in HBase.
CoProcessors have many inbuilt issues if you need I can provide a detailed description to you.
Secondly when you delete anything from HBase it's never directly deleted from Hbase there is tombstone marker get attached to that record and later during a major compaction it gets deleted, so no need to use a coprocessor which is highly resource exhaustive.
Modified code to support batch operation.
int batchSize = 50;
int batchCounter=0;
for(int i=startRange;i<=endRange;i++){
String key = tablePrefix+i;
Delete d=new Delete(Bytes.toBytes(key));
listOfBatchDelete.add(d);
batchCounter++;
if(batchCounter==batchSize){
try {
table.delete(listOfBatchDelete);
listOfBatchDelete.clear();
batchCounter=0;
}
}}
Creating HBase conf and getting table instance.
Configuration hConf = HBaseConfiguration.create(conf);
hConf.set("hbase.zookeeper.quorum", "Zookeeper IP");
hConf.set("hbase.zookeeper.property.clientPort", ZookeeperPort);
HTable hTable = new HTable(hConf, tableName);
If you already aware of the rowkeys of the records that you want to delete from HBase table then you can use the following approach
1.First create a List objects with these rowkeys
for (int rowKey = 1; rowKey <= 10; rowKey++) {
deleteList.add(new Delete(Bytes.toBytes(rowKey + "")));
}
2.Then get the Table object by using HBase Connection
Table table = connection.getTable(TableName.valueOf(tableName));
3.Once you have table object call delete() by passing the list
table.delete(deleteList);
The complete code will look like below
Configuration config = HBaseConfiguration.create();
config.addResource(new Path("/etc/hbase/conf/hbase-site.xml"));
config.addResource(new Path("/etc/hadoop/conf/core-site.xml"));
String tableName = "users";
Connection connection = ConnectionFactory.createConnection(config);
Table table = connection.getTable(TableName.valueOf(tableName));
List<Delete> deleteList = new ArrayList<Delete>();
for (int rowKey = 500; rowKey <= 900; rowKey++) {
deleteList.add(new Delete(Bytes.toBytes("user_" + rowKey)));
}
table.delete(deleteList);

How to add elements to ConcurrentHashMap using ExecutorService

I have a requirement of reading User Information from 2 different sources (db) per userId and storing consolidated information in a Map with key as userId. Users in numbers can vary based on period they have opted for. Group of users may belong to different Period of Year.eg daily, weekly, monthly users.
I used HashMap and LinkedHashMap to get this done. As it slows down the process and to make it faster, I thought of using Threading here.
Reading some tutorials and examples now I am using ConcurrentHashMap and ExecutorService.
In cases based on some validation I want to skip the current iteration and move to next User info. It doesnot allow to use continue keyword to use within for loop. Is there any way to achieve same differently within Multithreaded code.
Moreover below code piece though it works, but its not significantly that faster than the code without threading which creates doubt if Executor Service is implemented correctly.
How do we debug in case we get any error in Multithreaded code. Execution holds at debug point but its not consistent and it does not move to next line with F6.
Can someone point out if I am missing something in the code. Or any other example of simillar use case also can be of great help.
public void getMap() throws UserException
{
long startTime = System.currentTimeMillis();
Map<String, Map<Integer, User>> map = new ConcurrentHashMap<String, Map<Integer, User>>();
//final String key = "";
try
{
final Date todayDate = new Date();
List<String> applyPeriod = db.getPeriods(todayDate);
for (String period : applyPeriod)
{
try
{
final String key = period;
List<UserTable1> eligibleUsers = db.findAllUsers(key);
Map<Integer, User> userIdMap = new ConcurrentHashMap<Integer, User>();
ExecutorService executor = Executors.newFixedThreadPool(eligibleUsers.size());
CompletionService<User> cs = new ExecutorCompletionService<User>(executor);
int userCount=0;
for (UserTable1 eligibleUser : eligibleUsers)
{
try
{
cs.submit(
new Callable<User>()
{
public User call()
{
int userId = eligibleUser.getUserId();
List<EmployeeTable2> empData = db.findByUserId(userId);
EmployeeTable2 emp = null;
if (null != empData && !empData.isEmpty())
{
emp = empData.get(0);
}else{
String errorMsg = "No record found for given User ID in emp table";
logger.error(errorMsg);
//continue;
// conitnue does not work here.
}
User user = new User();
user.setUserId(userId);
user.setFullName(emp.getFullName());
return user;
}
}
);
userCount++;
}
catch(Exception ex)
{
String errorMsg = "Error while creating map :" + ex.getMessage();
logger.error(errorMsg);
}
}
for (int i = 0; i < userCount ; i++ ) {
try {
User user = cs.take().get();
if (user != null) {
userIdMap.put(user.getUserId(), user);
}
} catch (ExecutionException e) {
} catch (InterruptedException e) {
}
}
executor.shutdown();
map.put(key, userIdMap);
}
catch(Exception ex)
{
String errorMsg = "Error while creating map :" + ex.getMessage();
logger.error(errorMsg);
}
}
}
catch(Exception ex){
String errorMsg = "Error while creating map :" + ex.getMessage();
logger.error(errorMsg);
}
logger.info("Size of Map : " + map.size());
Set<String> periods = map.keySet();
logger.info("Size of periods : " + periods.size());
for(String period :periods)
{
Map<Integer, User> mapOfuserIds = map.get(period);
Set<Integer> userIds = mapOfuserIds.keySet();
logger.info("Size of Set : " + userIds.size());
for(Integer userId : userIds){
User inf = mapOfuserIds.get(userId);
logger.info("User Id : " + inf.getUserId());
}
}
long endTime = System.currentTimeMillis();
long timeTaken = (endTime - startTime);
logger.info("All threads are completed in " + timeTaken + " milisecond");
logger.info("******END******");
}
You really don't want to create a thread pool with as many threads as users you've read from the db. That doesn't make sense most of the time because you need to keep in mind that threads need to run somewhere... There are not many servers out there with 10 or 100 or even 1000 cores reserved for your application. A much smaller value like maybe 5 is often enough, depending on your environment.
And as always for topics about performance: You first need to test what your actual bottleneck is. Your application may simply don't benefit of threading because e.g. you are reading form a db which only allows 5 concurrent connections a the same time. In that case all your other 995 threads will simply wait.
Some other thing to consider is network latency: Reading multiple user ids from multiple threads may even increase the round trip time needed to get the data for one user from the database. An alternative approach might be to not read one user at a time, but the data of all 10'000 of them at once. That way your maybe available 10 GBit Ethernet connection to your database might really speed things up because you have only small communication overhead with the database but it might serve you all data you need in one answer quickly.
So in short, from my opinion your question is about performance optimization of your problem in general, but you don't know enough yet to decide which way to go.
you could try something like that:
List<String> periods = db.getPeriods(todayDate);
Map<String, Map<Integer, User>> hm = new HashMap<>();
periods.parallelStream().forEach(s -> {
eligibleUsers = // getEligibleUsers();
hm.put(s, eligibleUsers.parallelStream().collect(
Collectors.toMap(UserTable1::getId,createUserForId(UserTable1:getId))
});
); //
And in the createUserForId you do your db-reading
private User createUserForId(Integer id){
db.findByUserId(id);
//...
User user = new User();
user.setUserId(userId);
user.setFullName(emp.getFullName());
return user;
}

Categories

Resources