Converting Java Map to Spark DataFrame (Java API) - java

I'm trying to use Spark (Java API) to take an in-memory Map (that potentially contains other nested Maps as its values) and convert it into a dataframe. I think I need something along these lines:
Map myMap = getSomehow();
RDD myRDD = sparkContext.makeRDD(myMap); // ???
DataFrame df = sparkContext.read(myRDD); // ???
But I'm having a tough time seeing the forest through the trees here...any ideas? Again this might be a Map<String,String> or a Map<String,Map>, where there could be several nested layers of maps-inside-of-maps-inside-of-maps, etc.

So I tried something, not sure if this is the most efficient option to do it, but I do not see any other right now.
SparkConf sf = new SparkConf().setAppName("name").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(sf);
SQLContext sqlCon = new SQLContext(sc);
Map map = new HashMap<String, Map<String, String>>();
map.put("test1", putMap);
HashMap putMap = new HashMap<String, String>();
putMap.put("1", "test");
List<Tuple2<String, HashMap>> list = new ArrayList<Tuple2<String, HashMap>>();
Set<String> allKeys = map.keySet();
for (String key : allKeys) {
list.add(new Tuple2<String, HashMap>(key, (HashMap) map.get(key)));
};
JavaRDD<Tuple2<String, HashMap>> rdd = sc.parallelize(list);
System.out.println(rdd.first());
List<StructField> fields = new ArrayList<>();
StructField field1 = DataTypes.createStructField("String", DataTypes.StringType, true);
StructField field2 = DataTypes.createStructField("Map",
DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType), true);
fields.add(field1);
fields.add(field2);
StructType struct = DataTypes.createStructType(fields);
JavaRDD<Row> rowRDD = rdd.map(new Function<Tuple2<String, HashMap>, Row>() {
#Override
public Row call(Tuple2<String, HashMap> arg0) throws Exception {
return RowFactory.create(arg0._1, arg0._2);
}
});
DataFrame df = sqlCon.createDataFrame(rowRDD, struct);
df.show();
In this scenario I assumed that the Map in the Dataframe is of Type (String, String). Hope this helps!
Edit: Obviously you can delete all the prints. I did this for visualization purposes!

Related

delete index of elasticsearch 7 using spark java

Hi Guys i am looking for help to delete index in es7 using spark there is no such example or anything on google as i search. if you find please help me
SparkConf conf = new SparkConf().setAppName("es7").setMaster("local[*]");
conf.set("es.index.auto.create", "true");
conf.set("es.index.read.missing.as.empty", "false");
conf.set("es.resource", "employeeindex/_doc");
conf.set("es.query", "?q=me*");
JavaSparkContext jsc = new JavaSparkContext(conf);
Map<String, ?> numbers = ImmutableMap.of("one", 1, "two", 2);
Map<String, ?> airports = ImmutableMap.of("OTP", "Otopeni", "SFO", "San Fran");
JavaRDD<Map<String, ?>> javaRDD = jsc.parallelize(ImmutableList.of(numbers, airports));
JavaEsSpark.saveToEs(javaRDD, "employeeindex/_doc");
JavaEsSpark.saveToEs(javaRDD, "employeeindex/_doc", ImmutableMap.of("es.mapping.id", "id"));
JavaPairRDD<String, Map<String, Object>> esRDD =
JavaEsSpark.esRDD(jsc, "employeeindex/_doc");
esRDD.collect();

which data structure in java can hold hashmaps as its elements?

public static HashMap<String,Integer> users1 = new HashMap<>();
public static HashMap<String,Integer> users2 = new HashMap<>();
public static HashMap<String,Integer> users3 = new HashMap<>();
public static HashMap<String,Integer> users4 = new HashMap<>();
// Enter code here.
I need to add them into a arrayList or set or something and then they should be called when using that data structures index. Like is there any way to solve it
Or just simply use an array:
public static HashMap<String,Integer>[] users = HashMap<>[4];
users[0] = = new HashMap<>();
users[1] = = new HashMap<>();
users[2] = = new HashMap<>();
users[3] = = new HashMap<>();
And then use them:
users[0].add( "String", 123 );
you could as simply as :
List<HashMap<String, Integer>> list = new ArrayList<>();
// And then add them to list
list.add(users1);
list.add(users2);
...
ArrayList<HashMap> list = new ArrayList<>();
list.add(users1);

How to list became parameter in method put?

I want to send data from the list as input to execute a stored procedure. In code below, all variable list which contains to be sent as an input parameter.
public void onClick$btnSend() throws Exception {
Workbook workbook = new Workbook("D:/excel file/Mapping Prod Matriks _Group Sales Commercial.xlsx");
com.aspose.cells.Worksheet worksheet = workbook.getWorksheets().get(0);
com.aspose.cells.Cells cells = worksheet.getCells();
Range displayRange = cells.getMaxDisplayRange();
List<String> ParaObjGroup = new ArrayList<String>();
List<String> ParaObjCode = new ArrayList<String>();
List<String> ParaProdMatrixId = new ArrayList<String>();
List<String> ParaProdChannelId = new ArrayList<String>();
List<String> ParaProdSalesGroupId = new ArrayList<String>();
List<String> ParaCustGroup = new ArrayList<String>();
List<String> ParaSlsThroughId = new ArrayList<String>();
List<Integer> Active = new ArrayList<Integer>();
for(int row= displayRange.getFirstRow()+1;row<displayRange.getRowCount();row++){
ParaObjGroup.add(displayRange.get(row,1).getStringValue());
ParaObjCode.add(displayRange.get(row,3).getStringValue());
ParaProdMatrixId.add(displayRange.get(row,5).getStringValue());
ParaProdChannelId.add(displayRange.get(row,7).getStringValue());
ParaProdSalesGroupId.add(displayRange.get(row,9).getStringValue());
ParaCustGroup.add(displayRange.get(row,11).getStringValue());
ParaSlsThroughId.add(displayRange.get(row,13).getStringValue());
Active.add(displayRange.get(row,14).getIntValue());
}
System.out.println(ParaObjGroup);
System.out.println(ParaObjCode);
System.out.println(ParaProdMatrixId);
System.out.println(ParaProdChannelId);
System.out.println(ParaProdSalesGroupId);
System.out.println(ParaCustGroup);
System.out.println(ParaSlsThroughId);
System.out.println(Active);
lovService.coba(ParaObjGroup,ParaObjCode,ParaProdMatrixId,ParaProdChannelId,ParaProdSalesGroupId,ParaCustGroup,ParaSlsThroughId,Active);
}
and below code for execute stored procedure
#Transactional(propagation = Propagation.REQUIRED, rollbackFor = {SQLException.class, Exception.class })
public void executeSPForInsertData(DataSource ds,String procedureName,Map<String[], Object[]> inputParameter){
SimpleJdbcCall jdbcCall = new SimpleJdbcCall(paramsDataSourceBean).withProcedureName(procedureName);
jdbcCall.execute(inputParameter);
}
But I have a problem cannot set the list type as a parameter in method put:
#ServiceLog(schema = ConstantaVariable.DBDefinition_Var.PARAMS_DB_SCHEMA, sp = ConstantaVariable.PARAMSProcedure_VAR.PR_SP_FAHMI)
#Transactional(propagation=Propagation.REQUIRED, rollbackFor={Exception.class,SQLException.class})
public void coba(List<String> params1,List<String> params2,List<String> params3,List<String> params4,List<String> params5,
List<String> params6,List<String> params7,List<Integer> params8){
Map<String[], Object[]> mapInputParameter = new LinkedHashMap<String[], Object[]>();
mapInputParameter.put("P_OBJT_GROUP", params1);
mapInputParameter.put("P_CODE", params2);
mapInputParameter.put("P_PROD_MATRIX_ID", params3);
mapInputParameter.put("P_PROD_CHANNEL_ID", params4);
mapInputParameter.put("P_PROD_SALES_GROUP_ID", params5);
mapInputParameter.put("P_CUST_GROUP", params6);
mapInputParameter.put("P_SLS_THROUGH_ID", params7);
mapInputParameter.put("P_ACTIVE", params8);
ParamsService.getService().executeSPForInsertData(null,ConstantaVariable.PARAMSProcedure_VAR.PR_SP_FAHMI,mapInputParameter);
}
The type Map<String[], Object[]> is not compatible with what you try to put in: The key is String and the value is List<String>.
There are two solutions:
Change the map to be compatible with the inserted parameters.
Map<String, List<String>> mapInputParameter = new LinkedHashMap<>();
If you need to use the original map type, then you have to change the way you put the parameters into the map.
Map<String[], Object[]> mapInputParameter = new LinkedHashMap<>();
mapInputParameter.put(new String[] { "P_OBJT_GROUP" }, new Object[] { params1 });
mapInputParameter.put(new String[] { "P_CODE" }, new Object[] { params2 });
The drawback is that in further processing you have to check if the array is not empty and cast explicitly from Object to List<String>.
If you want something "more universal and more generic", I'd go for Map<String, List<Object>>. In any way, I find no reason to use an array in the map unless it is explicitly required (I have no information about the executeSPForInsertData method.

Fetch String of map in dynamodb

I have dynamodb table structure as follows:
{
"id": "1",
"skills": {
"skill1": "html",
"skill2": "css"
}
}
I have task to filter by skills value, In order to complete my task wrote java logic as follows:
AmazonDynamoDB client = dynamoDBService.getClient();
DynamoDB dynamoDB = new DynamoDB(client);
Table table = dynamoDB.getTable("dummy");
Map<String, String> attributeNames = new HashMap<String, String >();
attributeNames.put("#columnValue", "skills.skill1");
Map<String, AttributeValue> attributeValues = new HashMap<String, AttributeValue>();
attributeValues.put(":val1", new AttributeValue().withS("html"));
ScanSpec scanSpec = new ScanSpec().withProjectionExpression("skills.skill1")
.withFilterExpression("#columnValue = :val1 ").withNameMap(new NameMap().with("#columnValue", "skills.skill1"))
.withValueMap(new ValueMap().withString(":val1", "html"));
ItemCollection<ScanOutcome> items = table.scan(scanSpec);
Iterator<Item> iter = items.iterator();
while (iter.hasNext()) {
Item item = iter.next();
System.out.println("--------"+item.toString());
}
The mentioned code does not help me out. Any solution ?
You can use a ProjectionExpression to retrieve only specific attributes or elements, rather than an entire item. A ProjectionExpression can specify top-level or nested attributes, using document paths.
for example from AWS:
GetItemSpec spec = new GetItemSpec()
.withPrimaryKey("Id", 206)
.withProjectionExpression("Id, Title, RelatedItems[0], Reviews.FiveStar")
.withConsistentRead(true);
Item item = table.getItem(spec);
System.out.println(item.toJSONPretty());
Simple solution to this problem is:
First fetch all the records from the table.
Then iterate over the list of that object.
Extract the skills from each object.
Wrote your logic to do filtering.
Repeat the loop till the last record.
I found solution,scanSpec should be as follows:
ScanSpec scanSpec = new ScanSpec()
.withFilterExpression("#category.#uid = :categoryuid").withNameMap(new NameMap().with("#category","skills").with("#uid",queryString))
.withValueMap(new ValueMap().withString(":categoryuid", queryString));

How to convert JSON from KAFKA to pass it to Spark's machine learning Algorithm

I am trying to learn spark and spark-streaming using Java. And developing an IOT application.
I am having a KAFKA server in place which accepts JSON data and I am able to parse it using SQLContext and foreach function.
Data format is as follows,
[{"t":1481368346000,"sensors":[{"s":"s1","d":"+149.625"},{"s":"s2","d":"+23.062"},{"s":"s3","d":"+16.375"},{"s":"s4","d":"+235.937"},{"s":"s5","d":"+271.437"},{"s":"s6","d":"+265.937"},{"s":"s7","d":"+295.562"},{"s":"s8","d":"+301.687"}]}]
In this t is a timestamp of each data stream
and sensors is array of sensor data with s as a name of each sensor and d containing a data.
What I have done till now is,
JavaPairInputDStream<String, String> directKafkaStream =
KafkaUtils.createDirectStream(ssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topics);
SQLContext sqlContext = spark.sqlContext();
StreamingLinearRegressionWithSGD model = new StreamingLinearRegressionWithSGD().setInitialWeights(Vectors.zeros(2));
JavaDStream<String> json = directKafkaStream.map(new Function<Tuple2<String,String>, String>() {
public String call(Tuple2<String,String> message) throws Exception {
return message._2();
};
});
json.print();
json.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> jsonRecord) throws Exception {
System.out.println("JSON Record ---- "+jsonRecord);
if(!jsonRecord.isEmpty()){
Dataset<Row> timestamp = sqlContext.read().json(jsonRecord).select("t");
timestamp.printSchema();
timestamp.show(false);
Dataset<Row> data = sqlContext.read().json(jsonRecord).select("sensors");
data.printSchema();
data.show(false);
//DF in table
Dataset<Row> df = data.select(org.apache.spark.sql.functions.explode(org.apache.spark.sql.functions.col("sensors")))
.toDF("sensors").select("sensors.s","sensors.d").where("sensors.s = 's1'");
Row firstRow = df.head();
String valueOfFirstSensor = firstRow.getString(1);
System.out.println("---------valueOfFirstSensor --------"+ valueOfFirstSensor);
double[] values = new double[1];
values[0] = firstRow.getDouble(0);
new LabeledPoint(timestamp.head().getDouble(0), Vectors.dense(values));
df.show(false);
}
}
});
ssc.start();
ssc.awaitTermination();
What I want to do is, convert json which is JavaDStream into a data structure which StreamingLinearRegressionWithSGD model accepts.
When I try to use sparks's map function to map json stream to JavaDStream as follows,
JavaDStream<LabeledPoint> forML = json.map(new Function<String, LabeledPoint>() {
#Override
public LabeledPoint call(String jsonRecord) throws Exception {
// TODO Auto-generated method stub
System.out.println("\n\n\n here is JSON in"+ jsonRecord);
LabeledPoint returnObj = null;
if(!jsonRecord.isEmpty()){
Dataset<Row> timestamp = sqlContext.read().json(jsonRecord).select("t");
timestamp.printSchema();
timestamp.show(false);
Dataset<Row> data = sqlContext.read().json(jsonRecord).select("sensors");
data.printSchema();
data.show(false);
//DF in table
Dataset<Row> df = data.select(org.apache.spark.sql.functions.explode(org.apache.spark.sql.functions.col("sensors")))
.toDF("sensors").select("sensors.s","sensors.d").where("sensors.s = 's1'");
Row firstRow = df.head();
String valueOfFirstSensor = firstRow.getString(1);
System.out.println("---------valueOfFirstSensor --------"+ valueOfFirstSensor);
double[] values = new double[1];
values[0] = firstRow.getDouble(0);
returnObj = new LabeledPoint(timestamp.head().getDouble(0), Vectors.dense(values));
df.show(false);
}
return returnObj;
}
}).cache();
model.trainOn(forML);
And call model.trainOn it fails with NullPointerException at
Dataset<Row> timestamp = sqlContext.read().json(jsonRecord).select("t");
Now the questions I am having are,
Am I doing this right?
How I will be able to predict values and why and how I need to create a different stream to pass it on to predictOn function of model?
I will be receiving multiple sensors but single value for each sensor, and there can be thousands of such streams, how I can create different model for each of those thousand sensors and predict for such a vast amount of data efficiently?
Are there any other good machine learning algorithms or approaches which can be utilized for this type of sensor data?

Categories

Resources