I am very new to spark
i could see the data using loadrisk.show() method but when I am creating the object JavaRDD balRDD = loadrisk.javaRDD(); I am getting null pointer.
public class LoadBalRDD implements Serializable {
public JavaPairRDD getBalRDD(SQLContext sqlContext) {
Dataset<Row> loadrisk = sqlContext.read().format("com.databricks.spark.csv").option("header", "true")
.option("mode", "DROPMALFORMED").load("/home/data/test.csv");
loadrisk.show(); // able to see the result
JavaRDD<Row> balRDD = loadrisk.javaRDD(); // here not loading
JavaPairRDD<String, Balrdd> balRDDMap = balRDD.mapToPair(x -> {
String aml_acc_id = "";
if (!x.isNullAt(x.fieldIndex("aml_acc_id")))
aml_acc_id = x.getAs("aml_acc_id").toString();
Tuple2<String, Balrdd> tp = new Tuple2(x.getAs(x.fieldIndex("aml_acc_id")).toString(),
new Balrdd(aml_acc_id));
return tp;
}).repartitionAndSortWithinPartitions(new CustomAcctIdPartitioner());
return balRDDMap;
}
}
Related
I have two url calls in one method that is in addnewMap() - one is the buildGetSubtenantsURL and the other is buildGetAssetsURL
public void addNewMap(MapDTO mapDTO) {
log.info("going to add the map data into db");
if (mapRepository.existsMapWithMapName(mapDTO.getMapName()))
throw new BadRequestException("Map with map name " + mapDTO.getMapName()
+ " already exists. Please provide a different map name.");
Map<String, String> subtenantInfoMap = new HashMap<>();
Maps mapEntity = new Maps();
String iottenant = mapDTO.getTenant();
String subtenantsURL = buildGetSubtenantsURL(null);
String subTenantsResponse = getSubtenants(subtenantsURL,iottenant);
JSONObject subTenant = getSubtenantName(subTenantsResponse);
checkForMultiplePagesSubtenants(subTenantsResponse, subtenantInfoMap,iottenant);
if(subtenantInfoMap.get(mapDTO.getSubtenantName()) != null) {
mapEntity = Maps.builder().subtenant(subtenantInfoMap.get(mapDTO.getSubtenantName()).toString()).build();
}
else {
throw new DataNotFoundException(SUBTENANT_DOESNT_EXIST);
}
String SubtenantId = subtenantInfoMap.get(mapDTO.getSubtenantName());
UriComponents assetsURL = buildGetAssetsURL(iottenant,SubtenantId);
String assetsResponse = getAssets(assetsURL, iottenant);
String mindsphereAssetId = getAssetId(assetsResponse);
if(mindsphereAssetId.isEmpty()) {
throw new DataNotFoundException(ASSET_ID_DOESNT_EXIST);
}
else {
mapEntity = Maps.builder().mindsphereAssetId(mindsphereAssetId).build();
}
mapEntity = Maps.builder().mapName(mapDTO.getMapName()).displayName(getDisplayName(mapDTO))
.description(Objects.nonNull(mapDTO.getDescription()) ? mapDTO.getDescription() : null)
.tenant(getTenantNameForMapDTO(mapDTO)).mindsphereAssetId(mindsphereAssetId).subtenant(subtenantInfoMap.get(mapDTO.getSubtenantName()).toString())
.mapLocation(mapDTO.getMapLocation()).operator(mapDTO.getOperator()).recipeName(mapDTO.getRecipeName()).subtenantName(mapDTO.getSubtenantName())
.createdBy(getUserEmail()).createdAt(new Timestamp(System.currentTimeMillis()))
.build();
Maps createdMap = mapRepository.saveAndFlush(mapEntity);
addStationsMappingforNewMap(createdMap);
}
I have written the test case for the above method as:
#Test
public void addNewMap() {
map = Maps.builder().mapId(1l).mapName("testMap").displayName("Map Test").mindsphereAssetId("a0609ebf2eb7400da8a5fd707e7f68b7").mapLocation("hyd").operator("operator").recipeName("recipe").subtenantName("NSTI").tenant("ctlbrdev").subtenant("9b04027dde5fbd047073805ab8c1c87c")
.tenant(Tenant).build();
maps = Arrays.asList(map);
mapDTO = MapDTO.builder().mapId(1l).mapName("testMap").displayName("Map Test").subtenantName("NSTI").mapLocation("hyd").recipeName("recipe").operator("operator").description("description")
.tenant("ctlbrdev").build();
ReflectionTestUtils.setField(mapService, "mindsphereBaseURL", MindsphereBaseURL);
ReflectionTestUtils.setField(mapService, "mindsphereSubtenantsURL", mindsphereSubtenantsURL);
ReflectionTestUtils.setField(mapService, "mindsphereAssetsURL", mindsphereAssetsURL);
when(restTemplate.exchange(ArgumentMatchers.anyString(), ArgumentMatchers.any(HttpMethod.class), ArgumentMatchers.any(HttpEntity.class),
ArgumentMatchers.<Class<String>>any()))
.thenReturn(new ResponseEntity<String>(entityDtoCreaters.getSubtenant(),HttpStatus.OK));
when(tokenCaching.retrieveHeadersContainingTechToken("ctblrdev")).thenReturn(new HttpHeaders());
when(mapRepository.existsMapWithMapName(any())).thenReturn(false);
//doReturn(Tenant).when(mapService).getTenantName();
doReturn(EMAIL).when(mapService).getUserEmail();
when(mapRepository.saveAndFlush(any())).thenReturn(map);
when(restTemplate.exchange(ArgumentMatchers.anyString(), ArgumentMatchers.any(HttpMethod.class), ArgumentMatchers.any(HttpEntity.class),
ArgumentMatchers.<Class<String>>any()))
.thenReturn(new ResponseEntity<String>(entityDtoCreaters.getSubtenant(),HttpStatus.OK));
Map<String, String> subtenantInfoMap = new HashMap<>();
subtenantInfoMap.get(mapDTO.getSubtenantName());
mapService.addNewMap(mapDTO);
It is not covering the getAssets() method, hence not covering the whole method. how can I achieve this?
I am automating database automation. I am using #factory and
#Dataprovider annotation in feed the inputs.
I want to restrict this method related alone runs once getCountOfPt1(poiLocId)
I tried setting boolean value also, but it fails, because I am using factory as well as dataprovider annotation.
The code Which I want to restrict and execute only once is
String pt1 = null;
if(!alreadyExecuted) {
Map<String, Integer> records = DbMr.getCountOfPt1(poiLocId);
pt1 = getMaxKey(records);
LOG.debug("Max key value is...." + pt1);
if (StringUtils.isBlank(pt11)) {
records.remove(null);
pt1 = getMaxKey(records);
alreadyExecuted = true;
}
}
Note: poiLocId which I passed in this method is from factory method
#Factory
public Object[] factoryMethod() {
Object[] poiLocIdData = null;
if (StringUtils.isNotBlank(cityName)) {
List<String> poiLocId = DbMr.getPoiLocId(cityName);
int size = poiLocId.size();
poiLocIdData = new Object[size];
for (int i = 0; i < size; i++) {
poiLocIdData[i] = new CollectsTest(poiLocId.get(i));
}
} else {
LOG.error("The parameter is required: Pass City Name");
Assert.fail("Problems with parameters");
}
return poiLocIdData;
}
public CollectTest(String locationId) {
poiLocId = locationId;
this.reportsPath = "reports_" + cityName;
this.extent = new ExtentReports();
}
#DataProvider(name = "pData")
public Object[][] getPData() {
List<PData> pList = DbMr.getCollectionPs(poiLocId);
Object[][] testData = new Object[pList.size()][];
for (int i = 0; i < poiList.size(); i++) {
testData[i] = new Object[] { pList.get(i) };
}
return testData;
}
#BeforeClass
private void setup() throws Exception {
ExtentHtmlReporter htmlReporter = new ExtentHtmlReporter(reportsPath + "/" +
cityName + "_extent.html");
htmlReporter.loadXMLConfig("src/test/resources/extent-config.xml");
extent.attachReporter(htmlReporter);
}
#Test(dataProvider = "pData")
public void verifyData(PData pData) throws Exception {
extentTest = extent.createTest(pData.toString());
String pt1 = null;
if(!alreadyExecuted) {
Map<String, Integer> records = DbMr.getCountOfPt1(poiLocId);
pt1 = getMaxKey(records);
LOG.debug("Max key value is...." + pt1);
if (StringUtils.isBlank(pt11)) {
records.remove(null);
pt1 = getMaxKey(records);
alreadyExecuted = true;
}
}
if (pt1.equalsIgnoreCase("xxxx")) {
Assert.assertEquals(pData.getpt1(), "xxxx");
}
Since #factory and #DataProvider work with the instance of the test class, so try to make the "alreadyExecuted" variable as a static variable.(since static variable is at class level")
The below code works fine and it runs once only, I have used map to execute only once.
// declare it as global variable
private static Map<String, String>LOC_ID_AND_PT1_COUNT_MAP = new HashMap();
//test method
#Test(dataProvider = "pData")
public void verifyData(PData pData) throws Exception {
extentTest = extent.createTest(pData.toString());
String pt1 = LOC_ID_AND_PT1_COUNT_MAP.get(LocId);
if (pt1 == null) {
Map<String, Integer> records =
DbMr.getCountOfPT1(LocId);
pT1 = getMaxKey(records);
LOG.debug("Max key value is...." + pt1);
if (StringUtils.isBlank(pt1)) {
records.remove(null);
pt1 = getMaxKey(records);
LOG.debug("Max key value is...." + pt1);
}
LOC_ID_AND_PT1_COUNT_MAP.put(locId, pt1);
}
The spark consumer have to read topics with same name from different Bootstrap servers. So in need to create two JavaDstreams, performing union, process the stream and commit the offsets.
JavaInputDStream<ConsumerRecord<String, GenericRecord>> dStream = KafkaUtils.createDirectStream(...);
Problem is JavaInputDStream doesn't support dStream.Union(stream2);
If i use,
JavaDStream<ConsumerRecord<String, GenericRecord>> dStream= KafkaUtils.createDirectStream(...);
But JavaDstream doesn't support,
((CanCommitOffsets) dStream.inputDStream()).commitAsync(os);
Please bare with the long answer.
There is no direct way to do this which i am aware of so, I would like to first convert the Dstreams to Datasets/Dataframes and then perform a UNION on both of the dataframes/datasets.
The below code is not tested but this should works. Please feel free to validate and do the necessary changes to make it work.
JavaPairInputDStream<String, String> pairDstream1 = KafkaUtils.createDirectStream(ssc,kafkaParams, topics);
JavaPairInputDStream<String, String> pairDstream2 = KafkaUtils.createDirectStream(ssc,kafkaParams, topics);
//Create JavaDStream<String>
JavaDStream<String> dstream1 = pairDstream1.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
//Create JavaDStream<String>
JavaDStream<String> dstream1 = pairDstream2.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
//Create JavaRDD<Row>
pairDstream1.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
#Override
public Row call(String msg) {
Row row = RowFactory.create(msg);
return row;
}
});
//Create JavaRDD<Row>
pairDstream2.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
#Override
public Row call(String msg) {
Row row = RowFactory.create(msg);
return row;
}
});
//Create Schema
StructType schema = DataTypes.createStructType(new StructField[] {DataTypes.createStructField("Message", DataTypes.StringType, true)});
//Get Spark 2.0 session
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset<Row> df1 = spark.createDataFrame(rowRDD, schema);
Dataset<Row> df2 = spark.createDataFrame(rowRDD, schema);
//union the both dataframes
df1.union(df2);
I have to add new column with value of UUID. I have done this using Spark 1.4 Java using following code.
StructType objStructType = inputDataFrame.schema();
StructField []arrStructField=objStructType.fields();
List<StructField> fields = new ArrayList<StructField>();
List<StructField> newfields = new ArrayList<StructField>();
List <StructField> listFields = Arrays.asList(arrStructField);
StructField a = DataTypes.createStructField(leftCol,DataTypes.StringType, true);
fields.add(a);
newfields.addAll(listFields);
newfields.addAll(fields);
final int size = objStructType.size();
JavaRDD<Row> rowRDD = inputDataFrame.javaRDD().map(new Function<Row, Row>() {
private static final long serialVersionUID = 3280804931696581264L;
public Row call(Row tblRow) throws Exception {
Object[] newRow = new Object[size+1];
int rowSize= tblRow.length();
for (int itr = 0; itr < rowSize; itr++)
{
if(tblRow.apply(itr)!=null)
{
newRow[itr] = tblRow.apply(itr);
}
}
newRow[size] = UUID.randomUUID().toString();
return RowFactory.create(newRow);
}
});
inputDataFrame = objsqlContext.createDataFrame(rowRDD, DataTypes.createStructType(newfields));
I'm wondering if there is some neat way to doing in Spark 2. Please advice.
You can register udf for getting UUID and use callUDF function to add new column to your inputDataFrame. Please see the sample code using Spark 2.0.
public class SparkUUIDSample {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().appName("SparkUUIDSample").master("local[*]").getOrCreate();
//sample input data
List<Tuple2<String, String>> inputList = new ArrayList<Tuple2<String, String>>();
inputList.add(new Tuple2<String, String>("A", "v1"));
inputList.add(new Tuple2<String, String>("B", "v2"));
//dataset
Dataset<Row> df = spark.createDataset(inputList, Encoders.tuple(Encoders.STRING(), Encoders.STRING())).toDF("key", "value");
df.show();
//register udf
UDF1<String, String> uuid = str -> UUID.randomUUID().toString();
spark.udf().register("uuid", uuid, DataTypes.StringType);
//call udf
df.select(col("*"), callUDF("uuid", col("value"))).show();
//stop
spark.stop();
}
}
I am facing a problem in which i have to find out the largest line and its index. Here is my approach
SparkConf conf = new SparkConf().setMaster("local").setAppName("basicavg");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> rdd = sc.textFile("/home/impadmin/ravi.txt");
JavaRDD<Tuple2<Integer,String>> words = rdd.map(new Function<String, Tuple2<Integer,String>>() {
#Override
public Tuple2<Integer,String> call(String v1) throws Exception {
// TODO Auto-generated method stub
return new Tuple2<Integer, String>(v1.split(" ").length, v1);
}
});
JavaPairRDD<Integer, String> linNoToWord = JavaPairRDD.fromJavaRDD(words).sortByKey(false);
System.out.println(linNoToWord.first()._1+" ********************* "+linNoToWord.first()._2);
In this way the tupleRDD will get sorted on the basis of key and the first element in the new rdd after sorting is of highest length:
JavaRDD<String> rdd = sc.textFile("/home/impadmin/ravi.txt");
JavaRDD<Tuple2<Integer,String>> words = rdd.map(new Function<String, Tuple2<Integer,String>>() {
#Override
public Tuple2<Integer,String> call(String v1) throws Exception {
// TODO Auto-generated method stub
return new Tuple2<Integer, String>(v1.split(" ").length, v1);
}
});
JavaRDD<Tuple2<Integer,String>> tupleRDD1= tupleRDD.sortBy(new Function<Tuple2<Integer,String>, Integer>() {
#Override
public Integer call(Tuple2<Integer, String> v1) throws Exception {
// TODO Auto-generated method stub
return v1._1;
}
}, false, 1);
System.out.println(tupleRDD1.first());
}
Since you are concerned with the line number and text both, please try this.
First create a serializable class Line :
public static class Line implements Serializable {
public Line(Long lineNo, String text) {
lineNo_ = lineNo;
text_ = text;
}
public Long lineNo_;
public String text_;
}
Then do the following operations:
SparkConf conf = new SparkConf().setMaster("local[1]").setAppName("basicavg");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> rdd = sc.textFile("/home/impadmin/words.txt");
JavaPairRDD<Long, Line> linNoToWord2 = rdd.zipWithIndex().mapToPair(new PairFunction<Tuple2<String,Long>, Long, Line>() {
public Tuple2<Long, Line> call(Tuple2<String, Long> t){
return new Tuple2<Long, Line>(Long.valueOf(t._1.split(" ").length), new Line(t._2, t._1));
}
}).sortByKey(false);
System.out.println(linNoToWord2.first()._1+" ********************* "+linNoToWord2.first()._2.text_);