any way to use JavaSparkContext in JavaRdd.map(rdd -> {})?

any way to use JavaSparkContext in JavaRdd.map(rdd -> {})? - java

I am thinking doing the following code. However, I got an error saying JavaSparkContext (sc) is not serializable. I am wondering if there is anyway to bypass this?
javaRdd.map(rdd -> {
List<String> data = new ArrayList<>();
ObjectMapper mapper = new ObjectMapper();
for (EntityA a : rdd) {
String json = null;
try {
json = mapper.writeValueAsString(a);
} catch (JsonProcessingException e) {
e.printStackTrace();
}
data.add(json);
}
JavaRDD<String> rddData = sc.parallelize(data);
DataFrame df = sqlContext.read().schema(schema).json(rddData);
});

Related

How to publish mqtt message to kafka in spring

Develop mqtt connector for Kafka using spring.
Using the mqtt library provided by spring, messages are collected as follows.
message handler
#Bean
#ServiceActivator(inputChannel = "mqttInputChannel")
public MessageHandler handler() {
return new MessageHandler() {
#Override
public void handleMessage(Message<?> message) throws MessagingException {
String topic = message.getHeaders().get(MqttHeaders.RECEIVED_TOPIC).toString();
if(topic.equals("myTopic")) {
System.out.println("Mqtt data pub");
}
System.out.println(message.getPayload());
if(topic==null) {
topic = "mqttdata";
}
String tag = "test/vib";
String name = null;
if(name==null) {
name = KafkaMessageService.MQTT_PRODUCER;
}
HashMap<String, Object> datalist = new HashMap<String, Object>();
try {
datalist =convertJSONstringToMap(message.getPayload().toString());
System.out.println(datalist.get("mac"));
counts = kafkaMessageService.publish(topic, name, tag, (HashMap<String,Object>[] datalist);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
public static HashMap<String,Object> convertJSONstringToMap(String json) throws Exception {
ObjectMapper mapper = new ObjectMapper();
HashMap<String, Object> map = new HashMap<String, Object>();
map = mapper.readValue(json, new TypeReference<HashMap<String, Object>>() {});
return map;
}
publish method
public int publish(String topic,String producerName,String tag,HashMap<String,Object>[] datalist) throws NotMatchedProducerException,KafkaPubFailureException{
KafkaProducerAdaptor adaptor = searchProducerAdaptor(producerName);
if(adaptor==null) {
throw new NotMatchedProducerException();
}
KafkaTemplate<String,Object> kafkaTemplate = adaptor.getKafkaTemplate();
LocalDateTime currentDateTime = LocalDateTime.now();
String receivedTime = currentDateTime.toString();
ObjectMapper objectMapper = new ObjectMapper();
String key = adaptor.getName();
int counts = 0;
for(HashMap<String,Object> data : datalist) {
Map<String,Object> messagePacket = new HashMap<String,Object>();
messagePacket.put("tag", tag);
messagePacket.put("data", data);
messagePacket.put("receivedtime", receivedTime);
try {
kafkaTemplate.send(topic,key,objectMapper.valueToTree(messagePacket)).get();
logger.info("Sent message : topic=["+topic+"],key=["+key+"] value=["+messagePacket+"]");
} catch(Exception e) {
logger.info("Unable to send message : topic=["+topic+"],key=["+key+"] message=["+messagePacket+"] / due to : "+e.getMessage());
throw new KafkaPubFailureException(e);
}
counts++;
}
return counts;
}
I don't know how to declare a hashmap <String, object> [] as an instance and how to use it.
The above source was taken from spring support as it is, and some modifications were made.

SnakeYAML formatting - remove YAML curly brackets

I have a code that is dumping a linkedhashmap into a YAML object
public class TestDump {
public static void main(String[] args) {
LinkedHashMap<String, Object> values = new LinkedHashMap<String, Object>();
values.put("one", 1);
values.put("two", 2);
values.put("three", 3);
DumperOptions options = new DumperOptions();
options.setIndent(2);
options.setPrettyFlow(true);
Yaml output = new Yaml(options);
File targetYAMLFile = new File("C:\\temp\\sample.yaml");
FileWriter writer =null;
try {
writer = new FileWriter(targetYAMLFile);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
output.dump(values, writer);
}
}
But the output looks like this
{
one: 1,
two: 2,
three: 3
}
is there a way to set this to something like this
one: 1
two: 2
three: 3
Although the first one is a valid yaml..I wanted the output format to be like the second one.

It looks like this is just some configuration via DumperOptions:
public class TestDump {
public static void main(String[] args) {
LinkedHashMap<String, Object> values = new LinkedHashMap<String, Object>();
values.put("one", 1);
values.put("two", 2);
values.put("three", 3);
DumperOptions options = new DumperOptions();
options.setIndent(2);
options.setPrettyFlow(true);
// Fix below - additional configuration
options.setDefaultFlowStyle(DumperOptions.FlowStyle.BLOCK);
Yaml output = new Yaml(options);
File targetYAMLFile = new File("C:\\temp\\sample.yaml");
FileWriter writer =null;
try {
writer = new FileWriter(targetYAMLFile);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
output.dump(values, writer);
}
}
This will solve my problem

How to use LinearRegression in Spark based on text files

I'm fairly new to programming with spark. I want to setup a linear regression model using spark based on log files with tabs as "column" separators. All tutorials, examples I found about it start off with something like this JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc, path).toJavaRDD();
However, I have a bunch of log files I want to use. So what I tried so far is the following:
public static void main(String... args)
{
if(!new File("LogisticRegressionModel").exists())
{
buildTrainingModel();
}
else
{
testModel();
}
}
private static void testModel()
{
SparkSession sc = SparkSession.builder().master("local[2]").appName("LogisticRegressionTest").getOrCreate();
Dataset<Row> dataSet = sc.read().option("delimiter", "-").option("header", "false").csv("EI/eyeliteidemo/TAP01.log");
PipelineModel model = PipelineModel.load("LogisticRegressionModel");
Dataset<Row> predictions = model.transform(dataSet);
}
private static void buildTrainingModel()
{
SparkSession sc = SparkSession.builder().master("local[2]").appName("LogisticRegressionTest").getOrCreate();
StructType schema = new StructType(new StructField[]{
new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
DataTypes.createStructField("features", DataTypes.StringType, false),
});
Dataset<Row> input = sc.read().option("delimiter", "-").option("header", "false").csv("foo/bar/Foo_*.log");
input = input.drop("_c1", "_c3", "_c4");
input = input.select(functions.concat(input.col("_c0"), input.col("_c2"), input.col("_c5")));
input = input.withColumnRenamed("concat(_c0, _c2, _c5)", "features");
input.show(30, false);
Dataset<Row> dataSet = sc.createDataFrame(input.collectAsList(), schema);
Tokenizer tokenizer = new Tokenizer()
.setInputCol("features")
.setOutputCol("rawTokens");
StopWordsRemover swRemover = new StopWordsRemover().setInputCol(tokenizer.getOutputCol()).setOutputCol("cleanedTerms").setStopWords(readStopwords());
HashingTF hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(swRemover.getOutputCol())
.setOutputCol("hashedTerms");
IDF idf = new IDF().setInputCol(hashingTF.getOutputCol()).setOutputCol("featuresIDF");
LogisticRegression lr = new LogisticRegression().setMaxIter(10).setRegParam(0.001);
Pipeline pipeline = new Pipeline()
.setStages(new PipelineStage[] {tokenizer, swRemover, hashingTF, idf, lr});
// Fit the pipeline to training documents.
PipelineModel model = pipeline.fit(dataSet);
try
{
model.save("LogisticRegressionModel");
}
catch (IOException e)
{
e.printStackTrace();
}
}
private static String[] readStopwords()
{
List<String> words = new ArrayList();
try (Stream<String> stream = Files.lines(Paths.get(LogisticRegressionTest.class.getResource("stopwords_en.txt").toURI()))) {
words = stream
.map(String::toLowerCase)
.collect(Collectors.toList());
} catch (IOException e) {
e.printStackTrace();
}
catch (URISyntaxException e)
{
e.printStackTrace();
}
String[] retWords = new String[words.size()];
return words.toArray(retWords);
}
Unfortunately, I run into exceptions:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually StringType.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:58)
at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:42)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
at org.apache.spark.ml.classification.LogisticRegression.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:193)
at org.apache.spark.ml.classification.LogisticRegressionParams$class.validateAndTransformSchema(LogisticRegression.scala:184)
at org.apache.spark.ml.classification.LogisticRegression.validateAndTransformSchema(LogisticRegression.scala:193)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:122)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:184)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:136)
at LogisticRegressionTest.buildTrainingModel(LogisticRegressionTest.java:92)
at LogisticRegressionTest.main(LogisticRegressionTest.java:40)
Now my problem/question is how to get this datatype issue right? Moreover, does my code make any sense to Spark experts in the first place?
Thanks!

STRING to ArrayList<String> using TypeReference

str = "[tdr1w6v, tdr1w77]";
ObjectMapper objectMapper = new ObjectMapper();
JavaType type = objectMapper.getTypeFactory().
constructCollectionType(ArrayList.class, String.class);
ArrayList<String> list = null;
try {
list = objectMapper.readValue(str,
new TypeReference<ArrayList<String>>(){});
} catch (IOException e) {
e.printStackTrace();
}
Here an exception is thrown :
com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'tdr1w6v': was expecting 'null', 'true', 'false' or NaN
How can I convert str to ArrayList of string ?

#FedericoPeraltaSchaffner suggestion helped. Now what I do is, in my binder class use objectMapper.writeValueAsString to convert data to store in database. And in my Mapper class while reading from data base I can use the same way as in the question:
ObjectMapper objectMapper = new ObjectMapper();
ArrayList<String> list = null;
try {
list =objectMapper.readValue(str, new TypeReference<ArrayList<String>>(){});
} catch (IOException e) {
e.printStackTrace();
}
So now I don't have to create a separate DTO class, I can use the same model at service layer and DAO.

The requirement can be easily met without using TypeReference
String str="[tdr1w6v, tdr1w77]";
List<String> al=Arrays.asList(str.replaceAll("[\\[\\]]","").split(","));
System.out.println(al);

Deserialize JSON to ArrayList<POJO> using Jackson

I have a Java class MyPojo that I am interested in deserializing from JSON. I have configured a special MixIn class, MyPojoDeMixIn, to assist me with the deserialization. MyPojo has only int and String instance variables combined with proper getters and setters. MyPojoDeMixIn looks something like this:
public abstract class MyPojoDeMixIn {
MyPojoDeMixIn(
#JsonProperty("JsonName1") int prop1,
#JsonProperty("JsonName2") int prop2,
#JsonProperty("JsonName3") String prop3) {}
}
In my test client I do the following, but of course it does not work at compile time because there is a JsonMappingException related to a type mismatch.
ObjectMapper m = new ObjectMapper();
m.getDeserializationConfig().addMixInAnnotations(MyPojo.class,MyPojoDeMixIn.class);
try { ArrayList<MyPojo> arrayOfPojo = m.readValue(response, MyPojo.class); }
catch (Exception e) { System.out.println(e) }
I am aware that I could alleviate this issue by creating a "Response" object that has only an ArrayList<MyPojo> in it, but then I would have to create these somewhat useless objects for every single type I want to return.
I also looked online at JacksonInFiveMinutes but had a terrible time understanding the stuff about Map<A,B> and how it relates to my issue. If you cannot tell, I'm entirely new to Java and come from an Obj-C background. They specifically mention:
In addition to binding to POJOs and "simple" types, there is one
additional variant: that of binding to generic (typed) containers.
This case requires special handling due to so-called Type Erasure
(used by Java to implement generics in somewhat backwards compatible
way), which prevents you from using something like
Collection.class (which does not compile).
So if you want to bind data into a Map you will need to use:
Map<String,User> result = mapper.readValue(src, new TypeReference<Map<String,User>>() { });
How can I deserialize directly to ArrayList?

You can deserialize directly to a list by using the TypeReference wrapper. An example method:
public static <T> T fromJSON(final TypeReference<T> type,
final String jsonPacket) {
T data = null;
try {
data = new ObjectMapper().readValue(jsonPacket, type);
} catch (Exception e) {
// Handle the problem
}
return data;
}
And is used thus:
final String json = "";
Set<POJO> properties = fromJSON(new TypeReference<Set<POJO>>() {}, json);
TypeReference Javadoc

Another way is to use an array as a type, e.g.:
ObjectMapper objectMapper = new ObjectMapper();
MyPojo[] pojos = objectMapper.readValue(json, MyPojo[].class);
This way you avoid all the hassle with the Type object, and if you really need a list you can always convert the array to a list by:
List<MyPojo> pojoList = Arrays.asList(pojos);
IMHO this is much more readable.
And to make it be an actual list (that can be modified, see limitations of Arrays.asList()) then just do the following:
List<MyPojo> mcList = new ArrayList<>(Arrays.asList(pojos));

This variant looks more simple and elegant.
//import com.fasterxml.jackson.core.JsonProcessingException;
//import com.fasterxml.jackson.databind.ObjectMapper;
//import com.fasterxml.jackson.databind.type.CollectionType;
//import com.fasterxml.jackson.databind.type.TypeFactory;
//import java.util.List;
CollectionType typeReference =
TypeFactory.defaultInstance().constructCollectionType(List.class, Dto.class);
List<Dto> resultDto = objectMapper.readValue(content, typeReference);

This works for me.
#Test
public void cloneTest() {
List<Part> parts = new ArrayList<Part>();
Part part1 = new Part(1);
parts.add(part1);
Part part2 = new Part(2);
parts.add(part2);
try {
ObjectMapper objectMapper = new ObjectMapper();
String jsonStr = objectMapper.writeValueAsString(parts);
List<Part> cloneParts = objectMapper.readValue(jsonStr, new TypeReference<ArrayList<Part>>() {});
} catch (Exception e) {
//fail("failed.");
e.printStackTrace();
}
//TODO: Assert: compare both list values.
}

I am also having the same problem. I have a json which is to be converted to ArrayList.
Account looks like this.
Account{
Person p ;
Related r ;
}
Person{
String Name ;
Address a ;
}
All of the above classes have been annotated properly.
I have tried TypeReference>() {}
but is not working.
It gives me Arraylist but ArrayList has a linkedHashMap which contains some more linked hashmaps containing final values.
My code is as Follows:
public T unmarshal(String responseXML,String c)
{
ObjectMapper mapper = new ObjectMapper();
AnnotationIntrospector introspector = new JacksonAnnotationIntrospector();
mapper.getDeserializationConfig().withAnnotationIntrospector(introspector);
mapper.getSerializationConfig().withAnnotationIntrospector(introspector);
try
{
this.targetclass = (T) mapper.readValue(responseXML, new TypeReference<ArrayList<T>>() {});
}
catch (JsonParseException e)
{
e.printStackTrace();
}
catch (JsonMappingException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return this.targetclass;
}
I finally solved the problem. I am able to convert the List in Json String directly to ArrayList as follows:
JsonMarshallerUnmarshaller<T>{
T targetClass ;
public ArrayList<T> unmarshal(String jsonString)
{
ObjectMapper mapper = new ObjectMapper();
AnnotationIntrospector introspector = new JacksonAnnotationIntrospector();
mapper.getDeserializationConfig().withAnnotationIntrospector(introspector);
mapper.getSerializationConfig().withAnnotationIntrospector(introspector);
JavaType type = mapper.getTypeFactory().
constructCollectionType(ArrayList.class, targetclass.getClass()) ;
try
{
Class c1 = this.targetclass.getClass() ;
Class c2 = this.targetclass1.getClass() ;
ArrayList<T> temp = (ArrayList<T>) mapper.readValue(jsonString, type);
return temp ;
}
catch (JsonParseException e)
{
e.printStackTrace();
}
catch (JsonMappingException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return null ;
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

any way to use JavaSparkContext in JavaRdd.map(rdd -> {})? - java

Related

How to publish mqtt message to kafka in spring

SnakeYAML formatting - remove YAML curly brackets

How to use LinearRegression in Spark based on text files

STRING to ArrayList<String> using TypeReference

Deserialize JSON to ArrayList<POJO> using Jackson

Categories

Resources