Transforming JavaPairDStream into RDD

Transforming JavaPairDStream into RDD - java

I'm using Spark streaming with Spark MLlib to evaluate a naive bayes model. Actually, I reach to point where I can't go ahead since I could not transform the object of JavaPairDStream into RDD to calculate the accuracy. The result of prediction and labels are stored in this JavaPairDStream but I want to go through each one of them and compare to calculate the accuracy.
I will post my code to make my question more clear, The code raise an exception in the part of calculating the accuracy (The operator / is undefined for the argument type(s) JavaDStream, double) (because this way is only working with JavaPairRDD not), I need a help to calculate the accuracy for JavaPairDStream.
Edit: I edited the code, my problem now is how to read the value of accuracy which is JavaDStream and then accumulate this value for each batch of data to calculate the accuracy for all data.
public static JSONArray testSparkStreaming(){
SparkConf sparkConf = new SparkConf().setAppName("My app").setMaster("local[2]").set("spark.driver.allowMultipleContexts", "true");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.milliseconds(500));
String savePath = "path to saved model";
final NaiveBayesModel savedModel = NaiveBayesModel.load(jssc.sparkContext().sc(), savePath);
JavaDStream<String> data = jssc.textFileStream("path to CSV file");
JavaDStream<LabeledPoint> testData = data.map(new Function<String, LabeledPoint>() {
public LabeledPoint call(String line) throws Exception {
List<String> featureList = Arrays.asList(line.trim().split(","));
double[] points = new double[featureList.size()-1];
double classLabel = Double.parseDouble(featureList.get(featureList.size() - 1));
for (int i = 0; i < featureList.size()-1; i++){
points[i] = Double.parseDouble(featureList.get(i));
}
return new LabeledPoint(classLabel, Vectors.dense(points));
}
});
JavaPairDStream<Double, Double> predictionAndLabel = testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
public Tuple2<Double, Double> call(LabeledPoint p) {
return new Tuple2<Double, Double>(savedModel.predict(p.features()), p.label());
}
});
JavaDStream<Long> accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
public Boolean call(Tuple2<Double, Double> pl) throws JSONException {
return pl._1().equals(pl._2());
}
}).count();
jssc.start();
jssc.awaitTermination();
System.out.println("*************");
JSONArray jsonArray = new JSONArray();
JSONObject obj = new JSONObject();
jsonArray.put(obj);
obj = new JSONObject();
obj.put("Accuracy", accuracy*100 + "%");
jsonArray.put(obj);
return jsonArray;
}

Related

Xodus producing a huge file when passing from key-value (Environment) to relational (Entity)

I originally created a key-value database with Xodus Entity that created a small, 2GB database:
public static void main(String[] args) throws Exception{
if (args.length != 2){
throw new Exception("Argument missing. Current number of arguments: " + args.length);
}
long offset = Long.parseLong(args[0]);
long chunksize = Long.parseLong(args[1]);
Path pathBabelNet = Paths.get("/mypath/BabelNet-API-3.7/config");
BabelNetLexicalizationDataSource dataSource = new BabelNetLexicalizationDataSource(pathBabelNet);
Map<String, List<String>> data = new HashMap<String, List<String>>();
data = dataSource.getDataChunk(offset, chunksize);
jetbrains.exodus.env.Environment env = Environments.newInstance(".myAppData");
final Transaction txn = env.beginTransaction();
Store store = env.openStore("xodus-lexicalizations", StoreConfig.WITHOUT_DUPLICATES, txn);
for (Map.Entry<String, List<String>> entry : data.entrySet()) {
String key = entry.getKey();
String value = entry.getValue().get(0);
store.put(txn, StringBinding.stringToEntry(key), StringBinding.stringToEntry(value));
}
txn.commit();
env.close();
}
I used a batch script to do this in chunks:
#!/bin/bash
START_TIME=$SECONDS
chunksize=50000
for ((offset=0; offset<165622128;))
do
echo $offset;
java -Xmx10g -jar /path/to/jar.jar $offset $chunksize
offset=$((offset+(chunksize*12)))
done
ELAPSED_TIME=$(($SECONDS - $START_TIME))
echo $ELAPSED_TIME;
Now I changed it so it is relational:
public static void main(String[] args) throws Exception{
if (args.length != 2){
throw new Exception("Argument missing. Current number of arguments: " + args.length);
}
long offset = Long.parseLong(args[0]);
long chunksize = Long.parseLong(args[1]);
Path pathBabelNet = Paths.get("/mypath/BabelNet-API-3.7/config");
BabelNetLexicalizationDataSource dataSource = new BabelNetLexicalizationDataSource(pathBabelNet);
Map<String, List<String>> data = new HashMap<String, List<String>>();
data = dataSource.getDataChunk(offset, chunksize);
PersistentEntityStore store = PersistentEntityStores.newInstance("lexicalizations-test");
final StoreTransaction txn = store.beginTransaction();
Entity synsetID;
Entity lexicalization;
String id;
for (Map.Entry<String, List<String>> entry : data.entrySet()) {
String key = entry.getKey();
String value = entry.getValue().get(0);
synsetID = txn.newEntity("SynsetID");
synsetID.setProperty("synsetID", key);
lexicalization = txn.newEntity("Lexicalization");
lexicalization.setProperty("lexicalization", value);
lexicalization.addLink("synsetID", synsetID);
synsetID.addLink("lexicalization", lexicalization);
txn.flush();
}
txn.commit();
}
And this created a file over 17GB and it only stopped because I ran out of memory. I understand that it will be larger because it has to store the links, among other things, but ten times bigger? What am I doing wrong?

For some reason removing the txn.flush() fixes everything. Now it's just 5.5GB.

How to properly construct Akka Graph

I want to create a graph that has a source and that source is linked to a broadcast which fanout through two flows and then the output is zipped to a sink.
I did almost everything, but I have two problems:
The builder is not accepting my FanIn shape
I am providing a sink but it is required a shape sink and I don't know how to get that
public static void main(String[] args) {
ActorSystem system = ActorSystem.create("test");
ActorMaterializer materializer = ActorMaterializer.create(system);
Source<Integer, NotUsed> source = Source.range(1, 100);
Flow<Integer, Integer, NotUsed> flow1 = Flow.of(Integer.class).map(i -> i + 1);
Flow<Integer, Integer, NotUsed> flow2 = Flow.of(Integer.class).map(i -> i * 2);
Sink<List<Integer>, CompletionStage<Integer>> sink = Sink.fold(0, ((arg1, arg2) -> {
int value = arg1.intValue();
for (Integer i : arg2) {
value += i.intValue();
}
return value;
}));
RunnableGraph<Integer> graph = RunnableGraph.fromGraph(GraphDSL.create(
(builder) -> {
UniformFanOutShape fanOutShape = builder.add(Broadcast.create(2));
UniformFanInShape fanInShape = builder.add(Zip.create());
return builder.from(builder.add(source))
.viaFanOut(fanOutShape)
.via(builder.add(flow1))
.via(builder.add(flow2))
.viaFanIn(fanInShape)
.to(sink);
}
));
}
any help is appreciated

You are failing to map the out ports from broadcast to the specific sub flows (flow1 and flow2) and similarly you need to map the specific flows (flow1 and flow2) coming together in zip stage to the specific port of a zip stage.
Also i think it is not clear what is expected from the flow you are writing. zip stage will return you a tuple (int, int), so output of zip in the stream would lead to stream of tuples. But your sink which is supposed to be added after zip does not accept a stream of tuples but stream of Integers
public static void main(String[] args) {
ActorSystem system = ActorSystem.create("test");
ActorMaterializer materializer = ActorMaterializer.create(system);
Source<Integer, NotUsed> source = Source.range(1, 100);
Flow<Integer, Integer, NotUsed> flow1 = Flow.of(Integer.class).map(i -> i + 1);
Flow<Integer, Integer, NotUsed> flow2 = Flow.of(Integer.class).map(i -> i * 2);
//create a new zip stage which accepts
//Zip<?, ?, ?> zip1 =
final FanInShape2<Integer, Integer, Pair<Integer, Integer>> zip = builder.add(Zip.create());
Sink<List<Integer>, CompletionStage<Integer>> sink = Sink.fold(0, ((arg1, arg2) -> {
int value = arg1.intValue();
for (Integer i : arg2) {
value += i.intValue();
}
return value;
}));
RunnableGraph<Integer> graph = RunnableGraph.fromGraph(GraphDSL.create(flow1, flow2, sink,
(builder, flow1, flow2, sink) -> {
UniformFanOutShape fanOutShape = builder.add(Broadcast.create(2));
UniformFanInShape fanInShape = builder.add(Zip.create());
builder.from(builder.add(source))
.viaFanOut(fanOutShape)
builder
.from(broadcast.out(0))
.via(builder.add(flow1))
.toInlet(zip.in0());
builder
.from(broadcast.out(1))
.via(builder.add(flow2))
.toInlet(zip.in1());
builder
.from(zip.out()).toInlet(sink)
}
));
}
You can check the below link for more examples.
https://github.com/Cs4r/akka-examples/blob/master/src/main/java/cs4r/labs/akka/examples/ConstructingGraphs.java

Creating RDD of words and their index in the original text

I want to process a large text using spark and the order of the words in the text is important and should be kept.
I tried the following approach but it was not successful for large texts. It has out of memory issue.(obviously!)
hadoopConf.set("textinputformat.record.delimiter", "$$$$$");//read whole text
JavaRDD<String> texts = sparkContext.newAPIHadoopFile(inputFile, TextInputFormat.class, LongWritable.class, Text.class, hadoopConf).values().map((x) -> x.toString());
JavaRDD<Tuple2<String, Integer>> lines = texts.flatMap(new readDocuments());
public class readDocuments implements FlatMapFunction<String, Tuple2<String, Integer>> {
private static final long serialVersionUID = 1L;
// line , index in original text
#Override
public Iterator<Tuple2<String, Integer>> call(String text) throws Exception {
List<Tuple2<String, Integer>> lines = new ArrayList<Tuple2<String, Integer>>();
String[] tempLines = text.split("\n");
for (int i = 0; i < tempLines.length; i++) {
if (tempLines[i].length() > 0)
lines.add(new Tuple2<String, Integer>(tempLines[i], i));
}
return lines.iterator();
}
}
I also tried to read files using sparkContext.wholeTextFiles(inputFile) but the same problem!
Any idea or hint will be really appreciated.

Implementing Heat Map on Google Maps using a combination of langitutde, altitude and Energy consumption

I'm using android studio to create a heat map on google maps. I have a database consisting of the following information:
longitude latitude Electricity Energy Consumption
1 -77.08527 38.7347905 4.742112594
2 -19.03592 34.8081915 4.742112594
3 -74.04591 12.8815925 5.278542493
4 -32.05547 25.9549935 12.270006486
5 -49.06596 76.0283945 4.742112594
6 -63.08492 20.1017955 4.742112594
Is there any way to take these coordinates and magnitude and plot a density map using Google Maps?
I've done a bit of research, and the google api does allow the creation of the heatmap but it only allows a dataset containing coordiantes. How would I reflect the energy consumption in certain areas?
This is the link to the site that tells you how to create a heatmap: https://developers.google.com/maps/documentation/android-api/utility/heatmap
I just need a push in the right direction to be able to implement this.
There is the following method that I thought I could use but I didnt quite understand it and was hoping maybe someone could explain how to use it, and if its possible to use this to implement my particular scenario:
This is the code from the site to implement the heatmap which only take into account the coordinates:
List<LatLng> list = null;
// Get the data: latitude/longitude positions of police stations.
try {
list = readItems(R.raw.police_stations);
} catch (JSONException e) {
Toast.makeText(this, "Problem reading list of locations.", Toast.LENGTH_LONG).show();
}
// Create a heat map tile provider, passing it the latlngs of the police stations.
mProvider = new HeatmapTileProvider.Builder()
.data(list)
.build();
// Add a tile overlay to the map, using the heat map tile provider.
mOverlay = mMap.addTileOverlay(new TileOverlayOptions().tileProvider(mProvider));
}
private ArrayList<LatLng> readItems(int resource) throws JSONException {
ArrayList<LatLng> list = new ArrayList<LatLng>();
InputStream inputStream = getResources().openRawResource(resource);
String json = new Scanner(inputStream).useDelimiter("\\A").next();
JSONArray array = new JSONArray(json);
for (int i = 0; i < array.length(); i++) {
JSONObject object = array.getJSONObject(i);
double lat = object.getDouble("lat");
double lng = object.getDouble("lng");
list.add(new LatLng(lat, lng));
}
return list;
}
This is the code from the site to change the dataset:
ArrayList<WeightedLatLng> data = new ArrayList<WeightedLatLng>();
mProvider.setData(data);
mOverlay.clearTileCache();

Instead of LatLng, you can use WeightedLatLng.
You code should look like:
List<WeightedLatLng> list = null;
// Get the data: latitude/longitude positions of police stations.
try {
list = readItems(R.raw.police_stations);
} catch (JSONException e) {
Toast.makeText(this, "Problem reading list of locations.", Toast.LENGTH_LONG).show();
}
// Create a heat map tile provider, passing it the latlngs of the police stations.
mProvider = new HeatmapTileProvider.Builder()
.weightedData(list)
.build();
// Add a tile overlay to the map, using the heat map tile provider.
mOverlay = mMap.addTileOverlay(new
TileOverlayOptions().tileProvider(mProvider));
}
private List<WeightedLatLng> readItems(int resource) throws JSONException {
List<WeightedLatLng> list = new ArrayList<WeightedLatLng>();
InputStream inputStream = getResources().openRawResource(resource);
String json = new Scanner(inputStream).useDelimiter("\\A").next();
JSONArray array = new JSONArray(json);
for (int i = 0; i < array.length(); i++) {
JSONObject object = array.getJSONObject(i);
double lat = object.getDouble("lat");
double lng = object.getDouble("lng");
double magnitude = object.getDouble("mag"
list.add(new WeightedLatLng(lat, lng, magnitude));
}
return list;
}

Hadoop MapReduce querying on large json data

Hadoop n00b here.
I have installed Hadoop 2.6.0 on a server where I have stored twelve json files I want to perform MapReduce operations on. These files are large, ranging from 2-5 gigabytes each.
The structure of the JSON files is an array of JSON objects. Snippet of two objects below:
[{"campus":"Gløshaugen","building":"Varmeteknisk og Kjelhuset","floor":"4. etasje","timestamp":1412121618,"dayOfWeek":3,"hourOfDay":2,"latitude":63.419161638078066,"salt_timestamp":1412121602,"longitude":10.404867443910122,"id":"961","accuracy":56.083199914753536},{"campus":"Gløshaugen","building":"IT-Vest","floor":"2. etasje","timestamp":1412121612,"dayOfWeek":3,"hourOfDay":2,"latitude":63.41709424828986,"salt_timestamp":1412121602,"longitude":10.402167488838765,"id":"982","accuracy":7.315199988880896}]
I want to perform MapReduce operations based on the fields building and timestamp. At least in the beginning until I get the hang of this. E.g. mapReduce the data where building equals a parameter and timestamp is greater than X and less than Y. The relevant fields I need after the reduce process is latitude and longitude.
I know there are different tools(Hive, HBase, PIG, Spark etc) you can use with Hadoop that might solve this easier, but my boss wants an evaluation of the MapReduce performance of standalone Hadoop.
So far I have created the main class triggering the map and reduce classes, implemented what I believe is a start in the map class, but I'm stuck on the reduce class. Below is what I have so far.
public class Hadoop {
public static void main(String[] args) throws Exception {
try {
Configuration conf = new Configuration();
Job job = new Job(conf, "maze");
job.setJarByClass(Hadoop.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path inPath = new Path("hdfs://xxx.xxx.106.23:50070/data.json");
FileInputFormat.addInputPath(job, inPath);
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}catch (Exception e){
e.printStackTrace();
}
}
}
Mapper:
public class Map extends org.apache.hadoop.mapreduce.Mapper{
private Text word = new Text();
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
try {
JSONObject jo = new JSONObject(value.toString());
String latitude = jo.getString("latitude");
String longitude = jo.getString("longitude");
long timestamp = jo.getLong("timestamp");
String building = jo.getString("building");
StringBuilder sb = new StringBuilder();
sb.append(latitude);
sb.append("/");
sb.append(longitude);
sb.append("/");
sb.append(timestamp);
sb.append("/");
sb.append(building);
sb.append("/");
context.write(new Text(sb.toString()),value);
}catch (JSONException e){
e.printStackTrace();
}
}
}
Reducer:
public class Reducer extends org.apache.hadoop.mapreduce.Reducer{
private Text result = new Text();
protected void reduce(Text key, Iterable<Text> values, org.apache.hadoop.mapreduce.Reducer.Context context) throws IOException, InterruptedException {
}
}
UPDATE
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
private static String BUILDING;
private static int tsFrom;
private static int tsTo;
try {
JSONArray ja = new JSONArray(key.toString());
StringBuilder sb;
for(int n = 0; n < ja.length(); n++)
{
JSONObject jo = ja.getJSONObject(n);
String latitude = jo.getString("latitude");
String longitude = jo.getString("longitude");
int timestamp = jo.getInt("timestamp");
String building = jo.getString("building");
if (BUILDING.equals(building) && timestamp < tsTo && timestamp > tsFrom) {
sb = new StringBuilder();
sb.append(latitude);
sb.append("/");
sb.append(longitude);
context.write(new Text(sb.toString()), value);
}
}
}catch (JSONException e){
e.printStackTrace();
}
}
#Override
public void configure(JobConf jobConf) {
System.out.println("configure");
BUILDING = jobConf.get("BUILDING");
tsFrom = Integer.parseInt(jobConf.get("TSFROM"));
tsTo = Integer.parseInt(jobConf.get("TSTO"));
}
This works for a small data set. Since I am working with LARGE json files, I get Java Heap Space exception. Since I am not familiar with Hadoop, I'm having trouble understanding how MapR can read the data without getting outOfMemoryError.

If you simply want a list of LONG/LAT under the constraint of building=something and timestamp=somethingelse.
This is a simple filter operation; for this you do not need a reducer. In the mapper you should check if the current JSON satisfies the condition, and only then write it out to the context. If it fails to satisfy the condition you don't want it in the output.
The output should be LONG/LAT (no building/timestamp, unless you want them there as well)
If no reducer is present, the output of the mappers is the output of the job, which in your case is sufficient.
As for the code:
your driver should pass the building ID and the timestamp range to the mapper, using the job configuration. Anything you put there will be available to all your mappers.
Configuration conf = new Configuration();
conf.set("Building", "123");
conf.set("TSFROM", "12300000000");
conf.set("TSTO", "12400000000");
Job job = new Job(conf);
your mapper class needs to implement JobConfigurable.configure; in there you will read from the configuration object into local static variables
private static String BUILDING;
private static Long tsFrom;
private static Long tsTo;
public void configure(JobConf job) {
BUILDING = job.get("Building");
tsFrom = Long.parseLong(job.get("TSFROM"));
tsTo = Long.parseLong(job.get("TSTO"));
}
Now, your map function needs to check:
if (BUILDING.equals(building) && timestamp < TSTO && timestamp > TSFROM) {
sb = new StringBuilder();
sb.append(latitude);
sb.append("/");
sb.append(longitude);
context.write(new Text(sb.toString()),1);
}
this means any rows belonging to other buildings or outside the timestamp, would not appear in the result.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Transforming JavaPairDStream into RDD - java

Related

Xodus producing a huge file when passing from key-value (Environment) to relational (Entity)

How to properly construct Akka Graph

Creating RDD of words and their index in the original text

Implementing Heat Map on Google Maps using a combination of langitutde, altitude and Energy consumption

Hadoop MapReduce querying on large json data

Categories

Resources