Sorting in Apache Spark - java

I am new to Apache Spark and I am reading files from HDFS directory and then filtering and ordering based on condition.
I have two files in a hdfs directory
First file containing data like below
Name:xxxx,currenttime:[timestamp],urlvisited:[url]
Second file containing following information
Name:xxxx,currenttime :[timestamp],downloadfilename:[filename]
First I am filtering the data based on Name and then I am splitting the data using comma and then I am ordering the data using the fields currenttime
So far I have tried
import java.text.ParseException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import scala.Tuple2;
public class SampleVisit {
public static void main(String[] args) {
final String name = args[0];
SparkConf sparkConf = new SparkConf().setAppName("sample");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
JavaRDD<String> lines = ctx.textFile("hdfs://localhost:9000/sample/*/",1);
JavaRDD<String> filterdata = lines.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String s) {
return s.contains("Name:" + name);
}
});
//Returning all other values as one fields and currenttime as another field
JavaRDD<Tuple2<String,String>> stage2 = filterdata.map(new Function<String, Tuple2<String,String>>() {
public Tuple2<String, String> call(String s) throws ParseException {
String [] entries = s.split(",");
return new Tuple2(s[0]+","+s[2], s[1]);
}
});
List<Tuple2<String,String>> sorted = stage2.takeOrdered(100, new CompareValues()) ;
JavaRDD<Tuple2<String,String>> finale = ctx.parallelize(sorted);
finale.coalesce(1, true).saveAsTextFile("hdfs://localhost:9000/sampleout");
}
}
And my CompareValues.java is shown below
public class CompareValues implements Comparator<Tuple2<String,String>>, Serializable {
#Override
public int compare(Tuple2<String, String> o1, Tuple2<String, String> o2) {
long first = Long.valueOf(o1._2);
long second = Long.valueOf(o2._2);
Date firstDate = new Date(first);
Date secondDate = new Date(second);
return secondDate.compareTo(firstDate);
}
}
when I run this with Name value as argument all are running as expected but the result returns the data in which the first file values are in an ordered manner and then second file values are ordered but I wantthe result in which both file values in an ordered manner
Can anyone help me in this?

I think your problem comes from coalescing your RDD with shuffle=true. This means that the RDD is shuffled using an hash partitioner and each item is sent to the proper partition. Thus, the items are ordered within the partition, but they are not in general. When you save it to a file, each partition is written to a separate file, to allow concurrent writes.
If you want the result to be ordered across all the partitions, you need to set a partitioner which guarantees that all the "close" items are in the same partition.
Look at the Data Partitioning chapter here for further explanations.

Related

Hbase MapReduce: how to use custom class as value for the mapper and/or reducer?

I am trying to familiarize myself with Hadoop/Hbase MapReduce jobs to be able to properly write them. Right now I have an Hbase instance with a table called dns with some DNS records. I tried to make a simple unique domains counter that outputs a file and it worked. Right now, I only use IntWritable or Text and I was wondering if it's possible to use custom objects for my Mapper/Reducer. I tried to do it myself, but I'm getting
Error: java.io.IOException: Initialization of all the collectors failed. Error in last collector was :null
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:415)
at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:81)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:698)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:1011)
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:402)
... 9 more
Since I'm new to this, I don't actually know what to do. I'm guessing I have to implement one or more interfaces or to extend an abstract class, but I can't find here or on the internet a proper example.
I tried to make a simple domains counter from my dns table, but using a class as a wrapper over an integer (for didactic purposes only). My Map class looks like this:
public class Map extends TableMapper<Text, MapperOutputValue> {
private static byte[] columnName = "fqdn".getBytes();
private static byte[] columnFamily = "d".getBytes();
public void map(ImmutableBytesWritable row, Result value, Context context)
throws InterruptedException, IOException {
String fqdn = new String(value.getValue(columnFamily, columnName));
Text key = new Text();
key.set(fqdn);
context.write(key, new MapperOutputValue(1));
}
}
The Reducer:
public class Reduce extends Reducer<Text, MapperOutputValue, Text, IntWritable> {
#Override
public void reduce(Text key, Iterable<MapperOutputValue> values, Context context)
throws IOException, InterruptedException {
int i = 0;
for (MapperOutputValue val : values) {
i += val.getCount();
}
context.write(key, new IntWritable(i));
}
}
And a part of my Driver/Main function:
TableMapReduceUtil.initTableMapperJob(
"dns",
scan,
Map.class,
Text.class,
MapperOutputValue.class,
job);
/* Set output parameters */
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setOutputFormatClass(TextOutputFormat.class);
As I said, MapperOutputValue is just a simple class that contains a private Integer, a constructor with a parameter, a getter and a setter. I also tried adding a toString method but it still doesn't work.
So my question is: what's the best way to use custom classes as an output of the mapper/input for the reducer? Also, let's say I want to use a class with multiple fields as an final output of the reducer. What should this class implement/extends? Is it a good idea or I should stick to using "primitives" as IntWritable or Text?
Thank!
MapOutputValue should implement Writable, so that it can be serialised between tasks in the MapReduce job. Replacing MapOutputJob with the below should work:
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class DomainCountWritable implements Writable {
private Text domain;
private IntWritable count;
public DomainCountWritable() {
this.domain = new Text();
this.count = new IntWritable(0);
}
public DomainCountWritable(Text domain, IntWritable count) {
this.domain = domain;
this.count = count;
}
public Text getDomain() {
return this.domain;
}
public IntWritable getCount() {
return this.count;
}
public void setDomain(Text domain) {
this.domain = domain;
}
public void setCount(IntWritable count) {
this.count = count;
}
public void readFields(DataInput in) throws IOException {
this.domain.readFields(in);
this.count.readFields(in);
}
public void write(DataOutput out) throws IOException {
this.domain.write(out);
this.count.write(out);
}
#Override
public String toString() {
return this.domain.toString() + "\t" + this.count.toString();
}
}

Get all the table items from DynamoDB table using Java High Level API

I implemented scan operation using in dynamodb table using dynamodbmapper, but I'm not getting all the results. Scan returns different number of items, whenever I run my program.
Code snippet :
DyanmoDBScanExpression scanExpression = new DynamoDBScanExpression();
List<Books> scanResult = mapper.scan(Books.class, scanExpression);
I investigated into it, and found out about the limit of the items scan returns. But I couln't find a way to get all the items from the table using mapper! Is there a way so I can loop through all the items of the table. I have set enough heap memory in JVM so there won't be memory issues.
In java use the DynamoDBScanExpression without any filter,
// Change to your Table_Name (you can load dynamically from lambda env as well)
DynamoDBMapperConfig mapperConfig = new DynamoDBMapperConfig.Builder().withTableNameOverride(DynamoDBMapperConfig.TableNameOverride.withTableNameReplacement("Table_Name")).build();
DynamoDBMapper mapper = new DynamoDBMapper(client, mapperConfig);
DynamoDBScanExpression scanExpression = new DynamoDBScanExpression();
// Change to your model class
List < ParticipantReport > scanResult = mapper.scan(ParticipantReport.class, scanExpression);
// Check the count and iterate the list and perform as desired.
scanResult.size();
the scan should return all the items.
the catch is that the collection returned is lazily loaded.
you need to iterate through the List and when it consumes all the items that are fetched additional calls will be made behind the scenes to bring in more items (until everything is brought in).
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/JavaQueryScanORMModelExample.html
In that example it's:
List<Book> scanResult = mapper.scan(Book.class, scanExpression);
for (Book book : scanResult) {
System.out.println(book);
}
You need to iterate until LastEvaluatedKey is no longer returned. Check how is done in one of the official examples from the SDK:
https://github.com/awslabs/aws-dynamodb-examples/blob/23837f36944f4166c56988452475edee99868166/src/main/java/com/amazonaws/codesamples/lowlevel/LowLevelQuery.java#L70
A little bit late, but
import java.util.HashMap;
import java.util.Map;
import java.util.List;
import com.amazonaws.services.dynamodbv2.datamodeling.DynamoDBMapper;
import com.amazonaws.services.dynamodbv2.datamodeling.DynamoDBQueryExpression;
import com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient;
import com.amazonaws.services.dynamodbv2.model.AttributeValue;
public final class LogFetcher {
static AmazonDynamoDBClient client = new AmazonDynamoDBClient();
static String tableName = "SystemLog";
public static List<SystemLog> findLogsForDeviceWithMacID(String macID) {
client.setRegion(Region.getRegion(Regions.EU_WEST_1));
DynamoDBMapper mapper = new DynamoDBMapper(client);
Map<String, AttributeValue> eav = new HashMap<String, AttributeValue>();
eav.put(":val1", new AttributeValue().withS(macID));
DynamoDBQueryExpression<SystemLog> queryExpression = new DynamoDBQueryExpression<SystemLog>()
.withKeyConditionExpression("parentKey = :val1")
.withExpressionAttributeValues(eav);
List<SystemLog> requestedLogs = mapper.query(SystemLog.class, queryExpression);
return requestedLogs;
}
}
And sample class
import com.amazonaws.services.dynamodbv2.datamodeling.DynamoDBAttribute;
import com.amazonaws.services.dynamodbv2.datamodeling.DynamoDBHashKey;
import com.amazonaws.services.dynamodbv2.datamodeling.DynamoDBRangeKey;
import com.amazonaws.services.dynamodbv2.datamodeling.DynamoDBTable;
#DynamoDBTable(tableName="SystemLog")
public final class SystemLog {
public Integer pidValue;
public String uniqueId;
public String parentKey;
//DynamoDB
//Partition (hash) key
#DynamoDBHashKey(attributeName="parentKey")
public String getParentKey() { return parentKey; }
public void setParentKey(String parentKey) { this.parentKey = parentKey; }
//Range key
#DynamoDBRangeKey(attributeName="uniqueId")
public String getUniqueId() { return uniqueId; }
public void setUniqueId(String uniqueId) { this.uniqueId = uniqueId;}
#DynamoDBAttribute(attributeName="pidValue")
public Integer getPidValue() { return pidValue; }
public void setPidValue(Integer pidValue) { this.pidValue = pidValue; }
}
By default, the DynamoDBMapper#scan method returns a "lazy-loaded" collection. It
initially returns only one page of results, and then makes a service
call for the next page if needed. To obtain all the matching items,
iterate over the paginated results collection.
However, PaginatedScanList comes with out of box PaginatedScanList#loadAllResults method which helps to eagerly load all results for this list.
NOTE: loadAllResults method is not supported in ITERATION_ONLY mode.
List<Books> scanResult = mapper.scan(Books.class, new DynamoDBScanExpression());
scanResult.loadAllResults();//Eagerly loads all results for this list.
//Total results loaded into the list
System.out.println(scanResult.size());
DyanmoDBScanExpression scanExpression = new DynamoDBScanExpression();
List<Books> scanResult = new ArrayList<Books>(mapper.scan(Books.class, scanExpression));
This will work, it will iterate all the items and then returns a list.

How to convert an observableset to an observablelist

I am trying to set items to a tableview but the setitems method expects an observablelist while I have an observableset in my model.The FXCollections utility class does not have a method for creating an observable list given an observable set.I tried casting but that caused a class cast exception (as expected).
Currently I am using this kind of code
new ObservableListWrapper<E>(new ArrayList<E>(pojo.getObservableSet()));
And I have some problems with it:
Will editing this in the table update the underlying set as expected?
Is it the 'right' way of doing this
So in short I need a style guide or best practice for converting between observable set and observable list because I expect to be doing this a lot when building a java fx GUI
Will editing this in the table update the underlying set as expected ?
No because, you are doing a copy of the set:
new ArrayList<E>(pojo.getObservableSet())
Is it the 'right' way of doing this ?
I think the right way is not doing that. Set are not List and vice versa. Both have specific contraints. For example, the lists are ordered and sets contains no duplicate elements.
Moreover, nor FXCollections neither Bindings provides this kind of stuff.
I would like the collection to remain as a set to enforce uniqueness
I guess you could write a custom ObservableList, for example the Parent::children have a similar behavior. It throws an IllegalArgumentException if a duplicate children is added. If you look at the source code, you will see that it is a VetoableListDecorator extension. You could write your own:
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import javafx.collections.FXCollections;
import javafx.collections.ObservableList;
import com.sun.javafx.collections.VetoableListDecorator;
public class CustomObservableList<E> extends VetoableListDecorator<E> {
public CustomObservableList(ObservableList<E> decorated) {
super(decorated);
}
#Override
protected void onProposedChange(List<E> toBeAdded, int... indexes) {
for (E e : toBeAdded) {
if (contains(e)) {
throw new IllegalArgumentException("Duplicament element added");
}
}
}
}
class Test {
public static void main(String[] args) {
Object o1 = new Object();
Object o2 = new Object();
Set<Object> set = new HashSet<Object>();
set.add(o1);
CustomObservableList<Object> list = new CustomObservableList<Object>(FXCollections.observableArrayList(set));
list.add(o2);
list.add(o1); // throw Exception
}
}
Just in Case someone stumbles over this question looking for a one-way to convert an ObservableSet into an ObservableList... I post my solution. It doesn't support feeding back data to the set (which in my opinion wouldn't be nice since TableView doesn't have a concept of not being able to change a value) but supports updates of the set and preserves the (in this case) sorted order.
package de.fluxparticle.lab;
import javafx.animation.KeyFrame;
import javafx.animation.Timeline;
import javafx.application.Application;
import javafx.beans.property.SimpleStringProperty;
import javafx.collections.FXCollections;
import javafx.collections.ObservableList;
import javafx.collections.ObservableSet;
import javafx.collections.SetChangeListener;
import javafx.scene.Scene;
import javafx.scene.control.TableColumn;
import javafx.scene.control.TableView;
import javafx.stage.Stage;
import javafx.util.Duration;
import java.util.Collections;
import java.util.Random;
import java.util.TreeSet;
import static javafx.collections.FXCollections.observableSet;
/**
* Created by sreinck on 23.01.17.
*/
public class Set2List extends Application {
private final ObservableSet<Integer> setModel = observableSet(new TreeSet<Integer>());
#Override
public void start(Stage primaryStage) throws Exception {
TableView<Integer> tableView = new TableView<>();
addColumn(tableView, "Number");
ObservableList<Integer> list = convertSetToList(setModel);
tableView.setItems(list);
Random rnd = new Random();
scheduleTask(Duration.millis(1000), () -> setModel.add(rnd.nextInt(10)));
primaryStage.setScene(new Scene(tableView, 800, 600));
primaryStage.setTitle("Set2List");
primaryStage.show();
}
private static void scheduleTask(Duration interval, Runnable task) {
Timeline timeline = new Timeline(new KeyFrame(interval, event -> task.run()));
timeline.setCycleCount(Timeline.INDEFINITE);
timeline.play();
}
private static ObservableList<Integer> convertSetToList(ObservableSet<Integer> set) {
ObservableList<Integer> list = FXCollections.observableArrayList(set);
set.addListener((SetChangeListener<Integer>) change -> {
if (change.wasAdded()) {
Integer added = change.getElementAdded();
int idx = -Collections.binarySearch(list, added)-1;
list.add(idx, added);
} else {
Integer removed = change.getElementRemoved();
int idx = Collections.binarySearch(list, removed);
list.remove(idx);
}
});
return list;
}
private static void addColumn(TableView<Integer> tableView, String text) {
TableColumn<Integer, String> column = new TableColumn<>(text);
column.setCellValueFactory(param -> new SimpleStringProperty(param.getValue().toString()));
tableView.getColumns().add(column);
}
public static void main(String[] args) {
launch(args);
}
}

Java 8 Lambda stack bleed

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Stream;
public class StackBleed {
public static int check(String s) {
if (s.equals("lambda")) {
throw new IllegalArgumentException();
}
return s.length();
}
#SuppressWarnings({ "rawtypes", "unchecked" })
public static void main(String[] args) {
// List lengths = new ArrayList();
List<String> argList = Arrays.asList(args);
Stream lengths2 = argList.stream().map((String name) -> check(name));
}
}
So I was checking out this article http://www.takipiblog.com/2014/03/25/the-dark-side-of-lambda-expressions-in-java-8/ and wrote similar class, but JDK 8 approach didn't yield expected exception. I was wondering if they have changed something in JDK 8u5?
You're only calling a non-terminal operation on the stream. So your code doesn't consume the data from the stream. All it does is saying: "when a terminal operation will be called, you'll have to map the strings using the check() method".
Use
List<Integer> transformed =
argList.stream().map((String name) -> check(name)).collect(Collectors.toList());
for example, and then the call to collect(), which is a terminal operation, will trigger the iteration on the stream elements and the transformation of its elements.

Java - List sorting doesn't work

I'm trying to sort a hashmap's by sorting it's keys but it doesn't work.
The sorting criteria is given by the length of a list that is the hashmap's value.
See code below with some unit test.
Class:
package com.fabri.interpreter.util;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import com.fabri.interpreter.VerbExpr;
import com.fabri.interpreter.ObjectExpr;
public class Environment {
private HashMap<VerbExpr, List<ObjectExpr>> map = new HashMap<VerbExpr, List<ObjectExpr>>();
public List<ObjectExpr> eval(VerbExpr verb) {
return map.get(verb);
}
public void put(VerbExpr verb, ObjectExpr words) {
List<ObjectExpr> values;
if(map.get(verb) == null)
values = new ArrayList<ObjectExpr>();
else
values = map.get(verb);
values.add(words);
map.put(verb, values);
}
public HashMap<VerbExpr, List<ObjectExpr>> getMap() {
return map;
}
public void sort() {
List<VerbExpr> keys = new ArrayList<VerbExpr>(map.keySet());
Collections.sort(keys, new Comparator<VerbExpr>() {
#Override
public int compare(VerbExpr verb1, VerbExpr verb2) {
return map.get(verb1).size()-map.get(verb2).size();
}
});
HashMap<VerbExpr, List<ObjectExpr>> sortedMap = new HashMap<VerbExpr, List<ObjectExpr>>();
for(VerbExpr verb : keys) {
sortedMap.put(verb, map.get(verb));
}
map = sortedMap;
}
}
Testing class:
package com.fabri.interpreter.util;
import static org.junit.Assert.assertTrue;
import java.util.ArrayList;
import java.util.List;
import org.junit.Before;
import org.junit.Test;
import com.fabri.interpreter.ObjectExpr;
import com.fabri.interpreter.VerbExpr;
import com.fabri.interpreter.WordExpr;
public class TestEnvironment {
private Object[] verbExprs;
#Before
public void setUp() {
Environment env = new Environment();
List<WordExpr> words1 = new ArrayList<WordExpr>();
words1.add(new WordExpr("american"));
words1.add(new WordExpr("italian"));
env.put(new VerbExpr("was"), new ObjectExpr(words1));
List<WordExpr> words2 = new ArrayList<WordExpr>();
words2.add(new WordExpr("zero"));
words2.add(new WordExpr("one"));
words2.add(new WordExpr("two"));
env.put(new VerbExpr("is"), new ObjectExpr(words2));
env.sort();
verbExprs = env.getMap().keySet().toArray();
}
#Test
public void testEnvironment() {
assertTrue(((VerbExpr)verbExprs[0]).equals("is"));
assertTrue(((VerbExpr)verbExprs[1]).equals("was"));
}
}
Plain hashmaps are inherently unordered. You can't sort them, or assume anything about the order in which the entries are retrieved when iterating over them. Options:
Use a TreeMap if you want to sort by key.
Use a LinkedHashMap if you want to preserve insertion order (which is what your sort method looks like it assumes)
Create a list of key/value pairs and sort that instead.
As jon said I would suggest keeping an ordered list of keys, and using that to access the inherently unordered hash map.

Categories

Resources