Getting empty Java List after converting from RDD

Getting empty Java List after converting from RDD - java

I'm creating a RDD in 1st part of the application, then converting it to a list using rdd.collect().
But for some reason the list size is coming as 0 in the second part of the application , while the RDD from which I'm creating the list is not empty.Even rdd.toArray() is giving empty list.
Below is my program.
public class Query5kPids implements Serializable{
List<String> ListFromS3 = new ArrayList<String>();
public static void main(String[] args) throws JSONException, IOException, InterruptedException, URISyntaxException {
SparkConf conf = new SparkConf();
conf.setAppName("Spark-Cassandra Integration");
conf.set("spark.cassandra.connection.host", "12.16.193.19");
conf.setMaster("yarn-cluster");
SparkConf conf1 = new SparkConf().setAppName("SparkAutomation").setMaster("yarn-cluster");
Query5kPids app1 = new Query5kPids(conf1);
app1.run1(file);
Query5kPids app = new Query5kPids(conf);
System.out.println("Both RDD has been generated");
app.run();
}
private void run() throws JSONException, IOException, InterruptedException {
JavaSparkContext sc = new JavaSparkContext(conf);
query(sc);
sc.stop();
}
private void run1(File file) throws JSONException, IOException, InterruptedException {
JavaSparkContext sc = new JavaSparkContext(conf);
getData(sc,file);
sc.stop();
}
private void getData(JavaSparkContext sc, File file) {
JavaRDD<String> Data = sc.textFile(file.toString());
System.out.println("RDD Count is " + Data.count());
// here it prints some count value
ListFromS3 = Data.collect();
// ListFromS3 = Data.toArray();
}
private void query(JavaSparkContext sc) {
System.out.println("RDD Count is " + ListFromS3.size());
// Prints 0
// So cant convert the list to RDD
JavaRDD<String> rddFromGz = sc.parallelize(ListFromS3);
}
}
NOTE -> In the actual program , the RDD and List is of type.
List<UserSetGet> ListFromS3 = new ArrayList<UserSetGet>();
JavaRDD<UserSetGet> Data = new ....
where UserSetGet is a Pojo , With Setter and getter methods, and its Serializable.

app1.run1 puts the RDD contents into app1.ListFromS3. Then you look at app.ListFromS3, which is empty. app1.ListFromS3 and app.ListFromS3 are fields on two different objects. Setting one does not set the other.
I think you meant ListFromS3 to be static, meaning it belongs to the Query5kPids class, not to a particular instance. Like this:
static List<String> ListFromS3 = new ArrayList<String>();

Related

Grouping and sorting Lists into a Map

Before anything, the title doesn't convey what I really want to ask.
What I want to know is, how can I make a map, where for several users, it collects their Data and then groups it all together. I'm currently using two lists, one for the users' names and another for their works. I tried using a map.put but it kept overwriting the previous entry. So what I'd like to obtain is as follows;
Desired output:
{user1 = work1, work2, work3 , user2 = work1, work2 , userN = workN}
Current output:
{[user1, user2, user3, user4]=[work1, work2, work3, work4, work5 (user1) , work1 (user2), work1, work2, work3 ( user3 )]}
This is the code that I'm currently using to achieve the above.
private static Map<List<String>, List<String>> repositoriesUserData = new HashMap<>();
private static Set<String> collaboratorNames = new HashSet<>();
public static void main(String[] args) throws Exception {
login();
getCollabs(GITHUB_REPO_NAME);
repositoriesUnderUser();
}
public GitManager(String AUTH, String USERNAME, String REPO_NAME) throws IOException {
this.GITHUB_LOGIN = USERNAME;
this.GITHUB_OAUTH = AUTH;
this.GITHUB_REPO_NAME = REPO_NAME;
this.githubLogin = new GitHubBuilder().withOAuthToken(this.GITHUB_OAUTH, this.GITHUB_LOGIN).build();
this.userOfLogin = this.githubLogin.getUser(GITHUB_LOGIN);
}
public static void login() throws IOException {
new GitManager(GIT_TOKEN, GIT_LOGIN, GITHUB_REPO_NAME);
connect();
}
public static void connect() throws IOException {
if (githubLogin.isCredentialValid()) {
valid = true;
githubLogin.connect(GITHUB_LOGIN, GITHUB_OAUTH);
userOfLogin = githubLogin.getUser(GITHUB_LOGIN);
}
}
public static String getCollabs(String repositoryName) throws IOException {
GHRepository collaboratorsRepository = userOfLogin.getRepository(repositoryName);
collaboratorNames = collaboratorsRepository.getCollaboratorNames();
String collaborators = collaboratorNames.toString();
System.out.println("Collaborators for the following Repository: " + repositoryName + "\nAre: " + collaborators);
String out = "Collaborators for the following Repository: " + repositoryName + "\nAre: " + collaborators;
return out;
}
public static List<String> fillList() {
List<String> collaborators = new ArrayList<>();
collaboratorNames.forEach(s -> {
collaborators.add(s);
});
return collaborators;
}
public static String repositoriesUnderUser() throws IOException {
GHUser user;
List<String> names = new ArrayList<>();
List<String> repoNames = new ArrayList<>();
for (int i = 0; i < fillList().size(); i++) {
user = githubLogin.getUser(fillList().get(i));
Map<String, GHRepository> temp = user.getRepositories();
names.add(user.getLogin());
temp.forEach((c, b) -> {
repoNames.add(b.getName());
});
}
repositoriesUserData.put(names,repoNames);
System.out.println(repositoriesUserData);
return "temporaryReturn";
}
All help is appreciated!

I'll give it a try (code in question still not working for me):
If I understood correctly, you want a Map, that contains the repositories for each user.
So therefore i think the repositoriesUserData should be a Map<String, List<String>.
With that in mind, lets fill the map in each loop-cycle with the user from the lists as key and the list of repository-names as value.
The method would look like this (removed the temporary return and replaced it with void)
public static String repositoriesUnderUser() throws IOException {
for (int i = 0; i < fillList().size(); i++) {
GHUser user = githubLogin.getUser(fillList().get(i));
Map<String, GHRepository> temp = user.getRepositories();
repositoriesUserData.put(user.getLogin(), temp.values().stream().map(GHRepository::getName).collect(Collectors.toList()));
}
return "temporaryReturn";
}
Edit: (Short explanation what is happening in your code)
You are collecting all usernames to the local List names and also adding all repository-names to the local List 'repoNames'.
At the end of the method you put a new entry to your map repositoriesUserData.
That means at the end of the method you just added one single entry to the map where
key = all of the users
value = all of the repositories from the users (because its a list, if two users have the same repository, they are added twice to this list)

super csv nested bean

I have a csv
id,name,description,price,date,name,address
1,SuperCsv,Write csv file,1234.56,28/03/2016,amar,jp nagar
I want to read it and store it to json file.
I have created two bean course(id,name,description,price,date) and person(name,address)
on reading by bean reader i'm not able to set the person address.
The (beautified) output is
Course [id=1,
name=SuperCsv,
description=Write csv file,
price=1234.56,
date=Mon Mar 28 00:00:00 IST 2016,
person=[
Person [name=amar, address=null],
Person [name=null, address=jpnagar]
]
]
I want the adress to set with name
My code:
public static void readCsv(String csvFileName) throws IOException {
ICsvBeanReader beanReader = null;
try {
beanReader = new CsvBeanReader(new FileReader(csvFileName), CsvPreference.STANDARD_PREFERENCE);
// the header elements are used to map the values to the bean (names must match)
final String[] header = beanReader.getHeader(true);
final CellProcessor[] processors = getProcessors();
final String[] fieldMapping = new String[header.length];
for (int i = 0; i < header.length; i++) {
if (i < 5) {
// normal mappings
fieldMapping[i] = header[i];
} else {
// attribute mappings
fieldMapping[i] = "addAttribute";
}}
ObjectMapper mapper=new ObjectMapper();
Course course;
List<Course> courseList=new ArrayList<Course>();
while ((course = beanReader.read(Course.class, fieldMapping, processors)) != null) {
// process course
System.out.println(course);
courseList.add(course);
}
private static CellProcessor[] getProcessors(){
final CellProcessor parsePerson = new CellProcessorAdaptor() {
public Object execute(Object value, CsvContext context) {
return new Person((String) value,null);
}
};
final CellProcessor parsePersonAddress = new CellProcessorAdaptor() {
public Object execute(Object value, CsvContext context) {
return new Person(null,(String) value);
}
};
return new CellProcessor[] {
new ParseInt(),
new NotNull(),
new Optional(),
new ParseDouble(),
new ParseDate("dd/MM/yyyy"),
new Optional(parsePerson),
new Optional(parsePersonAddress)
};

SuperCSV is the first parser I have seen that lets you create an object within an object.
for what you are wanting you can try Apache Commons CSV or openCSV (CSVToBean) to map but to do this you need to have the setters of the inner class (setName, setAddress) in the outer class so the CSVToBean to pick it up. That may or may not work.
What I normally tell people is to have a plain POJO that has all the fields in the csv - a data transfer object. Let the parser create that then use a utility/builder class convert the plain POJO into the nested POJO you want.

How to filter a Spark RDD based on particular field value in Java?

I am creating a Spark job in Java. Here is my code.
I am trying to filter records from a CSV file. Header contains fields OID, COUNTRY_NAME, ......
Instead of just filtering based on s.contains("CANADA"), I would like to be more specific, like I want to filter based on COUNTRY_NAME.equals("CANADA").
Any thoughts on how I can do this?
public static void main(String[] args) {
String gaimFile = "hdfs://xx.yy.zz.com/sandbox/data/acc/mydata";
SparkConf conf = new SparkConf().setAppName("Filter App");
JavaSparkContext sc = new JavaSparkContext(conf);
try{
JavaRDD<String> gaimData = sc.textFile(gaimFile);
JavaRDD<String> canadaOnly = gaimData.filter(new Function<String, Boolean>() {
private static final long serialVersionUID = -4438640257249553509L;
public Boolean call(String s) {
// My file id csv with header OID, COUNTRY_NAME, .....
// here instead of just saying s.contains
// i would like to be more specific and say
// if COUNTRY_NAME.eqauls("CANADA)
return s.contains("CANADA");
}
});
}
catch(Exception e){
System.out.println("ERROR: G9 MatchUp Failed");
}
finally{
sc.close();
}
}

You will have to map your values into a custom class first:
rdd.map(lines=>ConvertToCountry(line))
.filter(country=>country == "CANADA")
class Country{
...ctor that takes an array and fills properties...
...properties for each field from the csv...
}
ConvertToCountry(line: String){
return new Country(line.split(','))
}
The above is a combination of Scala and pseudocode, but you should get the point.

MapReduce HBase NullPointerException

I am beginner at bigdata. First I wanna try how mapreduce work with hbase. The scenario is summing of a field uas in my hbase use map reduce based on date which is as primary key. Here is my table :
Hbase::Table - test
ROW COLUMN+CELL
10102010#1 column=cf:nama, timestamp=1418267197429, value=jonru
10102010#1 column=cf:quiz, timestamp=1418267197429, value=\x00\x00\x00d
10102010#1 column=cf:uas, timestamp=1418267197429, value=\x00\x00\x00d
10102010#1 column=cf:uts, timestamp=1418267197429, value=\x00\x00\x00d
10102010#2 column=cf:nama, timestamp=1418267180874, value=jonru
10102010#2 column=cf:quiz, timestamp=1418267180874, value=\x00\x00\x00d
10102010#2 column=cf:uas, timestamp=1418267180874, value=\x00\x00\x00d
10102010#2 column=cf:uts, timestamp=1418267180874, value=\x00\x00\x00d
10102012#1 column=cf:nama, timestamp=1418267156542, value=jonru
10102012#1 column=cf:quiz, timestamp=1418267156542, value=\x00\x00\x00\x0A
10102012#1 column=cf:uas, timestamp=1418267156542, value=\x00\x00\x00\x0A
10102012#1 column=cf:uts, timestamp=1418267156542, value=\x00\x00\x00\x0A
10102012#2 column=cf:nama, timestamp=1418267166524, value=jonru
10102012#2 column=cf:quiz, timestamp=1418267166524, value=\x00\x00\x00\x0A
10102012#2 column=cf:uas, timestamp=1418267166524, value=\x00\x00\x00\x0A
10102012#2 column=cf:uts, timestamp=1418267166524, value=\x00\x00\x00\x0A
My codes are like these :
public class TestMapReduce {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "Test");
job.setJarByClass(TestMapReduce.TestMapper.class);
Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);
TableMapReduceUtil.initTableMapperJob(
"test",
scan,
TestMapReduce.TestMapper.class,
Text.class,
IntWritable.class,
job);
TableMapReduceUtil.initTableReducerJob(
"test",
TestReducer.class,
job);
job.waitForCompletion(true);
}
public static class TestMapper extends TableMapper<Text, IntWritable> {
#Override
protected void map(ImmutableBytesWritable rowKey, Result columns, Mapper.Context context) throws IOException, InterruptedException {
System.out.println("mulai mapping");
try {
//get row key
String inKey = new String(rowKey.get());
//get new key having date only
String onKey = new String(inKey.split("#")[0]);
//get value s_sent column
byte[] bUas = columns.getValue(Bytes.toBytes("cf"), Bytes.toBytes("uas"));
String sUas = new String(bUas);
Integer uas = new Integer(sUas);
//emit date and sent values
context.write(new Text(onKey), new IntWritable(uas));
} catch (RuntimeException ex) {
ex.printStackTrace();
}
}
}
public class TestReducer extends TableReducer {
public void reduce(Text key, Iterable values, Reducer.Context context) throws IOException, InterruptedException {
try {
int sum = 0;
for (Object test : values) {
System.out.println(test.toString());
sum += Integer.parseInt(test.toString());
}
Put inHbase = new Put(key.getBytes());
inHbase.add(Bytes.toBytes("cf"), Bytes.toBytes("sum"), Bytes.toBytes(sum));
context.write(null, inHbase);
} catch (Exception e) {
e.printStackTrace();
}
}
}
I got errors like these :
Exception in thread "main" java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:451)
at org.apache.hadoop.util.Shell.run(Shell.java:424)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:656)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:745)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:728)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:421)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:281)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:125)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:348)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)
at TestMapReduce.main(TestMapReduce.java:97)
Java Result: 1
Help me please :)

Let's look at this part of your code:
byte[] bUas = columns.getValue(Bytes.toBytes("cf"), Bytes.toBytes("uas"));
String sUas = new String(bUas);
For the current key you are trying to get a value of column uas from column family cf. This is a non-relational DB, so it is easily possible that this key doesn't have a value for this column. In that case, getValue method will return null. String constructor that accepts byte[] as an input can't handle null values, so it will throw a NullPointerException. A quick fix will look like this:
byte[] bUas = columns.getValue(Bytes.toBytes("cf"), Bytes.toBytes("uas"));
String sUas = bUas == null ? "" : new String(bUas);

Best way to get distribute a small lookup file using Distributed Cache

Which is the best way to get Distributed cached data?
public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
ArrayList<String> globalFreq = new ArrayList<String>();
public void setup(Context context) throws IOException{
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
URI[] cacheFiles = DistributedCache.getCacheFiles(conf);
Path getPath = new Path(cacheFiles[0].getPath());
BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
String setupData = null;
while ((setupData = bf.readLine()) != null) {
String [] parts = setupData.split(" ");
globalFreq.add(parts[0]);
}
}
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
//Accessing "globalFreq" data .and do further processing
}
OR
public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
URI[] cacheFiles
public void setup(Context context) throws IOException{
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
cacheFiles = DistributedCache.getCacheFiles(conf);
}
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
ArrayList<String> globalFreq = new ArrayList<String>();
Path getPath = new Path(cacheFiles[0].getPath());
BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
String setupData = null;
while ((setupData = bf.readLine()) != null) {
String [] parts = setupData.split(" ");
globalFreq.add(parts[0]);
}
}
So if we are doing like (code 2) does that mean Say we have 5 map task every map task reads the same copy of the data . while writing like this for each map , the task reads the data multiple times am i right (5 times)?
code 1 : as it is written in setup it is read once and the global data is accessed in map.
Which is the right way of writing distributed cache.

Do as much as you can in the setup method: this will be called once by each mapper, but will then be cached for each record that is passed to the mapper. Parsing your data for each record is overhead you can avoid, since there is nothing there that depends on the key, value and context variables you are receiving in the map method.
The setup method will be called per map task, but map will be called for each record processed by that task (which can clearly be a very high number).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Getting empty Java List after converting from RDD - java

Related

Grouping and sorting Lists into a Map

super csv nested bean

How to filter a Spark RDD based on particular field value in Java?

MapReduce HBase NullPointerException

Best way to get distribute a small lookup file using Distributed Cache

Categories

Resources