Convert RDD List to RDD of individual element in spark - java

I have a input rdd (JavaRDD<List<String>>) and i want to convert it to JavaRDD<String> as output.
Each element of input RDD list should become a individual element in output rdd.
how to achieve it in java?
JavaRDD<List<String>> input; //suppose rdd length is 2
input.saveAsTextFile(...)
output:
[a,b] [c,d]
what i want:
a b c d

Convert it into a DataFrame and use Explode UDF function.

I did a workaround using below code snippet:
Concat each element of list with separator '\n' then save rdd using standard spark API.
inputRdd.map(new Function<List<String>, String>() {
#Override
public String call(List<String> scores) throws Exception {
int size = scores.size();
StringBuffer sb = new StringBuffer();
for (int i=0; i <size;i++){
sb.append(scores.get(i));
if(i!=size-1){
sb.append("\n");
}
}
return sb.toString();
}
}).saveAsTextFile("/tmp/data"));

If the rdd type is RDD[List[String]], you can just do this:
val newrdd = rdd.flatmap(line => line)
Each of the elements will be a new line in the new rdd.

below will solve your problem
var conf = new SparkConf().setAppName("test")
.setMaster("local[1]")
.setExecutorEnv("executor-cores", "2")
var sc = new SparkContext(conf)
val a = sc.parallelize(Array(List("a", "b"), List("c", "d")))
a.flatMap(x => x).foreach(println)
output :
a
b
c
d

Related

replace null to empty string in Java

I have below string output :
["Kolkata","data can be, null",null,"05/31/2020",null]
but I want to have the output like below format in Java
["Kolkata","data can be, null","","05/31/2020",""]
please help me .
I am converting object to json data . Please see the below codes
List<String> test = new ArrayList<>();
List<Object[]> data =query.list();
for (int i = 0; i < data.size(); i++) {
Object[] row = (Object[]) data.get(i);
String jsonString = gson.toJson(row);
test.add(jsonString);
}
I want to apply this on jsonString variable using java 7 as not using java 8
If you have list for example list of like this
List<String> list = Arrays.asList("Kolkata","data can be, null",null,"05/31/2020",null);
list.replaceAll(t -> Objects.isNull(t) ? "''" : t);
System.out.println(list);
Here oputput will be:
[Kolkata, data can be, null, '', 05/31/2020, '']

How to convert JavaRDD<List<String>> to JavaRDD<String> and write to a file without "[" and "]"

I have a JavaRDD<List<String>> and my file is getting written with [] at the beginning and end of each list of strings when I use
javacontext.parallelize(rdd).coalesce(1, true).saveAsTextFile("dirname");
Can we convert JavaRDD<List<String>> to JavaRDD<String> and write it to a file?
You could use map to apply String.join for each List<String> in JavaRDD:
String separator = ",";
JavaRDD<String> ys = rdd
.map(new Function<List<String>, String>() {
#Override
public String call(List<String> xs) throws Exception {
return String.join(separator, xs);
}
});
Or using lambdas:
JavaRDD<String> ys = rdd
.map((Function<List<String>, String>) xs -> String.join(separator, xs));

How convert JavaRDD<Row> to JavaRDD<List<String>>?

JavaRDD<List<String>> documents = StopWordsRemover.Execute(lemmatizedTwits).toJavaRDD().map(new Function<Row, List<String>>() {
#Override
public List<String> call(Row row) throws Exception {
List<String> document = new LinkedList<String>();
for(int i = 0; i<row.length(); i++){
document.add(row.get(i).toString());
}
return document;
}
});
I try make it with use this code, but I get WrappedArray
[[WrappedArray(happy, holiday, beth, hope, wonderful, christmas, wish, best)], [WrappedArray(light, shin, meeeeeeeee, like, diamond)]]
How make it correctly?
You can use getList method:
Dataset<Row> lemmas = StopWordsRemover.Execute(lemmatizedTwits).select("lemmas");
JavaRDD<List<String>> documents = lemmas.toJavaRDD().map(row -> row.getList(0));
where lemmas is the name of the column with lemmatized text. If there is only one column (it looks like this is the case) you can skip select. If you know the index of the column you can skip select as well and pass index to getList but it is error prone.
Your current code iterates over the Row not the field you're trying to extract.
Here's an example with using an excel file :
JavaRDD<String> data = sc.textFile(yourPath);
String header = data.first();
JavaRDD<String> dataWithoutHeader = data.filter(line -> !line.equalsIgnoreCase(header) && !line.isEmpty());
JavaRDD<List<String>> dataAsList = dataWithoutHeader.map(line -> Arrays.asList(line.split(";")));
hope this peace of code help you

To skip empty records from a CSV file using Apache Commons CSV

if a CSV file contains three columns and if the values are as given below
a,b,c
//empty line
,,,
a,b,c
There are two valid records. Using Apache commons CSV parser, i could easily skip the record which has empty lines. But when the records contain only null values, how to skip it then?
To overcome this, I'm using String equals() with already constructed empty record. Here is a sample implementation.
List<String[]> csvContentsList = new ArrayList<String[]>();
CSVFormat csvFormat = CSVFormat.DEFAULT.withNullString("");
CSVParser csvParser = new CSVParser(fileReader, csvFormat);
String[] nullRecordArray = { null, null, null};
String nullRecordString = Arrays.toString(nullRecordArray);
for (CSVRecord csvRecord : csvParser) {
try {
String values[] = { csvRecord.get(0),csvRecord.get(1),csvRecord.get(2) };
if (!nullRecordString.equals(Arrays.toString(values))) //lineA
csvContentsList.add(values);
} catch (Exception e) {
// exception handling
}
}
When i don't use the line marked as 'lineA', this implementation gives three records in the csvContentsList as below
[a,b,c]
[null,null,null]
[a,b,c]
Is there any inbuilt way to do this? or any other better way?
Find here another possible solution.
CSVFormat csvFormat = CSVFormat.DEFAULT.withNullString("");
CSVParser csvParser = new CSVParser(fileReader, csvFormat);
for (CSVRecord csvRecord : csvParser.getRecords()) {
String values[] = {csvRecord.get(0), csvRecord.get(1), csvRecord.get(2)};
for (String value : values) {
if (value != null) {
// as soon a value is not-null we add the array
// and exit the for-loop
csvContentsList.add(values);
break;
}
}
}
assumend input
a,b,c
,,,
d,e,f
output
a,b,c
d,e,f
edit If you can use Java 8 a solution might be.
List<String[]> csvContentsList = csvParser.getRecords()
.stream()
.sequential() // 1.
.map((csvRecord) -> new String[]{
csvRecord.get(0),
csvRecord.get(1),
csvRecord.get(2)
}) // 2.
.filter(v -> Arrays.stream(v)
.filter(t -> t != null)
.findFirst()
.isPresent()
) // 3.
.collect(Collectors.toList()); // 4.
if the order of lines is important
map a csvRecord to a String[]
filter on String arrays with at least one non-null value
collect all values and return a List
Might need to be amended, depending on your requirements.
You could try StringUtils#isNotBlank() this way:
if (StringUtils.isNotBlank(csvRecord.get(0))
&& StringUtils.isNotBlank(csvRecord.get(1))
&& StringUtils.isNotBlank(csvRecord.get(2))) {
csvContentsList.add(values);
}

How to assign array string to array object

I have something like this :
List<Page> result = new ArrayList<Page>();
Page is a class with 3 string variables;
I have an array as such :
List<String[]> output = new ArrayList<String[]>();
Which is populated like this in a loop:
String[] out = new String[3];
out[0] = "";
out[1] = "";
out[2] = "";
then added to output: output.set(i, out);
How can I assign output (type:String) to result (type:Page)?
I am guessing you are looking for something like this (code requires Java 8 but can be easily rewritten for earlier versions using loop)
List<String[]> output = new ArrayList<String[]>();
// populate output with arrays containing three elements
// which will be used used to initialize Page instances
//...
List<Page> result = output.stream()
.map(arr -> new Page(arr[0], arr[1], arr[2]))
.collect(Collectors.toList());

Categories

Resources