How convert JavaRDD<Row> to JavaRDD<List<String>>? - java

JavaRDD<List<String>> documents = StopWordsRemover.Execute(lemmatizedTwits).toJavaRDD().map(new Function<Row, List<String>>() {
#Override
public List<String> call(Row row) throws Exception {
List<String> document = new LinkedList<String>();
for(int i = 0; i<row.length(); i++){
document.add(row.get(i).toString());
}
return document;
}
});
I try make it with use this code, but I get WrappedArray
[[WrappedArray(happy, holiday, beth, hope, wonderful, christmas, wish, best)], [WrappedArray(light, shin, meeeeeeeee, like, diamond)]]
How make it correctly?

You can use getList method:
Dataset<Row> lemmas = StopWordsRemover.Execute(lemmatizedTwits).select("lemmas");
JavaRDD<List<String>> documents = lemmas.toJavaRDD().map(row -> row.getList(0));
where lemmas is the name of the column with lemmatized text. If there is only one column (it looks like this is the case) you can skip select. If you know the index of the column you can skip select as well and pass index to getList but it is error prone.
Your current code iterates over the Row not the field you're trying to extract.

Here's an example with using an excel file :
JavaRDD<String> data = sc.textFile(yourPath);
String header = data.first();
JavaRDD<String> dataWithoutHeader = data.filter(line -> !line.equalsIgnoreCase(header) && !line.isEmpty());
JavaRDD<List<String>> dataAsList = dataWithoutHeader.map(line -> Arrays.asList(line.split(";")));
hope this peace of code help you

Related

replace null to empty string in Java

I have below string output :
["Kolkata","data can be, null",null,"05/31/2020",null]
but I want to have the output like below format in Java
["Kolkata","data can be, null","","05/31/2020",""]
please help me .
I am converting object to json data . Please see the below codes
List<String> test = new ArrayList<>();
List<Object[]> data =query.list();
for (int i = 0; i < data.size(); i++) {
Object[] row = (Object[]) data.get(i);
String jsonString = gson.toJson(row);
test.add(jsonString);
}
I want to apply this on jsonString variable using java 7 as not using java 8
If you have list for example list of like this
List<String> list = Arrays.asList("Kolkata","data can be, null",null,"05/31/2020",null);
list.replaceAll(t -> Objects.isNull(t) ? "''" : t);
System.out.println(list);
Here oputput will be:
[Kolkata, data can be, null, '', 05/31/2020, '']

Convert RDD List to RDD of individual element in spark

I have a input rdd (JavaRDD<List<String>>) and i want to convert it to JavaRDD<String> as output.
Each element of input RDD list should become a individual element in output rdd.
how to achieve it in java?
JavaRDD<List<String>> input; //suppose rdd length is 2
input.saveAsTextFile(...)
output:
[a,b] [c,d]
what i want:
a b c d
Convert it into a DataFrame and use Explode UDF function.
I did a workaround using below code snippet:
Concat each element of list with separator '\n' then save rdd using standard spark API.
inputRdd.map(new Function<List<String>, String>() {
#Override
public String call(List<String> scores) throws Exception {
int size = scores.size();
StringBuffer sb = new StringBuffer();
for (int i=0; i <size;i++){
sb.append(scores.get(i));
if(i!=size-1){
sb.append("\n");
}
}
return sb.toString();
}
}).saveAsTextFile("/tmp/data"));
If the rdd type is RDD[List[String]], you can just do this:
val newrdd = rdd.flatmap(line => line)
Each of the elements will be a new line in the new rdd.
below will solve your problem
var conf = new SparkConf().setAppName("test")
.setMaster("local[1]")
.setExecutorEnv("executor-cores", "2")
var sc = new SparkContext(conf)
val a = sc.parallelize(Array(List("a", "b"), List("c", "d")))
a.flatMap(x => x).foreach(println)
output :
a
b
c
d

To skip empty records from a CSV file using Apache Commons CSV

if a CSV file contains three columns and if the values are as given below
a,b,c
//empty line
,,,
a,b,c
There are two valid records. Using Apache commons CSV parser, i could easily skip the record which has empty lines. But when the records contain only null values, how to skip it then?
To overcome this, I'm using String equals() with already constructed empty record. Here is a sample implementation.
List<String[]> csvContentsList = new ArrayList<String[]>();
CSVFormat csvFormat = CSVFormat.DEFAULT.withNullString("");
CSVParser csvParser = new CSVParser(fileReader, csvFormat);
String[] nullRecordArray = { null, null, null};
String nullRecordString = Arrays.toString(nullRecordArray);
for (CSVRecord csvRecord : csvParser) {
try {
String values[] = { csvRecord.get(0),csvRecord.get(1),csvRecord.get(2) };
if (!nullRecordString.equals(Arrays.toString(values))) //lineA
csvContentsList.add(values);
} catch (Exception e) {
// exception handling
}
}
When i don't use the line marked as 'lineA', this implementation gives three records in the csvContentsList as below
[a,b,c]
[null,null,null]
[a,b,c]
Is there any inbuilt way to do this? or any other better way?
Find here another possible solution.
CSVFormat csvFormat = CSVFormat.DEFAULT.withNullString("");
CSVParser csvParser = new CSVParser(fileReader, csvFormat);
for (CSVRecord csvRecord : csvParser.getRecords()) {
String values[] = {csvRecord.get(0), csvRecord.get(1), csvRecord.get(2)};
for (String value : values) {
if (value != null) {
// as soon a value is not-null we add the array
// and exit the for-loop
csvContentsList.add(values);
break;
}
}
}
assumend input
a,b,c
,,,
d,e,f
output
a,b,c
d,e,f
edit If you can use Java 8 a solution might be.
List<String[]> csvContentsList = csvParser.getRecords()
.stream()
.sequential() // 1.
.map((csvRecord) -> new String[]{
csvRecord.get(0),
csvRecord.get(1),
csvRecord.get(2)
}) // 2.
.filter(v -> Arrays.stream(v)
.filter(t -> t != null)
.findFirst()
.isPresent()
) // 3.
.collect(Collectors.toList()); // 4.
if the order of lines is important
map a csvRecord to a String[]
filter on String arrays with at least one non-null value
collect all values and return a List
Might need to be amended, depending on your requirements.
You could try StringUtils#isNotBlank() this way:
if (StringUtils.isNotBlank(csvRecord.get(0))
&& StringUtils.isNotBlank(csvRecord.get(1))
&& StringUtils.isNotBlank(csvRecord.get(2))) {
csvContentsList.add(values);
}

How to write ArrayList<Object> to a csv file

I have a ArrayList<Metadata> and i want to know if there is a Java API for working with CSV files which has a write method which accepts a ArrayList<> as parameter similar to LinqToCsv in .Net. As i know OpenCSV is available but the CsvWriter class doesn't accept a collection.
My Metadata Class is
public class Metadata{
private String page;
private String document;
private String loan;
private String type;
}
ArrayList<Metadata> record = new ArrayList<Metadata>();
once i populate the record, i want to write each row into a csv file.
Please suggest.
Surely there'll be a heap of APIs that will do this for you, but why not do it yourself for such a simple case? It will save you a dependency, which is a good thing for any project of any size.
Create a toCsvRow() method in Metadata that joins the strings separated by a comma.
public String toCsvRow() {
return Stream.of(page, document, loan, type)
.map(value -> value.replaceAll("\"", "\"\""))
.map(value -> Stream.of("\"", ",").anyMatch(value::contains) ? "\"" + value + "\"" : value)
.collect(Collectors.joining(","));
}
Collect the result of this method for every Metadata object separated by a new line.
String recordAsCsv = record.stream()
.map(Metadata::toCsvRow)
.collect(Collectors.joining(System.getProperty("line.separator")));
EDIT
Should you not be so fortunate as to have Java 8 and the Stream API at your disposal, this would be almost as simple using a traditional List.
public String toCsvRow() {
String csvRow = "";
for (String value : Arrays.asList(page, document, loan, type)) {
String processed = value;
if (value.contains("\"") || value.contains(",")) {
processed = "\"" + value.replaceAll("\"", "\"\"") + "\"";
}
csvRow += "," + processed;
}
return csvRow.substring(1);
}
By using CSVWriter, you could convert the ArrayList to an array, and pass that to the writer .
csvWriter.writeNext(record.toArray(new String[record.size()]));
If you have an ArrayList of Objects (Metadata in your case) you would use the BeanToCSV instead of the CSVWriter.
You can look at the BeanToCSVTest in the opencsv source code for examples of how to use it.

Parse a csv String and map to a java object

I am trying to parse a csv string like this
COL1,COL2,COL3
1,2,3
2,4,5
and map columns to a java object-
Class Person{
COL1,
COL2,
COL3;
}
Most of the libraries I found on google are for csv files but I am working with google app engine so can't write or read files. currently I am using split method but problems with this approach is
column that I am getting in csv string could vary as
COL1,COL3,COL2
don't want to use boiler plate code of splitting and getting each column.so what I need is list of column header and read all columns in a collection using header mapper. While iterating, map column value to a java object.
There are several question based on similar type of requirement but none of them helped me.
If anyone has done this before please could you share the idea? Thanks!
After searching and trying several libraries, I am able to solve it. I am sharing the code if anyone needs it later-
public class CSVParsing {
public void parseCSV() throws IOException {
List<Person> list = Lists.newArrayList();
String str = "COL1,COL2,COL3\n" +
"A,B,23\n" +
"S,H,20\n";
CsvSchema schema = CsvSchema.emptySchema().withHeader();
ObjectReader mapper = new CsvMapper().reader(Person.class).with(schema);
MappingIterator<Person> it = mapper.readValues(str);
while (it.hasNext()) {
list.add(it.next());
}
System.out.println("stored list is:" + (list != null ? list.toString() : null));
}}
Most of the libraries I found on google are for csv files but I am
working with google app engine so can't write or read files
You can read file (in project file system).
You can read and write file in blobstore, google cloud storage
Use a Tokenizer to split the string into objects then set them to the object.
//Split the string into tokens using a comma as the token seperator
StringTokenizer st = new StringTokenizer(lineFromFile, ",");
while (st.hasMoreTokens())
{
//Collect each item
st.nextElement();
}
//Set to object
Person p = new Person(item1, item2, item3);
If the columns can be reversed, you parse the header line, save it's values and and use it to decide which column each token falls under using, say, a Map
String columns[] = new String[3]; //Fill these with column names
Map<String,String> map = new HashMap<>();
int i=0;
while (st.hasMoreTokens())
{
//Collect each item
map.put(columns[i++], st.nextElement());
}
Then just, create the Person
Person p = new Person(map.get("COL1"), map.get("COL2"), map.get("COL3"));

Categories

Resources