How flatMap a dataFrame from another dataFrame in Java?

How flatMap a dataFrame from another dataFrame in Java? - java

I have a dataFrame like the following:
+-----------------+--------------------+
| id| document|
+-----------------+--------------------+
| doc1 |"word1, word2" |
| doc2 |"word3 word4" |
+-----------------+--------------------+
I want to create another dataFrame with following structure:
+-----------------+--------------------+-----------------+
| id| document| word |
+-----------------+--------------------+----------------|
| doc1 |"word1, word2" | word1 |
| doc1 |"word1 word2" | word2 |
| doc2 |"word3 word4" | word3 |
| doc2 |"word3 word4" | word4 |
+-----------------+--------------------+----------------|
I tried the following:
public static Dataset<Row> buildInvertIndex(Dataset<Row> inputRaw, SQLContext sqlContext, String id) {
JavaRDD<Row> inputInvertedIndex = inputRaw.javaRDD();
JavaRDD<Tuple3<String, String ,String>> d = inputInvertedIndex.flatMap(x -> {
List<Tuple3<String, String, String>> k = new ArrayList<>();
String data2 = x.getString(0).toString();
String[] field2 = x.getString(1).split(" ", -1);
for(String s: field2)
k.add(new Tuple3<String, String, String>(data2, x.getString(1), s));
return k.iterator();
}
);
JavaPairRDD<String, Tuple2<String, String>>d2 = d.mapToPair(x->{
return new Tuple2<String, Tuple2<String, String>>(x._3(), new Tuple2<String, String>(x._1(), x._2()));
});
Dataset<Row> d3 = sqlContext.createDataset(JavaPairRDD.toRDD(d2), Encoders.tuple(Encoders.STRING(), Encoders.tuple(Encoders.STRING(),Encoders.STRING()))).toDF();
return d3;
}
But it gives:
+-----------------+----------------------+
| _1| _2 |
+-----------------+----------------------+
| word1 |[doc1,"word1, word2"] |
| word2 |[doc1,"word1 word2"] |
| word3 |[doc2, "word3, word4"]|
| word4 |[doc2, "word3, word4"]|
+-----------------+----------------------+
Im newbie to spark in java. SO please any help will be so appreciated. In addition please, suppose in the second dataframe above i want to compute a string similarity metric(i.e, jaccard) on the two column document and word and add the result in a new column, how can i do that?

You can use explode and split
import static org.apache.spark.sql.functions.expr;
inputRaw.withColumn("word", expr("explode(split(document, '[, ]+'))"))

Related

How to apply the minDF feature extractor CountVectorizer parameter to each label separately, and not to the entire dataset

I am new to Apache Spark and faced the following problem:
there is a dataset:
| label | words |
| -------- | ------------------- |
| 0 | word1 word2 word3 |
| 0 | word4 word1 word5 |
| 0 | word6 word7 word8 |
| 1 | word9 word10 |
| 1 | word9 word11 |
If you use CountVectorizer with setMinDF (0.5) with this dataset, then words word1 and word9 will not get into the dictionary, since each of them occurs in less than 50% of documents in the dataset.
How can you apply minDF to each label separately to get word1 and word9 into the dictionary?
My current code:
Dataset<Row> wordsData = <my dataset>;
CountVectorizer cv = new CountVectorizer().setInputCol("words").setOutputCol("features");
cv.setMinDF(0.5);
cvModel = cv.fit(wordsData);
cvModel.save(extractorModelFile); // SAVE MODEL
Dataset<Row> rescaledData = cvModel.transform(wordsData);

If you set minDF to 1.0 (its default value), all the words will be in the vocabulary.
val df = Seq(0 -> "word1 word2 word3",
0 -> "word1 word2 word3",
0 -> "word6 word7 word8",
1 -> "word9 word10",
1 -> "word9 word11")
.toDF("label", "words").withColumn("words", split('words, " "))
val cv = new CountVectorizer().setInputCol("words").setOutputCol("features").setMinDF(1d)
val model = cv.fit(df).transform(df)
model.vocabulary
res: Array[String] = Array(word9, word3, word2, word1, word11, word8, word7, word6, word10)

How to split a column into a list and save it into a new .csv file

I have a data frame with two columns: student ID and their courses. The course column has multiple values separated by ";". How can I split genres into a list and save every pair (studentID, genre1), (studetID, genre2) into a new CSV file?

You could try split and explode :
val df = Seq((1,("a;b;c"))).toDF("id","values")
df.show()
val df2 = df.select($"id", explode(split($"values",";")).as("value"))
df2.show()
df2.write.option("header", "true").csv("/path/to/csv");
+---+------+
| id|values|
+---+------+
| 1| a;b;c|
+---+------+
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 1| b|
| 1| c|
+---+-----+

Variable increment (index++) not increase 1 each time

My program will read a file line by line, then split the line by delimiter | (vertical line) and stored into a String []. However, as column position and number of columns in the line will change in the future, instead of using concrete index number 0,1,2,3..., I use index++ to iterate the line split tokens;
After running the program, instead of increase 1, the index will increase more than 1 each time.
My code is like as follows:
BufferedReader br = null;
String line = null;
String[] lineTokens = null;
int index = 1;
DataModel dataModel = new DataModel();
try {
br = new BufferedReader(new FileReader(filePath));
while((line = br.readLine()) != null) {
// check Group C only
if(line.contains("CCC")) {
lineTokens = line.split("\\|");
dataModel.setGroupID(lineTokens[index++]);
//System.out.println(index); The value of index not equal to 2 here. The value change each running time
dataModel.setGroupName(lineTokens[index++]);
//System.out.println(index);
// dataModel.setOthers(lineTokens[index++]); <- if the file add columns in the middle of the line in the future, this is required.
dataModel.setMemberID(lineTokens[index++]);
dataModel.setMemberName(lineTokens[index++]);
dataModel.setXXX(lineTokens[index++]);
dataModel.setYYY(lineTokens[index++]);
index = 1;
//blah blah below
}
}
br.close();
} catch (Exception ex) {
}
The file format is like as follows:
Prefix | Group ID | Group Name | Memeber ID | Member Name | XXX | YYY
GroupInterface | AAA | Group A | 001 | Amy | XXX | YYY
GroupInterface | BBB | Group B | 002 | Tom | XXX | YYY
GroupInterface | AAA | Group A | 003 | Peter | XXX | YYY
GroupInterface | CCC | Group C | 004 | Sam | XXX | YYY
GroupInterface | CCC | Group C | 005 | Susan | XXX | YYY
GroupInterface | DDD | Group D | 006 | Parker| XXX | YYY
Instead of increase 1, the index++ will increase more than 1. I wonder why this happen and how to solve it? Any help is highly appreciated.

Well, #SomeProgrammerDude slyly hinted it, but I'll just come out and say it: when you reset index it should be set to zero, not 1.
By starting index at 1, you're always indexing one position ahead of where you should be, and you're probably eventually getting an IndexOutOfBoundsException that's being swallowed up by your empty catch clause.

Matching ${123...456} and extracting 2 numbers in Java?

What is the simplest succinct way to expect 2 integers from a String when i know the format will always be ${INT1...INT2} e.g. "Hello ${123...456} would extract 123,456?

I would go with a Pattern with groups and back-references.
Here's an example:
String input = "Hello ${123...456}, bye ${789...101112}";
// | escaped "$"
// | | escaped "{"
// | | | first group (any number of digits)
// | | | | 3 escaped dots
// | | | | | second group (same as 1st)
// | | | | | | escaped "}"
Pattern p = Pattern.compile("\\$\\{(\\d+)\\.{3}(\\d+)\\}");
Matcher m = p.matcher(input);
// iterating over matcher's find for multiple matches
while (m.find()) {
System.out.println("Found...");
System.out.println("\t" + m.group(1));
System.out.println("\t" + m.group(2));
}
Output
Found...
123
456
Found...
789
101112

final String string = "${123...456}";
final String firstPart = string.substring(string.indexOf("${") + "${".length(), string.indexOf("..."));
final String secondPart = string.substring(string.indexOf("...") + "...".length(), string.indexOf("}"));
final Integer integer = Integer.valueOf(firstPart.concat(secondPart));

String tokenizing to remove some data

I have a string like this:
1 | 2 | 3 | 4 | 5 | 6 | 7 | | | | | | | |
The string might have more/less data also.
I need to remove | and get only numbers one by one.

Guava's Splitter Rocks!
String input = "1 | 2 | 3 | 4 | 5 | 6 | 7 | | | | | | | |";
Iterable<String> entries = Splitter.on("|")
.trimResults()
.omitEmptyStrings()
.split(input);
And if you really want to get fancy:
Iterable<Integer> ints = Iterables.transform(entries,
new Function<String, Integer>(){
Integer apply(String input){
return Integer.parseInt(input);
}
});
Although you definitely could use a regex method or String.split, I feel that using Splitter is less likely to be error-prone and is more readable and maintainable. You could argue that String.split might be more efficient but since you are going to have to do all the trimming and checking for empty strings anyway, I think it will probably even out.
One comment about transform, it does the calculation on an as-needed basis which can be great but also means that the transform may be done multiple times on the same element. Therefore I recommend something like this to perform all the calculations once.
Function<String, Integer> toInt = new Function...
Iterable<Integer> values = Iterables.transform(entries, toInt);
List<Integer> valueList = Lists.newArrayList(values);

You can try using a Scanner:
Scanner sc = new Scanner(myString);
sc.useDelimiter("|");
List<Integer> numbers = new LinkedList<Integer>();
while(sc.hasNext()) {
if(sc.hasNextInt()) {
numbers.add(sc.nextInt());
} else {
sc.next();
}
}

Here you go:
String str = "1 | 2 | 3 | 4 | 5 | 6 | 7 | | | | | | | |".replaceAll("\\|", "").replaceAll("\\s+", "");

Do you mean like?
String s = "1 | 2 | 3 | 4 | 5 | 6 | 7 | | | | | | | |";
for(String n : s.split(" ?\\| ?")) {
int i = Integer.parseInt(n);
System.out.println(i);
}
prints
1
2
3
4
5
6
7

inputString.split("\\s*\\|\\s*") will give you an array of the numbers as strings. Then you need to parse the numbers:
final List<Integer> ns = new ArrayList<>();
for (String n : input.split("\\s*\\|\\s*"))
ns.add(Integer.parseInt(n);

You can use split with the following regex (allows for extra spaces, tabs and empty buckets):
String input = "1 | 2 | 3 | 4 | 5 | 6 | 7 | | | | | | | | ";
String[] numbers = input.split("([\\s*\\|\\s*])+");
System.out.println(Arrays.toString(numbers));
outputs:
[1, 2, 3, 4, 5, 6, 7]

Or with Java onboard methods:
String[] data="1 | 2 | 3 | 4 | 5 | 6 | 7 | | | | | | | |".split("|");
for(String elem:data){
elem=elem.trim();
if(elem.length()>0){
// do someting
}
}

Split the string at its delimiter | and then parse the array.
Something like this should do:
String test = "|1|2|3";
String delimiter = "|";
String[] testArray = test.split(delimiter);
List<Integer> values = new ArrayList<Integer>();
for (String string : testArray) {
int number = Integer.parseInt(string);
values.add(number);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How flatMap a dataFrame from another dataFrame in Java? - java

You can use explode and split import static org.apache.spark.sql.functions.expr; inputRaw.withColumn("word", expr("explode(split(document, '[, ]+'))"))

Related

How to apply the minDF feature extractor CountVectorizer parameter to each label separately, and not to the entire dataset

How to split a column into a list and save it into a new .csv file

Variable increment (index++) not increase 1 each time

Matching ${123...456} and extracting 2 numbers in Java?

String tokenizing to remove some data

Categories

Resources