How can I match double rows with a threshold in Spark?

How can I match double rows with a threshold in Spark? - java

I have a very simple dataframe:
+--+------+
|Id|Amount|
+--+------+
|0 |3.47 |
|1 |-3.47 |
|2 |3.47 |
|3 |3.47 |
|4 |2.02 |
|5 |-2.01 |
|6 |-2.01 |
|7 |7.65 |
|8 |7.65 |
+--+------+
I'd like to match lines that cancel each other given a threshold value (let's say 0.5).
So in this case, match line 0 and 1, 4 and 5, and return line 2 and 3. There are several solutions to this, returning lines 0 and 2 is also fine.
The general idea is that they should be matched 2 by 2 and the leftovers returned. It should return nothing if every line has a match and should return all lines that couldn't be paired that way.
Any idea how to do that?
Expected result:
+--+------+
|Id|Amount|
+--+------+
|0 |3.47 |
|2 |3.47 |
|6 |-2.01 |
|7 |7.65 |
|8 |7.65 |
+--+------+
I've been thinking about using an UserDefinedAggregateFunction, but I'm not sure whether or not it's enough. Especially because I think it can only return one value per group of lines.

I went with an UDF. Writing UDFs in Java is seriously overcomplicated...
If anybody can see a way to simplify that mess, please post or comment.
private UDF1<WrappedArray<Row>, Row[]> matchData() {
return (data) -> {
List<Data> dataList = JavaConversions.seqAsJavaList(data).stream().map(Data::fromRow).collect(Collectors.toList());
Set<Data> matched = new HashSet<>();
for (Data element : dataList) {
if (matched.contains(element)) continue;
dataList.stream().filter(e -> !matched.contains(e) && e != element)
.filter(e -> Math.abs(e.getAmount() + element.getAmount()) < THRESHOLD
&& Math.signum(e.getAmount()) != Math.signum(element.getAmount()))
.min(Comparator.comparingDouble(e -> Math.abs(e.getAmount() + element.getAmount())))
.ifPresent(e -> {
matched.add(e);
matched.add(element);
});
}
if (matched.size() != dataList.size()) {
return dataList.stream().map(Data::toRow).toArray(Row[]::new);
} else {
return new Row[0];
}
};
}
With the Data class (using Lombok):
#AllArgsConstructor
#EqualsAndHashCode
#Data
public final class Data {
private String name;
private Double amount;
public static Data fromRow(Row r) {
return new Data(
r.getString(r.fieldIndex("name")),
r.getDouble(r.fieldIndex("amount")));
}
public Row toRow() {
return RowFactory.create(name, amount);
}
}
I'm returning the whole set in case it didn't work, this is actually what I need in my case.

Related

Add value to map if the key has generated dates

I have a class as a helper to an entity
#Data
#Builder
#AllArgsConstructor
#NoArgsConstructor
public class ChildReports {
private LocalDate date;
private BigDecimal amount;
}
I have entries in the database, for example:
| date | amount |
+-----------------+---------------------+
| 2022-06-20 | 10000 |
| 2023-01-15 | 8000 |
| 2023-07-05 | 6500 |
| 2024-02-11 | 5000 |
| 2024-08-18 | 1000 |
Now I want to fill in the gaps between the dates, so that the previous digits are written in those months in which there is no data. At the end it should look something like this:
| date | amount |
+-----------------+---------------------+
| 2022-06-20 | 10000 |
| 2022-07-20 | 10000 |
| 2022-08-20 | 10000 |
| 2022-09-20 | 10000 |
.............
| 2022-12-20 | 10000 |
| 2023-01-15 | 8000 |
| 2023-02-15 | 8000 |
| 2023-03-15 | 8000 |
and so on
In the service, I started writing a method in which I took the entire range of dates, starting from dateStart and ending with dateEnd.
LocalDate dateStart = Objects.requireNonNull(childReports.stream().findFirst().orElse(null)).getDate();
LocalDate dateEnd = Objects.requireNonNull(childReports.stream().reduce((first, second) -> second).orElse(null).getDate());
long monthsBetween = ChronoUnit.MONTHS.between(dateStart, dateEnd);
List<LocalDate> totalMonths = LongStream.iterate(0,i -> i+1)
.limit(monthsBetween).mapToObj(dateStart::plusMonths)
.collect(Collectors.toList());
Map<List<LocalDate>, BigDecimal> map = new HashMap<>();
for (ChildReports childReport : childReports) {
BigDecimal amount = childReport.getAmount();
map.put(totalMonths, amount);
}
System.out.println(map);
I get this interval correctly, but now I want to add a value - amunt, so that at the end the result would come out, as I indicated above.
Can't get this result

Make sure to adjust the start date to be on the same day as the end date before finding the months beween them:
ChronoUnit.MONTHS.between(dateStart.withDayOfMonth(dateEnd.getDayOfMonth()), dateEnd)
The rest of the things should be pretty straightforward using nested loops. Given below is a sample report:
class ReportRow {
private LocalDate date;
private BigDecimal amount;
// Parametrised constructor and getters
#Override
public String toString() {
return date + " | " + amount;
}
}
public class Solution {
public static void main(String[] args) {
List<ReportRow> originalReport = List.of(
new ReportRow(LocalDate.of(2022, 6, 20), BigDecimal.valueOf(10000)),
new ReportRow(LocalDate.of(2023, 1, 15), BigDecimal.valueOf(8000)),
new ReportRow(LocalDate.of(2023, 7, 5), BigDecimal.valueOf(6500)));
System.out.println("Before:");
originalReport.forEach(System.out::println);
List<ReportRow> updatedReport = new ArrayList<>();
int size = originalReport.size();
if (size > 0)
updatedReport.add(originalReport.get(0));
if (size > 1) {
for (int i = 1; i < size; i++) {
ReportRow lastRow = originalReport.get(i - 1);
ReportRow currentRow = originalReport.get(i);
BigDecimal lastAmount = lastRow.getAmount();
LocalDate dateStart = lastRow.getDate();
LocalDate dateEnd = currentRow.getDate();
if (ChronoUnit.MONTHS.between(dateStart.withDayOfMonth(dateEnd.getDayOfMonth()), dateEnd) > 1) {
for (LocalDate date = dateStart.plusMonths(1); date.isBefore(dateEnd); date = date.plusMonths(1))
updatedReport.add(new ReportRow(date, lastAmount));
}
updatedReport.add(currentRow);
}
}
System.out.println("After:");
updatedReport.forEach(System.out::println);
}
}
Output:
Before:
2022-06-20 | 10000
2023-01-15 | 8000
2023-07-05 | 6500
After:
2022-06-20 | 10000
2022-07-20 | 10000
2022-08-20 | 10000
2022-09-20 | 10000
2022-10-20 | 10000
2022-11-20 | 10000
2022-12-20 | 10000
2023-01-15 | 8000
2023-02-15 | 8000
2023-03-15 | 8000
2023-04-15 | 8000
2023-05-15 | 8000
2023-06-15 | 8000
2023-07-05 | 6500
Note: if you plan to use a Map and intend to maintain the order, you should use LinkedHashMap instead of HashMap.

Between function in spark using java

I have two dataframe :
Dataframe 1
+-----------------+-----------------+
| hour_Entre | hour_Sortie |
+-----------------+-----------------+
| 18:30:00 | 05:00:00 |
| | |
+-----------------+-----------------+
Dataframe 2
+-----------------+
| hour_Tracking |
+-----------------+
| 19:30:00 |
+-----------------+
I want to take the hour_tracking that are between hour_Entre and hour_Sortie.
I tried the following code :
boolean checked = true;
try{
if(df1.select(col("heureSortie")) != null && df1.select(col("heureEntre")) !=null){
checked = checked && df2.select(col("dateTracking_hour_minute").between(df1.select(col("heureSortie")),df1.select(col("heureEntre"))));
}
} catch (Exception e) {
e.printStackTrace();
}
But I get this error :
Operator && cannot be applied to boolean , 'org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>'

In case you are looking for hour difference -
1st create date difference
from pyspark.sql import functions as F
df = df.withColumn('date_diff', F.datediff(F.to_date(df.hour_Entre), F.to_date(df.hour_Sortie)))
Then calculate hour difference out of that -
df = df.withColumn('hours_diff', (df.date_diff*24) +
F.hour(df.hour_Entre) - F.hour(df.hour_Sortie))

With Apache Spark flattern the 2 first rows of each group with Java

Giving the following input table:
+----+------------+----------+
| id | shop | purchases|
+----+------------+----------+
| 1 | 01 | 20 |
| 1 | 02 | 31 |
| 2 | 03 | 5 |
| 1 | 03 | 3 |
+----+------------+----------+
I would like, grouping by id and based on the purchases, obtain the first 2 top shops as follow:
+----+-------+------+
| id | top_1 | top_2|
+----+-------+------+
| 1 | 02 | 01 |
| 2 | 03 | |
+----+-------+------+
I'm using Apache Spark 2.0.1 and the first table is the result of other queries and joins which are on a Dataset. I could maybe do this with the traditional java iterating over the Dataset, but I hope there is another way using the Dataset functionalities.
My first attempt was the following:
//dataset is already ordered by id, purchases desc
...
Dataset<Row> ds = dataset.repartition(new Column("id"));
ds.foreachPartition(new ForeachPartitionFunction<Row>() {
#Override
public void call(Iterator<Row> itrtr) throws Exception {
int counter = 0;
while (itrtr.hasNext()) {
Row row = itrtr.next();
if(counter < 2)
//save it into another Dataset
counter ++;
}
}
});
But then I were lost in how to save it into another Dataset. My goal is, at the end, save the result into a MySQL table.

Using window functions and pivot you can define a window:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, first, row_number}
val w = Window.partitionBy(col("id")).orderBy(col("purchases").desc)
add row_number and filter top two rows:
val dataset = Seq(
(1, "01", 20), (1, "02", 31), (2, "03", 5), (1, "03", 3)
).toDF("id", "shop", "purchases")
val topTwo = dataset.withColumn("top", row_number.over(w)).where(col("top") <= 2)
and pivot:
topTwo.groupBy(col("id")).pivot("top", Seq(1, 2)).agg(first("shop"))
with result being:
+---+---+----+
| id| 1| 2|
+---+---+----+
| 1| 02| 01|
| 2| 03|null|
+---+---+----+
I'll leave converting syntax to Java as an exercise for the poster (excluding import static for functions the rest should be close to identical).

How do I map a resultset to a nested structure of objects?

I have a result set like this…
+--------------+--------------+----------+--------+
| LocationCode | MaterialCode | ItemCode | Vendor |
+--------------+--------------+----------+--------+
| 1 | 11 | 111 | 1111 |
| 1 | 11 | 111 | 1112 |
| 1 | 11 | 112 | 1121 |
| 1 | 12 | 121 | 1211 |
+--------------+--------------+----------+--------+
And so on for LocationCode 2,3,4 etc. I need an object (to be converted to json, eventually) as : List<Location>
Where the the hierarchy of nested objects in Location Class are..
Location.class
LocationCode
List<Material>
Material.class
MaterialCode
List<Item>
Item.class
ItemCode
Vendor
This corresponds to the resultset, where 1 location has 2 materials, 1 material(11) has 2 Items, 1 item(111) has 2 vendors. How do i achieve this? I have used AliasToBeanResultTransformer before, but i doubt it will be of help in this case.

I don't think there is a neat way to do that mapping. I'd just do it with nested loops, and custom logic to decide when to when to start building the next Location, Material, Item, whatever.
Something like this pseudo-code:
while (row = resultSet.next()) {
if (row.locationCode != currentLocation.locationCode) {
currentLocation = new Location(row.locationCode)
list.add(currentLocation)
currentMaterial = null
} else if (currentMaterial == null ||
row.materialCode != currentMaterial.materialCode) {
currentMaterial = new Material(row.materialCode)
currentLocation.add(currentMaterial)
} else {
currentMaterial.add(new Item(row.itemCode, row.vendorCode))
}
}

String tokenizing to remove some data

I have a string like this:
1 | 2 | 3 | 4 | 5 | 6 | 7 | | | | | | | |
The string might have more/less data also.
I need to remove | and get only numbers one by one.

Guava's Splitter Rocks!
String input = "1 | 2 | 3 | 4 | 5 | 6 | 7 | | | | | | | |";
Iterable<String> entries = Splitter.on("|")
.trimResults()
.omitEmptyStrings()
.split(input);
And if you really want to get fancy:
Iterable<Integer> ints = Iterables.transform(entries,
new Function<String, Integer>(){
Integer apply(String input){
return Integer.parseInt(input);
}
});
Although you definitely could use a regex method or String.split, I feel that using Splitter is less likely to be error-prone and is more readable and maintainable. You could argue that String.split might be more efficient but since you are going to have to do all the trimming and checking for empty strings anyway, I think it will probably even out.
One comment about transform, it does the calculation on an as-needed basis which can be great but also means that the transform may be done multiple times on the same element. Therefore I recommend something like this to perform all the calculations once.
Function<String, Integer> toInt = new Function...
Iterable<Integer> values = Iterables.transform(entries, toInt);
List<Integer> valueList = Lists.newArrayList(values);

You can try using a Scanner:
Scanner sc = new Scanner(myString);
sc.useDelimiter("|");
List<Integer> numbers = new LinkedList<Integer>();
while(sc.hasNext()) {
if(sc.hasNextInt()) {
numbers.add(sc.nextInt());
} else {
sc.next();
}
}

Here you go:
String str = "1 | 2 | 3 | 4 | 5 | 6 | 7 | | | | | | | |".replaceAll("\\|", "").replaceAll("\\s+", "");

Do you mean like?
String s = "1 | 2 | 3 | 4 | 5 | 6 | 7 | | | | | | | |";
for(String n : s.split(" ?\\| ?")) {
int i = Integer.parseInt(n);
System.out.println(i);
}
prints
1
2
3
4
5
6
7

inputString.split("\\s*\\|\\s*") will give you an array of the numbers as strings. Then you need to parse the numbers:
final List<Integer> ns = new ArrayList<>();
for (String n : input.split("\\s*\\|\\s*"))
ns.add(Integer.parseInt(n);

You can use split with the following regex (allows for extra spaces, tabs and empty buckets):
String input = "1 | 2 | 3 | 4 | 5 | 6 | 7 | | | | | | | | ";
String[] numbers = input.split("([\\s*\\|\\s*])+");
System.out.println(Arrays.toString(numbers));
outputs:
[1, 2, 3, 4, 5, 6, 7]

Or with Java onboard methods:
String[] data="1 | 2 | 3 | 4 | 5 | 6 | 7 | | | | | | | |".split("|");
for(String elem:data){
elem=elem.trim();
if(elem.length()>0){
// do someting
}
}

Split the string at its delimiter | and then parse the array.
Something like this should do:
String test = "|1|2|3";
String delimiter = "|";
String[] testArray = test.split(delimiter);
List<Integer> values = new ArrayList<Integer>();
for (String string : testArray) {
int number = Integer.parseInt(string);
values.add(number);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I match double rows with a threshold in Spark? - java

Related

Add value to map if the key has generated dates

Between function in spark using java

With Apache Spark flattern the 2 first rows of each group with Java

How do I map a resultset to a nested structure of objects?

String tokenizing to remove some data

Categories

Resources