Parsing key value pairs as Hive Dataset rows using java spark - java

I have a hdfs file with the following data
key1=value1 key2=value2 key3=value3...
key1=value11 key2=value12 key3=value13..
We use and internal framework that gives Dataset as an input to java method which should be transformed as below and put in a hive table
keys should be the hive column names
Row formed after splitting the dataset with = as delimiter and picking the value to the right
Expected Output:
key1 | key 2 | key3
----------+-------------+----------
value1 | value2 | value3
value11 | value12 | value13
Hdfs file would roughly have 60 key- value pairs so its impossible to manually do a withcolumn() on Dataset. Any help is appreciated.
Edit1:
This is what I could write so far. Dataset.withColumn() doesnt seem to be working in a loop except for the 1st iteration
String[] columnNames = new String[dataset.columns().length];
String unescapedColumn;
Row firstRow= (Row)dataset.first();
String[] rowData = firstRow.mkString(",").split(",");
for(int i=0;i<rowData.length;i++) {
unescapedColumn=rowData[i].split("=")[0];
if(unescapedColumn.contains(">")) {
columnNames[i] = unescapedColumn.substring(unescapedColumn.indexOf(">")+1).trim();
}else {
columnNames[i] = unescapedColumn.trim();
}
}
Dataset<Row> namedDataset = dataset.toDF(columnNames);
for(String column : namedDataset.columns()) {
System.out.println("Column name :" + column);
namedDataset = namedDataset.withColumn(column, functions.substring_index(namedDataset.col(column),"=",-1));
}

Related

Compare and Highlight the differences of two dataframes using spark and java

I am using spark and java to to try and compare two data frames.
Once I convert my csv files into data frames, I want to highlight exactly what changed between two dataframes.
They all have the same columns in common.
As you can see the only thing not correct with below data frames is emp_id 4 in the second df2.
Dataset<Row> df1 = spark.read().csv("/Users/dataframeOne.csv");
Dataset<Row> df1 = spark.read().csv("/Users/dataframeTwo.csv");
df1.unionAll(df2).except(df1.intersect(df2)).show(true);
Df1
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sanjose| romin|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
Df2
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sanjose| romino|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
Difference
+------+--------+--------+----------+-------+--------+
|emp_id|emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+--------+--------+----------+-------+--------+
| 4| sanjose| romino|9848022331| 45123|SanRamon|
+------+--------+--------+----------+-------+--------+
How can I highlight in yellow 'Romino', the incorrect field using JAVA and SPARK?
Highlighting something in Spark depends on your GUI, so as first step I would suggest to detect the different values and add the information about the differences as additional column to the dataframe.
Step 1: Add a suffix to all columns of the two dataframes and join them over the primary key (emp_id):
import static org.apache.spark.sql.functions.*;
private static Dataset<Row> prefix(Dataset<Row> df, String prefix) {
for(String col: df.columns()) df = df.withColumnRenamed(col, col + prefix);
return df;
}
[...]
Dataset<Row> df1 = spark.read().option("header", "true").csv(...);
Dataset<Row> df2 = spark.read().option("header", "true").csv(...);
String[] columns = df1.columns();
Dataset<Row> joined = prefix(df1, "_1").join(prefix(df2, "_2"),
col("emp_id_1").eqNullSafe(col("emp_id_2")), "full_outer");
Step 2: create a list of column objects that check if the value from one table is different from the other table. This list will later be used as input parameter for map.
List<Column> diffs = new ArrayList<>();
for( String column: columns) {
diffs.add(lit(column));
diffs.add(when(col(column + "_1").eqNullSafe(col(column + "_2")), null)
.otherwise(concat_ws("/", col(column + "_1"), col(column + "_2"))));
}
Step 3: create a new column containing a map with all differences:
joined.withColumn("differences", map(diffs.toArray(new Column[]{})))
.withColumn("differences", map_filter(col("differences"), (k, v) -> not(v.isNull())))
.select("emp_id_1", "differences")
.filter(size(col("differences")).gt(0))
.show(false);
Output:
+--------+--------------------------+
|emp_id_1|differences |
+--------+--------------------------+
|4 |{emp_name -> romin/romino}|
+--------+--------------------------+

Write text to a file in Java

I'm trying to write a simple output to a file but I'm getting the wrong output. This is my code:
Map<Integer,List<Client>> hashMapClients = new HashMap<>();
hashMapClients = clients.stream().collect(Collectors.groupingBy(Client::getDay));
Map<Integer,List<Transaction>> hasMapTransactions = new HashMap<>();
hasMapTransactions = transactions.stream().collect(Collectors.groupingBy(Transaction::getDay));
//DAYS
String data;
for (Integer key: hashMapClients.keySet()) {
data = key + " | ";
for (int i = 0; i <hashMapClients.get(key).size();i++) {
data += hashMapClients.get(key).get(i).getType() + " | " + hashMapClients.get(key).get(i).getAmountOfClients() + ", ";
writer.println(data);
}
}
I get this output
1 | Individual | 0,
1 | Individual | 0, Corporation | 0,
2 | Individual | 0,
2 | Individual | 0, Corporation | 0,
But it should be, also it should not end with , if it's the last one.
1 | Individual | 0, Corporation | 0
2 | Individual | 0, Corporation
| 0
What am I doing wrong?
It sounds like you only want to write data to the output in the outer loop, not the inner loop. The latter of which is just for building the data value to write. Something like this:
String data;
for (Integer key: hashMapClients.keySet()) {
// initialize the value
data = key + " | ";
// build the value
for (int i = 0; i <hashMapClients.get(key).size();i++) {
data += hashMapClients.get(key).get(i).getType() + " | " + hashMapClients.get(key).get(i).getAmountOfClients() + ", ";
}
// write the value
writer.println(data);
}
Edit: Thanks for pointing out that the last character also still needs to be removed. Without more error checking, that could be as simple as:
data = data.substring(0, data.length() - 1);
You can add error checking as your logic requires, perhaps confirming that the last character is indeed a comma or confirming that the inner loop executes at least once, etc.
One problem is that you are calling println after every Client, rather than waiting until the whole list is built. Then, to fix the problem with the trailing comma, you can use a joining collector.
Map<Integer,List<Client>> clientsByDay = clients.stream()
.collect(Collectors.groupingBy(Client::getDay));
/* Iterate over key-value pairs */
for (Map.Entry<Integer, List<Client>> e : clientsByDay) {
/* Print the key */
writer.print(e.getKey());
/* Print a separator */
writer.print(" | ");
/* Print the value */
writer.println(e.getValue().stream()
/* Convert each Client to a String in the desired format */
.map(c -> c.getType() + " | " + c.getAmountOfClients())
/* Join the clients together in a comma-separated list */
.collect(Collectors.joining(", ")));
}

Split Java Spring

I have a db with 2 columns, key and value. record:
------------------------------------
| key | value |
------------------------------------
| A | 1,desc 1;2,desc 2;3,desc 3 |
------------------------------------
I want to split value column become json format:
[{"key":"1","value":"desc 1"},{"key":"2","value":"desc 2"},{"key":"3", "value":"desc 3"}]
Where I am put split function? in service? because too dificult for 2 split. How to solve this problem?
Thanks,
Bobby
That depends on how your application is usually working with this value. If the usual case is using some specific data from this column, I would parse this at repository level already:
public static void main(String[] args) {
// You actually get this from DB
String value = "1,desc 1;2,desc 2;3,desc 3";
JSONArray j = new JSONArray();
Stream.of(value.split(";")).forEach((pair -> {
String[] keyValue = pair.split(",");
JSONObject o = new JSONObject();
o.put("key", keyValue[0]);
o.put("value", keyValue[1]);
j.put(o);
}));
System.out.println(j);
}

Data manipulation on all columns in Dataset with Java API

After reading csv file in Dataset, want to remove spaces from String type data using Java API.
Apache Spark 2.0.0
Dataset<Row> dataset = sparkSession.read().format("csv").option("header", "true").load("/pathToCsv/data.csv");
Dataset<String> dataset2 = dataset.map(new MapFunction<Row,String>() {
#Override
public String call(Row value) throws Exception {
return value.getString(0).replace(" ", "");
// But this will remove space from only first column
}
}, Encoders.STRING());
By using MapFunction, not able to remove spaces from all columns.
But in Scala, by using following way in spark-shell able to perform desired operation.
val ds = spark.read.format("csv").option("header", "true").load("/pathToCsv/data.csv")
val opds = ds.select(ds.columns.map(c => regexp_replace(col(c), " ", "").alias(c)): _*)
Dataset opds have data without spaces. Want to achieve same in Java. But in Java API columns method returns String[] and not able to perform functional programming on Dataset.
Input Data
+----------------+----------+-----+---+---+
| x| y| z| a| b|
+----------------+----------+-----+---+---+
| Hello World|John Smith|There| 1|2.3|
|Welcome to world| Bob Alice|Where| 5|3.6|
+----------------+----------+-----+---+---+
Expected Output Data
+--------------+---------+-----+---+---+
| x| y| z| a| b|
+--------------+---------+-----+---+---+
| HelloWorld|JohnSmith|There| 1|2.3|
|Welcometoworld| BobAlice|Where| 5|3.6|
+--------------+---------+-----+---+---+
Try:
for (String col: dataset.columns) {
dataset = dataset.withColumn(col, regexp_replace(dataset.col(col), " ", ""));
}
You can try following regex to remove white spaces between strings.
value.getString(0).replaceAll("\\s+", "");
About \s+ : match any white space character between one and unlimited times, as many times as possible.
Instead of replace use replaceAll function.
More about replace and replaceAll functions Difference between String replace() and replaceAll()

java reading numbers, interpreting as octal, want interpreted as string

i am having an issue, where java is reading an array list from a YAML file of numbers, or strings, and it is interpreting the numbers as octal if it has a leading 0, and no 8-9 digit.
is there a way to force java to read the yaml field as a string?
code:
ArrayList recordrarray = (ArrayList) sect.get("recordnum");
if (recordrarray != null) {
recno = join (recordrarray, " ");
}
HAVE ALSO TRIED:
Iterator<String> iter = recordrarray.iterator();
if (iter.hasNext()) recno = " " +String.valueOf(iter.next());
System.out.println(" this recnum:" + recno);
while (iter.hasNext()){
recno += ""+String.valueOf(iter.next()));
System.out.println(" done recnum:" + String.valueOf(iter.next()));
}
the input is such:
061456 changes to 25390
061506 changes to 25414
061559 -> FINE
it took a while to figure out what it was doing, and apparently this is a common issue for java,
ideas?
thanks
edit: using jvyaml
yaml:
22:
country_code: ' '
description: ''
insection: 1
recordnum:
- 061264
type: misc
yaml loading:
import org.jvyaml.YAML;
Map structure = new HashMap();
structure = (Map) YAML.load(new FileReader(structurefn)); // load the structure file
Where are you reading the file? The problem lies in where the file contents are being read. Most likeley the recordarray list contains integers, ie. they have alreadey been parsed. Find the place where the records are being read. Maybe you are doing something like this:
int val = Integer.parseInt(record);
Use this instead:
int val = Integer.parseInt(record, 10);

Categories

Resources