Talend - generating n multiple rows from 1 row - java

Background: I'm using Talend to do something (I guess) that is pretty common: generating multiple rows from one. For example:
ID | Name | DateFrom | DateTo
01 | Marco| 01/01/2014 | 04/01/2014
...could be split into:
new_ID | ID | Name | DateFrom | DateTo
01 | 01 | Marco | 01/01/2014 | 02/01/2014
02 | 01 | Marco | 02/01/2014 | 03/01/2014
03 | 01 | Marco | 03/01/2014 | 04/01/2014
The number of outcoming rows is dynamic, depending on the date period in the original row.
Question: how can I do this? Maybe using tSplitRow? I am going to check those periods with tJavaRow. Any suggestions?

Expanding on the answer given by Balazs Gunics
Your first part is to calculate the number of rows one row will become, easy enough with a date diff function on the to and from dates
Part 2 is to pass that value to a tFlowToIterate, and pick it up with a tJavaFlex that will use it in its start code to control a for loop:
tJavaFlex start:
int currentId = (Integer)globalMap.get("out1.id");
String currentName = (String)globalMap.get("out1.name");
Long iterations = (Long)globalMap.get("out1.iterations");
Date dateFrom = (java.util.Date)globalMap.get("out1.dateFrom");
for(int i=0; i<((Long)globalMap.get("out1.iterations")); i++) {
Main
row2.id = currentId;
row2.name = currentName;
row2.dateFrom = TalendDate.addDate(dateFrom, i, "dd");
row2.dateTo = TalendDate.addDate(dateFrom, i+1, "dd");
End
}
and sample output:
1|Marco|01-01-2014|02-01-2014
1|Marco|02-01-2014|03-01-2014
1|Marco|03-01-2014|04-01-2014
2|Polo|01-01-2014|02-01-2014
2|Polo|02-01-2014|03-01-2014
2|Polo|03-01-2014|04-01-2014
2|Polo|04-01-2014|05-01-2014
2|Polo|05-01-2014|06-01-2014
2|Polo|06-01-2014|07-01-2014
2|Polo|07-01-2014|08-01-2014
2|Polo|08-01-2014|09-01-2014
2|Polo|09-01-2014|10-01-2014
2|Polo|10-01-2014|11-01-2014
2|Polo|11-01-2014|12-01-2014
2|Polo|12-01-2014|13-01-2014
2|Polo|13-01-2014|14-01-2014
2|Polo|14-01-2014|15-01-2014
2|Polo|15-01-2014|16-01-2014
2|Polo|16-01-2014|17-01-2014
2|Polo|17-01-2014|18-01-2014
2|Polo|18-01-2014|19-01-2014
2|Polo|19-01-2014|20-01-2014
2|Polo|20-01-2014|21-01-2014
2|Polo|21-01-2014|22-01-2014
2|Polo|22-01-2014|23-01-2014
2|Polo|23-01-2014|24-01-2014
2|Polo|24-01-2014|25-01-2014
2|Polo|25-01-2014|26-01-2014
2|Polo|26-01-2014|27-01-2014
2|Polo|27-01-2014|28-01-2014
2|Polo|28-01-2014|29-01-2014
2|Polo|29-01-2014|30-01-2014
2|Polo|30-01-2014|31-01-2014
2|Polo|31-01-2014|01-02-2014

You can use tJavaFlex to do this.
If you have a small amount of columns the a tFlowToIterate -> tJavaFlex options could be fine.
In the begin part you can start to iterate, and in the main part you assign values to the output schema. If you name your output is row6 then:
row6.id = (String)globalMap.get("id");
and so on.

I came here as I wanted to add all context parameters into an Excel data sheet. So the solution bellow works when you are taking 0 input lines, but can be adapted to generate several lines for each line in input.
The design is actually straight forward:
tJava –trigger-on-OK→ tFileInputDelimited → tDoSomethingOnRowSet
↓ ↑
[write into a CSV] [read the CSV]
And here is the kind of code structure usable in the tJava.
try {
StringBuffer wad = new StringBuffer();
wad.append("Key;Nub"); // Header
context.stringPropertyNames().forEach(
key -> wad.
append(System.getProperty("line.separator")).
append(key + ";" + context.getProperty(key) )
);
// Here context.metadata contains the path to the CSV file
FileWriter output = new FileWriter(context.metadata);
output.write(wad.toString());
output.close();
} catch (IOException mess) {
System.out.println("An error occurred.");
mess.printStackTrace();
}
Of course if you have a set of rows as input, you can adapt the process to use a tJavaRow instead of a tJava.
You might prefer to use an Excel file as an on disk buffer, but dealing with this file format asks more work at least the first time when you don’t have the Java libraries already configured in Talend. Apache POI might help you if you nonetheless chose to go this way.

Related

How to merge two parquet files having different schema in spark (java)

I am having 2 parquet files with different number of columns and trying to merge them with following code snippet
Dataset<Row> dataSetParquet1 = testSparkSession.read().option("mergeSchema",true).parquet("D:\\ABC\\abc.parquet");
Dataset<Row> dataSetParquet2 = testSparkSession.read().option("mergeSchema",true).parquet("D:\\EFG\\efg.parquet");
dataSetParquet1.unionByName(dataSetParquet2);
// dataSetParquet1.union(dataSetParquet2);
for unionByName() I get the error:
Caused by: org.apache.spark.sql.AnalysisException: Cannot resolve column name
for union() I get the error:
Caused by: org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 7 columns and the second table has 6 columns;;
How do I merge these files using spark in java?
UPDATE : Example
dataset 1:
epochMillis | one | two | three| four
--------------------------------------
1630670242000 | 1 | 2 | 3 | 4
1630670244000 | 1 | 2 | 3 | 4
1630670246000 | 1 | 2 | 3 | 4
dataset2 :
epochMillis | one | two | three|five
---------------------------------------
1630670242000 | 11 | 22 | 33 | 55
1630670244000 | 11 | 22 | 33 | 55
1630670248000 | 11 | 22 | 33 | 55
Final dataset after merging:
epochMillis | one | two | three|four |five
--------------------------------------------
1630670242000 | 11 | 22 | 33 |4 |55
1630670244000 | 11 | 22 | 33 |4 |55
1630670246000 | 1 | 2 | 3 |4 |null
1630670248000 | 11 | 22 | 33 |null |55
how to obtain this result for merging two Datasets?
You can use mergeSchema option along with adding all the paths of parquet files you want to merge in parquet method, as follow:
Dataset<Row> finalDataset = testSparkSession.read()
.option("mergeSchema", true)
.parquet("D:\\ABC\\abc.parquet", "D:\\EFG\\efg.parquet");
All columns present in first dataset but not in second dataset will be set with null value in the second dataset
To merge two rows that come from two different dataframes, you first join the two dataframes, then select the right columns according on how you want to merge.
So for your case, it means:
Read separately the two dataframes from their parquet location
Join the two dataframes on their epochTime column, using a full_outer join as you want to keep all rows present in one dataframe but not in the other
From the new dataframe with all the columns of the two dataframes duplicated, select merged columns using a function columnMerges (implementation below)
[Optional] Reorder final dataframe by epochTime
Translated into code:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
Dataset<Row> dataframe1 = testSparkSession.read().parquet("D:\\ABC\\abc.parquet");
Dataset<Row> dataframe2 = testSparkSession.read().parquet("D:\\EFG\\efg.parquet");
dataframe1.join(dataframe2, dataframe1.col("epochTime").equalTo(dataframe2.col("epochTime")), "full_outer")
.select(Selector.columnMerges(dataframe2, dataframe1))
.orderBy("epochTime")
Note: when we read parquets no need for mergeSchema option as for each dataframe we read only one parquet file thus only one schema
For the merge function Selector.columnMerges, for each row, what we want to do is:
if the column is present in both dataframe, take value in dataframe2 if not null, else take value in dataframe1
if the column is only present in dataframe2, take value in dataframe2
if the column is only present in dataframe1, take value in dataframe1
So we first build set of columns of dataframe1, set of columns of dataframe2, and the list of columns from both dataframes, deduplicated. Then we iterate over this list of columns, applying previous rules for each one:
import org.apache.spark.sql.Column;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import static org.apache.spark.sql.functions.when;
public class Selector {
public static Column[] columnMerges(Dataset<Row> main, Dataset<Row> second) {
List<Column> columns = new ArrayList<>();
Set<String> columnsFromMain = new HashSet<>(Arrays.asList(main.columns()));
Set<String> columnsFromSecond = new HashSet<>(Arrays.asList(second.columns()));
List<String> columnNames = new ArrayList<>(Arrays.asList(main.columns()));
for (String column: second.columns()) {
if (!columnsFromMain.contains(column)) {
columnNames.add(column);
}
}
for (String column : columnNames) {
if (columnsFromMain.contains(column) && columnsFromSecond.contains(column)) {
columns.add(when(main.col(column).isNull(), second.col(column)).otherwise(main.col(column)).as(column));
} else if (columnsFromMain.contains(column)) {
columns.add(main.col(column).as(column));
} else {
columns.add(second.col(column).as(column));
}
}
return columns.toArray(new Column[0]);
}
}

RegEx for normalising UK telephone number

I am trying to normalise UK telephone numbers to international format.
The following strings should resolve to: +447834012345
07834012345
+447834012345
+4407834012345
+44 (0) 7834 012345
+44 0 7834 012345
004407834012345
0044 (0) 7834012345
00 44 0 7834012345
So far, I have got this:
"+44" + mobile.replaceAll("[^0-9]0*(44)?0*", "")
This doesn't quite cut it, as I am having problems with leading 0's etc; see table below. I'd like to try and refrain from using the global flag if possible.
Mobile | Normalised |
--------------------+--------------------+------
07834012345 | +4407834012345 | FAIL
+447834012345 | +447834012345 | PASS
+4407834012345 | +447834012345 | PASS
+44 (0) 7834 012345 | +44783412345 | FAIL
+44 0 7834 012345 | +44783412345 | FAIL
004407834012345 | +44004407834012345 | FAIL
0044 (0) 7834012345 | +4400447834012345 | FAIL
00 44 0 7834012345 | +44007834012345 | FAIL
+4407834004445 | +447834004445 | PASS
Thanks
If you still want the regex I was able to get it working like this:
"+44" + System.out.println(replaceAll("[^0-9]", "")
.replaceAll("^0{0,2}(44){0,2}0{0,1}(\\d{10})", "$2"));
EDIT: Changed the code to reflect failed tests. Removed non-numeric characters before running the regex.
EDIT: Update code based on comments.
Like my answer here, I would also suggest looking at the Google libphonenumber library. I know it is not regex but it does exactly what you want.
An example of how to do it in Java (it is available in other languages) would be the following from the documentation:
Let's say you have a string representing a phone number from
Switzerland. This is how you parse/normalize it into a PhoneNumber
object:
String swissNumberStr = "044 668 18 00";
PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
try {
PhoneNumber swissNumberProto = phoneUtil.parse(swissNumberStr, "CH");
} catch (NumberParseException e) {
System.err.println("NumberParseException was thrown: " + e.toString());
}
At this point, swissNumberProto contains:
{
"country_code": 41,
"national_number": 446681800
}
PhoneNumber is a class that is auto-generated from the
phonenumber.proto with necessary modifications for efficiency. For
details on the meaning of each field, refer to
https://github.com/googlei18n/libphonenumber/blob/master/resources/phonenumber.proto
Now let us validate whether the number is valid:
boolean isValid = phoneUtil.isValidNumber(swissNumberProto); // returns true
There are a few formats supported by the formatting method, as
illustrated below:
// Produces "+41 44 668 18 00"
System.out.println(phoneUtil.format(swissNumberProto, PhoneNumberFormat.INTERNATIONAL));
// Produces "044 668 18 00"
System.out.println(phoneUtil.format(swissNumberProto, PhoneNumberFormat.NATIONAL));
// Produces "+41446681800"
System.out.println(phoneUtil.format(swissNumberProto, PhoneNumberFormat.E164));

Writing data to text file in table format

So far I have this:
File dir = new File("C:\\Users\\User\\Desktop\\dir\\dir1\\dir2);
dir.mkdirs();
File file = new File(dir, "filename.txt");
FileWriter archivo = new FileWriter(file);
archivo.write(String.format("%20s %20s", "column 1", "column 2 \r\n"));
archivo.write(String.format("%20s %20s", "data 1", "data 2"));
archivo.flush();
archivo.close();
However. the file output looks like this:
Which I do not like at all.
How can I make a better table format for the output of a text file?
Would appreciate any assistance.
Thanks in advance!
EDIT: Fixed!
Also, instead of looking like
column 1 column 2
data 1 data 2
How can I make it to look like this:
column 1 column 2
data 1 data 2
Would prefer it that way.
The \r\n is been evaluated as part of the second parameter, so it basically calculating the required space as something like... 20 - "column 2".length() - " \r\n".length(), but since the second line doesn't have this, it takes less space and looks misaligned...
Try adding the \r\n as part of the base format instead, for example...
String.format("%20s %20s \r\n", "column 1", "column 2")
This generates something like...
column 1 column 2
data 1 data 2
In my tests...
I think you are trying to get data in tabular format. I've developed a Java library that can build much complex tables with more customization. You can get the source code here. Following are some of the basic table-views that my library can create. Hope this is useful enough!
COLUMN WISE GRID(DEFAULT)
+--------------+--------------+--------------+--------------+-------------+
|NAME |GENDER |MARRIED | AGE| SALARY($)|
+--------------+--------------+--------------+--------------+-------------+
|Eddy |Male |No | 23| 1200.27|
|Libby |Male |No | 17| 800.50|
|Rea |Female |No | 30| 10000.00|
|Deandre |Female |No | 19| 18000.50|
|Alice |Male |Yes | 29| 580.40|
|Alyse |Female |No | 26| 7000.89|
|Venessa |Female |No | 22| 100700.50|
+--------------+--------------+--------------+--------------+-------------+
FULL GRID
+------------------------+-------------+------+-------------+-------------+
|NAME |GENDER |MARRIE| AGE| SALARY($)|
+------------------------+-------------+------+-------------+-------------+
|Eddy |Male |No | 23| 1200.27|
+------------------------+-------------+------+-------------+-------------+
|Libby |Male |No | 17| 800.50|
+------------------------+-------------+------+-------------+-------------+
|Rea |Female |No | 30| 10000.00|
+------------------------+-------------+------+-------------+-------------+
|Deandre |Female |No | 19| 18000.50|
+------------------------+-------------+------+-------------+-------------+
|Alice |Male |Yes | 29| 580.40|
+------------------------+-------------+------+-------------+-------------+
|Alyse |Female |No | 26| 7000.89|
+------------------------+-------------+------+-------------+-------------+
|Venessa |Female |No | 22| 100700.50|
+------------------------+-------------+------+-------------+-------------+
NO GRID
NAME GENDER MARRIE AGE SALARY($)
Alice Male Yes 29 580.40
Alyse Female No 26 7000.89
Eddy Male No 23 1200.27
Rea Female No 30 10000.00
Deandre Female No 19 18000.50
Venessa Female No 22 100700.50
Libby Male No 17 800.50
Eddy Male No 23 1200.27
Libby Male No 17 800.50
Rea Female No 30 10000.00
Deandre Female No 19 18000.50
Alice Male Yes 29 580.40
Alyse Female No 26 7000.89
Venessa Female No 22 100700.50
You're currently including " \r\n" within your right-aligned second argument. I suspect you don't want the space at all, and you don't want the \r\n to be part of the count of 20 characters.
To left-align instead of right-aligning, use the - flag, i.e. %-20s instead of %20s. See the documentation for Formatter documentation for more information.
Additionally, you can make the code work in a more cross-platform way using %n to represent the current platform's line terminator (unless you specifically want a Windows file.
I'd recommend the use of Files.newBufferedWriter as well, as that allows you to specify the character encoding (and will use UTF-8 otherwise, which is better than using the platform default)... and use a try-with-resources statement to close the writer even in the face of an exception:
try (Writer writer = Files.newBufferedWriter(file.toPath())) {
writer.write(String.format("%-20s %-20s%n", "column 1", "column 2"));
writer.write(String.format("%-20s %-20s%n", "data 1", "data 2"));
}
try {
PrintWriter outputStream = new PrintWriter("myObjects.txt");
outputStream.println(String.format("%-20s %-20s %-20s", "Name", "Age", "Gender"));
outputStream.println(String.format("%-20s %-20s %-20s", "John", "30", "Male"));
outputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
It also works with the printf in case you want to output variables instead of hard coding
try {
PrintWriter myObj = new PrintWriter("Result.txt");
resultData.println("THE RESULTS OF THE OPERATIONS\n");
for (int i = 0; i < 15; i++){
resultData.printf("%-20d%-20d\r", finalScores[i], midSemScores[i]);
}
resultData.close();
} catch (IOException e){
System.Out.Println("An error occurred");
e.printStackTrace();
}

read pdf from itext

I had made a table in pdf using text in java web application.
PDF Generated is:
Gender | Column 1 | Column 2 | Column 3
Male | 1845 | 645 | 254
Female | 214 | 457 | 142
On reading pdf i used following code:
ArrayList allrows = firstable.getRows();
for (PdfPRow currentrow:allrows) {
PdfPCell[] allcells = currentrow.getCells();
System.out.println("CurrentRow ->"+currentrow.getCells());
for(PdfPCell currentcell : allcells){
ArrayList<Element> element = (ArrayList<Element>) currentcell.getCompositeElements();
System.out.println("Element->"+element.toString());
}
}
How to read text from pdf columns and pass to int variables?
Why don't you generate the Column of the pdf as fields, so that reading will be much easier

How can I display a graphical (ascii text graphics) representation of the decision tree in WEKA using graph() or tograph()

I'm trying to display the decision tree generated by different classes using the WEKA classes in my own program. Specifically I'm using two different ones: J48 (C4.5 implementation) and RandomTree. One has the function graph() and the other has the function toGraph() which appear to have the same functionality for their respective classes.
Since they both show java.lang.String as their return type I was expecting to see something like what you see when using their Explorer app:
act = STRETCH
| size = SMALL
| | Color = YELLOW
| | | age = ADULT : T (1/0)
| | | age = CHILD : F (1/0)
| | Color = PURPLE
| | | age = ADULT : T (2/0)
| | | age = CHILD : F (1/0)
| size = LARGE
| | age = ADULT : T (4/0)
| | age = CHILD : F (2/0)
act = DIP : F (8/0)
Instead I get something like this:
digraph Tree {
edge [style=bold]
N13aaa14a [label="1: T"]
N13aaa14a->N268b819f [label="act = STRETCH"]
N268b819f [label="2: T"shape=box]
N13aaa14a->N10eb017e [label="act = DIP"]
N10eb017e [label="3: F"]
N10eb017e->N34aeffdf [label="age = CHILD"]
N34aeffdf [label="4: F"shape=box]
N10eb017e->N4d20a47e [label="age = ADULT"]
N4d20a47e [label="5: T"shape=box]
}
Is this something unique to the WEKA libraries or is this some type of standard Java format? It looks similar to some of the JSON stuff I saw working on another project but I never got that familiar with it.
Is there an easy way I can write a function to display this in a more human-readable format?
The output you are getting is in so-called "dot" format which is designed to be compiled by graphviz. You'll get better results than ASCII art, that's for sure.
Save your file in out.dot and then try this command:
$ dot -Tpng -oout.png out.dot
Then look at what you've got in out.png

Categories

Resources