read pdf from itext - java

I had made a table in pdf using text in java web application.
PDF Generated is:
Gender | Column 1 | Column 2 | Column 3
Male | 1845 | 645 | 254
Female | 214 | 457 | 142
On reading pdf i used following code:
ArrayList allrows = firstable.getRows();
for (PdfPRow currentrow:allrows) {
PdfPCell[] allcells = currentrow.getCells();
System.out.println("CurrentRow ->"+currentrow.getCells());
for(PdfPCell currentcell : allcells){
ArrayList<Element> element = (ArrayList<Element>) currentcell.getCompositeElements();
System.out.println("Element->"+element.toString());
}
}
How to read text from pdf columns and pass to int variables?

Why don't you generate the Column of the pdf as fields, so that reading will be much easier

Related

How to merge two parquet files having different schema in spark (java)

I am having 2 parquet files with different number of columns and trying to merge them with following code snippet
Dataset<Row> dataSetParquet1 = testSparkSession.read().option("mergeSchema",true).parquet("D:\\ABC\\abc.parquet");
Dataset<Row> dataSetParquet2 = testSparkSession.read().option("mergeSchema",true).parquet("D:\\EFG\\efg.parquet");
dataSetParquet1.unionByName(dataSetParquet2);
// dataSetParquet1.union(dataSetParquet2);
for unionByName() I get the error:
Caused by: org.apache.spark.sql.AnalysisException: Cannot resolve column name
for union() I get the error:
Caused by: org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 7 columns and the second table has 6 columns;;
How do I merge these files using spark in java?
UPDATE : Example
dataset 1:
epochMillis | one | two | three| four
--------------------------------------
1630670242000 | 1 | 2 | 3 | 4
1630670244000 | 1 | 2 | 3 | 4
1630670246000 | 1 | 2 | 3 | 4
dataset2 :
epochMillis | one | two | three|five
---------------------------------------
1630670242000 | 11 | 22 | 33 | 55
1630670244000 | 11 | 22 | 33 | 55
1630670248000 | 11 | 22 | 33 | 55
Final dataset after merging:
epochMillis | one | two | three|four |five
--------------------------------------------
1630670242000 | 11 | 22 | 33 |4 |55
1630670244000 | 11 | 22 | 33 |4 |55
1630670246000 | 1 | 2 | 3 |4 |null
1630670248000 | 11 | 22 | 33 |null |55
how to obtain this result for merging two Datasets?
You can use mergeSchema option along with adding all the paths of parquet files you want to merge in parquet method, as follow:
Dataset<Row> finalDataset = testSparkSession.read()
.option("mergeSchema", true)
.parquet("D:\\ABC\\abc.parquet", "D:\\EFG\\efg.parquet");
All columns present in first dataset but not in second dataset will be set with null value in the second dataset
To merge two rows that come from two different dataframes, you first join the two dataframes, then select the right columns according on how you want to merge.
So for your case, it means:
Read separately the two dataframes from their parquet location
Join the two dataframes on their epochTime column, using a full_outer join as you want to keep all rows present in one dataframe but not in the other
From the new dataframe with all the columns of the two dataframes duplicated, select merged columns using a function columnMerges (implementation below)
[Optional] Reorder final dataframe by epochTime
Translated into code:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
Dataset<Row> dataframe1 = testSparkSession.read().parquet("D:\\ABC\\abc.parquet");
Dataset<Row> dataframe2 = testSparkSession.read().parquet("D:\\EFG\\efg.parquet");
dataframe1.join(dataframe2, dataframe1.col("epochTime").equalTo(dataframe2.col("epochTime")), "full_outer")
.select(Selector.columnMerges(dataframe2, dataframe1))
.orderBy("epochTime")
Note: when we read parquets no need for mergeSchema option as for each dataframe we read only one parquet file thus only one schema
For the merge function Selector.columnMerges, for each row, what we want to do is:
if the column is present in both dataframe, take value in dataframe2 if not null, else take value in dataframe1
if the column is only present in dataframe2, take value in dataframe2
if the column is only present in dataframe1, take value in dataframe1
So we first build set of columns of dataframe1, set of columns of dataframe2, and the list of columns from both dataframes, deduplicated. Then we iterate over this list of columns, applying previous rules for each one:
import org.apache.spark.sql.Column;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import static org.apache.spark.sql.functions.when;
public class Selector {
public static Column[] columnMerges(Dataset<Row> main, Dataset<Row> second) {
List<Column> columns = new ArrayList<>();
Set<String> columnsFromMain = new HashSet<>(Arrays.asList(main.columns()));
Set<String> columnsFromSecond = new HashSet<>(Arrays.asList(second.columns()));
List<String> columnNames = new ArrayList<>(Arrays.asList(main.columns()));
for (String column: second.columns()) {
if (!columnsFromMain.contains(column)) {
columnNames.add(column);
}
}
for (String column : columnNames) {
if (columnsFromMain.contains(column) && columnsFromSecond.contains(column)) {
columns.add(when(main.col(column).isNull(), second.col(column)).otherwise(main.col(column)).as(column));
} else if (columnsFromMain.contains(column)) {
columns.add(main.col(column).as(column));
} else {
columns.add(second.col(column).as(column));
}
}
return columns.toArray(new Column[0]);
}
}

Need to set values in columns of dataset based on value of 1 column

I have a Dataset<Row> in java. I need to read value of 1 column which is a JSON string, parse it, and set the value of a few other columns based on the parsed JSON value.
My dataset looks like this:
|json | name| age |
========================================
| "{'a':'john', 'b': 23}" | null| null |
----------------------------------------
| "{'a':'joe', 'b': 25}" | null| null |
----------------------------------------
| "{'a':'zack'}" | null| null |
----------------------------------------
And I need to make it like this:
|json | name | age |
========================================
| "{'a':'john', 'b': 23}" | 'john'| 23 |
----------------------------------------
| "{'a':'joe', 'b': 25}" | 'joe' | 25 |
----------------------------------------
| "{'a':'zack'}" | 'zack'|null|
----------------------------------------
I am unable to figure out a way to do it. Please help with the code.
There is a function get_json_object exists in Spark.
Suggesting, you have a data frame named df, you may choose this way to solve your problem:
df.selectExpr("get_json_object(json, '$.a') as name", "get_json_object(json, '$.b') as age" )
But first and foremost, be sure that your json attribute has double quotes instead of single ones.
Note: there is a full list of Spark SQL functions. I am using it heavily. Consider to add it to bookmarks and reference time to time.
You could use UDFs
def parseName(json: String): String = ??? // parse json
val parseNameUDF = udf[String, String](parseName)
def parseAge(json: String): Int = ??? // parse json
val parseAgeUDF = udf[Int, String](parseAge)
dataFrame
.withColumn("name", parseNameUDF(dataFrame("json")))
.withColumn("age", parseAgeUDF(dataFrame("json")))

With Apache Spark flattern the 2 first rows of each group with Java

Giving the following input table:
+----+------------+----------+
| id | shop | purchases|
+----+------------+----------+
| 1 | 01 | 20 |
| 1 | 02 | 31 |
| 2 | 03 | 5 |
| 1 | 03 | 3 |
+----+------------+----------+
I would like, grouping by id and based on the purchases, obtain the first 2 top shops as follow:
+----+-------+------+
| id | top_1 | top_2|
+----+-------+------+
| 1 | 02 | 01 |
| 2 | 03 | |
+----+-------+------+
I'm using Apache Spark 2.0.1 and the first table is the result of other queries and joins which are on a Dataset. I could maybe do this with the traditional java iterating over the Dataset, but I hope there is another way using the Dataset functionalities.
My first attempt was the following:
//dataset is already ordered by id, purchases desc
...
Dataset<Row> ds = dataset.repartition(new Column("id"));
ds.foreachPartition(new ForeachPartitionFunction<Row>() {
#Override
public void call(Iterator<Row> itrtr) throws Exception {
int counter = 0;
while (itrtr.hasNext()) {
Row row = itrtr.next();
if(counter < 2)
//save it into another Dataset
counter ++;
}
}
});
But then I were lost in how to save it into another Dataset. My goal is, at the end, save the result into a MySQL table.
Using window functions and pivot you can define a window:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, first, row_number}
val w = Window.partitionBy(col("id")).orderBy(col("purchases").desc)
add row_number and filter top two rows:
val dataset = Seq(
(1, "01", 20), (1, "02", 31), (2, "03", 5), (1, "03", 3)
).toDF("id", "shop", "purchases")
val topTwo = dataset.withColumn("top", row_number.over(w)).where(col("top") <= 2)
and pivot:
topTwo.groupBy(col("id")).pivot("top", Seq(1, 2)).agg(first("shop"))
with result being:
+---+---+----+
| id| 1| 2|
+---+---+----+
| 1| 02| 01|
| 2| 03|null|
+---+---+----+
I'll leave converting syntax to Java as an exercise for the poster (excluding import static for functions the rest should be close to identical).

Talend - generating n multiple rows from 1 row

Background: I'm using Talend to do something (I guess) that is pretty common: generating multiple rows from one. For example:
ID | Name | DateFrom | DateTo
01 | Marco| 01/01/2014 | 04/01/2014
...could be split into:
new_ID | ID | Name | DateFrom | DateTo
01 | 01 | Marco | 01/01/2014 | 02/01/2014
02 | 01 | Marco | 02/01/2014 | 03/01/2014
03 | 01 | Marco | 03/01/2014 | 04/01/2014
The number of outcoming rows is dynamic, depending on the date period in the original row.
Question: how can I do this? Maybe using tSplitRow? I am going to check those periods with tJavaRow. Any suggestions?
Expanding on the answer given by Balazs Gunics
Your first part is to calculate the number of rows one row will become, easy enough with a date diff function on the to and from dates
Part 2 is to pass that value to a tFlowToIterate, and pick it up with a tJavaFlex that will use it in its start code to control a for loop:
tJavaFlex start:
int currentId = (Integer)globalMap.get("out1.id");
String currentName = (String)globalMap.get("out1.name");
Long iterations = (Long)globalMap.get("out1.iterations");
Date dateFrom = (java.util.Date)globalMap.get("out1.dateFrom");
for(int i=0; i<((Long)globalMap.get("out1.iterations")); i++) {
Main
row2.id = currentId;
row2.name = currentName;
row2.dateFrom = TalendDate.addDate(dateFrom, i, "dd");
row2.dateTo = TalendDate.addDate(dateFrom, i+1, "dd");
End
}
and sample output:
1|Marco|01-01-2014|02-01-2014
1|Marco|02-01-2014|03-01-2014
1|Marco|03-01-2014|04-01-2014
2|Polo|01-01-2014|02-01-2014
2|Polo|02-01-2014|03-01-2014
2|Polo|03-01-2014|04-01-2014
2|Polo|04-01-2014|05-01-2014
2|Polo|05-01-2014|06-01-2014
2|Polo|06-01-2014|07-01-2014
2|Polo|07-01-2014|08-01-2014
2|Polo|08-01-2014|09-01-2014
2|Polo|09-01-2014|10-01-2014
2|Polo|10-01-2014|11-01-2014
2|Polo|11-01-2014|12-01-2014
2|Polo|12-01-2014|13-01-2014
2|Polo|13-01-2014|14-01-2014
2|Polo|14-01-2014|15-01-2014
2|Polo|15-01-2014|16-01-2014
2|Polo|16-01-2014|17-01-2014
2|Polo|17-01-2014|18-01-2014
2|Polo|18-01-2014|19-01-2014
2|Polo|19-01-2014|20-01-2014
2|Polo|20-01-2014|21-01-2014
2|Polo|21-01-2014|22-01-2014
2|Polo|22-01-2014|23-01-2014
2|Polo|23-01-2014|24-01-2014
2|Polo|24-01-2014|25-01-2014
2|Polo|25-01-2014|26-01-2014
2|Polo|26-01-2014|27-01-2014
2|Polo|27-01-2014|28-01-2014
2|Polo|28-01-2014|29-01-2014
2|Polo|29-01-2014|30-01-2014
2|Polo|30-01-2014|31-01-2014
2|Polo|31-01-2014|01-02-2014
You can use tJavaFlex to do this.
If you have a small amount of columns the a tFlowToIterate -> tJavaFlex options could be fine.
In the begin part you can start to iterate, and in the main part you assign values to the output schema. If you name your output is row6 then:
row6.id = (String)globalMap.get("id");
and so on.
I came here as I wanted to add all context parameters into an Excel data sheet. So the solution bellow works when you are taking 0 input lines, but can be adapted to generate several lines for each line in input.
The design is actually straight forward:
tJava –trigger-on-OK→ tFileInputDelimited → tDoSomethingOnRowSet
↓ ↑
[write into a CSV] [read the CSV]
And here is the kind of code structure usable in the tJava.
try {
StringBuffer wad = new StringBuffer();
wad.append("Key;Nub"); // Header
context.stringPropertyNames().forEach(
key -> wad.
append(System.getProperty("line.separator")).
append(key + ";" + context.getProperty(key) )
);
// Here context.metadata contains the path to the CSV file
FileWriter output = new FileWriter(context.metadata);
output.write(wad.toString());
output.close();
} catch (IOException mess) {
System.out.println("An error occurred.");
mess.printStackTrace();
}
Of course if you have a set of rows as input, you can adapt the process to use a tJavaRow instead of a tJava.
You might prefer to use an Excel file as an on disk buffer, but dealing with this file format asks more work at least the first time when you don’t have the Java libraries already configured in Talend. Apache POI might help you if you nonetheless chose to go this way.

How to use Java to encode/decode/transcode byte array using a charset directly

I have a malformed string which may be caused by a bug of MySQL JDBC driver,
The bytes of a sample malformed string (malformed_string.getBytes("UTF-8")) is this:
C3 A4 C2 B8 C2 AD C3 A6 E2 80 93 E2 80 A1 (UTF-8 twice)
which should encoded the following bytes (it's already UTF-8 encoded, but treat them as ISO-8859-1 enoded)
----- ----- ----- ----- -------- --------
E4 B8 AD E6 96 87 (UTF-8)
which should encoded the following Unicode BigEndian bytes
--------------- ---------------------
4E 2D 65 87 (Unicode BigEndian)
I want to decode the 1st one to the 2nd one, I tried new String(malformed_string.getBytes("UTF-8"), "ISO-8859-1"), but it does not transcode as expected. I'm wondering if there's something like byte[] encode/decode (byte[] src, String charsetName), or how to achieve the transcode above in java?
Background:
I have a MySQL table which have Chinese column names, when I update such columns with long data, MySQL JDBC driver thrown an exception like this:
com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column '中文' at row 1
The column name in the exception is malformed, it should be "中文", and it must be correctly displayed to user as the following.
com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column '中文' at row 1
EDIT
Here's MySQL statement to demonstrate how the malformed string occured, and how to restore it to correct string
show variables like 'char%';
+--------------------------+--------------------------+
| Variable_name | Value |
+--------------------------+--------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | C:\mysql\share\charsets\ |
+--------------------------+--------------------------+
-- encode
select
hex(convert(convert(unhex('E4B8ADE69687') using UTF8) using ucs2)) as `hex(src in UNICODE)`,
unhex('E4B8ADE69687') `src in UTF8`,
'E4B8ADE69687' `hex(src in UTF8)`,
hex(convert(convert(unhex('E4B8ADE69687') using latin1) using UTF8)) as `hex(src in UTF8->Latin1->UTF8)`;
+---------------------+-------------+------------------+--------------------------------+
| hex(src in UNICODE) | src in UTF8 | hex(src in UTF8) | hex(src in UTF8->Latin1->UTF8) |
+---------------------+-------------+------------------+--------------------------------+
| 4E2D6587 | 中文 | E4B8ADE69687 | C3A4C2B8C2ADC3A6E28093E280A1 |
+---------------------+-------------+------------------+--------------------------------+
1 row in set (0.00 sec)
-- decode
select
unhex('C3A4C2B8C2ADC3A6E28093E280A1') as `malformed`,
'C3A4C2B8C2ADC3A6E28093E280A1' as `hex(malformed)`,
hex(convert(convert(unhex('C3A4C2B8C2ADC3A6E28093E280A1') using utf8) using latin1)) as `hex(malformed->UTF8->Latin1)`,
convert(convert(convert(convert(unhex('C3A4C2B8C2ADC3A6E28093E280A1') using utf8) using latin1) using binary)using utf8) `malformed->UTF8->Latin1->binary->UTF8`;
+----------------+------------------------------+------------------------------+---------------------------------------+
| malformed | hex(malformed) | hex(malformed->UTF8->Latin1) | malformed->UTF8->Latin1->binary->UTF8 |
+----------------+------------------------------+------------------------------+---------------------------------------+
| 中文 | C3A4C2B8C2ADC3A6E28093E280A1 | E4B8ADE69687 | 中文 |
+----------------+------------------------------+------------------------------+---------------------------------------+
1 row in set (0.00 sec)
Check out this tutorial: http://download.oracle.com/javase/tutorial/i18n/text/string.html
The punch line is the way to transcode using only the jdk classes is :
try {
byte[] utf8Bytes = original.getBytes("UTF8");
byte[] defaultBytes = original.getBytes();
String roundTrip = new String(utf8Bytes, "ISO-885-9");
System.out.println("roundTrip = " + roundTrip);
System.out.println();
printBytes(utf8Bytes, "utf8Bytes");
System.out.println();
printBytes(defaultBytes, "defaultBytes");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
For a more robust transcoding mechanism I suggest you check out ICU>

Categories

Resources