I have a column with type Timestamp with the format yyyy-MM-dd HH:mm:ss in a dataframe.
The column is sorted by time where the earlier date is at the earlier row
When I ran this command
List<Row> timeRows = df.withColumn(ts, df.col(ts).cast("long")).select(ts).collectAsList();
I face a strange issue where the value of the later date is smaller than the earlier date. Example:
[670] : 1550967304 (2019-02-23 04:30:15)
[671] : 1420064100 (2019-02-24 08:15:04)
Is this the correct way to convert to Epoch or is there another way?
Try using unix_timestamp to convert the string date time to the timestamp. According to the document:
unix_timestamp(Column s, String p) Convert time string with given
pattern (see
[http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html
]) to Unix time stamp (in seconds), return null if fail.
import org.apache.spark.functions._
val format = "yyyy-MM-dd HH:mm:ss"
df.withColumn("epoch_sec", unix_timestamp($"ts", format)).select("epoch_sec").collectAsList()
Also, see https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions-datetime.html
You should use the built in function unix_timestamp() in org.apache.spark.sql.functions
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html#unix_timestamp()
I think you are looking at using: unix_timestamp()
Which you can import from:
import static org.apache.spark.sql.functions.unix_timestamp;
And use like:
df = df.withColumn(
"epoch",
unix_timestamp(col("date")));
And here is a full example, where I tried to mimic your use-case:
package net.jgp.books.spark.ch12.lab990_others;
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.from_unixtime;
import static org.apache.spark.sql.functions.unix_timestamp;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
/**
* Use of from_unixtime() and unix_timestamp().
*
* #author jgp
*/
public class EpochTimestampConversionApp {
/**
* main() is your entry point to the application.
*
* #param args
*/
public static void main(String[] args) {
EpochTimestampConversionApp app = new EpochTimestampConversionApp();
app.start();
}
/**
* The processing code.
*/
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("expr()")
.master("local")
.getOrCreate();
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(
"event",
DataTypes.IntegerType,
false),
DataTypes.createStructField(
"original_ts",
DataTypes.StringType,
false) });
// Building a df with a sequence of chronological timestamps
List<Row> rows = new ArrayList<>();
long now = System.currentTimeMillis() / 1000;
for (int i = 0; i < 1000; i++) {
rows.add(RowFactory.create(i, String.valueOf(now)));
now += new Random().nextInt(3) + 1;
}
Dataset<Row> df = spark.createDataFrame(rows, schema);
df.show();
df.printSchema();
// Turning the timestamps to Timestamp datatype
df = df.withColumn(
"date",
from_unixtime(col("original_ts")).cast(DataTypes.TimestampType));
df.show();
df.printSchema();
// Turning back the timestamps to epoch
df = df.withColumn(
"epoch",
unix_timestamp(col("date")));
df.show();
df.printSchema();
// Collecting the result and printing out
List<Row> timeRows = df.collectAsList();
for (Row r : timeRows) {
System.out.printf("[%d] : %s (%s)\n",
r.getInt(0),
r.getAs("epoch"),
r.getAs("date"));
}
}
}
And the output should be:
...
[994] : 1551997326 (2019-03-07 14:22:06)
[995] : 1551997329 (2019-03-07 14:22:09)
[996] : 1551997330 (2019-03-07 14:22:10)
[997] : 1551997332 (2019-03-07 14:22:12)
[998] : 1551997333 (2019-03-07 14:22:13)
[999] : 1551997335 (2019-03-07 14:22:15)
Hopefully this helps.
Related
I'm using spark-sql-2.4.1v with java8.
I have dynamic list of columns is are passed into my function.
i.e.
List<String> cols = Arrays.asList("col_1","col_2","col_3","col_4");
Dataset<Row> df = //which has above columns plus "id" ,"name" plus many other columns;
Need to select cols + "id" + "name"
I am doing as below
Dataset<Row> res_df = df.select("id", "name", cols.stream().toArray( String[]::new));
this is giving compilation error. so how to handle this use-case.
Tried :
When I do something like below :
List<String> cols = new ArrayList<>(Arrays.asList("col_1","col_2","col_3","col_4"));
cols.add("id");
cols.add("name");
Giving error
Exception in thread "main" java.lang.UnsupportedOperationException
at java.util.AbstractList.add(AbstractList.java:148)
at java.util.AbstractList.add(AbstractList.java:108)
You could create array of Columns and pass it to the select statement.
import org.apache.spark.sql.*;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;
List<String> cols = new ArrayList<>(Arrays.asList("col_1","col_2","col_3","col_4"));
cols.add("id");
cols.add("name");
Column[] cols2 = cols.stream()
.map(s->new Column(s)).collect(Collectors.toList())
.toArray(new Column[0]);
settingsDataset.select(cols2).show();
You have a bunch of ways to achieve this, relying on different select method signatures.
One of the possible solutions, with the assumption cols List is immutable and is not controlled by your code:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import scala.collection.JavaConverters;
public class ATest {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.master("local[2]")
.getOrCreate();
List<String> cols = Arrays.asList("col_1", "col_2");
Dataset<Row> df = spark.sql("select 42 as ID, 'John' as NAME, 1 as col_1, 2 as col_2, 3 as col_3, 4 as col4");
df.show();
ArrayList<String> newCols = new ArrayList<>();
newCols.add("NAME");
newCols.addAll(cols);
df.select("ID", JavaConverters.asScalaIteratorConverter(newCols.iterator()).asScala().toSeq())
.show();
}
}
Initial data is in Dataset<Row> and I am trying to write to pipe delimited file and I want each non empty cell and non null values to be placed in quotes. Empty or null values should not contain quotes
result.coalesce(1).write()
.option("delimiter", "|")
.option("header", "true")
.option("nullValue", "")
.option("quoteAll", "false")
.csv(Location);
Expected output:
"London"||"UK"
"Delhi"|"India"
"Moscow"|"Russia"
Current Output:
London||UK
Delhi|India
Moscow|Russia
If I change the "quoteAll" to "true", output I am getting is:
"London"|""|"UK"
"Delhi"|"India"
"Moscow"|"Russia"
Spark version is 2.3 and java version is java 8
Java answer. CSV escape is not just adding " symbols around. You should handle " inside strings. So let's use StringEscapeUtils and define UDF that will call it. Then just apply the UDF to each of the column.
import org.apache.commons.text.StringEscapeUtils;
import org.apache.spark.sql.Column;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import static org.apache.spark.sql.functions.*;
import org.apache.spark.sql.expressions.UserDefinedFunction;
import org.apache.spark.sql.types.DataTypes;
import java.util.Arrays;
public class Test {
void test(Dataset<Row> result, String Location) {
// define UDF
UserDefinedFunction escape = udf(
(String str) -> str.isEmpty()?"":StringEscapeUtils.escapeCsv(str), DataTypes.StringType
);
// call udf for each column
Column columns[] = Arrays.stream(result.schema().fieldNames())
.map(f -> escape.apply(col(f)).as(f))
.toArray(Column[]::new);
// save the result
result.select(columns)
.coalesce(1).write()
.option("delimiter", "|")
.option("header", "true")
.option("nullValue", "")
.option("quoteAll", "false")
.csv(Location);
}
}
Side note: coalesce(1) is a bad call. It collect all data on one executor. You can get executor OOM in production for huge dataset.
EDIT & Warning: Did not see java tag. This is Scala solution that uses foldLeft as a loop to go over all columns. If this is replaced by a Java friendly loop, everything should work as is. I will try and look back at this at the later time.
A programmatic solution could be
val columns = result.columns
val randomColumnName = "RND"
val result2 = columns.foldLeft(result) { (data, column) =>
data
.withColumnRenamed(column, randomColumnName)
.withColumn(column,
when(col(randomColumnName).isNull, "")
.otherwise(concat(lit("\""), col(randomColumnName), lit("\"")))
)
.drop(randomColumnName)
}
This will produce the strings with " around them and write empty strings in nulls. If you need to keep nulls, just keep them.
Then just write it down:
result2.coalesce(1).write()
.option("delimiter", "|")
.option("header", "true")
.option("quoteAll", "false")
.csv(Location);
This is certainly not a efficient answer and I am modifying this based on one given by Artem Aliev, but thought it would be useful to few people, so posting this answer
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import static org.apache.spark.sql.functions.*;<br/>
import org.apache.spark.sql.expressions.UserDefinedFunction;
import org.apache.spark.sql.types.DataTypes;<br/>
public class Quotes {<br/>
private static final String DELIMITER = "|";
private static final String Location = "Give location here";
public static void main(String[] args) {
SparkSession sparkSession = SparkSession.builder()
.master("local")
.appName("Spark Session")
.enableHiveSupport()
.getOrCreate();
Dataset<Row> result = sparkSession.read()
.option("header", "true")
.option("delimiter",DELIMITER)
.csv("Sample file to read"); //Give the details of file to read here
UserDefinedFunction udfQuotesNonNull = udf(
(String abc) -> (abc!=null? "\""+abc+"\"":abc),DataTypes.StringType
);
result = result.withColumn("ind_val", monotonically_increasing_id()); //inducing a new column to be used for join as there is no identity column in source dataset
Dataset<Row> dataset1 = result.select((udfQuotesNonNull.apply(col("ind_val").cast("string")).alias("ind_val"))); //Dataset used for storing temporary results
Dataset<Row> dataset = result.select((udfQuotesNonNull.apply(col("ind_val").cast("string")).alias("ind_val"))); //Dataset used for storing output
String[] str = result.schema().fieldNames();
dataset1.show();
for(int j=0; j<str.length-1;j++)
{
dataset1 = result.select((udfQuotesNonNull.apply(col("ind_val").cast("string")).alias("ind_val")),(udfQuotesNonNull.apply(col(str[j]).cast("string")).alias("\""+str[j]+"\"")));
dataset=dataset.join(dataset1,"ind_val"); //Joining based on induced column
}
result = dataset.drop("ind_val");
result.coalesce(1).write()
.option("delimiter", DELIMITER)
.option("header", "true")
.option("quoteAll", "false")
.option("nullValue", null)
.option("quote", "\u0000")
.option("spark.sql.sources.writeJobUUID", false)
.csv(Location);
}
}
This is my "revenue_data.csv" file:
Client ReportDate Revenue
C1 2019-1-7 12
C2 2019-1-7 34
C1 2019-1-16 56
C2 2019-1-16 78
C3 2019-1-16 90
And my case class to read the file is:
package com.source.code;
import java.time.LocalDate;
public class RevenueRecorder {
private String clientCode;
private LocalDate reportDate;
private int revenue;
public RevenueRecorder(String clientCode, LocalDate reportDate, int revenue) {
this.clientCode = clientCode;
this.reportDate = reportDate;
this.revenue = revenue;
}
public String getClientCode() {
return clientCode;
}
public LocalDate getReportDate() {
return reportDate;
}
public int getRevenue() {
return revenue;
}
}
I can read the file and group by ReportDate, sum(revenue) in the following manner:
import com.source.code.RevenueRecorder;
import java.io.IOException;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.nio.file.Path;
import java.time.LocalDate;
import java.time.format.DateTimeFormatter;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import static java.util.stream.Collectors.groupingBy;
import static java.util.stream.Collectors.summingInt;
public class RevenueRecorderMain {
public static void main(String[] args) throws IOException {
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-M-d");
List<RevenueRecorder> revenueRecords = new ArrayList<>();
Path path = FileSystems.getDefault().getPath("src", "main", "resources",
"data", "revenue_data.csv");
Files.lines(path)
.skip(1)
.map(s -> s.split(","))
.forEach(s ->
{
String clientCode = s[0];
LocalDate reportDate = LocalDate.parse(s[1], formatter);
int revenue = Integer.parseInt(s[2]);
revenueRecords.add(new RevenueRecorder(clientCode, reportDate, revenue));
});
Map<LocalDate, Integer> reportDateRev = revenueRecords.stream()
.collect(groupingBy(RevenueRecorder::getReportDate,
summingInt(RevenueRecorder::getRevenue)));
}
}
My question is how can I group by ReportDate, count(clientCode) and sum(revenue) in Java 8, specifically:
what collection to use instead of the Map
how to groupby and collect in this case (and generally for more than 2 groupingBy's)
I'm trying:
//import org.apache.commons.lang3.tuple.ImmutablePair;
//import org.apache.commons.lang3.tuple.Pair;
Map<LocalDate, Pair<Integer, Integer>> pairedReportDateRev = revenueRecords.stream()
.collect(groupingBy(RevenueRecorder::getReportDate,
new ImmutablePair(summingInt(RevenueRecorder::getRevenue),
groupingBy(RevenueRecorder::getClientCode, Collectors.counting()))));
But getting the Intellij red-squiggle underneath RevenueRecorder::getReportDate with the hover-message 'Non-static method cannot be referenced from a static context'.
Thanks
EDIT
For clarification, here's the corresponding SQL query that I'm trying to get at:
select
reportDate, count(distinct(clientCode)), sum(revenue)
from
revenue_data_table
group by
reportDate
Although your trying has not been successful, but I think is what you most want to express. So I just follow your code and fix it. Try this one!
Map<LocalDate, ImmutablePair<Integer, Map<String, Long>>> map = revenueRecords.stream()
.collect(groupingBy(RevenueRecorder::getReportDate,
collectingAndThen(toList(), list -> new ImmutablePair(list.stream().collect(summingInt(RevenueRecorder::getRevenue)),
list.stream().collect(groupingBy(RevenueRecorder::getClientCode, Collectors.counting()))))));
And I borrowed some sample data code from #Lyashko Kirill to test my code, the result is below
This's my own idea, I hope I can help you. ╰( ̄▽ ̄)╭
If you already use Java 12, there is a new collector Collectors.teeing() which allows to collect using two independent collectors, then merge their results using the supplied BiFunction. Every element passed to the resulting collector is processed by both downstream collectors, then their results are merged using the specified merge function into the final result. Therefor Collectors.teeing() may be a good fit since you want counting and summing.
Map<LocalDate, Result> pairedReportDateMRR =
revenueRecords.stream().collect(Collectors.groupingBy(RevenueRecorder::getReportDate,
Collectors.teeing(Collectors.counting(),
Collectors.summingInt(RevenueRecorder::getRevenue), Result::new)));
System.out.println(pairedReportDateMRR);
//output: {2019-01-07={count=2, sum=46}, 2019-01-16={count=3, sum=224}}
For testing purposes I used the following simple static class
static class Result {
private Long count;
private Integer sum;
public Result(Long count, Integer sum) {
this.count = count;
this.sum = sum;
}
#Override
public String toString() {
return "{" + "count=" + count + ", sum=" + sum + '}';
}
}
First of all you can't produce map Map<LocalDate, Pair<Integer, Integer>> due to you want to do the second grouping, what means that for the same date you may have multiple Client Codes with separate counters per each of them.
So if I've got you right you wont to get something like this Map<LocalDate, MutablePair<Integer, Map<String, Integer>>>, if it's correct try this code snippet:
public static void main(String[] args) {
String data = "C1,2019-1-7,12\n" +
"C2,2019-1-7,34\n" +
"C1,2019-1-16,56\n" +
"C2,2019-1-16,78\n" +
"C3,2019-1-16,90";
Stream.of(data.split("\n")).forEach(System.out::println);
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-M-d");
List<RevenueRecorder> revenueRecords = Stream.of(data.split("\n")).map(line -> {
String[] s = line.split(",");
String clientCode = s[0];
LocalDate reportDate = LocalDate.parse(s[1].trim(), formatter);
int revenue = Integer.parseInt(s[2]);
return new RevenueRecorder(clientCode, reportDate, revenue);
}).collect(toList());
Supplier<MutablePair<Integer, Map<String, Integer>>> supplier = () -> MutablePair.of(0, new HashMap<>());
BiConsumer<MutablePair<Integer, Map<String, Integer>>, RevenueRecorder> accumulator = (pair, recorder) -> {
pair.setLeft(pair.getLeft() + recorder.getRevenue());
pair.getRight().merge(recorder.getClientCode(), 1, Integer::sum);
};
BinaryOperator<MutablePair<Integer, Map<String, Integer>>> combiner = (p1, p2) -> {
p1.setLeft(p1.getLeft() + p2.getLeft());
p2.getRight().forEach((key, val) -> p1.getRight().merge(key, val, Integer::sum));
return p1;
};
Map<LocalDate, MutablePair<Integer, Map<String, Integer>>> pairedReportDateMRR = revenueRecords.stream()
.collect(
groupingBy(RevenueRecorder::getReportDate,
Collector.of(supplier, accumulator, combiner))
);
System.out.println(pairedReportDateMRR);
}
I want to check and verify that all of the contents in the ArrayList are similar to the value of a String variable. If any of the value is not similar, the index number to be printed with an error message like (value at index 2 didn't match the value of expectedName variable).
After I run the code below, it will print all the three indexes with the error message, it will not print only the index number 1.
Please note that here I'm getting the data from CSV file, putting it into arraylist and then validating it against the expected data in String variable.
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVParser;
import org.apache.commons.csv.CSVRecord;
import java.io.IOException;
import java.io.Reader;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
public class ValidateVideoDuration {
private static final String CSV_FILE_PATH = "C:\\Users\\videologs.csv";
public static void main(String[] args) throws IOException {
String expectedVideo1Duration = "00:00:30";
String expectedVideo2Duration = "00:00:10";
String expectedVideo3Duration = "00:00:16";
String actualVideo1Duration = "";
String actualVideo2Duration = "";
String actualVideo3Duration = "";
ArrayList<String> actualVideo1DurationList = new ArrayList<String>();
ArrayList<String> actualVideo2DurationList = new ArrayList<String>();
ArrayList<String> actualVideo3DurationList = new ArrayList<String>();
try (Reader reader = Files.newBufferedReader(Paths.get(CSV_FILE_PATH));
CSVParser csvParser = new CSVParser(reader,
CSVFormat.DEFAULT.withFirstRecordAsHeader().withIgnoreHeaderCase().withTrim());) {
for (CSVRecord csvRecord : csvParser) {
// Accessing values by Header names
actualVideo1Duration = csvRecord.get("Video 1 Duration");
actualVideo1DurationList.add(actualVideo1Duration);
actualVideo2Duration = csvRecord.get("Video 2 Duration");
actualVideo2DurationList.add(actualVideo2Duration);
actualVideo3Duration = csvRecord.get("Video 3 Duration");
actualVideo3DurationList.add(actualVideo3Duration);
}
}
for (int i = 0; i < actualVideo2DurationList.size(); i++) {
if (actualVideo2DurationList.get(i) != expectedVideo2Duration) {
System.out.println("Duration of Video 1 at index number " + Integer.toString(i)
+ " didn't match the expected duration");
}
}
The data inside my CSV file look like the following:
video 1 duration, video 2 duration, video 3 duration
00:00:30, 00:00:10, 00:00:16
00:00:30, 00:00:15, 00:00:15
00:00:25, 00:00:10, 00:00:16
Don't use == or != for string compare. == checks the referential equality of two Strings and not the equality of the values. Use the .equals() method instead.
Change your if condition to if (!actualVideo2DurationList.get(i).equals(expectedVideo2Duration))
I have several bank statements from our users. I am trying to figure out a way to parse the rows of transactions. I have used PDFBox previously using TextArea, TextStripper, but i am not sure how to proceed with bank statements since they will have an undetermined number of rows and the rows may or maynot be of fixed size.
i wrote just such a parser to parse our chase pdf credit card statements, to speed up the tax-preparation time, with the help of an open source project called Apache Tika.
just need to include tika and pdf parser in your pom.xml dependencies:
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.17</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.17</version>
</dependency>
the PDF extractor is fairly straightforward also:
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.xml.sax.ContentHandler;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
public class PdfExtractor {
private static Logger logger = LoggerFactory.getLogger(PdfExtractor.class);
public static void main(String args[]) throws Exception {
StopWatch sw = new StopWatch();
List<String> files = new ArrayList<>();
files.add("C:/Users/m/Downloads/20170115.pdf");
files.add("C:/Users/m/Downloads/20170215.pdf");
files.add("C:/Users/m/Downloads/20170315.pdf");
files.add("C:/Users/m/Downloads/20170415.pdf");
files.add("C:/Users/m/Downloads/20170515.pdf");
files.add("C:/Users/m/Downloads/20170615.pdf");
files.add("C:/Users/m/Downloads/20170715.pdf");
files.add("C:/Users/m/Downloads/20170815.pdf");
files.add("C:/Users/m/Downloads/20170915.pdf");
files.add("C:/Users/m/Downloads/20171015.pdf");
files.add("C:/Users/m/Downloads/20171115.pdf");
files.add("C:/Users/m/Downloads/20171215.pdf");
files.add("C:/Users/m/Downloads/20180115.pdf");
InputStream is;
List<ChasePdfParser.ChaseRecord> full = new ArrayList<>();
for (String fileName : files) {
logger.info("Now processing " + fileName);
is = new FileInputStream(fileName);
ContentHandler contenthandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(is, contenthandler, metadata, new ParseContext());
String data = contenthandler.toString();
List<ChasePdfParser.ChaseRecord> chaseRecords = ChasePdfParser.parse(data);
full.addAll(chaseRecords);
is.close();
}
logger.info("Total processing time: " + PrettyPrinter.toMsSoundsGood(sw.getTime()));
full.forEach(cr -> System.err.println(cr.date + "|" + cr.desc + "|" + cr.amt));
}
}
The line parser also fairly straight-forward, since each line has all the necessary info, it's easy to parse it:
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.time.LocalDate;
import java.time.format.DateTimeFormatter;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class ChasePdfParser {
private static Logger logger = LoggerFactory.getLogger(ChasePdfParser.class);
private static int FOR_TAX_YEAR = 2017;
private static String YEAR_EXTENSION = "/" + FOR_TAX_YEAR;
private static DateTimeFormatter check = DateTimeFormatter.ofPattern("MM/dd/uuuu");
private static List<String> exclusions = new ArrayList<>(Arrays.asList("Payment Thank You", "AUTOMATIC PAYMENT"));
public static List<ChaseRecord> parse(String data) {
List<ChaseRecord> l = new ArrayList<>();
for (String line : data.split("\n")) {
if (line.isEmpty()) continue;
String[] split = line.split("\\s");
if (split == null || split.length == 0) continue;
String test = split[0];
if (!isMMDD(test)) continue;
if(skip(line)) continue;
if (split.length < 4) continue;
ChaseRecord cr = new ChaseRecord();
cr.date = extractDate(test);
try {
String last = split[split.length - 1];
last = last.replaceAll(",", "");
cr.amt = Double.parseDouble(last);
} catch (NumberFormatException e) {
e.printStackTrace();
}
cr.desc = String.join(" ", Arrays.copyOfRange(split, 1, split.length - 1));
cr.desc = cr.desc.replaceAll("\\s\\s+", " ");
l.add(cr);
}
return l;
}
private static boolean skip(String s) {
if (s == null || s.isEmpty()) {
return true;
}
for (String e : exclusions) {
if (s.contains(e)) {
return true;
}
}
return false;
}
protected static LocalDate extractDate(String s) {
if (!isMMDD(s)) {
return null;
}
LocalDate localDate = LocalDate.parse(s + YEAR_EXTENSION, check);
return localDate;
}
public static boolean isMMDD(String s) {
if (s == null || s.isEmpty() || s.length() != 5) {
return false;
}
try {
s += YEAR_EXTENSION;
LocalDate.parse(s, check);
return true;
} catch (Exception e) {
return false;
}
}
public static class ChaseRecord {
public LocalDate date;
public String desc;
public Double amt;
#Override
public String toString() {
return "ChaseRecord{" +
"date=" + date +
", desc='" + desc + '\'' +
", amt=" + amt +
'}';
}
}
}
Late to the party. You can also use pdftotext as a workaround. Everyone once in a great while it will miss out an amount of currency, particularly in the upper right of the table.
As you'd expect, you'll join the text on newlines and then start chopping the lines into lists, thence to write it to a tsv. The approach looks like this:
HTH.
import csv
import pdftotext
import re
from datetime import *
import os
import pandas as pd
# compile directory ref:
path='path to directory'
directory = os.fsencode(path)
# https://stackoverflow.com/questions/42202872/how-to-convert-list-to-row-dataframe-with-pandas
column_list = ['filesource','filedate','eventdate','description','bankcategory','amount']
filelist=[]
# an example of how to scrape chase statement pdf into list of lists:
def process_pdf_data(filename,filesource,filedate):
# trying with pdftotext
# print('starting pdf content scrape', file)
with open(filename, "rb") as f:
pdf = pdftotext.PDF(f)
pdf_join="\n".join(pdf)
pdf_array=pdf_join.split('\n')
# print(pdf_array)
startint=0
line=''
# at this point, the pdf_array is just a list of strings read serially from the pdf in succession down the page.
while line!='Account activity' and startint<=1000:
line=pdf_array[startint]
startint+=1
startint-=1 # bc it still gets incremented on exit above
# drop data before 'Account activity' as we won't need it.
del pdf_array[:startint]
# print(pdf_array)
# set pattern for date detection
# https://www.programiz.com/python-programming/regex
# https://docs.python.org/3/library/re.html
pattern=re.compile("^([A-Z]|[a-z]){3} [0-9]{1,2}, [0-9]{4}$") # test pattern for regex eval of date
startint=0 # use for test exit limits
# print('entering pdf content eval', file)
while startint<len(pdf_array):
# if string has certain date format:
# if it doesn't have this conversion then it's suspect and maybe write it to log
# print(startint,pdf_array[startint])
if pattern.match(pdf_array[startint])!=None:
# transform it to date
# https://docs.python.org/3/library/datetime.html
datestr=datetime.strptime(pdf_array[startint], '%b %d, %Y').date().isoformat()
# print('pattern match',datestr)
# look ahead and keep next few strings:
description=pdf_array[startint+2]
bankcategory=pdf_array[startint+4]
amount=''
if '$' in pdf_array[startint+6]:
amount=pdf_array[startint+6] # will mess with $/string type conversion downstream, when combining sources
# write to list of lists
templist=[]
templist.append(filesource)
templist.append(filedate)
templist.append(datestr)
templist.append(description)
templist.append(bankcategory)
templist.append(amount)
# print(templist)
filelist.append(templist)
startint+=1
process_pdf_data(,,)