I am trying to read data from an Access database using the Java library Jackcess. The database has several tables and queries, some of which are linked tables pointing to Excel sheets on the file-system.
I saw that I can use a LinkResolver to intercept the resolving of the linked data, but it expects a full-blown Database, not just data for one single table.
I can easily use Apache POI to open the Excel file and extract the necessary data, but I don't know how I can pass the data in the LinkResolver.
What is the simplest way to provide the location of the Excel file or read the data from the Excel file and pass it back to Jackcess so it can load the linked data successfully?
At this point in time, the LinkResolver API is only built for loading "remote" Table instances from other databases. It was not built to be a general purpose API for any type of external file. You could certainly file a feature request with the Jackcess project.
UPDATE:
As of the 2.1.7 release, jackcess provides the CustomLinkResolver utility to facilitate loading linked tables from files which are not access databases (using a temporary db).
I came up with the following initial implementation of a LinkResolver which builds a temporary database with the content from the Excel file. It still lacks some things like Close-handling and temp-file-removal of the temporary database, but it seems to work for basic purposes.
/**
* Sample LinkResolver which reads the data from an Excel file
* The data is read from the first sheet and needs to contain a
* header-row with column-names and then data-rows with string/numeric values.
*/
public class ExcelFileLinkResolver implements LinkResolver {
private final LinkResolver parentResolver;
private final String fileNameInDB;
private final String tableName;
private final File excelFile;
public ExcelFileLinkResolver(LinkResolver parentResolver, String fileNameInDB, File excelFile, String tableName) {
this.parentResolver = parentResolver;
this.fileNameInDB = fileNameInDB;
this.excelFile = excelFile;
this.tableName = tableName;
}
#Override
public Database resolveLinkedDatabase(Database linkerDb, String linkeeFileName) throws IOException {
if(linkeeFileName.equals(fileNameInDB)) {
// TODO: handle close or create database in-memory if possible
File tempFile = File.createTempFile("LinkedDB", ".mdb");
Database linkedDB = DatabaseBuilder.create(Database.FileFormat.V2003, tempFile);
try (Workbook wb = WorkbookFactory.create(excelFile, null, true)) {
TableBuilder tableBuilder = new TableBuilder(tableName);
Table table = null;
List<Object[]> rows = new ArrayList<>();
for(org.apache.poi.ss.usermodel.Row row : wb.getSheetAt(0)) {
if(table == null) {
for(Cell cell : row) {
tableBuilder.addColumn(new ColumnBuilder(cell.getStringCellValue()
// column-names cannot contain some characters
.replace(".", ""),
DataType.TEXT));
}
table = tableBuilder.toTable(linkedDB);
} else {
List<String> values = new ArrayList<>();
for(Cell cell : row) {
if(cell.getCellTypeEnum() == CellType.NUMERIC) {
values.add(Double.toString(cell.getNumericCellValue()));
} else {
values.add(cell.getStringCellValue());
}
}
rows.add(values.toArray());
}
}
Preconditions.checkNotNull(table, "Did not have a row in " + excelFile);
table.addRows(rows);
} catch (InvalidFormatException e) {
throw new IllegalStateException(e);
}
return linkedDB;
}
return parentResolver.resolveLinkedDatabase(linkerDb, linkeeFileName);
}
}
Related
I am writing a Service that obtains data from large sql query in database (over 100,000 records) and streams into an API CSV File. Is there any java library function that does it, or any way to make the code below more efficient? Currently using Java 8 in Spring boot environment.
Code is below with sql repository method, and service for csv. Preferably trying to write to csv file, while data is being fetched from sql concurrently as query make take 2-3 min for user .
We are using Snowflake DB.
public class ProductService {
private final ProductRepository productRepository;
private final ExecutorService executorService;
public ProductService(ProductRepository productRepository) {
this.productRepository = productRepository;
this.executorService = Executors.newFixedThreadPool(20);
}
public InputStream getproductExportFile(productExportFilters filters) throws IOException {
PipedInputStream is = new PipedInputStream();
PipedOutputStream os = new PipedOutputStream(is);
executorService.execute(() -> {
try {
Stream<productExport> productStream = productRepository.getproductExportStream(filters);
Field[] fields = Stream.of(productExport.class.getDeclaredFields())
.peek(f -> f.setAccessible(true))
.toArray(Field[]::new);
String[] headers = Stream.of(fields)
.map(Field::getName).toArray(String[]::new);
CSVFormat csvFormat = CSVFormat.DEFAULT.builder()
.setHeader(headers)
.build();
OutputStreamWriter outputStreamWriter = new OutputStreamWriter(os);
CSVPrinter csvPrinter = new CSVPrinter(outputStreamWriter, csvFormat);
productStream.forEach(productExport -> writeproductExportToCsv(productExport, csvPrinter, fields));
outputStreamWriter.close();
csvPrinter.close();
} catch (Exception e) {
logger.warn("Unable to complete writing to csv stream.", e);
} finally {
try {
os.close();
} catch (IOException ignored) { }
}
});
return is;
}
private void writeProductExportToCsv(productExport productExport, CSVPrinter csvPrinter, Field[] fields) {
Object[] values = Stream.of(fields).
map(f -> {
try {
return f.get(productExport);
} catch (IllegalAccessException e) {
return null;
}
})
.toArray();
try {
csvPrinter.printRecord(values);
csvPrinter.flush();
} catch (IOException e) {
logger.warn("Unable to write record to file.", e);
}
}
public Stream<PatientExport> getProductExportStream(ProductExportFilters filters) {
MapSqlParameterSource parameterSource = new MapSqlParameterSource();
parameterSource.addValue("customerId", filters.getCustomerId().toString());
parameterSource.addValue("practiceId", filters.getPracticeId().toString());
StringBuilder sqlQuery = new StringBuilder("SELECT * FROM dbo.Product ");
sqlQuery.append("\nWHERE CUSTOMERID = :customerId\n" +
"AND PRACTICEID = :practiceId\n"
);
Streaming allows you to transfer the data, little by little, without having to load it all into the server’s memory. You can do your operations by using the extractData() method in ResultSetExtractor. You can find javadoc about ResultSetExtractor here.
You can view an example using ResultSetExtractor here.
You can also easily create your JPA queries as ResultSet using JdbcTemplate. You can take a look at an example here. to use ResultSetExtractor.
There is product which we bought some time ago for our company, we got even the source code back then. https://northconcepts.com/ We were also evaluating Apache Camel which had similar support but it didnt suite our goal. If you really need speed you should go to lowest level possible - pure JDBC and as simple as possible csv writer.
Nortconcepts library itself provides capability to read from jdbc and write to CSV on lower level. We found few tweaks which have sped up the transmission and processing. With single thread we are able to stream 100 000 records (with 400 columns) within 1-2 minutes.
Given that you didn't specify which database you use I can give you only generic answers.
In general code like this is network limited, as JDBC resultset is usually transferred in "only n rows" packages, and when you exhaust one, only then database triggers fetching of next packet. This property is often called fetch-size, and you should greatly increase it. By default settings, most of databases transfer 10-100 rows in one fetch. In spring you can use setFetchSize property. Some benchmarks here.
There are other similar low level stuff which you could do. For example, Oracle jdbc driver has "InsensitiveResultSetBufferSize" - how big in bytes is a buffer which holds result set. But dose things tend to be database specific.
Thus being said, the best way to really increase speed of your transfer is to actually launch multiple queries. Divide your data on some column value, and than launch multiple parallel queries. Essentially, if you can design data to support parallel queries working on easily distinguished subsets, bottleneck can be transferred to a network or CPU throughput.
For example one of your columns might be 'timestamp'. Instead having one query to fetch all rows, fetch multiple subset of rows with query like this:
SELECT * FROM dbo.Product
WHERE CUSTOMERID = :customerId
AND PRACTICEID = :practiceId
AND :lowerLimit <= timestamp AND timestamp < :upperLimit
Launch this query in parallel with different timestamp ranges. Aggregate result of those subqueries in shared ConcurrentLinkedQueue and build CSV there.
With similar approach I regularly read 100000 rows/sec on 80 column table from Oracle DB. That is 40-60 MB/sec sustained transfer rate from a table which is not even locked.
i have a csv file which i should read using apache poi,while reading the file it should follow some pattern like data should not have ' or '' or new line like that.After validating we need to insert the validated csv into db.The code which i wrote for this is below.
#RequestMapping(value = "/insert", method = RequestMethod.POST)
public void uploadData(#RequestParam("file") final MultipartFile DataFile,
#PathVariable("DataType") final String DataType,
final Model model, final HttpServletRequest request,
final HttpServletResponse response) throws Exception {
byte[] bytes = null;
InputStream inputStream = null;
if (DataFile != null && !DataFile.isEmpty()) {
inputStream = DataFile.getInputStream();
LOGGER.info("Making Service call to save imported Enrichment details in DB ");
if (StringUtils.equalsIgnoreCase(DataType, "csvData1")) {
bytes = enrichmentDataFile.getBytes();
inputStream = new ByteArrayInputStream(Pattern.compile("(\\r\"|\\n\"|\\r\"\\n\"|\"|\')+")
.matcher(new String(bytes)).replaceAll("").getBytes(Charset.forName("UTF-8")));
DataService.insertData(inputStream,
DataFile.getOriginalFilename());//reading data using Apache POI and inserting into db
} else if (StringUtils.equalsIgnoreCase(DataType, "csvData2")) {
DataService.insertData(inputStream,
DataFile.getOriginalFilename());//reading data using Apache POI and inserting into db
}
}
}
iam able to insert csvData2 into db but when iam trying to insert csvData1,it was creating an empty file and that file was inserting into db.
can anyone suggest How can i validate the inputstream(scv) without having any ' or " or new lines and then insert validated one into db
Your InputStream mutation works perfectly. Problem should be in DataService.insertData method.
I want to get output of a Spark application(which we only use core Spark and people working on the project do not want to change it to Spark SQL) as Parquet or Avro files.
When I look for these two file types, I couldn't find any example without DataFrames, or in general Spark SQL. Can I achieve this without using SparkSQL?
My data is tabular, it has columns but in the processing, all data will be used, not a single column. It's columns are decided at runtime, so there is no "name,ID,adress" kinda generic columns. It looks like this:
No f1 f2 f3 ...
1, 123.456, 123.457, 123.458, ...
2, 123.789, 123.790, 123.791, ...
...
You can’t save an rdd in parquet without converting it to dataframe. Rdd does not have schema but parquet file is in columnar format which needs schema, so we need to convert it to dataframe.
You can use createdataframe api
I tried this and it works like a champ...
public class ParquetHelper{
static ParquetWriter<GenericData.Record> writer = null;
private static Schema schema;
public ParquetHelper(Schema schema, String pathName){
try {
Path path = new Path(pathName);
writer = AvroParquetWriter.
<GenericData.Record>builder(path)
.withRowGroupSize(ParquetWriter.DEFAULT_BLOCK_SIZE)
.withPageSize(ParquetWriter.DEFAULT_PAGE_SIZE)
.withSchema(schema)
.withConf(new Configuration())
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withValidation(true)
.withDictionaryEncoding(false)
.build();
this.schema = schema;
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
/*
*
*/
public static void writeToParquet(JavaRDD<Record> empRDDRecords) throws IOException {
empRDDRecords.foreach(record -> {
if(null != record && new RecordValidator().validate(record, schema).isEmpty()){
writeToParquet(record);
}// TODO collect bad records here
});
writer.close();
}
}
I want to skip header line from a CSV file. As of now I'm removing the header manually before loading it to google storage.
Below is my code :
PCollection<String> financeobj =p.apply(TextIO.read().from("gs://storage_path/Financials.csv"));
PCollection<ClassFinance> pojos5 = financeobj.apply(ParDo.of(new DoFn<String, ClassFinance>() { // converting String into classtype
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) {
String[] strArr = c.element().split(",");
ClassFinance fin = new ClassFinance();
fin.setBeneficiaryFinance(strArr[0]);
fin.setCatlibCode(strArr[1]);
fin.set_rNR_(Double.valueOf(strArr[2]));
fin.set_rNCS_(Double.valueOf(strArr[3]));
fin.set_rCtb_(Double.valueOf(strArr[4]));
fin.set_rAC_(Double.valueOf(strArr[5]));
c.output(fin);
}
}));
I have checked the existing question in stackoverflow but I dont find it promising : Skipping header rows - is it possible with Cloud DataFlow?
Any help ?
Edit : I have tried something like below and it worked :
PCollection<String> financeobj = p.apply(TextIO.read().from("gs://google-bucket/final_input/Financials123.csv"));
PCollection<ClassFinance> pojos5 = financeobj.apply(ParDo.of(new DoFn<String, ClassFinance>() { // converting String into classtype
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) {
String[] strArr2 = c.element().split(",");
String header = Arrays.toString(strArr2);
ClassFinance fin = new ClassFinance();
if(header.contains("Beneficiary"))
System.out.println("Header");
else {
fin.setBeneficiaryFinance(strArr2[0].trim());
fin.setCatlibCode(strArr2[1].trim());
fin.setrNR(Double.valueOf(strArr2[2].trim().replace("", "0")));
fin.setrNCS(Double.valueOf(strArr2[3].trim().replace("", "0")));
fin.setrCtb(Double.valueOf(strArr2[4].trim().replace("", "0")));
fin.setrAC(Double.valueOf(strArr2[5].trim().replace("", "0")));
c.output(fin);
}
}
}));
The older Stack Overflow post that you shared (Skipping header rows - is it possible with Cloud DataFlow?) does contain the answer to your question.
This option is currently not available in the Apache Beam SDK, although there is an open Feature Request in the Apache Beam JIRA issue tracker, BEAM-123. Note that, as of writing, this feature request is still open and unresolved, and it has been like that for 2 years already. However, it looks like some effort is being done in that sense, and the latest update in the issue is from February 2018, so I would advise you to stay updated on that JIRA issue, as it was last moved to the sdk-java-core component, and it may be getting more attention there.
With that information in mind, I would say that the approach you are using (removing the header before uploading the file to GCS) is the best option for you. I would refrain from doing it manually, as you can easily script that and automate the remove header ⟶ upload file process.
EDIT:
I have been able to come up with a simple filter using a DoFn. It might not be the most elegant solution (I am not an Apache Beam expert myself), but it does work, and you may be able to adapt it to your needs. It requires that you know beforehand the header of the CSV files being uploaded (as it will be filtering by element content), but again, take this just as a template that you may be able to modify to your needs:
public class RemoveCSVHeader {
// The Filter class
static class FilterCSVHeaderFn extends DoFn<String, String> {
String headerFilter;
public FilterCSVHeaderFn(String headerFilter) {
this.headerFilter = headerFilter;
}
#ProcessElement
public void processElement(ProcessContext c) {
String row = c.element();
// Filter out elements that match the header
if (!row.equals(this.headerFilter)) {
c.output(row);
}
}
}
// The main class
public static void main(String[] args) throws IOException {
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
PCollection<String> vals = p.apply(TextIO.read().from("gs://BUCKET/FILE.csv"));
String header = "col1,col2,col3,col4";
vals.apply(ParDo.of(new FilterCSVHeaderFn(header)))
.apply(TextIO.write().to("out"));
p.run().waitUntilFinish();
}
}
This code works for me. I have used Filter.by() to filter out the header row from csv file.
static void run(GcsToDbOptions options) {
Pipeline p = Pipeline.create(options);
// Read the CSV file from GCS input file path
p.apply("Read Rows from " + options.getInputFile(), TextIO.read()
.from(options.getInputFile()))
// filter the header row
.apply("Remove header row",
Filter.by((String row) -> !((row.startsWith("dwid") || row.startsWith("\"dwid\"")
|| row.startsWith("'dwid'")))))
// write the rows to database using prepared statement
.apply("Write to Auths Table in Postgres", JdbcIO.<String>write()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(dataSource(options)))
.withStatement(INSERT_INTO_MYTABLE)
.withPreparedStatementSetter(new StatementSetter()));
PipelineResult result = p.run();
try {
result.getState();
result.waitUntilFinish();
} catch (UnsupportedOperationException e) {
// do nothing
} catch (Exception e) {
e.printStackTrace();
}}
https://medium.com/#baranitharan/the-textio-write-1be1c07fbef0
The TextIO.Write in Dataflow now has withHeader function to add a header row to the data. This function was added in verison 1.7.0.
So you can add a header to your csv like this:
TextIO.Write.named("WriteToText")
.to("/path/to/the/file")
.withHeader("col_name1,col_name2,col_name3,col_name4")
.withSuffix(".csv"));
The withHeader function automatically adds a newline character at the end of the header row.
I try to Save an Excel file. The Excel file is a template with makros (*.xltm). I can open the file and edit the content, but if i try to save the destination Excel file is corrupt.
I try to save the file with:
int id = _workbook.getIDsOfNames(new String[] {"Save"})[0];
_workbook.invoke(id);
or/and
_xlsClientSite.save(_file, true);
You might try specifying a file format in your Save call.
If you're lucky, you can find the file format code you need in the Excel help. If you can't find what you need there, you'll have to get your hands dirty using the OLEVIEW.EXE program. There's likely a copy of it sitting on your hard drive somewhere, but if not, it's easy enough to find a copy with a quick Google search.
To use OLEVIEW.EXE:
Run it
Crack open the 'Type Libraries' entry
Find the version of Excel that you're using
Open that item
Search the enormous pile of text that's displayed for the string 'XlFileFormat'
Examine the XLFileFormat enum for a code that seems promising
If you are using Office2007 ("Excel12") like I am, you might try one of these values:
xlOpenXMLWorkbookMacroEnabled = 52
xlOpenXMLTemplateMacroEnabled = 53
Here's a method that I use to save Excel files using OLE:
/**
* Save the given workbook in the specified format.
*
* #param controlSiteAuto the OLE control site to use
* #param filepath the file to save to
* #param formatCode XlFileFormat code representing the file format to save as
* #param replaceExistingWithoutPrompt true to replace an existing file quietly, false to ask the user first
*/
public void saveWorkbook(OleAutomation controlSiteAuto, String filepath, Integer formatCode, boolean replaceExistingWithoutPrompt) {
Variant[] args = null;
Variant result = null;
try {
// suppress "replace existing?" prompt, if necessary
if (replaceExistingWithoutPrompt) {
setPropertyOnObject(controlSiteAuto, "Application", "DisplayAlerts", "False");
}
// if the given formatCode is null, for some reason, use a reasonable default
if (formatCode == null) {
formatCode = 51; // xlWorkbookDefault=51
}
// save the workbook
int[] id = controlSiteAuto.getIDsOfNames(new String[] {"SaveAs", "FileName", "FileFormat"});
args = new Variant[2];
args[0] = new Variant(filepath);
args[1] = new Variant(formatCode);
result = controlSiteAuto.invoke(id[0], args);
if (result == null || !result.getBoolean()) {
throw new RuntimeException("Unable to save active workbook");
}
// enable alerts again, if necessary
if (replaceExistingWithoutPrompt) {
setPropertyOnObject(controlSiteAuto, "Application", "DisplayAlerts", "True");
}
} finally {
cleanup(args);
cleanup(result);
}
}
protected void cleanup(Variant[] variants) {
if (variants != null) {
for (int i = 0; i < variants.length; i++) {
if (variants[i] != null) {
variants[i].dispose();
}
}
}
}