Java Spring Import Excel to SQL Server - java

I have code for importing 15.000 Row in Excel with Java Spring, it takes around 10 minutes in Production Environtment but in Development Environtment its only takes around 5 minutes, how can i enhance the performance? heres my code.
Flow Code :
Checking Row Excel is Clean to Save
Save to Database 1 by 1
start checking row excel
Cell currentCell = cellsInRow.next();
String uuidAsString = uuid.toString();
Date today = Calendar.getInstance().getTime();
if(cellIndex==0) {
ble.setA(currentCell.getStringCellValue());
} else if(cellIndex==1) {
ble.setB(currentCell.getStringCellValue());
} else if(cellIndex==2) {
ble.setC(currentCell.getDateCellValue());
}
after start
blacklistExternalRepository.saveAll(lstBlacklistExternal);

The code posted is not complete so here is an idea with the following assumptions:
the variable today can be calculated only once for a batch import
the excel document has a regular format, that is, each row has at least three cells at indices 0, 1 and 2
With this in mind, you could do something like
LocalDate today = LocalDate.now();
List<BLE> bleList = new ArrayList<>(); // a list of ble objects
for (Row r : rows) {
Iterator<Cell> cellsInRow = ... // get the cells in row r
BLE ble = new BLE();
ble.setA(cellsInRow.next().getStringCellValue());
ble.setB(cellsInRow.next().getStringCellValue());
ble.setC(cellsInRow.next().getDateCellValue());
bleList.add(ble);
}
// do whatever you need to with the list of ble objects

Related

The change of the average according to the last row in Excel ,

I want to add SQL Query result to the created Excel File. And as a result, I need to average some columns at the bottom. My question is;
The number of rows is changing in some queries. How can I shift the average result by the last row according to the changing row?
example
after writing to excel
I'm learning now to add results using Postgresql to an Excel file created using apache poi.
Resource resource = resourceLoader.getResource("classpath:/temps/" + querySelected.getTemplateName());
workbook = new XSSFWorkbook(resource.getInputStream());
XSSFSheet sheetTable1 = workbook.getSheet("Table1");
int rowCounter = 1;
for (tname:tname) {
Row values = sheetTable1.createRow(rowCounter);
rowCounter++;}
cell = values.createCell(0, CellType.NUMERIC);
cell.setCellValue(knz.gettablename().doubleValue());}

Loading Excel files in Java takes too much time

I would like to load an Excel file into a Java program, parse it and insert the necessary things into a database every day, but don't want to load the whole file every time when I run the program. I need to get last 90 rows only. Is it possible to load an Excel (XLSM) file partially in Java (not necessary but preferred, can be another programing language too) to decrease loading time?
It takes around 60-70 seconds, and loading Excel takes 35 seconds, Excel file has 4000 rows and rows has 900 columns.
try{
workbook = WorkbookFactory.create(new FileInputStream(file));
sheet = workbook.getSheetAt(0);
rowSize=sheet.getLastRowNum();
myWriter = new FileWriter("/Users/mykyusuf/Desktop/filename.txt");
Row malzeme=sheet.getRow(1);
Row kaynak=sheet.getRow(2);
Row endeks=sheet.getRow(3);
myWriter.write("insert all\n");
Row row=sheet.getRow(rowSize-1);
for (int i = 4; i < rowSize-1; i++) {
row = sheet.getRow(i);
for (Cell cell : row) {
if (cell.getColumnIndex()>3) {
myWriter.write("into piyasa_takip (tarih,malzeme,kaynak,endeks,deger) values (to_date(\'" + row.getCell(3).getLocalDateTimeCellValue().toLocalDate() + "\','YYYY-MM-DD'),\'" + malzeme.getCell(cell.getColumnIndex()) + "\',\'" + kaynak.getCell(cell.getColumnIndex()) + "\',\'" + endeks.getCell(cell.getColumnIndex()) + "\',\'" + cell + "\')\n");
}
}
}
row = sheet.getRow(rowSize-1);
for (Cell cell : row) {
if (cell.getColumnIndex()>3 ) {
myWriter.write("into piyasa_takip (tarih,malzeme,kaynak,endeks,deger) values (to_date(\'" + row.getCell(3).getLocalDateTimeCellValue().toLocalDate() + "\','YYYY-MM-DD'),\'" + malzeme.getCell(cell.getColumnIndex()) + "\',\'" + kaynak.getCell(cell.getColumnIndex()) + "\',\'" + endeks.getCell(cell.getColumnIndex()) + "\',\'" + cell + "\')\n");
}
}
myWriter.write(" Select * from DUAL\n");
myWriter.close();
}
I do not know a simple answer to your question, but I want to help you figure it out
Exist two substantially different formats: *.XLS (old) and *.XLSX (new). In common case, new format more compact (because use zipping as part of "container").
I don't know simple way for "cut" last 90 rows from excel file. Especially, excel have a complicated format with tabs, formulas and hyperlinks (and scripts :-) ) in document.
But, we can use "divide and rule" principle. If you have a big excel file locally and this file wery slow loading on remote host, you can process fiel locally (for extracnt only new reccords in other file) and load to remote host this "modifications" only.
Thus, you divide the task into two parts: super-simple processing of a large file locally (to highlight the changed part) and normal and smart processing on a remote host.
Maybe this will help you?
Maybe you can try to use Free Spire.Xls to solve this.
I choose some data (70 rows and 8 columns ). It costs me 1-2 seconds to read them.
Hope it can help you to save some time.
And codes are right below:
import com.spire.xls.Workbook;
import com.spire.xls.Worksheet;
public class GetCellRange {
public static void main(String[] args) {
//Load the sample document
Workbook workbook = new Workbook();
workbook.loadFromFile("sample.xlsx");
//Get the first worksheet
Worksheet worksheet = workbook.getWorksheets().get(0);
//Choose the output content
for (int row = 1; row <= 70 ; row++) {
for (int col = 1; col <= 8 ; col++) {
System.out.println(worksheet.getCellRange(row,col).getValue() + "\t");
}
System.out.println("\n");
}
}
}

How to retrieve the Date from Excel with Formula by Apache POI

I have an Excel Sheet where the Date Cell is assigned with the Date Formula in Excel TODAY() + 1. So basically today it's showing as 03/10/2018 by default. I've created a code to read the data from Excel which has the formula in it but when I'm getting the date it's coming differently.
Code :
Cell c = CellUtil.getCell(r, columnIndex);
CellType type = c.getCellType();
if (type == CellType.FORMULA) {
switch (c.getCachedFormulaResultType()) {
case NUMERIC:
if (DateUtil.isCellDateFormatted(c)) {
value = (new SimpleDateFormat("dd-MM-yyyy").format(c.getDateCellValue()));
data.add(value); // Date should display 03-10-2018 but it's showing 23-01-2018
} else {
value = (c.getNumericCellValue()) + "";
data.add(value);
}
break;
case STRING:
value = c.getStringCellValue();
data = new LinkedList <String>(Arrays.asList(value.split(";")));
break;
}
}
I don't know why it's showing date from January with the formula applied TODAY() + 1
Similar to this another function TODAY() + 15 returning the 22-04-2018.
As stated in Formula Evaluation:
"The Excel file format (both .xls and .xlsx) stores a "cached" result
for every formula along with the formula itself. This means that when
the file is opened, it can be quickly displayed, without needing to
spend a long time calculating all of the formula results. It also
means that when reading a file through Apache POI, the result is
quickly available to you too!"
So all formulas will have cached results stored from the last time they were evaluated. This is either the last time the workbook was opened in Excel, recalculated and saved or from the last time an evaluation was be done outside of Excel.
So if a cell having the formula =TODAY() has a cached result of 22-01-2018 stored, then the workbook was evaluated on January 22, 2018 the last time.
To get always current formula results you need evaluating the formulas first before reading. Simplest way:
...
workbook.getCreationHelper().createFormulaEvaluator().evaluateAll();
...
Or you are using a DataFormatter together with a FormulaEvaluator:
...
DataFormatter formatter = new DataFormatter();
FormulaEvaluator evaluator = workbook.getCreationHelper().createFormulaEvaluator();
...
Cell cell = CellUtil.getCell(...);
...
String value = formatter.formatCellValue(cell, evaluator);
...

Dataframes are slow to parse through small amount of data

I have 2 classes doing a similar task in Apache Spark but the one using data frame is many times slower than the "regular" one using RDD. (30x)
I would like to use data frame since it will eliminate a lot of code and classes we have but obviously I can't have it be that much slower.
The data set is nothing big. We have 30 some files with json data in each about events triggered from activities in another piece of software. There are between 0 to 100 events in each file.
A data set with 82 events will take about 5 minutes to be processed with data frames.
Sample code:
public static void main(String[] args) throws ParseException, IOException {
SparkConf sc = new SparkConf().setAppName("POC");
JavaSparkContext jsc = new JavaSparkContext(sc);
SQLContext sqlContext = new SQLContext(jsc);
conf = new ConfImpl();
HashSet<String> siteSet = new HashSet<>();
// last month
Date yesterday = monthDate(DateUtils.addDays(new Date(), -1)); // method that returns the date on the first of the month
Date startTime = startofYear(new Date(yesterday.getTime())); // method that returns the date on the first of the year
// list all the sites with a metric file
JavaPairRDD<String, String> allMetricFiles = jsc.wholeTextFiles("hdfs:///somePath/*/poc.json");
for ( Tuple2<String, String> each : allMetricFiles.toArray() ) {
logger.info("Reading from " + each._1);
DataFrame metric = sqlContext.read().format("json").load(each._1).cache();
metric.count();
boolean siteNameDisplayed = false;
boolean dateDisplayed = false;
do {
Date endTime = DateUtils.addMonths(startTime, 1);
HashSet<Row> totalUsersForThisMonth = new HashSet<>();
for (String dataPoint : Conf.DataPoints) { // This is a String[] with 4 elements for this specific case
try {
if (siteNameDisplayed == false) {
String siteName = parseSiteFromPath(each._1); // method returning a parsed String
logger.info("Data for site: " + siteName);
siteSet.add(siteName);
siteNameDisplayed = true;
}
if ( dateDisplayed == false ) {
logger.info("Month: " + formatDate(startTime)); // SimpleFormatDate("yyyy-MM-dd")
dateDisplayed = true;
}
DataFrame lastMonth = metric.filter("event.eventId=\"" + dataPoint + "\"").filter("creationDate >= " + startTime.getTime()).filter("creationDate < " + endTime.getTime()).select("event.data.UserId").distinct();
logger.info("Distinct for last month for " + dataPoint + ": " + lastMonth.count());
totalUsersForThisMonth.addAll(lastMonth.collectAsList());
} catch (Exception e) {
// data does not fit the expected model so there is nothing to print
}
}
logger.info("Total Unique for the month: " + totalStudentForThisMonth.size());
startTime = DateUtils.addMonths(startTime, 1);
dateDisplayed = false;
} while ( startTime.getTime() < commonTmsMetric.monthDate(yesterday).getTime());
// reset startTime for the next site
startTime = commonTmsMetric.StartofYear(new Date(yesterday.getTime()));
}
}
There are a few things that are not efficient in this code but when I look at the logs it only adds a few seconds to the whole processing.
I must be missing something big.
I have ran this with 2 executors and 1 executor and the difference is 20 seconds on 5 minutes.
This is running with Java 1.7 and Spark 1.4.1 on Hadoop 2.5.0.
Thank you!
So there a few things, but its hard to say without seeing the breakdown of the different tasks & their time. The short version is you are doing way to much work in the driver and not taking advantage of Spark's distributed capabilities.
For example, you are collecting all of the data back to the driver program (toArray() and your for loop). Instead you should just point Spark SQL at the files in needs to load.
For the operators, it seems like your doing many aggregations in the driver, instead you could use the driver to generate the aggregations and have Spark SQL execute them.
Another big difference between your in-house code and the DataFrame code is going to be Schema inference. Since you've already created classes to represent your data, it seems likely that you know the schema of your JSON data. You can likely speed up your code by adding the schema information at read time so Spark SQL can skip inference.
I'd suggest re-visiting this approach and trying to build something using Spark SQL's distributed operators.

Converting into number format while writing from text file to excel in java

We have one existing application where we read one text file and write into excel using java. Text file will have first row as header and subsequent rows as records which is fetched from database.
Currently while writing into excel, all the columns are getting converted into text format. We want to convert one column(consider column Z) as number.
String [] column = line.split("\\|~");
for (int i = 0; i < column.length; i++) {
Cell tmpCell = row.createCell(i);
tmpCell.setCellValue(new HSSFRichTextString(column[i].trim()));
}
I am new to java, need your help in resolving this issue.
Thanks
Santhosh
You can write to cell like this
cell.setCellValue(12345.00000)
But this alone will not be enough in case where you don't want to truncate. Using style and dataformats, you can avoid truncation. E.g.
CellStyle cellStyle = wb.createCellStyle();
DataFormat df = wb.createDataFormat();
cellStyle.setDataFormat(df.getFormat("0.0")); //you can adjust it as per your requirments
cell.setCellStyle(cellStyle);

Categories

Resources