Slow reading from CSV file - java

I'm trying to read from a csv file but it's slow. Here's the code roughly explained:
private static Film[] readMoviesFromCSV() {
// Regex to split by comma without splitting in double quotes.
// https://regexr.com/3s3me <- example on this data
var pattern = Pattern.compile(",(?=(?:[^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)");
Film[] films = null;
try (var br = new BufferedReader(new FileReader(FILENAME))) {
var start = System.currentTimeMillis();
var temparr = br.lines().skip(1).collect(Collectors.toList()); // skip first line and read into List
films = temparr.stream().parallel()
.map(pattern::split)
.filter(x -> x.length == 24 && x[7].equals("en")) // all fields(total 24) and english speaking movies
.filter(x -> (x[14].length() > 0)) // check if it has x[14] (date)
.map(movieData -> new Film(movieData[8], movieData[9], movieData[14], movieData[22], movieData[23], movieData[7]))
// movieData[8] = String title, movieData[9] = String overview
// movieData[14] = String date (constructor parses it to LocalDate object)
// movieData[22] = String avgRating
.toArray(Film[]::new);
System.out.println(MessageFormat.format("Execution time: {0}", (System.currentTimeMillis() - start)));
System.out.println(films.length);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return films;
}
File is about 30 MB big and it takes about 3-4 seconds avg. I'm using streams but it's still really slow. Is it because of that splitting each time?
EDIT: I've managed to speed up reading and processing time by 3x with uniVocity-parsers library. On average it takes 950 ms to finish. That's pretty impressive.
private static Film[] readMoviesWithLib() {
Film[] films = null;
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setLineSeparatorDetectionEnabled(true);
RowListProcessor rowProcessor = new RowListProcessor();
parserSettings.setProcessor(rowProcessor);
parserSettings.setHeaderExtractionEnabled(true);
CsvParser parser = new CsvParser(parserSettings);
var start = System.currentTimeMillis();
try {
parser.parse(new BufferedReader(new FileReader(FILENAME)));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
List<String[]> rows = rowProcessor.getRows();
films = rows.stream()
.filter(Objects::nonNull)
.filter(x -> x.length == 24 && x[14] != null && x[7] != null)
.filter(x -> x[7].equals("en"))
.map(movieData -> new Film(movieData[8], movieData[9], movieData[14], movieData[22], movieData[23], movieData[7]))
.toArray(Film[]::new);
System.out.printf(MessageFormat.format("Time: {0}",(System.currentTimeMillis()-start)));
return films;
}

Author of the univocity-parsers library here. You can speed up the code you posted in your edit a little bit further by rewriting it like this:
//initialize an arraylist with a good size to avoid reallocation
final ArrayList<Film> films = new ArrayList<Film>(20000);
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setLineSeparatorDetectionEnabled(true);
parserSettings.setHeaderExtractionEnabled(true);
//don't generate strings for columns you don't want
parserSettings.selectIndexes(7, 8, 9, 14, 22, 23);
//keep generating rows with the same number of columns found in the input
//indexes not selected will have nulls as they are not processed.
parserSettings.setColumnReorderingEnabled(false);
parserSettings.setProcessor(new AbstractRowProcessor(){
#Override
public void rowProcessed(String[] row, ParsingContext context) {
if(row.length == 24 && "en".equals(row[7]) && row[14] != null){
films.add(new Film(row[8], row[9], row[14], row[22], row[23], row[7]));
}
}
});
CsvParser parser = new CsvParser(parserSettings);
long start = System.currentTimeMillis();
try {
parser.parse(new File(FILENAME), "UTF-8");
} catch (FileNotFoundException e) {
e.printStackTrace();
}
System.out.printf(MessageFormat.format("Time: {0}",(System.currentTimeMillis()-start)));
return films.toArray(new Film[0]);
For convenience, if you have to process stuff into different classes you can also use annotations in your Film class.
Hope this helps.

Related

How to remove specific duplicate data in array after sorting?

This is my code:
FileWriter writers = null;
try {
BufferedReader reader = new BufferedReader(new FileReader("Database.txt"));
ArrayList<Data> dataList = new ArrayList<>();
String line = "";
while ((line = reader.readLine()) != null) {
//split string and construct Data object and add it to dataList
dataList.add(parse(line));
}
reader.close();
Collections.sort(dataList);
writers = new FileWriter("final.txt");
for (Data d : dataList) {
writers.write(d.toString());
writers.write("\r\n");
}
writers.close();
} catch (Exception ex) {
ex.printStackTrace();
} finally {
}
Input/Output in this code:
input: mamy, 30, new, old
daddy, 21, new, new
output: daddy, 21,new,new
mamy , 30, new, old
Expected output:
daddy,21,new
mamy,30,new,old
My Problem is how to remove duplicate in array before storing it to final.txt? any suggestion?
I think Set is perfect for you, it eliminates duplicates.
Set<Data> dataSet = new HashSet<>(dataList);
To remove duplicates use this code right before sorting.
ArrayList<Data> newDataList = new ArrayList<>();
for (Data element : dataList) {
if (!newDataList.contains(element)) {
newDataList.add(element);
}
}
dataList = newDataList;

How to sort a file in alphabetical order?

I'm currently trying to make a program in java to sort a content of a file in alphabetical order, the file contains :
camera10:192.168.112.43
camera12:192.168.2.112
camera1:127.0.0.1
camera2:133.192.31.42
camera3:145.154.42.58
camera8:192.168.12.205
camera3:192.168.2.122
camera5:192.168.112.54
camera123:192.168.2.112
camera4:192.168.112.1
camera6:192.168.112.234
camera7:192.168.112.20
camera9:192.168.2.112
And I would like to sort them and write that back into the file (which in this case is "daftarCamera.txt"). But somehow my algorithm sort them in the wrong way, which the result is :
camera10:192.168.112.43
camera123:192.168.2.112
camera12:192.168.2.112
camera1:127.0.0.1
camera2:133.192.31.42
camera3:145.154.42.58
camera3:192.168.2.122
camera4:192.168.112.1
camera5:192.168.112.54
camera6:192.168.112.234
camera7:192.168.112.20
camera8:192.168.12.205
camera9:192.168.2.112
while the result I want is :
camera1:127.0.0.1
camera2:133.192.31.42
camera3:145.154.42.58
camera3:192.168.2.122
camera4:192.168.112.1
camera5:192.168.112.54
camera6:192.168.112.234
camera7:192.168.112.20
camera8:192.168.12.205
camera9:192.168.2.112
camera10:192.168.112.43
camera12:192.168.2.112
camera123:192.168.2.112
Here's the code I use :
public void sortCamera() {
BufferedReader reader = null;
BufferedWriter writer = null;
ArrayList <String> lines = new ArrayList<String>();
try {
reader = new BufferedReader(new FileReader (log));
String currentLine = reader.readLine();
while (currentLine != null){
lines.add(currentLine);
currentLine = reader.readLine();
}
Collections.sort(lines);
writer = new BufferedWriter(new FileWriter(log));
for (String line : lines) {
writer.write(line);
writer.newLine();
}
} catch(IOException e) {
e.printStackTrace();
} finally {
try {
if (reader != null) {
reader.close();
}
if(writer != null){
writer.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
You are currently performing a lexical sort, to perform a numerical sort you would need a Comparator that sorts the lines numerically on the value between "camera" and : in each line. You could split by :, and then use a regular expression to grab the digits, parse them and then compare. Like,
Collections.sort(lines, (String a, String b) -> {
String[] leftTokens = a.split(":"), rightTokens = b.split(":");
Pattern p = Pattern.compile("camera(\\d+)");
int left = Integer.parseInt(p.matcher(leftTokens[0]).replaceAll("$1"));
int right = Integer.parseInt(p.matcher(rightTokens[0]).replaceAll("$1"));
return Integer.compare(left, right);
});
Making a fully reproducible example
List<String> lines = new ArrayList<>(Arrays.asList( //
"camera10:192.168.112.43", //
"camera12:192.168.2.112", //
"camera1:127.0.0.1", //
"camera2:133.192.31.42", //
"camera3:145.154.42.58", //
"camera8:192.168.12.205", //
"camera3:192.168.2.122", //
"camera5:192.168.112.54", //
"camera123:192.168.2.112", //
"camera4:192.168.112.1", //
"camera6:192.168.112.234", //
"camera7:192.168.112.20", //
"camera9:192.168.2.112"));
Collections.sort(lines, (String a, String b) -> {
String[] leftTokens = a.split(":"), rightTokens = b.split(":");
Pattern p = Pattern.compile("camera(\\d+)");
int left = Integer.parseInt(p.matcher(leftTokens[0]).replaceAll("$1"));
int right = Integer.parseInt(p.matcher(rightTokens[0]).replaceAll("$1"));
return Integer.compare(left, right);
});
System.out.println(lines);
And that outputs
[camera1:127.0.0.1, camera2:133.192.31.42, camera3:145.154.42.58, camera3:192.168.2.122, camera4:192.168.112.1, camera5:192.168.112.54, camera6:192.168.112.234, camera7:192.168.112.20, camera8:192.168.12.205, camera9:192.168.2.112, camera10:192.168.112.43, camera12:192.168.2.112, camera123:192.168.2.112]
Your sorting algorithm assumes the ordinary collating sequence, i.e. sorts the strings by alphabetical order, as if the digits were letters. Hence the shorter string come first.
You need to specify an ad-hoc comparison function that splits the string and extracts the numerical value of the suffix.

Convert CSV to JSON array in Java Springboot

Hi I am trying to convert a CSV file into a JSON array using A dependency called csvReader, but when I run the code it prints out the JSON response incorrectly and I ament sure why would anyone be able to point me in the right direction.
#GetMapping("/convert")
public List<List<String>> convertCSV() throws FileNotFoundException {
List<List<String>> records = new ArrayList<List<String>>();
try (CSVReader csvReader = new CSVReader(new FileReader("C:/Download/cities.csv"));) {
String[] values = null;
while ((values = csvReader.readNext()) != null) {
records.add(Arrays.asList(values));
}
} catch (IOException e) {
e.printStackTrace();
}
return values;
}
Your case is not a big deal.
You can read that csv and build json.
Read first row and determine columns. The rest of rows are values.
public class Foo{
public static void main(String[] args) throws Exception{
List<String> csvRows = null;
try(var reader = Files.lines(Paths.get("dataFile.csv"))){
csvRows = reader.collect(Collectors.toList());
}catch(Exception e){
e.printStackTrace();
}
if(csvRows != null){
String json = csvToJson(csvRows);
System.out.println(json);
}
}
public static String csvToJson(List<String> csv){
//remove empty lines
//this will affect permanently the list.
//be careful if you want to use this list after executing this method
csv.removeIf(e -> e.trim().isEmpty());
//csv is empty or have declared only columns
if(csv.size() <= 1){
return "[]";
}
//get first line = columns names
String[] columns = csv.get(0).split(",");
//get all rows
StringBuilder json = new StringBuilder("[\n");
csv.subList(1, csv.size()) //substring without first row(columns)
.stream()
.map(e -> e.split(","))
.filter(e -> e.length == columns.length) //values size should match with columns size
.forEach(row -> {
json.append("\t{\n");
for(int i = 0; i < columns.length; i++){
json.append("\t\t\"")
.append(columns[i])
.append("\" : \"")
.append(row[i])
.append("\",\n"); //comma-1
}
//replace comma-1 with \n
json.replace(json.lastIndexOf(","), json.length(), "\n");
json.append("\t},"); //comma-2
});
//remove comma-2
json.replace(json.lastIndexOf(","), json.length(), "");
json.append("\n]");
return json.toString();
}
}
Tested on:
fname,lname,note
Shaun,Curtis,a
Kirby,Beil,b
-----------------------
[
{
"fname" : "Shaun",
"lname" : "Curtis",
"note" : "a"
}, {
"fname" : "Kirby",
"lname" : "Beil",
"note" : "b"
}
]
This method work on any structure of csv. Don't need to map columns.
That is because of your reading data in String and printing the List of String. If you want to map the CSV to Object ( JSON Object), You need to read the CSV as bean object please find below code snippet to print as JSON, override toString method as JSON format.
User.java
public class User {
#Id
#GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;
#NotNull
private String name;
#NotNull
private String surname;
//Getter and Setters
}
CsvReaderUtil.java
public static List<User> readCsvFile() throws IOException {
List<User> list = null;
CSVReader reader = null;
InputStream is = null;
try {
File initialFile = new File("C:\\Users\\ER\\Desktop\\test.csv");
is = new FileInputStream(initialFile);
reader = new CSVReader(new InputStreamReader(is), ',', '"', 1);
ColumnPositionMappingStrategy strat = new ColumnPositionMappingStrategy();
strat.setType(User.class);
String[] columns = new String[]{"id", "name", "surname"};
strat.setColumnMapping(columns);
CsvToBean csv = new CsvToBean();
list = csv.parse(strat, reader);
} catch (Exception e) {
e.printStackTrace();
} finally {
is.close();
reader.close();
}
return list;
}
Now print this List Of Users as a JSON object.
Here is a useful example of how to transform CSV to JSON using Java 11+:
private String fromCsvToJson(String csvFile) {
String[] lines = file.split("\n");
if (lines.length <= 1) {
return List.of();
}
var headers = lines[0].split(",");
var jsonFormat = Arrays.stream(lines)
.skip(1)
.map(line -> line.split(","))
.filter(line -> headers.length == line.length)
.map(line -> IntStream.range(0, headers.length).boxed().collect(toMap(i -> headers[i], i -> line[i], (a, b) -> b)))
.toList();
return new ObjectMapper().writeValueAsString(jsonFormat);
}

Jar showing increased RAM activity after completing the task

I have a class KeywordCount which tokenizes a given sentence and tags it using a maxent tagger by Apache OpenNLP-POS tagger. I first tokenize the output and then feed it to the tagger. I have a problem of RAM usage of upto 165 MB after the jar has completed its tasks. The rest of the program just makes a DB call and checks for new tasks. I have isolated the leak to this class. You can safely ignore the Apache POI Excel code. I need to know if any of you can find the leak in the code.
public class KeywordCount {
Task task;
String taskFolder = "";
List<String> listOfWords;
public KeywordCount(String taskFolder) {
this.taskFolder = taskFolder;
listOfWords = new ArrayList<String>();
}
public void tagText() throws Exception {
String xlsxOutput = taskFolder + File.separator + "results_pe.xlsx";
FileInputStream fis = new FileInputStream(new File(xlsxOutput));
XSSFWorkbook wb = new XSSFWorkbook(fis);
XSSFSheet sheet = wb.createSheet("Keyword Count");
XSSFRow row = sheet.createRow(0);
Cell cell = row.createCell(0);
XSSFCellStyle csf = (XSSFCellStyle)wb.createCellStyle();
csf.setVerticalAlignment(CellStyle.VERTICAL_TOP);
csf.setBorderBottom(CellStyle.BORDER_THICK);
csf.setBorderRight(CellStyle.BORDER_THICK);
csf.setBorderTop(CellStyle.BORDER_THICK);
csf.setBorderLeft(CellStyle.BORDER_THICK);
Font fontf = wb.createFont();
fontf.setColor(IndexedColors.GREEN.getIndex());
fontf.setBoldweight(Font.BOLDWEIGHT_BOLD);
csf.setFont(fontf);
int rowNum = 0;
BufferedReader br = null;
InputStream modelIn = null;
POSModel model = null;
try {
modelIn = new FileInputStream("taggers" + File.separator + "en-pos-maxent.bin");
model = new POSModel(modelIn);
}
catch (IOException e) {
// Model loading failed, handle the error
e.printStackTrace();
}
finally {
if (modelIn != null) {
try {
modelIn.close();
}
catch (IOException e) {
}
}
}
File ftmp = new File(taskFolder + File.separator + "phrase_tmp.txt");
if(ftmp.exists()) {
br = new BufferedReader(new FileReader(ftmp));
POSTaggerME tagger = new POSTaggerME(model);
String line = "";
while((line = br.readLine()) != null) {
if (line.equals("")) {
break;
}
row = sheet.createRow(rowNum++);
if(line.startsWith("Match")) {
int index = line.indexOf(":");
line = line.substring(index + 1);
String[] sent = getTokens(line);
String[] tags = tagger.tag(sent);
for(int i = 0; i < tags.length; i++) {
if (tags[i].equals("NN") || tags[i].equals("NNP") || tags[i].equals("NNS") || tags[i].equals("NNPS")) {
listOfWords.add(sent[i].toLowerCase());
} else if (tags[i].equals("JJ") || tags[i].equals("JJR") || tags[i].equals("JJS")) {
listOfWords.add(sent[i].toLowerCase());
}
}
Map<String, Integer> treeMap = new TreeMap<String, Integer>();
for(String temp : listOfWords) {
Integer counter = treeMap.get(temp);
treeMap.put(temp, (counter == null) ? 1 : counter + 1);
}
listOfWords.clear();
sent = null;
tags = null;
if (treeMap != null || !treeMap.isEmpty()) {
for(Map.Entry<String, Integer> entry : treeMap.entrySet()) {
row = sheet.createRow(rowNum++);
cell = row.createCell(0);
cell.setCellValue(entry.getKey().substring(0, 1).toUpperCase() + entry.getKey().substring(1));
XSSFCell cell1 = row.createCell(1);
cell1.setCellValue(entry.getValue());
}
treeMap.clear();
}
treeMap = null;
}
rowNum++;
}
br.close();
tagger = null;
model = null;
}
sheet.autoSizeColumn(0);
fis.close();
FileOutputStream fos = new FileOutputStream(new File(xlsxOutput));
wb.write(fos);
fos.close();
System.out.println("Finished writing XLSX file for Keyword Count!!");
}
public String[] getTokens(String match) throws Exception {
InputStream modelIn = new FileInputStream("taggers" + File.separator + "en-token.bin");
TokenizerModel model = null;
try {
model = new TokenizerModel(modelIn);
}
catch (IOException e) {
e.printStackTrace();
}
finally {
if (modelIn != null) {
try {
modelIn.close();
}
catch (IOException e) {
}
}
}
Tokenizer tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize(match);
model = null;
return tokens;
}
}
My system GCed the RAM after 165MB...but when I upload to the server the GC is not performed and it rises upto 480 MB(49% of RAM usage).
First of all, increased heap usage is not evidence of a memory leak. It may simply be that the GC has not run yet.
Having said that, it is doubtful that anyone can spot a memory leak just by "eyeballing" your code. The correct way to solve this is for >>you<< to read up on the techniques for finding Java memory leaks, and >>you<< then use the relevant tools (e.g. visualvm, jhat, etc) to search for the problem yourself.
Here are some references on finding storage leaks:
Troubleshooting Guide for Java SE 6 with HotSpot VM : Troubleshooting Memory Leaks. http://www.oracle.com/technetwork/java/javase/memleaks-137499.html - Note 1.
How to find a Java Memory Leak
Note 1: This link is liable to break. If it does, use Google to find the article.
I have isolated the leak to this class. You can safely ignore the Apache POI Excel code.
If we ignore the Apache POI code, the only source of a potential memory "leakage" is that the word list ( listOfWords ) is retained. (Calling clear() will null out its contents, but the backing array is retained, and that array's size is determined by the maximum list size. From a memory footprint perspective, it would be better to replace the list with a new empty list.)
However, that is only a "leak" if you keep a reference to the KeywordCount instance. And if you are doing that because you are using the instance, I wouldn't call that a leak at all.

delete a row in csv file

I am appending the data to the last row of a csv. I wanted to delete the existing row and then rewrite it with the appended element. Is there any way of deleting the row in csv? I am using opencsv to read and the write the file. I tried using CSVIterator class. However, it seems the iterator does not support the remove() operation.
Here is the code that I tried:
static String[] readLastRecord(File outputCSVFile) throws WAGException {
checkArgument(outputCSVFile != null, "Output CSV file cannot be null");
FileReader fileReader = null;
CSVReader csvFileReader = null;
CSVIterator csvIterator = null;
String[] csvLastRecord = null;
try {
fileReader = new FileReader(outputCSVFile);
csvFileReader = new CSVReader(fileReader, ',', '\'',
csvRowCount - 1);
csvIterator = new CSVIterator(csvFileReader);
while (csvIterator.hasNext()) {
csvLastRecord = csvIterator.next();
csvIterator.remove();
}
} catch (IOException ioEx) {
throw new WAGException(
WAGInputExceptionMessage.FILE_READ_ERR.getMessage());
} finally {
try {
if (csvFileReader != null)
csvFileReader.close();
} catch (IOException ioEx) {
throw new WAGException(
WAGInputExceptionMessage.FILE_CLOSE_ERR.getMessage());
}
}
return csvLastRecord;
}
i just found an answer. Hope it helps.
You need to read the csv, add elements to the list string, remove specific row from it with allelements.remove(rowNumber) and then write the list string back to the csv file.
The rowNumber is an int with row number.
CSVReader reader2 = new CSVReader(new FileReader(filelocation));
List<String[]> allElements = reader2.readAll();
allElements.remove(rowNumber);
FileWriter sw = new FileWriter(filelocation);
CSVWriter writer = new CSVWriter(sw);
writer.writeAll(allElements);
writer.close();
Look at this example from opencsv opencsv example
use unset to remove the row in csv
function readCSV($csvFile){
$file_handle = fopen($csvFile, 'r');
while (!feof($file_handle) ) {
$line_of_text[] = fgetcsv($file_handle, 1024);
}
fclose($file_handle);
return $line_of_text;
}
$csvFile1 = '../build/js/snowcem.csv';
$csv1 = readCSV($csvFile1);
//specified row number want to delete on place of $id
unset($csv1[$id]);
$file = fopen("../build/js/snowcem.csv","w");
foreach ($csv1 as $file1) {
$result = [];
array_walk_recursive($file1, function($item) use (&$result) {
$item = '"'.$item.'"';
$result[] = $item;
});
fputcsv($file, $result);
}
fclose($file);

Categories

Resources