I have noticed that using java.util.Scanner is very slow when reading large files (in my case, CSV files).
I want to change the way I am currently reading files, to improve performance. Below is what I have at the moment. Note that I am developing for Android:
InputStreamReader inputStreamReader;
try {
inputStreamReader = new InputStreamReader(context.getAssets().open("MyFile.csv"));
Scanner inputStream = new Scanner(inputStreamReader);
inputStream.nextLine(); // Ignores the first line
while (inputStream.hasNext()) {
String data = inputStream.nextLine(); // Gets a whole line
String[] line = data.split(","); // Splits the line up into a string array
if (line.length > 1) {
// Do stuff, e.g:
String value = line[1];
}
}
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
Using Traceview, I managed to find that the main performance issues, specifically are: java.util.Scanner.nextLine() and java.util.Scanner.hasNext().
I've looked at other questions (such as this one), and I've come across some CSV readers, like the Apache Commons CSV, but they don't seem to have much information on how to use them, and I'm not sure how much faster they would be.
I have also heard about using FileReader and BufferedReader in answers like this one, but again, I do not know whether the improvements will be significant.
My file is about 30,000 lines in length, and using the code I have at the moment (above), it takes at least 1 minute to read values from about 600 lines down, so I have not timed how long it would take to read values from over 2,000 lines down, but sometimes, when reading information, the Android app becomes unresponsive and crashes.
Although I could simply change parts of my code and see for myself, I would like to know if there are any faster alternatives I have not mentioned, or if I should just use FileReader and BufferedReader. Would it be faster to split the huge file into smaller files, and choose which one to read depending on what information I want to retrieve? Preferably, I would also like to know why the fastest method is the fastest (i.e. what makes it fast).
uniVocity-parsers has the fastest CSV parser you'll find (2x faster than OpenCSV, 3x faster than Apache Commons CSV), with many unique features.
Here's a simple example on how to use it:
CsvParserSettings settings = new CsvParserSettings(); // many options here, have a look at the tutorial
CsvParser parser = new CsvParser(settings);
// parses all rows in one go
List<String[]> allRows = parser.parseAll(new FileReader(new File("your/file.csv")));
To make the process faster, you can select the columns you are interested in:
parserSettings.selectFields("Column X", "Column A", "Column Y");
Normally, you should be able to parse 4 million rows around 2 seconds. With column selection the speed will improve by roughly 30%.
It is even faster if you use a RowProcessor. There are many implementations out-of-the box for processing conversions to objects, POJOS, etc. The documentation explains all of the available features. It works like this:
// let's get the values of all columns using a column processor
ColumnProcessor rowProcessor = new ColumnProcessor();
parserSettings.setRowProcessor(rowProcessor);
//the parse() method will submit all rows to the row processor
parser.parse(new FileReader(new File("/examples/example.csv")));
//get the result from your row processor:
Map<String, List<String>> columnValues = rowProcessor.getColumnValuesAsMapOfNames();
We also built a simple speed comparison project here.
Your code is good to load big files. However, when an operation is going to be longer than you're expecting, it's good practice to execute it in a task and not in UI Thread, in order to prevent any lack of responsiveness.
The AsyncTask class help to do that:
private class LoadFilesTask extends AsyncTask<String, Integer, Long> {
protected Long doInBackground(String... str) {
long lineNumber = 0;
InputStreamReader inputStreamReader;
try {
inputStreamReader = new
InputStreamReader(context.getAssets().open(str[0]));
Scanner inputStream = new Scanner(inputStreamReader);
inputStream.nextLine(); // Ignores the first line
while (inputStream.hasNext()) {
lineNumber++;
String data = inputStream.nextLine(); // Gets a whole line
String[] line = data.split(","); // Splits the line up into a string array
if (line.length > 1) {
// Do stuff, e.g:
String value = line[1];
}
}
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
return lineNumber;
}
//If you need to show the progress use this method
protected void onProgressUpdate(Integer... progress) {
setYourCustomProgressPercent(progress[0]);
}
//This method is triggered at the end of the process, in your case when the loading has finished
protected void onPostExecute(Long result) {
showDialog("File Loaded: " + result + " lines");
}
}
...and executing as:
new LoadFilesTask().execute("MyFile.csv");
You should use a BufferedReader instead:
BufferedReader reader = null;
try {
reader = new BufferedReader( new InputStreamReader(context.getAssets().open("MyFile.csv"))) ;
reader.readLine(); // Ignores the first line
String data;
while ((data = reader.readLine()) != null) { // Gets a whole line
String[] line = data.split(","); // Splits the line up into a string array
if (line.length > 1) {
// Do stuff, e.g:
String value = line[1];
}
}
} catch (IOException e) {
e.printStackTrace();
} finally {
if (reader != null) {
try {
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Related
This question already has answers here:
How do I sort very large files
(10 answers)
Closed 1 year ago.
I'm having a problem working with a 1.3 Gb CSV file (it contains 3 million rows). The problem is that I want to sort the file according to a field called "Timestamp" and I can't split the file into multiple reads because otherwise the sorting won't work properly. I get the following error at one point :
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
This is my code:
public class createCSV {
public static BufferedReader br = null;
public static String csvFile = "/Scrivania/dataset";
public static String newcsvFile = "/Scrivania/ordinatedataset";
public static String extFile = ".csv";
public static void main(String[] args) {
try {
List<List<String>> csvLines = new ArrayList<>();
br = new BufferedReader(new FileReader(csvFile+extFile));
CSVWriter writer = new CSVWriter(new FileWriter(newcsvFile+extFile));
String line = br.readLine();
String[] fields = line.split(",");
writer.writeNext(fields);
line = br.readLine();
while(line!=null) {
csvLines.add(Arrays.asList(line.split(",")));
line = br.readLine();
}
csvLines.sort(new Comparator<List<String>>() {
#Override
public int compare(List<String> o1, List<String> o2) {
return o1.get(8).compareTo(o2.get(8));
}
});
for(List<String>lin:csvLines){
writer.writeNext(lin.toArray(new String[0]));
}
writer.close();
}catch(IOException e) {
e.printStackTrace();
}
}
}
I have tried increasing the heap size to the maximum, 2048, in particular: -Xms512M -Xmx2048M in Run->Run Configuratins but it still gives me an error. How could I solve and sort the whole file? Thank you in advance
The approach of reading file with FileReader will keep data of file in-memory this leads to exhaustion of memory. What you need is streaming through file. You can achieve this with Scanner class of Apache commons library.
With Scanner:
List<List<String>> csvLines = new ArrayList<>();
FileInputStream inputStream = null;
Scanner sc = null;
try {
inputStream = new FileInputStream(path);
sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine()) {
String line = sc.nextLine();
csvLines.add(Arrays.asList(line.split(",")));
}
// note that Scanner suppresses exceptions
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
if (inputStream != null) {
inputStream.close();
}
if (sc != null) {
sc.close();
}
}
Apache Commons:
LineIterator it = FileUtils.lineIterator(theFile, "UTF-8");
try {
while (it.hasNext()) {
String line = it.nextLine();
// do something with line
}
} finally {
LineIterator.closeQuietly(it);
}
Hopefully you can find an existing library that will do this for you, or use a command line tool called from Java to do this instead. If you need to code this yourself, here's a suggestion as to a pretty simple approach you might code up...
There's a simple general approach to sorting a large file like this. I call it a "shard sort". Here's what you do:
Pick a number N that is the number of shards you'll have and a function that will produce a value between 1 and N for each input entry such that you get roughly the same number of entries in each shard. For example, you could choose N to be 10, and you could use the seconds part of your timestamp and have the shard id be id = seconds % 10. This should "randomly" spread your entries across the 10 shards.
Now open the input file and 10 output files, one for each shard. Read each entry from the input file, compute its shard id, and write it to the file for that shard id.
Now read each shard file into memory, sort it bases on on each entry's timestamp, and write it back out to file. For this example, this will take 10% of the memory needed to sort the whole file.
Now open the 10 shard files for reading and a new result file to contain the final result. Read in the next entry in all 10 input files. Write out the earliest entry timestamp-wise of those 10 entries to the output file. When you write out a value, you read a new one from the shard file it came from. Repeat this process this until all the shard files are empty and all entries in memory have been written.
If your file is so big that 10 shards isn't enough, use more. You could, for example, use 60 shard files and use the entire seconds value from your timestamp for the shard id.
I currently have some code that loads some WAPT load testing data from a CSV file into an ArrayList.
int count = 0;
String file = "C:\\Temp\\loadtest.csv";
List<String[]> content = new ArrayList<>();
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line = "";
while (((line = br.readLine()) != null) && (count <18)) {
content.add(line.split(";"));
count++;
}
} catch (FileNotFoundException e) {
//Some error logging
}
So now it gets complicated. The CSV file looks like this, the seperator is a ";".
In this case I want to ignore the first line, it's just minutes. I want the second row. I need to ignore "AddToCartDemo" and "Users". So the first 10 entries, in this case all ten 5's get loaded into the first ten columns of the database. Likewise the third row "Pages/sec" is ignored and the data after it is loaded into the next ten columns of the database, etc.
;;0:01:00;0:02:00;0:03:00;0:04:00;0:05:00;0:06:00;0:07:00;0:08:00;0:09:00;0:10:00;
AddToCartDemo;Users;5;5;5;5;5;5;5;5;5;5;
;Pages/sec;0.25;0.1;0.22;0.65;0.03;0.4;0.43;0.17;0.22;0.4;
;Hits/sec;0.25;0.1;0.27;0.85;0.03;0.5;0.53;0.22;0.27;0.5;
;HTTP errors;0;0;0;0;0;0;0;0;0;0;
;KB received;1015;4595;422;1600;2.46;4374;1527;1491;2551;2954;
;KB sent;12.9;3.66;13.8;39.9;5.22;21.7;23.8;13.2;12.2;23.1;
;Receiving kbps;135;613;56.3;213;0.33;583;204;199;340;394;
;Sending kbps;1.73;0.49;1.84;5.32;0.7;2.89;3.17;1.76;1.63;3.07;
Anyone have any ideas on how to caccomplish this? As usual a search brings up nothing even close to this. Thanks much in advance!
I have created a method that reads specific lines from a file based on their line number. It works fine for most files but when I try to read a file that contains a large number of really long lines then it takes ages, particularly as it gets further down in the file. I've also done some debugging and it appears to take a lot of memory as well but I'm not sure if this is something that can be improved. I know there are some other questions which focus on how to read certain lines from a file but this question is focussed primarily on the performance aspect.
public static final synchronized List<String> readLines(final File file, final Integer start, final Integer end) throws IOException {
BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
List<String> lines = new ArrayList<>();
try {
String line = bufferedReader.readLine();
Integer currentLine = 1;
while (line != null) {
if ((currentLine >= start) && (currentLine <= end)) {
lines.add(line + "\n");
}
currentLine++;
if (currentLine > end) {
return lines;
}
line = bufferedReader.readLine();
}
} finally {
bufferedReader.close();
}
return lines;
}
How can I optimize this method to be faster than light?
I realised that what I was doing before was inherently slow and used up too much memory.
By adding all lines to memory and then processing all lines in a List it was not only taking twice as long but was also creating String variables for no reason.
I am now using Java 8 Stream and processing at point of reading which is the fastest method I've used so far.
Path path = Paths.get(file.getAbsolutePath());
Stream<String> stream = Files.lines(path, StandardCharsets.UTF_8);
for (String line : (Iterable<String>) stream::iterator) {
//do stuff
}
}
I'm trying to basically make a simple Test Generator. I want a button to parse a text file and add the records to my database. The questions and answers are in a text file. I have been searching the net for examples but I can't find one that matches my situation.
The text file has header information that I want to ignore up until the line that starts with "~ End of Syllabus". I want "~ End of Syllabus" to indicate the beginning of the questions. A couple of lines after that look for a line with a "(" in the seventh character position. I want that to indicate the Question Number line. The Question Number line is unique in that the "(" is in the seventh character position. I want to use that as an indicator to mark the start of a new question. In the Question Number line, the first three characters together "T1A" are the Question Group. The last part of the T1A*01* is the question number within that group.
So, as you can see I will also need to get the actual question text line and the answer lines as well. Also typically after the four Answer lines is the Question Terminator indicated by "~~". I don't know how I would be able to do this for all the questions in the text file. Do I keep adding them to an array String? How would I access this information from the file and add it to a database. This is very confusing for me and the way I feel I could learn how this works is by seeing an example that covers my situation. Here is a link to the text file I'm talking about:http://pastebin.com/3U3uwLHN
Code:
public static void main(String args[]) {
String endOfSyllabus = "~ End of Syllabus";
Path objPath = Paths.get("2014HamTechnician.txt");
String[] restOfTextFile = null;
if (Files.exists(objPath)){
File objFile = objPath.toFile();
try(BufferedReader in = new BufferedReader(
new FileReader(objFile))){
String line = in.readLine();
List<String> linesFile = new LinkedList<>();
while(line != null){
linesFile.add(line);
line = in.readLine();
}
System.out.println(linesFile);
}
catch(IOException e){
System.out.println(e);
}
}
else{
System.out.println(
objPath.toAbsolutePath() + " doesn't exist");
}
/* Create and display the form */
java.awt.EventQueue.invokeLater(new Runnable() {
public void run() {
new A19015_Form().setVisible(true);
}
});
}
Reading a text file in Java is straight forward (and there are sure to be other, more creative/efficient ways to do this):
try (BufferedReader reader = new BufferedReader(new FileReader(path))) { //try with resources needs JDK 7
int lineNum = 0;
String readLine;
while ((readLine = reader.readLine()) != null) { //read until end of stream
Skipping an arbitrary amount of lines can be accomplished like this:
if (lineNum == 0) {
lineNum++;
continue;
}
Your real problem is the text to split on. Had you been using CSV you could use String[] nextLine = readLine.split("\t"); to split each line into its respective cells based on tab separation. But your not, so you'll be stuck with reading each line, and than find something to split on.
It seems like you're in control of the text file format. If you are, go to an easier to consume format such as CSV, otherwise you're going to be designing a custom parser for your format.
A bonus to using CSV is it can mirror a database very effectivly. I.e. your CSV header column = database column.
As far as databases go, using JDBC is easy enough, just make sure you use prepared statements to insert your data to prevent against SQL injection:
public Connection connectToDatabase(){
String url = "jdbc:postgresql://url";
return DriverManager.getConnection(url);
}
Connection conn = connectToDatabase();
PreparedStatement pstInsert = conn.prepareStatement(cInsert);
pstInsert.setTimestamp(1, fromTS1);
pstInsert.setString(2, nextLine[1]);
pstInsert.execute();
pstInsert.close();
conn.close();
--Edit--
I didn't see your pastebin earlier on. It doesn't appear that you're in charge of the file format, so you're going to need to split on spaces ( each word ) and rely on regular expressions to determine if this is a question or not. Fortunately it seems the file is fairly consistent so you should be able to do this without too much problem.
--Edit 2--
As a possible solution you can try this untested code:
try{
BufferedReader reader = new BufferedReader(new FileReader("file.txt")); //try with resources needs JDK 7
boolean doRegex = false;
String readLine;
while ((readLine = reader.readLine()) != null) { //read until end of stream
if(readLine.startsWith("~~ End of Syllabus")){
doRegex = true;
continue; //immediately goto the next iteration
}
if(doRegex){
String[] line = readLine.split(" "); //split on spaces
if(line[0].matches("your regex here")){
//answer should be line[1]
//do logic with your answer here
}
}
}
} catch (IOException e) {
e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates.
}
This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
I guess this comes down to reading and writing to the same file. I would like to be able to return the same text file as is input, but with all integer values quadrupled. Should I even be attempting this with Java, or is it better to write to a new file and overwrite the original .txt file?
In essence, I'm trying to transform This:
12
fish
55 10 yellow 3
into this:
48
fish
220 40 yellow 12
Here's what I've got so far. Currently, it doesn't modify the .txt file.
import java.io.*;
import java.util.Scanner;
public class CharacterStretcher
{
public static void main(String[] args)
{
Scanner keyboard = new Scanner( System.in );
System.out.println("Copy and paste the path of the file to fix");
// get which file you want to read and write
File file = new File(keyboard.next());
File file2 = new File("temp.txt");
BufferedReader reader;
BufferedWriter writer;
try {
// new a writer and point the writer to the file
FileInputStream fstream = new FileInputStream(file);
// Use DataInputStream to read binary NOT text.
reader = new BufferedReader(new InputStreamReader(fstream));
writer = new BufferedWriter(new FileWriter(file2, true));
String line = "";
String temp = "";
int var = 0;
int start = 0;
System.out.println("000");
while ((line = reader.readLine()) != null)
{
System.out.println("a");
if(line.contains("="))
{
System.out.println("b");
var = 0;
temp = line.substring(line.indexOf('='));
for(int x = 0; x < temp.length(); x++)
{
System.out.println(temp.charAt(x));
if(temp.charAt(x)>47 && temp.charAt(x)<58) //if 0<=char<=9
{
if(start==0)
start = x;
var*=10;
var+=temp.indexOf(x)-48; //converts back into single digit
}
else
{
if(start!=0)
{
temp = temp.substring(0, start) + var*4 + temp.substring(x);
//writer.write(line.substring(0, line.indexOf('=')) + temp);
//TODO: Currently writes a bunch of garbage to the end of the file, how to write in the middle?
//move x if var*4 has an extra digit
if((var<10 && var>2)
|| (var<100 && var>24)
|| (var<1000 && var>249)
|| (var<10000 && var>2499))
x++;
}
//start = 0;
}
System.out.println(temp + " " + start);
}
if(start==0)
writer.write(line);
else
writer.write(temp);
}
}
System.out.println("end");
// writer the content to the file
//writer.write("I write something to a file.");
// always remember to close the writer
writer.close();
//writer = null;
file2.renameTo(file); //TODO: Not sure if this works...
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Given that this is a pretty quick and simple hack of a formatted text file, I don't think you need to be too clever about it.
Your logic for deciding whether you are looking at a number is pretty complex and I'd say it's overkill.
I've written up a basic outline of what I'd do in this instance.
It's not very clever or impressive, but should get the job done I think.
I've left out the overwriting and reading the input form the console so you get to do some of the implementation yourself ;-)
import java.io.*;
public class CharacterStretcher {
public static void main(String[] args) {
//Assumes the input is at c:\data.txt
File inputFile = new File("c:\\data.txt");
//Assumes the output is at c:\temp.txt
File outputFile = new File("c:\\temp.txt");
try {
//Construct a file reader and writer
final FileInputStream fstream = new FileInputStream(inputFile);
final BufferedReader reader = new BufferedReader(new InputStreamReader(fstream));
final BufferedWriter writer = new BufferedWriter(new FileWriter(outputFile, false));
//Read the file line by line...
String line;
while ((line = reader.readLine()) != null) {
//Create a StringBuilder to build our modified lines that will
//go into the output file
StringBuilder newLine = new StringBuilder();
//Split each line from the input file by spaces
String[] parts = line.split(" ");
//For each part of the input line, check if it's a number
for (String part : parts) {
try {
//If we can parse the part as an integer, we assume
//it's a number because it almost certainly is!
int number = Integer.parseInt(part);
//We add this to out new line, but multiply it by 4
newLine.append(String.valueOf(number * 4));
} catch (NumberFormatException nfEx) {
//If we couldn't parse it as an integer, we just add it
//to the new line - it's going to be a String.
newLine.append(part);
}
//Add a space between each part on the new line
newLine.append(" ");
}
//Write the new line to the output file remembering to chop the
//trailing space off the end, and remembering to add the line
//breaks
writer.append(newLine.toString().substring(0, newLine.toString().length() - 1) + "\r\n");
writer.flush();
}
//Close the file handles.
reader.close();
writer.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
You may want to consider one of these:
Build the new file in memory, rather than trying to write to the same file you are reading from. You could use StringBuilder for this.
Write to a new file, then overwrite the old file with the new one. This SO Question may help you there.
With both of these, you will be able to see your whole output, separate from the input file.
Additionally, with option (2), you don't have the risk of the operation failing in the middle and giving you a messed up file.
Now, you certainly can modify the file in-place. But it seems like unnecessary complexity for your case, unless you have really huge input files.
At the very least, if you try it this way first, you can narrow down on why the more complicated version is failing.
You cannot read and simultaneously write to the same file, because this would modify the text you currently read. This means, you must first write a modified new file and later rename it to the original one. You probably need to remove the original file before renameing.
For renaming, you can use File.renameTo or see one of the many SO's questions
You seem to parse integers in your code by collecting single digits and adding them up. You should consider using either a Scanner.nextInt or employ Integer.parseInt.
You can read your file line by line, split the words at white space and then parse them and check if it is either an integer or some other word.