This question already has answers here:
How do I sort very large files
(10 answers)
Closed 1 year ago.
I'm having a problem working with a 1.3 Gb CSV file (it contains 3 million rows). The problem is that I want to sort the file according to a field called "Timestamp" and I can't split the file into multiple reads because otherwise the sorting won't work properly. I get the following error at one point :
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
This is my code:
public class createCSV {
public static BufferedReader br = null;
public static String csvFile = "/Scrivania/dataset";
public static String newcsvFile = "/Scrivania/ordinatedataset";
public static String extFile = ".csv";
public static void main(String[] args) {
try {
List<List<String>> csvLines = new ArrayList<>();
br = new BufferedReader(new FileReader(csvFile+extFile));
CSVWriter writer = new CSVWriter(new FileWriter(newcsvFile+extFile));
String line = br.readLine();
String[] fields = line.split(",");
writer.writeNext(fields);
line = br.readLine();
while(line!=null) {
csvLines.add(Arrays.asList(line.split(",")));
line = br.readLine();
}
csvLines.sort(new Comparator<List<String>>() {
#Override
public int compare(List<String> o1, List<String> o2) {
return o1.get(8).compareTo(o2.get(8));
}
});
for(List<String>lin:csvLines){
writer.writeNext(lin.toArray(new String[0]));
}
writer.close();
}catch(IOException e) {
e.printStackTrace();
}
}
}
I have tried increasing the heap size to the maximum, 2048, in particular: -Xms512M -Xmx2048M in Run->Run Configuratins but it still gives me an error. How could I solve and sort the whole file? Thank you in advance
The approach of reading file with FileReader will keep data of file in-memory this leads to exhaustion of memory. What you need is streaming through file. You can achieve this with Scanner class of Apache commons library.
With Scanner:
List<List<String>> csvLines = new ArrayList<>();
FileInputStream inputStream = null;
Scanner sc = null;
try {
inputStream = new FileInputStream(path);
sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine()) {
String line = sc.nextLine();
csvLines.add(Arrays.asList(line.split(",")));
}
// note that Scanner suppresses exceptions
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
if (inputStream != null) {
inputStream.close();
}
if (sc != null) {
sc.close();
}
}
Apache Commons:
LineIterator it = FileUtils.lineIterator(theFile, "UTF-8");
try {
while (it.hasNext()) {
String line = it.nextLine();
// do something with line
}
} finally {
LineIterator.closeQuietly(it);
}
Hopefully you can find an existing library that will do this for you, or use a command line tool called from Java to do this instead. If you need to code this yourself, here's a suggestion as to a pretty simple approach you might code up...
There's a simple general approach to sorting a large file like this. I call it a "shard sort". Here's what you do:
Pick a number N that is the number of shards you'll have and a function that will produce a value between 1 and N for each input entry such that you get roughly the same number of entries in each shard. For example, you could choose N to be 10, and you could use the seconds part of your timestamp and have the shard id be id = seconds % 10. This should "randomly" spread your entries across the 10 shards.
Now open the input file and 10 output files, one for each shard. Read each entry from the input file, compute its shard id, and write it to the file for that shard id.
Now read each shard file into memory, sort it bases on on each entry's timestamp, and write it back out to file. For this example, this will take 10% of the memory needed to sort the whole file.
Now open the 10 shard files for reading and a new result file to contain the final result. Read in the next entry in all 10 input files. Write out the earliest entry timestamp-wise of those 10 entries to the output file. When you write out a value, you read a new one from the shard file it came from. Repeat this process this until all the shard files are empty and all entries in memory have been written.
If your file is so big that 10 shards isn't enough, use more. You could, for example, use 60 shard files and use the entire seconds value from your timestamp for the shard id.
Related
I want to read text files and convert each word to a number. Then for each file write sequence of numbers instead of word in a new file. I used a HashMap to assigned just one number (identifier) for each word, for instance, the word apple is assigned to number 10 so whenever, I see apple in a text file I write 10 in the sequence. I need to have just one HashMap to prevent assigned more than one identifier to a word. I wrote the following code but it process file slowly. For instance, converting a text file with size 165.7 MB to a file of sequence took 20 hours. I need to convert 600 text file with the same size to sequence files. I want to know is there any way to improve the efficiency of my code . The following function is called for each text file.
public void ConvertTextToSequence(File file) {
try{
FileWriter filewriter=new FileWriter(path.keywordDocIdsSequence,true);
BufferedWriter bufferedWriter= new BufferedWriter(filewriter);
String sequence="";
FileReader fileReader = new FileReader(file);
BufferedReader bufferedReader = new BufferedReader(fileReader);
String line = bufferedReader.readLine();
while(line!=null)
{
StringTokenizer tokens = new StringTokenizer(line);
String str;
while (tokens.hasMoreTokens())
{
str = tokens.nextToken();
if(keywordsId.containsKey(str))
sequence= sequence+" "+keywordsId.get(stmWord);
else
{
keywordsId.put(str,id);
sequence= sequence+" "+id;
id++;
}
if(keywordsId.size()%10000==0)
{
bufferedWriter.append(sequence);
sequence="";
start=id;
}
}
String line = bufferedReader.readLine();
}
}
if(start<id)
{
bufferedWriter.append(sequence);
}
bufferedReader.close();
fileReader.close();
bufferedWriter.close();
filewriter.close();
}
catch(Exception e)
{
e.printStackTrace();
}
}
The constructor of that class is:
public ConvertTextToKeywordIds(){
path= new LocalPath();
repository= new RepositorySQL();
keywordsId= new HashMap<String, Integer>();
id=1;
start=1;}
I suspect that the speed of your program is tied to the rehashing of the hash map as the number of words grows. Each rehash can incur a significant time penalty as the size of the hash map grows. You could try and estimate the number of unique words you expect and use that to initialize the hash map.
As mentioned by #JB Nizet you may want to write directly to the buffered writer rather than waiting to accumulate a number of entries. Since the buffered writer is already set up to write only when it has accumulated enough changes.
Your most effective performace boost is probably using StringBuilder instead of String for your sequence.
I would also write and flush the sequence each time it exceeds a certain length rather than whenever you've added 10000 words to your map.
This map could get pretty huge - have you considered improving that? If you hit millions of entries you may get better performance using a database.
I have noticed that using java.util.Scanner is very slow when reading large files (in my case, CSV files).
I want to change the way I am currently reading files, to improve performance. Below is what I have at the moment. Note that I am developing for Android:
InputStreamReader inputStreamReader;
try {
inputStreamReader = new InputStreamReader(context.getAssets().open("MyFile.csv"));
Scanner inputStream = new Scanner(inputStreamReader);
inputStream.nextLine(); // Ignores the first line
while (inputStream.hasNext()) {
String data = inputStream.nextLine(); // Gets a whole line
String[] line = data.split(","); // Splits the line up into a string array
if (line.length > 1) {
// Do stuff, e.g:
String value = line[1];
}
}
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
Using Traceview, I managed to find that the main performance issues, specifically are: java.util.Scanner.nextLine() and java.util.Scanner.hasNext().
I've looked at other questions (such as this one), and I've come across some CSV readers, like the Apache Commons CSV, but they don't seem to have much information on how to use them, and I'm not sure how much faster they would be.
I have also heard about using FileReader and BufferedReader in answers like this one, but again, I do not know whether the improvements will be significant.
My file is about 30,000 lines in length, and using the code I have at the moment (above), it takes at least 1 minute to read values from about 600 lines down, so I have not timed how long it would take to read values from over 2,000 lines down, but sometimes, when reading information, the Android app becomes unresponsive and crashes.
Although I could simply change parts of my code and see for myself, I would like to know if there are any faster alternatives I have not mentioned, or if I should just use FileReader and BufferedReader. Would it be faster to split the huge file into smaller files, and choose which one to read depending on what information I want to retrieve? Preferably, I would also like to know why the fastest method is the fastest (i.e. what makes it fast).
uniVocity-parsers has the fastest CSV parser you'll find (2x faster than OpenCSV, 3x faster than Apache Commons CSV), with many unique features.
Here's a simple example on how to use it:
CsvParserSettings settings = new CsvParserSettings(); // many options here, have a look at the tutorial
CsvParser parser = new CsvParser(settings);
// parses all rows in one go
List<String[]> allRows = parser.parseAll(new FileReader(new File("your/file.csv")));
To make the process faster, you can select the columns you are interested in:
parserSettings.selectFields("Column X", "Column A", "Column Y");
Normally, you should be able to parse 4 million rows around 2 seconds. With column selection the speed will improve by roughly 30%.
It is even faster if you use a RowProcessor. There are many implementations out-of-the box for processing conversions to objects, POJOS, etc. The documentation explains all of the available features. It works like this:
// let's get the values of all columns using a column processor
ColumnProcessor rowProcessor = new ColumnProcessor();
parserSettings.setRowProcessor(rowProcessor);
//the parse() method will submit all rows to the row processor
parser.parse(new FileReader(new File("/examples/example.csv")));
//get the result from your row processor:
Map<String, List<String>> columnValues = rowProcessor.getColumnValuesAsMapOfNames();
We also built a simple speed comparison project here.
Your code is good to load big files. However, when an operation is going to be longer than you're expecting, it's good practice to execute it in a task and not in UI Thread, in order to prevent any lack of responsiveness.
The AsyncTask class help to do that:
private class LoadFilesTask extends AsyncTask<String, Integer, Long> {
protected Long doInBackground(String... str) {
long lineNumber = 0;
InputStreamReader inputStreamReader;
try {
inputStreamReader = new
InputStreamReader(context.getAssets().open(str[0]));
Scanner inputStream = new Scanner(inputStreamReader);
inputStream.nextLine(); // Ignores the first line
while (inputStream.hasNext()) {
lineNumber++;
String data = inputStream.nextLine(); // Gets a whole line
String[] line = data.split(","); // Splits the line up into a string array
if (line.length > 1) {
// Do stuff, e.g:
String value = line[1];
}
}
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
return lineNumber;
}
//If you need to show the progress use this method
protected void onProgressUpdate(Integer... progress) {
setYourCustomProgressPercent(progress[0]);
}
//This method is triggered at the end of the process, in your case when the loading has finished
protected void onPostExecute(Long result) {
showDialog("File Loaded: " + result + " lines");
}
}
...and executing as:
new LoadFilesTask().execute("MyFile.csv");
You should use a BufferedReader instead:
BufferedReader reader = null;
try {
reader = new BufferedReader( new InputStreamReader(context.getAssets().open("MyFile.csv"))) ;
reader.readLine(); // Ignores the first line
String data;
while ((data = reader.readLine()) != null) { // Gets a whole line
String[] line = data.split(","); // Splits the line up into a string array
if (line.length > 1) {
// Do stuff, e.g:
String value = line[1];
}
}
} catch (IOException e) {
e.printStackTrace();
} finally {
if (reader != null) {
try {
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
The following code reads a bunch of .csv files and then combines them into one .csv file. I tried to system.out.println ... all datapoints are correct, however when i try to use the PrintWriter I get:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space.
I tried to use FileWriter but got the same error. How should I correct my code?
public class CombineCsv {
public static void main(String[] args) throws IOException {
PrintWriter output = new PrintWriter("C:\\User\\result.csv");
final File file = new File("C:\\Users\\is");
int i = 0;
for (final File child: file.listFiles()) {
BufferedReader CSVFile = new BufferedReader( new FileReader( "C:\\Users\\is\\"+child.getName()));
String dataRow = CSVFile.readLine();
while (dataRow != null) {
String[] dataArray = dataRow.split(",");
for (String item:dataArray) {
System.out.println(item + "\t");
output.append(item+","+child.getName().replaceAll(".csv", "")+",");
i++;
}
dataRow = CSVFile.readLine(); // Read next line of data.
} // Close the file once all data has been read.
CSVFile.close();
}
output.close();
System.out.println(i);
}
}
I can only think of two scenarios in which that code could result in an OOME:
If the file directory has a very large number of elements, then file.listFiles() could create a very large array of File objects.
If one of the input files includes a line that is very long, then CSVFile.readLine() could use a lot of memory in the process of reading it. (Up to 6 times the number of bytes in the line.)
The simplest approach to solving both of these issues is to increase the Java heap size using the -Xmx JVM option.
I can see no reason why your use of a PrintWriter would be the cause of the problem.
Try
boolean autoFlush = true;
PrintWriter output = new PrintWriter(myFileName, autoFlush);
It creates a PrintWriter instance which flushes content everytime when there is a new line or format.
my problem is to read non primes from txt file and write prime factors in same file.
i actually dont know how BufferedReader works.from my understanding i am trying to read the file data to buffer(8kb) and write prime factors to file.(by creating a new one)
class PS_Task2
{
public static void main(String[] args)
{
String line=null;
int x;
try
{
FileReader file2 = new FileReader("nonprimes.txt");
BufferedReader buff2=new BufferedReader(file2);
File file1 = new File("nonprimes.txt");
file1.createNewFile();
PrintWriter d=new PrintWriter(file1);
while((line = buff2.readLine()) != null)
{
x=Integer.parseInt(line);
d.printf ("%d--> ", x);
while(x%2==0)
{
d.flush();
d.print("2"+"*");
x=x/2;
}
for (int i = 3; i <= Math.sqrt(x); i = i+2)
{
while (x%i == 0)
{
d.flush();
d.printf("%d*", i);
x = x/i;
}
}
if (x > 2)
{
d.flush();
d.printf ("%d ", x);
}
d.flush();//FLUSING THE STREAM TO FILE
d.println("\n");
}
d.close(); // CLOSING FILE
}
feel free to give detailed explanation. :D thanks ~anirudh
reading and writing to a file in java doesnt EDIT the file, but clear the old one and creates a new one, you can use many approachesfor example, to get your data, modify it, and either save it on memory in a StringBuilder or a collection or what ever and re-write it again
well i created fileOne.txt containing the following data :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
and i want to multiply all those numbers by 10, then re-write them again :
public static void main(String [] args) throws Exception{ // just for the example
// locate the file
File fileOne = new File("fileOne.txt");
FileReader inputStream = new FileReader(fileOne);
BufferedReader reader = new BufferedReader(inputStream);
// create a LinkedList to hold the data read
List<Integer> numbers = new LinkedList<Integer>();
// prepare variables to refer to the temporary objects
String line = null;
int number = 0;
// start reading
do{
// read each line
line = reader.readLine();
// check if the read data is not null, so not to use null values
if(line != null){
number = Integer.parseInt(line);
numbers.add(number*10);
}
}while(line != null);
// free resources
reader.close();
// check the new numbers before writing to file
System.out.println("NEW NUMBERS IN MEMORY : "+numbers);
// assign a printer
PrintWriter writer = new PrintWriter(fileOne);
// write down data
for(int newNumber : numbers){
writer.println(newNumber);
}
// free resources
writer.flush();
writer.close();
}
this approach is not very good when dealing with massive data
As per your problem statement, you need to take input from a file, do some processing and write back the processed data in the same file. For this, please note the below points:
You may not create a file with same name in a directory, so you must create the new file at some other location; or write the content into different file and later rename it after deleting original one.
While your file is open for reading, modifying the same file is not a good idea. you could use below approach:
Read the content of the file and store in a data structure liek Arrays, ArrayList.
Close the file.
Process the data stored in the data structure.
Open the file in write mode (over-write mode rather than append mode)
Write back the processed data into the file.
I often use the Scanner class to read files because it is so convenient.
String inputFileName;
Scanner fileScanner;
inputFileName = "input.txt";
fileScanner = new Scanner (new File(inputFileName));
My question is, does the above statement load the entire file into memory at once? Or do subsequent calls on the fileScanner like
fileScanner.nextLine();
read from the file (i.e. from external storage and not from memory)? I ask because I am concerned about what might happen if the file is too huge to be read into memory all at once. Thanks.
If you read the source code you can answer the question yourself.
It appear that the implementation of the Scanner constructor in question shows:
public Scanner(File source) throws FileNotFoundException {
this((ReadableByteChannel)(new FileInputStream(source).getChannel()));
}
Latter this is wrapped into a Reader:
private static Readable makeReadable(ReadableByteChannel source, CharsetDecoder dec) {
return Channels.newReader(source, dec, -1);
}
And it is read using a buffer size
private static final int BUFFER_SIZE = 1024; // change to 1024;
As you can see in the final constructor in the construction chain:
private Scanner(Readable source, Pattern pattern) {
assert source != null : "source should not be null";
assert pattern != null : "pattern should not be null";
this.source = source;
delimPattern = pattern;
buf = CharBuffer.allocate(BUFFER_SIZE);
buf.limit(0);
matcher = delimPattern.matcher(buf);
matcher.useTransparentBounds(true);
matcher.useAnchoringBounds(false);
useLocale(Locale.getDefault(Locale.Category.FORMAT));
}
So, it appears scanner does not read the entire file at once.
From reading the code, it appears to load 1 KB at a time by default. The size of the buffer can increase for long lines of text. (To the size of the longest line of text you have)
In ACM Contest the fast read is very important. In Java we found found that use something like that is very faster...
FileInputStream inputStream = new FileInputStream("input.txt");
InputStreamReader streamReader = new InputStreamReader(inputStream, "UTF-8");
BufferedReader in = new BufferedReader(streamReader);
Map<String, Integer> map = new HashMap<String, Integer>();
int trees = 0;
for (String s; (s = in.readLine()) != null; trees++) {
Integer n = map.get(s);
if (n != null) {
map.put(s, n + 1);
} else {
map.put(s, 1);
}
}
The file contains, in that case, tree names...
Red Alder
Ash
Aspen
Basswood
Ash
Beech
Yellow Birch
Ash
Cherry
Cottonwood
You can use the StringTokenizer for catch any part of line that your want.
We have some errors if we use Scanner for large files. Read 100 lines from a file with 10000 lines!
A scanner can read text from any object which implements the Readable
interface. If an invocation of the underlying readable's
Readable.read(java.nio.CharBuffer) method throws an IOException then
the scanner assumes that the end of the input has been reached. The
most recent IOException thrown by the underlying readable can be
retrieved via the ioException() method.
tells in the API
Good luck!
You're better off going with something like BufferedReader with a FileReader for large files. A basic example can be found here.