I have a .CSV file containing 100 000 records. I need to parse through a set of records and then delete it. Then again parse the next set of records till the end. How to do it? A code snippet will be very helpful.
I tried but I am not able to delete the records and reuse the same CSV file left with remaining set of records.
This can not be done efficiently, since CSV is a sequential file format. Say you have
"some text", "adsf"
"more text", "adfgagqwe"
"even more text", "adsfasdf"
...
and you want to remove the second line:
"some text", "adsf"
"even more text", "adsfasdf"
...
you need to move up all subsequent lines (which in your case can be 100 000 ...), which involves reading them at their old location and writing them to the new one. That is, deleting the first of 100 000 lines involves reading and writing 99 999 lines of text, which will take a while ...
It is therefore worthwhile to consider alternatives. For instance, if you are trying to process a file, and want to keep track of how far you got, it is far more efficient store the line number (or offset in bytes) you were at, and leave the input file intact. This will also prevent corrupting the file if your program crashes while deleting the lines. Another approach is to first split the file into many small files (perhaps 1000 lines each), process each file in its entirety and then delete the file.
However, if you truly must delete lines from a CSV file, the most robust way is to read the entire file, write all records you want to keep to a new file, delete the original file, and finally rename the new file to the original file.
You cannot edit or delete the existing data of a file. Ideally you should generate a new file for your output. In your case, once you reach the point to delete the existing data, you can create a new file, copy the remaining lines to the file and use this new file as input
code:
File infile =new File("C:\\MyInputFile.txt");
File outfile =new File("C:\\MyOutputFile.txt");
instream = new FileInputStream(infile);
outstream = new FileOutputStream(outfile);
byte[] buffer = new byte[1024];
int length;
/*copying the contents from input stream to
* output stream using read and write methods
*/
while ((length = instream.read(buffer)) > 0){
outstream.write(buffer, 0, length);
}
//Closing the input/output file streams
instream.close();
outstream.close();
Below code is tested working fine, you can erase any line in existing csv file using below code, so please check and let me know, you will have to put row number in array to delete,
File f=new File(System.getProperty("user.home")+"/Desktop/c.csv");
RandomAccessFile ra=new RandomAccessFile(f,"rw");
ra.seek(0);
long p=ra.getFilePointer();
byte b[]=ra.readLine().getBytes();
char c=' ';//44 for comma 32 for white space
for(int i=0;i<b.length;i++){
if(b[i]!=44){//Replace all except comma
b[i]=32;
}
}
ra.seek(p);//Go to intial pointer of line
ra.write(b);//write blank line with commas as column separators
ra.close();
Related
I have a .txt file that will be accessed by many users, possibly at the same time (or close to that) and because of that I need a way modify that txt file without creating a temporary file and I haven't found answer or solution to this. So far, I only found this approach ->
Take existing file -> modify something -> write it to a new file (temp file) -> delete the old file.
But his approach is not good to me, I need something like: Take existing file -> modify it -> save it.
Is this possible? I'm really sorry if this question already exists, I tried searching Stack-overflow and I read thru Oracle Docs but I haven't found solution that suits my needs.
EDIT:
After modification, file would stay the same size as before. For example imagine list of students, each student can have value 1 or 0 (passed or failed the exam)
So in this case I would just need to update one character per row in a file (that is per, student). Example:
Lee Jackson 0 -> Lee Jackson 0
Bob White 0 -> would become -> Bob White 1
Jessica Woo 1 -> Jessica Woo 1
In the example above we have a file with 3 records one below other and I need to update 2nd record while 1st and 3rd would became the same and all that without creating a new file.
Here's a potential approach using RandomAccessFile. The idea would be to use readline to read it in strings but to remember the position in the file so you can go back there and write a new line. It's still risky in case anything in the text encoding would change byte lenght, because that could overwrite the line break for example.
void modifyFile(String file) throws IOException {
try (RandomAccessFile raf = new RandomAccessFile(file, "rw")) {
long beforeLine = raf.getFilePointer();
String line;
while ((line = raf.readLine()) != null) {
// edit the line while keeping its length identical
if (line.endsWith("0")) {
line = line.substring(0, line.length() - 1) + "1";
}
// go back to the beginning of the line
raf.seek(beforeLine);
// overwrite the bytes of that line
raf.write(line.getBytes());
// advance past the line break
String ignored = raf.readLine();
// and remember that position again
beforeLine = raf.getFilePointer();
}
}
}
Handling correct String encoding is tricky in this case. If the file isn't in the encoding used by readline() and getBytes(), you could workaround that by doing
// file is in "iso-1234" encoding which is made up.
// reinterpret the byte as the correct encoding first
line = new String(line.getBytes("ISO-8859-1"), "iso-1234");
... modify line
// when writing use the expected encoding
raf.write(line.getBytes("iso-1234"));
See How to read UTF8 encoded file using RandomAccessFile?
Try storing the changes you want to make to a file in the RAM (string or linked list of strings). If you read in the file to a linked list of strings (per line of the file) and write a function to merge the string you want to insert into that linked list of lines from the file and then rewrite the file entirely by putting down every line from the linked list it should give you what you want. Heres what I mean in psudocode the order is important here.
By reading in the file and setting after input we minimize interference with other users.
String lineYouWantToWrite = yourInput
LinkedList<String> list = new LinkedList<String>()
while (file has another line)
list.add(file's next line)
add your string to whatever index of list you want
write list to file line by line, file's first line = list[1]...
I have a java program that is supposed to output data, take in data again, read and then output with a few extra columns of result. (So two outputs in total) To test my program I just tried to read and print out the exact same csv to see if it works. However, my first output returns 786718 rows of data, which is complete and correct, but when it gets read again to output the second time, the data is cut at row 786595 and even that row is missing some column data. The file size is also 74868KB vs 74072KB of data. Is this because of the lack of memory from my java program or excel/the .csv file's problem?
PrintWriter writer = null;
try {
writer = new PrintWriter(saveFileName + " updated.csv", "UTF-8");
for (Map.Entry<String, ArrayList> entry : readOutputCSV(saveFileName).entrySet()) {
FindOutput.find(entry.getKey(), entry.getValue(), checkInMRTWriter);
}
} finally {
if (writer != null) {
writer.flush();
writer.close();
}
}
The most likely reason is you are not flushing nor closing the PrintWriter.
From the Java source
public PrintWriter(OutputStream out) {
this(out, false);
}
public PrintWriter(OutputStream out, boolean autoFlush) {
this(new BufferedWriter(new OutputStreamWriter(out)), autoFlush);
You can see that PrintWriter is buffered by default.
The default buffer size is 8 KiB so if you leave this data in the buffer and don't write it out you can lose up to the last 8 KiB of your data.
Some things might influence here:
input/output encoding
line separators (you might be reading a file with '\r\n' and writing '\n' back
CSV escape - values might be escaped or not depending on how you are handling the special cases (values with newlines, comma, or quote). You might be reading valid CSV with a parser but printing out unescaped (and broken) CSV.
whitespaces. Some libraries clear the whitespace when parsing automatically.
The best way to verify is to use a CSV parsing library, such as univocity-parsers and use it to read/write your data with a fixed format configuration. Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
We have an issue unzipping bz2 files in Java, whereby the input stream thinks it's finished after reading ~3% of the file.
We would welcome any suggestions for how to decompress and read large bz2 files which have to be processed line by line.
Here are the details of what we have done so far:
For example, a bz2 file is 2.09 GB in size and uncompressed it is 24.9 GB
The code below only reads 343,800 lines of the actual ~10 million lines the file contains.
Modifying the code to decompress the bz2 into a text file (FileInputStream straight into the CompressorInputStream) results in a file of ~190 MB - irrespective of the size of the bz2 file.
I have tried setting a buffer value of 2048 bytes, but this has no effect on the outcome.
We have executed the code on Windows 64 bit and Linux/CentOS both with the same outcome.
Could the buffered reader come to an empty, "null" line and cause the code to exit the while-loop?
import org.apache.commons.compress.compressors.*;
import java.io.*;
...
CompressorInputStream is = new CompressorStreamFactory()
.createCompressorInputStream(
new BufferedInputStream(
new FileInputStream(filePath)));
lineNumber = 0;
line = "";
br = new BufferedReader(
new InputStreamReader(is));
while ((line = br.readLine()) != null) {
this.processLine(line, ++lineNumber);
}
Even this code, which forces an exception when the end of the stream is reached, has exactly the same result:
byte[] buffer = new byte[1024];
int len = 1;
while (len == 1) {
out.write(buffer, 0, is.read(buffer));
out.flush();
}
There is nothing obviously wrong with your code; it should work. This means the problem must be elsewhere.
Try to enable logging (i.e. print the lines as you process them). Make sure there are no gaps in the input (maybe write the lines to a new file and do a diff). Use bzip2 --test to make sure the input file isn't buggy. Check whether it always fails for the same line (maybe the input contains odd characters or binary data?)
The issue lies with the bz2 files: they were created using a version of Hadoop which includes bad block headers inside the files.
Current Java solutions stumble over this, while others ignore it or handle it somehow.
Will look for a solution/workaround.
I am trying to use protocol buffer to record a little market data. Each time I get a quote notification from the market, I take this quote and convert it into a protocol buffers object. Then I call "writeDelimitedTo"
Example of my recorder:
try {
writeLock.lock();
LimitOrder serializableQuote = ...
LimitOrderTransport gpbQuoteRaw = serializableQuote.serialize();
LimitOrderTransport gpbQuote = LimitOrderTransport.newBuilder(gpbQuoteRaw).build();
gpbQuote.writeDelimitedTo(fileStream);
csvWriter1.println(gpbQuote.getIdNumber() + DELIMITER+ gpbQuote.getSymbol() + ...);
} finally {
writeLock.unlock();
}
The reason for the locking is because quotes coming from different markets are handled by different threads, so I was trying to simplify and "serialize" the logging to the file.
Code that Reads the resulting file:
FileInputStream stream = new FileInputStream(pathToFile);
PrintWriter writer = new PrintWriter("quoteStream6-compare.csv", "UTF-8");
while(LimitOrderTransport.newBuilder().mergeDelimitedFrom(stream)) {
LimitOrderTransport gpbQuote= LimitOrderTransport.parseDelimitedFrom(stream);
csvWriter2.println(gpbQuote.getIdNumber()+DELIMITER+ gpbQuote.getSymbol() ...);
}
When I run the recorder, I get a binary file that seems to grow in size. When I use my reader to read from the file I also appear to get a large number of quotes. They are all different and appear correct.
Here's the issue: Many of the quotes appear to be "missing" - Not present when my reader reads from the file.
I tried an experiment with csvWriter1 and csvWriter2. In my writer, I write out a csv file then in my reader I write a second cvs file using the my protobufs file as a source.
The theory is that they should match up. They don't match up. The original csv file contains many more quotes in it than the csv that I generate by reading my protobufs recorded data.
What gives? Am I not using writeDelimitedTo/parseDelimitedFrom correctly?
Thanks!
Your problem is here:
while(LimitOrderTransport.newBuilder().mergeDelimitedFrom(stream)) {
LimitOrderTransport gpbQuote= LimitOrderTransport.parseDelimitedFrom(stream);
The first line constructs a new LimitOrderTransport.Builder and uses it to parse a message from the stream. Then that builder is discarded.
The second line parses a new message from the same stream, into a new builder.
So you are discarding every other message.
Do this instead:
while (true) {
LimitOrderTransport gpbQuote = LimitOrderTransport.parseDelimitedFrom(stream);
if (gpbQuote == null) break; // EOF
Im tring to transfer from a file filein starting at position 1300 (in uint8 pieces) into fileto, using the RandomAccessFile transferFrom function.
fromfile = java.io.RandomAccessFile(ifile, 'rw');
fromchannel = fromfile.getChannel();
tofile = java.io.RandomAccessFile(ofile, 'rw');
tochannel = tofile.getChannel();
tochannel.transferFrom(fromchannel,n,fromfile.length()-n)
tochannel.close();
fromchannel.close();
fromfile.close();
tofile.close();
My output file is just empty tho.
Anyone know what im doing wrong??
Edit 1:
I've changed
tochannel.transferFrom(fromchannel,n,fromfile.length()-n)
to
fromchannel.transferTo(n,fromfile.length()-n,tochannel)
But now the output is printing to the all the file right except for it puts alot of 00 hexadecimals where the header in the original was???
You want to use transferTo I believe
fromchannel.transferTo(n,fromfile.length()-n,tochannel)
as transferFrom tries to start at position n in the outfile, while transferTo will start at position n in the infile