Printing multiple files to a csv [closed] - java

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I'm currently writing an assignment that takes multiple text files(File objects) with lines, and then combines the lines together and separates them by commas, like:
File1Line1, File2Line1
File1Line2, File2Line2
I guess I'm just confused with how to use the files. How would I get the first(second, third, etc.) line from each file, while also dealing with files having different numbers of lines? Any help just on the concept of this is appreciated.

As far as reading a file line by line it's easy to do in most languages. Here's an example in java: How to read a large text file line by line using Java?.
Conceptually, you should start with thinking of an algorithm and then write some pseudocode to further explore and understand it.
For this assignment, an option would be to alternate reading each file one line at a time, and immediately write them to the csv. A second option would be to store each line in a data structure, such as an array, and write at the end, but that could be expensive for large files. You can handle different file lengths in many ways, for instance just writing the lines without corresponding lines alone. Here's some pseudocode, based on java:
FileReader reader1 = FileReader("file1.text")
FileReader reader2 = FileReader("file2.text")
while(reader1.hasNextLine() || reader2.hasNextLine())
{
if(reader1.hasNextLine()) {
writeToCSV(reader1.nextLine());
}
if(reader2.hasNextLine() {
writeToCSV(reader2.nextLine());
}
writeToCSV("\r\n");
}
You can find plenty of examples on the actual method calls, but it's important to understand the algorithm first.

If you are sure the lines of the two files are One-to-One mapping, then it is easy.
You can use two BuffererReader to read these two files, and you just need to iterate one of them
some codes like this:
BufferedReader reader1 = new BufferedReader(new FileReader(new File(pathOfFile1)));
BufferedReader reader2 = new BufferedReader(new FileReader(new File(pathOfFile2)));
BufferedWriter writer = new BufferedWriter(new FileWriter(new File(pathOfOutputCsvFile)));
String lineOfFile1 = null;
while((lineOfFile1 = reader1.readLine()) != null){
String lineOfFile2 = reader2.readLine();
//here, lineOfFile1 and lineOfFile2 are the same line number
//then some codes for combination
//...
}
//finally don't forget to close the readers and writer.
If you can't be sure the lines in these two files are One-to-One mapping, then you should read them all into the memory and mapping them in memory then output them as a CSV file.

This code only directly references 1 line from each file in RAM at a time, meaning it should work with huge files without memory exceptions. Behind the scenes more memory may be occupied than what you see, but it will still not crash with huge files.
Code works by reading one line at a time from each of the files till all files are empty. As files run out of lines output an empty string instead.
void assignment(String outputFile, String... filenames){
PrintWriter writer = new PrintWriter(outputFile, "UTF-8");
Scanner scanners = new Scanner[filenames.length];
for(int i=0;i<filenames.length;i++){
Scanner scanner = new Scanner(new File(filenames[i]));
scanners[i] = scanner;
}
boolean running = true;
while(running){
boolean allEmpty = true;
StringBuilder csvLine = new StringBuilder();
for(int i=0;i<scanners.lengh;i++){
if(scanner.hasNextLine()){
String line = scanner.nextLine();
csvLine.append(line);
allEmpty=false;
}
if(i!=scanners.length-1) csvLine.append(",");
}
if(allEmpty)
running=false;
else
writer.println(csvLine.toString());
}
writer.close();
for(Scanner s : scanners) s.close();
}
Usage:
assignment("output.txt","file1.txt","file2.txt","file3.txt","file4.txt");
Or:
String[] args = new String[]{"helloWorld.txt","fun.bin"};
assignment("output2.txt",args);
This code is untested and doesn't handle exceptions. This code will let you read in lines from files who's lines don't match, and combine them into a single CSV file. As files run out of lines, only empty strings will be shown.
This should give you an idea of how to do precisely what you've asked.

Related

Reading a large compressed file using Apache Commons Compress

I'm trying to read a bz2 file using Apache Commons Compress.
The following code works for a small file.
However for a large file (over 500MB), it ends after reading a few thousands lines without any error.
try {
InputStream fin = new FileInputStream("/data/file.bz2");
BufferedInputStream bis = new BufferedInputStream(fin);
CompressorInputStream input = new CompressorStreamFactory()
.createCompressorInputStream(bis);
BufferedReader br = new BufferedReader(new InputStreamReader(input,
"UTF-8"));
String line = "";
while ((line = br.readLine()) != null) {
System.out.println(line);
}
} catch (Exception e) {
e.printStackTrace();
}
Is there another good way to read a large compressed file?
I was having the same problem with a large file, until I noticed that CompressorStreamFactory has a couple of overloaded constructors that take a boolean decompressUntilEOF parameter.
Simply changing to the following may be all that's missing...
CompressorInputStream input = new CompressorStreamFactory(true)
.createCompressorInputStream(bis);
Clearly, whoever wrote this factory seems to think it's better to create new compressor input streams at certain points, with the same underlying buffered input stream so that the new one picks up where the last one left off. They seem to think that's a better default, or preferred way of doing it over allowing one stream to decompress data all the way to the end of the file. I've no doubt they are cleverer than me, and I haven't worked out what trap I'm setting for future me by setting this parameter to true. Maybe someone will tell me in the comments! :-)

writeDelimitedTo/parseDelimitedFrom seem to be losing data

I am trying to use protocol buffer to record a little market data. Each time I get a quote notification from the market, I take this quote and convert it into a protocol buffers object. Then I call "writeDelimitedTo"
Example of my recorder:
try {
writeLock.lock();
LimitOrder serializableQuote = ...
LimitOrderTransport gpbQuoteRaw = serializableQuote.serialize();
LimitOrderTransport gpbQuote = LimitOrderTransport.newBuilder(gpbQuoteRaw).build();
gpbQuote.writeDelimitedTo(fileStream);
csvWriter1.println(gpbQuote.getIdNumber() + DELIMITER+ gpbQuote.getSymbol() + ...);
} finally {
writeLock.unlock();
}
The reason for the locking is because quotes coming from different markets are handled by different threads, so I was trying to simplify and "serialize" the logging to the file.
Code that Reads the resulting file:
FileInputStream stream = new FileInputStream(pathToFile);
PrintWriter writer = new PrintWriter("quoteStream6-compare.csv", "UTF-8");
while(LimitOrderTransport.newBuilder().mergeDelimitedFrom(stream)) {
LimitOrderTransport gpbQuote= LimitOrderTransport.parseDelimitedFrom(stream);
csvWriter2.println(gpbQuote.getIdNumber()+DELIMITER+ gpbQuote.getSymbol() ...);
}
When I run the recorder, I get a binary file that seems to grow in size. When I use my reader to read from the file I also appear to get a large number of quotes. They are all different and appear correct.
Here's the issue: Many of the quotes appear to be "missing" - Not present when my reader reads from the file.
I tried an experiment with csvWriter1 and csvWriter2. In my writer, I write out a csv file then in my reader I write a second cvs file using the my protobufs file as a source.
The theory is that they should match up. They don't match up. The original csv file contains many more quotes in it than the csv that I generate by reading my protobufs recorded data.
What gives? Am I not using writeDelimitedTo/parseDelimitedFrom correctly?
Thanks!
Your problem is here:
while(LimitOrderTransport.newBuilder().mergeDelimitedFrom(stream)) {
LimitOrderTransport gpbQuote= LimitOrderTransport.parseDelimitedFrom(stream);
The first line constructs a new LimitOrderTransport.Builder and uses it to parse a message from the stream. Then that builder is discarded.
The second line parses a new message from the same stream, into a new builder.
So you are discarding every other message.
Do this instead:
while (true) {
LimitOrderTransport gpbQuote = LimitOrderTransport.parseDelimitedFrom(stream);
if (gpbQuote == null) break; // EOF

What is more performatic way to extract patterns from large file (over 700MB)

I've a problem which requires me to parse a text file from local machine. There are a few complications:
The files can be quite large (700mb+)
The pattern occurs in multiple lines
I need store line information after the pattern
I've created a simple code using BufferReader, String.indexOf and String.substring (to get item 3).
Inside the file it has a key (pattern) named code= that occurs many times in different blocks. The program read each line from this file using BufferReader.readLine. It uses indexOf to check if the pattern appears and then it extract text after pattern and store in a common string.
When I ran my program with 600mb file, I noticed that performance was worst while it process file. I read an article in CodeRanch that Scanner class isn't performatic for large files.
Are there some techniques or a library that could improve my performance ?
Thanks in advance.
Here's my source code:
String codeC = "code=[";
String source = "";
try {
FileInputStream f1 = new FileInputStream("c:\\Temp\\fo1.txt");
DataInputStream in = new DataInputStream(f1);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
boolean bPrnt = false;
int ln = 0;
// Read File Line By Line
while ((strLine = br.readLine()) != null) {
// Print the content on the console
if (strLine.indexOf(codeC) != -1) {
ln++;
System.out.println(strLine + " ---- register : " + ln);
strLine = strLine.substring(codeC.length(), strLine.length());
source = source + "\n" + strLine;
}
}
System.out.println("");
System.out.println("Lines :" + ln);
f1.close();
} catch ( ... ) {
...
}
This code of yours is highly suspicious and may well account for at least a part of your performance issues:
FileInputStream f1 = new FileInputStream("c:\\Temp\\fo1.txt");
DataInputStream in = new DataInputStream(f1);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
You are involving DataInputStream for no good reason, and in fact using it as an input to a Reader can be considered a case of broken code. Write this instead:
InputStream f1 = new FileInputStream("c:\\Temp\\fo1.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fr));
A huge detriment to performance is the System.out you are using, especially if you measure the performance when running in Eclipse, but even if running from the command line. My guess is, this is the major cause of your bottleneck. By all means ensure you don't print anything in the main loop when you aim for top performance.
In addition to what Marko answered, I suggest to close the br, not the f1:
br.close()
This will not affect the performance, but is cleaner. (closing the outermost stream)
Have a look at java.util.regex
An excellent tutorial from oracle.
A copy paste from the JAVADoc:
Classes for matching character sequences against patterns specified by regular expressions.
An instance of the Pattern class represents a regular expression that is specified in string form in a syntax similar to that used by Perl.
Instances of the Matcher class are used to match character sequences against a given pattern. Input is provided to matchers via the CharSequence interface in order to support matching against characters from a wide variety of input sources.
Unless otherwise noted, passing a null argument to a method in any class or interface in this package will cause a NullPointerException to be thrown.
It works perfectly !!
I followed OldCurmudgeon, Marko Topolnik and AlexWien advices and my performance improved 1000%. Before the program spent 2 hours to complete described operation and write a response in file.
Now it spends 5 minutes !! And SYSO remains in source code !!
I think that reason of great improvement is change String "source" for HashSet "source" like OldCurmudgeon suggests. Bur I removed DataInputStream and used "br.close" too.
Thanks guys !!

Fastest/Cleanest way to load a text file in memory

I know similar questions have been asked before, but I couldn't find one that answers my exact question.
I need a way to read a file as a String with the least code and as simple and as optimal as possible.
I'm not looking for:
final BufferedReader br = new BufferedReader(new FileReader(file));
String line = null;
while ((line = br.readLine()) != null) {
// logic
}
And I know I can write my own helper class that does this.
I'm looking for something more along the lines of:
final String wholeFileAsStr = Something.load(file);
Where Something.load() is super optimized and buffers the file properly while reading it, taking file size into account for instance.
Can anyone recommend something from Guava or Apache maybe that I'm not aware of ?
Thanks in advance.
Perhaps IOUtils.toString , from Commons IOUtils
For a they detailed look at all various methods of reading a single file in a JVM try the following article:
Java tip: How to read files quickly

StringBuilders ending with mass nul characters

I'm having a very difficult time debugging a problem with an application I've been building. The problem itself I cannot seem to reproduce with a representitive test program with the same issue which makes it difficult to demonstrate. Unfortunately I cannot share my actual source because of security, however, the following test represents fairly well what I am doing, the fact that the files and data are unix style EOL, writing to a zip file with a PrintWriter, and the use of StringBuilders:
public class Tester {
public static void main(String[] args) {
// variables
File target = new File("TESTSAVE.zip");
PrintWriter printout1;
ZipOutputStream zipStream;
ZipEntry ent1;
StringBuilder testtext1 = new StringBuilder();
StringBuilder replacetext = new StringBuilder();
// ensure file replace
if (target.exists()) {
target.delete();
}
try {
// open the streams
zipStream = new ZipOutputStream(new FileOutputStream(target, true));
printout1 = new PrintWriter(zipStream);
ent1 = new ZipEntry("testfile.txt");
zipStream.putNextEntry(ent1);
// construct the data
for (int i = 0; i < 30; i++) {
testtext1.append("Testing 1 2 3 Many! \n");
}
replacetext.append("Testing 4 5 6 LOTS! \n");
replacetext.append("Testing 4 5 6 LOTS! \n");
// the replace operation
testtext1.replace(21, 42, replacetext.toString());
// write it
printout1 = new PrintWriter(zipStream);
printout1.println(testtext1);
// save it
printout1.flush();
zipStream.closeEntry();
printout1.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
The heart of the problem is that the file I see at my side is producing a file of 16.3k characters. My friend, whether he uses the app on his pc or whether he looks at exactly the same file as me sees a file of 19.999k characters, the extra characters being a CRLF followed by a massive number of null characters. No matter what application, encoding or views I use, I cannot at all see these nul characters, I only see a single LF at the last line, but I do see a file of 20k. In all cases there is a difference between what is seen with the exact same files on the two machines even though both are windows machines and both are using the same editing softwares to view.
I've not yet been able to reproduce this behaviour with any amount of dummy programs. I have been able to trace the final line's stray CRLF to my use of println on the PrintWriter, however. When I replaced the println(s) with print(s + '\n') the problem appeared to go away (the file size was 16.3k). However, when I returned the program to println(s), the problem does not appear to return. I'm currently having the files verified by a friend in france to see if the problem really did go away (since I cannot see the nuls but he can), but this behaviour has be thoroughly confused.
I've also noticed that the StringBuilder's replace function states "This sequence will be lengthened to accommodate the specified String if necessary". Given that the stringbuilders setLength function pads with nul characters and that the ensureCapacity function sets capacity to the greater of the input or (currentCapacity*2)+2, I suspected a relation somewhere. However, I have only once when testing with this idea been able to get a result that represented what I've seen, and have not been able to reproduce it since.
Does anyone have any idea what could be causing this error or at least have a suggestion on what direction to take the testing?
Edit since the comments section is broken for me:
Just to clarify, the output is required to be in unix format regardless of the OS, hence the use of '\n' directly rather than through a formatter. The original StringBuilder that is inserted into is not in fact generated to me but is the contents of a file read in by the program. I'm happy the reading process works, as the information in it is used heavily throughout the application. I've done a little probing too and found that directly prior to saving, the buffer IS the correct capacity and that the output when toString() is invoked is the correct length (i.e. it contains no null characters and is 16,363 long, not 19,999). This would put the cause of the error somewhere between generating the string and saving the zip file.
Finally found the cause. Managed to reproduce the problem a few times and traced the cause down not to the output side of the code but the input side. My file reading function was essentially this:
char[] buf;
int charcount = 0;
StringBuilder line = new StringBuilder(2048);
InputStreamReader reader = new InputStreamReader(stream);// provides a line-wise read
BufferedReader file = new BufferedReader(reader);
do { // capture loop
try {
buf = new char[2048];
charcount = file.read(buf, 0, 2048);
} catch (IOException e) {
return null; // unknown IO error
}
line.append(buf);
} while (charcount != -1);
// close and output
problem was appending a buffer that wasnt full, so the later values were still at their initial values of null. Reason I couldnt reproduce it was because some data filled in the buffers nicely, some didn't.
Why I couldn't seem to view the problem on my text editors I still have no idea of, but I should be able to resolve this now. Any suggestions on the best way to do so are welcome, as this is part of one of my long term utility libraries I want to keep it as generic and optimised as possible.

Categories

Resources