writeDelimitedTo/parseDelimitedFrom seem to be losing data - java

I am trying to use protocol buffer to record a little market data. Each time I get a quote notification from the market, I take this quote and convert it into a protocol buffers object. Then I call "writeDelimitedTo"
Example of my recorder:
try {
writeLock.lock();
LimitOrder serializableQuote = ...
LimitOrderTransport gpbQuoteRaw = serializableQuote.serialize();
LimitOrderTransport gpbQuote = LimitOrderTransport.newBuilder(gpbQuoteRaw).build();
gpbQuote.writeDelimitedTo(fileStream);
csvWriter1.println(gpbQuote.getIdNumber() + DELIMITER+ gpbQuote.getSymbol() + ...);
} finally {
writeLock.unlock();
}
The reason for the locking is because quotes coming from different markets are handled by different threads, so I was trying to simplify and "serialize" the logging to the file.
Code that Reads the resulting file:
FileInputStream stream = new FileInputStream(pathToFile);
PrintWriter writer = new PrintWriter("quoteStream6-compare.csv", "UTF-8");
while(LimitOrderTransport.newBuilder().mergeDelimitedFrom(stream)) {
LimitOrderTransport gpbQuote= LimitOrderTransport.parseDelimitedFrom(stream);
csvWriter2.println(gpbQuote.getIdNumber()+DELIMITER+ gpbQuote.getSymbol() ...);
}
When I run the recorder, I get a binary file that seems to grow in size. When I use my reader to read from the file I also appear to get a large number of quotes. They are all different and appear correct.
Here's the issue: Many of the quotes appear to be "missing" - Not present when my reader reads from the file.
I tried an experiment with csvWriter1 and csvWriter2. In my writer, I write out a csv file then in my reader I write a second cvs file using the my protobufs file as a source.
The theory is that they should match up. They don't match up. The original csv file contains many more quotes in it than the csv that I generate by reading my protobufs recorded data.
What gives? Am I not using writeDelimitedTo/parseDelimitedFrom correctly?
Thanks!

Your problem is here:
while(LimitOrderTransport.newBuilder().mergeDelimitedFrom(stream)) {
LimitOrderTransport gpbQuote= LimitOrderTransport.parseDelimitedFrom(stream);
The first line constructs a new LimitOrderTransport.Builder and uses it to parse a message from the stream. Then that builder is discarded.
The second line parses a new message from the same stream, into a new builder.
So you are discarding every other message.
Do this instead:
while (true) {
LimitOrderTransport gpbQuote = LimitOrderTransport.parseDelimitedFrom(stream);
if (gpbQuote == null) break; // EOF

Related

PrintWriter not creating output with supposed number of rows

I have a java program that is supposed to output data, take in data again, read and then output with a few extra columns of result. (So two outputs in total) To test my program I just tried to read and print out the exact same csv to see if it works. However, my first output returns 786718 rows of data, which is complete and correct, but when it gets read again to output the second time, the data is cut at row 786595 and even that row is missing some column data. The file size is also 74868KB vs 74072KB of data. Is this because of the lack of memory from my java program or excel/the .csv file's problem?
PrintWriter writer = null;
try {
writer = new PrintWriter(saveFileName + " updated.csv", "UTF-8");
for (Map.Entry<String, ArrayList> entry : readOutputCSV(saveFileName).entrySet()) {
FindOutput.find(entry.getKey(), entry.getValue(), checkInMRTWriter);
}
} finally {
if (writer != null) {
writer.flush();
writer.close();
}
}
The most likely reason is you are not flushing nor closing the PrintWriter.
From the Java source
public PrintWriter(OutputStream out) {
this(out, false);
}
public PrintWriter(OutputStream out, boolean autoFlush) {
this(new BufferedWriter(new OutputStreamWriter(out)), autoFlush);
You can see that PrintWriter is buffered by default.
The default buffer size is 8 KiB so if you leave this data in the buffer and don't write it out you can lose up to the last 8 KiB of your data.
Some things might influence here:
input/output encoding
line separators (you might be reading a file with '\r\n' and writing '\n' back
CSV escape - values might be escaped or not depending on how you are handling the special cases (values with newlines, comma, or quote). You might be reading valid CSV with a parser but printing out unescaped (and broken) CSV.
whitespaces. Some libraries clear the whitespace when parsing automatically.
The best way to verify is to use a CSV parsing library, such as univocity-parsers and use it to read/write your data with a fixed format configuration. Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

StringBuilders ending with mass nul characters

I'm having a very difficult time debugging a problem with an application I've been building. The problem itself I cannot seem to reproduce with a representitive test program with the same issue which makes it difficult to demonstrate. Unfortunately I cannot share my actual source because of security, however, the following test represents fairly well what I am doing, the fact that the files and data are unix style EOL, writing to a zip file with a PrintWriter, and the use of StringBuilders:
public class Tester {
public static void main(String[] args) {
// variables
File target = new File("TESTSAVE.zip");
PrintWriter printout1;
ZipOutputStream zipStream;
ZipEntry ent1;
StringBuilder testtext1 = new StringBuilder();
StringBuilder replacetext = new StringBuilder();
// ensure file replace
if (target.exists()) {
target.delete();
}
try {
// open the streams
zipStream = new ZipOutputStream(new FileOutputStream(target, true));
printout1 = new PrintWriter(zipStream);
ent1 = new ZipEntry("testfile.txt");
zipStream.putNextEntry(ent1);
// construct the data
for (int i = 0; i < 30; i++) {
testtext1.append("Testing 1 2 3 Many! \n");
}
replacetext.append("Testing 4 5 6 LOTS! \n");
replacetext.append("Testing 4 5 6 LOTS! \n");
// the replace operation
testtext1.replace(21, 42, replacetext.toString());
// write it
printout1 = new PrintWriter(zipStream);
printout1.println(testtext1);
// save it
printout1.flush();
zipStream.closeEntry();
printout1.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
The heart of the problem is that the file I see at my side is producing a file of 16.3k characters. My friend, whether he uses the app on his pc or whether he looks at exactly the same file as me sees a file of 19.999k characters, the extra characters being a CRLF followed by a massive number of null characters. No matter what application, encoding or views I use, I cannot at all see these nul characters, I only see a single LF at the last line, but I do see a file of 20k. In all cases there is a difference between what is seen with the exact same files on the two machines even though both are windows machines and both are using the same editing softwares to view.
I've not yet been able to reproduce this behaviour with any amount of dummy programs. I have been able to trace the final line's stray CRLF to my use of println on the PrintWriter, however. When I replaced the println(s) with print(s + '\n') the problem appeared to go away (the file size was 16.3k). However, when I returned the program to println(s), the problem does not appear to return. I'm currently having the files verified by a friend in france to see if the problem really did go away (since I cannot see the nuls but he can), but this behaviour has be thoroughly confused.
I've also noticed that the StringBuilder's replace function states "This sequence will be lengthened to accommodate the specified String if necessary". Given that the stringbuilders setLength function pads with nul characters and that the ensureCapacity function sets capacity to the greater of the input or (currentCapacity*2)+2, I suspected a relation somewhere. However, I have only once when testing with this idea been able to get a result that represented what I've seen, and have not been able to reproduce it since.
Does anyone have any idea what could be causing this error or at least have a suggestion on what direction to take the testing?
Edit since the comments section is broken for me:
Just to clarify, the output is required to be in unix format regardless of the OS, hence the use of '\n' directly rather than through a formatter. The original StringBuilder that is inserted into is not in fact generated to me but is the contents of a file read in by the program. I'm happy the reading process works, as the information in it is used heavily throughout the application. I've done a little probing too and found that directly prior to saving, the buffer IS the correct capacity and that the output when toString() is invoked is the correct length (i.e. it contains no null characters and is 16,363 long, not 19,999). This would put the cause of the error somewhere between generating the string and saving the zip file.
Finally found the cause. Managed to reproduce the problem a few times and traced the cause down not to the output side of the code but the input side. My file reading function was essentially this:
char[] buf;
int charcount = 0;
StringBuilder line = new StringBuilder(2048);
InputStreamReader reader = new InputStreamReader(stream);// provides a line-wise read
BufferedReader file = new BufferedReader(reader);
do { // capture loop
try {
buf = new char[2048];
charcount = file.read(buf, 0, 2048);
} catch (IOException e) {
return null; // unknown IO error
}
line.append(buf);
} while (charcount != -1);
// close and output
problem was appending a buffer that wasnt full, so the later values were still at their initial values of null. Reason I couldnt reproduce it was because some data filled in the buffers nicely, some didn't.
Why I couldn't seem to view the problem on my text editors I still have no idea of, but I should be able to resolve this now. Any suggestions on the best way to do so are welcome, as this is part of one of my long term utility libraries I want to keep it as generic and optimised as possible.

Java - directly reading a file without downloading and storing locally [duplicate]

This question already has answers here:
How do I read / convert an InputStream into a String in Java?
(62 answers)
Closed 8 years ago.
I want to directly read a file, put it into a string without storing the file locally. I used to do this with an old project, but I don't have the source code anymore. I used to be able to get the source of my website this way.
However, I don't remember if I did it by "InputStream to String array of lines to String", or if I directly read it into a String.
Was there a function for this, or am I remembering wrong?
(Note: this function would be the PHP equivalent of file_get_contents($path))
You need to use InputStreamReader to convert from a binary input stream to a Reader which is appropriate for reading text.
After that, you need to read to the end of the reader.
Personally I'd do all this with Guava, which has convenience methods for this sort of thing, e.g. CharStreams.toString(Readable).
When you create the InputStreamReader, make sure you supply the appropriate character encoding - if you don't, you'll get junk text out (just like trying to play an MP3 file as if it were a WAV, for example).
Check out apache-commons-io and for your use case FileUtils.readFileToString(File file)
(should not be to hard to get a File form the path).
You can use the library or have a look at the code - as this is open.
There is no direct way to read a File into a String.
But there is a quick alternative - read the File into a Byte array and convert it into a String.
Untested:
File f = new File("/foo/bar");
InputStream fStream = new FileInputStream(f);
ByteArrayOutputStream bStream = new ByteArrayOutputStream();
for(int data = fStream.read(); data > -1; data = fStream.read()) {
b.write(data);
}
String theResult = new String(bStream.toByteArray(), "UTF-8");

How to check whether the file is binary?

I wrote the following method to see whether particular file contains ASCII text characters only or control characters in addition to that. Could you glance at this code, suggest improvements and point out oversights?
The logic is as follows: "If first 500 bytes of a file contain 5 or more Control characters - report it as binary file"
thank you.
public boolean isAsciiText(String fileName) throws IOException {
InputStream in = new FileInputStream(fileName);
byte[] bytes = new byte[500];
in.read(bytes, 0, bytes.length);
int x = 0;
short bin = 0;
for (byte thisByte : bytes) {
char it = (char) thisByte;
if (!Character.isWhitespace(it) && Character.isISOControl(it)) {
bin++;
}
if (bin >= 5) {
return false;
}
x++;
}
in.close();
return true;
}
Since you call this class "isASCIIText", you know exactly what you're looking for. In other words, it's not "isTextInCurrentLocaleEncoding". Thus you can be more accurate with:
if (thisByte < 32 || thisByte > 127) bin++;
edit, a long time later — it's pointed out in a comment that this simple check would be tripped up by a text file that started with a lot of newlines. It'd probably be better to use a table of "ok" bytes, and include printable characters (including carriage return, newline, and tab, and possibly form feed though I don't think many modern documents use those), and then check the table.
x doesn't appear to do anything.
What if the file is less than 500 bytes?
Some binary files have a situation where you can have a header for the first N bytes of the file which contains some data that is useful for an application but that the library the binary is for doesn't care about. You could easily have 500+ bytes of ASCII in a preamble like this followed by binary data in the following gigabyte.
Should handle exception if the file can't be opened or read, etc.
Fails badly if file size is less than 500 bytes
The line char it = (char) thisByte; is conceptually dubious, it mixes bytes and chars concepts, ie. assumes implicitly that the encoding is one-byte=one character (them, it excludes unicode encodings). In particular, it fails if the file is UTF-16 encoded.
The return inside the loop (slightly bad practice IMO) forgets to close the file.
The first thing I noticed - unrelated to your actual question, but you should be closing your input stream in a finally block to ensure it's always done. Usually this merely handles exceptions, but in your case you won't even close the streams of files when returning false.
Asides from that, why the comparison to ISO control characters? That's not a "binary" file, that's a "file that contains 5 or more control characters". A better way to approach the situation in my opinion, would be to invert the check - write an isAsciiText function instead which asserts that all the characters in the file (or in the first 500 bytes if you so wish) are in a set of bytes that are known good.
Theoretically, only checking the first few hundred bytes of a file could get you into trouble if it was a composite file of sorts (e.g. text with embedded pictures), but in practice I suspect every such file will have binary header data at the start so you're probably OK.
This would not work with the jdk install packages for linux or solaris. they have a shell-script start and then a bi data blob.
why not check the mime type using some library like jMimeMagic (http://http://sourceforge.net/projects/jmimemagic/) and deside based on the mimetype how to handle the file.
One could parse and compare ageinst a list of known binary file header bytes, like the one provided here.
Problem is, one needs to have a sorted list of binary-only headers, and the list might not be complete at all. For example, reading and parsing binary files contained in some Equinox framework jar. If one needs to identify the specific file types though, this should work.
If you're on Linux, for existing files on the disk, native file command execution should work well:
String command = "file -i [ZIP FILE...]";
Process process = Runtime.getRuntime().exec(command);
...
It will output information on the files:
...: application/zip; charset=binary
which you can furtherly filter with grep, or in Java, depending on, if you simply need estimation of the files' binary character, or if you need to find out their MIME types.
If parsing InputStreams, like content of nested files inside archives, this doesn't work, unfortunately, unless resorting to shell-only programs, like unzip - if you want to avoid creating temp unzipped files.
For this, a rough estimation of examining the first 500 Bytes worked out ok for me, so far, as was hinted in the examples above; instead of Character.isWhitespace/isISOControl(char), I used Character.isIdentifierIgnorable(codePoint), assuming UTF-8 default encoding:
private static boolean isBinaryFileHeader(byte[] headerBytes) {
return new String(headerBytes).codePoints().filter(Character::isIdentifierIgnorable).count() >= 5;
}
public void printNestedZipContent(String zipPath) {
try (ZipFile zipFile = new ZipFile(zipPath)) {
int zipHeaderBytesLen = 500;
zipFile.entries().asIterator().forEachRemaining( entry -> {
String entryName = entry.getName();
if (entry.isDirectory()) {
System.out.println("FOLDER_NAME: " + entryName);
return;
}
// Get content bytes from ZipFile for ZipEntry
try (InputStream zipEntryStream = new BufferedInputStream(zipFile.getInputStream(zipEntry))) {
// read and store header bytes
byte[] headerBytes = zipEntryStream.readNBytes(zipHeaderBytesLen);
// Skip entry, if nested binary file
if (isBinaryFileHeader(headerBytes)) {
return;
}
// Continue reading zipInputStream bytes, if non-binary
byte[] zipContentBytes = zipEntryStream.readAllBytes();
int zipContentBytesLen = zipContentBytes.length;
// Join already read header bytes and rest of content bytes
byte[] joinedZipEntryContent = Arrays.copyOf(zipContentBytes, zipContentBytesLen + zipHeaderBytesLen);
System.arraycopy(headerBytes, 0, joinedZipEntryContent, zipContentBytesLen, zipHeaderBytesLen);
// Output (default/UTF-8) encoded text file content
System.out.println(new String(joinedZipEntryContent));
} catch (IOException e) {
System.out.println("ERROR getting ZipEntry content: " + entry.getName());
}
});
} catch (IOException e) {
System.out.println("ERROR opening ZipFile: " + zipPath);
e.printStackTrace();
}
}
You ignore what read() returns, what if the files is shorter than 500 bytes?
When you return false, you don't close the file.
When converting byte to char, you assume your file is 7-bit ASCII.

CSV file validation with Java

I'm reading a file line by line, like this:
FileReader myFile = new FileReader(File file);
BufferedReader InputFile = new BufferedReader(myFile);
// Read the first line
String currentRecord = InputFile.readLine();
while(currentRecord != null) {
currentRecord = InputFile.readLine();
}
But if other types of files are uploaded, it will still read their contents. For instance, if the uploaded file is an image, it will output junk characters when reading the file. So my question is: how can I check the file is CSV for sure before reading it?
Checking extension of the file is kind of lame since someone can upload a file that is not CSV but has a .csv extension. Thanks in advance.
Determining the MIME type of a file is not something easy to do, especially if ASCII sections can be mixed with binary ones.
Actually, when you look at how a java mail system does determine the MIME type of an email, it does involve reading all bytes in it, and applying some "rules".
Check out MimeUtility.java
If the primary type of this datasource is "text" and if all the bytes in its input stream are US-ASCII, then the encoding is "7bit".
If more than half of the bytes are non-US-ASCII, then the encoding is "base64".
If less than half of the bytes are non-US-ASCII, then the encoding is "quoted-printable".
If the primary type of this datasource is not "text", then if all the bytes of its input stream are US-ASCII, the encoding is "7bit".
If there is even one non-US-ASCII character, the encoding is "base64".
#return "7bit", "quoted-printable" or "base64"
As mentioned by mmyers in a deleted comment, JavaMimeType is supposed to do the same thing, but:
it is dead since 2006
it does involve reading the all content!
:
File file = new File("/home/bibi/monfichieratester");
InputStream inputStream = new FileInputStream(file);
ByteArrayOutputStream byteArrayStream = new ByteArrayOutputStream();
int readByte;
while ((readByte = inputStream.read()) != -1) {
byteArrayStream.write(readByte);
}
String mimetype = "";
byte[] bytes = byteArrayStream.toByteArray();
MagicMatch m = Magic.getMagicMatch(bytes);
mimetype = m.getMimeType();
So... since you are reading the all content of the file anyway, you could take advantage of that to determine the type based on that content and your own rules.
Java Mime Magic may be of use. It'll analyse mime-types from files and inputstreams. I can't vouch for it's functionality, however.
This link may provide further info. It provides several different means of determining how to do what you want (or at least something similar).
I would perhaps be tempted to write something specific to your problem domain. e.g. determining the number of comma-separated values per line and rejecting if it's not within certain limits. Then split on the commas and parse each entry according to requirements (e.g. are they doubles/floats/valid Strings - and if strings, what encoding). I think you may have to do this anyway, given that someone may upload a file that starts like a CSV but is corrupted half-way through.

Categories

Resources