I have a XML file that I read from a URLConnection. The file grows as time goes. I want to read the first 100k of the file, today the file is 1.3MB. Then I want to parse the part I have read.
How can I do that?
(From scratch)
int length = 100*1024;
byte[] buf = new byte[length];
urlConnection.getInputStream().read(buf,0,length);
StringBufferInputStream in = new StringBufferInputStream(new String(buf));
new SAXParser().parse(in,myHandler);
As far as I understand you're interested not just in 100k of a stream but 100k of a stream from which you could extract data you need. This means taking 100k as proposed by Peter won't work as it might result in non-well-formed XML.
Instead I'd suggest to use StAX parser which will give you ability to read and parse XML directly from stream with ability to stop when you've reached 100k (or near) limit.
For further information take a look at XMLStreamReader interface (and samples around its usage). For example you could loop until you get to the START_ELEMENT with name "result" and then use method getTextCharacters(int sourceStart, char[] target, int targetStart, int length) specifying 100k as buffer size.
As you mentioned Android currently it doesn't have StAX parser available. However it does have XmlPullParser with similar functionality.
Related
I'm looking for a solution to a problem on
how to count how much space it takes to write object to file before writing it to a file
pseudo code on what I'm looking is
if (alreadyMarshalled.size() + toBeMarshalled.size() < 40 KB) {
alreadyMarshalled.marshall(toBeMarshalled);
}
So, I could use a counting stream from, i.e Apache CountingOutputStream, but
at first I would need to know how much space would object take (tags included),
however I've no clue how to include tags and prefixes in that count before
checking to an already what was marshalled. Is there any library that would
solve such a situation?
The only way to tell is to actually marshal the XML.
The idea of the CountingOutputStream is sound.
NullOutputStream nos = new NullOutputStream();
CountingOutputStream cos = new CountingOutputStream(nos);
OutputStreamWriter osw = new OutputStreamWriter(cos);
jaxbMarshaller.marshal(object, osw);
long result = cos.getByteCount();
You have to run this twice (once to get the count, again to write it out) it's the only deterministic way to do it, and this won't cost you any real memory.
If you're not worried about memory, then just dump it to a ByteArrayOutputStream, and if you decide to "keep it", you can just dump the byte array straight in to the file without have to run through the marshaller again.
In fact with the ByteArrayOutputStream, you don't need the CountingOutputStream, you can just find out the size of the resulting array when it's done. But it can come at a high memory cost.
I am trying to implement a processor for a specific resource archive file format in Java. The format has a Header comprised of a three-char description, a dummy byte, plus a byte indicating the number of files.
Then each file has an entry consisting of a dummy byte, a twelve-char string describing the file name, a dummy byte, and an offset declared in a three-byte array.
What would be the proper class for reading this kind of structure? I have tried RandomAccessFile but it does not allow to read arrays of data, e.g. I can only read three chars by calling readChar() three times, etc.
Of course I can extend RandomAccessFile to do what I want but there's got to be a proper out-of-the-box class to do this kind of processing isn't it?
This is my reader for the header in C#:
protected override void ReadHeader()
{
Header = new string(this.BinaryReader.ReadChars(3));
byte dummy = this.BinaryReader.ReadByte();
NFiles = this.BinaryReader.ReadByte();
}
I think you got lucky with your C# code, as it relies on the character encoding to be set somewhere else, and if it didn't match the number of bytes per character in the file, your code would probably have failed.
The safest way to do this in Java would be to strictly read bytes and do the conversion to characters yourself. If you need seek abilities, then indeed RandomAccessFile would be your easiest solution, but it should be pointed out that InputStream allows skipping, so if you don`t need actual random access, just to skip some of the files, you could certainly use it.
In either case, you should read the bytes from the file per the file specification, and then convert them to characters based on a known encoding. You should never trust a file that was not written by a Java program to contain any Java data types other than byte, and even if it was written by Java, it may well have been converted to raw bytes while writing.
So your code should be something along the lines of:
String header = "";
int nFiles = 0;
RandomAccessFile raFile = new RandomAccessFile( "filename", "r" );
byte[] buffer = new byte[3];
int numRead = raFile.read( buffer );
header = new String( buffer, StandardCharsets.US_ASCII.name() );
int numSkipped = raFile.skipBytes(1);
nFiles = raFile.read(); // The byte is read as an integer between 0 and 255
Sanity checks (checking that actual 3 bytes were read, 1 byte was skipped and nFiles is not -1) and exception handling have been skipped for brevity.
It's more or less the same if you use InputStream.
I would go with MappedByteBuffer. This will allow you to seek arbitrarily, but will also deal efficiently and transparently with large files that are too large to fit comfortably in RAM.
This is, to my mind, the best way of reading structured binary data like this from a file.
You can then build your own data structure on top of that, to handle the specific file format.
I have a servlet that clients will post xml or json data too.
Currently I am reading the posted content using Guava:
String string = CharStreams.toString( new InputStreamReader( inputStream, "UTF-8" ) );
I want to be able to abort my entire operation of reading the posted file if it is larger n in size.
Is there a way to do this using Guava or do I have to know implement my own function to do this?
I don't see anything that aborts, but you can use ByteStreams#limit(InputStream, long) to set a maximum number of bytes to read. The InputStream returned will simply return -1 on any read(..) that goes over the limit.
If you really want abort behavior, you could write your own InputStream wrapper that throws an exception if you go above some number of bytes read.
How can I get the number of lines(rows) from an InputStream or from a CsvMapper without looping through and counting them?
Below I have an InputStream created from a CSV file.
InputStream content = (... from a resource ...);
CsvMapper mapper = new CsvMapper();
mapper.enable(CsvParser.Feature.WRAP_AS_ARRAY);
MappingIterator<Object[]> it = mapper
.reader(Object[].class)
.readValues(content);
Is it possible to do something like
int totalRows = mapper.getTotalRows();
I would like to use this number in the loop to update progress.
while (it.hasNextValue()){
//do stuff here
updateProgressHere(currentRow, totalRows);
}
Obviously, I can loop through and count them once. Then loop through again and process them while updating progress. This is inefficient and slow as some of these InputStreams are huge.
Unless you know the row count ahead of time, it is not possible without looping. You have to read that file in its entirety to know how many lines are in it, and neither InputStream nor CsvMapper have a means of reading ahead and abstracting that for you (they are both stream oriented interfaces).
None of the interfaces that ObjectReader can operate on support querying the underlying file size (if it's a file) or number of bytes read so far.
One possible option is to create your own custom InputStream that also provides methods for grabbing the total size and number of bytes read so far, e.g. if it is reading from a file, it can expose the underlying File.length() and also track the number of bytes read. This may not be entirely accurate, especially if Jackson buffers far ahead, but it could get you something at least.
Technically spoken, there are only two ways. Either (as you have seen) looping through and incrementing counter, or:
On the sender, the first information to send would be the counter, and then sending the data. This enables you to evaluate the first bytes as count when reading the stream at the begin. Precondition of this procedure is of course that the sending application knows in advance the size of data to be sent.
I have a function in which I am only given a BufferedInputStream and no other information about the file to be read. I unfortunately cannot alter the method definition as it is called by code I don't have access to. I've been using the code below to read the file and place its contents in a String:
public String[] doImport(BufferedInputStream stream) throws IOException, PersistenceException {
int bytesAvail = stream.available();
byte[] bytesRead = new byte[bytesAvail];
stream.read(bytesRead);
stream.close();
String fileContents = new String(bytesRead);
//more code here working with fileContents
}
My problem is that for large files (>2Gb), this code causes the program to either run extremely slowly or truncate the data, depending on the computer the program is executed on. Does anyone have a recommendation regarding how to deal with large files in this situation?
You're assuming that available() returns the size of the file; it does not. It returns the number of bytes available to be read, and that may be any number less than or equal to the size of the file.
Unfortunately there's no way to do what you want in just one shot without having some other source of information on the length of the file data (i.e., by calling java.io.File.length()). Instead, you have to possibly accumulate from multiple reads. One way is by using ByteArrayOutputStream. Read into a fixed, finite-size array, then write the data you read into a ByteArrayOutputStream. At the end, pull the byte array out. You'll need to use the three-argument forms of read() and write() and look at the return value of read() so you know exactly how many bytes were read into the buffer on each call.
I'm not sure why you don't think you can read it line-by-line. BufferedInputStream only describes how the underlying stream is accessed, it doesn't impose any restrictions on how you ultimately read data from it. You can use it just as if it were any other InputStream.
Namely, to read it line-by-line you could do
InputStreamReader streamReader = new InputStreamReader(stream);
BufferedInputReader lineReader = new BufferedInputReader(streamReader);
String line = lineReader.readLine();
...
[Edit] This response is to the original wording of the question, which asked specifically for a way to read the input file line-by-line.