Total number of rows in an InputStream (or CsvMapper) in Java - java

How can I get the number of lines(rows) from an InputStream or from a CsvMapper without looping through and counting them?
Below I have an InputStream created from a CSV file.
InputStream content = (... from a resource ...);
CsvMapper mapper = new CsvMapper();
mapper.enable(CsvParser.Feature.WRAP_AS_ARRAY);
MappingIterator<Object[]> it = mapper
.reader(Object[].class)
.readValues(content);
Is it possible to do something like
int totalRows = mapper.getTotalRows();
I would like to use this number in the loop to update progress.
while (it.hasNextValue()){
//do stuff here
updateProgressHere(currentRow, totalRows);
}
Obviously, I can loop through and count them once. Then loop through again and process them while updating progress. This is inefficient and slow as some of these InputStreams are huge.

Unless you know the row count ahead of time, it is not possible without looping. You have to read that file in its entirety to know how many lines are in it, and neither InputStream nor CsvMapper have a means of reading ahead and abstracting that for you (they are both stream oriented interfaces).
None of the interfaces that ObjectReader can operate on support querying the underlying file size (if it's a file) or number of bytes read so far.
One possible option is to create your own custom InputStream that also provides methods for grabbing the total size and number of bytes read so far, e.g. if it is reading from a file, it can expose the underlying File.length() and also track the number of bytes read. This may not be entirely accurate, especially if Jackson buffers far ahead, but it could get you something at least.

Technically spoken, there are only two ways. Either (as you have seen) looping through and incrementing counter, or:
On the sender, the first information to send would be the counter, and then sending the data. This enables you to evaluate the first bytes as count when reading the stream at the begin. Precondition of this procedure is of course that the sending application knows in advance the size of data to be sent.

Related

Java Streams - Turn Stream into Stream of Lists

I have a stream of InputStreams that are all being reduced into a single InputStream with a reduce function.
It looks like this:
InputStream pdfReportStream = dynamoDBItems.parallelStream()
.map(dynamoDBItem -> new ChonkRequest(dynamoDBItem.getString("s3_key"),
dynamoDBItem.getString("s3_bucket"), dynamoDBItem.getInt("SK")))
.sorted(Comparator.comparingInt(ChonkRequest::getIndex))
.map(storageService::getChonk) // Stream<InputStream>
.reduce(null, pdfService::mergePdf);
I have been tweaking this for about a day now. Originally I just collected the list of input lists and passed them into the merge method and it merged them all together very quickly. The problem was that with too many InputStreams it failed.
So I switched to use reduce and add one at a time. This works for larger numbers but it is slower overall and will eventually lead to a timeout on very large number of InputStreams.
What I would like to do is use a combination of both previous approaches and pass in lists of InputStreams with a max size into the mergePDF method, and have it keep performing reduce on them, outputting an InputStream everytime.
So the pseudocode would be like...
InputStream pdfReportStream = dynamoDBItems.parallelStream()
.map(dynamoDBItem -> new ChonkRequest(dynamoDBItem.getString("s3_key"),
dynamoDBItem.getString("s3_bucket"), dynamoDBItem.getInt("SK")))
.sorted(Comparator.comparingInt(ChonkRequest::getIndex))
.map(storageService::getChonk) // Stream<InputStream>
.convert_individual_elements_to_lists_of_elements
.reduce(null, pdfService::mergePdf);
I'm aware that the reduce method would need a converter if it's subtotal is an InputStream but the elements being added are lists of InputStreams. Not sure the best overall approach here.

JAXB Marshaller count how much space it takes to write object to file before writing it to a file

I'm looking for a solution to a problem on
how to count how much space it takes to write object to file before writing it to a file
pseudo code on what I'm looking is
if (alreadyMarshalled.size() + toBeMarshalled.size() < 40 KB) {
alreadyMarshalled.marshall(toBeMarshalled);
}
So, I could use a counting stream from, i.e Apache CountingOutputStream, but
at first I would need to know how much space would object take (tags included),
however I've no clue how to include tags and prefixes in that count before
checking to an already what was marshalled. Is there any library that would
solve such a situation?
The only way to tell is to actually marshal the XML.
The idea of the CountingOutputStream is sound.
NullOutputStream nos = new NullOutputStream();
CountingOutputStream cos = new CountingOutputStream(nos);
OutputStreamWriter osw = new OutputStreamWriter(cos);
jaxbMarshaller.marshal(object, osw);
long result = cos.getByteCount();
You have to run this twice (once to get the count, again to write it out) it's the only deterministic way to do it, and this won't cost you any real memory.
If you're not worried about memory, then just dump it to a ByteArrayOutputStream, and if you decide to "keep it", you can just dump the byte array straight in to the file without have to run through the marshaller again.
In fact with the ByteArrayOutputStream, you don't need the CountingOutputStream, you can just find out the size of the resulting array when it's done. But it can come at a high memory cost.

Go back 'n' lines in file using Stream.lines

I need to build an application which scans through a large amount of files. These files contain blocks with some data about a sessions, in which each line has a different value. E.g.: "=ID: 39487".
At that point I have that line, but the problem I now face is that I need the value n lines above that ID. I was thinking about an Iterator but it only has forward methods. I also thought about saving the results in a List but that defies the reason to use Stream and some files are huge so that would cause memory problems.
I was wondering if something like this is possible using the Stream API (Files)? Or perhaps a better question, is there a better way to approach this?
Stream<String> lines = Files.lines(Paths.get(file.getName()));
Iterator<String> search = lines.iterator();
You can't arbitrarily read backwards and forwards through the file with the same reader (no matter if you're using streams, iterators, or a plain BufferedReader.)
If you need:
m lines before a given line
n lines after the given line
You don't know the value of m and n in advance, until you reach that line
...then you essentially have three options:
Read the whole file once, keep it in memory, and then your task is trivial (but this uses the most memory.)
Read the whole file once, mark the line numbers that you need, then do a second pass where you extract the lines you require.
Read the whole file once, storing some form of metadata about line lengths as you go, then use a RandomAccessFile to extract the specific bits you need without having to read the whole file again.
I'd suggest given the files are huge, the second option here is probably the most realistic. The third will probably give you better performance, but will require much more in the way of development effort.
As an alternative if you can guarantee that both n and m are below a certain value, and that value is a reasonable size - you could also just keep a certain number of lines in a buffer as you're processing the file, and read through that buffer when you need to read lines "backwards".
Try my library. abacus-util
try(Reader reader = new FileReader(yourFile)) {
StreamEx.of(reader)
.sliding(n, n, ArrayList::new)
.filter(l -> l.get(l.size() - 1).contains("=ID: 39487"))
./* then do your work */
}
No matter how big your file is. as long as n is small number, not millions

Memory required by JVM for creating CSV files and zip it on the fly

I am creating two CSV files using String buffers and byte arrays.
I use ZipOutputStream to generate the zip files. Each csv file will have 20K records with 14 columns. Actually the records are fetched from DB and stored in ArrayList. I have to iterate the list and build StringBuffer and convert the StringBuffer to byte Array to wirte it to the zip entry.
I want to know the memory required by JVM to do the entire process starting from storing the records in the ArrayList.
I have provide the code snippet below.
StringBuffer responseBuffer = new StringBuffer();
String response = new String();
response = "Hello, sdksad, sfksdfjk, World, Date, ask, askdl, sdkldfkl, skldkl, sdfklklgf, sdlksldklk, dfkjsk, dsfjksj, dsjfkj, sdfjkdsfj\n";
for(int i=0;i<20000;i++){
responseBuffer.append(response);
}
response = responseBuffer.toString();
byte[] responseArray = response.getBytes();
res.setContentType("application/zip");
ZipOutputStream zout = new ZipOutputStream(res.getOutputStream());
ZipEntry parentEntry = new ZipEntry("parent.csv");
zout.putNextEntry(parentEntry);
zout.write(responseArray);
zout.closeEntry();
ZipEntry childEntry = new ZipEntry("child.csv");
zout.putNextEntry(childEntry);
zout.write(responseArray);
zout.closeEntry();
zout.close();
Please help me with this. Thanks in advance.
I'm guessing you've already tried counting how many bytes will be allocated to the StringBuffer and the byte array. But the problem is you can't really know how much memory your app will use unless you have upper bounds on the sizes of the CSV records. I'm If you want your software to be stable, robust and scalable, I'm afraid you're asking the wrong question: you should strive on performing the task you need to do using a fixed amount of memory, which in your case seems easily possible.
The key is, that in your case the processing is entirely FIFO - you read records from the database, and then write them (in the same order) into a FIFO stream (OutputStream in that case). Even zip compression is stream-based, and uses a fixed amount of memory internally, so you're totally safe there.
Instead of buffering the entire input in a huge String, then converting it to a huge byte array, then writing it to the output stream - you should read each response element separately from the database (or chunks of fixed size, say 100 records at a time), and write it to the output stream. Something like
res.setContentType("application/zip");
ZipOutputStream zout = new ZipOutputStream(res.getOutputStream());
ZipEntry parentEntry = new ZipEntry("parent.csv");
zout.putNextEntry(parentEntry);
while (... fetch entries ...)
zout.write(...data...)
zout.closeEntry();
The advantage of this approach is that because it works with small chunks you can easily estimate their sizes, and allocate enough memory for your JVM so it never crashes. And you know it will still work if your CSV files become much more than 20K lines in the future.
To analyze the memory usage you can use a Profiler.
JProfiler or YourKit is very good at doing this.
VisualVM is also good to an extent.
You can measure the memory with the MemoryTestbench.
http://www.javaspecialists.eu/archive/Issue029.html
This article desribes what to do. Its simple, and acurate to 1 byte, I often use it.
It even could be run form a junit test case, so its very usefull, while a profiler could not be run
from a junit test case.
With that apporach, you even can measure the memory size of one Integer object.
But with zip there is one special thing. Zipstream uses a native c library, in that case the MemoryTestbench may not measure that memory, only the java part.
You should try both variants, the MemroyTestbench, and with profilers (jprof).

Read first part of inputstream in Java

I have a XML file that I read from a URLConnection. The file grows as time goes. I want to read the first 100k of the file, today the file is 1.3MB. Then I want to parse the part I have read.
How can I do that?
(From scratch)
int length = 100*1024;
byte[] buf = new byte[length];
urlConnection.getInputStream().read(buf,0,length);
StringBufferInputStream in = new StringBufferInputStream(new String(buf));
new SAXParser().parse(in,myHandler);
As far as I understand you're interested not just in 100k of a stream but 100k of a stream from which you could extract data you need. This means taking 100k as proposed by Peter won't work as it might result in non-well-formed XML.
Instead I'd suggest to use StAX parser which will give you ability to read and parse XML directly from stream with ability to stop when you've reached 100k (or near) limit.
For further information take a look at XMLStreamReader interface (and samples around its usage). For example you could loop until you get to the START_ELEMENT with name "result" and then use method getTextCharacters(int sourceStart, char[] target, int targetStart, int length) specifying 100k as buffer size.
As you mentioned Android currently it doesn't have StAX parser available. However it does have XmlPullParser with similar functionality.

Categories

Resources