Java Streams - Turn Stream into Stream of Lists

Java Streams - Turn Stream into Stream of Lists - java

I have a stream of InputStreams that are all being reduced into a single InputStream with a reduce function.
It looks like this:
InputStream pdfReportStream = dynamoDBItems.parallelStream()
.map(dynamoDBItem -> new ChonkRequest(dynamoDBItem.getString("s3_key"),
dynamoDBItem.getString("s3_bucket"), dynamoDBItem.getInt("SK")))
.sorted(Comparator.comparingInt(ChonkRequest::getIndex))
.map(storageService::getChonk) // Stream<InputStream>
.reduce(null, pdfService::mergePdf);
I have been tweaking this for about a day now. Originally I just collected the list of input lists and passed them into the merge method and it merged them all together very quickly. The problem was that with too many InputStreams it failed.
So I switched to use reduce and add one at a time. This works for larger numbers but it is slower overall and will eventually lead to a timeout on very large number of InputStreams.
What I would like to do is use a combination of both previous approaches and pass in lists of InputStreams with a max size into the mergePDF method, and have it keep performing reduce on them, outputting an InputStream everytime.
So the pseudocode would be like...
InputStream pdfReportStream = dynamoDBItems.parallelStream()
.map(dynamoDBItem -> new ChonkRequest(dynamoDBItem.getString("s3_key"),
dynamoDBItem.getString("s3_bucket"), dynamoDBItem.getInt("SK")))
.sorted(Comparator.comparingInt(ChonkRequest::getIndex))
.map(storageService::getChonk) // Stream<InputStream>
.convert_individual_elements_to_lists_of_elements
.reduce(null, pdfService::mergePdf);
I'm aware that the reduce method would need a converter if it's subtotal is an InputStream but the elements being added are lists of InputStreams. Not sure the best overall approach here.

Related

JAXB Marshaller count how much space it takes to write object to file before writing it to a file

I'm looking for a solution to a problem on
how to count how much space it takes to write object to file before writing it to a file
pseudo code on what I'm looking is
if (alreadyMarshalled.size() + toBeMarshalled.size() < 40 KB) {
alreadyMarshalled.marshall(toBeMarshalled);
}
So, I could use a counting stream from, i.e Apache CountingOutputStream, but
at first I would need to know how much space would object take (tags included),
however I've no clue how to include tags and prefixes in that count before
checking to an already what was marshalled. Is there any library that would
solve such a situation?

The only way to tell is to actually marshal the XML.
The idea of the CountingOutputStream is sound.
NullOutputStream nos = new NullOutputStream();
CountingOutputStream cos = new CountingOutputStream(nos);
OutputStreamWriter osw = new OutputStreamWriter(cos);
jaxbMarshaller.marshal(object, osw);
long result = cos.getByteCount();
You have to run this twice (once to get the count, again to write it out) it's the only deterministic way to do it, and this won't cost you any real memory.
If you're not worried about memory, then just dump it to a ByteArrayOutputStream, and if you decide to "keep it", you can just dump the byte array straight in to the file without have to run through the marshaller again.
In fact with the ByteArrayOutputStream, you don't need the CountingOutputStream, you can just find out the size of the resulting array when it's done. But it can come at a high memory cost.

Go back 'n' lines in file using Stream.lines

I need to build an application which scans through a large amount of files. These files contain blocks with some data about a sessions, in which each line has a different value. E.g.: "=ID: 39487".
At that point I have that line, but the problem I now face is that I need the value n lines above that ID. I was thinking about an Iterator but it only has forward methods. I also thought about saving the results in a List but that defies the reason to use Stream and some files are huge so that would cause memory problems.
I was wondering if something like this is possible using the Stream API (Files)? Or perhaps a better question, is there a better way to approach this?
Stream<String> lines = Files.lines(Paths.get(file.getName()));
Iterator<String> search = lines.iterator();

You can't arbitrarily read backwards and forwards through the file with the same reader (no matter if you're using streams, iterators, or a plain BufferedReader.)
If you need:
m lines before a given line
n lines after the given line
You don't know the value of m and n in advance, until you reach that line
...then you essentially have three options:
Read the whole file once, keep it in memory, and then your task is trivial (but this uses the most memory.)
Read the whole file once, mark the line numbers that you need, then do a second pass where you extract the lines you require.
Read the whole file once, storing some form of metadata about line lengths as you go, then use a RandomAccessFile to extract the specific bits you need without having to read the whole file again.
I'd suggest given the files are huge, the second option here is probably the most realistic. The third will probably give you better performance, but will require much more in the way of development effort.
As an alternative if you can guarantee that both n and m are below a certain value, and that value is a reasonable size - you could also just keep a certain number of lines in a buffer as you're processing the file, and read through that buffer when you need to read lines "backwards".

Try my library. abacus-util
try(Reader reader = new FileReader(yourFile)) {
StreamEx.of(reader)
.sliding(n, n, ArrayList::new)
.filter(l -> l.get(l.size() - 1).contains("=ID: 39487"))
./* then do your work */
}
No matter how big your file is. as long as n is small number, not millions

Total number of rows in an InputStream (or CsvMapper) in Java

How can I get the number of lines(rows) from an InputStream or from a CsvMapper without looping through and counting them?
Below I have an InputStream created from a CSV file.
InputStream content = (... from a resource ...);
CsvMapper mapper = new CsvMapper();
mapper.enable(CsvParser.Feature.WRAP_AS_ARRAY);
MappingIterator<Object[]> it = mapper
.reader(Object[].class)
.readValues(content);
Is it possible to do something like
int totalRows = mapper.getTotalRows();
I would like to use this number in the loop to update progress.
while (it.hasNextValue()){
//do stuff here
updateProgressHere(currentRow, totalRows);
}
Obviously, I can loop through and count them once. Then loop through again and process them while updating progress. This is inefficient and slow as some of these InputStreams are huge.

Unless you know the row count ahead of time, it is not possible without looping. You have to read that file in its entirety to know how many lines are in it, and neither InputStream nor CsvMapper have a means of reading ahead and abstracting that for you (they are both stream oriented interfaces).
None of the interfaces that ObjectReader can operate on support querying the underlying file size (if it's a file) or number of bytes read so far.
One possible option is to create your own custom InputStream that also provides methods for grabbing the total size and number of bytes read so far, e.g. if it is reading from a file, it can expose the underlying File.length() and also track the number of bytes read. This may not be entirely accurate, especially if Jackson buffers far ahead, but it could get you something at least.

Technically spoken, there are only two ways. Either (as you have seen) looping through and incrementing counter, or:
On the sender, the first information to send would be the counter, and then sending the data. This enables you to evaluate the first bytes as count when reading the stream at the begin. Precondition of this procedure is of course that the sending application knows in advance the size of data to be sent.

Memory required by JVM for creating CSV files and zip it on the fly

I am creating two CSV files using String buffers and byte arrays.
I use ZipOutputStream to generate the zip files. Each csv file will have 20K records with 14 columns. Actually the records are fetched from DB and stored in ArrayList. I have to iterate the list and build StringBuffer and convert the StringBuffer to byte Array to wirte it to the zip entry.
I want to know the memory required by JVM to do the entire process starting from storing the records in the ArrayList.
I have provide the code snippet below.
StringBuffer responseBuffer = new StringBuffer();
String response = new String();
response = "Hello, sdksad, sfksdfjk, World, Date, ask, askdl, sdkldfkl, skldkl, sdfklklgf, sdlksldklk, dfkjsk, dsfjksj, dsjfkj, sdfjkdsfj\n";
for(int i=0;i<20000;i++){
responseBuffer.append(response);
}
response = responseBuffer.toString();
byte[] responseArray = response.getBytes();
res.setContentType("application/zip");
ZipOutputStream zout = new ZipOutputStream(res.getOutputStream());
ZipEntry parentEntry = new ZipEntry("parent.csv");
zout.putNextEntry(parentEntry);
zout.write(responseArray);
zout.closeEntry();
ZipEntry childEntry = new ZipEntry("child.csv");
zout.putNextEntry(childEntry);
zout.write(responseArray);
zout.closeEntry();
zout.close();
Please help me with this. Thanks in advance.

I'm guessing you've already tried counting how many bytes will be allocated to the StringBuffer and the byte array. But the problem is you can't really know how much memory your app will use unless you have upper bounds on the sizes of the CSV records. I'm If you want your software to be stable, robust and scalable, I'm afraid you're asking the wrong question: you should strive on performing the task you need to do using a fixed amount of memory, which in your case seems easily possible.
The key is, that in your case the processing is entirely FIFO - you read records from the database, and then write them (in the same order) into a FIFO stream (OutputStream in that case). Even zip compression is stream-based, and uses a fixed amount of memory internally, so you're totally safe there.
Instead of buffering the entire input in a huge String, then converting it to a huge byte array, then writing it to the output stream - you should read each response element separately from the database (or chunks of fixed size, say 100 records at a time), and write it to the output stream. Something like
res.setContentType("application/zip");
ZipOutputStream zout = new ZipOutputStream(res.getOutputStream());
ZipEntry parentEntry = new ZipEntry("parent.csv");
zout.putNextEntry(parentEntry);
while (... fetch entries ...)
zout.write(...data...)
zout.closeEntry();
The advantage of this approach is that because it works with small chunks you can easily estimate their sizes, and allocate enough memory for your JVM so it never crashes. And you know it will still work if your CSV files become much more than 20K lines in the future.

To analyze the memory usage you can use a Profiler.
JProfiler or YourKit is very good at doing this.
VisualVM is also good to an extent.

You can measure the memory with the MemoryTestbench.
http://www.javaspecialists.eu/archive/Issue029.html
This article desribes what to do. Its simple, and acurate to 1 byte, I often use it.
It even could be run form a junit test case, so its very usefull, while a profiler could not be run
from a junit test case.
With that apporach, you even can measure the memory size of one Integer object.
But with zip there is one special thing. Zipstream uses a native c library, in that case the MemoryTestbench may not measure that memory, only the java part.
You should try both variants, the MemroyTestbench, and with profilers (jprof).

Why is Java HashMap slowing down?

I try to build a map with the content of a file and my code is as below:
System.out.println("begin to build the sns map....");
String basePath = PropertyReader.getProp("oldbasepath");
String pathname = basePath + "\\user_sns.txt";
FileReader fr;
Map<Integer, List<Integer>> snsMap =
new HashMap<Integer, List<Integer>>(2000000);
try {
fr = new FileReader(pathname);
BufferedReader br = new BufferedReader(fr);
String line;
int i = 1;
while ((line = br.readLine()) != null) {
System.out.println("line number: " + i);
i++;
String[] strs = line.split("\t");
int key = Integer.parseInt(strs[0]);
int value = Integer.parseInt(strs[1]);
List<Integer> list = snsMap.get(key);
//if the follower is not in the map
if(snsMap.get(key) == null)
list = new LinkedList<Integer>();
list.add(value);
snsMap.put(key, list);
System.out.println("map size: " + snsMap.size());
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("finish building the sns map....");
return snsMap;
The program is very fast at first but gets much slowly when the information printed is :
map size: 1138338
line number: 30923602
map size: 1138338
line number: 30923603
....
I try to find to reason with two System.out.println() clauses to judge the preformance of BufferedReader and HashMap instead of a Java profiler.
Sometimes it takes a while to get the information of the map size after getting the line number information, and sometimes, it takes a while to get the information of the line number information after get the map size. My question is: which makes my program slow? the BufferedReader for a big file or HashMap for a big map?

If you are testing this from inside Eclipse, you should be aware of the huge performance penalty of writing to stdout/stderr, due to Eclipse's capturing that ouptut in the Console view. Printing inside a tight loop is always a performance issue, even outside of Eclipse.
But, if what you are complaining about is the slowdown experienced after processing 30 million lines, then I bet it's a memory issue. First it slows down due to intense GC'ing and then it breaks with OutOfMemoryError.

You will have to check you program with some profiling tools to understand why it is slow.
In general file access is much more slower than in memory operations (unless you are constrained in memory and doing excess GC) so the guess would be that reading file could be the slower here.

Before you profiled, you will not know what is slow and what isn't.
Most likely, the System.out will show up as being the bottleneck, and you'll then have to profile without them again. System.out is the worst thing you can do for finding performance bottlenecks, because in doing so you usually add an even worse bottleneck.
An obivous optimization for your code is to move the line
snsMap.put(key, list);
into the if statement. You only need to put this when you created a new list. Otherwise, the put will just replace the current value with itself.
Java cost associated with Integer objects (and in particular the use of Integers in the Java Collections API) is largely a memory (and thus Garbage Collection!) issue. You can sometimes get significant gains by using primitive collections such as GNU trove, depending how well you can adjust your code to use them efficiently. Most of the gains of Trove are in memory usage. Definitely try rewriting your code to use TIntArrayList and TIntObjectMap from GNU trove. I'd avoid linked lists, too, in particular for primitive types.
Roughly estimated, a HashMap<Integer, List<Integer>> needs at least 3*16 bytes per entry. The doubly linked list again needs at least 2*16 bytes per entry stored. 1m keys + 30m values ~ 1 GB. No overhead included yet. With GNU trove TIntObjectHash<TIntArrayList> that should be 4+4+16 bytes per key and 4 bytes per value, so 144 MB. The overhead is probably similar for both.
The reason that Trove uses less memory is because the types are specialized for primitive values such as int. They will store the int values directly, thus using 4 bytes to store each.
A Java collections HashMap consists of many objects. It roughly looks like this: there are Entry objects that point to a key and a value object each. These must be objects, because of the way generics are handled in Java. In your case, the key will be an Integer object, which uses 16 bytes (4 bytes mark, 4 bytes type, 4 bytes actual int value, 4 bytes padding) AFAIK. These are all 32 bit system estimates. So a single entry in the HashMap will probably need some 16 (entry) + 16 (Integer key) + 32 (yet empty LinkedList) bytes of memory that all need to be considered for garbage collection.
If you have lots of Integer objects, it just will take 4 times as much memory as if you could store everything using int primitives. This is the cost you pay for the clean OOP principles realized in Java.

The best way is to run your program with profiler (for example, JProfile) and see what parts are slow. Also debug output can slow your program, for example.

Hash Map is not slow, but in reality its the fastest among the maps. HashTable is the only thread safe among maps, and can be slow sometimes.
Important note: Close the BufferedReader and File after u read the data... this might help.
eg: br.close()
file.close()
Please check you system processes from task manager, there may be too may processes running in the background.
Sometimes eclipse is real resource heavy, so try to run it from console to check it.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.