Cap'n Proto - Finding Message Size in Java

Cap'n Proto - Finding Message Size in Java - java

I am using a TCP Client/Server to send Cap'n Proto messages from C++ to Java.
Sometimes the receiving buffer may be overfilled or underfilled and to handle these cases we need to know the message size.
When I check the size of the buffer in Java I get 208 bytes, however calling
MyModel.MyMessage.STRUCT_SIZE.total()
returns 4 (not sure what unit of measure is being used here).
I notice that 4 divides into 208, 52 times. But I don't know of a significant conversion factor using 52.
How do I check the message size in Java?

MyMessage.STRUCT_SIZE represents the constant size of that struct itself (measured in 8-byte words), but if the struct contains non-trivial fields (like Text, Data, List, or other structs) then those take up space too, and the amount of space they take is not constant (e.g. Text will take space according to how long the string is).
Generally you should try to let Cap'n Proto directly write to / read from the appropriate ByteChannels, so that you don't have to keep track of sizes yourself. However, if you really must compute the size of a message ahead of time, you could do so with something like:
ByteBuffer[] segments = message.getSegmentsForOutput();
int total = (segments.length / 2 + 1) * 8; // segment table
for (ByteBuffer segment: segments) {
total += segment.remaining();
}
// now `total` is the total number of bytes that will be
// written when the message is serialized.
On the C++ size, you can use capnp::computeSerializedSizeInWords() from serialize.h (and multiply by 8).
But again, you really should structure your code to avoid this, by using the methods of org.capnproto.Serialize with streaming I/O.

Related

What is an overhead for creating Java objects from lines of csv file

the code reads lines of CSV file like:
Stream<String> strings = Files.lines(Paths.get(filePath))
then it maps each line in the mapper:
List<String> tokens = line.split(",");
return new UserModel(tokens.get(0), tokens.get(1), tokens.get(2), tokens.get(3));
and finally collects it:
Set<UserModel> current = currentStream.collect(toSet())
File size is ~500MB
I've connected to the server using jconsole and see that heap size grew from 200MB to 1.8GB while processing.
I can't understand where this x3 memory usage came from - I expected something like 500MB spike or so?
My first impression was it's because there is no throttling and garbage collector simply doesn't have enough time for cleanup.
But I've tried to use guava rate limiter to let garbage collector time to do it's job but result is the same.

Tom Hawtin made good points - I just wanna expand on them and provide a bit more details.
Java Strings take at least 40 bytes of memory (that's for empty string) due to java object header (see later) overhead and an internal byte array.
That means the minimal size for non-empty string (1 or more characters) is 48 bytes.
Nowawadays, JVM uses Compact Strings which means that ASCII-only strings only occupy 1 byte per character - before it was 2 bytes per char minimum.
That means if your file contains characters beyond ASCII set, then memory usage can grow significantly.
Streams also have more overhead compared to plain iteration with arrays/lists (see here Java 8 stream objects significant memory usage)
I guess your UserModel object adds at least 32 bytes overhead on top of each line, because:
the minimum size of java object is 16 bytes where first 12 bytes are the JVM "overhead": object's class reference (4 bytes when Compressed Oops are used) + the Mark word (used for identity hash code, Biased locking, garbage collectors)
and the next 4 bytes are used by the reference to the first "token"
and the next 12 bytes are used by 3 references to the second, third and fourth "token"
and the last 4 bytes are required due to Java Object Alignment at 8-byte boundaries (on 64-bit architectures)
That being said, it's not clear whether you even use all the data that you read from the file - you parse 4 tokens from a line but maybe there are more?
Moreover, you didn't mention how exactly the heap size "grew" - If it was the commited size or the used size of the heap. The used portion is what actually is being "used" by live objects, the commited portion is what has been allocated by the JVM at some point but could be garbage-collected later; used < commited in most cases.
You'd have to take a heap snapshot to find out how much memory actually the result set of UserModel occupies and that would actually be interesting to compare to the size of the file.

It may be that the String implementation is using UTF-16 whereas the file may be using UTF-8. That would be double the size assuming all US ASCII characters. However, I believe JVM tend to use a compact form for Strings nowadays.
Another factor is that Java objects tend to be allocated on a nice round address. That means there's extra padding.
Then there's memory for the actual String object, in addition to the actual data in the backing char[] or byte[].
Then there's your UserModel object. Each object has a header and references are usually 8-bytes (may be 4).
Lastly not all the heap will be allocated. GC runs more efficiently when a fair proportion of the memory isn't, at any particular moment, being used. Even C malloc will end up with much of the memory unused once a process is up and running.

You code reads the full file into memory. Then you start splitting each line into an array, then you create objects of your custom class for each line. So basically you have 3 different pieces of "memory usage" for each line in your file!
While enough memory is available, the jvm might simply not waste time running the garbage collector while turning your 500 megabytes into three different representations. Therefore you are likely to "triplicate" the number of bytes within your file. At least until the gc kicks in and throws away the no longer required file lines and splitted arrays.

Understanding ZipSecureFile.setMinInflateRatio(double ratio)

I am using this function call, because when I read a trusted file, It results in zipbomb error.
ZipSecureFile.setMinInflateRatio(double ratio)
FileInputStream file = new FileInputStream("/file/path/report.xlsx");
ZipSecureFile.setMinInflateRatio(-1.0d);
XSSFWorkbook wb = new XSSFWorkbook(file);
I am trying to understand how it works?
The only source I could find is https://poi.apache.org/apidocs/org/apache/poi/openxml4j/util/ZipSecureFile.html
But, couldn't get a clear picture as I am new to this concept.
What are the differences between
ZipSecureFile.setMinInflateRatio(-1.0d);
vs
ZipSecureFile.setMinInflateRatio(0.009);
vs
ZipSecureFile.setMinInflateRatio(0);

A zip bomb detection works the following way:
While uncompressing it checks the ratio compressedBytes/uncompressedBytes and if this falls below a special amount (MinInflateRatio), then a bomb was detected.
So if the ratio compressedBytes/uncompressedBytes is 0.01d for example, then that means that the compressed file is 100 times smaller than the uncompressed one whithout information lost. In other words, the compressed file stores the same information in only 1% of the file size, the uncompressed one needs. This is really unlikely using real life data.
To show how unlikely it is we could take a look (in a popular scientific manner) on how compression works:
Let's have the string
"This is a test for compressing having long count of characters which always occurs the same sequence."
This needs 101 bytes. Let's say this string occurs 100,000 times in the file. Then uncompressed it would need 10,100,000 bytes. A compression algorithm would give that string a ID and would storing the string only once mapping it to that ID and would storing 100,000 times the ID where the string occurs in the file. That would need 101 bytes + 1 byte (ID) + 100,000 bytes (IDs) = 100,102 bytes. And this would have a ratio compressedBytes/uncompressedBytes of 0.009911089d for example.
So if we set the MinInflateRatio to lower than 0.01d, then we accept such unlikely data compression rates.
Also we can see, that the ratio compressedBytes/uncompressedBytes can only be 0 if compressedBytes is 0. But this would mean that there are no bytes to uncompress. So a MinInflateRatio of 0.0d can never be reached nor be undershot. So with a MinInflateRatio of 0.0d all possible ratios will be accepted.
Of course a MinInflateRatio of -1.0d also can never be reached nor be undershot. So using this also all possible ratios will be accepted.

Creating a bitmask with a large number of options

In my Android app I have a class containing only data (exposed with getters). This class needs to be serialized and sent across to other clients (done naively by iterating over all getters, and storing them in a ByteBuffer).
public class Data
{
public int getOption1() { }
public int getOption2 { }
// ...
public int getOptionN { }
}
Serialize:
public void serialize(Data data) {
// write getOption1();
// write getOption2();
// ...
}
Deserialize:
public void deserialize() {
// read Option1();
// read Option2();
// ...
}
I'd like to be able to define which fields actually get sent (instead of blindly sending all of them), and one potential solution for this would be to define another field which is a bitmask that defines which fields are actually sent.
The receiving side parses the bitmask, and can tell which of the fields should be deserialized from the received message.
The problem is - using an int (32-bit) for bitmask allows for only 32 unique options (by using the "standard" power of 2 enum values).
How can one define a bitmask that can support a larger number of items? is there any other encoding (other than storing each value as a power of 2) ?
The number of actual values may vary (depending on user input) and may be anything from ~ 50 up to 200.
I'd like to encode the different options in the most efficient encoding.

An int provides a bit for each of 32 options. You can use a long to get a bit for each of 64 options. For larger number of options, you can use an int or long array. Take the number of options, divide by 32 (for an int array) or 64 (for a long array) and round up.
A byte array will provide the least waste. Divide the number of options by 8 and round up. You can reserve the first byte to contain the length of the byte array (if you're passing other data as well). Since Byte.MAX_VALUE is 127 (but you can treat the value as the maximum valid index, not the byte count), this limits you to 128 * 8 - 1 = 1023 options (or 2047 options if you are willing to do a little extra work to deal with negative byte count values). The maximum waste will be less than one byte (plus an additional byte of overhead to store the count).
If each option can be independently there or not there, you cannot do much better. If options can be grouped such that all options in a group are always either all present or all absent, then some additional compression may be possible.

File size vs. in memory size in Java

If I take an XML file that is around 2kB on disk and load the contents as a String into memory in Java and then measure the object size it's around 33kB.
Why the huge increase in size?
If I do the same thing in C++ the resulting string object in memory is much closer to the 2kB.
To measure the memory in Java I'm using Instrumentation.
For C++, I take the length of the serialized object (e.g string).

I think there are multiple factors involved.
First of all, as Bruce Martin said, objects in java have an overhead of 16 bytes per object, c++ does not.
Second, Strings in Java might be 2 Bytes per character instead of 1.
Third, it could be that Java reserves more Memory for its Strings than the C++ std::string does.
Please note that these are just ideas where the big difference might come from.

Assuming that your XML file contains mainly ASCII characters and uses an encoding that represents them as single bytes, then you can espect the in memory size to be at least double, since Java uses UTF-16 internally (I've heard of some JVMs that try to optimize this, thouhg). Added to that will be overhead for 2 objects (the String instance and an internal char array) with some fields, IIRC about 40 bytes overall.
So your "object size" of 33kb is definitely not correct, unless you're using a weird JVM. There must be some problem with the method you use to measure it.

In Java String object have some extra data, that increases it's size.
It is object data, array data and some other variables. This can be array reference, offset, length etc.
Visit http://www.javamex.com/tutorials/memory/string_memory_usage.shtml for details.

String: a String's memory growth tracks its internal char array's growth. However, the String class adds another 24 bytes of overhead.
For a nonempty String of size 10 characters or less, the added overhead cost relative to useful payload (2 bytes for each char plus 4 bytes for the length), ranges from 100 to 400 percent.
More:
What is the memory consumption of an object in Java?

Yes, you should GC and give it time to finish. Just System.gc(); and print totalMem() in the loop. You also better to create a million of string copies in array (measure empty array size and, then, filled with strings), to be sure that you measure the size of strings and not other service objects, which may present in your program. String alone cannot take 32 kb. But hierarcy of XML objects can.
Said that, I cannot resist the irony that nobody cares about memory (and cache hits) in the world of Java. We are know that JIT is improving and it can outperform the native C++ code in some cases. So, there is not need to bother about memory optimization. Preliminary optimization is a root of all evils.

As stated in other answers, Java's String is adding an overhead. If you need to store a large number of strings in memory, I suggest you to store them as byte[] instead. Doing so the size in memory should be the same than the size on disk.
String -> byte[] :
String a = "hello";
byte[] aBytes = a.getBytes();
byte[] -> String :
String b = new String(aBytes);

Any code tips for speeding up random reads from a Java FileChannel?

I have a large (3Gb) binary file of doubles which I access (more or less) randomly during an iterative algorithm I have written for clustering data. Each iteration does about half a million reads from the file and about 100k writes of new values.
I create the FileChannel like this...
f = new File(_filename);
_ioFile = new RandomAccessFile(f, "rw");
_ioFile.setLength(_extent * BLOCK_SIZE);
_ioChannel = _ioFile.getChannel();
I then use a private ByteBuffer the size of a double to read from it
private ByteBuffer _double_bb = ByteBuffer.allocate(8);
and my reading code looks like this
public double GetValue(long lRow, long lCol)
{
long idx = TriangularMatrix.CalcIndex(lRow, lCol);
long position = idx * BLOCK_SIZE;
double d = 0;
try
{
_double_bb.position(0);
_ioChannel.read(_double_bb, position);
d = _double_bb.getDouble(0);
}
...snip...
return d;
}
and I write to it like this...
public void SetValue(long lRow, long lCol, double d)
{
long idx = TriangularMatrix.CalcIndex(lRow, lCol);
long offset = idx * BLOCK_SIZE;
try
{
_double_bb.putDouble(0, d);
_double_bb.position(0);
_ioChannel.write(_double_bb, offset);
}
...snip...
}
The time taken for an iteration of my code increases roughly linearly with the number of reads. I have added a number of optimisations to the surrounding code to minimise the number of reads, but I am at the core set that I feel are necessary without fundamentally altering how the algorithm works, which I want to avoid at the moment.
So my question is whether there is anything in the read/write code or JVM configuration I can do to speed up the reads? I realise I can change hardware, but before I do that I want to make sure that I have squeezed every last drop of software juice out of the problem.
Thanks in advance

As long as your file is stored on a regular harddisk, you will get the biggest possible speedup by organizing your data in a way that gives your accesses locality, i.e. causes as many get/set calls in a row as possible to access the same small area of the file.
This is more important than anything else you can do because accessing random spots on a HD is by far the slowest thing a modern PC does - it takes about 10,000 times longer than anything else.
So if it's possible to work on only a part of the dataset (small enough to fit comfortably into the in-memory HD cache) at a time and then combine the results, do that.
Alternatively, avoid the issue by storing your file on an SSD or (better) in RAM. Even storing it on a simple thumb drive could be a big improvement.

Instead of reading into a ByteBuffer, I would use file mapping, see: FileChannel.map().
Also, you don't really explain how your GetValue(row, col) and SetValue(row, col) access the storage. Are row and col more or less random? The idea I have in mind is the following: sometimes, for image processing, when you have to access pixels like row + 1, row - 1, col - 1, col + 1 to average values; on trick is to organize the data in 8 x 8 or 16 x 16 blocks. Doing so helps keeping the different pixels of interest in a contiguous memory area (and hopefully in the cache).
You might transpose this idea to your algorithm (if it applies): you map a portion of your file once, so that the different calls to GetValue(row, col) and SetValue(row, col) work on this portion that's just been mapped.

Presumably if we can reduce the number of reads then things will go more quickly.
3Gb isn't huge for a 64 bit JVM, hence quite a lot of the file would fit in memory.
Suppose that you treat the file as "pages" which you cache. When you read a value, read the page around it and keep it in memory. Then when you do more reads check the cache first.
Or, if you have the capacity, read the whole thing into memory, in at the start of processing.

Access byte-by-byte always produce poor performance (not only in Java). Try to read/write bigger blocks (e.g. rows or columns).
How about switching to database engine for handling such amounts of data? It would handle all optimizations for you.
May be This article helps you ...

You might want to consider using a library which is designed for managing large amounts of data and random reads rather than using raw file access routines.
The HDF file format may by a good fit. It has a Java API but is not pure Java. It's licensed under an Apache Style license.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.