compressing strings with a static dictionary

compressing strings with a static dictionary - java

This will be a bit abstract question since i don't even know if there are any developments like this.
Given we have an application which tries to deliver text data from point A to B.
A and B are quite far away so size of the data has significant effect on all important metrics we want to optimize for (speed, latency and throughput). First thing that comes to mind is compression, but compression is not that effective when we have to compress many many small messages but its very effective when the size of the compressed data is significant.
I have no experience with compression algorithms but my understanding is that bigger the input the better can be the compression rate since there is a bigger likelihood of repeating chunks and things that can be optimized.
One other way we could go is batching, by waiting for some N period of time and collecting all tiny messages and creating one compressed big one we could have good compression rate but we would sacrifice latency, the message that arrives first will take unnecessary delay of N.
Solution that I'm looking for is something like this, when a compression algorithm traverses the data set it is probably having some dictionary of things that it knows can be optimized. This dictionary is thrown away every time we finish with the compression and it is always sent with the message to B.
rawMsg -> [dictionary|compressedPayload] -> send to B
however if we could have this dictionary to be maintained in memory, and be sent only when there is a change in it that would mean that we can efficiently compress even small messages and avoid sending the dictionary to the other end every time...
rawMsg -> compress(existingDictrionaryOfSomeVersion, rawMsg) -> [dictionaryVersion|compressedPayload] -> send to B
now obviously the assumption here is that B will also keep the instance of dictionary and keep updating it when the newer version arrives.
Note that exactly this is happening already with protocols like protobuf or fix (in financial applications).
With any message you have schema (dictionary) and it is available on both ends and then you just send raw binary data, efficient and fast but your schema is fixed and unchanged.
I'm looking for something that can be used for free form text.
Is there any technology that allows to do this (without having some fixed schema)?

You can simply send the many small messages in a single compressed stream. Then they will be able to take advantage of the previous history of small messages. With zlib you can flush out each message, which will avoid having to wait for a whole block to be built up before transmitting. This will degrade compression, but not nearly as much as trying to compress each string individually (which will likely just end up expanding them). In the case of zlib, your dictionary is always the last 32K of messages that you have sent.

Related

Is it faster to concatenate strings within a call to print(), or call print() multiple times?

Consider the following two code snippets:
System.out.print(i + " ");
or
System.out.print(i);
System.out.print(" ");
Although both will end up printing the same output, which variant will be faster, in terms of execution speed?
For some context, the reason I am asking is that converting the first one to the second one resulted in a "time limit exceeded (TLE)" error in my online compiler. That suggests that the first one is faster, but I don't know if that is universally true. I would also like to understand why.

You can't really know - it depends on factors not evident in your code.
When is i + " " slower?
For that to be slower, i needs to be a very large string (or some object whose toString() method returns a very large string, thus resulting in the i + " " operation being relatively pricey. It's still an entirely in-memory affair, so compared to most I/O, it's negligible, but make that a string of a million characters or so and that is likely going to result in this being slower.
When is i, then " " slower?
System.in is, or isn't, buffered. You get no guarantees. It's usually a sizable chain of systems tied together, and eventually, you end up writing to a file (if you ran java -cp . yourpkg.YourClass >output.txt for example), or a console, and if that console is running over ssh, TCP/IP packets over a network, etcetera.
Many of these systems add tons of overhead because these are so-called 'packeting' systems: They can't send or process single bytes at all. They can only process in sizable chunks. Take networking: To send some data from one computer on the internet to another, you wrap the data you want to send in a so-called packet, which includes many bytes of data which is used for systems that process that data to know what to do with it. This data will travel through your network card, your router, the tiny little closet thingie at the end of the street, the larger network hub in your town, to a major distribution trunk, across the fiber to the other side of the ocean, and then allll that, back to a server you've ssh-ed into. Should be obvious that you need quite a few bytes of info to make this happen.
So, if you want to send just Hello and nothing more, then an entire packet is made. It could have carried maybe 1800 bytes of data, but it will carry only 5. it still has the ~100 bytes of overhead for the routing info though. So, the Hello packet is in grand total about 105 bytes large. Your network card shunts that off to your router and it's off to the wide world and then your code almost immediately says: Okay, great, now send a space! - and this results in your system dutifully crafting another packet, applying all that routing overhead, and off goes a 101-byte packet, for a grand total of 2 separate packets, totalling 206 bytes.
Contrast to sending Hello in one shot, for a single packet of 106 bytes.
That's an example of why usually 'buffering', or failing that, bunching your sends into fewer actual writes, will be faster. But the problem is, you have no idea where System.out goes. Console? Network? File? bitbucket? Who knows. If you run java -jar yourapp.jar >/dev/null, System.out is incredibly fast (as the data goes absolutely nowhere). Your question doesn't mention where this goes.
NB: Files end up being packet-based too, modern SSDs can't write single bytes to disk, only entire chunks in one go. If you first write 'hello', the SSD ends up reading an entire chunk into memory, then updating some bytes so they read 'hello', then flashing the location on disk with a surge of power that resets that entire segment, and then writing the whole chunk of many thousands of bytes large back. If you then write a space, that whole routine is done a second time, whereas if you call write only once, the disk will likely do this 'load an entire chunk, update data, surge the chunk, save the chunk' song and dance routine only once.
Okay, so can you simplify this?
Well, no. That's the point. However, usually, it doesn't matter and they are equally fast. But, if it matters, it is highly likely that i + " " will be faster.

Reading an input stream twice without storing it in memory

With reference to the stackoverflow question it is said that the InputStream can be read multiple times with mark() and reset() provided by the InputStream or by using PushbackInputStream.
In all these cases the content of the stream is stored in byte array (ie; the original content of the file is stored in main memory) and reused multiple times.
What happens when the size of the file exceeds the memory size? I think this may pave way for OutOfMemoryException.
Is there any better way to read the stream content multiple times without storing the stream content locally (ie; in main memory)?
Please help me knowing this. Thanks in advance.

It depends on the source of the stream.
If it's a local file, you can likely re-open and re-read the stream as many times as you want.
If it's dynamically generated by a process, a remote service, etc., you might not be free to re-generate it. In that case, you need to store it, either in memory or in some more persistent (and slow) storage like a file system or storage service.
Maybe an analogy would help. Suppose your friend is speaking to you at length. You listen carefully without interruption, but when they are done, you realize you didn't understand something they said near the beginning, and want to review that portion.
At this point, there are a few possibilities.
Perhaps your friend was actually reading aloud from a book. You can simply re-read the book.
Or, perhaps you had to foresight to record their monologue. You can replay the recording.
However, since neither you nor your friend has perfect and unlimited recall, simply repeating verbatim what was said ten minutes ago from memory alone is not an option.
An InputStream is like your friend speaking. Neither of you has a good enough memory to remember exactly, word-for-word, what is said. In the same way, neither a process that is generating the data stream nor your program has enough RAM to store, byte-for-byte, the stream. To scale, your program has to rely on its "short-term memory" (RAM), working with just a small portion of the whole stream at any given time, and "taking notes" (writing to a persistent store) as it encounters important points.
If the source of stream is a local file, then it's like your friend reading a book. Either of you can re-read that content easily enough.
If you copy the stream to some persistent storage, that's like recording your friend's speech. You can replay it as often as you like.
Consider a scenario where browser is uploading a large file, but the server is busy, and not able to read that stream for some time. Where is that data stored during that delay?
Because the receiver can't always respond immediately to input, TCP and many other protocols allocate a small buffer to store some data from a sender. But, they also have a way to tell the sender to wait, they are sending data too fast—flow control. Going back to the analogy, it's like telling your friend to pause a moment while you catch up with your note-taking.
As the browser uploads the file, at first, the buffer will be filled. But if the server can't keep up, the browser will be instructed to pause its upload until there is more room in the buffer. (This generally happens at the OS and TCP level; the client and server applications don't manage this directly.) The upload speed depends on how fast the browser can read the file from disk, how fast the network link is, and how fast the server can process the uploaded data. Even a fast network and client will be limited by the weak link in this chain.

Should I fill stream into a sting/container for manipulation?

I want to make this general question. If we have a program which reads data from outside of the program, should we first put the data in a container and then manipulate the data or should we work directly on the stream, if the stream api of a language is powerful enough?
For example. I am writing a program which reads from text file. Should I first put the data in a string and then manipulate instead of working directly on the stream. I am using java and let's say it has powerful enough (for my needs) stream classes.

Stream processing is generally preferable to accumulating data in memory.
Why? One obvious reason is that the file you are reading might not even fit into memory. You might not even know the size of the data before you've read it completely (imagine, that you are reading from a socket or a pipe rather than a file).
It is also more efficient, especially, when the size isn't known ahead of time - allocating large chunks of memory and moving data around between them can be taxing. Things like processing and concatenating large strings aren't free either.
If the io is slow (ever tried reading from a tape?) or if the data is being produced in real time by a peer process (socket/pipe), your processing of the data read can, at least in part, happen in parallel with reading, which will speed things up.
Stream processing is inherently easier to scale and parallelize if necessary, because your logic is forced to only depend on the current element, being processed, you are free from state. If the amount of data becomes too large to process sequentially, you can trivially scale your app, by adding more readers, and splitting the stream between them.
You might argue, that in case none of this matters, because the file you are reading is only 300 bytes. Indeed, for small amounts of data, this is not crucial (you may also bubble sort it while you are at it), but adopting good patterns and practices makes you a better programmer, and will help when it does matters. There is no disadvantage to it. No, it does not make your code more complicated. It might seem so to you at first, but that's simply because you are not used to stream processing. Once you get into the right mindset, and it becomes natural to you, you'll see that, if anything, the code, dealing with one small piece of data at a time, and not caring about indexes, pointers and positions, is simpler than the alternative.
All of the above applies to sequential processing though. You read the stream once, processing the data immediately, as it comes in, and discarding it (or, perhaps, writing out to the next stream in pipeline).
You mentioned RandomAccessFile ... that's a completely different beast. If you need random access, and the data fits in memory, put it in memory. Seeking the file back and forth is the same thing conceptually, only much slower. There is no benefit to it other than saving memory.

You should certainly process it as you receive it. The other way adds latency and doesn't scale.

How to handle a large stream that is mostly written to disk in netty (use ChunkedWriteHandler?)?

I have a binary protocol with some unknown amount of initial header data (unknown length until header is fully decoded) followed by a stream of data that should be written to disk for later processing.
I have an implementation that decodes the header and then writes the file data to disk as it comes in from one or more frames in ChannelBuffers (so my handler subclasses FrameDecoder, and builds the message step by step, not waiting for one ChannelBuffer to contain the entire message, and writing the file data to disk with each frame). My concern is whether this is enough, or if using something like ChunkedWriteHandler does more than this and is necessary to handle large uploads.
Is there a more optimal way of handling file data than just writing it directly to disk from the ChannelBuffer with each frame?

It should be enough as long as the throughput is good enough. Otherwise, you might want to buffer the received data so that you don't make system calls too often (say, 32KiB bounded buffer) when the amount of received data is not large enough.
Netty could be even faster if it exposed the transferTo/From operation to a user, but such a feature is not available yet.

You should also think about adding an ExecutionHandler in front if your Handler. This will help you to not get blocked by the disk I/O. Otherwise you may see slow downs on heavy disk access.

JAVA NIO ByteBuffer allocatation to fit largest dataset?

I'm working on an online game and I've hit a little snag while working on the server side of things.
When using nonblocking sockets in Java, what is the best course of action to handle complete packet data sets that cannot be processed until all the data is available? For example, sending a large 2D tiled map over a socket.
I can think of two ways to handle it:
Allocate the ByteBuffer large enough to handle the complete data set needed to process a large 2D tiled map from my example. Continue add read data to the buffer until it's all been received and process from there.
If the ByteBuffer is a smaller size (perhaps 1500), subsequent reads can be done and put out to a file until it can be processed completely from the file. This would prevent having to have large ByteBuffers, but degrades performance because of disk I/O.
I'm using a dedicated ByteBuffer for every SocketChannel so that I can keep reading in data until it's complete for processing. The problem is if my 2D Tiled Map amounts to 2MB in size, is it really wise to use 1000 2MB ByteBuffers (assuming 1000 is a client connection limit and they are all in use)? There must be a better way that I'm not thinking of.
I'd prefer to keep things simple, but I'm open to any suggestions and appreciate the help. Thanks!

Probably, the best solution for now is to use the full 2MB ByteBuffer and let the OS take care of paging to disk (virtual memory) if that's necessary. You probably won't have 1000 concurrent users right away, and when you do, you can optimize. You may be surprised what your real performance issues are.

I decided the best course of action was to simply reduce the size of my massive dataset and send tile updates instead of an entire map update. That way I can simply send a list of tiles that have changed on a map instead of the entire map over again. This reduces the need for such a large buffer and I'm back on track. Thanks.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.