Compressing content of Java string variable - java

I have a Java application and I would like to compress content of certain String variable (as it's a huge JSON). Now, i saw a lot of examples where i can compress String into byte array, but as my application has to (optionally!) send that compressed data as a REST response (i'm using Java Spark) i would like to compress that String into another (smaller) string.
Not sure if that's really possible, that's why i'm here :)
Why i don't wanna send byte array over network? Because my response has two parts - metadata and actual data returned from the DB. I would like to keep metadata readable and only actual data compressed.
Is there a way to achieve this?

Related

Send XML Document and JSON as bytes over web socket in Java

What are the benefits of using ByteArrayOutputStream to convert the XML or JSON data to send over web socket instead of sending these values as Strings?
Security: JSON and XML easy to decode.(mostly for WS / compare to WSS)
Efficiency: In traffic usage and in most case encode/decode processing. Byte-Arrays could be very compact compare to string, specially with data that are not string by nature (compare 4-bytes Boolean array of size 32 with over 128 (32*4) byte need for string representation, both data usage and encode/decode CPU usage). check THIS link
Generality: Sending all type of data including any objects with complex hierarchical inheritances between them. In order to decode JSON with complex Tree-Like inheritance, you need very complex parsing method.
Simplicity: Enable to chunk data meaningfully. suppose we always use first 2 byte of data as it's type. (to decode the rest). Normally additional libraries do that for us.
Integrity: Easily recognizing corrupted data. Even without checksum, 1-bit data-corruption could be detected in most case.
Compatibility: Using serialized object with version to control compatibility. (version control)-Although you could add version in JSON, it could cause many difficulty, inefficiency and trouble. check THIS
And probably other reasons in special cases.

Is additional base64 encoding necessary when sending files as byte[] from Java service to Java Service via RestTemplate?

I am sending data via json body in a post request from a client (Java) to a server (Java) using a Spring RestTemplate and RestController.
The data is present as a POJO on the client and will be parsed into a POJO with the same structure on the server.
On the client I am converting a file with Files.readAllBytes to byte[] and store it in the content field.
On the server side the whole object including the byte[] will be marshalled to XML using JAXB annotations.
class BinaryObject {
String fileName;
String mimeCode;
byte[] content;
}
Everything is working fine and running as intended.
I heard it could be beneficial to encode the content field before transmitting the date to the server and decode it there before it is marshaled into XML.
My Question
Is it necessary or recommended to additionally encode / decode the content field with base64?
TL;DR
To the best of my knowledge, you are not going against any good practice with your current implementation. One might question the design (exchanging files in JSON ? Storing binary inside XML ?), but this is a separate question.
Still, there is room for possible optmization, but the toolset you use (e.g. Spring rest template + Spring Controler + JSON serialization (jackson) + XML using JAXB) kind of hide the possible optimizations from you.
You have to carrefully weight the pros and cons of working around your comfortable "automat(g)ical" serializations that work well as of today to see if it is worth the trouble to tweak it.
We can nonetheless discuss the theory of what could be done.
A discussion about Base64
Base64 encoding in an efficient way to encode binary data in pure text formats (e.g. MIME strucutres such as email or some HTTP bodies, JSON, XML, ...) but it has two costs : the first is a non negligible size increase (~ 33% size), the second is CPU time.
Sometimes, (but you'd have to profile, check if that is your case), this cost is not negligible, esp. for large files (due to some buffering and char/byte conversions in the frameworks, you could easilly end up using e.g. 4x the size of the encoded file in the Java Heap).
When handling 10kb files at 10 requests/sec, this is usually NOT an issue.
But 10MB files at 100 req/second, well that is another ball park.
So you'd have to check (I doubt your typical server will reach 100 req/s with 10MB files, because that is a 1GB/s incoming network bandwidth).
What is optimizable in your current process
In your current process, you have multiple encodings taking place : the client needs to Base64 encode the bytes read from the file.
When the request hits the server, the server decodes the base64 to a byte[], then your XML serialization (JAXB) reconverts the byte[] to base64.
So in effect, "you" (more exactly, the REST controler side of things) decoded base64 content, all for nothing because the XML side of things could have used it directly.
What could be done
A few things.
Do you need base64 at the calling site ?
First, you do not have to encode at the client side. When using JSON, there is no choice, but the world did not wait for JSON to exchange files (e.g. arbitrary binary content) over HTTP.
If your content is a file name, a MIME type, and a file body, then standard, direct HTTP calls with no JSON at all is perfectly fine.
The MIME type could be mapped to the Content-Type HTTP Header, the file name inside the Content-Disposition HTTP header, and the contents as the raw HTTP body. No base64 needed (but you need your server-side to accept raw HTTP content as is). This is standard as can be.
This change would allow you to remove the encoding (client side), lower the network size of the call (~33% less), and remove one decoding at the server side. The server would just have to base64 encode (once) a raw stream to produce the XML, and you would not even need to buffer the whole file contents for that (you'd have to tweak you JAXB model a bit, but you can JAXB serialize directly bytes from an InputStream, which means, almost no buffer, and since your CPU probably encodes faster than your network serves content, no real latency incurred).
If this, for some reason, is not an option, let's say your client has to send JSON (and therefore base64 content)
Can you avoid decoding at the server side
Sort of. You can use a server-side bean where the content is actually a String and NOT a byte[]. This is hacky, but your REST controler will no longer deserialize base64, it will keep it "as is", which is a JSON string (which happens to be base64 encoded content, but the controler does not care).
So your server will have saved the CPU cost of one base64 decoding, but in exchange, you'll have a base64 String in java heap (compared to the raw byte[], +33% size on Java >=9 with compact strings, +166% size on Java < 9).
If you are to profit from this, you also have to tweak your JAXB to see the base64 encoded String as a byte[], which is not trivial as far as I can tell, unless you modify the JAXB object in such a way that it accepts a String instead of the byte[] which is kind of hacky (if your JAXB objects are generated from a XML schema, this might really become a pain to implement)
All in all this is much harder - probably too much if you are not really hitting the wall, performance wise, on this particular issue.
A few other stuff
Are your files pure binary, or are they actually text ? If there are text, you may benefit from using CDATA encoding on the XML side instead of base64 ?
Is your XML actually a SOAP call ? If so, and if the service supports MTOM, you could avoid base64 completely, but that is an altogether different subject.

Obtain a string from the compressed data and vice versa in java

I want to compress a string(an XML Document) in Java and store it in Cassandra db as varchar. I should be able to decompress it while reading from db. I looked into GZIP and lz4 and both return a byte array on compressing.
My goal is to obtain a string from the compressed data which can also be used to decompress and get back the original string.
What is the best possible approach?
I don't see any good reasons for you to compress your data: Cassandra can do it for you transparently (it will LZ4 your data by default). So, if your goal is to reduce your data footprint then you have a non-existent problem, and I'd feed the XML document directly to C*.
By the way, all the compression algorithms take array of bytes and produce array of bytes. As a solution, you could apply something like a base64 encoding to your compressed byte array. On decompression, reverse the logic: decode base64 your string and then apply your decompression algorithm.
Not enough reputation to comment so posting as an answer. If you want a string back, then significant compression will depend on your data. A very simple solution might be something like Java compressing Strings but that would work if your string is only characters and no numbers. You can modify this solution to work for most characters but then if you don't have repeating characters then you might actually get a larger string than your original one.

Grabbing data from web service xsd:base64Binary field in Java

I have a web service I am trying to call from Java. The XSD for the service defines a field as an xsd:base64Binary. I am using maven jaxb2 plugin to generate Java artifacts. The field becomes a byte[] in the generated Java object. The data that comes back in that field is either CSV or XML data depending on what is passed into the service. SoapUI displays the data perfectly (not encoded). Watching the wire with wireshark I can also see the non encoded data. My question is, how do I grab this data as a string in Java? I want to take this data and later write it into a file.
Response looks something like this:
Service Agreement,Interval Start Time,Interval End Time,Quantity,Unit of Measure .... etc.
Relevant bit of XSD:
Relevant bit of generated java:
protected byte[] greenDoc;
In my client java code I have been trying every possible combination of new String(byte[]), new String(byte[], charset), Base64 decoding, etc. and I just cannot seem to get the data correctly. I know it is not a limitation of the web service because like I said SoapUI can display the data perfectly.
Any pointers on how the client code can take the byte array and convert to string? Thanks!
Programmatically, you can use DatatypeConverter

Storing files on Database

i have to save a file in any format (XLS, PDF, DOC, JPG ....) in a database using Java. in my experience i would have do this by storing the binary data of the file into a BLOB type field, someone told me that an alternative is coding the binary data as Text using BASE64 and store the string in a TEXT type field. Which one is the best option to performn this task?.
Thanks.
Paul Manjarres
BLOB would be better, simply because you can use a byte[] data type, and you don't have to encode/decode from BASE64. No reason to use BASE64 for simple storage.
The argument for using BLOB is that it takes fewer CPU cycles, less disk and network i/o, less code, and reduces the likelihood of bugs:
As Will Hartung says, using BLOB enables you to skip the encode/decode steps, which will reduce CPU cycles. Moreover, there are many Java libraries for Base64 encoding and decoding, and there are nuances in implementation (ie PEM line wraps). This means to be safe, the same library should be used for encoding and decoding. This creates an unnecessary coupling between the application which creates the record, and the application that reads the record.
The encoded output will be larger than the raw bytes, which means it will take up more disk space (and network i/o).
Use BLOB to put them in database
FILE to BLOB = DB will not query the content and treat it as, as a ... well ... a meaningless binary BLOB regardless of its content. DB knows this field may be 1KB or 1GB and allocates resources accordingly.
FILE to TEXT = DB can query this thing. Strings can be searched replaced modified in the file. But this time DBMS will spend more resources to make this thing work. There may be a 100 char long text inside a field which may or may not be storing 1 million char long text. Files can have any kind of text encoding and invalid characters may be lost due to table/DB encoding settings.No need to use this if content of the files will not be used in SQL queries.
BASE64 = Converts any content to a lovely super valid text. A work around to bypass every compatibility issue. Store anywhere, print it, telegraph it, write it on a paper, convert your favorite selfie to a private key. Output will be meaningless and bigger but it will be an ordinary text.

Categories

Resources