Understanding ZipSecureFile.setMinInflateRatio(double ratio)

Understanding ZipSecureFile.setMinInflateRatio(double ratio) - java

I am using this function call, because when I read a trusted file, It results in zipbomb error.
ZipSecureFile.setMinInflateRatio(double ratio)
FileInputStream file = new FileInputStream("/file/path/report.xlsx");
ZipSecureFile.setMinInflateRatio(-1.0d);
XSSFWorkbook wb = new XSSFWorkbook(file);
I am trying to understand how it works?
The only source I could find is https://poi.apache.org/apidocs/org/apache/poi/openxml4j/util/ZipSecureFile.html
But, couldn't get a clear picture as I am new to this concept.
What are the differences between
ZipSecureFile.setMinInflateRatio(-1.0d);
vs
ZipSecureFile.setMinInflateRatio(0.009);
vs
ZipSecureFile.setMinInflateRatio(0);

A zip bomb detection works the following way:
While uncompressing it checks the ratio compressedBytes/uncompressedBytes and if this falls below a special amount (MinInflateRatio), then a bomb was detected.
So if the ratio compressedBytes/uncompressedBytes is 0.01d for example, then that means that the compressed file is 100 times smaller than the uncompressed one whithout information lost. In other words, the compressed file stores the same information in only 1% of the file size, the uncompressed one needs. This is really unlikely using real life data.
To show how unlikely it is we could take a look (in a popular scientific manner) on how compression works:
Let's have the string
"This is a test for compressing having long count of characters which always occurs the same sequence."
This needs 101 bytes. Let's say this string occurs 100,000 times in the file. Then uncompressed it would need 10,100,000 bytes. A compression algorithm would give that string a ID and would storing the string only once mapping it to that ID and would storing 100,000 times the ID where the string occurs in the file. That would need 101 bytes + 1 byte (ID) + 100,000 bytes (IDs) = 100,102 bytes. And this would have a ratio compressedBytes/uncompressedBytes of 0.009911089d for example.
So if we set the MinInflateRatio to lower than 0.01d, then we accept such unlikely data compression rates.
Also we can see, that the ratio compressedBytes/uncompressedBytes can only be 0 if compressedBytes is 0. But this would mean that there are no bytes to uncompress. So a MinInflateRatio of 0.0d can never be reached nor be undershot. So with a MinInflateRatio of 0.0d all possible ratios will be accepted.
Of course a MinInflateRatio of -1.0d also can never be reached nor be undershot. So using this also all possible ratios will be accepted.

Related

How can the sound level at specific timestamps of a .wav file be measured using the Sound API in Java?

After searching for over 12 hours, I was unable to find anything regarding this. ALl I could find is how to use functions from the Sound API to measure and change the volume of the device, not the .wav file. It would be great if someone could advise us/tell us how to get and/or change the volume from specific timestamps of a .wav file itself, thank you very much!
Even if it is not possible to change the audio of the .wav file itself, we need to know at least how to measure the volume level at the specific timestamps.

To deal with the amplitude of the sound signal, you will have to inspect the PCM data held in the .wav file. Unfortunately, the Java Clip does not expose the PCM values. Java makes the individual PCM data values available through the AudioInputStream class, but you have to read the data points sequentially. A code example is available at The Java Tutorials: Using Files and Format Converters.
Here's a block quote of the relevant portion of the page:
Suppose you're writing a sound-editing application that allows the
user to load sound data from a file, display a corresponding waveform
or spectrogram, edit the sound, play back the edited data, and save
the result in a new file. Or perhaps your program will read the data
stored in a file, apply some kind of signal processing (such as an
algorithm that slows the sound down without changing its pitch), and
then play the processed audio. In either case, you need to get access
to the data contained in the audio file. Assuming that your program
provides some means for the user to select or specify an input sound
file, reading that file's audio data involves three steps:
Get an AudioInputStream object from the file.
Create a byte array in which you'll store successive chunks of data from the file.
Repeatedly read bytes from the audio input stream into the array. On each iteration, do something useful with the bytes in the array
(for example, you might play them, filter them, analyze them, display
them, or write them to another file).
The following code snippet outlines these steps:
int totalFramesRead = 0;
File fileIn = new File(somePathName);
// somePathName is a pre-existing string whose value was
// based on a user selection.
try {
AudioInputStream audioInputStream =
AudioSystem.getAudioInputStream(fileIn);
int bytesPerFrame =
audioInputStream.getFormat().getFrameSize();
if (bytesPerFrame == AudioSystem.NOT_SPECIFIED) {
// some audio formats may have unspecified frame size
// in that case we may read any amount of bytes
bytesPerFrame = 1;
}
// Set an arbitrary buffer size of 1024 frames.
int numBytes = 1024 * bytesPerFrame;
byte[] audioBytes = new byte[numBytes];
try {
int numBytesRead = 0;
int numFramesRead = 0;
// Try to read numBytes bytes from the file.
while ((numBytesRead =
audioInputStream.read(audioBytes)) != -1) {
// Calculate the number of frames actually read.
numFramesRead = numBytesRead / bytesPerFrame;
totalFramesRead += numFramesRead;
// Here, do something useful with the audio data that's
// now in the audioBytes array...
}
} catch (Exception ex) {
// Handle the error...
}
} catch (Exception e) {
// Handle the error...
}
END OF QUOTE
The values themselves will need another conversion step before they are PCM. If the file uses 16-bit encoding (most common), you will have to concatenate two bytes to make a single PCM value. With two bytes, the range of values is from -32778 to 32767 (a range of 2^16).
It is very common to normalize these values to floats that range from -1 to 1. This is done by float division using 32767 or 32768 in the denominator. I'm not really sure which is more correct (or how much getting this exactly right matters). I just use 32768 to avoid getting a result less than -1 if the signal has any data points that hit the minimum possible value.
I'm not entirely clear on how to convert the PCM values to decibels. I think the formulas are out there for relative adjustments, such as, if you want to lower your volume by 6 dBs. Changing volumes is a matter of multiplying each PCM value by the desired factor that matches the volume change you wish to make.
As far as measuring the volume at a given point, since PCM signal values can range pretty widely as they zig zag back and forth across the 0, the usual operation is to take an average of the absolute value of many PCM values. The process is referred to as getting a root mean square. The number of values to include in a RMS calculation can vary. I think the main consideration is to make the number of values in the rolling average to be large enough such that they are greater than the period of the lowest frequency included in the signal.
There are some good tutorials at the HackAudio site. This link is for the RMS calculation.

How to correctly save stream of bytes to file in Java/Scala? How to fix wrongly saved stream?

Story
While conducting an experiment I was saving a stream of random Bytes generated by a hardware RNG device. After the experiment was finished, I realized that the saving method was incorrect. I hope I can find the way how to fix the corrupted file so that I obtain the correct stream of random numbers back.
Example
The story of the problem can be explained in the following simple example.
Let's say I have a stream of random numbers in an input file randomInput.bin. I will simulate the stream of random numbers coming from the hardware RNG device by sending the input file to stdout via cat. I found two ways how to save this stream to a file:
A) Harmless saving method
This method gives me exactly the original stream of random Bytes.
import scala.sys.process._
import java.io.File
val res = ("cat randomInput.bin" #> new File(outputFile))!
B) Saving method leading to corruption
Unfortunately, this is the original saving method I chose.
import scala.sys.process._
import java.io.PrintWriter
val randomBits = "cat randomInput.bin".!!
val out = new PrintWriter(outputFile)
out.println(randomBits)
if (out != null) {
out.close()
Seq("chmod", "600", outputFile).!
}
The file saved using method B) is still binary, however, is is approximately 2x larger that the file saved by method A). Further analysis shows that the stream of random Bits is significantly less random.
Summary
I suspect that the saving method B) adds something to almost every byte, however, the understanding of this is behind my expertise in Java/Scala I/O.
I would very much appreciate if somebody explained me the low-level difference between methods A) and B). The goal is to revert the changes created by saving method B) and obtain the original stream of random Bytes.
Thank you very much in advance!

The problem is probably that println is meant for text, and this text is being encoded as Unicode, which uses multiple bytes for some or all characters, depending on which version of Unicode.
If the file is exactly 2x larger than it should be, then you've probably got a null byte every other byte, which could be easy to fix. Otherwise, it may be harder to figure out what you would need to do to recover the binary data. Viewing the corrupted file in a hex editor may help you see what happened. Either way, I think it may be easier to just generate new random data and save it correctly.
Especially if this is for an experiment, if your random data has been corrupted and then fixed, it may be harder to justify that the data is truly random compared to just generating it properly in the first place.

Cap'n Proto - Finding Message Size in Java

I am using a TCP Client/Server to send Cap'n Proto messages from C++ to Java.
Sometimes the receiving buffer may be overfilled or underfilled and to handle these cases we need to know the message size.
When I check the size of the buffer in Java I get 208 bytes, however calling
MyModel.MyMessage.STRUCT_SIZE.total()
returns 4 (not sure what unit of measure is being used here).
I notice that 4 divides into 208, 52 times. But I don't know of a significant conversion factor using 52.
How do I check the message size in Java?

MyMessage.STRUCT_SIZE represents the constant size of that struct itself (measured in 8-byte words), but if the struct contains non-trivial fields (like Text, Data, List, or other structs) then those take up space too, and the amount of space they take is not constant (e.g. Text will take space according to how long the string is).
Generally you should try to let Cap'n Proto directly write to / read from the appropriate ByteChannels, so that you don't have to keep track of sizes yourself. However, if you really must compute the size of a message ahead of time, you could do so with something like:
ByteBuffer[] segments = message.getSegmentsForOutput();
int total = (segments.length / 2 + 1) * 8; // segment table
for (ByteBuffer segment: segments) {
total += segment.remaining();
}
// now `total` is the total number of bytes that will be
// written when the message is serialized.
On the C++ size, you can use capnp::computeSerializedSizeInWords() from serialize.h (and multiply by 8).
But again, you really should structure your code to avoid this, by using the methods of org.capnproto.Serialize with streaming I/O.

Store a number as ASCII text in Java?

It's probably a stupid question but here's the thing. I was reading this question:
Storing 1 million phone numbers
and the accepted question was what I was thinking: using a trie. In the comments Matt Ball suggested:
I think storing the phone numbers as ASCII text and compressing is a very reasonable suggestion
Problem: how do I do that in Java? And ASCII text does stand for String?

For in-memory storage as indicated in the question:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(
new GZIPOutputStream(baos), "US-ASCII");
for(String number : numbers){
out.write(number);
out.write('\n');
}
byte[] data = baos.toByteArray();
But as Pete remarked: this may be good for memory efficiency, but you can't really do anything with the data afterwards, so it's not really very useful.

Yes, ASCII means Strings in this case. You can store compressed data in Java using the java.util.zip.GZIPOutputStream.

In answer to an implied, but different question;
Q: You have 1 billion phones numbers and you need to send these over a low bandwidth connection. You only need to send whether the phone number is in the collection or not. (No other information required)
A: This is the general approach
First sort the list if its not sorted already.
From the lowest number find regions of continuous numbers. Send the start of the region and the phones which are taken. This can be stored a BitSet (1-bit per possible number) Send the phone number at the start and the BitSet whenever the gap is more than some threshold.
Write the stream to a compressed data set.
Test this to compare with a simple sending of all numbers.
You can use Strings in a sorted TreeMap. One million numbers is not very much and will use about 64 MB. I don't see the need for a more complex solution.
The latest version of Java can store ASCII text efficiently by using a byte[] instead of a char[] however, the overhead of your data structure is likely to be larger.
If you need to store a phone numbers as a key, you could store them with the assumption that large ranges will be continous. As such you could store them like
NavigableMap<String, PhoneDetails[]>
In this structure, the key would define the start of the range and you could have a phone details for each number. This could be not much bigger than the reference to the PhoneDetails (which is the minimum)
BTW: You can invent very efficient structures if you don't need access to the data. If you never access the data, don't keep it in memory, in fact you can just discard it as it won't ever be needed.
Alot depending on what you want to do with the data and why you have it in memory at all.
You can Use DeflatorOutputStream to a ByteArrayOutputStream, which will be very small, but not very useful.
I suggest using DeflatorOutputStream as its more light weight/faster/smaller than GZIPOutputStream.

Java String are by default UTF-8 encoded, you have to change the encoding if you want to manipulate ASCII text.

How to compress a String in Java?

I use GZIPOutputStream or ZIPOutputStream to compress a String (my string.length() is less than 20), but the compressed result is longer than the original string.
On some site, I found some friends said that this is because my original string is too short, GZIPOutputStream can be used to compress longer strings.
so, can somebody give me a help to compress a String?
My function is like:
String compress(String original) throws Exception {
}
Update:
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.zip.GZIPOutputStream;
import java.util.zip.*;
//ZipUtil
public class ZipUtil {
public static String compress(String str) {
if (str == null || str.length() == 0) {
return str;
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
GZIPOutputStream gzip = new GZIPOutputStream(out);
gzip.write(str.getBytes());
gzip.close();
return out.toString("ISO-8859-1");
}
public static void main(String[] args) throws IOException {
String string = "admin";
System.out.println("after compress:");
System.out.println(ZipUtil.compress(string));
}
}
The result is :

Compression algorithms almost always have some form of space overhead, which means that they are only effective when compressing data which is sufficiently large that the overhead is smaller than the amount of saved space.
Compressing a string which is only 20 characters long is not too easy, and it is not always possible. If you have repetition, Huffman Coding or simple run-length encoding might be able to compress, but probably not by very much.

When you create a String, you can think of it as a list of char's, this means that for each character in your String, you need to support all the possible values of char. From the sun docs
char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).
If you have a reduced set of characters you want to support you can write a simple compression algorithm, which is analogous to binary->decimal->hex radix converstion. You go from 65,536 (or however many characters your target system supports) to 26 (alphabetical) / 36 (alphanumeric) etc.
I've used this trick a few times, for example encoding timestamps as text (target 36 +, source 10) - just make sure you have plenty of unit tests!

If the passwords are more or less "random" you are out of luck, you will not be able to get a significant reduction in size.
But: Why do you need to compress the passwords? Maybe what you need is not a compression, but some sort of hash value? If you just need to check if a name matches a given password, you don't need do save the password, but can save the hash of a password. To check if a typed in password matches a given name, you can build the hash value the same way and compare it to the saved hash. As a hash (Object.hashCode()) is an int you will be able to store all 20 password-hashes in 80 bytes).

Your friend is correct. Both gzip and ZIP are based on DEFLATE. This is a general purpose algorithm, and is not intended for encoding small strings.
If you need this, a possible solution is a custom encoding and decoding HashMap<String, String>. This can allow you to do a simple one-to-one mapping:
HashMap<String, String> toCompressed, toUncompressed;
String compressed = toCompressed.get(uncompressed);
// ...
String uncompressed = toUncompressed.get(compressed);
Clearly, this requires setup, and is only practical for a small number of strings.

Huffman Coding might help, but only if you have a lot of frequent characters in your small String

The ZIP algorithm is a combination of LZW and Huffman Trees. You can use one of theses algorithms separately.
The compression is based on 2 factors :
the repetition of substrings in your original chain (LZW): if there are a lot of repetitions, the compression will be efficient. This algorithm has good performances for compressing a long plain text, since words are often repeated
the number of each character in the compressed chain (Huffman): more the repartition between characters is unbalanced, more the compression will be efficient
In your case, you should try the LZW algorithm only. Used basically, the chain can be compressed without adding meta-informations: it is probably better for short strings compression.
For the Huffman algorithm, the coding tree has to be sent with the compressed text. So, for a small text, the result can be larger than the original text, because of the tree.

Huffman encoding is a sensible option here. Gzip and friends do this, but the way they work is to build a Huffman tree for the input, send that, then send the data encoded with the tree. If the tree is large relative to the data, there may be no not saving in size.
However, it is possible to avoid sending a tree: instead, you arrange for the sender and receiver to already have one. It can't be built specifically for every string, but you can have a single global tree used to encode all strings. If you build it from the same language as the input strings (English or whatever), you should still get good compression, although not as good as with a custom tree for every input.

If you know that your strings are mostly ASCII you could convert them to UTF-8.
byte[] bytes = string.getBytes("UTF-8");
This may reduce the memory size by about 50%. However, you will get a byte array out and not a string. If you are writing it to a file though, that should not be a problem.
To convert back to a String:
private final Charset UTF8_CHARSET = Charset.forName("UTF-8");
...
String s = new String(bytes, UTF8_CHARSET);

You don't see any compression happening for your String, As you atleast require couple of hundred bytes to have real compression using GZIPOutputStream or ZIPOutputStream. Your String is too small.(I don't understand why you require compression for same)
Check Conclusion from this article:
The article also shows how to compress
and decompress data on the fly in
order to reduce network traffic and
improve the performance of your
client/server applications.
Compressing data on the fly, however,
improves the performance of
client/server applications only when
the objects being compressed are more
than a couple of hundred bytes. You
would not be able to observe
improvement in performance if the
objects being compressed and
transferred are simple String objects,
for example.

Take a look at the Huffman algorithm.
https://codereview.stackexchange.com/questions/44473/huffman-code-implementation
The idea is that each character is replaced with sequence of bits, depending on their frequency in the text (the more frequent, the smaller the sequence).
You can read your entire text and build a table of codes, for example:
Symbol Code
a 0
s 10
e 110
m 111
The algorithm builds a symbol tree based on the text input. The more variety of characters you have, the worst the compression will be.
But depending on your text, it could be effective.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.