What are the extra bytes in the ZipEntry used for? - java

The Java library for Zip files has an option in ZipEntry for getExtra() that returns either byte[] or null. What are the extra bytes in the ZipEntry used for? I'm aware of this question about archive attributes linked to getExtra() but it doesn't explain what else the field is used for. Furthermore the question indicates that some things stored in the extra field cannot be set from Java.

The answer can be found in the first two links in the java.util.zip package documentation.
The basic zip format is described in the PKWARE zip specification. Sections 4.5 and 4.6 describe what the extra data is.
The extra data is a series of zero or more blocks. Each block starts with a little-endian 16-bit ID, followed by a little-endian 16-bit count of the bytes that immediately follow.
The PKWARE specification describes some well known extra data record IDs. The Info-Zip format describes many more.
So, if you wanted to check whether a zip entry includes an ASi Unix Extra Field, you might read it like this:
ByteBuffer extraData = ByteBuffer.wrap(zipEntry.getExtra());
extraData.order(ByteOrder.LITTLE_ENDIAN);
while (extraData.hasRemaining()) {
int id = extraData.getShort() & 0xffff;
int length = extraData.getShort() & 0xffff;
if (id == 0x756e) {
int crc32 = extraData.getInt();
short permissions = extraData.getShort();
int linkLengthOrDeviceNumbers = extraData.getInt();
int userID = extraData.getChar();
int groupID = extraData.getChar();
ByteBuffer linkDestBuffer = extraData.slice().limit(length - 14);
String linkDestination =
StandardCharsets.UTF_8.decode(linkDestBuffer).toString();
// ...
} else {
extraData.position(extraData.position() + length);
}
}

Related

Unable to create a torrent's info hash

I'm having trouble finding the issue with how I'm generating the corresponding info hash for a torrent file. This is the code I have so far:
InputStream input = null;
try {
MessageDigest sha1 = MessageDigest.getInstance("SHA-1");
input = new FileInputStream(file);
StringBuilder builder = new StringBuilder();
while (!builder.toString().endsWith("4:info")) {
builder.append((char) input.read()); // It's ASCII anyway.
}
ByteArrayOutputStream output = new ByteArrayOutputStream();
for (int data; (data = input.read()) > -1; output.write(data));
sha1.update(output.toByteArray(), 0, output.size() - 1);
this.infoHash = sha1.digest();
System.out.println(new String(Hex.encodeHex(infoHash)));
} catch (NoSuchAlgorithmException | IOException e) {
e.printStackTrace();
} finally {
if (input != null) try { input.close(); } catch (IOException ignore) {}
}
Below is my expected and actual hash:
Expected: d4d44272ee5f5bf887a9c85ad09ae957bc55f89d
Actual: 4d753474429d817b80ff9e0c441ca660ec5d2450
The torrent I'm trying to generate an info hash for can be found here (Ubuntu 14.04 Desktop amd64).
Let me know if I can provide any more info, thanks!
Exceptions contain 4 useful bits of info: Type, Message, Trace, and Cause. You've tossing away 3 out of the 4 relevant bits of info. Also, code is part of a process, and when an error occurs, generally that process cannot be finished at all. And yet on exceptions your process continues. Stop doing this; you've written code that only hurts you. Remove the try, and the catch. Add a throws clause on your method signature. If you can't, the go-to default (and update your IDE if that generated this code to do this) is throw new RuntimeException("Unhandled", e);. This is shorter, does not destroy any of the 4 interesting bits of info, and ends a process.
Separately, the notion that the right way to handle an inputstream close method's IOException being: Just ignore it, is also false. It is highly unlikely to throw, but if it does, you should assume you didn't read every byte. As that would be one explanation for a mismatched hash, it's misguided.
Finally, use the proper language constructs: There is a try-with-resources statement that would work far better here.
You're calling update with output.size() - 1; unless you want to intentionally ignore the last byte, this is a mistake; you're lopping off the last byte read.
Reading bytes into a builder, and then per byte converting the builder to a string and then checking the last character is incredibly inefficient; for a file as small as 1MB that'll cause quite a grind.
Reading a single byte at a time from a raw FileInputStream is also that level of inefficient, because every read will cause file access (reading 1 byte is as expensive as reading a whole buffer full, so, it's about 50000 times slower than it needs to be).
Here's how to do this with somewhat newer API, and look how much nicer this code reads. It also acts better under erroneous conditions:
byte[] data = Files.readAllBytes(Paths.get(fileName));
var search = "4:info".getBytes(StandardCharsets.US_ASCII);
int searchIdx = -1;
for (int i = 0; searchIdx == -1 && i < data.length - search.length; i++) {
for (int j = 0; j < search.length; j++) {
if (data[i + j] != search[j]) break;
if (j == search.length - 1) searchIdx = i + j;
}
}
if (searchIdx == -1) throw new IOException("Input torrent file does not contain marker");
var sha1 = MessageDigest.getInstance("SHA-1");
sha1.update(data, searchIdx, data.length - searchIdx);
byte[] hash = sha1.digest();
StringBuilder hex = new StringBuilder();
for (byte h : hash) hex.append(String.format("%02x", h));
System.out.println(hex);
While rzwitserloot's answer covers some general java coding practices there also are correctness issues on the bittorrent level.
You are using string processing for a structured data format, this is pretty much the same mistake as attempting to parse html with regex. In this case you're assuming that the only place that the data can contain the string 4:info is the top-level dictionary key for the info dict and that the info dictionary is the last entry of the top level dictionary.
Instead you should use a proper bencoding decoder-encoder to extract the info dict and then re-encode it for hashing or a tokenizer to find the exact byte-range covering the info value. Note that you need a validating parser for the former while the latter can also handle some out-of-spec edge cases. Unless you want to implement them yourself you may want to find a library that handles this for you.
Additionally you're assuming that the data is ASCII. bencoding is in fact a binary format that just tends to use ascii by convention in some places. You should operate on byte arrays directly. Your input is already binary, the hasher expects binary so it is quite circuitous to go through strings.

fwrite() in C & readInt() in Java differ in endianess

Native Code :
writing number 27 using fwrite().
int main()
{
int a = 27;
FILE *fp;
fp = fopen("/data/tmp.log", "w");
if (!fp)
return -errno;
fwrite(&a, 4, 1, fp);
fclose();
return 0;
}
Reading back the data(27) using DataInputStream.readInt() :
public int readIntDataInputStream(void)
{
String filePath = "/data/tmp.log";
InputStream is = null;
DataInputStream dis = null;
int k;
is = new FileInputStream(filePath);
dis = new DataInputStream(is);
k = dis.readInt();
Log.i(TAG, "Size : " + k);
return 0;
}
O/p
Size : 452984832
Well that in hex is 0x1b000000
0x1b is 27. But the readInt() is reading the data as big endian while my native coding is writing as little endian. . So, instead of 0x0000001b i get 0x1b000000.
Is my understanding correct? Did anyone came across this problem before?
From the Javadoc for readInt():
This method is suitable for reading bytes written by the writeInt method of interface DataOutput
If you want to read something written by a C program you'll have to do the byte swapping yourself, using the facilities in java.nio. I've never done this but I believe you would read the data into a ByteBuffer, set the buffer's order to ByteOrder.LITTLE_ENDIAN and then create an IntBuffer view over the ByteBuffer if you have an array of values, or just use ByteBuffer#getInt() for a single value.
All that aside, I agree with #EJP that the external format for the data should be big-endian for greatest compatibility.
There are multiple issues in your code:
You assume that the size of int is 4, it is not necessarily true, and since you want to deal with 32-bit ints, you should use int32_t or uint32_t.
You must open the file in binary more to write binary data reliably. The above code would fail on Windows for less trivial output. Use fopen("/data/tmp.log", "wb").
You must deal with endianness. You are using the file to exchange data between different platforms that may have different native endianness and/or endian specific APIs. Java seems to use big-endian, aka network byte order, so you should convert the values on the C platform with the hton32() utility function. It is unlikely to have significant impact on performance on the PC side, as this function is usually expanded inline, possibly as a single instruction and most of the time will be spent waiting for I/O anyway.
Here is a modified version of the code:
#include <endian.h>
#include <stdint.h>
#include <stdio.h>
int main(void) {
uint32_t a = hton32(27);
FILE *fp = fopen("/data/tmp.log", "wb");
if (!fp) {
return errno;
}
fwrite(&a, sizeof a, 1, fp);
fclose();
return 0;
}

Why String receiver's size is smaller than original ByteArrayOutputStream's size when I call toString()

I'm in front of a curious problem. Some code is better than long story:
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
buffer.write(...); // I write byte[] data
// In debugger I can see that buffer's count = 449597
String szData = buffer.toString();
int iSizeData = buffer.size();
// But here, szData's count = 240368
// & iSizeData = 449597
So my question is: why szData doesn't contain all the buffer's data? (only one Thread run this code) because after that kind of operation, I don't want szData.charAt(iSizeData - 1) crashes!
EDIT: szData.getBytes().length = 450566. There is encoding problems I think. Better use a byte[] instead of a String finally?
In Java, char ≠ byte, depending on the default character coding of the platform, char can occupy up to 4 bytes in memory. You work either with bytes (binary data), or with characters (strings), you cannot (easily) switch between them.
For String operations like strncasecmp in C, use the methods of the String class, e.g. String.compareToIgnoreCase(String str). Also have a look at the StringUtils class from the Apache Commons Lang library.

Protobuf in java : Uid.Lists from Accumulo Values go in but they don't come out

I am attempting to write and read Uids from Accumulo Value (key,Value) into Uid.List using protobuf.
Specifically:
org.apache.accumulo.examples.wikisearch.protobuf.Uid;import org.apache.accumulo.examples.wikisearch.protobuf.Uid.List.Builder
I use the following code to write Uid.List where I declare UidListCount as #of uids in List Cseq:
Builder uidBuilder = Uid.List.newBuilder();
uidBuilder.setIGNORE(false);
for String entry : seq){
uidBuilder.addUID(entry);
}
Uid.List uidList = uidBuilder.build();
Value newAccumuloValue = new Value(uidList.toByteArray());
This seems to work fine.
When I Try to read the Uid.List out of accumulo value,where value is a protobuf Uid.List, its a no-go:
byte[] byteVal = value.getBytes; //retrieving Accumulo Value containing Uid.List
Uid.List uids= Uid.List.parseFrom(byteVal);
while (counter <= counter){
String uidStr = uids.getUID(counter).toString();
system.out.println(uidStr);
}
I keep getting "tag errors"
I would really like to understand how to read out what goes in.
Thanks!
I would suggest changing the second bit of code to something along the lines of:
byte[] byteVal = value.getBytes;
Uid.List uids= Uid.List.parseFrom(byteVal);
int count = uids.getUIDCount();
for (int i = 0; i< count; i++){
String uidStr = uids.getUID(i).toString();
system.out.println(uidStr);
}
This code does work as long as the UIDs in your list are properly cleaned before the list is built by protobuf. If you have characters within the data (such as unicode nulls) that are used by protobuf as part of the list format then, when parsing the data back out, it is going to break because data characters will be recognized as format characters that don't properly match the data schema. I would start by taking a look at your data and ensuring that it meets data quality and cleanliness standards for what you are trying to achieve.

Wav comparison, same file

I'm currently stumped. I've been looking around and experimenting with audio comparison. I've found quite a bit of material, and a ton of references to different libraries and methods to do it.
As of now I've taken Audacity and exported a 3min wav file called "long.wav" and then split the first 30seconds of that into a file called "short.wav". I figured somewhere along the line I could visually log (log.txt) the data through java for each and should be able to see at least some visual similarities among the values.... here's some code
Main method:
int totalFramesRead = 0;
File fileIn = new File(filePath);
BufferedWriter writer = new BufferedWriter(new FileWriter(outPath));
writer.flush();
writer.write("");
try {
AudioInputStream audioInputStream =
AudioSystem.getAudioInputStream(fileIn);
int bytesPerFrame =
audioInputStream.getFormat().getFrameSize();
if (bytesPerFrame == AudioSystem.NOT_SPECIFIED) {
// some audio formats may have unspecified frame size
// in that case we may read any amount of bytes
bytesPerFrame = 1;
}
// Set an arbitrary buffer size of 1024 frames.
int numBytes = 1024 * bytesPerFrame;
byte[] audioBytes = new byte[numBytes];
try {
int numBytesRead = 0;
int numFramesRead = 0;
// Try to read numBytes bytes from the file.
while ((numBytesRead =
audioInputStream.read(audioBytes)) != -1) {
// Calculate the number of frames actually read.
numFramesRead = numBytesRead / bytesPerFrame;
totalFramesRead += numFramesRead;
// Here, do something useful with the audio data that's
// now in the audioBytes array...
if(totalFramesRead <= 4096 * 100)
{
Complex[][] results = PerformFFT(audioBytes);
int[][] lines = GetKeyPoints(results);
DumpToFile(lines, writer);
}
}
} catch (Exception ex) {
// Handle the error...
}
audioInputStream.close();
} catch (Exception e) {
// Handle the error...
}
writer.close();
Then PerformFFT:
public static Complex[][] PerformFFT(byte[] data) throws IOException
{
final int totalSize = data.length;
int amountPossible = totalSize/Harvester.CHUNK_SIZE;
//When turning into frequency domain we'll need complex numbers:
Complex[][] results = new Complex[amountPossible][];
//For all the chunks:
for(int times = 0;times < amountPossible; times++) {
Complex[] complex = new Complex[Harvester.CHUNK_SIZE];
for(int i = 0;i < Harvester.CHUNK_SIZE;i++) {
//Put the time domain data into a complex number with imaginary part as 0:
complex[i] = new Complex(data[(times*Harvester.CHUNK_SIZE)+i], 0);
}
//Perform FFT analysis on the chunk:
results[times] = FFT.fft(complex);
}
return results;
}
At this point I've tried logging everywhere: audioBytes before transforms, Complex values, and FFT results.
The problem: No matter what values I log, the log.txt of each wav file is completely different. I'm not understanding it. Given that I took the small.wav from the large.wav (and they have all the same properties) there should be a very heavy similarity among either the raw wav byte[] data... or Complex[][] fft data... or something thus far..
How can I possibly try to compare these files if the data isn't even close to similar at any point of these calculations.
I know I'm missing quite a bit of knowledge with regards to audio analysis, and this is why I come to the board for help! Thanks for any info, help, or fixes you can offer!!
Have you looked at MARF? It is a well-documented Java library used for audio recognition.
It is used to recognize speakers (for transcription or securing software) but the same features should be able to be used to classify audio samples. I'm not familiar with it but it looks like you'd want to use the FeatureExtraction class to extract an array of features from each audio sample and then create a unique id.
For 16-bit audio, 3e-05 isn't really that different from zero. So a file of zeros is pretty much the same as a file of zeros (maybe missing equality by some tiny rounding errors.)
ADDED:
For your comparison, read in and plot, using some Java plotting library, a portion of each of the two waveforms when they get past the portion that's mostly (close to) zero.
I think for debugging you better try use matlab to plot out. Since matlab is much more powerful in dealing with this problem.
You use "wavread" to the file, and "stft" to get the short time Fourier Transformation which is a complex number Matrix. Then simply abs(Matrix) to get the magnitude of each complex number. Show the image with imshow(abs(Matrix),[]).
I don't know how do you compare the whole file and 30s clip (by looking at the stft image?)
I don't know how are you comparing both audio files, but, seeing some service that offer music recognition (like TrackId or MotoID), these services take a small sample of the music you're hearing (10-20 secs), then process them in their server, i theorize that they have samples that long or less and that they have a database of (or calculate it on the fly) patterns of that samples (in your case Fourier Transforms), in your case, you may need to break your long audio file in chunks of or smaller size than your sample data, in the first case you may find a specific chunk that resembles more the pattern in your sample data, in the second case your smaller chunks may resamble a part of your sample data and you can calculate the probability that the sample data belongs to a respective audio file.
I think you are looking at Acoustic Fingerprinting
It's hard, and there are libraries to do it.
If you want to implement it yourself, this is a whitepaper on the shazam algorithm.

Categories

Resources