Seek backwards to deal with invalid MP3 header? - java

I am writing an application to decode MP3 frames. I am having difficulty finding the headers.
An MP3 header is 32 bits and begins with the signature: 11111111 111
In the inner loop below, I look for this signature. When this signature is found, I retrieve the next two bytes and then pass the latter three bytes of the header into a custom MpegFrame() class. The class verifies the integrity of the header and parses information from it. MpegFrame.isValid() returns a boolean indicating the validity/integrity of the frame's header. If the header is invalid, the outer loop is executed again, and the signature is looked for again.
When executing my program with a CBR MP3, only some of the frames are found. The application reports many invalid frames.
I believe the invalid frames may be a result of bits being skipped. The header is 4 bytes long. When the header is determined to be invalid, I skip all 4 bytes and start seeking the signature from the next four bytes. In a case like the following: 11111111 11101101 11111111 11101001, the header signature is found in the first two bytes, however the third byte contains an error which invalidates the header. If I skip all of the bytes because I've determined the header beginning with the first byte is invalid, I miss the pottential header starting with the third byte (as the third and fourth bytes contain a signature).
I cannot seek backwards in an InputStream, so my question is the following: When I determine a header starting with bytes 1 and 2 to be invalid, how do I run my signature finding loop starting with byte 2, rather than byte 5?
In the below code, b is the first byte of the possible header under consideration, b1 is the second byte, b2 is the third byte and b3 is the fourth byte.
int bytesRead = 0;
//10 bytes of Tagv2
int j = 0;
byte[] tagv2h = new byte[10];
j = fis.read(tagv2h);
bytesRead += j;
ByteBuffer bb = ByteBuffer.wrap(new byte[]{tagv2h[6], tagv2h[7],tagv2h[8], tagv2h[9]});
bb.order(ByteOrder.BIG_ENDIAN);
int tagSize = bb.getInt();
byte[] tagv2 = new byte[tagSize];
j = fis.read(tagv2);
bytesRead += j;
while (bytesRead < MPEG_FILE.length()) {
boolean foundHeader = false;
// Seek frame
int b = 0;
int b1 = 0;
while ((b = fis.read()) > -1) {
bytesRead++;
if (b == 255) {
b1 = fis.read();
if (b1 > -1) {
bytesRead++;
if (((b1 >> 5) & 0x7) == 0x7) {
System.out.println("Found header.");
foundHeader = true;
break;
}
}
}
}
if (!foundHeader) {
continue;
}
int b2 = fis.read();
int b3 = fis.read();
MpegFrame frame = new MpegFrame(b1, b2, b3, false);
if (!frame.isValid()) {
System.out.println("Invalid header # " + (bytesRead-4));
continue;
}
}

You can wrap your input stream in a PushbackInputStream so that you can push back some bytes and re-parse them.

I ended up writing a function to shift the bytes of an invalid header so it can be re-parsed. I call the function in a loop where I essentially Seek for valid frames.
Seek() returns true when a valid frame is found (elsewhere the last frame found by calling Seek() is stored). CheckHeader() verifies the integrity of a header. SkipAudioData() reads all the audio data of a frame, placing the stream marker at the end of the frame.
private boolean Seek() throws IOException {
while(!(CheckHeader() && SkipAudioData())){
if(!ShiftHeader()){
return false;
}
}
return true;
}
private boolean ShiftHeader() {
try {
if (bytesRead >= MPEG_FILE.length()) {
return false;
}
} catch (Exception ex) {
ex.printStackTrace();
return false;
}
header[0] = header[1];
header[1] = header[2];
header[2] = header[3];
try {
int b = fis.read();
if (b > -1) {
header[3] = b;
return true;
}
} catch (IOException ex) {
return false;
} catch (Exception ex) {
return false;
}
return false;
}

Related

How to search sequence of bytes in an byte array[] of a bin file?

I'm trying to code a program where I can:
Load a file
Input a start and beginning offset addresses where to scan data from
Scan that offset range in search of specific sequence of bytes (such as "05805A6C")
Retrieve the offset of every match and write them to a .txt file
i66.tinypic.com/2zelef5.png
As the picture shows I need to search the file for "05805A6C" and then print to a .txt file the offset "0x21F0".
I'm using Java Swing for this. So far I've been able to load the file as a Byte array[]. But I haven't found a way how to search for the specific sequence of bytes, nor setting that search between a range of offsets.
This is my code that opens and reads the file into byte array[]
public class Read {
static public byte[] readBytesFromFile () {
try {
JFileChooser chooser = new JFileChooser();
int returnVal = chooser.showOpenDialog(null);
if (returnVal == JFileChooser.APPROVE_OPTION) {
FileInputStream input = new FileInputStream(chooser.getSelectedFile());
byte[] data = new byte[input.available()];
input.read(data);
input.close();
return data;
}
return null;
}
catch (IOException e) {
System.out.println("Unable to read bytes: " + e.getMessage());
return null;
}
}
}
And my code where I try to search among the bytes.
byte[] model = Read.readBytesFromFile();
String x = new String(model);
boolean found = false;
for (int i = 0; i < model.length; i++) {
if(x.contains("05805A6C")){
found = true;
}
}
if(found == true){
System.out.println("Yes");
}else{
System.out.println("No");
}
Here's a bomb-proof1 way to search for a sequence of bytes in a byte array:
public boolean find(byte[] buffer, byte[] key) {
for (int i = 0; i <= buffer.length - key.length; i++) {
int j = 0;
while (j < key.length && buffer[i + j] == key[j]) {
j++;
}
if (j == key.length) {
return true;
}
}
return false;
}
There are more efficient ways to do this for large-scale searching; e.g. using the Boyer-Moore algorithm. However:
converting the byte array a String and using Java string search is NOT more efficient, and it is potentially fragile depending on what encoding you use when converting the bytes to a string.
converting the byte array to a hexadecimal encoded String is even less efficient ... and memory hungry ... though not fragile if you have enough memory. (You may need up to 5 times the memory as the file size while doing the conversion ...)
1 - bomb-proof, modulo any bugs :-)
EDIT It seems the charset from system to system is different so you may get different results so I approach it with another method:
String x = HexBin.encode(model);
String b = new String("058a5a6c");
int index = 0;
while((index = x.indexOf(b,index)) != -1 )
{
System.out.println("0x"+Integer.toHexString(index/2));
index = index + 2;
}
...

Implement a function to check if a string/byte array follows utf-8 format

I am trying to solve this interview question.
After given clearly definition of UTF-8 format. ex: 1-byte :
0b0xxxxxxx 2- bytes:.... Asked to write a function to validate whether
the input is valid UTF-8. Input will be string/byte array, output
should be yes/no.
I have two possible approaches.
First, if the input is a string, since UTF-8 is at most 4 byte, after we remove the first two characters "0b", we can use Integer.parseInt(s) to check if the rest of the string is at the range 0 to 10FFFF. Moreover, it is better to check if the length of the string is a multiple of 8 and if the input string contains all 0s and 1s first. So I will have to go through the string twice and the complexity will be O(n).
Second, if the input is a byte array (we can also use this method if the input is a string), we check if each 1-byte element is in the correct range. If the input is a string, first check the length of the string is a multiple of 8 then check each 8-character substring is in the range.
I know there are couple solutions on how to check a string using Java libraries, but my question is how I should implement the function based on the question.
Thanks a lot.
Let's first have a look at a visual representation of the UTF-8 design.
Now let's resume what we have to do.
Loop over all character of the string (each character being a byte).
We will need to apply a mask to each byte depending on the codepoint as the x characters represent the actual codepoint. We will use the binary AND operator (&) which copy a bit to the result if it exists in both operands.
The goal of applying a mask is to remove the trailing bits so we compare the actual byte as the first code point. We will do the bitwise operation using 0b1xxxxxxx where 1 will appear "Bytes in sequence" time, and other bits will be 0.
We can then compare with the first byte to verify if it is valid, and also determinate what is the actual byte.
If the character entered in none of the case, it means the byte is invalid and we return "No".
If we can get out of the loop, that means each of the character are valid, hence the string is valid.
Make sure the comparison that returned true correspond to the expected length.
The method would look like this :
public static final boolean isUTF8(final byte[] pText) {
int expectedLength = 0;
for (int i = 0; i < pText.length; i++) {
if ((pText[i] & 0b10000000) == 0b00000000) {
expectedLength = 1;
} else if ((pText[i] & 0b11100000) == 0b11000000) {
expectedLength = 2;
} else if ((pText[i] & 0b11110000) == 0b11100000) {
expectedLength = 3;
} else if ((pText[i] & 0b11111000) == 0b11110000) {
expectedLength = 4;
} else if ((pText[i] & 0b11111100) == 0b11111000) {
expectedLength = 5;
} else if ((pText[i] & 0b11111110) == 0b11111100) {
expectedLength = 6;
} else {
return false;
}
while (--expectedLength > 0) {
if (++i >= pText.length) {
return false;
}
if ((pText[i] & 0b11000000) != 0b10000000) {
return false;
}
}
}
return true;
}
Edit : The actual method is not the original one (almost, but not) and is stolen from here. The original one was not properly working as per #EJP comment.
A small solution for real world UTF-8 compatibility checking:
public static final boolean isUTF8(final byte[] inputBytes) {
final String converted = new String(inputBytes, StandardCharsets.UTF_8);
final byte[] outputBytes = converted.getBytes(StandardCharsets.UTF_8);
return Arrays.equals(inputBytes, outputBytes);
}
You can check the tests results:
#Test
public void testEnconding() {
byte[] invalidUTF8Bytes1 = new byte[]{(byte)0b10001111, (byte)0b10111111 };
byte[] invalidUTF8Bytes2 = new byte[]{(byte)0b10101010, (byte)0b00111111 };
byte[] validUTF8Bytes1 = new byte[]{(byte)0b11001111, (byte)0b10111111 };
byte[] validUTF8Bytes2 = new byte[]{(byte)0b11101111, (byte)0b10101010, (byte)0b10111111 };
assertThat(isUTF8(invalidUTF8Bytes1)).isFalse();
assertThat(isUTF8(invalidUTF8Bytes2)).isFalse();
assertThat(isUTF8(validUTF8Bytes1)).isTrue();
assertThat(isUTF8(validUTF8Bytes2)).isTrue();
assertThat(isUTF8("\u24b6".getBytes(StandardCharsets.UTF_8))).isTrue();
}
Test cases copy from https://codereview.stackexchange.com/questions/59428/validating-utf-8-byte-array
the CharsetDecoder might be what you are looking for:
#Test
public void testUTF8() throws CharacterCodingException {
// the desired charset
final Charset UTF8 = Charset.forName("UTF-8");
// prepare decoder
final CharsetDecoder decoder = UTF8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
byte[] bytes = new byte[48];
new Random().nextBytes(bytes);
ByteBuffer buffer = ByteBuffer.wrap(bytes);
try {
decoder.decode(buffer);
fail("Should not be UTF-8");
} catch (final CharacterCodingException e) {
// noop, the test should fail here
}
final String string = "hallo welt!";
bytes = string.getBytes(UTF8);
buffer = ByteBuffer.wrap(bytes);
final String result = decoder.decode(buffer).toString();
assertEquals(string, result);
}
so your function might look like that:
public static boolean checkEncoding(final byte[] bytes, final String encoding) {
final CharsetDecoder decoder = Charset.forName(encoding).newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
final ByteBuffer buffer = ByteBuffer.wrap(bytes);
try {
decoder.decode(buffer);
return true;
} catch (final CharacterCodingException e) {
return false;
}
}
public static boolean validUTF8(byte[] input) {
int i = 0;
// Check for BOM
if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
&& (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
i = 3;
}
int end;
for (int j = input.length; i < j; ++i) {
int octet = input[i];
if ((octet & 0x80) == 0) {
continue; // ASCII
}
// Check for UTF-8 leading byte
if ((octet & 0xE0) == 0xC0) {
end = i + 1;
} else if ((octet & 0xF0) == 0xE0) {
end = i + 2;
} else if ((octet & 0xF8) == 0xF0) {
end = i + 3;
} else {
// Java only supports BMP so 3 is max
return false;
}
while (i < end) {
i++;
octet = input[i];
if ((octet & 0xC0) != 0x80) {
// Not a valid trailing byte
return false;
}
}
}
return true;
}
Well, I am grateful for the comments and the answer.
First of all, I have to agree that this is "another stupid interview question". It is true that in Java String is already encoded, so it will always be compatible with UTF-8. One way to check it is given a string:
public static boolean isUTF8(String s){
try{
byte[]bytes = s.getBytes("UTF-8");
}catch(UnsupportedEncodingException e){
e.printStackTrace();
System.exit(-1);
}
return true;
}
However, since all the printable strings are in the unicode form, so I haven't got a chance to get an error.
Second, if given a byte array, it will always be in the range -2^7(0b10000000) to 2^7(0b1111111), so it will always be in a valid UTF-8 range.
My initial understanding to the question was that given a string, say "0b11111111", check if it is a valid UTF-8, I guess I was wrong.
Moreover, Java does provide constructor to convert byte array to string, and if you are interested in the decode method, check here.
One more thing, the above answer would be correct given another language. The only improvement could be:
In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences.
So 4 bytes would be enough.
I am definitely to this, so correct me if I am wrong. Thanks a lot.

ByteBuffer returns null characters (character 0)

I'm trying to implement a simple client-server application, using NIO.
As an exercise, communication should be text-based and line-oriented. But when the server reads the bytes sent by the client, it gets nothing, or rather, the buffer is filled with a bunch of zeroes.
I'm using a selector, and this is the code triggered, when the channel is readable.
private void handleRead() throws IOException {
System.out.println("Handler Read");
while (lineIndex < 0) {
buffer.clear();
switch (channel.read(buffer)) {
case -1:
// Close the connection.
return;
case 0:
System.out.println("Nothing to read.");
return;
default:
System.out.println("Converting to String...");
buffer.flip();
bufferToString();
break;
}
}
// Do something with the line read.
}
In this snippet, lineIndex is an int holding the index at which the first \n occurred, when reading. It is initialized with -1, meaning there's no \n present.
The variable buffer references a ByteBuffer, and channel represents a SocketChannel.
To keep it simple, without Charsets and whatnot, this is how bufferToString is coded:
private void bufferToString() {
char c;
System.out.println("-- Buffer to String --");
for (int i = builder.length(); buffer.remaining() > 1; ++i) {
c = buffer.getChar();
builder.append(c);
System.out.println("Appending: " + c + "(" + (int) c + ")");
if (c == '\n' && lineIndex < 0) {
System.out.println("Found a new-line character!");
lineIndex = i;
}
}
}
The variable builder holds a reference to a StringBuilder.
I expected getChar to do a reasonable convertion, but all I get in my output is a bunch (corresponding to half of the buffer capacity) of
Appending: (0)
Terminated by a
Nothing to read.
Any clues of what may be the cause? I have similar code in the client which is also unable to properly read anything from the server.
If it is of any help, here is a sample of what the writing code looks like:
private void handleWrite() throws IOException {
buffer.clear();
String msg = "Some message\n";
for (int i = 0; i < msg.length(); ++i) {
buffer.putChar(msg.charAt(i));
}
channel.write(buffer);
}
I've also confirmed that the result from channel.write is greater than zero, reassuring that the bytes are indeed written and sent.
Turns out, this was a buffer indexing problem. In the server, a flip() was missing before writing to the socket. In the client code, a few flip() were missing too, after reading and before writing. Now everything works as expected.
Current writing code (server side):
private void handleWrite() throws IOException {
String s = extractLine();
for (int i = 0, len = s.length(); i < len;) {
buffer.clear();
while (buffer.remaining() > 1 && i < len) {
buffer.putChar(s.charAt(i));
++i;
}
buffer.flip();
channel.write(buffer);
}
// some other operations...
}

Behaviour of ReadableByteChannel.read()

Does read() return -1 if EOF is reached during the read operation, or on the subsequent call? The Java docs aren't entirely clear on this and neither is the book I'm reading.
The following code from the book is reading a file with three different types of values repeating, a double, a string of varying length and a binary long. The buffer is supposed to fill at some random place in the middle of any of the values, and the code will handle it. What I don't understand is if the -1 is returned during the read operation, the last values won't get output in the prinf statment.
try(ReadableByteChannel inCh = Files.newByteChannel(file)) {
ByteBuffer buf = ByteBuffer.allocateDirect(256);
buf.position(buf.limit());
int strLength = 0;
byte[] strChars = null;
while(true) {
if(buf.remaining() < 8) {
if(inCh.read(buf.compact()) == -1) {
break;
}
buf.flip();
}
strLength = (int)buf.getDouble();
if (buf.remaining() < 2*strLength) {
if(inCh.read(buf.compact()) == -1) {
System.err.println("EOF found while reading the prime string.");
break;
}
buf.flip();
}
strChars = new byte[2*strLength];
buf.get(strChars);
if(buf.remaining() <8) {
if(inCh.read(buf.compact()) == -1) {
System.err.println("EOF found reading the binary prime value.");
break;
}
buf.flip();
}
System.out.printf("String length: %3s String: %-12s Binary Value: %3d%n",
strLength, ByteBuffer.wrap(strChars).asCharBuffer(), buf.getLong());
}
System.out.println("\n EOF Reached.");
I suggest to make a simple test to understand how it works, like this
ReadableByteChannel in = Files.newByteChannel(Paths.get("1.txt"));
ByteBuffer b = ByteBuffer.allocate(100);
System.out.println(in.read(b));
System.out.println(in.read(b));
1.txt contains 1 byte, the test prints
1
-1

How to read the second column in a large file

I have a huge file with millions of columns, splited by space, but it only has a limited number of rows:
examples.txt:
1 2 3 4 5 ........
3 1 2 3 5 .........
l 6 3 2 2 ........
Now, I just want to read in the second column:
2
1
6
How do I do that in java with high performance.
Thanks
Update: the file is usually 1.4G containing hundreds of rows.
If your file is not statically structured, your only option is the naive one: read through the file byte sequence by byte sequence looking for newlines and grab the second column after each one. Use FileReader.
If your file were statically structured, you could calculate where in the file the second column would be for a given line and seek() to it directly.
I have to concur with #gene, try with a BufferedReader and getLine first, it's simple and easy to code. Just be careful not to alias the backing array between the result of getLine and any substring operation you use. String.substring() is a particularly common culprit, and I have had multi-MB byte-arrays locked in memory because a 3-char substring was referencing it.
Assuming ASCII, my preference when doing this is to drop down to the byte level. Use mmap to view the file as a ByteBuffer and then do a linear scan for 0x20 and 0x0A (assuming unix-style line separators). Then convert the relevant bytes to a String. If you are using an 8-bit charset it is extremely difficult to be faster than this.
If you are using Unicode the problem is sufficiently more complicated that I strongly urge you to use BufferedReader unless that performance really is unacceptable. If getLine() doesn't work, then consider just looping on a call to read().
Regardless you should always specify the Charset when initialising a String from an external bytestream. This documents your charset assumption explicitly. So I recommend a minor modification to gene's suggestion, so one of:
int i = Integer.parseInt(new String(buffer, start, length, "US-ASCII"));
int i = Integer.parseInt(new String(buffer, start, length, "ISO-8859-1"));
int i = Integer.parseInt(new String(buffer, start, length, "UTF-8"));
as appropriate.
Here is a little state machine that uses a FileInputStream as its input and handles its own buffering. There is no locale conversion.
On my 7-year old 1.4 GHz laptop with 1/2 Gb of memory it takes 48 seconds to go through 1.28 billion bytes of data. Buffers bigger than 4Kb seem to run slower.
On a new 1-year old MacBook with 4Gb it runs in 14 seconds. After the file is in cache it runs in 2.7 seconds. Again there is no difference with buffers bigger than 4Kb. This is the same 1.2 billion byte data file.
I expect memory-mapped IO would do better, but this is probably more portable.
It will fetch any column you tell it to.
import java.io.*;
import java.util.Random;
public class Test {
public static class ColumnReader {
private final InputStream is;
private final int colIndex;
private final byte [] buf;
private int nBytes = 0;
private int colVal = -1;
private int bufPos = 0;
public ColumnReader(InputStream is, int colIndex, int bufSize) {
this.is = is;
this.colIndex = colIndex;
this.buf = new byte [bufSize];
}
/**
* States for a tiny DFA to recognize columns.
*/
private static final int START = 0;
private static final int IN_ANY_COL = 1;
private static final int IN_THE_COL = 2;
private static final int WASTE_REST = 3;
/**
* Return value of colIndex'th column or -1 if none is found.
*
* #return value of column or -1 if none found.
*/
public int getNext() {
colVal = -1;
bufPos = parseLine(bufPos);
return colVal;
}
/**
* If getNext() returns -1, this can be used to check if
* we're at the end of file.
*
* Otherwise the column did not exist.
*
* #return end of file indication
*/
public boolean atEoF() {
return nBytes == -1;
}
/**
* Parse a line.
* The buffer is automatically refilled if p reaches the end.
* This uses a standard DFA pattern.
*
* #param p position of line start in buffer
* #return position of next unread character in buffer
*/
private int parseLine(int p) {
colVal = -1;
int iCol = -1;
int state = START;
for (;;) {
if (p == nBytes) {
try {
nBytes = is.read(buf);
} catch (IOException ex) {
nBytes = -1;
}
if (nBytes == -1) {
return -1;
}
p = 0;
}
byte ch = buf[p++];
if (ch == '\n') {
return p;
}
switch (state) {
case START:
if ('0' <= ch && ch <= '9') {
if (++iCol == colIndex) {
state = IN_THE_COL;
colVal = ch - '0';
}
else {
state = IN_ANY_COL;
}
}
break;
case IN_THE_COL:
if ('0' <= ch && ch <= '9') {
colVal = 10 * colVal + (ch - '0');
}
else {
state = WASTE_REST;
}
break;
case IN_ANY_COL:
if (ch < '0' || ch > '9') {
state = START;
}
break;
case WASTE_REST:
break;
}
}
}
}
public static void main(String[] args) {
final String fn = "data.txt";
if (args.length > 0 && args[0].equals("--create-data")) {
PrintWriter pw;
try {
pw = new PrintWriter(fn);
} catch (FileNotFoundException ex) {
System.err.println(ex.getMessage());
return;
}
Random gen = new Random();
for (int row = 0; row < 100; row++) {
int rowLen = 4 * 1024 * 1024 + gen.nextInt(10000);
for (int col = 0; col < rowLen; col++) {
pw.print(gen.nextInt(32));
pw.print((col < rowLen - 1) ? ' ' : '\n');
}
}
pw.close();
}
FileInputStream fis;
try {
fis = new FileInputStream(fn);
} catch (FileNotFoundException ex) {
System.err.println(ex.getMessage());
return;
}
ColumnReader cr = new ColumnReader(fis, 1, 4 * 1024);
int val;
long start = System.currentTimeMillis();
while ((val = cr.getNext()) != -1) {
System.out.print('.');
}
long stop = System.currentTimeMillis();
System.out.println("\nelapsed = " + (stop - start) / 1000.0);
}
}

Categories

Resources