How Do I Code Large Static Byte Arrays In Java? - java

I'm writing a code generator that is replaying events recorded during a packet capture.
The JVM is pretty limited - it turns out. Methods can't be >64KB in size. So I added all kinds of trickery to make my code generator split up Java methods.
But now I have a new problem. I was taking a number of byte[] arrays and making them static variables in my class, e.g.:
public class myclass {
private static byte[] byteArray = { 0x3c, 0x3f, ...
...
};
private static byte[] byteArray2 = { 0x1a, 0x20, ...
...
};
...
private static byte[] byteArray_n = { 0x0a, 0x0d, ...
...
};
}
Now I get the error: "The code for the static initializer is exceeding the 65535 bytes limit".
I DO NOT WANT TO HAVE AN EXTERNAL FILE AND READ IN THE DATA FROM THERE. I WANT TO USE CODE GENERATED IN A SINGLE FILE.
What can I do? Can I declare the arrays outside the class? Or should I be using a string with unicode for the values 128-255 (e.g. \u009c instead of (byte)0x9c)? Or am I the only person in the world right now that wants to use statically initialised data?
UPDATE
The technique I'm now using is auto-creation of functions like the following:
private byte[] byteArray_6() {
String localString = "\u00ff\u00d8\u00ff\u00e0\u0000\u0010JFIF\u0000" +
"(0%()(\u00ff\u00db\u0000C\u0001\u0007\u0007\u0007\n\u0008\n\u0013\n" +
"\u0000\u00b5\u0010\u0000\u0002\u0001\u0003\u0003\u0002\u0004\u0003";
byte[] localBuff = new byte[ localString.length() ];
for ( int localInt = 0; localInt < localString.length(); localInt++ ) {
localBuff[localInt] = (byte)localString.charAt(localInt);
}
return localBuff;
}
Note: Java keeps on surprising. You'd think you could just encode every value in the range 0-255 as \u00XX (where XX is the 2-character hex representation). But you'd be wrong. The Java compiler actually thinks \u000A is a literal "\n" in your code - which breaks the compilation of your source code. So your strings can be littered with Unicode escapes but you'll have to use "\n" and "\r" instead of \u000a and \u000d respectively. And it doesn't hurt to put printable characters as they are in the strings instead of the 6 character Unicode escape representation.

Generally, you would put the data in a literal String and then have a method which decodes that to a byte[]. toByteArray() is of limited use as UTF-8 wont produce all possible byte sequences, and some values don't appear at all.
This technique is quite popular when trying to produce small object code. Removing huge sequences of array initialisation code will also help start up time.
Off the top of my head:
public static byte[] toBytes(String str) {
char[] src = str.toCharArray();
int len = src.length;
byte[] buff = new byte[len];
for (int i=0; i<len; ++i) {
buff[i] = (byte)src[i];
}
return buff;
}
More compact schemes are available. For instance you could limit string character contents to [1, 127] (0 is encoded in a non-normalised form for really bad reasons). Or something more complicated. I believe JDK8 will have a public API for Base64 decoding which isn't too bad and nicely standardised.

declare an arraylist and use a static constructor

May by you can use nested classes for storing static arrays.
This step is not the best in means of performans, but I think you could get it with minimum changes in your code.

Related

Equivalent of MemorySegment.getUtf8String for UTF-16

I'm porting my JNA-based library to "pure" Java using the Foreign Function and Memory API ([JEP 424][1]) in JDK 19.
One frequent use case my library handles is reading (null-terminated) Strings from native memory. For most *nix applications, these are "C Strings" and the MemorySegment.getUtf8String() method is sufficient to the task.
Native Windows Strings, however, are stored in UTF-16 (LE). Referenced as arrays of TCHAR or as "Wide Strings" they are treated similarly to "C Strings" except consume 2 bytes each.
JNA provides a Native.getWideString() method for this purpose which invokes native code to efficiently iterate over the appropriate character set.
I don't see a UTF-16 equivalent to the getUtf8String() (and corresponding set...()) optimized for these Windows-based applications.
I can work around the problem with a few approaches:
If I'm reading from a fixed size buffer, I can create a new String(bytes, StandardCharsets.UTF_16LE) and:
If I know the memory was cleared before being filled, use trim()
Otherwise split() on the null delimiter and extract the first element
If I'm just reading from a pointer offset with no knowledge of the total size (or a very large total size I don't want to instantiate into a byte[]) I can iterate character-by-character looking for the null.
While certainly I wouldn't expect the JDK to provide native implementations for every character set, I would think that Windows represents a significant enough usage share to support its primary native encoding alongside the UTF-8 convenience methods. Is there a method to do this that I haven't discovered yet? Or are there any better alternatives than the new String() or character-based iteration approaches I've described?
Since Java’s char is a UTF-16 unit, there’s no need for special “wide string” support in the Foreign API, as the conversion (which may be a mere copying operation in some cases) does already exist:
public static String fromWideString(MemorySegment wide) {
var cb = wide.asByteBuffer().order(ByteOrder.nativeOrder()).asCharBuffer();
int limit = 0; // check for zero termination
for(int end = cb.limit(); limit < end && cb.get(limit) != 0; limit++) {}
return cb.limit(limit).toString();
}
public static MemorySegment toWideString(String s, SegmentAllocator allocator) {
MemorySegment ms = allocator.allocateArray(ValueLayout.JAVA_CHAR, s.length() + 1);
ms.asByteBuffer().order(ByteOrder.nativeOrder()).asCharBuffer().put(s).put('\0');
return ms;
}
This is not using UTF-16LE specifically, but the current platform’s native order, which is usually the intended thing on a platform with native wide strings. Of course, when running on Windows x86 or x64, this will result in the UTF-16LE encoding.
Note that CharBuffer implements CharSequence which implies that for a lot of use cases you can omit the final toString() step when reading a wide string and effectively process the memory segment without a copying step.
A charset decoder provides a way to deal with null terminated MemorySegment wide / UTF16_LE to String on Windows using Foreign Memory API. This may not be any different / improvement to your workaround suggestions, as it involves scanning the resulting character buffer for the null position.
public static String toJavaString(MemorySegment wide) {
return toJavaString(wide, StandardCharsets.UTF_16LE);
}
public static String toJavaString(MemorySegment segment, Charset charset) {
// JDK Panama only handles UTF-8, it does strlen() scan for 0 in the segment
// which is valid as all code points of 2 and 3 bytes lead with high bit "1".
if (StandardCharsets.UTF_8 == charset)
return segment.getUtf8String(0);
// if (StandardCharsets.UTF_16LE == charset) {
// return Holger answer
// }
// This conversion is convoluted: MemorySegment->ByteBuffer->CharBuffer->String
CharBuffer cb = charset.decode(segment.asByteBuffer());
// cb.array() isn't valid unless cb.hasArray() is true so use cb.get() to
// find a null terminator character, ignoring it and the remaining characters
final int max = cb.limit();
int len = 0;
while(len < max && cb.get(len) != '\0')
len++;
return cb.limit(len).toString();
}
Going the other way String -> null terminated Windows wide MemorySegment:
public static MemorySegment toCString(SegmentAllocator allocator, String s, Charset charset) {
// "==" is OK here as StandardCharsets.UTF_8 == Charset.forName("UTF8")
if (StandardCharsets.UTF_8 == charset)
return allocator.allocateUtf8String(s);
// else if (StandardCharsets.UTF_16LE == charset) {
// return Holger answer
// }
// For MB charsets it is safer to append terminator '\0' and let JDK append
// appropriate byte[] null termination (typically 1,2,4 bytes) to the segment
return allocator.allocateArray(JAVA_BYTE, (s+"\0").getBytes(charset));
}
/** Convert Java String to Windows Wide String format */
public static MemorySegment toWideString(String s, SegmentAllocator allocator) {
return toCString(allocator, s, StandardCharsets.UTF_16LE);
}
Like you, I'd also like to know if there are better approaches than the above.

What is exactly done here? Trying to transfer code from Java to NodeJS

I'm currently trying to move over some encoding script from Java to NodeJs.
At the moment the current Java script is as follows:
public static final char[] chars = "0123456789abcdef".toCharArray();
public static String sha1Digest(String str) {
try {
MessageDigest instance = MessageDigest.getInstance('SHA-1');
instance.reset();
instance.update(str.getBytes('UTF-8'));
return lastEncode(instance.digest());
} catch (NoSuchAlgorithmException e) {
throw new RuntimeException(e);
}
}
public static String lastEncode(byte[] bArr) {
StringBuilder encoded = new StringBuilder(bArr.length * 2);
for (byte b : bArr) {
encoded.append(chars[(b >> 4) & 15]);
encoded.append(chars[b & 15]);
}
return encoded.toString();
}
The initial parameter passed to the sha1Digest function is a string that consists of a URL appended with a secret key.
Currently, I'm trying to transfer the code over to NodeJs in which I have this code (for now):
async function sha1Digest(str) {
try {
const sha1 = crypto.createHmac("SHA1");
const hmac = sha1.update(new Buffer(str, 'utf-8'));
return encoder(hmac.digest());
} catch (e) {
console.dir(e);
}
}
async function lastEncode(bArr) {
let chars = "0123456789abcdef".split('')
let sb = '';
for (b in bArr) {
sb = sb + (chars[(b >> 4) & 15]);
sb = sb + (chars[b & 15]);
}
return sb;
}
Sadly tho, I have no understanding of what the part in the for loop in lastEncode does.
Is anybody able to help me out with this, and also verify that the sha1Digest function seems correct in the NodeJS?
Much appreciated!
lastEncode turns a byte array into hex nibbles. It turns the array: new byte[] {10, 16, (byte) 255} into the string "0a10ff". (0a is hex notation for 10, ff is hex notation for 255, etc - if this sounds like gobbledygook to you, the web has many tutorials on hexadecimal :P).
Your javascript translation messes up because you're joining on ,. More generally, to do that 'bytes to nibbles' operation of before, see this SO answer.
Just test the lastEncode function by itself. Run it in java, then run your javascript port, and ensure the exact same string is produced in both variants. Only then, move on to the hashing part.
NB: To be clear, this protocol is rather idiotic - you can just hash the byte array itself, there is no need to waste a ton of time turning that into hex nibbles (which is always exactly 2x as large as the input) and then hashing that. But, presumably, you can't mess with the protocol at this point. But if you can, change it. It'll be faster, simpler to explain, and less code. Win-win-win.
EDIT: NB: You also are using a different hash algorithm in the javascript side (HMAC-SHA1 is not the same as just SHA1).

Can a packed C structure and function be ported to java?

In the past I have written code which handles incoming data from a serial port. The data has a fixed format.
Now I want to migrate this code to java (android). However, I see many obstacles.
The actual code is more complex, but I have a simplified version here:
#define byte unsigned char
#define word unsigned short
#pragma pack(1);
struct addr_t
{
byte foo;
word bar;
};
#pragma pack();
bool RxData( byte val )
{
static byte buffer[20];
static int idx = 0;
buffer[idx++] = val;
return ( idx == sizeof(addr_t) );
}
The RxData function is called everytime a byte is received. When the complete chunk of data is in, it returns true.
Some of the obstacles:
The used data types are not available to java. In other threads it is recommended to use larger datatypes, but in this case this is not a workable solution.
The size of the structure is in this case exactly 3 bytes. That's also why the #pragma statement is important. Otherwise the C compiler might "optimize" it for memory use, with a different size as a result.
Java also doesn't have a sizeof function and I have found no alternative for this kind of situation.
I could replace the 'sizeof' with a fixed value of 3, but that would be very bad practice IMO.
Is it at all possible to write such a code in java? Or is it wiser to try to add native c source into Android Studio?
Your C code has its problems too. Technically, you do not know how big a char and a short is. You probably want uint8_t and uint16_t respectively. Also, I'm not sure how portable packing is.
In Java, you need a class. The class might as well tell you how many bytes you need to initialise it.
class Addr
{
private byte foo;
private short bar;
public final static int bufferBytes = 3;
public int getUnsignedFoo()
{
return (int)foo & 0xff;
}
public int getUnsignedBar()
{
return (int)bar & 0xffff;
}
}
Probably a class for the buffer too although there may already be a suitable class in the standard library.
class Buffer
{
private final static int maxSize = 20;
private byte[] bytes = new byte[maxSize];
private int idx = 0;
private bool rxData(byte b)
{
bytes[idx++] = b;
return idx == Addr.bufferBytes;
}
}
To answer the question about the hardcodedness of the 3, this is actually the better way to do it because your the specification of your protocol should say "one byte for foo and two bytes for bar" not "a packed C struct with a char and a short in it". One way to deserialise the buffer is like this:
public class Addr
{
// All the stuff from above
public Addr(byte[] buffer)
{
foo = buffer[0];
bar = someFunctionThatGetsTheEndiannessRight(buffer[1], buffer[2]);
}
}
TI have left the way bar is calculated deliberately vague because it depends on your platform as much as anything. You can do it simply with bit shifts e.g.
(((short)buffer[1] & 0xff) << 8) | ((short)buffer[2] & 0xff)
However, there are better options available. For example, you can use a java.nio.ByteBuffer which has the machinery to cope with endian isssues.

Why String receiver's size is smaller than original ByteArrayOutputStream's size when I call toString()

I'm in front of a curious problem. Some code is better than long story:
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
buffer.write(...); // I write byte[] data
// In debugger I can see that buffer's count = 449597
String szData = buffer.toString();
int iSizeData = buffer.size();
// But here, szData's count = 240368
// & iSizeData = 449597
So my question is: why szData doesn't contain all the buffer's data? (only one Thread run this code) because after that kind of operation, I don't want szData.charAt(iSizeData - 1) crashes!
EDIT: szData.getBytes().length = 450566. There is encoding problems I think. Better use a byte[] instead of a String finally?
In Java, char ≠ byte, depending on the default character coding of the platform, char can occupy up to 4 bytes in memory. You work either with bytes (binary data), or with characters (strings), you cannot (easily) switch between them.
For String operations like strncasecmp in C, use the methods of the String class, e.g. String.compareToIgnoreCase(String str). Also have a look at the StringUtils class from the Apache Commons Lang library.

Converting C++ encryption to Java

I have the following C++ code to cipher a string with XOR.
#define MPI_CIPHER_KEY "qwerty"
Buffer FooClient::cipher_string(const Buffer& _landing_url)
{
String key(CIPHER_KEY);
Buffer key_buf(key.chars(), key.length());
Buffer landing_url_cipher = FooClient::XOR(_url, key_buf);
Buffer b64_url_cipher;
base64_encode(landing_url_cipher, b64_url_cipher);
return b64_url_cipher;
}
Buffer FooClient::XOR(const Buffer& _data, const Buffer& _key)
{
Buffer retval(_data);
unsigned int klen=_key.length();
unsigned int dlen=_data.length();
unsigned int k=0;
unsigned int d=0;
for(;d<dlen;d++)
{
retval[d]=_data[d]^_key[k];
k=(++k<klen?k:0);
}
return retval;
}
I have seen in this question such java impl. would that work for this case?
String s1, s2;
StringBuilder sb = new StringBuilder();
for(int i=0; i<s1.length() && i<s2.length();i++)
sb.append((char)(s1.charAt(i) ^ s2.charAt(i)));
String result = sb.toString();
or is there an easier way to do it?
doesn't look the same to me. the c++ version loops across all of _data no matter what the _key length was, cycling through _key as necessary. (k=(++k<klen?k:0); in the c++ code)
yours returns as soon as the shortest of key or data is hit.
Personally, i'd start with the closest literal translation of C++ to java that you can do, keeping param and local names the same.
Then write unit tests for it that have known inputs and outputs from C++
then start refactoring the java version into using java idioms/etc ensuring the tests still pass.
No - the java code will only XOR up to the length of the smaller string - whereas the C++ code will XOR the entire data completely.
Assuming s1 is your "key" this can be fixed by changing to
for(int i=0; i<s2.length();i++)
sb.append((char)(s1.charAt(i%s1.length()) ^ s2.charAt(i)));
Also the base-64 encoding of the return value is missing.

Categories

Resources