Equivalent of MemorySegment.getUtf8String for UTF-16 - java

I'm porting my JNA-based library to "pure" Java using the Foreign Function and Memory API ([JEP 424][1]) in JDK 19.
One frequent use case my library handles is reading (null-terminated) Strings from native memory. For most *nix applications, these are "C Strings" and the MemorySegment.getUtf8String() method is sufficient to the task.
Native Windows Strings, however, are stored in UTF-16 (LE). Referenced as arrays of TCHAR or as "Wide Strings" they are treated similarly to "C Strings" except consume 2 bytes each.
JNA provides a Native.getWideString() method for this purpose which invokes native code to efficiently iterate over the appropriate character set.
I don't see a UTF-16 equivalent to the getUtf8String() (and corresponding set...()) optimized for these Windows-based applications.
I can work around the problem with a few approaches:
If I'm reading from a fixed size buffer, I can create a new String(bytes, StandardCharsets.UTF_16LE) and:
If I know the memory was cleared before being filled, use trim()
Otherwise split() on the null delimiter and extract the first element
If I'm just reading from a pointer offset with no knowledge of the total size (or a very large total size I don't want to instantiate into a byte[]) I can iterate character-by-character looking for the null.
While certainly I wouldn't expect the JDK to provide native implementations for every character set, I would think that Windows represents a significant enough usage share to support its primary native encoding alongside the UTF-8 convenience methods. Is there a method to do this that I haven't discovered yet? Or are there any better alternatives than the new String() or character-based iteration approaches I've described?

Since Java’s char is a UTF-16 unit, there’s no need for special “wide string” support in the Foreign API, as the conversion (which may be a mere copying operation in some cases) does already exist:
public static String fromWideString(MemorySegment wide) {
var cb = wide.asByteBuffer().order(ByteOrder.nativeOrder()).asCharBuffer();
int limit = 0; // check for zero termination
for(int end = cb.limit(); limit < end && cb.get(limit) != 0; limit++) {}
return cb.limit(limit).toString();
}
public static MemorySegment toWideString(String s, SegmentAllocator allocator) {
MemorySegment ms = allocator.allocateArray(ValueLayout.JAVA_CHAR, s.length() + 1);
ms.asByteBuffer().order(ByteOrder.nativeOrder()).asCharBuffer().put(s).put('\0');
return ms;
}
This is not using UTF-16LE specifically, but the current platform’s native order, which is usually the intended thing on a platform with native wide strings. Of course, when running on Windows x86 or x64, this will result in the UTF-16LE encoding.
Note that CharBuffer implements CharSequence which implies that for a lot of use cases you can omit the final toString() step when reading a wide string and effectively process the memory segment without a copying step.

A charset decoder provides a way to deal with null terminated MemorySegment wide / UTF16_LE to String on Windows using Foreign Memory API. This may not be any different / improvement to your workaround suggestions, as it involves scanning the resulting character buffer for the null position.
public static String toJavaString(MemorySegment wide) {
return toJavaString(wide, StandardCharsets.UTF_16LE);
}
public static String toJavaString(MemorySegment segment, Charset charset) {
// JDK Panama only handles UTF-8, it does strlen() scan for 0 in the segment
// which is valid as all code points of 2 and 3 bytes lead with high bit "1".
if (StandardCharsets.UTF_8 == charset)
return segment.getUtf8String(0);
// if (StandardCharsets.UTF_16LE == charset) {
// return Holger answer
// }
// This conversion is convoluted: MemorySegment->ByteBuffer->CharBuffer->String
CharBuffer cb = charset.decode(segment.asByteBuffer());
// cb.array() isn't valid unless cb.hasArray() is true so use cb.get() to
// find a null terminator character, ignoring it and the remaining characters
final int max = cb.limit();
int len = 0;
while(len < max && cb.get(len) != '\0')
len++;
return cb.limit(len).toString();
}
Going the other way String -> null terminated Windows wide MemorySegment:
public static MemorySegment toCString(SegmentAllocator allocator, String s, Charset charset) {
// "==" is OK here as StandardCharsets.UTF_8 == Charset.forName("UTF8")
if (StandardCharsets.UTF_8 == charset)
return allocator.allocateUtf8String(s);
// else if (StandardCharsets.UTF_16LE == charset) {
// return Holger answer
// }
// For MB charsets it is safer to append terminator '\0' and let JDK append
// appropriate byte[] null termination (typically 1,2,4 bytes) to the segment
return allocator.allocateArray(JAVA_BYTE, (s+"\0").getBytes(charset));
}
/** Convert Java String to Windows Wide String format */
public static MemorySegment toWideString(String s, SegmentAllocator allocator) {
return toCString(allocator, s, StandardCharsets.UTF_16LE);
}
Like you, I'd also like to know if there are better approaches than the above.

Related

Can I use java.nio for Console Input?

Consider the scenario of competitive programming, I have to read 2*10^5 (or Even more ) numbers from console . Then I use BufferedReader or for even fast performance I use custom reader class that uses DataInputStream under the hood.
Quick Internet search given me this .
We can use java.io for smaller streaming of data and for large streaming we can use java.nio.
So I want to try java.nio console input and test it against the java.io performance .
Is it possible to read console input using java.nio ?
Can I read data from System.in using java.nio ?
Will it be faster than input methods that I currently have ?
Any relevant information will be appreciated.
Thanks ✌️
You can open a channel to stdin like
FileInputStream stdin = new FileInputStream(FileDescriptor.in);
FileChannel stdinChannel = stdin.getChannel();
When stdin has been redirected to a file, operations like querying the size, performing fast transfers to other channels and even memory mapping may work. But when the input is a real console or a pipe or you are reading character data, the performance is unlikely to differ significantly.
The performance depends on the way you read it, not the class you are using.
An example of code directly operating on a channel, to process white-space separated decimal numbers, is
CharsetDecoder cs = Charset.defaultCharset().newDecoder();
ByteBuffer bb = ByteBuffer.allocate(1024);
CharBuffer cb = CharBuffer.allocate(1024);
while(stdinChannel.read(bb) >= 0) {
bb.flip();
cs.decode(bb, cb, false);
bb.compact();
cb.flip();
extractDoubles(cb);
cb.compact();
}
bb.flip();
cs.decode(bb, cb, true);
if(cb.position() > 0) {
cb.flip();
extractDoubles(cb);
}
private static void extractDoubles(CharBuffer cb) {
doubles: for(int p = cb.position(); p < cb.limit(); ) {
while(p < cb.limit() && Character.isWhitespace(cb.get(p))) p++;
cb.position(p);
if(cb.hasRemaining()) {
for(; p < cb.limit(); p++) {
if(Character.isWhitespace(cb.get(p))) {
int oldLimit = cb.limit();
double d = Double.parseDouble(cb.limit(p).toString());
cb.limit(oldLimit);
processDouble(d);
continue doubles;
}
}
}
}
}
This is more complicated than using java.util.Scanner or a BufferedReader’s readLine() followed by split("\\s"), but has the advantage of avoiding the complexity of the regex engine, as well as not creating String objects for the lines. When there are more than one number per line or empty lines, i.e. the line strings would not not match the number strings, this can save the copying overhead intrinsic to string construction.
This code is still handling arbitrary charsets. When you know the expected charset and it is ASCII based, using a lightweight transformation instead of the CharsetDecoder, like shown in this answer, can gain an additional performance increase.

Unable to create a torrent's info hash

I'm having trouble finding the issue with how I'm generating the corresponding info hash for a torrent file. This is the code I have so far:
InputStream input = null;
try {
MessageDigest sha1 = MessageDigest.getInstance("SHA-1");
input = new FileInputStream(file);
StringBuilder builder = new StringBuilder();
while (!builder.toString().endsWith("4:info")) {
builder.append((char) input.read()); // It's ASCII anyway.
}
ByteArrayOutputStream output = new ByteArrayOutputStream();
for (int data; (data = input.read()) > -1; output.write(data));
sha1.update(output.toByteArray(), 0, output.size() - 1);
this.infoHash = sha1.digest();
System.out.println(new String(Hex.encodeHex(infoHash)));
} catch (NoSuchAlgorithmException | IOException e) {
e.printStackTrace();
} finally {
if (input != null) try { input.close(); } catch (IOException ignore) {}
}
Below is my expected and actual hash:
Expected: d4d44272ee5f5bf887a9c85ad09ae957bc55f89d
Actual: 4d753474429d817b80ff9e0c441ca660ec5d2450
The torrent I'm trying to generate an info hash for can be found here (Ubuntu 14.04 Desktop amd64).
Let me know if I can provide any more info, thanks!
Exceptions contain 4 useful bits of info: Type, Message, Trace, and Cause. You've tossing away 3 out of the 4 relevant bits of info. Also, code is part of a process, and when an error occurs, generally that process cannot be finished at all. And yet on exceptions your process continues. Stop doing this; you've written code that only hurts you. Remove the try, and the catch. Add a throws clause on your method signature. If you can't, the go-to default (and update your IDE if that generated this code to do this) is throw new RuntimeException("Unhandled", e);. This is shorter, does not destroy any of the 4 interesting bits of info, and ends a process.
Separately, the notion that the right way to handle an inputstream close method's IOException being: Just ignore it, is also false. It is highly unlikely to throw, but if it does, you should assume you didn't read every byte. As that would be one explanation for a mismatched hash, it's misguided.
Finally, use the proper language constructs: There is a try-with-resources statement that would work far better here.
You're calling update with output.size() - 1; unless you want to intentionally ignore the last byte, this is a mistake; you're lopping off the last byte read.
Reading bytes into a builder, and then per byte converting the builder to a string and then checking the last character is incredibly inefficient; for a file as small as 1MB that'll cause quite a grind.
Reading a single byte at a time from a raw FileInputStream is also that level of inefficient, because every read will cause file access (reading 1 byte is as expensive as reading a whole buffer full, so, it's about 50000 times slower than it needs to be).
Here's how to do this with somewhat newer API, and look how much nicer this code reads. It also acts better under erroneous conditions:
byte[] data = Files.readAllBytes(Paths.get(fileName));
var search = "4:info".getBytes(StandardCharsets.US_ASCII);
int searchIdx = -1;
for (int i = 0; searchIdx == -1 && i < data.length - search.length; i++) {
for (int j = 0; j < search.length; j++) {
if (data[i + j] != search[j]) break;
if (j == search.length - 1) searchIdx = i + j;
}
}
if (searchIdx == -1) throw new IOException("Input torrent file does not contain marker");
var sha1 = MessageDigest.getInstance("SHA-1");
sha1.update(data, searchIdx, data.length - searchIdx);
byte[] hash = sha1.digest();
StringBuilder hex = new StringBuilder();
for (byte h : hash) hex.append(String.format("%02x", h));
System.out.println(hex);
While rzwitserloot's answer covers some general java coding practices there also are correctness issues on the bittorrent level.
You are using string processing for a structured data format, this is pretty much the same mistake as attempting to parse html with regex. In this case you're assuming that the only place that the data can contain the string 4:info is the top-level dictionary key for the info dict and that the info dictionary is the last entry of the top level dictionary.
Instead you should use a proper bencoding decoder-encoder to extract the info dict and then re-encode it for hashing or a tokenizer to find the exact byte-range covering the info value. Note that you need a validating parser for the former while the latter can also handle some out-of-spec edge cases. Unless you want to implement them yourself you may want to find a library that handles this for you.
Additionally you're assuming that the data is ASCII. bencoding is in fact a binary format that just tends to use ascii by convention in some places. You should operate on byte arrays directly. Your input is already binary, the hasher expects binary so it is quite circuitous to go through strings.

Why String receiver's size is smaller than original ByteArrayOutputStream's size when I call toString()

I'm in front of a curious problem. Some code is better than long story:
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
buffer.write(...); // I write byte[] data
// In debugger I can see that buffer's count = 449597
String szData = buffer.toString();
int iSizeData = buffer.size();
// But here, szData's count = 240368
// & iSizeData = 449597
So my question is: why szData doesn't contain all the buffer's data? (only one Thread run this code) because after that kind of operation, I don't want szData.charAt(iSizeData - 1) crashes!
EDIT: szData.getBytes().length = 450566. There is encoding problems I think. Better use a byte[] instead of a String finally?
In Java, char ≠ byte, depending on the default character coding of the platform, char can occupy up to 4 bytes in memory. You work either with bytes (binary data), or with characters (strings), you cannot (easily) switch between them.
For String operations like strncasecmp in C, use the methods of the String class, e.g. String.compareToIgnoreCase(String str). Also have a look at the StringUtils class from the Apache Commons Lang library.

Storing and comparing a large quantity of Strings in Java

My application stores a large number (about 700,000) of strings in an ArrayList. The strings are loaded from a text file like this:
List<String> stringList = new ArrayList<String>(750_000);
//there's a try catch here but I omitted it for this example
Scanner fileIn = new Scanner(new FileInputStream(listPath), "UTF-8");
while (fileIn.hasNext()) {
String s = fileIn.nextLine().trim();
if (s.isEmpty()) continue;
if (s.startsWith("#")) continue; //ignore comments
stringList.add(s);
}
fileIn.close();
Later on, Other strings are compared to this list, using this code:
String example = "Something";
if (stringList.contains(example))
doSomething();
This comparison will happen many hundreds (thousands?) of times.
This all works, but I want to know if there's anything I can do to make it better. I notice that the JVM increases in size from about 100MB to 600MB when it loads the 700K Strings. The strings are mainly about this size:
Blackened Recordings
Divergent Series: Insurgent
Google
Pixels Movie Money
X Ambassadors
Power Path Pro Advanced
CYRFZQ
Is there anything I can do to reduce the memory, or is that to be expected? Any suggestions in general?
ArrayList is a memory effective. Probably your issue is caused by java.util.Scanner. Scanner creates a lot of temp objects during parsing (Patterns, Matchers etc) and not suitable for big files.
Try to replace it with java.io.BufferedReader:
List<String> stringList = new ArrayList<String>();
BufferedReader fileIn = new BufferedReader(new FileReader("UTF-8"));
String line = null;
while ((line = fileIn.readLine()) != null) {
line = line.trim();
if (line.isEmpty()) continue;
if (line.startsWith("#")) continue; //ignore comments
stringList.add(line);
}
fileIn.close();
See java.util.Scanner source code
To pinpoint memory issue attach to your JVM any memory profiler, for example VisualVM from JDK tools.
Added:
Let's make few assumtions:
you have 700000 string with 20 characters each.
object reference size is 32 bits, object header - 24, array header - 16, char - 16, int 32.
Then every string will consume 24+32*2+32+(16+20*16) = 456 bits.
Whole ArrayList with string object will consume about 700000*(32*2+456) = 364000000 bits = 43.4 MB (very roughly).
Not quite an answer, but:
Your scenario uses around 70mb on my machine:
long usedMemory = -(Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory());
{//
String[] strings = new String[700_000];
for (int i = 0; i < strings.length; i++) {
strings[i] = new String(new char[20]);
}
}//
usedMemory += Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
System.out.println(usedMemory / 1_000_000d + " mb");
How did you reach 500mb there? As far as I know, String has internally a char[], and each char has 16 bits. Taking the Object and String overhead in account, 500mb is still quite much for the strings only. You may perform some benchmarking tests on your machine.
As others already mentioned, you should change the datastructure for element look-ups/comparison.
You're likely going to be better off using a HashSet instead of an ArrayList as both add and contains are constant time operations in a HashSet.
However, it does assume that your object's hashCode implementation (which is part of Object, but can be overridden) is evenly distributed.
There is a Trie data structure which can be used as dictionary, with so many strings they can occur multiple times. https://en.wikipedia.org/wiki/Trie . It seems to fit your case.
UPDATE:
An alternative can be HashSet or HashMap string -> something if you want occurrences of strings for example. Hashed collection will be faster than list for sure.
I would start with HashSet.
Using an ArrayList is a very bad idea for your use case, because it is not sorted, and hence you cannot efficiently search for an entry.
The best built-in type for your case is a is a TreeSet<String>. It guarantees O(log(n)) Performance for add() and contains().
Be aware that TreeSet is not thread-safe in the basic implementation. Use an mt-safe wrapper (see the JavaDocs of TreeSet for this).
Here is a Java 8 approach. It uses Files.lines() method which take advantage of Stream API. This method reads all lines from a file as a Stream.
As a consequence no String objects are created till the terminal operation which is a static method MyExecutor.doSomething(String).
/**
* Process lines from a file.
* Uses Files.lines() method which take advantage of Stream API introduced in Java 8.
*/
private static void processStringsFromFile(final Path file) {
try (Stream<String> lines = Files.lines(file)) {
lines.map(s -> s.trim())
.filter(s -> !s.isEmpty())
.filter(s -> !s.startsWith("#"))
.filter(s -> s.contains("Something"))
.forEach(MyExecutor::doSomething);
} catch (IOException ex) {
logProcessStringsFailed(ex);
}
}
I conducted an Analysis of Memory Usage in NetBeans and here are the Memory Results for empty implementation of doSomething()
public static void doSomething(final String s) {
}
Live Bytes = 6702720 ≈ 6.4MB.

how to load first x bytes from URL with Java / Scala?

I want to read the first x bytes from a java.net.URLConnection (although I'm not forced to use this class - other suggestions welcome).
My code looks like this:
val head = new Array[Byte](2000)
new BufferedInputStream(connection.getInputStream).read(head)
IOUtils.toString(new ByteArrayInputStream(head), charset)
It works, but does this code load only the first 2000 bytes from the network?
Next trial
As 'JB Nizet' said it is not useful to use a buffered input stream, so I tried it with an InputStreamReader:
val head = new Array[Char](2000)
new InputStreamReader(connection.getInputStream, charset).read(head)
new String(head)
This code may be better, but the load times are about the same. So does this procedure limit the transferred bytes ?
No, it doesn't. It could read up to 8192 bytes (the deault buffer size of BufferedInputStream). It could also read 0 bytes, or any number of bytes between 0 and 2000, since you don't check the number of bytes that have actually been read, and which is returned by the read() method.
And finally, depending on the value of charset, and of the actual charset used by the HTTP response, this could return an incorrect string, or a String truncated in the middle of a multi-byte character. You should use a Reader to read text.
I suggest you read the Java IO tutorial.
You can use read(Reader, char[]) from Apache Commons IO. Just pass a 2000-character buffer to it and it will fill it with as many characters as possible, up to 2000.
Be sure you understand the objections in the other answers/comments, in particular:
Don't use Buffered... wrappers, it goes against your intentions.
If you read textual data, then use a Reader to read 2000 characters instead of InputStream reading 2000 bytes. The proper procedure would be to determine the character encoding from the headers of a response (Content-Type) and set that encoding into InputStreamReader.
Calling plain read(char[]) on a Reader will not fully fill the array you give to it. It can read as little as one character no matter how big the array is!
Don't forget to close the reader afterwards.
Other than that, I'd strongly recommend you to use Apache HttpClient in favor of java.net.URLConnection. It's much more flexible.
Edit: To understand the difference between Reader.read and IOUtils.read, it's worth examining the source of the latter:
public static int read(Reader input, char[] buffer,
int offset, int length)
throws IOException
{
if (length < 0) {
throw new IllegalArgumentException("Length must not be negative: " + length);
}
int remaining = length;
while (remaining > 0) {
int location = length - remaining;
int count = input.read(buffer, offset + location, remaining);
if (EOF == count) { // EOF
break;
}
remaining -= count;
}
return length - remaining;
}
Since Reader.read can read less characters than a given length (we only know it's at least 1 and at most the length), we need to iterate calling it until we get the amount we want.

Categories

Resources