Why are the ModuleEntry32 text values from WinAPI displaying as Chinese characters?

Why are the ModuleEntry32 text values from WinAPI displaying as Chinese characters? - java

I have been using the JNA (Java Native Access) library to access the memory of processes. I have been writing some code to enumerate through all modules of a process, and the struct MODULEENTRY32 is obtained properly - I am getting their handles and base addresses properly. However, the "String" values szModule and szExePath (which are char arrays) that are returned give me random Chinese characters.
JNA provides helper classes for structs such as MODULEENTRY32 (they call it MODULEENTRY32W) for using functions such as Module32First and Module32Next, which I've been using. They have sort of their own toString method for szModule and szExePath, and those return the random Chinese chars as well. I have tried to encode/decode it myself, and would get close to the "right" values (encoding to UTF-16, then decoding to ISO) but it still is a bit off - as in I can't use equals/equalsIgnoreCase to compare it with another String.
Below is roughly an example of what I am getting when printing out szModule and szExePath in the format szModule:szExePath returned from the Module32First/Module32Next calls:
瑮汤⹬汤l: 瑮汤⹬汤l
䕋乒䱅㈳䐮䱌: 䕋乒䱅㈳䐮䱌
䕋乒䱅䅂䕓搮汬: 䕋乒䱅䅂䕓搮汬
档潲敭敟晬搮汬: 档潲敭敟晬搮汬
䕖卒佉⹎汤l: 䕖卒佉⹎汤l
獭捶瑲搮汬: 獭捶瑲搮汬
And here is roughly how I am enumerating:
// hSnapshot is valid, and I already called "Module32First" - this loops through any other modules
while(this.moduleBaseAddr == null && this.moduleHandle == null) {
Tlhelp32.MODULEENTRY32W currentModuleEntry32 = new Tlhelp32.MODULEENTRY32W();
if(this.kernel32.Module32Next(hSnapshot, currentModuleEntry32)) {
currentModuleEntry32.read();
String currentModuleName = currentModuleEntry32.szModule();
System.out.println(currentModuleName + ": " + currentModuleEntry32.szModule());
if(currentModuleName.equals(MODULE_NAME)) {
this.moduleBaseAddr = currentModuleEntry32.modBaseAddr;
this.moduleHandle = currentModuleEntry32.hModule.getPointer();
break;
}
}else{
break;
}
}
Does anyone have any insight on solving this issue?

You are mixing ANSI function mappings and Unicode structure mappings.
Most Windows functions have two versions of the function, one ending in A and one in W, with comments in the documentation. For example, CreateProcess has two versions, CreateProcessA and CreateProcessW, where the documentation states:
The processthreadsapi.h header defines CreateProcess as an alias which automatically selects the ANSI or Unicode version of this function based on the definition of the UNICODE preprocessor constant. Mixing usage of the encoding-neutral alias with code that not encoding-neutral can lead to mismatches that result in compilation or runtime errors. For more information, see Conventions for Function Prototypes.
That link states:
New Windows applications should use Unicode to avoid the inconsistencies of varied code pages and for ease of localization.
Unfortunately in the case of GetModuleFirst and GetModuleNext, they do not follow the usual SDK convention. There is no -A version of these functions so the mapping you have created is ANSI (really ASCII). The byte string returned for szModule in the first line of the output in your question is 6e74646c6c2e646c6c3a206e74646c6c2e646c6c which in ASCII or UTF-8 decodes to ntdll.dll: ntdll.dll. Because you are using the MODULEENTRY32W (Unicode) structure mapping, these bytes are interpreted as UTF-16, resulting in the characters you are seeing in your output.
The Unicode mappings are GetModuleFirstW and GetModuleNextW, and are the functions you should be using. These are mapped in JNA's Kernel32 class. I highly recommend you use the JNA mappings rather than reinventing the wheel.
Incidentally, JNA's Kernel32Util class already handles all of this and offers a List<Tlhelp32.MODULEENTRY32W> getModules(int processID) method using the correct mappings, that you may find useful.

Related

EBCDIC unpacking comp-3 data returns 40404** in Java

I have used the unpack data logic provided in below link for java
How to unpack COMP-3 digits using Java?
But for the null data in source it returns 404040404 like on Java unpack code. I understand this was space in ebcdic, but how to unpack by handling this space or to avoid it.

There are two problems that we have to deal with. First, is the data valid comp-3 data and second, is the data considered “valid” by older language implementations like COBOL since Comp-3 was mentioned.
If the offests are not misaligned it would appear that spaces are being interpreted by existing programs as 0 instead of spaces. This would be incorrect but could be an artifact of older programs that were engineered to tolerate this bad behaviour.
The approach I would take in a legacy shop (assuming no misalignment) is to consider “spaces” (which are sequences of 0x404040404040) as being zero. This would be a legacy check to compare the field with spaces and then assume that 0x00000000000f as the actual default. This is something an individual shop would have to determine and is not recognized as a general programming approach.
In terms of Java, one has to remember that bytes are “signed” so comparisons can be tricky based on how the code is written. The only “unsigned” data type I
recall in java is char which is really two bytes (unit 16) basically.
This is less of a programming problem than it is recognizing historical tolerance and remediation.

What is the minimum test to verify that a component can save/retrieve UTF8 encoded strings

I am integration testing a component. The component allows you to save and fetch strings.
I want to verify that the component is handling UTF-8 characters properly. What is the minimum test that is required to verify this?
I think that doing something like this is a good start:
// This is the ☺ character
String toSave = "\u263A";
int id = 123;
// Saves to Database
myComponent.save( id, toSave );
// Retrieve from Database
String fromComponent = myComponent.retrieve( id );
// Verify they are same
org.junit.Assert.assertEquals( toSave, fromComponent );
One mistake I have made in the past is I have set String toSave = "è". My test passed because the string was saved and retrieved properly to/from the DB. Unfortunately the application was not actually working correctly because the app was using ISO 8859-1 encoding. This meant that è worked but other characters like ☺ did not.
Question restated: What is the minimum test (or tests) to verify that I can persist UTF-8 encoded strings?

A code and/or documentation review is probably your best option here. But, you can probe if you want. It seems that a sufficient test is the goal and minimizing it is less important. It is hard to figure what a sufficient test is, based only on speculation of what the threat would be, but here's my suggestion: all codepoints, including U+0000, proper handling of "combining characters."
The method you want to test has a Java string as a parameter. Java doesn't have "UTF-8 encoded strings": Java's native text datatypes use the UTF-16 encoding of the Unicode character set. This is common for in-memory representations of text—It's used by Java, .NET, JavaScript, VB6, VBA,…. UTF-8 is commonly used for streams and storage, so it makes sense that you should ask about it in the context of "saving and fetching". Databases typically offer one or more of UTF-8, 3-byte-limited UTF-8, or UTF-16 (NVARCHAR) datatypes and collations.
The encoding is an implementation detail. If the component accepts a Java string, it should either throw an exception for data it is unwilling to handle or handle it properly.
"Characters" is a rather ill-defined term. Unicode codepoints range from 0x0 to 0x10FFFF—21 bits. Some codepoints are not assigned (aka "defined"), depending on the Unicode Standard revision. Java datatypes can handle any codepoint, but information about them is limited by version. For Java 8, "Character information is based on the Unicode Standard, version 6.2.0.". You can limit the test to "defined" codepoints or go all possible codepoints.
A codepoint is either a base "character" or a "combining character". Also, each codepoint is in exactly one Unicode Category. Two categories are for combining characters. To form a grapheme, a base character is followed by zero or more combining characters. It might be difficult to layout graphemes graphically (see Zalgo text) but for text storage all that it is needed to not mangle the sequence of codepoints (and byte order, if applicable).
So, here is a non-minimal, somewhat comprehensive test:
final Stream<Integer> codepoints = IntStream
.rangeClosed(Character.MIN_CODE_POINT, Character.MAX_CODE_POINT)
.filter(cp -> Character.isDefined(cp)) // optional filtering
.boxed();
final int[] combiningCategories = {
Character.COMBINING_SPACING_MARK,
Character.ENCLOSING_MARK
};
final Map<Boolean, List<Integer>> partitionedCodepoints = codepoints
.collect(Collectors.partitioningBy(cp ->
Arrays.binarySearch(combiningCategories, Character.getType(cp)) < 0));
final Integer[] baseCodepoints = partitionedCodepoints.get(true)
.toArray(new Integer[0]);
final Integer[] combiningCodepoints = partitionedCodepoints.get(false)
.toArray(new Integer[0]);
final int baseLength = baseCodepoints.length;
final int combiningLength = combiningCodepoints.length;
final StringBuilder graphemes = new StringBuilder();
for (int i = 0; i < baseLength; i++) {
graphemes.append(Character.toChars(baseCodepoints[i]));
graphemes.append(Character.toChars(combiningCodepoints[i % combiningLength]));
}
final String test = graphemes.toString();
final byte[] testUTF8 = StandardCharsets.UTF_8.encode(test).array();
// Java 8 counts for when filtering by Character.isDefined
assertEquals(736681, test.length()); // number of UTF-16 code units
assertEquals(3241399, testUTF8.length); // number of UTF-8 code units

If your component is only capable of storing and retrieving strings, then all you need to do is make sure that nothing gets lost in the conversion to and from the Unicode strings of java and the UTF-8 strings that the component stores.
That would involve checking with at least one character from each UTF-8 code point length. So, I would suggest check with:
One character from the US-ASCII set, (1-byte long code point,) then
One character from Greek, (2-byte long code point,) and
One character from Chinese (3-byte long code point.)
In theory you would also want to check with an emoji (4-byte long code point,) though these cannot be represented in java's Unicode strings, so it's moot point.
A useful extra test would be to try a string combining at least one character from each of the above cases, so as to make sure that characters of different code-point lengths can co-exist within the same string.
(If your component does anything more than storing and retrieving strings, like searching for strings, then things can get a bit more complicated, but it seems to me that you specifically avoided asking about that.)
I do believe that black box testing is the only kind of testing that makes sense, so I would not recommend polluting the interface of your component with methods that would expose knowledge of its internals. However, there are two things that you can do to increase the testability of the component without ruining its interface:
Introduce additional functions to the interface that might help with testing without disclosing anything about the internal implementation and without requiring that the testing code must have knowledge of the internal implementation of the component.
Introduce functionality useful for testing in the constructor of your component. The code that constructs the component knows precisely what component it is constructing, so it is intimately familiar with the nature of the component, so it is okay to pass something implementation-specific there.
An example of what you could do with any of the above techniques would be to artificially severely limit the number of bytes that the internal representation is allowed to occupy, so that you can make sure that a certain string you are planning to store will fit. So, you could limit the internal size to no more than 9 bytes, and then make sure that a java unicode string containing 3 chinese characters gets properly stored and retrieved.

String instances use a predefined and unchangeable encoding(16-bit words).
So, returning only a String from your service is probably not enough to do this check.
You should try to return the byte representation of the persisted String (a byte array for example) and compare the content of this array with the "\u263A" String that you would encode in bytes with the UTF-8 charset.
String toSave = "\u263A";
int id = 123;
// Saves to Database
myComponent.save(id, toSave );
// Retrieve from Database
byte[] actualBytes = myComponent.retrieve(id );
// assertion
byte[] expectedBytes = toSave.getBytes(Charset.forName("UTF-8"));
Assert.assertTrue(Arrays.equals(expectedBytes, actualBytes));

How to make Windows file associations work with non-ASCII file names?

I need to register the file association for a certain file type - in fact, I just need to launch a certain Java program with certain arguments and a name of that file.
I got as far as the following:
// in fff-assoc.cmd file:
assoc .fff=SomeFile
ftype SomeFile=java -jar some.jar <arguments1> "%%1" <arguments2>
It works properly for ASCII file names. But when I try to double-click some file with non-ASCII symbols in name, the argument passed looks like "????" (int value of each char = 63).
How can I fix those associations?

If what bobince says is accurate and you cannot reliably get the data to java directly, one alternative solution would be to write a small "shim" program in another language (e.g. C, C++ or C#).
The idea is that the program grabs the input as UNICODE, encodes it so that it's expressible using only ASCII characters (e.g. by using base64, or even something as simple as encoding every character as its numerical equivalent) and then assembles the command line argument to use and launches java itself using CreateProcess.
Your Java code could "undo" the encoding, reconstructing the UNICODE name and proceeding to use it. It's a bit of a roundabout way and requires an extra component for your software, but it should work around the restriction detailed above, if indeed that is an actual restriction.
Update: This is the basic code for the shim program. It encodes input as a sequence of integers, separated by colons. It doesn't do much in the way of error checking and you might want to improve it slightly, but it should at least get you started and going in the right direction.
You should grab Visual Studio Express (if you don't already have Visual Studio) and create a new Visual C++ project, choose "Win32" and select "Win32 Project". Choose "Win32 application". After the project is created, replace everything in the .cpp file that is displayed with this code:
#include "stdafx.h"
#include <string>
int APIENTRY _tWinMain(HINSTANCE, HINSTANCE, LPTSTR lpCmdLine, int)
{
std::string filename;
while((lpCmdLine != NULL) && (*lpCmdLine != 0))
{
if(filename.length() != 0)
filename.append(":");
char buf[32];
sprintf(buf, "%u", (unsigned int)(*lpCmdLine++));
filename.append(buf);
}
if(filename.length() == 0)
return 0;
PROCESS_INFORMATION pi;
memset(&pi, 0, sizeof(PROCESS_INFORMATION));
STARTUPINFOA si;
memset(&si, 0, sizeof(STARTUPINFOA));
si.cb = sizeof(STARTUPINFOA);
char *buf = new char[filename.length() + 256]; // ensure that 256 is enough for your extra arguments!
sprintf(buf, "java.exe -jar some.jar <arguments1> \"%s\" <arguments2>", filename.c_str());
// CHECKME: You hard-coded the path for java.exe here. While that may work on your system
// is it guaranteed that it will work on every system?
if(CreateProcessA("C:\\Program Files\\Java\\jre7\\bin\\java.exe", buf, NULL, NULL, TRUE, 0, NULL, NULL, &si, &pi))
{
CloseHandle(pi.hThread);
CloseHandle(pi.hProcess);
}
delete[] buf;
return 0;
}
You should be able to figure the details on how to compile and so on fairly easily.

I just need to launch a certain Java program with certain arguments and a name of that file.
Unfortunately this 'just' is not actually possible, due to the MS implementation of the standard C library that Java uses to receive argument input (amongst other things). Unless you go straight to the native Win32 API, bypassing standard Java or C interfaces,
See this question for background.

When calling java from the command line, you can specify the encoding of the parameters (which will be used to create the strings in args[]):
java -jar -Dsun.jnu.encoding=cp1252 yourFileName
When using non-ASCII characters, the specified charset has an impact on the value of args[0]. Not sure if that would apply to file associations though.
Note: I'm not sure what other uses that parameter has - this post seems to say none.

how to protect against Null Byte Injection in a java webapp

How can null byte injection be done on a java webapp, Or rather - how does on protect against it?
Should I look at each byte of the request parameter and inspect its 'byte' value to be 0 ? I can't imagine a 0 byte sneaking in a request parameter... can it?
My main aim is to make sure the filename used for saving the file is safe enough. And for now, I am not looking answers that recommend (for example): replacing ALL non-word characters with Underscore.

Allowing the user to store files with arbitrary names is dangerous. What happens if the user provides "../../../WINDOWS/explorer.exe"? You should restrict filenames to only contain characters known to be harmless.
'\0' is not known to be harmless. As far as Java is concerned, '\0' is a character like any other. However, the operating system is likely to interpret '\0' as the end of a string. If a string is passed from Java to the operating system, that different interpretation could result in exploitable bugs. Consider:
if (filename.endsWith(".txt") {
store(filename, data);
}
where filename is "C:\Windows\explorer.exe\0.txt", which ends with ".txt" to Java, but with ".exe" to the operating system.

I'm not sure why you're concerned with null byte injection. Java isn't like C/C++, where strings are null-terminated character arrays.
You ought to bind and validate parameters and values coming in from the web tier. How do you define "safe enough"?

You have 2 choices:
1 Scan the string (convert it to a char array first) for null bytes.
2 upgrade to Java 8 or Java 7u40 and you are protected. (Yes, i tested it!, it works!)
in May 1013 Oracle fixed the problem: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8014846

Null byte injection in filenames was fixed in Java 7 update 40 (released around Sept. 2013). So, its been fixed for a while now, but it WAS a problem for over a decade and it was a NASTY vulnerability in Java. The fix is documented here: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8014846
-Dave Wichers

Replicating C struct padding in Java

According to here, the C compiler will pad out values when writing a structure to a binary file. As the example in the link says, when writing a struct like this:
struct {
char c;
int i;
} a;
to a binary file, the compiler will usually leave an unnamed, unused hole between the char and int fields, to ensure that the int field is properly aligned.
How could I to create an exact replica of the binary output file (generated in C), using a different language (in my case, Java)?
Is there an automatic way to apply C padding in Java output? Or do I have to go through compiler documentation to see how it works (the compiler is g++ by the way).

Don't do this, it is brittle and will lead to alignment and endianness bugs.
For external data it is much better to explicitly define the format in terms of bytes and write explicit functions to convert between internal and external format, using shift and masks (not union!).

This is true not only when writing to files, but also in memory. It is the fact that the struct is padded in memory, that leads to the padding showing up in the file, if the struct is written out byte-by-byte.
It is in general very hard to replicate with certainty the exact padding scheme, although I guess some heuristics would get you quite far. It helps if you have the struct declaration, for analysis.
Typically, fields larger than one char will be aligned so that their starting offset inside the structure is a multiple of their size. This means shorts will generally be on even offsets (divisible by 2, assuming sizeof (short) == 2), while doubles will be on offsets divisible by 8, and so on.
UPDATE: It is for reasons like this (and also reasons having to do with endianness) that it is generally a bad idea to dump whole structs out to files. It's better to do it field-by-field, like so:
put_char(out, a.c);
put_int(out, a.i);
Assuming the put-functions only write the bytes needed for the value, this will emit a padding-less version of the struct to the file, solving the problem. It is also possible to ensure a proper, known, byte-ordering by writing these functions accordingly.

Is there an automatic way to apply C
padding in Java output? Or do I have
to go through compiler documentation
to see how it works (the compiler is
g++ by the way).
Neither. Instead, you explicitly specify a data/communication format and implement that specification, rather than relying on implementation details of the C compiler. You won't even get the same output from different C compilers.

For interoperability, look at the ByteBuffer class.
Essentially, you create a buffer of a certain size, put() variables of different types at different positions, and then call array() at the end to retrieve the "raw" data representation:
ByteBuffer bb = ByteBuffer.allocate(8);
bb.order(ByteOrder.LITTLE_ENDIAN);
bb.put(0, someChar);
bb.put(4, someInteger);
byte[] rawBytes = bb.array();
But it's up to you to work out where to put padding-- i.e. how many bytes to skip between positions.
For reading data written from C, then you generally wrap() a ByteBuffer around some byte array that you've read from a file.
In case it's helpful, I've written more on ByteBuffer.

A handy way of reading/writing C structs in Java is to use the javolution Struct class (see http://www.javolution.org). This won't help you with automatically padding/aligning your data, but it does make working with raw data held in a ByteBuffer much more convenient. If you're not familiar with javolution, it's well worth a look as there's lots of other cool stuff in there too.

This hole is configurable, compiler has switches to align structs by 1/2/4/8 bytes.
So the first question is: Which alignment exactly do you want to simulate?

With Java, the size of data types are defined by the language specification. For example, a byte type is 1 byte, short is 2 bytes, and so on. This is unlike C, where the size of each type is architecture-dependent.
Therefore, it would be important to know how the binary file is formatted in order to be able to read the file into Java.
It may be necessary to take steps in order to be certain that fields are a specific size, to account for differences in the compiler or architecture. The mention of alignment seem to suggest that the output file will depend on the architecture.

you could try preon:
Preon is a java library for building codecs for bitstream-compressed data in a
declarative (annotation based) way. Think JAXB or Hibernate, but then for binary
encoded data.
it can handle Big/Little endian binary data, alignment (padding) and various numeric types along other features. It is a very nice library, I like it very much
my 0.02$

I highly recommend protocol buffers for exactly this problem.

As I understand it, you're saying that you don't control the output of the C program. You have to take it as given.
So do you have to read this file for some specific set of structures, or do you have to solve this in a general case? I mean, is the problem that someone said, "Here's the file created by program X, you have to read it in Java"? Or do they expect your Java program to read the C source code, find the structure definition, and then read it in Java?
If you've got a specific file to read, the problem isn't really very difficult. Either by reviewing the C compiler specifications or by studying example files, figure out where the padding is. Then on the Java side, read the file as a stream of bytes, and build the values you know are coming. Basically I'd write a set of functions to read the required number of bytes from an InputStream and turn them into the appropriate data type. Like:
int readInt(InputStream is,int len)
throws PrematureEndOfDataException
{
int n=0;
while (len-->0)
{
int i=is.read();
if (i==-1)
throw new PrematureEndOfDataException();
byte b=(byte) i;
n=(n<<8)+b;
}
return n;
}

You can alter the packing on the c side to ensure that no padding is used, or alternatively you can look at the resultant file format in a hex editor to allow you to write a parser in Java that ignores bytes that are padding.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why are the ModuleEntry32 text values from WinAPI displaying as Chinese characters? - java

Related

EBCDIC unpacking comp-3 data returns 40404** in Java

What is the minimum test to verify that a component can save/retrieve UTF8 encoded strings

How to make Windows file associations work with non-ASCII file names?

how to protect against Null Byte Injection in a java webapp

Replicating C struct padding in Java

Categories

Resources