How to Find the Default Charset/Encoding in Java?

How to Find the Default Charset/Encoding in Java? - java

The obvious answer is to use Charset.defaultCharset() but we recently found out that this might not be the right answer. I was told that the result is different from real default charset used by java.io classes in several occasions. Looks like Java keeps 2 sets of default charset. Does anyone have any insights on this issue?
We were able to reproduce one fail case. It's kind of user error but it may still expose the root cause of all other problems. Here is the code,
public class CharSetTest {
public static void main(String[] args) {
System.out.println("Default Charset=" + Charset.defaultCharset());
System.setProperty("file.encoding", "Latin-1");
System.out.println("file.encoding=" + System.getProperty("file.encoding"));
System.out.println("Default Charset=" + Charset.defaultCharset());
System.out.println("Default Charset in Use=" + getDefaultCharSet());
}
private static String getDefaultCharSet() {
OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream());
String enc = writer.getEncoding();
return enc;
}
}
Our server requires default charset in Latin-1 to deal with some mixed encoding (ANSI/Latin-1/UTF-8) in a legacy protocol. So all our servers run with this JVM parameter,
-Dfile.encoding=ISO-8859-1
Here is the result on Java 5,
Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=UTF-8
Default Charset in Use=ISO8859_1
Someone tries to change the encoding runtime by setting the file.encoding in the code. We all know that doesn't work. However, this apparently throws off defaultCharset() but it doesn't affect the real default charset used by OutputStreamWriter.
Is this a bug or feature?
EDIT: The accepted answer shows the root cause of the issue. Basically, you can't trust defaultCharset() in Java 5, which is not the default encoding used by I/O classes. Looks like Java 6 corrects this issue.

This is really strange... Once set, the default Charset is cached and it isn't changed while the class is in memory. Setting the "file.encoding" property with System.setProperty("file.encoding", "Latin-1"); does nothing. Every time Charset.defaultCharset() is called it returns the cached charset.
Here are my results:
Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=ISO-8859-1
Default Charset in Use=ISO8859_1
I'm using JVM 1.6 though.
(update)
Ok. I did reproduce your bug with JVM 1.5.
Looking at the source code of 1.5, the cached default charset isn't being set. I don't know if this is a bug or not but 1.6 changes this implementation and uses the cached charset:
JVM 1.5:
public static Charset defaultCharset() {
synchronized (Charset.class) {
if (defaultCharset == null) {
java.security.PrivilegedAction pa =
new GetPropertyAction("file.encoding");
String csn = (String) AccessController.doPrivileged(pa);
Charset cs = lookup(csn);
if (cs != null)
return cs;
return forName("UTF-8");
}
return defaultCharset;
}
}
JVM 1.6:
public static Charset defaultCharset() {
if (defaultCharset == null) {
synchronized (Charset.class) {
java.security.PrivilegedAction pa =
new GetPropertyAction("file.encoding");
String csn = (String) AccessController.doPrivileged(pa);
Charset cs = lookup(csn);
if (cs != null)
defaultCharset = cs;
else
defaultCharset = forName("UTF-8");
}
}
return defaultCharset;
}
When you set the file encoding to file.encoding=Latin-1 the next time you call Charset.defaultCharset(), what happens is, because the cached default charset isn't set, it will try to find the appropriate charset for the name Latin-1. This name isn't found, because it's incorrect, and returns the default UTF-8.
As for why the IO classes such as OutputStreamWriter return an unexpected result,
the implementation of sun.nio.cs.StreamEncoder (witch is used by these IO classes) is different as well for JVM 1.5 and JVM 1.6. The JVM 1.6 implementation is based in the Charset.defaultCharset() method to get the default encoding, if one is not provided to IO classes. The JVM 1.5 implementation uses a different method Converters.getDefaultEncodingName(); to get the default charset. This method uses its own cache of the default charset that is set upon JVM initialization:
JVM 1.6:
public static StreamEncoder forOutputStreamWriter(OutputStream out,
Object lock,
String charsetName)
throws UnsupportedEncodingException
{
String csn = charsetName;
if (csn == null)
csn = Charset.defaultCharset().name();
try {
if (Charset.isSupported(csn))
return new StreamEncoder(out, lock, Charset.forName(csn));
} catch (IllegalCharsetNameException x) { }
throw new UnsupportedEncodingException (csn);
}
JVM 1.5:
public static StreamEncoder forOutputStreamWriter(OutputStream out,
Object lock,
String charsetName)
throws UnsupportedEncodingException
{
String csn = charsetName;
if (csn == null)
csn = Converters.getDefaultEncodingName();
if (!Converters.isCached(Converters.CHAR_TO_BYTE, csn)) {
try {
if (Charset.isSupported(csn))
return new CharsetSE(out, lock, Charset.forName(csn));
} catch (IllegalCharsetNameException x) { }
}
return new ConverterSE(out, lock, csn);
}
But I agree with the comments. You shouldn't rely on this property. It's an implementation detail.

Is this a bug or feature?
Looks like undefined behaviour. I know that, in practice, you can change the default encoding using a command-line property, but I don't think what happens when you do this is defined.
Bug ID: 4153515 on problems setting this property:
This is not a bug. The "file.encoding" property is not required by the J2SE
platform specification; it's an internal detail of Sun's implementations and
should not be examined or modified by user code. It's also intended to be
read-only; it's technically impossible to support the setting of this property
to arbitrary values on the command line or at any other time during program
execution.
The preferred way to change the default encoding used by the VM and the runtime
system is to change the locale of the underlying platform before starting your
Java program.
I cringe when I see people setting the encoding on the command line - you don't know what code that is going to affect.
If you do not want to use the default encoding, set the encoding you do want explicitly via the appropriate method/constructor.

The behaviour is not really that strange. Looking into the implementation of the classes, it is caused by:
Charset.defaultCharset() is not caching the determined character set in Java 5.
Setting the system property "file.encoding" and invoking Charset.defaultCharset() again causes a second evaluation of the system property, no character set with the name "Latin-1" is found, so Charset.defaultCharset() defaults to "UTF-8".
The OutputStreamWriter is however caching the default character set and is probably used already during VM initialization, so that its default character set diverts from Charset.defaultCharset() if the system property "file.encoding" has been changed at runtime.
As already pointed out, it is not documented how the VM must behave in such a situation. The Charset.defaultCharset() API documentation is not very precise on how the default character set is determined, only mentioning that it is usually done on VM startup, based on factors like the OS default character set or default locale.

First, Latin-1 is the same as ISO-8859-1, so, the default was already OK for you. Right?
You successfully set the encoding to ISO-8859-1 with your command line parameter. You also set it programmatically to "Latin-1", but, that's not a recognized value of a file encoding for Java. See http://java.sun.com/javase/6/docs/technotes/guides/intl/encoding.doc.html
When you do that, looks like Charset resets to UTF-8, from looking at the source. That at least explains most of the behavior.
I don't know why OutputStreamWriter shows ISO8859_1. It delegates to closed-source sun.misc.* classes. I'm guessing it isn't quite dealing with encoding via the same mechanism, which is weird.
But of course you should always be specifying what encoding you mean in this code. I'd never rely on the platform default.

I have set the vm argument in WAS server as -Dfile.encoding=UTF-8 to change the servers' default character set.

check
System.getProperty("sun.jnu.encoding")
it seems to be the same encoding as the one used in your system's command line.

Related

Why do Files' methods use CodingErrorAction.REPORT to handle encoding errors while usual JDK behavior is to use REPLACE?

Most JDK's methods accepting a Charset (or a charset name) reconfigure Charset(De|En)coder to handle malformed input and unmappable characters with CodingErrorAction.REPLACE:
String ctors
ByteArrayOutputStream.toString()
InputStreamReader ctors
OutputStreamWriter ctors
PrintStream ctors
PrintWriter ctors
Formatter ctors
Scanner ctors
etc.
Even if never stated in the Javadoc, it easy to check that in OpenJdk source code or with simple test cases:
private static final byte[] INVALID_UTF_8 = new byte[] {-1, 97};
#Test
public void string_uses_replacement_characters() {
String str = new String(INVALID_UTF_8, StandardCharsets.UTF_8);
assertThat(str).isEqualTo("\uFFFDa");
}
#Test
public void inputStreamReader_uses_replacement_characters() throws IOException {
ByteArrayInputStream bais = new ByteArrayInputStream(INVALID_UTF_8);
InputStreamReader isr = new InputStreamReader(bais, StandardCharsets.UTF_8);
BufferedReader br = new BufferedReader(isr);
assertThat(br.readLine()).isEqualTo("\uFFFDa");
}
Some of theses classes also define methods accepting Charset(En|De)coder for those who want to specify another CodingErrorAction.
JDK 8 added the Files class which provides utility/factory methods to reduce the boilerplate required by few very common actions. However, all theses methods do not follow the usual behavior described earlier. (En|De)coders are not reconfigured to use CodingErrorAction.REPLACE and exceptions are thrown on invalid bytes & unmappable characters.
#Rule
public TemporaryFolder tmp = new TemporaryFolder();
#Test
public void readAllLines_throws_MIE_on_invalid_bytes() throws IOException {
Path p = tmp.newFile().toPath();
Files.write(p, INVALID_UTF_8);
assertThatThrownBy(() -> Files.readAllLines(p, StandardCharsets.UTF_8))
.isInstanceOf(MalformedInputException.class);
}
Does anyone know the rational of this change and why nobody found useful to clearly state it in the Javadoc ?
Even if I think that REPORT is a saner default behavior, it seems really error prone to silently change this tacit agreement that has been made years ago. Most developers would expect newBuffereReader(p, "UTF-8") to equivalent to new BufferedReader(new InputStreamReader(new FileInputStream(p), "UTF-8")) which is not true.
Note: https://bugs.openjdk.java.net/browse/JDK-8143997 seems related to my question.

it seems really error prone to silently change this tacit agreement that has been made years ago. Most developers would expect
There is no tacit agreement. If there were then all implementation details would implicitly be part of the specification and this would happen:
(source xkcd.com)
So please don't rely on spacebar heating (unless specified).
A prominent example where behavior that was not guaranteed by the spec was changed in the past were the Arrays.sort and Collections.sort methods. In the past they tolerated Comparators or equals implementations that violated the transitivity requirement mandated by the spec. When the merge sort implementation was changed to TimSort an exception was added that reported violations of the requirement. This was backwards-incompatible but within spec since such comparators cannot exist within the spec.
So in principle the devs could even have changed the old implementations. But for the sake of backwards compatibility and because there was no pressing need to do so they elected to only change the behavior to better, saner practices on new APIs.
New APIs are an evolution over old APIs. Streams are not Collections are not Enumerations. ByteChannels are not IOStreams.

API throws java.io.UnsupportedEncodingException

I am developing a Java program in eclipse using a proprietary API and it throws the following exception at run-time:
java.io.UnsupportedEncodingException:
at java.lang.StringCoding.encode(StringCoding.java:287)
at java.lang.String.getBytes(String.java:954)...
my code:
private static String SERVER = "localhost";
private static int PORT = 80;
private static String DFT="";
private static String USER = "xx";
private static String pwd = "xx";
public static void main(String[] args) {
LLValue entInfo = new LLValue();
LLSession session = new LLSession(SERVER, PORT, DFT, USER, pwd);
try {
LAPI_DOCUMENTS doc = new LAPI_DOCUMENTS(session);
doc.AccessPersonalWS(entInfo);
} catch (Exception e) {
e.printStackTrace();
}
}
The session appears to open with no errors, but the encoding exception is thrown at doc.AccessEnterpriseWS(entInfo)
Through researching this error I have tried using the -encoding option of the compiler, changing the encoding of my editor, etc.
My questions are:
how can I find out the encoding of the .class files I am trying to use?
should I be matching the encoding of my new program to the encoding of the API?
If java is machine independent why isn't there standard encoding?
I have read this stack trace and this guide already --
Any suggestions will be appreciated!
Cheers

Run it in your debugger with a breakpoint on String.getBytes() or StringCoding.encode(). Both are classes in the JDK so you have access to them and should be able to see what the third party is passing in.
The character encoding is used to specify how to interpret the raw binary. The default encoding on English Windows systems in CP1252. Other languages and systems may use different a different default encoding. As a quick test, you might try specifying UTF-8 to see if the problem magically disappears.
As noted in this question, the JVM uses the default encoding of the OS, although you can override this default.
Without knowing more about the third party API you are trying to use, it's hard to say what encoding they might be using. Unfortunately from looking at the implementation of StringCoding.encode() it appears there are a couple different ways you could get an UnsupportedEncodingException. Stepping through with a debugger should help narrow things down.

It looks to me as if something in the proprietary API is calling String.getBytes with an empty string for the character set.
I compiled the following class
public class Test2 {
public static void main(String[] args) throws Exception {
"test".getBytes("");
}
}
and when I ran it, I got the following stacktrace:
Exception in thread "main" java.io.UnsupportedEncodingException:
at java.lang.StringCoding.encode(StringCoding.java:286)
at java.lang.String.getBytes(String.java:954)
at Test2.main(Test2.java:3)
I would be surprised if this is anything to do with the encoding in which the class files are written. It looks to me like this is a problem with code, not a problem you can fix by changing file encodings or compiler/JVM switches.
I don't know anything about what this proprietary API is supposed to do or how it works. Perhaps it is expecting to be run inside a Java EE or web application container? Perhaps it has a bug? Perhaps it needs more configuration before it can run without throwing exceptions? Given that it's proprietary, can you get any support from the vendor?

How do I determine MaxDirectMemorySize on a running JVM?

I have an application which uses DirectByteBuffers to store data, but I'd like to know what MaxDirectMemorySize is so I don't accidentally exceed it.
Without configuring this manually, how can I figure out, from within the program, what MaxDirectMemorySize is?

The accepted answer only works if the option is explicitly specified on the command line. As of Java 6, you can access the option directly using the HotSpotDiagnosticMXBean. The following Java 7 code can read it conveniently:
final HotSpotDiagnosticMXBean hsdiag = ManagementFactory
.getPlatformMXBean(HotSpotDiagnosticMXBean.class);
if (hsdiag != null) {
System.out.println(hsdiag.getVMOption("MaxDirectMemorySize"));
}
Note that this may return a value of zero, meaning to use the default setting, which is equal to Runtime.getRuntime().maxMemory(). For example, with Oracle JDK 7u71 64-bit on Windows 7, this returns 3,690,987,520.
Alternatively, if you're willing to resort to accessing the sun.misc package, it's available directly by calling sun.misc.VM.maxDirectMemory().

Yuu can get ALL JVM parameters with...
RuntimeMXBean RuntimemxBean = ManagementFactory.getRuntimeMXBean();
List<String> args=RuntimemxBean.getInputArguments();
for(int i=0;i<args.size();i++) {
System.out.println(args.get(i));
}

Is -Djava.library.path=... equivalent to System.setProperty("java.library.path", ...)

I load an external library that is placed in ./lib. Are these two solutions to set the java.library.path equivalent?
Set path in console when executing jar:
java -Djava.library.path=./lib -jar myApplication.jar
Set path in the code before loading library:
System.setProperty("java.library.path", "./lib");
If they are equivalent, why in the second solution can Java not find the library while the first one is ok?
If not, is there a way the set the path in the code?

Although it is not well documented, the java.library.path system property is a "read-only" property as far as the System.loadLibrary() method is concerned. This is a reported bug but it was closed by Sun as opposed to getting fixed. The problem is that the JVM's ClassLoader reads this property once at startup and then caches it, not allowing us to change it programatically afterward. The line System.setProperty("java.library.path", anyVal); will have no effect except for System.getProperty() method calls.
Luckily, someone posted a workaround on the Sun forums. Unfortunately, that link no longer works but I did find the code on another source. Here is the code you can use to work around not being able to set the java.library.path system property:
public static void addDir(String s) throws IOException {
try {
// This enables the java.library.path to be modified at runtime
// From a Sun engineer at http://forums.sun.com/thread.jspa?threadID=707176
//
Field field = ClassLoader.class.getDeclaredField("usr_paths");
field.setAccessible(true);
String[] paths = (String[])field.get(null);
for (int i = 0; i < paths.length; i++) {
if (s.equals(paths[i])) {
return;
}
}
String[] tmp = new String[paths.length+1];
System.arraycopy(paths,0,tmp,0,paths.length);
tmp[paths.length] = s;
field.set(null,tmp);
System.setProperty("java.library.path", System.getProperty("java.library.path") + File.pathSeparator + s);
} catch (IllegalAccessException e) {
throw new IOException("Failed to get permissions to set library path");
} catch (NoSuchFieldException e) {
throw new IOException("Failed to get field handle to set library path");
}
}
WARNING: This may not work on all platforms and/or JVMs.

Generally speaking, both approaches have the same net effect in that the system property java.library.path is set to the value ./lib.
However, some system properties are only evaluated at specific points in time, such as the startup of the JVM. If java.library.path is among those properties (and your experiment seems to indicate that), then using the second approach will have no noticeable effect except for returning the new value on future invocations of getProperty().
As a rule of thumb, using the -D command line property works on all system properties, while System.setProperty() only works on properties that are not only checked during startup.

you can add three lines
System.setProperty("java.library.path", "/path/to/libs" );
Field fieldSysPath = ClassLoader.class.getDeclaredField( "sys_paths" );
fieldSysPath.setAccessible( true );
fieldSysPath.set( null, null );
and also import java.lang.reflect.Field
It's ok to solve the problem

This is an addendum to this answer to Jesse Webb's amazing answer above: https://stackoverflow.com/a/6408467/257299
For Java 17:
import jdk.internal.loader.NativeLibraries;
final Class<?>[] declClassArr = NativeLibraries.class.getDeclaredClasses();
final Class<?> libraryPaths =
Arrays.stream(declClassArr)
.filter(klass -> klass.getSimpleName().equals("LibraryPaths"))
.findFirst()
.get();
final Field field = libraryPaths.getDeclaredField("USER_PATHS");
final MethodHandles.Lookup lookup = MethodHandles.privateLookupIn(Field.class, MethodHandles.lookup());
final VarHandle varHandle = lookup.findVarHandle(Field.class, "modifiers", int.class);
varHandle.set(field, field.getModifiers() & ~Modifier.FINAL);
Since package jdk.internal.loader from module java.base is not normally accessible, you will need to add "exports" and "opens" to both the compiler and JVM runtime args.
--add-exports=java.base/jdk.internal.loader=ALL-UNNAMED
--add-opens=java.base/jdk.internal.loader=ALL-UNNAMED
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED
Read more here:
--add-exports: https://stackoverflow.com/a/53647605/257299
--add-opens: https://stackoverflow.com/a/61663667/257299
Remove final modifier on Java12+: https://stackoverflow.com/a/56043252/257299

Setting Java VM line.separator

Has anybody found a way how to specify the Java line.separator property on VM startup? I was thinking of something like this:
java -Dline.separator="\n"
But this doesn't interprete the "\n" as linefeed character. Any ideas?

Try using java -Dline.separator=$'\n'. That should do the trick, at least in bash.
Here is a test-run:
aioobe#r60:~/tmp$ cat Test.java
public class Test {
public static void main(String[] args) {
System.out.println("\"" + System.getProperty("line.separator") + "\"");
}
}
aioobe#r60:~/tmp$ javac Test.java && java -Dline.separator=$'\n' Test
"
"
aioobe#r60:~/tmp$
Note:
The expression $'' uses the Bash feature ANSI-C Quoting. It expands backslash-escaped characters, thus $'\n' produces a line feed (ASCII code 10) character, enclosed in single quotes. See Bash manual, section 3.1.2.4 ANSI-C Quoting.

To bridge the gap between aioobe and Bozho's answers, I also would advise against setting the line.separator parameter at JVM startup, as this potentially breaks many fundamental assumptions the JVM and library code makes about the environment being run in. For instance, if a library you depend on relies on line.separator in order to store a config file in a cross-platform way, you've just broken that behavior. Yes, it's an edge case, but that makes it all the more nefarious when, years from now, a problem does crop up, and now all your code is dependent on this tweak being in place, while your libraries are (correctly) assuming it isn't.
That said, sometimes these things are out of your control, like when a library relies on line.separator and provides no way for you to override that behavior explicitly. In such a case, you're stuck overriding the value, or something more painful like re-implementing or patching the code manually.
For those limited cases, the it's acceptable to override line.separator, but we've got to follow two rules:
Minimize the scope of the override
Revert the override no matter what
Both of these requirements are well served by AutoCloseable and the try-with-resources syntax, so I've implemented a PropertiesModifier class that cleanly provides both.
/**
* Class which enables temporary modifications to the System properties,
* via an AutoCloseable. Wrap the behavior that needs your modification
* in a try-with-resources block in order to have your properties
* apply only to code within that block. Generally, alternatives
* such as explicitly passing in the value you need, rather than pulling
* it from System.getProperties(), should be preferred to using this class.
*/
public class PropertiesModifier implements AutoCloseable {
private final String original;
public PropertiesModifier(String key, String value) {
this(ImmutableMap.of(key, value));
}
public PropertiesModifier(Map<String, String> map) {
StringWriter sw = new StringWriter();
try {
System.getProperties().store(sw, "");
} catch (IOException e) {
throw new AssertionError("Impossible with StringWriter", e);
}
original = sw.toString();
for(Map.Entry<String, String> e : map.entrySet()) {
System.setProperty(e.getKey(), e.getValue());
}
}
#Override
public void close() {
Properties set = new Properties();
try {
set.load(new StringReader(original));
} catch (IOException e) {
throw new AssertionError("Impossible with StringWriter", e);
}
System.setProperties(set);
}
}
My use case was with Files.write(), which is a very convenient method, except it explicitly relies on line.separator. By wrapping the call to Files.write() I can cleanly specify the line separator I want to use, without risking exposing this to any other parts of my application (take note of course, that this still isn't thread-safe).
try(PropertiesModifier pm = new PropertiesModifier("line.separator", "\n")) {
Files.write(file, ImmutableList.of(line), Charsets.UTF_8);
}

I wouldn't do that if I were you. The line-separator is platform specific, and should remain so. If you want to write windows-only or linux-only files, define a UNIX_LINE_SEPARATOR constant somewhere and use it instead.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to Find the Default Charset/Encoding in Java? - java

I have set the vm argument in WAS server as -Dfile.encoding=UTF-8 to change the servers' default character set.

check System.getProperty("sun.jnu.encoding") it seems to be the same encoding as the one used in your system's command line.

Related

Why do Files' methods use CodingErrorAction.REPORT to handle encoding errors while usual JDK behavior is to use REPLACE?

API throws java.io.UnsupportedEncodingException

How do I determine MaxDirectMemorySize on a running JVM?

Is -Djava.library.path=... equivalent to System.setProperty("java.library.path", ...)

Setting Java VM line.separator

Categories

Resources