Print all Unicode characters within a specific range - java

I can't find the right API for this. I tried this;
public static void main(String[] args) {
for (int i = 2309; i < 3000; i++) {
String hex = Integer.toHexString(i);
System.out.println(hex + " = " + (char) i);
}
}
This code only prints like this in Eclipse IDE.
905 = ?
906 = ?
907 = ?
...
How can I make us of these decimal and hex values to get the Unicode characters?

It prints like that because all consoles use a mono spaced font. Try that on a JLabel in a frame and it should display fine.
EDIT:
Try creating a unicode printstream
PrintStream out = new PrintStream(System.out, true, "UTF-8");
And then print to it.
Here's the output in CMD window.

I forgot to save it in UTF-8 format by changing it from
File > Properties > Select the text file encoding
This will properly print the right character from the Eclipse console. The default is cp1252 which will print only ? for those characters it does not understand.

Related

Print unicode character in java

Displaying unicode character in java shows "?" sign. For example, i tried to print "अ". Its unicode Number is U+0905 and html representation is "अ".
The below codes prints "?" instead of unicode character.
char aa = '\u0905';
String myString = aa + " result" ;
System.out.println(myString); // displays "? result"
Is there a way to display unicode character directly from unicode itself without using unicode numbers? i.e "अ" is saved in file now display the file in jsp.
Java defines two types of streams, byte and character.
The main reason why System.out.println() can't show Unicode characters is that System.out.println() is a byte stream that deal with only the low-order eight bits of character which is 16-bits.
In order to deal with Unicode characters(16-bit Unicode character), you have to use character based stream i.e. PrintWriter.
PrintWriter supports the print( ) and println( ) methods. Thus, you can use these methods
in the same way as you used them with System.out.
PrintWriter printWriter = new PrintWriter(System.out,true);
char aa = '\u0905';
printWriter.println("aa = " + aa);
try to use utf8 character set -
Charset utf8 = Charset.forName("UTF-8");
Charset def = Charset.defaultCharset();
String charToPrint = "u0905";
byte[] bytes = charToPrint.getBytes("UTF-8");
String message = new String(bytes , def.name());
PrintStream printStream = new PrintStream(System.out, true, utf8.name());
printStream.println(message); // should print your character
Your myString variable contains the perfectly correct value. The problem must be the output from System.out.println(myString) which has to send some bytes to some output to show the glyphs that you want to see.
System.out is a PrintStream using the "platform default encoding" to convert characters to byte sequences - maybe your platform doesn't support that character. E.g. on my Windows 7 computer in Germany, the default encoding is CP1252, and there's no byte sequence in this encoding that corresponds to your character.
Or maybe the encoding is correct, but simply the font that creates graphical glyphs from characters doesn't have that charater.
If you are sending your output to a Windows CMD.EXE window, then maybe both reasons apply.
But be assured, your string is correct, and if you send it to a destination that can handle it (e.g. a Swing JTextField), it'll show up correctly.
I ran into the same problem wiht Eclipse. I solved my problem by switching the Encoding format for the console from ISO-8859-1 to UTF-8. You can do in the Run/Run Configurations/Common menu.
https://eclipsesource.com/blogs/2013/02/21/pro-tip-unicode-characters-in-the-eclipse-console/
Unicode is a unique code which is used to print any character or symbol.
You can use unicode from --> https://unicode-table.com/en/
Below is an example for printing a symbol in Java.
package Basics;
/**
*
* #author shelc
*/
public class StringUnicode {
public static void main(String[] args) {
String var1 = "Cyntia";
String var2 = new String(" is my daughter!");
System.out.println(var1 + " \u263A" + var2);
//printing heart using unicode
System.out.println("Hello World \u2665");
}
}
******************************************************************
OUTPUT-->
Cyntia ☺ is my daughter!
Hello World ♥

Odd behaviour with StringBuilder, Java

So I've been trying to print out some lines of "-" characters. Why does the following not work?:
StringBuilder horizonRule = new StringBuilder();
for(int i = 0 ; i < 12 ; i++) {
horizonRule.append("─");
System.out.println(horizonRule.toString());
}
The correct output is several lines like
─
──
───
────
and so on, but the incorrect output is
â??
â??â??
â??â??â??
I'm guessing the string is not being properly decoded by println or something
The string in your code is not a hyphen but a UTF8 box drawing character.
The terminal your application is printing to doesn't seem to expect any UTF8 content, so the issue is not inside your application.
Replace it with a real hyphen (-) or make sure the tool that displays the output supports UTF8.
You say that the IDE wants to save as UTF-8. You then probably have saved it as UTF-8.
However your compiler is likely to compile in whatever encoding your system uses.
If you write your code as UTF-8, make sure to compile it with the same encoding:
javac -encoding utf8 MyClass.java
I tried your code (I literally just copy'n'paste) using BeanShell, and it worked perfectly. So there's nothing wrong with the code. It will be your environment.
stewart$ bsh
Picked up JAVA_TOOL_OPTIONS: -Djava.awt.headless=true -Dapple.awt.UIElement=true
BeanShell 2.0b4 - by Pat Niemeyer (pat#pat.net)
bsh % StringBuilder horizonRule = new StringBuilder();
bsh % for(int i=0; i<12; i++) {
horizonRule.append("─");
System.out.println(horizonRule.toString());
}
─
──
───
────
─────
──────
───────
────────
─────────
──────────
───────────
────────────
bsh %
public class myTest1 {
public static void main(String[] args) {
StringBuilder horizonRule = new StringBuilder();
for (int i = 0 ; i <= 13 ; i++){
horizonRule.append('_');
System.out.println(horizonRule.toString());
}
}
}
is correct;
maybe you use a different encoding ? clear env path

JESS Userfunction writes "BS" instead of "/home" to a file

I'm using JESS for my expert system implementation and I have a userfunction. It writes some strings to a text file.
public Value call(ValueVector vv, Context context) throws JessException {
Rete engine = context.getEngine();
int size = vv.size();
for(i = 0; i < size-1; i++)
params[i] = vv.get(i+1).stringValue(context);
engine.eval("(printout file " + params[2] + ")");
return new Value(params[1], RU.STRING);
}
params[2] has /home/username/folder as content. When it prints out to a file I get the following in the file. BS has black background btw.
BSusername/folder
I'm not sure what's going on here. Any ideas?
In addition, I've never had this problem when I print out from JESS code.
The unquoted text /home/ is being parsed as a regular expression; the printed value is somewhat unpredictable. You need to include double quotes in your built-up command so the path is seen as a quoted string.

How to replace � in a string

I have a string that contains a character � I haven't been able to replace it correctly.
String.replace("�", "");
doesn't work, does anyone know how to remove/replace the � in the string?
That's the Unicode Replacement Character, \uFFFD. (info)
Something like this should work:
String strImport = "For some reason my �double quotes� were lost.";
strImport = strImport.replaceAll("\uFFFD", "\"");
Character issues like this are difficult to diagnose because information is easily lost through misinterpretation of characters via application bugs, misconfiguration, cut'n'paste, etc.
As I (and apparently others) see it, you've pasted three characters:
codepoint glyph escaped windows-1252 info
=======================================================================
U+00ef ï \u00ef ef, LATIN_1_SUPPLEMENT, LOWERCASE_LETTER
U+00bf ¿ \u00bf bf, LATIN_1_SUPPLEMENT, OTHER_PUNCTUATION
U+00bd ½ \u00bd bd, LATIN_1_SUPPLEMENT, OTHER_NUMBER
To identify the character, download and run the program from this page. Paste your character into the text field and select the glyph mode; paste the report into your question. It'll help people identify the problematic character.
You are asking to replace the character "�" but for me that is coming through as three characters 'ï', '¿' and '½'. This might be your problem... If you are using Java prior to Java 1.5 then you only get the UCS-2 characters, that is only the first 65K UTF-8 characters. Based on other comments, it is most likely that the character that you are looking for is '�', that is the Unicode replacement character. This is the character that is "used to replace an incoming character whose value is unknown or unrepresentable in Unicode".
Actually, looking at the comment from Kathy, the other issue that you might be having is that javac is not interpreting your .java file as UTF-8, assuming that you are writing it in UTF-8. Try using:
javac -encoding UTF-8 xx.java
Or, modify your source code to do:
String.replaceAll("\uFFFD", "");
As others have said, you posted 3 characters instead of one. I suggest you run this little snippet of code to see what's actually in your string:
public static void dumpString(String text)
{
for (int i=0; i < text.length(); i++)
{
System.out.println("U+" + Integer.toString(text.charAt(i), 16)
+ " " + text.charAt(i));
}
}
If you post the results of that, it'll be easier to work out what's going on. (I haven't bothered padding the string - we can do that by inspection...)
Change the Encoding to UTF-8 while parsing .This will remove the special characters
Use the unicode escape sequence. First you'll have to find the codepoint for the character you seek to replace (let's just say it is ABCD in hex):
str = str.replaceAll("\uABCD", "");
for detail
import java.io.UnsupportedEncodingException;
/**
* File: BOM.java
*
* check if the bom character is present in the given string print the string
* after skipping the utf-8 bom characters print the string as utf-8 string on a
* utf-8 console
*/
public class BOM
{
private final static String BOM_STRING = "Hello World";
private final static String ISO_ENCODING = "ISO-8859-1";
private final static String UTF8_ENCODING = "UTF-8";
private final static int UTF8_BOM_LENGTH = 3;
public static void main(String[] args) throws UnsupportedEncodingException {
final byte[] bytes = BOM_STRING.getBytes(ISO_ENCODING);
if (isUTF8(bytes)) {
printSkippedBomString(bytes);
printUTF8String(bytes);
}
}
private static void printSkippedBomString(final byte[] bytes) throws UnsupportedEncodingException {
int length = bytes.length - UTF8_BOM_LENGTH;
byte[] barray = new byte[length];
System.arraycopy(bytes, UTF8_BOM_LENGTH, barray, 0, barray.length);
System.out.println(new String(barray, ISO_ENCODING));
}
private static void printUTF8String(final byte[] bytes) throws UnsupportedEncodingException {
System.out.println(new String(bytes, UTF8_ENCODING));
}
private static boolean isUTF8(byte[] bytes) {
if ((bytes[0] & 0xFF) == 0xEF &&
(bytes[1] & 0xFF) == 0xBB &&
(bytes[2] & 0xFF) == 0xBF) {
return true;
}
return false;
}
}
dissect the URL code and unicode error. this symbol came to me as well on google translate in the armenian text and sometimes the broken burmese.
profilage bas� sur l'analyse de l'esprit (french)
should be translated as:
profilage basé sur l'analyse de l'esprit
so, in this case � = é
No above answer resolve my issue. When i download xml it apppends <xml to my xml. I simply
xml = parser.getXmlFromUrl(url);
xml = xml.substring(3);// it remove first three character from string,
now it is running accurately.

Java UTF-8 strange behaviour

I am trying to decode some UTF-8 strings in Java.
These strings contain some combining unicode characters, such as CC 88 (combining diaresis).
The character sequence seems ok, according to http://www.fileformat.info/info/unicode/char/0308/index.htm
But the output after conversion to String is invalid.
Any idea ?
byte[] utf8 = { 105, -52, -120 };
System.out.print("{{");
for(int i = 0; i < utf8.length; ++i)
{
int value = utf8[i] & 0xFF;
System.out.print(Integer.toHexString(value));
}
System.out.println("}}");
System.out.println(">" + new String(utf8, "UTF-8"));
Output:
{{69cc88}}
>i?
The console which you're outputting to (e.g. windows) may not support unicode, and may mangle the characters. The console output is not a good representation of the data.
Try writing the output to a file instead, making sure the encoding is correct on the FileWriter, then open the file in a unicode-friendly editor.
Alternatively, use a debugger to make sure the characters are what you expect. Just don't trust the console.
Here how I finally solved the problem, in Eclipse on Windows:
Click Run Configuration.
Click Arguments tab.
Add -Dfile.encoding=UTF-8
Click Common tab.
Set Console Encoding to UTF-8.
Modify the code:
byte[] utf8 = { 105, -52, -120 };
System.out.print("{{");
for(int i = 0; i < utf8.length; ++i)
{
int value = utf8[i] & 0xFF;
System.out.print(Integer.toHexString(value));
}
System.out.println("}}");
PrintStream sysout = new PrintStream(System.out, true, "UTF-8");
sysout.print(">" + new String(utf8, "UTF-8"));
Output:
{{69cc88}}
> ï
The code is fine, but as skaffman said your console probably doesn't support the appropriate character.
To test for sure, you need to print out the unicode values of the character:
public class Test {
public static void main(String[] args) throws Exception {
byte[] utf8 = { 105, -52, -120 };
String text = new String(utf8, "UTF-8");
for (int i=0; i < text.length(); i++) {
System.out.println(Integer.toHexString(text.charAt(i)));
}
}
}
This prints 69, 308 - which is correct (U+0069, U+0308).
Java, not unreasonably, encodes Unicode characters into native system encoded bytes before it writes them to stdout. Some operating systems, like many Linux distros, use UTF-8 as their default character set, which is nice.
Things are a bit different on Windows for a variety of backwards-compatibility reasons. The default system encoding will be one of the "ANSI" codepages and if you open the default command prompt (cmd.exe) it will be one of the old "OEM" DOS codepages (though it is possible to get ANSI and Unicode there with a bit of work).
Since U+0308 isn't in any of the "ANSI" character sets (probably 1252 in your case), it'll get encoded as an error character (usually a question mark).
An alternative to Unicode-enabling everything is to normalize the combining sequence U+0069 U+0308 to the single character U+00EF:
public static void emit(String foo) throws IOException {
System.out.println("Literal: " + foo);
System.out.print("Hex: ");
for (char ch : foo.toCharArray()) {
System.out.print(Integer.toHexString(ch & 0xFFFF) + " ");
}
System.out.println();
}
public static void main(String[] args) throws IOException {
String foo = "\u0069\u0308";
emit(foo);
foo = Normalizer.normalize(foo, Normalizer.Form.NFC);
emit(foo);
}
Under windows-1252, this code will emit:
Literal: i?
Hex: 69 308
Literal: ï
Hex: ef

Categories

Resources