How do I combine unicode characters? - java

I am trying to print unicode characters in java console. The problem is I am trying to use Bengali character set and I need to combine 2 unicode characters together. I have no clue how to do so. For example: I can print ড and া separately. When combined this should turn into: ডা . that means the circular part should not be there anymore. But I really have no idea how to do so. I tried googling but couldn't find anything relevant :/

Just use double quotes in between. It works for me in IntelliJ and Eclipse without installing any other addons.
E.g.
System.out.println(myUnicodeChar1+""+myUnicodeChar2+""+myUnicodeChar3);

Related

showDocument() with non-standard (Chinese) characters

So, I finally discovered that JavaFX lets you use HostServices.showDocument(uri) to open a browser to the given url. I have run into a problem though; I cannot open up urls that contain Chinese characters. It can only interpret them as '?', taking you to the wrong url. AWT's Display.browse(uri) handles characters without a problem, so I know that it can be communicated to the browser technically. I'm not sure if there is anything I can do on my end or not though.
My question is: Is there any way to make JavaFX's HostServices.showDocument() correctly read in Chinese characters?
EDIT:
Sample string
http://www.mdbg.net/chindict/chindict.php?page=worddict&wdrst=0&wdqb=%E6%96%87
You can follow the link through to see the address' chinese character (at the very end of the url). So in doing this, I noticed that it converts the character to a series of %, letters, and numbers. Plugging those into showDocument() in place of the character works fine. So then, I guess the question is now "How do I convert a character to this format?
I was able to figure out that converting the string into a URI, then using the .toASCIIString() method gave me what I needed. (Converting Chinese characters, and I would assume others, into something readable by showDocument(). Thanks for the help jewelsea.
If there is a better way to do this, feel free to give me another answer.

How do I handle non-English characters properly?

So I'm working with last.fm API. Sometimes, the query results in tracks that contain characters like these:
Æther, é, Hṛṣṭa
or non-English characters like these:
水鏡.
When debugging in Eclipse, I see them just fine (as-is) but printing on console prints these as ??? - which is OK for me.
Now, how do I handle these? At first I though I could remove every song that has any character other than the ones in English language. I used the regex ^\\w+$ but it didn't work. I also tried \\w+. That didn't work either.
Then I thought further on how do handle these properly. Any one can help me out? I am perfectly fine with letting these tracks out of the equation, ie. I'm fine with having only English character tracks.
Another question: What is the best way to display these character of console and/or Swing GUI?
You must ensure that you use correct encoding when reading your input first.
Second ensure that the font used in Eclipse on platform you developing has ability to display all these characters. Swing must display unicode chars if you read them correctly.
You will likely want to use UTF-8 everywhere.

Is there a standard way to detect directional character?

I'm parsing a text file made from this Wikipedia article, basically I made a Ctrl+A and copy/paste all the content in a text file. (I use it as example).
I'm trying to make a list of words with their counts and for that I use a Scanner with this delimiter :
sc.useDelimiter("[\\p{javaWhitespace}\\p{Punct}]+");
It works great for my need, but analysing the result, I saw something that looks like a blank token (again...). The character is after (nynorsk)‬ in the article (funny when I copy/paste here the character disappear, in gedit I can use → and ← and the cursor don't move).
After further research I've found out that this token was actually the POP DIRECTIONAL FORMATTING (U+202C).
It's not the only directional character, looking at the Character documentation Java seems to define them.
So I'm wondering if there is a standard way to detect these characters, and if possible a way that can be easily integrated in the delimiter pattern.
I'd like to avoid to make my own list because I fear I will forgot some of them.
You could always go the other way round and use a whitelist rather than a blacklist:
sc.useDelimiter("[^\\p{L}]+");

Print arabic string in java

I'm trying to display arabic text in java but it shows junk characters(Example : ¤[ï߯[î) or sometimes only question marks when i print. How do i make it to print arabic. I heard that its something related to unicode and UTF-8. This is the first time i'm working with languages so no idea. I'm using Eclipse Indigo IDE.
EDIT:
If i use UTF-8 encoding then "¤[ï߯[î" characters are becoming "????????" characters.
For starters you could take a look here. This should allow you to make Eclipse print unicode in its console (which I do not know if it is something which Eclipse supports out of the box without any extra tweaks)
If that does not solve your problem you most likely have an issue with the encoding your program is using, so you might want to create strings in some manner similar to this:
String str = new String("تعطي يونيكود رقما فريدا لكل حرف".getBytes(), "UTF-8");
This at least works for me.
If you embed the text literally in the code make sure you set the encoding for your project correctly.
This is for Java SE, Java EE, or Java ME?
If this is for Java ME, you have to make custom GlyphUtils if you use LWUIT.
Download this file:
http://dl.dropbox.com/u/55295133/U0600.pdf
Look list of unicode encoding..
And look at this thread:
https://stackoverflow.com/a/9172732/1061371
in the answer (post) of Mohamed Nazar that edited by bernama Alex Kliuchnikau,
"The below code can be use for displaying arabic text in J2ME String s=new String("\u0628\u06A9".getBytes(), "UTF-8"); where \u0628\u06A9 is the unicode of two arabic letters"
Look at U0600.pdf file, so we can see that Mohamed Nazar and Alex Kliuchnikau give example to create "ba" and "kaf" character in arabic.
Then the last point that you must consider is: "Make sure your UI support unicode(I mean arabic) character."
Like LWUIT not support yet unicode (I mean arabic) character.
You should make your custom code if you mean your app is using LWUIT.

How to remove colors etc. from ssh output

i am using jsch to get ssh output from a local ssh server.
When i display the output in a textbox i get all these weird string in the output for example:
]0;~/rails_sites/rex_raid
[32mRob#shinchanii [33m~/rails_sites/rex_raid[0m
I guess [33m and [0m mark the begin of a new color or something and ]0;~ marks a newline
how do i get rid of these withput parsing the output for those strings ?
Here a example (not from me) how my output looks like:
http://www.google.de/codesearch#048v6jEeHAU/typescript&q=%5D0;~&l=1
These are actually VT100 terminal control escape sequences. You can find a list of them (not sure if the list is complete) at http://www.termsys.demon.co.uk/vtansi.htm.
You can use the String's replaceAll method (http://download.oracle.com/javase/1.4.2/docs/api/java/lang/String.html#replaceAll%28java.lang.String,%20java.lang.String%29), and create a regular expressions which matches all valid VT100 escape sequences. However when creating the regexp do not forget that there is non printable ESC char (that is \u001B in Unicode) before the square bracket.
These are ANSI escape sequences. As you guessed right, these are intended to be implemented by the terminal showing these to change color or one of some font attributes. (They start with an Escape character (ASCII 27), but this is likely not shown in your text box.)
The right way to do this would be to make your shell not print these codes if there is no (or a dumb) terminal. But since they are often hard-coded in scripts (at least on my account here the prompt-colors are hard-coded in .bashrc), this might not be easy.
You can parse these codes, either to strip them off, or to even interpret them (to make your textbox colorful). I once started to implement the last part, but I think there might be existing implementations around.
I'm also using JSch and experience the same problem.
For you reference, in JSch, Channel.setPtyType("ansi") before connect can remove the ansi colors so that the output is acceptable in Windows.
Not sure if this setting is compatible for all remote Linux/Unix servers

Categories

Resources