showDocument() with non-standard (Chinese) characters

showDocument() with non-standard (Chinese) characters - java

So, I finally discovered that JavaFX lets you use HostServices.showDocument(uri) to open a browser to the given url. I have run into a problem though; I cannot open up urls that contain Chinese characters. It can only interpret them as '?', taking you to the wrong url. AWT's Display.browse(uri) handles characters without a problem, so I know that it can be communicated to the browser technically. I'm not sure if there is anything I can do on my end or not though.
My question is: Is there any way to make JavaFX's HostServices.showDocument() correctly read in Chinese characters?
EDIT:
Sample string
http://www.mdbg.net/chindict/chindict.php?page=worddict&wdrst=0&wdqb=%E6%96%87
You can follow the link through to see the address' chinese character (at the very end of the url). So in doing this, I noticed that it converts the character to a series of %, letters, and numbers. Plugging those into showDocument() in place of the character works fine. So then, I guess the question is now "How do I convert a character to this format?

I was able to figure out that converting the string into a URI, then using the .toASCIIString() method gave me what I needed. (Converting Chinese characters, and I would assume others, into something readable by showDocument(). Thanks for the help jewelsea.
If there is a better way to do this, feel free to give me another answer.

Related

Can anyone tell me the type of character encoding used within these Strings? [Decompiled]

Currently working on a project for a client that involves re modelling a decompiled and obfuscated set of code from a jar that's rather large.
There's a set of strings that keep popping up to be decoded consistently however the methods to decode said strings have been scrambled to the point of illegibility, from previously asking on here no one has time to be crawling through the method to figure out an alternate solution, so instead figuring out the character encoding is the best way to create a solution for the issue.
*Note that the obfuscator used does not have the ability to encrypt hard coded strings
I've tried varying methods of conversion from different libraries and different character sets however it doesn't seem to be playing ball, I asked a much more complex question here earlier but instead the more effective solution is to begin from knowing how to decode it from the start, below are some examples of the strings.
String encodedPriceExample = "\0163J\032'\032J\037\"m\007:P$\031";
//from interpretation shows a price of a transaction
String encodedErrorMessageExample ="V5T\016\005\"J:\037$\036w\0062\013!\017w\0238\037%J4\037%\0302\004#J1\0134\036>\0059J5\0171\005%\017w\0238\037wO$D";
//longer one, should show a no join message of some form.
These are only two strings, however all of them look similar and are decrypted via a scrambled static method as previously said.
Does this character encoding look like any character encoding at all? needs to be converted into UTF-8 or Base64.
Due to it going through a decompiler the string itself may just be jumbled and converted into raw unicode of some form, however I've never seen it happen before even with obfuscation, other hard coded strings in the project are fine, just the strings in those static methods.
Any input and / or help would be greatly appreciated in sorting this out! This is more of a check to make sure that my angle for fixing it up is correct.
Thanks guys

How do I combine unicode characters?

I am trying to print unicode characters in java console. The problem is I am trying to use Bengali character set and I need to combine 2 unicode characters together. I have no clue how to do so. For example: I can print ড and া separately. When combined this should turn into: ডা . that means the circular part should not be there anymore. But I really have no idea how to do so. I tried googling but couldn't find anything relevant :/

Just use double quotes in between. It works for me in IntelliJ and Eclipse without installing any other addons.
E.g.
System.out.println(myUnicodeChar1+""+myUnicodeChar2+""+myUnicodeChar3);

How do I handle non-English characters properly?

So I'm working with last.fm API. Sometimes, the query results in tracks that contain characters like these:
Æther, é, Hṛṣṭa
or non-English characters like these:
水鏡.
When debugging in Eclipse, I see them just fine (as-is) but printing on console prints these as ??? - which is OK for me.
Now, how do I handle these? At first I though I could remove every song that has any character other than the ones in English language. I used the regex ^\\w+$ but it didn't work. I also tried \\w+. That didn't work either.
Then I thought further on how do handle these properly. Any one can help me out? I am perfectly fine with letting these tracks out of the equation, ie. I'm fine with having only English character tracks.
Another question: What is the best way to display these character of console and/or Swing GUI?

You must ensure that you use correct encoding when reading your input first.
Second ensure that the font used in Eclipse on platform you developing has ability to display all these characters. Swing must display unicode chars if you read them correctly.
You will likely want to use UTF-8 everywhere.

Is there a standard way to detect directional character?

I'm parsing a text file made from this Wikipedia article, basically I made a Ctrl+A and copy/paste all the content in a text file. (I use it as example).
I'm trying to make a list of words with their counts and for that I use a Scanner with this delimiter :
sc.useDelimiter("[\\p{javaWhitespace}\\p{Punct}]+");
It works great for my need, but analysing the result, I saw something that looks like a blank token (again...). The character is after (nynorsk)‬ in the article (funny when I copy/paste here the character disappear, in gedit I can use → and ← and the cursor don't move).
After further research I've found out that this token was actually the POP DIRECTIONAL FORMATTING (U+202C).
It's not the only directional character, looking at the Character documentation Java seems to define them.
So I'm wondering if there is a standard way to detect these characters, and if possible a way that can be easily integrated in the delimiter pattern.
I'd like to avoid to make my own list because I fear I will forgot some of them.

You could always go the other way round and use a whitelist rather than a blacklist:
sc.useDelimiter("[^\\p{L}]+");

How do I decipher garbled/gibberish characters in my networking program

I am working on a client-server networking app in Java SE. I am using stings terminated by a newline from client to server and the server responds with a null terminated string.
In the output window of Netbeans IDE I am finding some gibberish characters amongst the strings that I send and receive.
I can't figure out what these characters are they mostly look like a rectagular box, when I paste that line containing the character in Notepad++ all the characters following and including that character disapear.
How could I know what characters are appearing in the output sreen of the IDE?

If the response you are getting back from the server is supposed to be human readable text, then this is probably a character encoding problem. For example, if the client and server are both written in Java, it is likely that they using/assuming different character encodings for the text. (It is also possible that the response is not supposed to be human readable text. In that case, the client should not be trying to interpret it as text ... so this question is moot.)
You typically see boxes (splats) when a program tries to render some character code that it does not understand. This maybe a real character (e.g. a Japanese character, mathematical symbol or the like) or it could be an artifact caused by a mismatch between the character sets used for encoding and decoding the text.
To try and figure out what is going on, try modifying your client-side code to read the response as bytes rather than characters, and then output the bytes to the console in hexadecimal. Then post those bytes in your question, together with the displayed characters that you currently see.
If you understand character set naming and have some ideas what the character sets are likely to be, the UNIX / Linux iconv utility may be helpful. Emacs has extensive support for displaying / editing files in a wide range of character encodings. (Even plain old Wordpad can help, if this is just a problem with line termination sequences; e.g. "\n" versus "\r\n" versus "\n\r".)
(I'd avoid trying to diagnose this by copy and pasting. The copy and paste process may itself "mangle" the data, causing you further confusion.)

This is probably just binary data. Most of it will look like gibberish when interpreted as ascii. Make sure you are writing exact number of bytes to the socket, and not some nice number like 4096. Best would be if you can post your code so we can help you find the error(s).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.