How to convert special characters in a string to unicode? - java

I couldn't find an answer to this problem, having tried several answer here combined to find something that works, to no avail.
An application I'm working on uses a users name to create PDF's with that name in it. However, when someones name contains a special character like "Yağmur" the pdf creator freaks out and omits this special character.
However, when it gets the unicode equivalent ("Yağmur"), it prints "Yağmur" in the pdf as it should.
How do I check a name/string for any special character (regex = "[^a-z0-9 ]") and when found, replace that character with its unicode equivalent and returning the new unicoded string?

I will try to give the solution in generic way as the frame work you are using is not mentioned as the part of your problem statement.
I too faced the same kind of issue long time back. This should be handled by the pdf engine if you set the text/char encoding as UTF-8. Please find how you can set encoding in your framework for pdf generation and try it out. Hope it helps !!

One hackish way to do this would be as follows:
/*
* TODO: poorly named
*/
public static String convertUnicodePoints(String input) {
// getting char array from input
char[] chars = input.toCharArray();
// initializing output
StringBuilder sb = new StringBuilder();
// iterating input chars
for (int i = 0; i < input.length(); i++) {
// checking character code point to infer whether "conversion" is required
// here, picking an arbitrary code point 125 as boundary
if (Character.codePointAt(input, i) < 125) {
sb.append(chars[i]);
}
// need to "convert", code point > boundary
else {
// for hex representation: prepends as many 0s as required
// to get a hex string of the char code point, 4 characters long
// sb.append(String.format("&#xu%04X;", (int)chars[i]));
// for decimal representation, which is what you want here
sb.append(String.format("&#%d;", (int)chars[i]));
}
}
return sb.toString();
}
If you execute: System.out.println(convertUnicodePoints("Yağmur"));...
... you'll get: Yağmur.
Of course, you can play with the "conversion" logic and decide which ranges get converted.

Related

Why won't my buffered `String` match with the `RegEx` pattern?

I've created a bluetooth data listener that stores incoming data into a String then checks whether it matches a regular expression pattern then resets the String. This is done because the data does not arrive whole in one instance so that I can manipulate with it when I get the full text. For example when I send "Hello Android!" to my device via bluetooth with my method and print the data, it would be printed like this:
#1 New data! "H"
#2 New data! "ello Android!
As you can see, the whole string cannot be sent at one instance which means that two strings get sent instead and I'm sure that most people know about that. That is why I am using RegEx to help me with that.
Instead, I am sending a randomly generated number between two different characters then try to parse them. For example "<128>". Now, I want to get the whole number so that I can use it, like parse it to an int or something like that. But only when my String buffer gets the whole data that is being sent which is determined by a RegEx pattern that goes like ([<])(-?\d+)([>]). character<followed by any positive/negative number followed by character '>'.
The problem is that it does not match the pattern at all for unknown reasons.
String szBuffer = "";
if(mmInputStream.available() > 0) {
StringBuilder builder = new StringBuilder();
byte[] bData = new byte[1024];
while(mmInputStream.available() > 0) {
int read = mmInputStream.read(bData);
builder.append(new String(bData, 0, read, StandardCharsets.UTF_8));
}
szBuffer += builder.toString();
Log.d("SZ_BUFFER", szBuffer); // For this example, "<128>" gets sent into pieces.
// <n>
if(Pattern.matches("([<])(-?\\d+)([>])", szBuffer)) {
Log.d("SZ_BUFFER_ISMATCH", "MATCH!");
szBuffer = ""; // Reset the buffer for new data
}
else
Log.d("SZ_BUFFER_ISMATCH", "NO MATCH...");
}
Here's a live output:
D/SZ_BUFFER: <
D/SZ_BUFFER_ISMATCH: NO MATCH...
D/SZ_BUFFER: <128>
D/SZ_BUFFER_ISMATCH: NO MATCH...
As you can see, it gets sent into two pieces, but when it gets the whole text together it should be a match, but isn't. Why? If I replace szBuffer with a constant String like this:
if(Pattern.matches("([<])(-?\\d+)([>])", "<128>"))
It's a match, meaning that the pattern should be correct, but when it checks for szBuffer it is never a match.
According to the docs, Pattern.matches(regex,string) is equivalent to Pattern.compile(regex).matcher(string).matches(), which is equivalent to string.matches(regex). This means that you are checking if the entire string matches the regex. If you want to check if the string contains your end-of-string marker, you can use Matcher.find instead:
Pattern pattern = Pattern.compile("([<])(-?\\d+)([>])");
String szBuffer = "";
if(mmInputStream.available() > 0) {
...
szBuffer += convertedBytes;
Log.d("SZ_BUFFER", szBuffer);
// <n>
if(pattern.matcher(szBuffer).find()) {
Log.d("SZ_BUFFER_ISMATCH", "MATCH!");
szBuffer = "";
}
else
Log.d("SZ_BUFFER_ISMATCH", "NO MATCH...");
}
The problem was that the data contains invisible characters which was confirmed by getting the data's length being wrong. The solution is to check every character then store them only if they match a certain criteria like this regular expression method:
for(char c : builder.toString().toCharArray()) {
String s = String.valueOf(c);
if(Pattern.matches("<|>|-|-?\\d+", s)){
szBuffer += s;
}
}
This is not the best nor cleanest solution, but it does solve my problem. I am still learning Regular Expressions and the many many combinations in algorithms. All new recommendations are welcome and will be added below this one.

Multi-byte character split generates junk symbols on saving to database

In my application, dynamic long strings are generated. these values I am saving in a database with a maximum length. when the maximum length is crossed, the string is split using a custom code and a new line gets inserted in database.
The problem here occurs when multi-byte characters are used. At the split of the string if a word is getting split at a Vowel signs (matra), then it generates a junk symbols like a diamond with question mark in between.
int blockSize = 12;
String str1 = "<SOME STRING>";
byte[] b = str1.getBytes("UTF-8");
int loopCount = x; // in actual code dynamically generated
String outString = "";
for (int i = 0; i <= loopCount; i++) {
if (i != loopCount) {
outString = new String(b, i * blockSize, blockSize, "UTF-8");
} else {
outString =
new String(b, i * blockSize, (b.length - loopCount * blockSize));
}
}
How can I avoid splitting of string when in between a word and instead take to complete word to the next time.?
2.Or is there any other way for stopping generation of junk symbols.
Text as conceived in Unicode has its problems on several levels.
As pure text composed from Unicode code points.
ĉ can be represented as one code point U+109, in UTF-16 (binary format) as one char'\u0109', or ascplus a zero-width so called combining diacritical mark for^. So splitting between code points already is problematic.java.text.Normalizer` can normalize to either composed or decomposed form. Then there are the Left-To-Right and Right-To-Left markers to consider when using a part of a text.
On the UTF-16 level, java char, some code points need 2 chars, a so called surrogate pair. This is testable in java using Character. The Character class and also regular expression Pattern has a rather good Unicode support. One can find categories like combining diacritical marks.
On the UTF-8 level some (non-ASCII) chars or code points need multibyte sequences, so splitting a byte array causes UTF-8 illegal garbage at the split point.
The solution?
Maybe sensible to normalize text; mind file names.
Do not consider byte sub-arrays as valid text
Treat boundaries of byte arrays careful, even a c at the end might be ĉ, consider a shifting boundary buffer.

How to convert next character pair to a hex integer

where I have to first reverse two bytes then convert that pair to hex integer. I am trying to convert it like below but it is giving error. Any idea to do that ? Thanks in advance
Here complete string : http://pastebin.com/1cSCyD78
Sample string
String str = "031890";
Error Message :
java.lang.NumberFormatException: Invalid int: "0x30"
Java Code
for ( int start = 0; start < str.length(); start += 2 ) {
try {
String thisByte = new StringBuilder(str.substring(start, start+2)).reverse().toString();
thisByte = "0x" + thisByte;
int value = Integer.parseInt(thisByte, 16);
char c = (char) value;
System.out.println(c);
} catch(Exception e) {
Log.e("MainActivity", e.getMessage());
}
}
Update
StringBuilder output = new StringBuilder();
for ( int start = 0; start < str.length(); start += 2 ) {
try {
String thisByte = new StringBuilder(str.substring(start, start+2)).reverse().toString();
output.append((char)Integer.parseInt(thisByte, 16));
} catch(Exception e) {
Log.e("MainActivity", e.getMessage());
}
}
Yes I Tried without prepending a "0x" to the string and now see my output looking weird.
Your string looks like you need 4 digits from your string per character, not 2.
Given that you interpret 2 digits as a character, though, at first glance, the output you showed in the picture does seem to match the string you posted on pastebin. You do get things that look like words in the output, so it's not totally off, and the gaps between the letters come from every second pair of 2 digits being '00'.
Not sure where this string came from, but if it was also generated by converting characters in some String to Bytes, it might make sense that it's 4 digits per character, since, for example, Java's chars are 16 bits (i.e. 2 bytes, i.e. 4 digits in your String) that encode the actual unicode symbol they represent in UTF-16.
If you are working off specs that someone else provided you with, maybe when they said "2 BYTES", they actually meant "two 8-bit numbers", which correspond to 4 digits (four 4-bit nibbles) in your hex string.
But your string looks like it contains binary data as well, not just characters. Do you know what are you actually expecting to see as the answer?
Update (as per comment request):
It's a trivial change to your code, but here it is:
StringBuilder output = new StringBuilder();
for ( int start = 0; start < str.length(); start += 4 ) {
try {
String thisByte = new StringBuilder(str.substring(start, start+4)).reverse().toString();
output.append((char)Integer.parseInt(thisByte, 16));
} catch(Exception e) {
Log.e("MainActivity", e.getMessage());
}
}
All I did was replace "2" with "4". :)
Update (as per chat):
The code posted here to convert the hex-string into characters (using 4 digits per character) seems to work fine, but the hex-string does not seem to follow the convention the OP is expecting based on the specifications of the data, which caused part of the confusion.
A side-note:
If this is a public application, it is highly risky to include unencrypted SQL statements in the network traffic. If these statements are part of a request and get executed on the server, a hacker can use this to perform unwanted operations on the underlying data (e.g. stealing all phone numbers in the database). If it is merely some debug-/log-information sent to the client, it's still not a good idea as it may give hints to a hacker about the structure of your database and the way you access it, significantly simplifying a potential SQL injection attack.
Try without prepending a "0x" to the string. This prefix is only for the compiler. It's actually a shortcut for saying to use 16 as the radix.

How to build the longest String with different Unicode characters

Thanks in advance for your patience. This is my problem.
I'm writing a program in Java that works best with a big set of different characters.
I have to store all the characters in a String. I started with
private static final String values = "0123456789";
Then I added A-Z, a-z and all the commons symbols.
But they are still too few, so I tought that maybe Unicode could be the solution.
The problem is now: what is the best way to get all the unicode characters that can be displayed in Eclipse (my algorithm will probably fail if there are unrecognized characters - those displayed like little rectangles). Is it possible to build a string (or some strings) with all the characters present here (en.wikipedia.org/wiki/List_of_Unicode_characters) correctly displayed?
I can do a rough copy-paste from http://www.terena.org/activities/multiling/euroml/tests/test-ucspages1ucs.html or http://zenoplex.jp/tools/unicoderange_generator.html, but I would appreciate some cleaner solution.
I don't know if there is a way to extract characters fron a font (the Unifont one). Or maybe I should parse this (www. utf8-chartable.de/unicode-utf8-table.pl) webpage.
Moreover, by adding all the characters into a String I will probably get the error:
"The type generates a string that requires more than 65535 bytes to encode in Utf8 format in the constant pool" (discussed in this question on SO: /questions/10798769/how-to-process-a-string-with-823237-characters).
Hybrid solutions can be accepted. I can remove duplicates following this question on SO questions/4989091/removing-duplicates-from-a-string-in-java)
Finally: every solution to get the longest only-different-characters string is accepted.
Thanks!
You are mixing some things up. The question whether a character can be displayed in Eclipse depends on the font you have chosen; and whether the source file can be processed correctly depends on which character encoding you have set up for the source file. When choosing UTF-8 and a good unicode font you can use and display almost any character, at least more than fit into a single String literal.
But is it really required to show the character in Eclipse? You can use the unicode escapes, e.g. \u20ac to refer to characters, regardless of whether they can be displayed or if the file encoding can handle them.
And if it is not a requirement to blow up your source code, it’s easy to create a String containing all existing characters:
// all chars (i.e. UTF-16 values)
StringBuilder sb=new StringBuilder(Character.MAX_VALUE);
for(char c=0; c<Character.MAX_VALUE; c++) sb.append(c);
String s=sb.toString();
// if it should behave like a compile-time constant:
s=s.intern();
or
// all unicode characters (aka code points)
StringBuilder sb=new StringBuilder(2162686);
for(int c=0; c<Character.MAX_CODE_POINT; c++) sb.appendCodePoint(c);
String s=sb.toString();
// if it should behave like a compile-time constant:
s=s.intern();
If you wan’t the String to contain valid unicode characters only you can use if(Character.isDefined(c)) … inside the loop. But that’s a moving target— newer JRE’s will most probably know more defined characters.
Smply use Apache classes, org.apache.commons.lang.RandomStringUtils (commons-lang) can solve your purpose.
http://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/RandomStringUtils.html
Also please refer to below code for api usage,
import org.apache.commons.lang3.RandomStringUtils;
public class RandomString {
public static void main(String[] args) {
// Random string only with numbers
String string = RandomStringUtils.random(64, false, true);
System.out.println("Random 0 = " + string);
// Random alphabetic string
string = RandomStringUtils.randomAlphabetic(64);
System.out.println("Random 1 = " + string);
// Random ASCII string
string = RandomStringUtils.randomAscii(32);
System.out.println("Random 2 = " + string);
// Create a random string with indexes from the given array of chars
string = RandomStringUtils.random(32, 0, 20, true, true, "bj81G5RDED3DC6142kasok".toCharArray());
System.out.println("Random 3 = " + string);
}
}

To split only Chinese characters in java

I am writing a java application; but stuck on this point.
Basically I have a string of Chinese characters with ALSO some possible Latin chars or numbers, lets say:
查詢促進民間參與公共建設法(210BOT法).
I want to split those Chinese chars except the Latin or numbers as "BOT" above. So, at the end I will have this kind of list:
[ 查, 詢, 促, 進, 民, 間, 參, 與, 公, 共, 建, 設, 法, (, 210, BOT, 法, ), ., ]
How can I resolve this problem (for java)?
Chinese characters lies within certain Unicode ranges:
2F00-2FDF: Kangxi
4E00-9FAF: CJK
3400-4DBF: CJK Extension
So all you basically need to do is to check if the character's codepoint lies within the known ranges. This example is a good starting point to write a stackbased parser/splitter, you only need to extend it to separate digits from latin letters, which should be obvious enough (hint: Character#isDigit()):
Set<UnicodeBlock> chineseUnicodeBlocks = new HashSet<UnicodeBlock>() {{
add(UnicodeBlock.CJK_COMPATIBILITY);
add(UnicodeBlock.CJK_COMPATIBILITY_FORMS);
add(UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS);
add(UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS_SUPPLEMENT);
add(UnicodeBlock.CJK_RADICALS_SUPPLEMENT);
add(UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION);
add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS);
add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A);
add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B);
add(UnicodeBlock.KANGXI_RADICALS);
add(UnicodeBlock.IDEOGRAPHIC_DESCRIPTION_CHARACTERS);
}};
String mixedChinese = "查詢促進民間參與公共建設法(210BOT法)";
for (char c : mixedChinese.toCharArray()) {
if (chineseUnicodeBlocks.contains(UnicodeBlock.of(c))) {
System.out.println(c + " is chinese");
} else {
System.out.println(c + " is not chinese");
}
}
Good luck.
Diclaimer: I'm a complete Lucene newbie.
Using the latest version of Lucene (3.6.0 at the time of writing) I manage to get close to the result you require.
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36, Collections.emptySet());
List<String> words = new ArrayList<String>();
TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(original));
CharTermAttribute termAttribute = tokenStream.addAttribute(CharTermAttribute.class);
try {
tokenStream.reset(); // Resets this stream to the beginning. (Required)
while (tokenStream.incrementToken()) {
words.add(termAttribute.toString());
}
tokenStream.end(); // Perform end-of-stream operations, e.g. set the final offset.
}
finally {
tokenStream.close(); // Release resources associated with this stream.
}
The result I get is:
[查, 詢, 促, 進, 民, 間, 參, 與, 公, 共, 建, 設, 法, 210bot, 法]
Here's an approach I would take.
You can use Character.codePointAt(char[] charArray, int index) to return the Unicode value for a char in your char array.
You will also need a mapping of Latin Unicode characters.
If you look in the source of Character.UnicodeBlock, the full LATIN block is the interval [0x0000, 0x0249]. So basically you check if your Unicode code point is somewhere within that interval.
I suspect there is a way to just use a Character.Subset to check if it contains your char, but I haven't looked into that.

Categories

Resources