How to pad Strings with Unicode characters in Java - java

I add right padding to a String to output it in a table format.
for (String[] tuple : testData) {
System.out.format("%-32s -> %s\n", tuple[0], tuple[1]);
}
The result looks like this (random test data):
znZfmOEQ0Gb68taaNU6HY21lvo -> Xq2aGqLedQnTSXg6wmBNDVb
frKweMCH8Kvgyk0J -> lHJ5r7YDV0jTL
NxtHP -> odvPJklwIzZZ
NX2scXjl5dxWmer -> wPDlKCKllVKk
x2HKsSHCqDQ -> RMuWLZ2vaP9sOF0yHmjVysJ
b0hryXKd6b80xAI -> 05MHjvTOxlxq1bvQ8RGe
This approach does not work when there are multi-byte unicode characters:
0OZot🇨🇳ivbyG🧷hZM1FI👡wNhn6r6cC -> OKDxDV1o2NMqXH3VvE7q3uONwEcY5V
fBHRCjU4K8OCdzACmQZSn6WO -> gvGBtUO5a4gPMKj9BKqBHFKx1iO7
cDUh🇲🇺b0cXkLWkS -> SZX
WtP9t -> Q0wWOeY3W66mM5rcQQYKpG
va4d🍷u8SS -> KI
a71?⚖TZ💣🧜‍♀🕓ws5J -> b8A
As you can see, the alignment is off.
My idea was to calculate the difference between the length of the String and the number of bytes used and use that to offset the padding, something like this:
int correction = tuple[0].getBytes().length - tuple[0].length();
And then instead of padding to 32 chars, I would pad to 32 + correction. However, this didn't work either.
Here is my test code (using emoji-java but the behaviour should be reproducable with any unicode characters):
import java.util.Collection;
import org.apache.commons.lang3.RandomStringUtils;
import com.vdurmont.emoji.Emoji;
import com.vdurmont.emoji.EmojiManager;
public class Test {
public static void main(String[] args) {
// create random test data
String[][] testData = new String[15][2];
for (String[] tuple : testData) {
tuple[0] = RandomStringUtils.randomAlphanumeric(2, 32);
tuple[1] = RandomStringUtils.randomAlphanumeric(2, 32);
}
// add some emojis
Collection<Emoji> all = EmojiManager.getAll();
for (String[] tuple : testData) {
for (int i = 1; i < tuple[0].length(); i++) {
if (Math.random() > 0.90) {
Emoji emoji = all.stream().skip((int) (all.size() * Math.random())).findFirst().get();
tuple[0] = tuple[0].substring(0, i - 1) + emoji.getUnicode() + tuple[0].substring(i + 1);
}
}
}
// output
for (String[] tuple : testData) {
System.out.format("%-32s -> %s\n", tuple[0], tuple[1]);
}
}
}

There are actually a few issues here, other than that some fonts display the flag wider than the other characters. I assume that you want to count the Chinese flag as a single character (as it is drawn as a single element on the screen).
The String class reports an incorrect length
The String class works with chars, which are 16-bit integers of Unicode code points. The problem is that not all code points fit in 16 bits, only code points from the Basic Multilingual Plane (BMP) fit in those chars. String's length() method returns the number of chars, not the number of code points.
Now String's codePointCount method may help in this case: it counts the number of code points in the given index range. So providing string.length() as second argument to the method returns the total count of code points.
Combining characters
However, there's another problem. The 🇨🇳 Chinese flag, for example, consists of two Unicode code points: the Regional Indicator Symbol Letters C (🇨, U+1F1E8) and N (🇳, U+1F1F3). Those two code points are combined into a flag of China. This is a problem you are not going to solve with the codePointCount method.
The Regional Indicator Symbol Letters seem to be a special occasion. Two of those characters can be combined into a national flag. I am not aware of a standard way to achieve what you want. You may have to take that manually into account.
I've written a small program to get the length of a string.
static int length(String str) {
String a = "\uD83C\uDDE6";
String z = "\uD83C\uDDFF";
Pattern p = Pattern.compile("[" + a + "-" + z + "]{2}");
Matcher m = p.matcher(str);
int count = 0;
while (m.find()) {
count++;
}
return str.codePointCount(0, str.length()) - count;
}

As is discussed by the comments in the question linked to by #Xehpuk, in this discussion on kotlinlang.org as well as in this blog post by Daniel Lemire the following seems to be correct:
The problem is that the java String class represents characters as
UTF-16 characters. This means any unicode character that is
represented by more than 16 bits is saved as 2 separate Char values.
This fact is ignored by many of the functions within String, eg.
String.lenght does not return the number of unicode characters, it
returns the number of 16bit characters within the String, some emoji
counting for 2 characters.
The behaviour, however, seems to be implementation-specific.
As David mentions in his post you could try the following to get the correct lenght:
tuple.codePointCount(0, tuple.length())
See code point methods from Java SE docs

Related

How to trim file segment length, so when written as new filepath is not longer than 255 chars for each segement

My Java code generates a new path name for an existing file, as part of this, I have to ensure each path segment is no longer than 255 characters because this is illegal for most operating systems.
// No path component can be longer than 255 chars
String[] pathComponents = splitPath(newPath);
for(int i=0;i<pathComponents.length - 1;i++) {
if (pathComponents[i]. length() > MAX_FILELENGTH) {
String shortened = pathComponents[i].substring(0, MAX_FILELENGTH - 1);
shortened = shortened.trim();
sb.append(shortened).append(File.separator);
}
else {
sb.append(pathComponents[i]).append(File.separator);
}
}
This works fine most of the time, but it doesn't work if there are less than 255 Unicode characters but when the Unicode characters are written to the filesystem, some require more than one byte and therefore end up with more than 255 bytes, which isn't caught by test.
I can count bytes instead of characters with
if(pathComponents[i].getBytes(StandardCharsets.UTF_8).length > MAX_FILELENGTH)
I cannot work out a nice way to trim by just the right amount of characters.
As you stated, 255 characters are sometimes 255 bytes but sometimes they are longer. This simple test shows that:
String a = "a";
System.out.println(a);
System.out.println((int)a.charAt(0));
System.out.println(Arrays.toString(a.getBytes(StandardCharsets.UTF_8)));
// a
// 97
// [97]
String aa = "ä";
System.out.println(aa);
System.out.println((int)aa.charAt(0));
System.out.println(Arrays.toString(aa.getBytes(StandardCharsets.UTF_8)));
// ä
// 228
// [-61, -92]
As you can see, the ä letter is part of 0-255 space (8 bytes) but it is represented by array with length = 2
What would I do? From question, I see that you generate the string, and in this new-path generator, I would create chars only in ASCII space (0-127). Then, you will be sure that generated string has one character as one byte, and string.length() will be the same as getBytes().length
The following code snippet shows that at 127th value everything is one byte long, and afterwards it is two byte long array. And you can also use that rule to shorten the string.
for (int i = 0; i < 255; i++) {
char c = (char)i;
String s = String.valueOf(c);
System.out.println(i + "-> " + s + " ->" + Arrays.toString(s.getBytes(StandardCharsets.UTF_8)));
}
// ...
// ...
// 125-> } ->[125]
// 126-> ~ ->[126]
// 127-> ->[127]
// 128-> ->[-62, -128]
// 129-> ->[-62, -127]
// 130-> ->[-62, -126]
I agree with Mark Rotteveel comment that the assumption of the OS filename being 255 bytes limit isn't safe one, and you need to know the charset of OS filenames.
That said, to answer the question you asked: in order for you to split a String[] up according to some rule on max length of bytes of each component you'd need to iterate through the character or code points until the converted size exceeds the max.
If you don't handle code points you could do this by writing and flushing a character at a time to OutputStreamWriter backed by ByteArrayOutputStream and split up components if next character means byteArrayOutput.size() > MAX.
To handle code points you could try this example which iterates the code points , finds their corresponding size in bytes, then assembles as sub-strings which when converted to some character set will all keep to your byte length limit:
public static void main(String ... args) {
int maxSizeInBytes = 5;
Charset osPathCharset = StandardCharsets.UTF_8;
ArrayList<String> split = new ArrayList<>();
System.out.println("splitting "+String.join(File.separator, args));
for (String s : args) {
System.out.println("s="+s+ " chars#="+s.length()+ " bytes#="+s.getBytes(osPathCharset).length);
int[] cpAsBytes = s.codePoints().mapToObj(Character::toString).mapToInt(c -> c.getBytes(osPathCharset).length).toArray();
StringBuilder b = new StringBuilder();
int avail = maxSizeInBytes;
for(int i = 0; i < cpAsBytes.length; i++) {
if (avail < cpAsBytes[i]) {
split.add(b.toString());
avail = maxSizeInBytes;
b.setLength(0);
}
avail -= cpAsBytes[i];
b.appendCodePoint(s.codePointAt(i));
}
if (b.length() > 0)
split.add(b.toString());
}
System.out.println("split as: "+String.join(File.separator, split.toArray(String[]::new)));
for (String s : split) {
System.out.println("Part s="+s+ " chars#="+s.length()+ " bytes#="+s.getBytes(osPathCharset).length);
}
}
Obviously this isn't memory friendly, it creates a stream of codePoints as String with corresponding int value for each part and isn't robustly tested so I may delete this answer at some stage. I tried it with:
main("å2ø4æ","12345", "67890abcdef");
Which prints:
splitting å2ø4æ\12345\67890abcdef
s=å2ø4æ chars#=5 bytes#=8
s=12345 chars#=5 bytes#=5
s=67890abcdef chars#=11 bytes#=11
split as: å2ø\4æ\12345\67890\abcde\f
Part s=å2ø chars#=3 bytes#=5
Part s=4æ chars#=2 bytes#=3
Part s=12345 chars#=5 bytes#=5
Part s=67890 chars#=5 bytes#=5
Part s=abcde chars#=5 bytes#=5
Part s=f chars#=1 bytes#=1

Java: Stringtokenizer To Array

Given a polynomial, I'm attempting to write code to create a polynomial that goes by the degree's, and adds like terms together For instance... given
String term = "323x^3+2x+x-5x+5x^2" //Given
What I'd like = "323x^3+5x^2-2x" //result
So far I've tokenized the given polynomial by this...
term = term.replace("+" , "~+");
term = term.replace("-", "~-");
System.out.println(term);
StringTokenizer multiTokenizer = new StringTokenizer(term, "~");
int numberofTokens = multiTokenizer.countTokens();
String[] tokensArray = new String[numberofTokens];
int x=0;
while (multiTokenizer.hasMoreTokens())
{
System.out.println(multiTokenizer.nextToken());
}
Resulting in
323x^3~+2x~+x~-5x~+5x^2
323x^3
+2x
+x
-5x
+5x^2
How would I go about splitting the coefficient from the x value, saving each coefficient in an array, and then putting the degrees in a different array with the same index as it's coefficient? I will then use this algorithm to add like terms....
for (i=0;i<=biggest_Root; i++)
for(j=0; j<=items_in_list ; j++)
if (degree_array[j] = i)
total += b1[j];
array_of_totals[i] = total;
Any and all help is much appreciated!
You can also update the terms so they all have coefficients:
s/([+-])x/\11/g
So +x^2 becomes +1x^2.
Your individual coefficients can be pulled out by simple regex expressions.
Something like this should suffice:
/([+-]?\d+)x/ // match for x
/([+-]?\d+)x\^2/ // match for x^2
/([+-]?\d+)x\^3/ // match for x^3
/([+-]?\d+)x\^4/ // match for x^4
Then
sum_of_coefficient[degree] += match
where "match" is the parseInt of the the regex match (special case where coefficient is 1 and has no number eg. +x)
sum_of_coefficient[3] = 323
sum_of_coefficient[1] = +2+1-5 = -2
sum_of_coefficient[2] = 5
Using a "Regular Expression" Pattern to Simplify the Parsing
(and make the code cooler and more concise)
Here is a working example that parses coefficient, variable and degree for each term based on the terms you've parsed so far. It just inserted the terms shown into your example into a list of Strings and then processes each string the same way.
This program runs and produces output, and if you like it you can splice it into your program. To try it:
$ javac parse.java
$ java parse
Limitations and Potential Improvements:
Technically speaking the coefficient and degrees could be fractional, so the regular expression could easily be changed to handle those kinds of numbers. And then instead of Integer.parseInt() you could use Float.parseFloat() instead to convert the matched value to a variable you can use.
import java.util.*;
import java.util.regex.*;
public class parse {
public static void main(String args[]) {
/*
* Substitute this List with your own list or
* array from the code you've written already...
*
* vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv */
List<String>terms = new ArrayList<String>();
terms.add("323x^3");
terms.add("+2x");
terms.add("+x");
terms.add("-5x");
terms.add("+5x^2");
/* ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ */
for (String term : terms) {
System.out.print("Term: " + term + ": \n");
Pattern pattern = Pattern.compile("([+-]*\\d*)([A-Za-z]*)\\^*(\\d*)");
Matcher matcher = pattern.matcher(term);
if (matcher.find()) {
int coefficient = 1;
try {
coefficient = Integer.parseInt(matcher.group(1));
} catch (Exception e) {}
String variable = matcher.group(2);
int degree = 1;
try {
degree = Integer.parseInt(matcher.group(3));
} catch (Exception e) {}
System.out.println(" coefficient = " + coefficient);
System.out.println(" variable = " + variable);
System.out.println(" degree = " + degree);
/*
* Here, do what you need to do with
* variable, coefficient and degree
*/
}
}
}
}
Explanation of the Regular Expression in the Example Code:
This is the regular expression used:
([+-]*\\d*)([A-Za-z]*)\\^*(\\d*)
Each parenthesized section represents part of the term I want to match and extract into my result. It puts whatever is matched in a group corresponding to the set of parenthesis. First set of parenthesis goes into group 1, second into group 2, etc...
The first matcher (grouped by ( )), is ([+-]*\\d*)
That is designed match (e.g. extract) the coefficient (if any) and put it into group 1. It expects something that has zero or more occurances of '+' or '-' characters, followed by zero or more digits. I probably should have written in [+-]?\\d* which would match zero or one + or - characters.
The next grouped matcher is ([A-Za-z]*) That says match zero or more capital or lowercase letters.
That is trying to extract the variable name, if any and put it into group 2.
Following that, there is an ungrouped \\^*, which matches 0 or more ^ characters. It's not grouped in parenthesis, because we want to account for the ^ character in the text, but not stash it anywhere. We're really interested in the exponent number following it. Note: Two backslashes are how you make one backslash in a Java string. The real world regular expression we're trying to represent is \^*. The reason it's escaped here is because ^ unescaped has special meaning in regular expressions, but we just want to match/allow for the possibility of an actual caret ^ at that position in the algebraic term we're parsing.
The final pattern group is (\\d*). Outside of a string literal, as most regex's in the wild are, that would simply be \d*. It's escaped because, by default, in a regex, d, unescaped, means match a literal d at the current position in the text, but, escaped,\d is a special regex pattern that matches match any digit [0-9] (as the Pattern javadoc explains). * means expect (match) zero or more digits at that point. Alternatively, + would mean expect 1 or more digits in the text at the current position, and ? would mean 0 or 1 digits are expected in the text at the current position. So, essentially, the last group is designed to match and extract the exponent (if any) after the optional caret, putting that number into group 3.
Remember the ( ) (parenthesized) groupings are just so that we can extract those areas parsed into separate groups.
If this doesn't all make perfect sense, study regular expressions in general and read the Java Pattern class javadoc online. The are NOT as scary as they first look, and an extremely worthwhile study for any programmer ASAP, as it crosses most popular scripting languages and compilers, so learn it once and you have an extremely powerful tool for life.
This looks like a homework question, so I won't divulge the entire answer here but here's how I'd get started
public class Polynomial {
private String rawPolynomial;
private int lastTermIndex = 0;
private Map<Integer, Integer> terms = new HashMap<>();
public Polynomial(String poly) {
this.rawPolynomial = poly;
}
public void simplify() {
while(true){
String term = getNextTerm(rawPolynomial);
if ("".equalsIgnoreCase(term)) {
return;
}
Integer degree = getDegree(term);
Integer coeff = getCoefficient(term);
System.out.println(String.format("%dx^%d", coeff, degree));
terms.merge(degree, coeff, Integer::sum);
}
}
private String getNextTerm(String poly) {
...
}
private Integer getDegree(String poly) {
...
}
private Integer getCoefficient(String poly) {
...
}
#Override public String toString() {
return terms.toString();
}
}
and some tests to get you started -
public class PolynomialTest {
#Test public void oneTermPolynomialRemainsUnchanged() {
Polynomial poly = new Polynomial("3x^2");
poly.simplify();
assertTrue("3x^2".equalsIgnoreCase(poly.toString()));
}
}
You should be able to fill in the blanks, hope this helps. I'll be happy to help you further if you're stuck somewhere.

JAVA: Space delimiting all non-numerical characters in a String

I am having some trouble with modifying Strings to be space delimited under the special case of adding spaces to all non-numerical characters.
My code must take a string representing a math equation, and split it up into it's individual parts. It does so using space delimits between values This part works great if the string is already delimited.
The problem is that I do not always get a space delimited input. To deal with this, I want to first insert these spaces so that the array is created properly.
What my code must do is take any character that is NOT a number, and add a space before and after it.
Something like this:
3*24+321 becomes 3 * 24 + 321
or
((3.0)*(2.5)) becomes ( ( 3.0 ) * ( 2.5 ) )
Obviously I need to avoid inserting space in the numbers, or 2.5 becomes 2 . 5, and then gets entered into the array as 3 elements. which it is not.
So far, I have tried using
String InputLineDelmit = InputLine.replaceAll("\B", " ");
which successfully changes a string of all letters "abcd" to "a b c d"
But it makes mistakes when it runs into numbers. Using this method, I have gotten that:
(((1)*(2))) becomes ( ( (1) * (2) ) ) ---- * The numbers must be separate from parens
12.7+3.1 becomes 1 2.7+3.1 ----- * 12.7 is split
51/3 becomes 5 1/3 ----- * same issue
and 5*4-2 does not change at all.
So, I know that \D can be used as a regular expression for all non-numbers in java. However, my attempts to implement this (by replacing, or combining it with \B above) have led either to compiler errors or it REPLACING the char with a space, not adding one.
EDIT:
==== Answered! ====
It wont let me add my own answer because I'm new, but an edit to neo108's code below (which, itself, does not work) did the job. What i did was change it to check isDigit, not isLetter, and then do nothing in that case (or in the special case of a decimal, for doubles). Else, the character is changed to have spaces on either side.
public static void main(String[] args){
String formula = "12+((13.0)*(2.5)-17*2)+(100/3)-7";
StringBuilder builder = new StringBuilder();
for (int i = 0; i < formula.length(); i++){
char c = formula.charAt(i);
char cdot = '.';
if(Character.isDigit(c) || c == cdot) {
builder.append(c);
}
else {
builder.append(" "+c+" ");
}
}
System.out.println("OUTPUT:" + builder);
}
OUTPUT: 12 + ( ( 13.0 ) * ( 2.5 ) - 17 * 2 ) + ( 100 / 3 ) - 7
However, any ideas on how to do this more succinctly, and also a decent explanation of StringBuilders, would be appreciated. Namely what is with this limit of 16 chars that I read about on javadocs, as the example above shows that you CAN have more output.
Something like this should work...
String formula = "Ab((3.0)*(2.5))";
StringBuilder builder = new StringBuilder();
for (int i = 0; i < formula.length(); i++){
char c = formula.charAt(i);
if(Character.isLetter(c)) {
builder.append(" "+c+" ");
} else {
builder.append(c);
}
}
Define the operations in your math equation + - * / () etc
Convert your equation string to char[]
Traverse through the char[] one char at a time and append the read char to a StringBuilder object.
If you encounter any character matching with the operations defined, then add a space before and after that character and then append this t o the StringBuilder object.
Well this is one of the algorithm you can implement. There might be other ways of doing it as well.

Checking if a character is an integer or letter

I am modifying a file using Java. Here's what I want to accomplish:
if an & symbol, along with an integer, is detected while being read, I want to drop the & symbol and translate the integer to binary.
if an & symbol, along with a (random) word, is detected while being read, I want to drop the & symbol and replace the word with the integer 16, and if a different string of characters is being used along with the & symbol, I want to set the number 1 higher than integer 16.
Here's an example of what I mean. If a file is inputted containing these strings:
&myword
&4
&anotherword
&9
&yetanotherword
&10
&myword
The output should be:
&0000000000010000 (which is 16 in decimal)
&0000000000000100 (or the number '4' in decimal)
&0000000000010001 (which is 17 in decimal, since 16 is already used, so 16+1=17)
&0000000000000101 (or the number '9' in decimal)
&0000000000010001 (which is 18 in decimal, or 17+1=18)
&0000000000000110 (or the number '10' in decimal)
&0000000000010000 (which is 16 because value of myword = 16)
Here's what I tried so far, but haven't succeeded yet:
for (i=0; i<anyLines.length; i++) {
char[] charray = anyLines[i].toCharArray();
for (int j=0; j<charray.length; j++)
if (Character.isDigit(charray[j])) {
anyLines[i] = anyLines[i].replace("&","");
anyLines[i] = Integer.toBinaryString(Integer.parseInt(anyLines[i]);
}
else {
continue;
}
if (Character.isLetter(charray[j])) {
anyLines[i] = anyLines[i].replace("&","");
for (int k=16; j<charray.length; k++) {
anyLines[i] = Integer.toBinaryString(Integer.parseInt(k);
}
}
}
}
I hope that I am articulate enough. Any suggestions on how to accomplish this task?
Character.isLetter() //tests to see if it is a letter
Character.isDigit() //tests the character to
It looks like something you could match against a regex. I don't know Java but you should have at least one regex engine at your disposal. Then the regex would be:
regex1: &(\d+)
and
regex2: &(\w+)
or
regex3: &(\d+|\w+)
in the first case, if regex1 matches, you know you ran into a number, and that number is into the first capturing group (eg: match.group(1)). If regex2 matches, you know you have a word. You can then lookup that word into a dictionary and see what its associated number is, or if not present, add it to the dictionary and associate it with the next free number (16 + dictionary size + 1).
regex3 on the other hand will match both numbers and words, so it's up to you to see what's in the capturing group (it's just a different approach).
If neither of the regex match, then you have an invalid sequence, or you need some other action. Note that \w in a regex only matches word characters (ie: letters, _ and possibly a few other characters), so &çSomeWord or &*SomeWord won't match at all, while the captured group in &Hello.World would be just "Hello".
Regex libs usually provide a length for the matched text, so you can move i forward by that much in order to skip already matched text.
You have to somehow tokenize your input. It seems you are splitting it in lines and then analyzing each line individually. If this is what you want, okay. If not, you could simply search for & (indexOf('%')) and then somehow determine what the next token is (either a number or a "word", however you want to define word).
What do you want to do with input which does not match your pattern? Neither the description of the task nor the example really covers this.
You need to have a dictionary of already read strings. Use a Map<String, Integer>.
I would post this as a comment, but don't have the ability yet. What is the issue you are running into? Error? Incorrect Results? 16's not being correctly incremented? Also, the examples use a '%' but in your description you say it should start with a '&'.
Edit2: Was thinking it was line by line, but re-reading indicates you could be trying to find say "I went to the &store" and want it to say "I went to the &000010000". So you would want to split by whitespace and then iterate through and pass the strings into your 'replace' method, which is similar to below.
Edit1: If I understand what you are trying to do, code like this should work.
Map<String, Integer> usedWords = new HashMap<String, Integer>();
List<String> output = new ArrayList<String>();
int wordIncrementer = 16;
String[] arr = test.split("\n");
for(String s : arr)
{
if(s.startsWith("&"))
{
String line = s.substring(1).trim(); //Removes &
try
{
Integer lineInt = Integer.parseInt(line);
output.add("&" + Integer.toBinaryString(lineInt));
}
catch(Exception e)
{
System.out.println("Line was not an integer. Parsing as a String.");
String outputString = "&";
if(usedWords.containsKey(line))
{
outputString += Integer.toBinaryString(usedWords.get(line));
}
else
{
outputString += Integer.toBinaryString(wordIncrementer);
usedWords.put(line, wordIncrementer++);
}
output.add(outputString);
}
}
else
{
continue; //Nothing indicating that we should parse the line.
}
}
How about this?
String input = "&myword\n&4\n&anotherword\n&9\n&yetanotherword\n&10\n&myword";
String[] lines = input.split("\n");
int wordValue = 16;
// to keep track words that are already used
Map<String, Integer> wordValueMap = new HashMap<String, Integer>();
for (String line : lines) {
// if line doesn't begin with &, then ignore it
if (!line.startsWith("&")) {
continue;
}
// remove &
line = line.substring(1);
Integer binaryValue = null;
if (line.matches("\\d+")) {
binaryValue = Integer.parseInt(line);
}
else if (line.matches("\\w+")) {
binaryValue = wordValueMap.get(line);
// if the map doesn't contain the word value, then assign and store it
if (binaryValue == null) {
binaryValue = wordValue;
wordValueMap.put(line, binaryValue);
wordValue++;
}
}
// I'm using Commons Lang's StringUtils.leftPad(..) to create the zero padded string
String out = "&" + StringUtils.leftPad(Integer.toBinaryString(binaryValue), 16, "0");
System.out.println(out);
Here's the printout:-
&0000000000010000
&0000000000000100
&0000000000010001
&0000000000001001
&0000000000010010
&0000000000001010
&0000000000010000
Just FYI, the binary value for 10 is "1010", not "110" as stated in your original post.

Creating Unicode character from its number

I want to display a Unicode character in Java. If I do this, it works just fine:
String symbol = "\u2202";
symbol is equal to "∂". That's what I want.
The problem is that I know the Unicode number and need to create the Unicode symbol from that. I tried (to me) the obvious thing:
int c = 2202;
String symbol = "\\u" + c;
However, in this case, symbol is equal to "\u2202". That's not what I want.
How can I construct the symbol if I know its Unicode number (but only at run-time---I can't hard-code it in like the first example)?
If you want to get a UTF-16 encoded code unit as a char, you can parse the integer and cast to it as others have suggested.
If you want to support all code points, use Character.toChars(int). This will handle cases where code points cannot fit in a single char value.
Doc says:
Converts the specified character (Unicode code point) to its UTF-16 representation stored in a char array. If the specified code point is a BMP (Basic Multilingual Plane or Plane 0) value, the resulting char array has the same value as codePoint. If the specified code point is a supplementary code point, the resulting char array has the corresponding surrogate pair.
Just cast your int to a char. You can convert that to a String using Character.toString():
String s = Character.toString((char)c);
EDIT:
Just remember that the escape sequences in Java source code (the \u bits) are in HEX, so if you're trying to reproduce an escape sequence, you'll need something like int c = 0x2202.
The other answers here either only support unicode up to U+FFFF (the answers dealing with just one instance of char) or don't tell how to get to the actual symbol (the answers stopping at Character.toChars() or using incorrect method after that), so adding my answer here, too.
To support supplementary code points also, this is what needs to be done:
// this character:
// http://www.isthisthingon.org/unicode/index.php?page=1F&subpage=4&glyph=1F495
// using code points here, not U+n notation
// for equivalence with U+n, below would be 0xnnnn
int codePoint = 128149;
// converting to char[] pair
char[] charPair = Character.toChars(codePoint);
// and to String, containing the character we want
String symbol = new String(charPair);
// we now have str with the desired character as the first item
// confirm that we indeed have character with code point 128149
System.out.println("First code point: " + symbol.codePointAt(0));
I also did a quick test as to which conversion methods work and which don't
int codePoint = 128149;
char[] charPair = Character.toChars(codePoint);
System.out.println(new String(charPair, 0, 2).codePointAt(0)); // 128149, worked
System.out.println(charPair.toString().codePointAt(0)); // 91, didn't work
System.out.println(new String(charPair).codePointAt(0)); // 128149, worked
System.out.println(String.valueOf(codePoint).codePointAt(0)); // 49, didn't work
System.out.println(new String(new int[] {codePoint}, 0, 1).codePointAt(0));
// 128149, worked
--
Note: as #Axel mentioned in the comments, with java 11 there is Character.toString(int codePoint) which would arguably be best suited for the job.
This one worked fine for me.
String cc2 = "2202";
String text2 = String.valueOf(Character.toChars(Integer.parseInt(cc2, 16)));
Now text2 will have ∂.
Remember that char is an integral type, and thus can be given an integer value, as well as a char constant.
char c = 0x2202;//aka 8706 in decimal. \u codepoints are in hex.
String s = String.valueOf(c);
String st="2202";
int cp=Integer.parseInt(st,16);// it convert st into hex number.
char c[]=Character.toChars(cp);
System.out.println(c);// its display the character corresponding to '\u2202'.
Although this is an old question, there is a very easy way to do this in Java 11 which was released today: you can use a new overload of Character.toString():
public static String toString​(int codePoint)
Returns a String object representing the specified character (Unicode code point). The result is a string of length 1 or 2, consisting solely of the specified codePoint.
Parameters:
codePoint - the codePoint to be converted
Returns:
the string representation of the specified codePoint
Throws:
IllegalArgumentException - if the specified codePoint is not a valid Unicode code point.
Since:
11
Since this method supports any Unicode code point, the length of the returned String is not necessarily 1.
The code needed for the example given in the question is simply:
int codePoint = '\u2202';
String s = Character.toString(codePoint); // <<< Requires JDK 11 !!!
System.out.println(s); // Prints ∂
This approach offers several advantages:
It works for any Unicode code point rather than just those that can be handled using a char.
It's concise, and it's easy to understand what the code is doing.
It returns the value as a string rather than a char[], which is often what you want. The answer posted by McDowell is appropriate if you want the code point returned as char[].
This is how you do it:
int cc = 0x2202;
char ccc = (char) Integer.parseInt(String.valueOf(cc), 16);
final String text = String.valueOf(ccc);
This solution is by Arne Vajhøj.
The code below will write the 4 unicode chars (represented by decimals) for the word "be" in Japanese. Yes, the verb "be" in Japanese has 4 chars!
The value of characters is in decimal and it has been read into an array of String[] -- using split for instance. If you have Octal or Hex, parseInt take a radix as well.
// pseudo code
// 1. init the String[] containing the 4 unicodes in decima :: intsInStrs
// 2. allocate the proper number of character pairs :: c2s
// 3. Using Integer.parseInt (... with radix or not) get the right int value
// 4. place it in the correct location of in the array of character pairs
// 5. convert c2s[] to String
// 6. print
String[] intsInStrs = {"12354", "12426", "12414", "12377"}; // 1.
char [] c2s = new char [intsInStrs.length * 2]; // 2. two chars per unicode
int ii = 0;
for (String intString : intsInStrs) {
// 3. NB ii*2 because the 16 bit value of Unicode is written in 2 chars
Character.toChars(Integer.parseInt(intsInStrs[ii]), c2s, ii * 2 ); // 3 + 4
++ii; // advance to the next char
}
String symbols = new String(c2s); // 5.
System.out.println("\nLooooonger code point: " + symbols); // 6.
// I tested it in Eclipse and Java 7 and it works. Enjoy
Here is a block to print out unicode chars between \u00c0 to \u00ff:
char[] ca = {'\u00c0'};
for (int i = 0; i < 4; i++) {
for (int j = 0; j < 16; j++) {
String sc = new String(ca);
System.out.print(sc + " ");
ca[0]++;
}
System.out.println();
}
Unfortunatelly, to remove one backlash as mentioned in first comment (newbiedoodle) don't lead to good result. Most (if not all) IDE issues syntax error. The reason is in this, that Java Escaped Unicode format expects syntax "\uXXXX", where XXXX are 4 hexadecimal digits, which are mandatory. Attempts to fold this string from pieces fails. Of course, "\u" is not the same as "\\u". The first syntax means escaped 'u', second means escaped backlash (which is backlash) followed by 'u'. It is strange, that on the Apache pages is presented utility, which doing exactly this behavior. But in reality, it is Escape mimic utility. Apache has some its own utilities (i didn't testet them), which do this work for you. May be, it is still not that, what you want to have. Apache Escape Unicode utilities But this utility 1 have good approach to the solution. With combination described above (MeraNaamJoker). My solution is create this Escaped mimic string and then convert it back to unicode (to avoid real Escaped Unicode restriction). I used it for copying text, so it is possible, that in uencode method will be better to use '\\u' except '\\\\u'. Try it.
/**
* Converts character to the mimic unicode format i.e. '\\u0020'.
*
* This format is the Java source code format.
*
* CharUtils.unicodeEscaped(' ') = "\\u0020"
* CharUtils.unicodeEscaped('A') = "\\u0041"
*
* #param ch the character to convert
* #return is in the mimic of escaped unicode string,
*/
public static String unicodeEscaped(char ch) {
String returnStr;
//String uniTemplate = "\u0000";
final static String charEsc = "\\u";
if (ch < 0x10) {
returnStr = "000" + Integer.toHexString(ch);
}
else if (ch < 0x100) {
returnStr = "00" + Integer.toHexString(ch);
}
else if (ch < 0x1000) {
returnStr = "0" + Integer.toHexString(ch);
}
else
returnStr = "" + Integer.toHexString(ch);
return charEsc + returnStr;
}
/**
* Converts the string from UTF8 to mimic unicode format i.e. '\\u0020'.
* notice: i cannot use real unicode format, because this is immediately translated
* to the character in time of compiling and editor (i.e. netbeans) checking it
* instead reaal unicode format i.e. '\u0020' i using mimic unicode format '\\u0020'
* as a string, but it doesn't gives the same results, of course
*
* This format is the Java source code format.
*
* CharUtils.unicodeEscaped(' ') = "\\u0020"
* CharUtils.unicodeEscaped('A') = "\\u0041"
*
* #param String - nationalString in the UTF8 string to convert
* #return is the string in JAVA unicode mimic escaped
*/
public String encodeStr(String nationalString) throws UnsupportedEncodingException {
String convertedString = "";
for (int i = 0; i < nationalString.length(); i++) {
Character chs = nationalString.charAt(i);
convertedString += unicodeEscaped(chs);
}
return convertedString;
}
/**
* Converts the string from mimic unicode format i.e. '\\u0020' back to UTF8.
*
* This format is the Java source code format.
*
* CharUtils.unicodeEscaped(' ') = "\\u0020"
* CharUtils.unicodeEscaped('A') = "\\u0041"
*
* #param String - nationalString in the JAVA unicode mimic escaped
* #return is the string in UTF8 string
*/
public String uencodeStr(String escapedString) throws UnsupportedEncodingException {
String convertedString = "";
String[] arrStr = escapedString.split("\\\\u");
String str, istr;
for (int i = 1; i < arrStr.length; i++) {
str = arrStr[i];
if (!str.isEmpty()) {
Integer iI = Integer.parseInt(str, 16);
char[] chaCha = Character.toChars(iI);
convertedString += String.valueOf(chaCha);
}
}
return convertedString;
}
char c=(char)0x2202;
String s=""+c;
(ANSWER IS IN DOT NET 4.5 and in java, there must be a similar approach exist)
I am from West Bengal in INDIA.
As I understand your problem is ...
You want to produce similar to ' অ ' (It is a letter in Bengali language)
which has Unicode HEX : 0X0985.
Now if you know this value in respect of your language then how will you produce that language specific Unicode symbol right ?
In Dot Net it is as simple as this :
int c = 0X0985;
string x = Char.ConvertFromUtf32(c);
Now x is your answer.
But this is HEX by HEX convert and sentence to sentence conversion is a work for researchers :P

Categories

Resources