What is the recommended way to escape HTML symbols in plain Java? - java

Is there a recommended way to escape <, >, " and & characters when outputting HTML in plain Java code? (Other than manually doing the following, that is).
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = source.replace("<", "<").replace("&", "&"); // ...

StringEscapeUtils from Apache Commons Lang:
import static org.apache.commons.lang.StringEscapeUtils.escapeHtml;
// ...
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = escapeHtml(source);
For version 3:
import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;
// ...
String escaped = escapeHtml4(source);

An alternative to Apache Commons: Use Spring's HtmlUtils.htmlEscape(String input) method.

Nice short method:
public static String escapeHTML(String s) {
StringBuilder out = new StringBuilder(Math.max(16, s.length()));
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c > 127 || c == '"' || c == '\'' || c == '<' || c == '>' || c == '&') {
out.append("&#");
out.append((int) c);
out.append(';');
} else {
out.append(c);
}
}
return out.toString();
}
Based on https://stackoverflow.com/a/8838023/1199155 (the amp is missing there). The four characters checked in the if clause are the only ones below 128, according to http://www.w3.org/TR/html4/sgml/entities.html

There is a newer version of the Apache Commons Lang library and it uses a different package name (org.apache.commons.lang3). The StringEscapeUtils now has different static methods for escaping different types of documents (http://commons.apache.org/proper/commons-lang/javadocs/api-3.0/index.html). So to escape HTML version 4.0 string:
import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;
String output = escapeHtml4("The less than sign (<) and ampersand (&) must be escaped before using them in HTML");

For those who use Google Guava:
import com.google.common.html.HtmlEscapers;
[...]
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = HtmlEscapers.htmlEscaper().escape(source);

Be careful with this. There are a number of different 'contexts' within an HTML document: Inside an element, quoted attribute value, unquoted attribute value, URL attribute, javascript, CSS, etc... You'll need to use a different encoding method for each of these to prevent Cross-Site Scripting (XSS). Check the OWASP XSS Prevention Cheat Sheet for details on each of these contexts. You can find escaping methods for each of these contexts in the OWASP ESAPI library -- https://github.com/ESAPI/esapi-java-legacy.

On android (API 16 or greater) you can:
Html.escapeHtml(textToScape);
or for lower API:
TextUtils.htmlEncode(textToScape);

For some purposes, HtmlUtils:
import org.springframework.web.util.HtmlUtils;
[...]
HtmlUtils.htmlEscapeDecimal("&"); //gives &
HtmlUtils.htmlEscape("&"); //gives &

org.apache.commons.lang3.StringEscapeUtils is now deprecated. You must now use org.apache.commons.text.StringEscapeUtils by
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-text</artifactId>
<version>${commons.text.version}</version>
</dependency>

While #dfa answer of org.apache.commons.lang.StringEscapeUtils.escapeHtml is nice and I have used it in the past it should not be used for escaping HTML (or XML) attributes otherwise the whitespace will be normalized (meaning all adjacent whitespace characters become a single space).
I know this because I have had bugs filed against my library (JATL) for attributes where whitespace was not preserved. Thus I have a drop in (copy n' paste) class (of which I stole some from JDOM) that differentiates the escaping of attributes and element content.
While this may not have mattered as much in the past (proper attribute escaping) it is increasingly become of greater interest given the use use of HTML5's data- attribute usage.

Java 8+ Solution:
public static String escapeHTML(String str) {
return str.chars().mapToObj(c -> c > 127 || "\"'<>&".indexOf(c) != -1 ?
"&#" + c + ";" : String.valueOf((char) c)).collect(Collectors.joining());
}
String#chars returns an IntStream of the char values from the String. We can then use mapToObj to escape the characters with a character code greater than 127 (non-ASCII characters) as well as the double quote ("), single quote ('), left angle bracket (<), right angle bracket (>), and ampersand (&). Collectors.joining concatenates the Strings back together.
To better handle Unicode characters, String#codePoints can be used instead.
public static String escapeHTML(String str) {
return str.codePoints().mapToObj(c -> c > 127 || "\"'<>&".indexOf(c) != -1 ?
"&#" + c + ";" : new String(Character.toChars(c)))
.collect(Collectors.joining());
}

The most of libraries offer escaping everything they can including hundreds of symbols and thousands of non-ASCII characters which is not what you want in UTF-8 world.
Also, as Jeff Williams noted, there's no single “escape HTML” option, there are several contexts.
Assuming you never use unquoted attributes, and keeping in mind that different contexts exist, it've written my own version:
private static final long TEXT_ESCAPE =
1L << '&' | 1L << '<';
private static final long DOUBLE_QUOTED_ATTR_ESCAPE =
TEXT_ESCAPE | 1L << '"';
private static final long SINGLE_QUOTED_ATTR_ESCAPE =
TEXT_ESCAPE | 1L << '\'';
private static final long ESCAPES =
DOUBLE_QUOTED_ATTR_ESCAPE | SINGLE_QUOTED_ATTR_ESCAPE;
// 'quot' and 'apos' are 1 char longer than '#34' and '#39'
// which I've decided to use
private static final String REPLACEMENTS = ""&'<";
private static final int REPL_SLICES = /* [0, 5, 10, 15, 19) */
5<<5 | 10<<10 | 15<<15 | 19<<20;
// These 5-bit numbers packed into a single int
// are indices within REPLACEMENTS which is a 'flat' String[]
private static void appendEscaped(
Appendable builder, CharSequence content, long escapes) {
try {
int startIdx = 0, len = content.length();
for (int i = 0; i < len; i++) {
char c = content.charAt(i);
long one;
if (((c & 63) == c) && ((one = 1L << c) & escapes) != 0) {
// -^^^^^^^^^^^^^^^ -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
// | | take only dangerous characters
// | java shifts longs by 6 least significant bits,
// | e. g. << 0b110111111 is same as >> 0b111111.
// | Filter out bigger characters
int index = Long.bitCount(ESCAPES & (one - 1));
builder.append(content, startIdx, i /* exclusive */).append(
REPLACEMENTS,
REPL_SLICES >>> (5 * index) & 31,
REPL_SLICES >>> (5 * (index + 1)) & 31
);
startIdx = i + 1;
}
}
builder.append(content, startIdx, len);
} catch (IOException e) {
// typically, our Appendable is StringBuilder which does not throw;
// also, there's no way to declare 'if A#append() throws E,
// then appendEscaped() throws E, too'
throw new UncheckedIOException(e);
}
}
Consider copy-pasting from Gist without line length limit.
UPD: As another answer suggests, > escaping is not necessary; also, " within attr='…' is allowed, too. I've updated the code accordingly.
You may check it out yourself:
<!DOCTYPE html>
<html lang="en">
<head><title>Test</title></head>
<body>
<p title="<"I'm double-quoted!">"><"Hello!"></p>
<p title='<"I'm single-quoted!">'><"Goodbye!"></p>
</body>
</html>

Related

Check if string contains only Unicode values [\u0030-\u0039] or [\u0660-\u0669]

I need to check, in java, if a string is composed only of Unicode values [\u0030-\u0039] or [\u0660-\u0669]. What is the most efficient way of doing this?
Use \x for unicode characters:
^([\x{0030}-\x{0039}\x{0660}-\x{0669}]+)$
if the patternt should match an empty string too, use * instead of +
Use this if you dont want to allows mixing characters from both sets you provided:
^([\x{0030}-\x{0039}]+|[\x{0660}-\x{0669}]+)$
https://regex101.com/r/xqWL4q/6
As mentioned by Holger in comments below. \x{0030}-\x{0039} is equivalent with [0-9]. So could be substituted and would be more readable.
As said here, it’s not clear whether you want to check for probably mixed occurrences of these digits or check for either of these ranges.
A simple check for mixed digits would be string.matches("[0-9٠-٩]*") or to avoid confusing changes of the read/write direction, or if your source code encoding doesn’t support all characters, string.matches("[0-9\u0660-\u669]*").
Checking whether the string matches either range, can be done using
string.matches("[0-9]*")||string.matches("[٠-٩]*") or
string.matches("[0-9]*")||string.matches("[\u0660-\u669]*").
An alternative would be
string.chars().allMatch(c -> c >= '0' && c <= '9' || c >= '٠' && c <= '٩').
Or to check for either, string.chars().allMatch(c -> c >= '0' && c <= '9') || string.chars().allMatch(c -> c >= '٠' && c <= '٩')
Since these codepoints represent numerals in two different unicode blocks,
I suggest to check if respective character is a numeral:
boolean isNumerals(String s) {
return !s.chars().anyMatch(v -> !Character.isDigit(v));
}
This will definitely match more than asked for, but in some cases or in more controlled environment it may be useful to make code more readable.
(edit)
Java API also allows to determine a unicode block of a specific character:
Character.UnicodeBlock arabic = Character.UnicodeBlock.ARABIC;
Character.UnicodeBlock latin = Character.UnicodeBlock.BASIC_LATIN;
boolean isValidBlock(String s) {
return s.chars().allMatch(v ->
Character.UnicodeBlock.of(v).equals(arabic) ||
Character.UnicodeBlock.of(v).equals(latin)
);
}
Combined with the check above will give exact result OP has asked for.
On the plus side - higher abstraction gives more flexibility, makes code more readable and is not dependent on exact encoding of string passed.
simple solution by using regex:
(see also lot better explained by #Predicate https://stackoverflow.com/a/60597367/12558456)
private boolean legalRegex(String s) {
return s.matches("^([\u0030-\u0039]|[\u0660-\u0669])*$");
}
faster but ugly solution: (needs a hashset of allowed chars)
private boolean legalCharactersOnly(String s) {
for (char c:s.toCharArray()) {
if (!allowedCharacters.contains(c)) {
return false;
}
}
return true;
}
Here is a solution which works without regex for arbitrary unicode code points (outside of the Basic Multilingual Plane).
private final Set<Integer> codePoints = new HashSet<Integer>();
public boolean test(String string) {
for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
codePoint = string.codePointAt(i);
if (!codePoints.contains(codePoint)) {
return false;
}
}
return true;
}

Replace "<" and ">" with "<" and ">"

I want to replace < and > with < and > if it is not a part of an html tag.
Input will be a string that may contain certain html tags. It can also contain less than & greater than signs (">" "<").
For example:
String example1 = "-> <b> Bold </b> <-";
String example2 = "< <i> Italic </i> >"
String example3 = "<i>foo >> </i>"
As output I want to get:
String output1 = "-> <b> Bold </b> <-";
String output2 = "< <i> Italic </i> >";
String output3 = "<i>foo >> </i>";
So replaceAll doesn't work, I have to use a regular expression match I guess. Any ideas? Some other way?
Note1: 3rd party library is not an option because of certain project requirements.
Note2: We support only a subset of HTML tags(text styling tags: italic, underline, bold etc.)
This is a non-trival task. HTML is not a regular language (perhaps it is irregular?) so you can not parse it using regular expressions. I suggest the following:
Option 1
Use this if you do not need to preserve white space.
Remove all whitespace from the input.
Split the input into tokens using "<" and ">" as the seperators; preserve seperators.
Process as follows:
if the token is not a supported HTML tag and contains a "<", convert the "<" as desired.
if the token is not a supported HTML tag and contains a ">", convert the ">" as desired.
pass HTML tags unchanged.
Option 2
Process each input line using multi character look ahead.
For each character in the input. Convert characters are {">", "<"}
Is the character a convert character.
if no, advance to next character.
if yes, look ahead to determine if this is a supported HTML tag (this is the tricky part).
if not part of a supported HTML tag, convert the character.
if part of a supported HTML tag, advance to the character following the HTML tag.
If you only support five html tags you could first remove those tags from the text.
replace < and > by < and > and then add the html tags again. You remove <b> from the text by replacing it by for instance [b]. Do the same with the other tags.
If you can't be bothered to use an external library then you would need to make an array with all the html tags and run it against the string.
I don't really recommend it because there are libraries for that...
Assuming arbitrary HTML files, you have to isolate text nodes and run replace on those.
If you can't use existing libraries, I'd just write my own.
(JSoup can do this but it's an 'external library' -- that is, not included in the Java SE standard, but just re-implementing it is an option.)
Assuming that the strings are containing valid HTML tags . Following method could be applied to parse the strings to achieve the result you looking for:
private static String parse(String str)
{
StringBuilder sBuilder = new StringBuilder();
for (int i = 0 ; i < str.length() ; i++)
{
char ch = str.charAt(i);
if (ch == '>' && i != 0)
{
char c = str.charAt( i - 1);
if (Character.isWhitespace(c) || !Character.isLetter(c))
{
sBuilder.append(">");
}
else
sBuilder.append(ch);
}
else if (ch == '>' && i==0)
{
sBuilder.append(">");
}
else if (ch == '<' && i < str.length() - 1)
{
char c = str.charAt( i + 1);
if (!(c=='/' || Character.isLetter(c)))
{
sBuilder.append("<");
}
else
sBuilder.append(ch);
}
else if (ch == '<' && i == str.length() - 1)
{
sBuilder.append("<");
}
else
{
sBuilder.append(ch);
}
}
return sBuilder.toString();
}

Creating Unicode character from its number

I want to display a Unicode character in Java. If I do this, it works just fine:
String symbol = "\u2202";
symbol is equal to "∂". That's what I want.
The problem is that I know the Unicode number and need to create the Unicode symbol from that. I tried (to me) the obvious thing:
int c = 2202;
String symbol = "\\u" + c;
However, in this case, symbol is equal to "\u2202". That's not what I want.
How can I construct the symbol if I know its Unicode number (but only at run-time---I can't hard-code it in like the first example)?
If you want to get a UTF-16 encoded code unit as a char, you can parse the integer and cast to it as others have suggested.
If you want to support all code points, use Character.toChars(int). This will handle cases where code points cannot fit in a single char value.
Doc says:
Converts the specified character (Unicode code point) to its UTF-16 representation stored in a char array. If the specified code point is a BMP (Basic Multilingual Plane or Plane 0) value, the resulting char array has the same value as codePoint. If the specified code point is a supplementary code point, the resulting char array has the corresponding surrogate pair.
Just cast your int to a char. You can convert that to a String using Character.toString():
String s = Character.toString((char)c);
EDIT:
Just remember that the escape sequences in Java source code (the \u bits) are in HEX, so if you're trying to reproduce an escape sequence, you'll need something like int c = 0x2202.
The other answers here either only support unicode up to U+FFFF (the answers dealing with just one instance of char) or don't tell how to get to the actual symbol (the answers stopping at Character.toChars() or using incorrect method after that), so adding my answer here, too.
To support supplementary code points also, this is what needs to be done:
// this character:
// http://www.isthisthingon.org/unicode/index.php?page=1F&subpage=4&glyph=1F495
// using code points here, not U+n notation
// for equivalence with U+n, below would be 0xnnnn
int codePoint = 128149;
// converting to char[] pair
char[] charPair = Character.toChars(codePoint);
// and to String, containing the character we want
String symbol = new String(charPair);
// we now have str with the desired character as the first item
// confirm that we indeed have character with code point 128149
System.out.println("First code point: " + symbol.codePointAt(0));
I also did a quick test as to which conversion methods work and which don't
int codePoint = 128149;
char[] charPair = Character.toChars(codePoint);
System.out.println(new String(charPair, 0, 2).codePointAt(0)); // 128149, worked
System.out.println(charPair.toString().codePointAt(0)); // 91, didn't work
System.out.println(new String(charPair).codePointAt(0)); // 128149, worked
System.out.println(String.valueOf(codePoint).codePointAt(0)); // 49, didn't work
System.out.println(new String(new int[] {codePoint}, 0, 1).codePointAt(0));
// 128149, worked
--
Note: as #Axel mentioned in the comments, with java 11 there is Character.toString(int codePoint) which would arguably be best suited for the job.
This one worked fine for me.
String cc2 = "2202";
String text2 = String.valueOf(Character.toChars(Integer.parseInt(cc2, 16)));
Now text2 will have ∂.
Remember that char is an integral type, and thus can be given an integer value, as well as a char constant.
char c = 0x2202;//aka 8706 in decimal. \u codepoints are in hex.
String s = String.valueOf(c);
String st="2202";
int cp=Integer.parseInt(st,16);// it convert st into hex number.
char c[]=Character.toChars(cp);
System.out.println(c);// its display the character corresponding to '\u2202'.
Although this is an old question, there is a very easy way to do this in Java 11 which was released today: you can use a new overload of Character.toString():
public static String toString​(int codePoint)
Returns a String object representing the specified character (Unicode code point). The result is a string of length 1 or 2, consisting solely of the specified codePoint.
Parameters:
codePoint - the codePoint to be converted
Returns:
the string representation of the specified codePoint
Throws:
IllegalArgumentException - if the specified codePoint is not a valid Unicode code point.
Since:
11
Since this method supports any Unicode code point, the length of the returned String is not necessarily 1.
The code needed for the example given in the question is simply:
int codePoint = '\u2202';
String s = Character.toString(codePoint); // <<< Requires JDK 11 !!!
System.out.println(s); // Prints ∂
This approach offers several advantages:
It works for any Unicode code point rather than just those that can be handled using a char.
It's concise, and it's easy to understand what the code is doing.
It returns the value as a string rather than a char[], which is often what you want. The answer posted by McDowell is appropriate if you want the code point returned as char[].
This is how you do it:
int cc = 0x2202;
char ccc = (char) Integer.parseInt(String.valueOf(cc), 16);
final String text = String.valueOf(ccc);
This solution is by Arne Vajhøj.
The code below will write the 4 unicode chars (represented by decimals) for the word "be" in Japanese. Yes, the verb "be" in Japanese has 4 chars!
The value of characters is in decimal and it has been read into an array of String[] -- using split for instance. If you have Octal or Hex, parseInt take a radix as well.
// pseudo code
// 1. init the String[] containing the 4 unicodes in decima :: intsInStrs
// 2. allocate the proper number of character pairs :: c2s
// 3. Using Integer.parseInt (... with radix or not) get the right int value
// 4. place it in the correct location of in the array of character pairs
// 5. convert c2s[] to String
// 6. print
String[] intsInStrs = {"12354", "12426", "12414", "12377"}; // 1.
char [] c2s = new char [intsInStrs.length * 2]; // 2. two chars per unicode
int ii = 0;
for (String intString : intsInStrs) {
// 3. NB ii*2 because the 16 bit value of Unicode is written in 2 chars
Character.toChars(Integer.parseInt(intsInStrs[ii]), c2s, ii * 2 ); // 3 + 4
++ii; // advance to the next char
}
String symbols = new String(c2s); // 5.
System.out.println("\nLooooonger code point: " + symbols); // 6.
// I tested it in Eclipse and Java 7 and it works. Enjoy
Here is a block to print out unicode chars between \u00c0 to \u00ff:
char[] ca = {'\u00c0'};
for (int i = 0; i < 4; i++) {
for (int j = 0; j < 16; j++) {
String sc = new String(ca);
System.out.print(sc + " ");
ca[0]++;
}
System.out.println();
}
Unfortunatelly, to remove one backlash as mentioned in first comment (newbiedoodle) don't lead to good result. Most (if not all) IDE issues syntax error. The reason is in this, that Java Escaped Unicode format expects syntax "\uXXXX", where XXXX are 4 hexadecimal digits, which are mandatory. Attempts to fold this string from pieces fails. Of course, "\u" is not the same as "\\u". The first syntax means escaped 'u', second means escaped backlash (which is backlash) followed by 'u'. It is strange, that on the Apache pages is presented utility, which doing exactly this behavior. But in reality, it is Escape mimic utility. Apache has some its own utilities (i didn't testet them), which do this work for you. May be, it is still not that, what you want to have. Apache Escape Unicode utilities But this utility 1 have good approach to the solution. With combination described above (MeraNaamJoker). My solution is create this Escaped mimic string and then convert it back to unicode (to avoid real Escaped Unicode restriction). I used it for copying text, so it is possible, that in uencode method will be better to use '\\u' except '\\\\u'. Try it.
/**
* Converts character to the mimic unicode format i.e. '\\u0020'.
*
* This format is the Java source code format.
*
* CharUtils.unicodeEscaped(' ') = "\\u0020"
* CharUtils.unicodeEscaped('A') = "\\u0041"
*
* #param ch the character to convert
* #return is in the mimic of escaped unicode string,
*/
public static String unicodeEscaped(char ch) {
String returnStr;
//String uniTemplate = "\u0000";
final static String charEsc = "\\u";
if (ch < 0x10) {
returnStr = "000" + Integer.toHexString(ch);
}
else if (ch < 0x100) {
returnStr = "00" + Integer.toHexString(ch);
}
else if (ch < 0x1000) {
returnStr = "0" + Integer.toHexString(ch);
}
else
returnStr = "" + Integer.toHexString(ch);
return charEsc + returnStr;
}
/**
* Converts the string from UTF8 to mimic unicode format i.e. '\\u0020'.
* notice: i cannot use real unicode format, because this is immediately translated
* to the character in time of compiling and editor (i.e. netbeans) checking it
* instead reaal unicode format i.e. '\u0020' i using mimic unicode format '\\u0020'
* as a string, but it doesn't gives the same results, of course
*
* This format is the Java source code format.
*
* CharUtils.unicodeEscaped(' ') = "\\u0020"
* CharUtils.unicodeEscaped('A') = "\\u0041"
*
* #param String - nationalString in the UTF8 string to convert
* #return is the string in JAVA unicode mimic escaped
*/
public String encodeStr(String nationalString) throws UnsupportedEncodingException {
String convertedString = "";
for (int i = 0; i < nationalString.length(); i++) {
Character chs = nationalString.charAt(i);
convertedString += unicodeEscaped(chs);
}
return convertedString;
}
/**
* Converts the string from mimic unicode format i.e. '\\u0020' back to UTF8.
*
* This format is the Java source code format.
*
* CharUtils.unicodeEscaped(' ') = "\\u0020"
* CharUtils.unicodeEscaped('A') = "\\u0041"
*
* #param String - nationalString in the JAVA unicode mimic escaped
* #return is the string in UTF8 string
*/
public String uencodeStr(String escapedString) throws UnsupportedEncodingException {
String convertedString = "";
String[] arrStr = escapedString.split("\\\\u");
String str, istr;
for (int i = 1; i < arrStr.length; i++) {
str = arrStr[i];
if (!str.isEmpty()) {
Integer iI = Integer.parseInt(str, 16);
char[] chaCha = Character.toChars(iI);
convertedString += String.valueOf(chaCha);
}
}
return convertedString;
}
char c=(char)0x2202;
String s=""+c;
(ANSWER IS IN DOT NET 4.5 and in java, there must be a similar approach exist)
I am from West Bengal in INDIA.
As I understand your problem is ...
You want to produce similar to ' অ ' (It is a letter in Bengali language)
which has Unicode HEX : 0X0985.
Now if you know this value in respect of your language then how will you produce that language specific Unicode symbol right ?
In Dot Net it is as simple as this :
int c = 0X0985;
string x = Char.ConvertFromUtf32(c);
Now x is your answer.
But this is HEX by HEX convert and sentence to sentence conversion is a work for researchers :P

How to classify Japanese characters as either kanji or kana?

Given the text below, how can I classify each character as kana or kanji?
誰か確認上記これらのフ
To get some thing like this
誰 - kanji
か - kana
確 - kanji
認 - kanji
上 - kanji
記 - kanji
こ - kana
れ - kana
ら - kana
の - kana
フ - kana
(Sorry if I did it incorrectly.)
This functionality is built into the Character.UnicodeBlock class. Some examples of the Unicode blocks related to the Japanese language:
Character.UnicodeBlock.of('誰') == CJK_UNIFIED_IDEOGRAPHS
Character.UnicodeBlock.of('か') == HIRAGANA
Character.UnicodeBlock.of('フ') == KATAKANA
Character.UnicodeBlock.of('フ') == HALFWIDTH_AND_FULLWIDTH_FORMS
Character.UnicodeBlock.of('!') == HALFWIDTH_AND_FULLWIDTH_FORMS
Character.UnicodeBlock.of('。') == CJK_SYMBOLS_AND_PUNCTUATION
But, as always, the devil is in the details:
Character.UnicodeBlock.of('A') == HALFWIDTH_AND_FULLWIDTH_FORMS
where A is the full-width character. So this is in the same category as the halfwidth Katakana フ above. Note that the full-width A is different from the normal (half-width) A:
Character.UnicodeBlock.of('A') == BASIC_LATIN
Use a table like this one to determine which unicode values are used for katakana and kanji, then you can simply cast the character to an int and check where it belongs, something like
int val = (int)て;
if (val >= 0x3040 && val <= 0x309f)
return KATAKANA
..
This seems like it'd be an interesting use for Guava's CharMatcher class. Using the tables linked in Jack's answer, I created this:
public class JapaneseCharMatchers {
public static final CharMatcher HIRAGANA =
CharMatcher.inRange((char) 0x3040, (char) 0x309f);
public static final CharMatcher KATAKANA =
CharMatcher.inRange((char) 0x30a0, (char) 0x30ff);
public static final CharMatcher KANA = HIRAGANA.or(KATAKANA);
public static final CharMatcher KANJI =
CharMatcher.inRange((char) 0x4e00, (char) 0x9faf);
public static void main(String[] args) {
test("誰か確認上記これらのフ");
}
private static void test(String string) {
System.out.println(string);
System.out.println("Hiragana: " + HIRAGANA.retainFrom(string));
System.out.println("Katakana: " + KATAKANA.retainFrom(string));
System.out.println("Kana: " + KANA.retainFrom(string));
System.out.println("Kanji: " + KANJI.retainFrom(string));
}
}
Running this prints the expected:
誰か確認上記これらのフ
Hiragana: かこれらの
Katakana: フ
Kana: かこれらのフ
Kanji: 誰確認上記
This gives you a lot of power for working with Japanese text by defining the rules for determining if a character is in one of these groups in an object that can not only do a lot of useful things itself, but can also be used with other APIs such as Guava's Splitter class.
Edit:
Based on jleedev's answer, you could also write a method like:
public static CharMatcher inUnicodeBlock(final Character.UnicodeBlock block) {
return new CharMatcher() {
public boolean matches(char c) {
return Character.UnicodeBlock.of(c) == block;
}
};
}
and use it like:
CharMatcher HIRAGANA = inUnicodeBlock(Character.UnicodeBlock.HIRAGANA);
I think this might be a bit slower than the other version though.
You need to get a reference that gives the separate ranges for kana and kanji characters. From what I've seen, alphabets and equivalents typically get a block of characters.
I know you didn't ask for VBA, but here is the VBA flavor for those who want to know:
Here's a function that will do it. It will break down the sentence like you have above into a single cell. You might need to add some error checking for how you want to deal with line breaks or English characters, etc. but this should be a good start.
Function KanjiKanaBreakdown(ByVal text As String) As String
Application.ScreenUpdating = False
Dim kanjiCode As Long
Dim result As String
Dim i As Long
For i = 1 To Len(text)
If Asc(Mid$(text, i, 1)) > -30562 And Asc(Mid$(text, i, 1)) < -950 Then
result = (result & (Mid$(text, i, 1)) & (" - kanji") & vbLf)
Else
result = (result & (Mid$(text, i, 1)) & (" - kana") & vbLf)
End If
Next
KanjiKanaBreakdown = result
Application.ScreenUpdating = True
End Function

Converting a char to uppercase

String lower = Name.toLowerCase();
int a = Name.indexOf(" ",0);
String first = lower.substring(0, a);
String last = lower.substring(a+1);
char f = first.charAt(0);
char l = last.charAt(0);
System.out.println(l);
how would i get the F and L variables converted to uppercase.
You can use Character#toUpperCase() for this.
char fUpper = Character.toUpperCase(f);
char lUpper = Character.toUpperCase(l);
It has however some limitations since the world is aware of many more characters than can ever fit in 16bit char range. See also the following excerpt of the javadoc:
Note: This method cannot handle supplementary characters. To support all Unicode characters, including supplementary characters, use the toUpperCase(int) method.
Instead of using existing utilities, you may try below conversion using boolean operation:
To upper case:
char upperChar = 'l' & 0x5f
To lower case:
char lowerChar = 'L' ^ 0x20
How it works:
Binary, hex and decimal table:
------------------------------------------
| Binary | Hexadecimal | Decimal |
-----------------------------------------
| 1011111 | 0x5f | 95 |
------------------------------------------
| 100000 | 0x20 | 32 |
------------------------------------------
Let's take an example of small l to L conversion:
The binary AND operation: (l & 0x5f)
l character has ASCII 108 and 01101100 is binary represenation.
1101100
& 1011111
-----------
1001100 = 76 in decimal which is **ASCII** code of L
Similarly the L to l conversion:
The binary XOR operation: (L ^ 0x20)
1001100
^ 0100000
-----------
1101100 = 108 in decimal which is **ASCII** code of l
Have a look at the java.lang.Character class, it provides a lot of useful methods to convert or test chars.
f = Character.toUpperCase(f);
l = Character.toUpperCase(l);
Since you know the chars are lower case, you can subtract the according ASCII value to make them uppercase:
char a = 'a';
a -= 32;
System.out.println("a is " + a); //a is A
Here is an ASCII table for reference
System.out.println(first.substring(0,1).toUpperCase());
System.out.println(last.substring(0,1).toUpperCase());
If you are including the apache commons lang jar in your project than the easiest solution would be to do:
WordUtils.capitalize(Name)
takes care of all the dirty work for you.
See the javadoc here
Alternatively, you also have a capitalizeFully(String) method which also lower cases the rest of the characters.
You can apply the .toUpperCase() directly on String variables or as an attribute to text fields. Ex: -
String str;
TextView txt;
str.toUpperCase();// will change it to all upper case OR
txt.append(str.toUpperCase());
txt.setText(str.toUpperCase());
Lets assume you have a variable you want split
String name = "Your name variable";
char nameChar = Character.toUpperCase(name.charAt(0));
I think you are trying to capitalize first and last character of each word in a sentence with space as delimiter.
Can be done through StringBuffer:
public static String toFirstLastCharUpperAll(String string){
StringBuffer sb=new StringBuffer(string);
for(int i=0;i<sb.length();i++)
if(i==0 || sb.charAt(i-1)==' ' //for first character of string/each word
|| i==sb.length()-1 || sb.charAt(i+1)==' ') //for last character of string/each word
sb.setCharAt(i, Character.toUpperCase(sb.charAt(i)));
return sb.toString();
}
The easiest solution for your case - change the first line, let it do just the opposite thing:
String lower = Name.toUpperCase ();
Of course, it's worth to change its name too.

Categories

Resources