How to classify Japanese characters as either kanji or kana?

How to classify Japanese characters as either kanji or kana? - java

Given the text below, how can I classify each character as kana or kanji?
誰か確認上記これらのフ
To get some thing like this
誰 - kanji
か - kana
確 - kanji
認 - kanji
上 - kanji
記 - kanji
こ - kana
れ - kana
ら - kana
の - kana
フ - kana
(Sorry if I did it incorrectly.)

This functionality is built into the Character.UnicodeBlock class. Some examples of the Unicode blocks related to the Japanese language:
Character.UnicodeBlock.of('誰') == CJK_UNIFIED_IDEOGRAPHS
Character.UnicodeBlock.of('か') == HIRAGANA
Character.UnicodeBlock.of('フ') == KATAKANA
Character.UnicodeBlock.of('ﾌ') == HALFWIDTH_AND_FULLWIDTH_FORMS
Character.UnicodeBlock.of('！') == HALFWIDTH_AND_FULLWIDTH_FORMS
Character.UnicodeBlock.of('。') == CJK_SYMBOLS_AND_PUNCTUATION
But, as always, the devil is in the details:
Character.UnicodeBlock.of('Ａ') == HALFWIDTH_AND_FULLWIDTH_FORMS
where Ａ is the full-width character. So this is in the same category as the halfwidth Katakana ﾌ above. Note that the full-width Ａ is different from the normal (half-width) A:
Character.UnicodeBlock.of('A') == BASIC_LATIN

Use a table like this one to determine which unicode values are used for katakana and kanji, then you can simply cast the character to an int and check where it belongs, something like
int val = (int)て;
if (val >= 0x3040 && val <= 0x309f)
return KATAKANA
..

This seems like it'd be an interesting use for Guava's CharMatcher class. Using the tables linked in Jack's answer, I created this:
public class JapaneseCharMatchers {
public static final CharMatcher HIRAGANA =
CharMatcher.inRange((char) 0x3040, (char) 0x309f);
public static final CharMatcher KATAKANA =
CharMatcher.inRange((char) 0x30a0, (char) 0x30ff);
public static final CharMatcher KANA = HIRAGANA.or(KATAKANA);
public static final CharMatcher KANJI =
CharMatcher.inRange((char) 0x4e00, (char) 0x9faf);
public static void main(String[] args) {
test("誰か確認上記これらのフ");
}
private static void test(String string) {
System.out.println(string);
System.out.println("Hiragana: " + HIRAGANA.retainFrom(string));
System.out.println("Katakana: " + KATAKANA.retainFrom(string));
System.out.println("Kana: " + KANA.retainFrom(string));
System.out.println("Kanji: " + KANJI.retainFrom(string));
}
}
Running this prints the expected:
誰か確認上記これらのフ
Hiragana: かこれらの
Katakana: フ
Kana: かこれらのフ
Kanji: 誰確認上記
This gives you a lot of power for working with Japanese text by defining the rules for determining if a character is in one of these groups in an object that can not only do a lot of useful things itself, but can also be used with other APIs such as Guava's Splitter class.
Edit:
Based on jleedev's answer, you could also write a method like:
public static CharMatcher inUnicodeBlock(final Character.UnicodeBlock block) {
return new CharMatcher() {
public boolean matches(char c) {
return Character.UnicodeBlock.of(c) == block;
}
};
}
and use it like:
CharMatcher HIRAGANA = inUnicodeBlock(Character.UnicodeBlock.HIRAGANA);
I think this might be a bit slower than the other version though.

You need to get a reference that gives the separate ranges for kana and kanji characters. From what I've seen, alphabets and equivalents typically get a block of characters.

I know you didn't ask for VBA, but here is the VBA flavor for those who want to know:
Here's a function that will do it. It will break down the sentence like you have above into a single cell. You might need to add some error checking for how you want to deal with line breaks or English characters, etc. but this should be a good start.
Function KanjiKanaBreakdown(ByVal text As String) As String
Application.ScreenUpdating = False
Dim kanjiCode As Long
Dim result As String
Dim i As Long
For i = 1 To Len(text)
If Asc(Mid$(text, i, 1)) > -30562 And Asc(Mid$(text, i, 1)) < -950 Then
result = (result & (Mid$(text, i, 1)) & (" - kanji") & vbLf)
Else
result = (result & (Mid$(text, i, 1)) & (" - kana") & vbLf)
End If
Next
KanjiKanaBreakdown = result
Application.ScreenUpdating = True
End Function

Related

How to pad Strings with Unicode characters in Java

I add right padding to a String to output it in a table format.
for (String[] tuple : testData) {
System.out.format("%-32s -> %s\n", tuple[0], tuple[1]);
}
The result looks like this (random test data):
znZfmOEQ0Gb68taaNU6HY21lvo -> Xq2aGqLedQnTSXg6wmBNDVb
frKweMCH8Kvgyk0J -> lHJ5r7YDV0jTL
NxtHP -> odvPJklwIzZZ
NX2scXjl5dxWmer -> wPDlKCKllVKk
x2HKsSHCqDQ -> RMuWLZ2vaP9sOF0yHmjVysJ
b0hryXKd6b80xAI -> 05MHjvTOxlxq1bvQ8RGe
This approach does not work when there are multi-byte unicode characters:
0OZot🇨🇳ivbyG🧷hZM1FI👡wNhn6r6cC -> OKDxDV1o2NMqXH3VvE7q3uONwEcY5V
fBHRCjU4K8OCdzACmQZSn6WO -> gvGBtUO5a4gPMKj9BKqBHFKx1iO7
cDUh🇲🇺b0cXkLWkS -> SZX
WtP9t -> Q0wWOeY3W66mM5rcQQYKpG
va4d🍷u8SS -> KI
a71?⚖TZ💣🧜‍♀🕓ws5J -> b8A
As you can see, the alignment is off.
My idea was to calculate the difference between the length of the String and the number of bytes used and use that to offset the padding, something like this:
int correction = tuple[0].getBytes().length - tuple[0].length();
And then instead of padding to 32 chars, I would pad to 32 + correction. However, this didn't work either.
Here is my test code (using emoji-java but the behaviour should be reproducable with any unicode characters):
import java.util.Collection;
import org.apache.commons.lang3.RandomStringUtils;
import com.vdurmont.emoji.Emoji;
import com.vdurmont.emoji.EmojiManager;
public class Test {
public static void main(String[] args) {
// create random test data
String[][] testData = new String[15][2];
for (String[] tuple : testData) {
tuple[0] = RandomStringUtils.randomAlphanumeric(2, 32);
tuple[1] = RandomStringUtils.randomAlphanumeric(2, 32);
}
// add some emojis
Collection<Emoji> all = EmojiManager.getAll();
for (String[] tuple : testData) {
for (int i = 1; i < tuple[0].length(); i++) {
if (Math.random() > 0.90) {
Emoji emoji = all.stream().skip((int) (all.size() * Math.random())).findFirst().get();
tuple[0] = tuple[0].substring(0, i - 1) + emoji.getUnicode() + tuple[0].substring(i + 1);
}
}
}
// output
for (String[] tuple : testData) {
System.out.format("%-32s -> %s\n", tuple[0], tuple[1]);
}
}
}

There are actually a few issues here, other than that some fonts display the flag wider than the other characters. I assume that you want to count the Chinese flag as a single character (as it is drawn as a single element on the screen).
The String class reports an incorrect length
The String class works with chars, which are 16-bit integers of Unicode code points. The problem is that not all code points fit in 16 bits, only code points from the Basic Multilingual Plane (BMP) fit in those chars. String's length() method returns the number of chars, not the number of code points.
Now String's codePointCount method may help in this case: it counts the number of code points in the given index range. So providing string.length() as second argument to the method returns the total count of code points.
Combining characters
However, there's another problem. The 🇨🇳 Chinese flag, for example, consists of two Unicode code points: the Regional Indicator Symbol Letters C (🇨, U+1F1E8) and N (🇳, U+1F1F3). Those two code points are combined into a flag of China. This is a problem you are not going to solve with the codePointCount method.
The Regional Indicator Symbol Letters seem to be a special occasion. Two of those characters can be combined into a national flag. I am not aware of a standard way to achieve what you want. You may have to take that manually into account.
I've written a small program to get the length of a string.
static int length(String str) {
String a = "\uD83C\uDDE6";
String z = "\uD83C\uDDFF";
Pattern p = Pattern.compile("[" + a + "-" + z + "]{2}");
Matcher m = p.matcher(str);
int count = 0;
while (m.find()) {
count++;
}
return str.codePointCount(0, str.length()) - count;
}

As is discussed by the comments in the question linked to by #Xehpuk, in this discussion on kotlinlang.org as well as in this blog post by Daniel Lemire the following seems to be correct:
The problem is that the java String class represents characters as
UTF-16 characters. This means any unicode character that is
represented by more than 16 bits is saved as 2 separate Char values.
This fact is ignored by many of the functions within String, eg.
String.lenght does not return the number of unicode characters, it
returns the number of 16bit characters within the String, some emoji
counting for 2 characters.
The behaviour, however, seems to be implementation-specific.
As David mentions in his post you could try the following to get the correct lenght:
tuple.codePointCount(0, tuple.length())
See code point methods from Java SE docs

Check if string contains only Unicode values [\u0030-\u0039] or [\u0660-\u0669]

I need to check, in java, if a string is composed only of Unicode values [\u0030-\u0039] or [\u0660-\u0669]. What is the most efficient way of doing this?

Use \x for unicode characters:
^([\x{0030}-\x{0039}\x{0660}-\x{0669}]+)$
if the patternt should match an empty string too, use * instead of +
Use this if you dont want to allows mixing characters from both sets you provided:
^([\x{0030}-\x{0039}]+|[\x{0660}-\x{0669}]+)$
https://regex101.com/r/xqWL4q/6
As mentioned by Holger in comments below. \x{0030}-\x{0039} is equivalent with [0-9]. So could be substituted and would be more readable.

As said here, it’s not clear whether you want to check for probably mixed occurrences of these digits or check for either of these ranges.
A simple check for mixed digits would be string.matches("[0-9٠-٩]*") or to avoid confusing changes of the read/write direction, or if your source code encoding doesn’t support all characters, string.matches("[0-9\u0660-\u669]*").
Checking whether the string matches either range, can be done using
string.matches("[0-9]*")||string.matches("[٠-٩]*") or
string.matches("[0-9]*")||string.matches("[\u0660-\u669]*").
An alternative would be
string.chars().allMatch(c -> c >= '0' && c <= '9' || c >= '٠' && c <= '٩').
Or to check for either, string.chars().allMatch(c -> c >= '0' && c <= '9') || string.chars().allMatch(c -> c >= '٠' && c <= '٩')

Since these codepoints represent numerals in two different unicode blocks,
I suggest to check if respective character is a numeral:
boolean isNumerals(String s) {
return !s.chars().anyMatch(v -> !Character.isDigit(v));
}
This will definitely match more than asked for, but in some cases or in more controlled environment it may be useful to make code more readable.
(edit)
Java API also allows to determine a unicode block of a specific character:
Character.UnicodeBlock arabic = Character.UnicodeBlock.ARABIC;
Character.UnicodeBlock latin = Character.UnicodeBlock.BASIC_LATIN;
boolean isValidBlock(String s) {
return s.chars().allMatch(v ->
Character.UnicodeBlock.of(v).equals(arabic) ||
Character.UnicodeBlock.of(v).equals(latin)
);
}
Combined with the check above will give exact result OP has asked for.
On the plus side - higher abstraction gives more flexibility, makes code more readable and is not dependent on exact encoding of string passed.

simple solution by using regex:
(see also lot better explained by #Predicate https://stackoverflow.com/a/60597367/12558456)
private boolean legalRegex(String s) {
return s.matches("^([\u0030-\u0039]|[\u0660-\u0669])*$");
}
faster but ugly solution: (needs a hashset of allowed chars)
private boolean legalCharactersOnly(String s) {
for (char c:s.toCharArray()) {
if (!allowedCharacters.contains(c)) {
return false;
}
}
return true;
}

Here is a solution which works without regex for arbitrary unicode code points (outside of the Basic Multilingual Plane).
private final Set<Integer> codePoints = new HashSet<Integer>();
public boolean test(String string) {
for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
codePoint = string.codePointAt(i);
if (!codePoints.contains(codePoint)) {
return false;
}
}
return true;
}

Removing duplicate same characters in a row

I am trying to create a method which will either remove all duplicates from a string or only keep the same 2 characters in a row based on a parameter.
For example:
helllllllo -> helo
or
helllllllo -> hello - This keeps double letters
Currently I remove duplicates by doing:
private String removeDuplicates(String word) {
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < word.length(); i++) {
char letter = word.charAt(i);
if (buffer.length() == 0 && letter != buffer.charAt(buffer.length() - 1)) {
buffer.append(letter);
}
}
return buffer.toString();
}
If I want to keep double letters I was thinking of having a method like private String removeDuplicates(String word, boolean doubleLetter)
When doubleLetter is true it will return hello not helo
I'm not sure of the most efficient way to do this without duplicating a lot of code.

why not just use a regex?
public class RemoveDuplicates {
public static void main(String[] args) {
System.out.println(new RemoveDuplicates().result("hellllo", false)); //helo
System.out.println(new RemoveDuplicates().result("hellllo", true)); //hello
}
public String result(String input, boolean doubleLetter){
String pattern = null;
if(doubleLetter) pattern = "(.)(?=\\1{2})";
else pattern = "(.)(?=\\1)";
return input.replaceAll(pattern, "");
}
}
(.) --> matches any character and puts in group 1.
?= --> this is called a positive lookahead.
?=\\1 --> positive lookahead for the first group
So overall, this regex looks for any character that is followed (positive lookahead) by itself. For example aa or bb, etc. It is important to note that only the first character is part of the match actually, so in the word 'hello', only the first l is matched (the part (?=\1) is NOT PART of the match). So the first l is replaced by an empty String and we are left with helo, which does not match the regex
The second pattern is the same thing, but this time we look ahead for TWO occurrences of the first group, for example helllo. On the other hand 'hello' will not be matched.
Look here for a lot more: Regex
P.S. Fill free to accept the answer if it helped.

try
String s = "helllllllo";
System.out.println(s.replaceAll("(\\w)\\1+", "$1"));
output
helo

Taking this previous SO example as a starting point, I came up with this:
String str1= "Heelllllllllllooooooooooo";
String removedRepeated = str1.replaceAll("(\\w)\\1+", "$1");
System.out.println(removedRepeated);
String keepDouble = str1.replaceAll("(\\w)\\1{2,}", "$1");
System.out.println(keepDouble);
It yields:
Helo
Heelo
What it does:
(\\w)\\1+ will match any letter and place it in a regex capture group. This group is later accessed through the \\1+. Meaning that it will match one or more repetitions of the previous letter.
(\\w)\\1{2,} is the same as above the only difference being that it looks after only characters which are repeated more than 2 times. This leaves the double characters untouched.
EDIT:
Re-read the question and it seems that you want to replace multiple characters by doubles. To do that, simply use this line:
String keepDouble = str1.replaceAll("(\\w)\\1+", "$1$1");

Try this, this will be most efficient way[Edited after comment]:
public static String removeDuplicates(String str) {
int checker = 0;
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < str.length(); ++i) {
int val = str.charAt(i) - 'a';
if ((checker & (1 << val)) == 0)
buffer.append(str.charAt(i));
checker |= (1 << val);
}
return buffer.toString();
}
I am using bits to identify uniqueness.
EDIT:
Whole logic is that if a character has been parsed then its corrresponding bit is set and next time when that character comes up then it will not be added in String Buffer the corresponding bit is already set.

What is the recommended way to escape HTML symbols in plain Java?

Is there a recommended way to escape <, >, " and & characters when outputting HTML in plain Java code? (Other than manually doing the following, that is).
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = source.replace("<", "<").replace("&", "&"); // ...

StringEscapeUtils from Apache Commons Lang:
import static org.apache.commons.lang.StringEscapeUtils.escapeHtml;
// ...
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = escapeHtml(source);
For version 3:
import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;
// ...
String escaped = escapeHtml4(source);

An alternative to Apache Commons: Use Spring's HtmlUtils.htmlEscape(String input) method.

Nice short method:
public static String escapeHTML(String s) {
StringBuilder out = new StringBuilder(Math.max(16, s.length()));
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c > 127 || c == '"' || c == '\'' || c == '<' || c == '>' || c == '&') {
out.append("&#");
out.append((int) c);
out.append(';');
} else {
out.append(c);
}
}
return out.toString();
}
Based on https://stackoverflow.com/a/8838023/1199155 (the amp is missing there). The four characters checked in the if clause are the only ones below 128, according to http://www.w3.org/TR/html4/sgml/entities.html

There is a newer version of the Apache Commons Lang library and it uses a different package name (org.apache.commons.lang3). The StringEscapeUtils now has different static methods for escaping different types of documents (http://commons.apache.org/proper/commons-lang/javadocs/api-3.0/index.html). So to escape HTML version 4.0 string:
import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;
String output = escapeHtml4("The less than sign (<) and ampersand (&) must be escaped before using them in HTML");

For those who use Google Guava:
import com.google.common.html.HtmlEscapers;
[...]
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = HtmlEscapers.htmlEscaper().escape(source);

Be careful with this. There are a number of different 'contexts' within an HTML document: Inside an element, quoted attribute value, unquoted attribute value, URL attribute, javascript, CSS, etc... You'll need to use a different encoding method for each of these to prevent Cross-Site Scripting (XSS). Check the OWASP XSS Prevention Cheat Sheet for details on each of these contexts. You can find escaping methods for each of these contexts in the OWASP ESAPI library -- https://github.com/ESAPI/esapi-java-legacy.

On android (API 16 or greater) you can:
Html.escapeHtml(textToScape);
or for lower API:
TextUtils.htmlEncode(textToScape);

For some purposes, HtmlUtils:
import org.springframework.web.util.HtmlUtils;
[...]
HtmlUtils.htmlEscapeDecimal("&"); //gives &
HtmlUtils.htmlEscape("&"); //gives &

org.apache.commons.lang3.StringEscapeUtils is now deprecated. You must now use org.apache.commons.text.StringEscapeUtils by
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-text</artifactId>
<version>${commons.text.version}</version>
</dependency>

While #dfa answer of org.apache.commons.lang.StringEscapeUtils.escapeHtml is nice and I have used it in the past it should not be used for escaping HTML (or XML) attributes otherwise the whitespace will be normalized (meaning all adjacent whitespace characters become a single space).
I know this because I have had bugs filed against my library (JATL) for attributes where whitespace was not preserved. Thus I have a drop in (copy n' paste) class (of which I stole some from JDOM) that differentiates the escaping of attributes and element content.
While this may not have mattered as much in the past (proper attribute escaping) it is increasingly become of greater interest given the use use of HTML5's data- attribute usage.

Java 8+ Solution:
public static String escapeHTML(String str) {
return str.chars().mapToObj(c -> c > 127 || "\"'<>&".indexOf(c) != -1 ?
"&#" + c + ";" : String.valueOf((char) c)).collect(Collectors.joining());
}
String#chars returns an IntStream of the char values from the String. We can then use mapToObj to escape the characters with a character code greater than 127 (non-ASCII characters) as well as the double quote ("), single quote ('), left angle bracket (<), right angle bracket (>), and ampersand (&). Collectors.joining concatenates the Strings back together.
To better handle Unicode characters, String#codePoints can be used instead.
public static String escapeHTML(String str) {
return str.codePoints().mapToObj(c -> c > 127 || "\"'<>&".indexOf(c) != -1 ?
"&#" + c + ";" : new String(Character.toChars(c)))
.collect(Collectors.joining());
}

The most of libraries offer escaping everything they can including hundreds of symbols and thousands of non-ASCII characters which is not what you want in UTF-8 world.
Also, as Jeff Williams noted, there's no single “escape HTML” option, there are several contexts.
Assuming you never use unquoted attributes, and keeping in mind that different contexts exist, it've written my own version:
private static final long TEXT_ESCAPE =
1L << '&' | 1L << '<';
private static final long DOUBLE_QUOTED_ATTR_ESCAPE =
TEXT_ESCAPE | 1L << '"';
private static final long SINGLE_QUOTED_ATTR_ESCAPE =
TEXT_ESCAPE | 1L << '\'';
private static final long ESCAPES =
DOUBLE_QUOTED_ATTR_ESCAPE | SINGLE_QUOTED_ATTR_ESCAPE;
// 'quot' and 'apos' are 1 char longer than '#34' and '#39'
// which I've decided to use
private static final String REPLACEMENTS = ""&'<";
private static final int REPL_SLICES = /* [0, 5, 10, 15, 19) */
5<<5 | 10<<10 | 15<<15 | 19<<20;
// These 5-bit numbers packed into a single int
// are indices within REPLACEMENTS which is a 'flat' String[]
private static void appendEscaped(
Appendable builder, CharSequence content, long escapes) {
try {
int startIdx = 0, len = content.length();
for (int i = 0; i < len; i++) {
char c = content.charAt(i);
long one;
if (((c & 63) == c) && ((one = 1L << c) & escapes) != 0) {
// -^^^^^^^^^^^^^^^ -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
// | | take only dangerous characters
// | java shifts longs by 6 least significant bits,
// | e. g. << 0b110111111 is same as >> 0b111111.
// | Filter out bigger characters
int index = Long.bitCount(ESCAPES & (one - 1));
builder.append(content, startIdx, i /* exclusive */).append(
REPLACEMENTS,
REPL_SLICES >>> (5 * index) & 31,
REPL_SLICES >>> (5 * (index + 1)) & 31
);
startIdx = i + 1;
}
}
builder.append(content, startIdx, len);
} catch (IOException e) {
// typically, our Appendable is StringBuilder which does not throw;
// also, there's no way to declare 'if A#append() throws E,
// then appendEscaped() throws E, too'
throw new UncheckedIOException(e);
}
}
Consider copy-pasting from Gist without line length limit.
UPD: As another answer suggests, > escaping is not necessary; also, " within attr='…' is allowed, too. I've updated the code accordingly.
You may check it out yourself:
<!DOCTYPE html>
<html lang="en">
<head><title>Test</title></head>
<body>
<p title="<"I'm double-quoted!">"><"Hello!"></p>
<p title='<"I'm single-quoted!">'><"Goodbye!"></p>
</body>
</html>

How can I check if a single character appears in a string?

In Java is there a way to check the condition:
"Does this single character appear at all in string x"
without using a loop?

You can use string.indexOf('a').
If the char a is present in string :
it returns the the index of the first occurrence of the character in
the character sequence represented by this object, or -1 if the
character does not occur.

String.contains() which checks if the string contains a specified sequence of char values
String.indexOf() which returns the index within the string of the first occurence of the specified character or substring (there are 4 variations of this method)

I'm not sure what the original poster is asking exactly. Since indexOf(...) and contains(...) both probably use loops internally, perhaps he's looking to see if this is possible at all without a loop? I can think of two ways off hand, one would of course be recurrsion:
public boolean containsChar(String s, char search) {
if (s.length() == 0)
return false;
else
return s.charAt(0) == search || containsChar(s.substring(1), search);
}
The other is far less elegant, but completeness...:
/**
* Works for strings of up to 5 characters
*/
public boolean containsChar(String s, char search) {
if (s.length() > 5) throw IllegalArgumentException();
try {
if (s.charAt(0) == search) return true;
if (s.charAt(1) == search) return true;
if (s.charAt(2) == search) return true;
if (s.charAt(3) == search) return true;
if (s.charAt(4) == search) return true;
} catch (IndexOutOfBoundsException e) {
// this should never happen...
return false;
}
return false;
}
The number of lines grow as you need to support longer and longer strings of course. But there are no loops/recurrsions at all. You can even remove the length check if you're concerned that that length() uses a loop.

You can use 2 methods from the String class.
String.contains() which checks if the string contains a specified sequence of char values
String.indexOf() which returns the index within the string of the first occurence of the specified character or substring or returns -1 if the character is not found (there are 4 variations of this method)
Method 1:
String myString = "foobar";
if (myString.contains("x") {
// Do something.
}
Method 2:
String myString = "foobar";
if (myString.indexOf("x") >= 0 {
// Do something.
}
Links by: Zach Scrivena

String temp = "abcdefghi";
if(temp.indexOf("b")!=-1)
{
System.out.println("there is 'b' in temp string");
}
else
{
System.out.println("there is no 'b' in temp string");
}

If you need to check the same string often you can calculate the character occurrences up-front. This is an implementation that uses a bit array contained into a long array:
public class FastCharacterInStringChecker implements Serializable {
private static final long serialVersionUID = 1L;
private final long[] l = new long[1024]; // 65536 / 64 = 1024
public FastCharacterInStringChecker(final String string) {
for (final char c: string.toCharArray()) {
final int index = c >> 6;
final int value = c - (index << 6);
l[index] |= 1L << value;
}
}
public boolean contains(final char c) {
final int index = c >> 6; // c / 64
final int value = c - (index << 6); // c - (index * 64)
return (l[index] & (1L << value)) != 0;
}}

To check if something does not exist in a string, you at least need to look at each character in a string. So even if you don't explicitly use a loop, it'll have the same efficiency. That being said, you can try using str.contains(""+char).

Is the below what you were looking for?
int index = string.indexOf(character);
return index != -1;

Yes, using the indexOf() method on the string class. See the API documentation for this method

String.contains(String) or String.indexOf(String) - suggested
"abc".contains("Z"); // false - correct
"zzzz".contains("Z"); // false - correct
"Z".contains("Z"); // true - correct
"😀and😀".contains("😀"); // true - correct
"😀and😀".contains("😂"); // false - correct
"😀and😀".indexOf("😀"); // 0 - correct
"😀and😀".indexOf("😂"); // -1 - correct
String.indexOf(int) and carefully considered String.indexOf(char) with char to int widening
"😀and😀".indexOf("😀".charAt(0)); // 0 though incorrect usage has correct output due to portion of correct data
"😀and😀".indexOf("😂".charAt(0)); // 0 -- incorrect usage and ambiguous result
"😀and😀".indexOf("😂".codePointAt(0)); // -1 -- correct usage and correct output
The discussions around character is ambiguous in Java world
can the value of char or Character considered as single character?
No. In the context of unicode characters, char or Character can sometimes be part of a single character and should not be treated as a complete single character logically.
if not, what should be considered as single character (logically)?
Any system supporting character encodings for Unicode characters should consider unicode's codepoint as single character.
So Java should do that very clear & loud rather than exposing too much of internal implementation details to users.
String class is bad at abstraction (though it requires confusingly good amount of understanding of its encapsulations to understand the abstraction 😒😒😒 and hence an anti-pattern).
How is it different from general char usage?
char can be only be mapped to a character in Basic Multilingual Plane.
Only codePoint - int can cover the complete range of Unicode characters.
Why is this difference?
char is internally treated as 16-bit unsigned value and could not represent all the unicode characters using UTF-16 internal representation using only 2-bytes. Sometimes, values in a 16-bit range have to be combined with another 16-bit value to correctly define character.
Without getting too verbose, the usage of indexOf, charAt, length and such methods should be more explicit. Sincerely hoping Java will add new UnicodeString and UnicodeCharacter classes with clearly defined abstractions.
Reason to prefer contains and not indexOf(int)
Practically there are many code flows that treat a logical character as char in java.
In Unicode context, char is not sufficient
Though the indexOf takes in an int, char to int conversion masks this from the user and user might do something like str.indexOf(someotherstr.charAt(0))(unless the user is aware of the exact context)
So, treating everything as CharSequence (aka String) is better
public static void main(String[] args) {
System.out.println("😀and😀".indexOf("😀".charAt(0))); // 0 though incorrect usage has correct output due to portion of correct data
System.out.println("😀and😀".indexOf("😂".charAt(0))); // 0 -- incorrect usage and ambiguous result
System.out.println("😀and😀".indexOf("😂".codePointAt(0))); // -1 -- correct usage and correct output
System.out.println("😀and😀".contains("😀")); // true - correct
System.out.println("😀and😀".contains("😂")); // false - correct
}
Semantics
char can handle most of the practical use cases. Still its better to use codepoints within programming environment for future extensibility.
codepoint should handle nearly all of the technical use cases around encodings.
Still, Grapheme Clusters falls out of the scope of codepoint level of abstraction.
Storage layers can choose char interface if ints are too costly(doubled). Unless storage cost is the only metric, its still better to use codepoint. Also, its better to treat storage as byte and delegate semantics to business logic built around storage.
Semantics can be abstracted at multiple levels. codepoint should become lowest level of interface and other semantics can be built around codepoint in runtime environment.

package com;
public class _index {
public static void main(String[] args) {
String s1="be proud to be an indian";
char ch=s1.charAt(s1.indexOf('e'));
int count = 0;
for(int i=0;i<s1.length();i++) {
if(s1.charAt(i)=='e'){
System.out.println("number of E:=="+ch);
count++;
}
}
System.out.println("Total count of E:=="+count);
}
}

static String removeOccurences(String a, String b)
{
StringBuilder s2 = new StringBuilder(a);
for(int i=0;i<b.length();i++){
char ch = b.charAt(i);
System.out.println(ch+" first index"+a.indexOf(ch));
int lastind = a.lastIndexOf(ch);
for(int k=new String(s2).indexOf(ch);k > 0;k=new String(s2).indexOf(ch)){
if(s2.charAt(k) == ch){
s2.deleteCharAt(k);
System.out.println("val of s2 : "+s2.toString());
}
}
}
System.out.println(s1.toString());
return (s1.toString());
}

you can use this code. It will check the char is present or not. If it is present then the return value is >= 0 otherwise it's -1. Here I am printing alphabets that is not present in the input.
import java.util.Scanner;
public class Test {
public static void letters()
{
System.out.println("Enter input char");
Scanner sc = new Scanner(System.in);
String input = sc.next();
System.out.println("Output : ");
for (char alphabet = 'A'; alphabet <= 'Z'; alphabet++) {
if(input.toUpperCase().indexOf(alphabet) < 0)
System.out.print(alphabet + " ");
}
}
public static void main(String[] args) {
letters();
}
}
//Ouput Example
Enter input char
nandu
Output :
B C E F G H I J K L M O P Q R S T V W X Y Z

If you see the source code of indexOf in JAVA:
public int indexOf(int ch, int fromIndex) {
final int max = value.length;
if (fromIndex < 0) {
fromIndex = 0;
} else if (fromIndex >= max) {
// Note: fromIndex might be near -1>>>1.
return -1;
}
if (ch < Character.MIN_SUPPLEMENTARY_CODE_POINT) {
// handle most cases here (ch is a BMP code point or a
// negative value (invalid code point))
final char[] value = this.value;
for (int i = fromIndex; i < max; i++) {
if (value[i] == ch) {
return i;
}
}
return -1;
} else {
return indexOfSupplementary(ch, fromIndex);
}
}
you can see it uses a for loop for finding a character. Note that each indexOf you may use in your code, is equal to one loop.
So, it is unavoidable to use loop for a single character.
However, if you want to find a special string with more different forms, use useful libraries such as util.regex, it deploys stronger algorithm to match a character or a string pattern with Regular Expressions. For example to find an email in a string:
String regex = "^(.+)#(.+)$";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(email);
If you don't like to use regex, just use a loop and charAt and try to cover all cases in one loop.
Be careful recursive methods has more overhead than loop, so it's not recommended.

how about one uses this ;
let text = "Hello world, welcome to the universe.";
let result = text.includes("world");
console.log(result) ....// true
the result will be a true or false
this always works for me

You won't be able to check if char appears at all in some string without atleast going over the string once using loop / recursion ( the built-in methods like indexOf also use a loop )
If the no. of times you look up if a char is in string x is more way more than the length of the string than I would recommend using a Set data structure as that would be more efficient than simply using indexOf
String s = "abc";
// Build a set so we can check if character exists in constant time O(1)
Set<Character> set = new HashSet<>();
int len = s.length();
for(int i = 0; i < len; i++) set.add(s.charAt(i));
// Now we can check without the need of a loop
// contains method of set doesn't use a loop unlike string's contains method
set.contains('a') // true
set.contains('z') // false
Using set you will be able to check if character exists in a string in constant time O(1) but you will also use additional memory ( Space complexity will be O(n) ).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to classify Japanese characters as either kanji or kana? - java

Given the text below, how can I classify each character as kana or kanji? 誰か確認上記これらのフ To get some thing like this 誰 - kanji か - kana 確 - kanji 認 - kanji 上 - kanji 記 - kanji こ - kana れ - kana ら - kana の - kana フ - kana (Sorry if I did it incorrectly.)

Use a table like this one to determine which unicode values are used for katakana and kanji, then you can simply cast the character to an int and check where it belongs, something like int val = (int)て; if (val >= 0x3040 && val <= 0x309f) return KATAKANA ..

You need to get a reference that gives the separate ranges for kana and kanji characters. From what I've seen, alphabets and equivalents typically get a block of characters.

Related

How to pad Strings with Unicode characters in Java

Check if string contains only Unicode values [\u0030-\u0039] or [\u0660-\u0669]

Removing duplicate same characters in a row

What is the recommended way to escape HTML symbols in plain Java?

How can I check if a single character appears in a string?

Categories

Resources