How can I convert Bengali Unicode numerical values (০,১,২,৩,...,৮,৯) to (0,1,2,3,...,8,9) in Java?
Use Character.getNumericValue to get the integer value associated with a character:
System.out.println(Character.getNumericValue('০'));
System.out.println(Character.getNumericValue('১'));
// etc.
Output:
0
1
Ideone demo
The advantage over other approaches here is that this works for any numeric chars, not just Bengali.
A simple solution subtract the value of '০' to the rest, since they are contiguous in the Unicode table, and add '0':
public static void main(String[] args) {
char[] bengaliDigits = {'০','১','২','৩','৪','৫','৬','৭','৮','৯'};
for (char bengaliDigit : bengaliDigits) {
char digit = (char) (bengaliDigit - '০' + '0');
System.out.print(digit);
}
}
This will print 0123456789.
Use -
Character.getNumericValue('০').
It will work irrespective of the language because it uses the unicode of the character for conversion
Try this:
/**
*
* Convert a bengali numeral to its arabic equivalent numeral.
*
* #param bengaliNumeral bengali numeral to be converted
*
* #return the equivalent Arabic numeral
* #see #bengaliToInt
*/
public static char bengaliToArabic(char bengaliNumeral) {
return (char) (bengaliNumeral - '০' + '0');
}
public static int bengaliToInt(char bengaliNumeral) {
return Character.getNumericValue(bengaliNumeral);
}
DEMO
SAMPLE CODE
System.out.format("bengaliToArabic('১') == %s // char\n", bengaliToArabic('১'));
System.out.format("bengaliToInt('১') == %s // int\n", bengaliToInt('১'));
OUTPUT
bengaliToArabic('১') == 1 // char
bengaliToInt('১') == 1 // int
A lot of solutions here suggest to simply subtract the Unicode value for the character ০ to get the numerical value. This works, but will only work if you know for a fact that the number is in fact a Bengali number. There are plenty of other numbers, and Java provides a standardised way to handle this using Character.getNumericValue() and Character.digit():
String s = "123০১২৩৪৫৬৭৮৯";
for(int i = 0 ; i < s.length() ; i++) {
System.out.println(Character.digit(ch, 10));
}
This will work with not only Bengali numbers, but with numbers from all languages.
Related
I add right padding to a String to output it in a table format.
for (String[] tuple : testData) {
System.out.format("%-32s -> %s\n", tuple[0], tuple[1]);
}
The result looks like this (random test data):
znZfmOEQ0Gb68taaNU6HY21lvo -> Xq2aGqLedQnTSXg6wmBNDVb
frKweMCH8Kvgyk0J -> lHJ5r7YDV0jTL
NxtHP -> odvPJklwIzZZ
NX2scXjl5dxWmer -> wPDlKCKllVKk
x2HKsSHCqDQ -> RMuWLZ2vaP9sOF0yHmjVysJ
b0hryXKd6b80xAI -> 05MHjvTOxlxq1bvQ8RGe
This approach does not work when there are multi-byte unicode characters:
0OZot🇨🇳ivbyG🧷hZM1FI👡wNhn6r6cC -> OKDxDV1o2NMqXH3VvE7q3uONwEcY5V
fBHRCjU4K8OCdzACmQZSn6WO -> gvGBtUO5a4gPMKj9BKqBHFKx1iO7
cDUh🇲🇺b0cXkLWkS -> SZX
WtP9t -> Q0wWOeY3W66mM5rcQQYKpG
va4d🍷u8SS -> KI
a71?⚖TZ💣🧜♀🕓ws5J -> b8A
As you can see, the alignment is off.
My idea was to calculate the difference between the length of the String and the number of bytes used and use that to offset the padding, something like this:
int correction = tuple[0].getBytes().length - tuple[0].length();
And then instead of padding to 32 chars, I would pad to 32 + correction. However, this didn't work either.
Here is my test code (using emoji-java but the behaviour should be reproducable with any unicode characters):
import java.util.Collection;
import org.apache.commons.lang3.RandomStringUtils;
import com.vdurmont.emoji.Emoji;
import com.vdurmont.emoji.EmojiManager;
public class Test {
public static void main(String[] args) {
// create random test data
String[][] testData = new String[15][2];
for (String[] tuple : testData) {
tuple[0] = RandomStringUtils.randomAlphanumeric(2, 32);
tuple[1] = RandomStringUtils.randomAlphanumeric(2, 32);
}
// add some emojis
Collection<Emoji> all = EmojiManager.getAll();
for (String[] tuple : testData) {
for (int i = 1; i < tuple[0].length(); i++) {
if (Math.random() > 0.90) {
Emoji emoji = all.stream().skip((int) (all.size() * Math.random())).findFirst().get();
tuple[0] = tuple[0].substring(0, i - 1) + emoji.getUnicode() + tuple[0].substring(i + 1);
}
}
}
// output
for (String[] tuple : testData) {
System.out.format("%-32s -> %s\n", tuple[0], tuple[1]);
}
}
}
There are actually a few issues here, other than that some fonts display the flag wider than the other characters. I assume that you want to count the Chinese flag as a single character (as it is drawn as a single element on the screen).
The String class reports an incorrect length
The String class works with chars, which are 16-bit integers of Unicode code points. The problem is that not all code points fit in 16 bits, only code points from the Basic Multilingual Plane (BMP) fit in those chars. String's length() method returns the number of chars, not the number of code points.
Now String's codePointCount method may help in this case: it counts the number of code points in the given index range. So providing string.length() as second argument to the method returns the total count of code points.
Combining characters
However, there's another problem. The 🇨🇳 Chinese flag, for example, consists of two Unicode code points: the Regional Indicator Symbol Letters C (🇨, U+1F1E8) and N (🇳, U+1F1F3). Those two code points are combined into a flag of China. This is a problem you are not going to solve with the codePointCount method.
The Regional Indicator Symbol Letters seem to be a special occasion. Two of those characters can be combined into a national flag. I am not aware of a standard way to achieve what you want. You may have to take that manually into account.
I've written a small program to get the length of a string.
static int length(String str) {
String a = "\uD83C\uDDE6";
String z = "\uD83C\uDDFF";
Pattern p = Pattern.compile("[" + a + "-" + z + "]{2}");
Matcher m = p.matcher(str);
int count = 0;
while (m.find()) {
count++;
}
return str.codePointCount(0, str.length()) - count;
}
As is discussed by the comments in the question linked to by #Xehpuk, in this discussion on kotlinlang.org as well as in this blog post by Daniel Lemire the following seems to be correct:
The problem is that the java String class represents characters as
UTF-16 characters. This means any unicode character that is
represented by more than 16 bits is saved as 2 separate Char values.
This fact is ignored by many of the functions within String, eg.
String.lenght does not return the number of unicode characters, it
returns the number of 16bit characters within the String, some emoji
counting for 2 characters.
The behaviour, however, seems to be implementation-specific.
As David mentions in his post you could try the following to get the correct lenght:
tuple.codePointCount(0, tuple.length())
See code point methods from Java SE docs
I have inputs like
AS23456SDE
MFD324FR
I need to get First Character values like
AS, MFD
There should no first two or first 3 characters input can be changed. Need to get first characters before a number.
Thank you.
Edit : This is what I have tried.
public static String getPrefix(String serial) {
StringBuilder prefix = new StringBuilder();
for(char c : serial.toCharArray()){
if(Character.isDigit(c)){
break;
}
else{
prefix.append(c);
}
}
return prefix.toString();
}
Here is a nice one line solution. It uses a regex to match the first non numeric characters in the string, and then replaces the input string with this match.
public String getFirstLetters(String input) {
return new String("A" + input).replaceAll("^([^\\d]+)(.*)$", "$1")
.substring(1);
}
System.out.println(getFirstLetters("AS23456SDE"));
System.out.println(getFirstLetters("1AS123"));
Output:
AS
(empty)
A simple solution could be like this:
public static void main (String[]args) {
String str = "MFD324FR";
char[] characters = str.toCharArray();
for(char c : characters){
if(Character.isDigit(c))
break;
else
System.out.print(c);
}
}
Use the following function to get required output
public String getFirstChars(String str){
int zeroAscii = '0'; int nineAscii = '9';
String result = "";
for (int i=0; i< str.lenght(); i++){
int ascii = str.toCharArray()[i];
if(ascii >= zeroAscii && ascii <= nineAscii){
result = result + str.toCharArray()[i];
}else{
return result;
}
}
return str;
}
pass your string as argument
I think this can be done by a simple regex which matches digits and java's string split function. This Regex based approach will be more efficient than the methods using more complicated regexs.
Something as below will work
String inp = "ABC345.";
String beginningChars = inp.split("[\\d]+",2)[0];
System.out.println(beginningChars); // only if you want to print.
The regex I used "[\\d]+" is escaped for java already.
What it does?
It matches one or more digits (d). d matches digits of any language in unicode, (so it matches japanese and arabian numbers as well)
What does String beginningChars = inp.split("[\\d]+",2)[0] do?
It applies this regex and separates the string into string arrays where ever a match is found. The [0] at the end selects the first result from that array, since you wanted the starting chars.
What is the second parameter to .split(regex,int) which I supplied as 2?
This is the Limit parameter. This means that the regex will be applied on the string till 1 match is found. Once 1 match is found the string is not processed anymore.
From the Strings javadoc page:
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
This will be efficient if your string is huge.
Possible other regex if you want to split only on english numerals
"[0-9]+"
public static void main(String[] args) {
String testString = "MFD324FR";
int index = 0;
for (Character i : testString.toCharArray()) {
if (Character.isDigit(i))
break;
index++;
}
System.out.println(testString.substring(0, index));
}
this prints the first 'n' characters before it encounters a digit (i.e. integer).
i want to find the character ASCII code for programming android to support the Arabic locale. Android programming has many characters are different English. The ASCII code in many letters joint or some of letters are split.
how can i find the special code for each letter?
Unicode is a numbering of all characters. The numbering would need three bytes integers. A Unicode character is represented in science as U+XXXX where XXXX stands for the number in hexadecimal (base 16) notation. A Unicode character is called code point, in Java with type int.
Java char is 2 bytes (UTF-16), so cannot represent the higher order Unicode; there a pair of two chars is used.
The java class Character deals with conversion.
char lowUnicode = '\u0627'; // Alef, fitting in a char
int cp = (int) lowUnicode;
One can iterate through code points of a String as follows:
String s = "...";
for (int i = 0; i < s.length(); ) {
int codePoint = s.codePointAt(i);
i += Character.charCount(codePoint);
}
String s = "...";
for (int i = 0; i < s.length(); ) {
int codePoint = s.codePointAt(i);
...
i += Character.charCount(codePoint);
}
Or in java 8:
s.codePoints().forEach(
(codePoint) -> System.out.println(codePoint));
Dumping Arabic between U+600 and U+8FF:
The code below dumps Unicode in the main Arabic range.
for (int codePoint = 0x600; codePoint < 0x900; ++codePoint) {
if (Character.isAlphabetic(codePoint)
&& UnicodeScript.of(codePoint) == UnicodeScript.ARABIC) {
System.out.printf("\u200E\\%04X \u200F%s\u200E %s%n",
codePoint,
new String(Character.toChars(codePoint)),
Character.getName(codePoint));
}
}
Under Windows/Linux/... there exist char map tools to display Unicode.
Above U+200E is the Left-To-Right, and U+200F is the Right-To-Left mark.
If you want to get Unicode characters code below will do that:
char character = 'ع';
int code = (int) character;
I want to display a Unicode character in Java. If I do this, it works just fine:
String symbol = "\u2202";
symbol is equal to "∂". That's what I want.
The problem is that I know the Unicode number and need to create the Unicode symbol from that. I tried (to me) the obvious thing:
int c = 2202;
String symbol = "\\u" + c;
However, in this case, symbol is equal to "\u2202". That's not what I want.
How can I construct the symbol if I know its Unicode number (but only at run-time---I can't hard-code it in like the first example)?
If you want to get a UTF-16 encoded code unit as a char, you can parse the integer and cast to it as others have suggested.
If you want to support all code points, use Character.toChars(int). This will handle cases where code points cannot fit in a single char value.
Doc says:
Converts the specified character (Unicode code point) to its UTF-16 representation stored in a char array. If the specified code point is a BMP (Basic Multilingual Plane or Plane 0) value, the resulting char array has the same value as codePoint. If the specified code point is a supplementary code point, the resulting char array has the corresponding surrogate pair.
Just cast your int to a char. You can convert that to a String using Character.toString():
String s = Character.toString((char)c);
EDIT:
Just remember that the escape sequences in Java source code (the \u bits) are in HEX, so if you're trying to reproduce an escape sequence, you'll need something like int c = 0x2202.
The other answers here either only support unicode up to U+FFFF (the answers dealing with just one instance of char) or don't tell how to get to the actual symbol (the answers stopping at Character.toChars() or using incorrect method after that), so adding my answer here, too.
To support supplementary code points also, this is what needs to be done:
// this character:
// http://www.isthisthingon.org/unicode/index.php?page=1F&subpage=4&glyph=1F495
// using code points here, not U+n notation
// for equivalence with U+n, below would be 0xnnnn
int codePoint = 128149;
// converting to char[] pair
char[] charPair = Character.toChars(codePoint);
// and to String, containing the character we want
String symbol = new String(charPair);
// we now have str with the desired character as the first item
// confirm that we indeed have character with code point 128149
System.out.println("First code point: " + symbol.codePointAt(0));
I also did a quick test as to which conversion methods work and which don't
int codePoint = 128149;
char[] charPair = Character.toChars(codePoint);
System.out.println(new String(charPair, 0, 2).codePointAt(0)); // 128149, worked
System.out.println(charPair.toString().codePointAt(0)); // 91, didn't work
System.out.println(new String(charPair).codePointAt(0)); // 128149, worked
System.out.println(String.valueOf(codePoint).codePointAt(0)); // 49, didn't work
System.out.println(new String(new int[] {codePoint}, 0, 1).codePointAt(0));
// 128149, worked
--
Note: as #Axel mentioned in the comments, with java 11 there is Character.toString(int codePoint) which would arguably be best suited for the job.
This one worked fine for me.
String cc2 = "2202";
String text2 = String.valueOf(Character.toChars(Integer.parseInt(cc2, 16)));
Now text2 will have ∂.
Remember that char is an integral type, and thus can be given an integer value, as well as a char constant.
char c = 0x2202;//aka 8706 in decimal. \u codepoints are in hex.
String s = String.valueOf(c);
String st="2202";
int cp=Integer.parseInt(st,16);// it convert st into hex number.
char c[]=Character.toChars(cp);
System.out.println(c);// its display the character corresponding to '\u2202'.
Although this is an old question, there is a very easy way to do this in Java 11 which was released today: you can use a new overload of Character.toString():
public static String toString(int codePoint)
Returns a String object representing the specified character (Unicode code point). The result is a string of length 1 or 2, consisting solely of the specified codePoint.
Parameters:
codePoint - the codePoint to be converted
Returns:
the string representation of the specified codePoint
Throws:
IllegalArgumentException - if the specified codePoint is not a valid Unicode code point.
Since:
11
Since this method supports any Unicode code point, the length of the returned String is not necessarily 1.
The code needed for the example given in the question is simply:
int codePoint = '\u2202';
String s = Character.toString(codePoint); // <<< Requires JDK 11 !!!
System.out.println(s); // Prints ∂
This approach offers several advantages:
It works for any Unicode code point rather than just those that can be handled using a char.
It's concise, and it's easy to understand what the code is doing.
It returns the value as a string rather than a char[], which is often what you want. The answer posted by McDowell is appropriate if you want the code point returned as char[].
This is how you do it:
int cc = 0x2202;
char ccc = (char) Integer.parseInt(String.valueOf(cc), 16);
final String text = String.valueOf(ccc);
This solution is by Arne Vajhøj.
The code below will write the 4 unicode chars (represented by decimals) for the word "be" in Japanese. Yes, the verb "be" in Japanese has 4 chars!
The value of characters is in decimal and it has been read into an array of String[] -- using split for instance. If you have Octal or Hex, parseInt take a radix as well.
// pseudo code
// 1. init the String[] containing the 4 unicodes in decima :: intsInStrs
// 2. allocate the proper number of character pairs :: c2s
// 3. Using Integer.parseInt (... with radix or not) get the right int value
// 4. place it in the correct location of in the array of character pairs
// 5. convert c2s[] to String
// 6. print
String[] intsInStrs = {"12354", "12426", "12414", "12377"}; // 1.
char [] c2s = new char [intsInStrs.length * 2]; // 2. two chars per unicode
int ii = 0;
for (String intString : intsInStrs) {
// 3. NB ii*2 because the 16 bit value of Unicode is written in 2 chars
Character.toChars(Integer.parseInt(intsInStrs[ii]), c2s, ii * 2 ); // 3 + 4
++ii; // advance to the next char
}
String symbols = new String(c2s); // 5.
System.out.println("\nLooooonger code point: " + symbols); // 6.
// I tested it in Eclipse and Java 7 and it works. Enjoy
Here is a block to print out unicode chars between \u00c0 to \u00ff:
char[] ca = {'\u00c0'};
for (int i = 0; i < 4; i++) {
for (int j = 0; j < 16; j++) {
String sc = new String(ca);
System.out.print(sc + " ");
ca[0]++;
}
System.out.println();
}
Unfortunatelly, to remove one backlash as mentioned in first comment (newbiedoodle) don't lead to good result. Most (if not all) IDE issues syntax error. The reason is in this, that Java Escaped Unicode format expects syntax "\uXXXX", where XXXX are 4 hexadecimal digits, which are mandatory. Attempts to fold this string from pieces fails. Of course, "\u" is not the same as "\\u". The first syntax means escaped 'u', second means escaped backlash (which is backlash) followed by 'u'. It is strange, that on the Apache pages is presented utility, which doing exactly this behavior. But in reality, it is Escape mimic utility. Apache has some its own utilities (i didn't testet them), which do this work for you. May be, it is still not that, what you want to have. Apache Escape Unicode utilities But this utility 1 have good approach to the solution. With combination described above (MeraNaamJoker). My solution is create this Escaped mimic string and then convert it back to unicode (to avoid real Escaped Unicode restriction). I used it for copying text, so it is possible, that in uencode method will be better to use '\\u' except '\\\\u'. Try it.
/**
* Converts character to the mimic unicode format i.e. '\\u0020'.
*
* This format is the Java source code format.
*
* CharUtils.unicodeEscaped(' ') = "\\u0020"
* CharUtils.unicodeEscaped('A') = "\\u0041"
*
* #param ch the character to convert
* #return is in the mimic of escaped unicode string,
*/
public static String unicodeEscaped(char ch) {
String returnStr;
//String uniTemplate = "\u0000";
final static String charEsc = "\\u";
if (ch < 0x10) {
returnStr = "000" + Integer.toHexString(ch);
}
else if (ch < 0x100) {
returnStr = "00" + Integer.toHexString(ch);
}
else if (ch < 0x1000) {
returnStr = "0" + Integer.toHexString(ch);
}
else
returnStr = "" + Integer.toHexString(ch);
return charEsc + returnStr;
}
/**
* Converts the string from UTF8 to mimic unicode format i.e. '\\u0020'.
* notice: i cannot use real unicode format, because this is immediately translated
* to the character in time of compiling and editor (i.e. netbeans) checking it
* instead reaal unicode format i.e. '\u0020' i using mimic unicode format '\\u0020'
* as a string, but it doesn't gives the same results, of course
*
* This format is the Java source code format.
*
* CharUtils.unicodeEscaped(' ') = "\\u0020"
* CharUtils.unicodeEscaped('A') = "\\u0041"
*
* #param String - nationalString in the UTF8 string to convert
* #return is the string in JAVA unicode mimic escaped
*/
public String encodeStr(String nationalString) throws UnsupportedEncodingException {
String convertedString = "";
for (int i = 0; i < nationalString.length(); i++) {
Character chs = nationalString.charAt(i);
convertedString += unicodeEscaped(chs);
}
return convertedString;
}
/**
* Converts the string from mimic unicode format i.e. '\\u0020' back to UTF8.
*
* This format is the Java source code format.
*
* CharUtils.unicodeEscaped(' ') = "\\u0020"
* CharUtils.unicodeEscaped('A') = "\\u0041"
*
* #param String - nationalString in the JAVA unicode mimic escaped
* #return is the string in UTF8 string
*/
public String uencodeStr(String escapedString) throws UnsupportedEncodingException {
String convertedString = "";
String[] arrStr = escapedString.split("\\\\u");
String str, istr;
for (int i = 1; i < arrStr.length; i++) {
str = arrStr[i];
if (!str.isEmpty()) {
Integer iI = Integer.parseInt(str, 16);
char[] chaCha = Character.toChars(iI);
convertedString += String.valueOf(chaCha);
}
}
return convertedString;
}
char c=(char)0x2202;
String s=""+c;
(ANSWER IS IN DOT NET 4.5 and in java, there must be a similar approach exist)
I am from West Bengal in INDIA.
As I understand your problem is ...
You want to produce similar to ' অ ' (It is a letter in Bengali language)
which has Unicode HEX : 0X0985.
Now if you know this value in respect of your language then how will you produce that language specific Unicode symbol right ?
In Dot Net it is as simple as this :
int c = 0X0985;
string x = Char.ConvertFromUtf32(c);
Now x is your answer.
But this is HEX by HEX convert and sentence to sentence conversion is a work for researchers :P
In Java is there a way to check the condition:
"Does this single character appear at all in string x"
without using a loop?
You can use string.indexOf('a').
If the char a is present in string :
it returns the the index of the first occurrence of the character in
the character sequence represented by this object, or -1 if the
character does not occur.
String.contains() which checks if the string contains a specified sequence of char values
String.indexOf() which returns the index within the string of the first occurence of the specified character or substring (there are 4 variations of this method)
I'm not sure what the original poster is asking exactly. Since indexOf(...) and contains(...) both probably use loops internally, perhaps he's looking to see if this is possible at all without a loop? I can think of two ways off hand, one would of course be recurrsion:
public boolean containsChar(String s, char search) {
if (s.length() == 0)
return false;
else
return s.charAt(0) == search || containsChar(s.substring(1), search);
}
The other is far less elegant, but completeness...:
/**
* Works for strings of up to 5 characters
*/
public boolean containsChar(String s, char search) {
if (s.length() > 5) throw IllegalArgumentException();
try {
if (s.charAt(0) == search) return true;
if (s.charAt(1) == search) return true;
if (s.charAt(2) == search) return true;
if (s.charAt(3) == search) return true;
if (s.charAt(4) == search) return true;
} catch (IndexOutOfBoundsException e) {
// this should never happen...
return false;
}
return false;
}
The number of lines grow as you need to support longer and longer strings of course. But there are no loops/recurrsions at all. You can even remove the length check if you're concerned that that length() uses a loop.
You can use 2 methods from the String class.
String.contains() which checks if the string contains a specified sequence of char values
String.indexOf() which returns the index within the string of the first occurence of the specified character or substring or returns -1 if the character is not found (there are 4 variations of this method)
Method 1:
String myString = "foobar";
if (myString.contains("x") {
// Do something.
}
Method 2:
String myString = "foobar";
if (myString.indexOf("x") >= 0 {
// Do something.
}
Links by: Zach Scrivena
String temp = "abcdefghi";
if(temp.indexOf("b")!=-1)
{
System.out.println("there is 'b' in temp string");
}
else
{
System.out.println("there is no 'b' in temp string");
}
If you need to check the same string often you can calculate the character occurrences up-front. This is an implementation that uses a bit array contained into a long array:
public class FastCharacterInStringChecker implements Serializable {
private static final long serialVersionUID = 1L;
private final long[] l = new long[1024]; // 65536 / 64 = 1024
public FastCharacterInStringChecker(final String string) {
for (final char c: string.toCharArray()) {
final int index = c >> 6;
final int value = c - (index << 6);
l[index] |= 1L << value;
}
}
public boolean contains(final char c) {
final int index = c >> 6; // c / 64
final int value = c - (index << 6); // c - (index * 64)
return (l[index] & (1L << value)) != 0;
}}
To check if something does not exist in a string, you at least need to look at each character in a string. So even if you don't explicitly use a loop, it'll have the same efficiency. That being said, you can try using str.contains(""+char).
Is the below what you were looking for?
int index = string.indexOf(character);
return index != -1;
Yes, using the indexOf() method on the string class. See the API documentation for this method
String.contains(String) or String.indexOf(String) - suggested
"abc".contains("Z"); // false - correct
"zzzz".contains("Z"); // false - correct
"Z".contains("Z"); // true - correct
"😀and😀".contains("😀"); // true - correct
"😀and😀".contains("😂"); // false - correct
"😀and😀".indexOf("😀"); // 0 - correct
"😀and😀".indexOf("😂"); // -1 - correct
String.indexOf(int) and carefully considered String.indexOf(char) with char to int widening
"😀and😀".indexOf("😀".charAt(0)); // 0 though incorrect usage has correct output due to portion of correct data
"😀and😀".indexOf("😂".charAt(0)); // 0 -- incorrect usage and ambiguous result
"😀and😀".indexOf("😂".codePointAt(0)); // -1 -- correct usage and correct output
The discussions around character is ambiguous in Java world
can the value of char or Character considered as single character?
No. In the context of unicode characters, char or Character can sometimes be part of a single character and should not be treated as a complete single character logically.
if not, what should be considered as single character (logically)?
Any system supporting character encodings for Unicode characters should consider unicode's codepoint as single character.
So Java should do that very clear & loud rather than exposing too much of internal implementation details to users.
String class is bad at abstraction (though it requires confusingly good amount of understanding of its encapsulations to understand the abstraction 😒😒😒 and hence an anti-pattern).
How is it different from general char usage?
char can be only be mapped to a character in Basic Multilingual Plane.
Only codePoint - int can cover the complete range of Unicode characters.
Why is this difference?
char is internally treated as 16-bit unsigned value and could not represent all the unicode characters using UTF-16 internal representation using only 2-bytes. Sometimes, values in a 16-bit range have to be combined with another 16-bit value to correctly define character.
Without getting too verbose, the usage of indexOf, charAt, length and such methods should be more explicit. Sincerely hoping Java will add new UnicodeString and UnicodeCharacter classes with clearly defined abstractions.
Reason to prefer contains and not indexOf(int)
Practically there are many code flows that treat a logical character as char in java.
In Unicode context, char is not sufficient
Though the indexOf takes in an int, char to int conversion masks this from the user and user might do something like str.indexOf(someotherstr.charAt(0))(unless the user is aware of the exact context)
So, treating everything as CharSequence (aka String) is better
public static void main(String[] args) {
System.out.println("😀and😀".indexOf("😀".charAt(0))); // 0 though incorrect usage has correct output due to portion of correct data
System.out.println("😀and😀".indexOf("😂".charAt(0))); // 0 -- incorrect usage and ambiguous result
System.out.println("😀and😀".indexOf("😂".codePointAt(0))); // -1 -- correct usage and correct output
System.out.println("😀and😀".contains("😀")); // true - correct
System.out.println("😀and😀".contains("😂")); // false - correct
}
Semantics
char can handle most of the practical use cases. Still its better to use codepoints within programming environment for future extensibility.
codepoint should handle nearly all of the technical use cases around encodings.
Still, Grapheme Clusters falls out of the scope of codepoint level of abstraction.
Storage layers can choose char interface if ints are too costly(doubled). Unless storage cost is the only metric, its still better to use codepoint. Also, its better to treat storage as byte and delegate semantics to business logic built around storage.
Semantics can be abstracted at multiple levels. codepoint should become lowest level of interface and other semantics can be built around codepoint in runtime environment.
package com;
public class _index {
public static void main(String[] args) {
String s1="be proud to be an indian";
char ch=s1.charAt(s1.indexOf('e'));
int count = 0;
for(int i=0;i<s1.length();i++) {
if(s1.charAt(i)=='e'){
System.out.println("number of E:=="+ch);
count++;
}
}
System.out.println("Total count of E:=="+count);
}
}
static String removeOccurences(String a, String b)
{
StringBuilder s2 = new StringBuilder(a);
for(int i=0;i<b.length();i++){
char ch = b.charAt(i);
System.out.println(ch+" first index"+a.indexOf(ch));
int lastind = a.lastIndexOf(ch);
for(int k=new String(s2).indexOf(ch);k > 0;k=new String(s2).indexOf(ch)){
if(s2.charAt(k) == ch){
s2.deleteCharAt(k);
System.out.println("val of s2 : "+s2.toString());
}
}
}
System.out.println(s1.toString());
return (s1.toString());
}
you can use this code. It will check the char is present or not. If it is present then the return value is >= 0 otherwise it's -1. Here I am printing alphabets that is not present in the input.
import java.util.Scanner;
public class Test {
public static void letters()
{
System.out.println("Enter input char");
Scanner sc = new Scanner(System.in);
String input = sc.next();
System.out.println("Output : ");
for (char alphabet = 'A'; alphabet <= 'Z'; alphabet++) {
if(input.toUpperCase().indexOf(alphabet) < 0)
System.out.print(alphabet + " ");
}
}
public static void main(String[] args) {
letters();
}
}
//Ouput Example
Enter input char
nandu
Output :
B C E F G H I J K L M O P Q R S T V W X Y Z
If you see the source code of indexOf in JAVA:
public int indexOf(int ch, int fromIndex) {
final int max = value.length;
if (fromIndex < 0) {
fromIndex = 0;
} else if (fromIndex >= max) {
// Note: fromIndex might be near -1>>>1.
return -1;
}
if (ch < Character.MIN_SUPPLEMENTARY_CODE_POINT) {
// handle most cases here (ch is a BMP code point or a
// negative value (invalid code point))
final char[] value = this.value;
for (int i = fromIndex; i < max; i++) {
if (value[i] == ch) {
return i;
}
}
return -1;
} else {
return indexOfSupplementary(ch, fromIndex);
}
}
you can see it uses a for loop for finding a character. Note that each indexOf you may use in your code, is equal to one loop.
So, it is unavoidable to use loop for a single character.
However, if you want to find a special string with more different forms, use useful libraries such as util.regex, it deploys stronger algorithm to match a character or a string pattern with Regular Expressions. For example to find an email in a string:
String regex = "^(.+)#(.+)$";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(email);
If you don't like to use regex, just use a loop and charAt and try to cover all cases in one loop.
Be careful recursive methods has more overhead than loop, so it's not recommended.
how about one uses this ;
let text = "Hello world, welcome to the universe.";
let result = text.includes("world");
console.log(result) ....// true
the result will be a true or false
this always works for me
You won't be able to check if char appears at all in some string without atleast going over the string once using loop / recursion ( the built-in methods like indexOf also use a loop )
If the no. of times you look up if a char is in string x is more way more than the length of the string than I would recommend using a Set data structure as that would be more efficient than simply using indexOf
String s = "abc";
// Build a set so we can check if character exists in constant time O(1)
Set<Character> set = new HashSet<>();
int len = s.length();
for(int i = 0; i < len; i++) set.add(s.charAt(i));
// Now we can check without the need of a loop
// contains method of set doesn't use a loop unlike string's contains method
set.contains('a') // true
set.contains('z') // false
Using set you will be able to check if character exists in a string in constant time O(1) but you will also use additional memory ( Space complexity will be O(n) ).