Android toLowerCase() issue with accented characters

Android toLowerCase() issue with accented characters - java

My app has a feature to filter content based on some keywords.
This is case insensitive so in order to work I first call String.toLowerCase() on the source content.
The issue I have is when the source is in upper case and contains accentuated characters like with the french word: "INVITÉ"
This word when set to lowercase using the device default locale returns "invité"
The problem is that the last character is not the same as the lowercase character "é"
Instead it's the combination of 2 chars:
"e" 101 &
" ' " 769
Because of this "invité" does not match "invité"
How can I solve this? I would prefer not to remove accentuated characters altogether

You should normalize the string like this.
String upper = "INVITÉ";
System.out.println(upper + " length=" + upper.length());
String lower = upper.toLowerCase();
System.out.println(lower + " length=" + lower.length());
String normalized = Normalizer.normalize(lower, Normalizer.Form.NFC);
System.out.println(normalized + " length=" + normalized.length());
output:
INVITÉ length=7
invité length=7
invité length=6
It also works for Japanese.
String japanese = "が";
System.out.println(japanese + " length=" + japanese.length());
String normalized = Normalizer.normalize(japanese, Normalizer.Form.NFC);
System.out.println(normalized + " length=" + normalized.length());
output:
が length=2
が length=1

Related

How to Insert slashes into a string?

I have a birth date number in the format: 890520, so yy/mm/dd.
However, I need to display it separated by slashes, eg. 89/05/20
How can I do this, as there is no delimiter with which I can split the string?

String a = "890520";
System.out.println(a.substring(0, 2) + "/" + a.substring(2, 4) + "/" + a.substring(4));
substring() is what you are looking for.

Why does the Java regular expression "|" find a matching substring for any input string?

I am trying to understand why a regular expression ending with "|" (or simply "|" itself) will find a matching substring with start index 0 and end "offset after the last character matched (as per JavaDoc for Matcher)" 0.
The following code demonstrates this:
public static void main(String[] args) {
String regExp = "|";
String toMatch = "A";
Matcher m = Pattern.compile(regExp).matcher(toMatch);
System.out.println("ReqExp: " + regExp +
" found " + toMatch + "(" + m.find() + ") " +
" start: " + m.start() +
" end: " + m.end());
}
Output is:
ReqExp: | found A(true) start: 0 end: 0
I'm confused by the fact that it is even a valid regular expression. And further confused by the fact that start and end are both 0.
Hoping someone can explain this to me.

The pipe in a regular expression means "or." So your regular expression is basically "(empty string) or (empty string)". It successfully finds an empty string at the beginning of the string, and an empty string has a length of 0.

Split String To Get Word Separators

I want to find store all the separators between the words in a sentence which could be spaces, newlines.
Say I have the following String:
String text = "hello, darkness my old friend.\nI've come to you again\r\nasd\n 123123";
String[] separators = text.split("\\S+");
Output: [, , , , ,
, , , , ,
,
]
So I split on anything but a space it is returning an empty separator at first and the rest are good. Why the empty string at first tho?
Also, I would like to split on periods and commas. But I don't know how to do that meaning that ".\n" is a separator.
Wanted Output for the above String:
separators = {", ", " ", " ", " ", ".\n", " ", " ", " ", " ", "\r\n", "\n "}
or
separators = {",", " ", " ", " ", " ", ".", "\n", " ", " ", " ", " ", "\r\n", "\n "}

Try this:
String[] separators = text.split("[\\w']+");
This defines non-separators as "word chars" and/or apostrophes.
This does leave a leading blank in the result array, which is not possible to avoid, except by removing the leading word first:
String[] separators = text.replaceAll("^[\\w']+", "").split("[\\w']+");
You may consider adding the hyphen to the character class, if you consider hyphenated words (example in the previous sentence) as one word, ie
String[] separators = text.split("[\\w'-]+");
See live demo.

I think this can also work correctly:
String[] separators = text.split("\\w+");

If think it's more easy to use the .find() method to obtain the desired result:
String text = "hello, darkness my old friend.\nI've come to you again\r\nasd\n 123123";
String pat = "[\\s,.]+"; // add all that you need to the character class
Matcher m = Pattern.compile(pat).matcher(text);
List<String> list = new ArrayList<String>();
while( m.find() ) {
list.add(m.group());
}
// the result is already stored in "list" but if you
// absolutely want to store the result in an array, just do:
String[] result = list.toArray(new String[0]);
This way you avoid the empty string problem at the beginning.

Digits are getting deleted when splitting a string

I have a string from which I need to remove all mentioned punctuations and spaces. My code looks as follows:
String s = "s[film] fever(normal) curse;";
String[] spart = s.split("[,/?:;\\[\\]\"{}()\\-_+*=|<>!`~##$%^&\\s+]");
System.out.println("spart[0]: " + spart[0]);
System.out.println("spart[1]: " + spart[1]);
System.out.println("spart[2]: " + spart[2]);
System.out.println("spart[3]: " + spart[3]);
I have a string from which I need to remove all mentioned punctuations and spaces. My code looks as follows:
String s = "s[film] fever(normal) curse;";
String[] spart = s.split("[,/?:;\\[\\]\"{}()\\-_+*=|<>!`~##$%^&\\s+]");
System.out.println("spart[0]: " + spart[0]);
System.out.println("spart[1]: " + spart[1]);
System.out.println("spart[2]: " + spart[2]);
System.out.println("spart[3]: " + spart[3]);
But, I am getting some elements which are blank. The output is:
spart[0]: s
spart[1]: film
spart[2]:
spart[3]: normal

- is a special character in PHP character classes. For instance, [a-z] matches all chars from a to z inclusive. Note that you've got )-_ in your regex.

- defines a range in regular expressions as used by String.split argument so that needs to be escaped
String[] part = line.toLowerCase().split("[,/?:;\"{}()\\-_+*=|<>!`~##$%^&]");

String[] spart = s.split("[,/?:;\\[\\]\"{}()\\-_+*=|<>!`~##$%^&\\s]+");

Using a JTextField to get a regular expression from a user. How do I make it see \t as a tab instead of a \ followed by a t

JTextField reSource; //contains the regex expression the user wants to search for
String re=reSource.getText();
Pattern p=Pattern.compile(re,myflags); //myflags defined elsewhere in code
Matcher m=p.matcher(src); //src is the text to search and comes from a JTextArea
while (m.find()==true) {
If the user enters \t it finds \t not tab.
If the user enters \\\t it finds \\\t not tab.
If the user enters [\t] or [\\\t] it finds t not tab.
I want it such that if the user enters \t it finds tab. Of course it also needs to work with \n, \r etc...
If re="\t"; is used instead of re=reSource.getText(); with \t in the JTextField then it finds tabs. How do I get it to work with the contents of the JTextField?

Example:
String src = "This\tis\ta\ttest";
System.out.println("src=\"" + src + '"'); // --> prints "This is a test"
String re="\\t";
System.out.println("re=\"" + re + '"'); // --> prints "\t" - as when you use reSource.getText();
Pattern p = Pattern.compile(re);
Matcher m = p.matcher(src);
while (m.find()) {
System.out.println('"' + m.group() + '"');
}
Output:
src="This is a test"
re="\t"
" "
" "
" "
Try this:
re=re.replace("\\t", "\t");
OR
re=re.replace("\\t", "\\\\t");
I think the problem is in understanding that when you type:
String str = "\t";
Then it is actualy same as:
String str = " ";
But if you type:
String str = "\\t";
Then the System.out.print(str) will be "\t".

Matching \t should work, however, your flags might have a problem.
Here's what works for me:
String src = "A\tBC\tD";
Pattern p=Pattern.compile("\\w\\t\\w"); //simulates the user entering \w\t\w
Matcher m=p.matcher(src);
while (m.find())
{
System.out.println("Match: \"" + m.group(0) + "\"");
}
Output is:
Match: "A B"
Match: "C D"

My experience is that Java Swing JTextField and JTable GUI controls escape user-entered backslashes by prefixing a backslash.
User types two-character sequence "backslash t", control's getText() method returns a String containing the three-character sequence "backslash backslash t". The SO formatter does its own thing with backslashes in text so here it is as code:
Single backslash: input is 2 char sequence \t and return value is 3 char \\t
For three-character input sequence "backsl backsl t", getText() returns the five-character sequence "backsl backsl backsl backsl t". As code:
Double backslash: input is 3 char sequence \\t and return value is 5 char \\\\t
This basically prevents the backslash from modifying the t to yield a character sequence that becomes a tab when interpreted by something like System.out.println.
Conveniently, and surprisingly to me, the regex processor accepts it either way. A two-character sequence "\t" matches a tab character, as does a three-character sequence "\\t". Please see demo code below. The system.out calls demonstrate which sequences and patterns, have tabs, and in JDK 1.7 both matches yield true.
package my.text;
/**
* Demonstrate use of tab character in regexes
*/
public class RegexForSo {
public static void main(String [] argv) {
final String sequenceTab="x\ty\tz";
final String patternBsTab = "x\t.*";
final String patternBsBsTab = "x\\t.*";
System.out.println("sequence is >" + sequenceTab + "<");
System.out.println("pattern BsTab is >" + patternBsTab + "<");
System.out.println("pattern BsBsTab is >" + patternBsBsTab + "<");
System.out.println("matched BsTab = " + sequenceTab.matches(patternBsTab));
System.out.println("matched BsBsTab = " + sequenceTab.matches(patternBsBsTab));
}
}
Output on my JDK1.7 system is below, tabs in output might not survive SO formatter :)
sequence is >x y z<
pattern BsTab is >x .*<
pattern BsBsTab is >x\t.*<
matched BsTab = true
matched BsBsTab = true
HTH

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Android toLowerCase() issue with accented characters - java

Related

How to Insert slashes into a string?

Why does the Java regular expression "|" find a matching substring for any input string?

Split String To Get Word Separators

Digits are getting deleted when splitting a string

Using a JTextField to get a regular expression from a user. How do I make it see \t as a tab instead of a \ followed by a t

Categories

Resources