Java String.contains() not working on a cyrillic string

Java String.contains() not working on a cyrillic string - java

Here is what happens.
User types in "лос ан"
I have a bunch of products whose location is "лос анджелис"
if I do:
String userInput = "лос ан"
for(Product product : products) {
if(product.getCity().trim().toLowerCase().contains(userInput.trim().toLowerCase())) {
System.out.println("MATCH");
}
}
I don't get MATCH.
This works for Latin characters

try specifying Locale in toLowerCase() on both sides of the equation: http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#toLowerCase(java.util.Locale)

The editor and the compiler (javac -encoding) must use the same encoding.
The compiler encoding are done easily. The editor, source encoding, can be tested with a programmer's editor like NotePad++ or JEdit, which can switch encodings.
You can also u-escape the Java source text to check this:
String userInput = "\u043b\u043e\u0441 \u0430\u043d";
If that does not work, there is a discrepancy between the encodings.
Furthermore String.toLowerCase(new Locale("ru", "RU")) or such is already mentioned.

Using jdk 1.8.0_45, the following code gives a match in both cases:
System.out.println("лос анджелис".trim().toLowerCase().contains("лос ан".trim().toLowerCase()));
System.out.println("лос анджелис".trim().toLowerCase(Locale.ROOT).contains("лос ан".trim().toLowerCase(Locale.ROOT)));
As others already mentioned, you may look for a working Locale as argument to String#toLowerCase.

Related

Cucumber: how to define a string with only letters as the Given

I am working on Cucumber framework, and I have written my feature file and run the test runner. From that I got the snippets, which have to be implemented. I am a bit confused with one as the scenario is that a user types a non-digits string e.g. "nonumbers".
#Given("The string contains {string}")
public void the_string_contains(String string) {
}
As I am unable to just say string = "^[a-zA-Z]+$"; I am not sure how I should define the string as a non-digits string. As it is the #Given, I am not using Pattern in order to check if the string is correctly formated

According to the documentation you can use {string} to match single-quoted or double-quoted strings, for example "banana split" or 'banana split' (but not banana split). Only the text between the quotes will be extracted. The quotes themselves are discarded.
Note that Cucumber expressions (like {string}) are available as of Cucumber-jvm v3.x

For my feature file I did the following implementation. Please check the screenshot:
Feature file
Java file
With the above implementation everything executed just fine.

How to remove \u200B (Zero Length Whitespace Unicode Character) from String in Java?

My application is using Spring Integration for email polling from Outlook mailbox.
As, it is receiving the String (email body)from an external system (Outlook), So I have no control over it.
For Example,
String emailBodyStr= "rejected by sundar14-\u200B.";
Now I am trying to remove the unicode character \u200B from this String.
What I tried already.
Try#1:
emailBodyStr = emailBodyStr.replaceAll("\u200B", "");
Try#2:
`emailBodyStr = emailBodyStr.replaceAll("\u200B", "").trim();`
Try#3 (using Apache Commons):
StringEscapeUtils.unescapeJava(emailBodyStr);
Try#4:
StringEscapeUtils.unescapeJava(emailBodyStr).trim();
Nothing worked till now.
When I tried to print this String using below code.
logger.info("Comment BEFORE:{}",emailBodyStr);
logger.info("Comment AFTER :{}",emailBodyStr);
In Eclipse console, it is NOT printing unicode char,
Comment BEFORE:rejected by sundar14-.
But the same code prints the unicode char in Linux console as below.
Comment BEFORE:rejected by sundar14-\u200B.
I read some examples where str.replace() is recommended, but please note that examples uses javascript, PHP and not Java.

Finally, I am able to remove 'Zero Width Space' character by using 'Unicode Regex'.
String plainEmailBody = new String();
plainEmailBody = emailBodyStr.replaceAll("[\\p{Cf}]", "");
Reference to find the category of Unicode characters.
Character class from Java.
Character class from Java lists all of these unicode categories.
Website: http://www.fileformat.info/
Website: http://www.regular-expressions.info/ => Unicode Regular Expressions
Note 1: As I received this string from Outlook Email Body - none of the approaches listed in my question was working.
My application is receiving a String from an external system
(Outlook), So I have no control over it.
Note 2: This SO answer helped me to know about Unicode Regular Expressions .

non-basic characters in java, how to handle the encoding correctly

when I am trying to call method with parameter using my Polish language f.e.
node.call("ąćęasdasdęczć")
I get these characters as input characters.
Ä?Ä?Ä?asdasdÄ?czÄ
I don't know where to set correct encoding in maven pom.xml? or in my IDE? I tried to change UTF-8 to ISO_8859-2 in my IDE setting, but it didn't work. I was searching similiar questions, but I didn't find the answer.
#Edit 1
Sample code:
public void findAndSendKeys(String vToSet , By vLocator){
WebElement element;
element = webDriverWait.until(ExpectedConditions.presenceOfElementLocated(vLocator));
element.sendKeys(vToSet);
}
By nameLoc = By.id("First_Name");
findAndSendKeys("ąćęasdasdęczć" , nameLoc );
Then in input field I got Ä?Ä?Ä?asdasdÄ?czÄ. Converting string to Basic Latin in my IDE helps, but It's not the solution that I needed.
I have also problems with fields in classes f.e. I have class in which I have to convert String to basic Latin
public class Contacts{
private static final By LOC_ADDRESS_BTN = By.xpath("//button[contains(#aria-label,'Wybór adresu')]");
// it doesn't work, I have to use basic latin and replace "ó" with "\u00f3" in my IDE
}
#Edit 2 - Changed encoding, but problem still exists
1:

Why doesn't this Java regex compile?

I am trying to extract the pass number from strings of any of the following formats:
PassID_132
PassID_64
Pass_298
Pass_16
For this, I constructed the following regex:
Pass[I]?[D]?_([\d]{2,3})
-and tested it in Eclipse's search dialog. It worked fine.
However, when I use it in code, it doesn't match anything. Here's my code snippet:
String idString = filename.replaceAll("Pass[I]?[D]?_([\\d]{2,3})", "$1");
int result = Integer.parseInt(idString);
I also tried
java.util.regex.Pattern.compile("Pass[I]?[D]?_([\\d]{2,3})")
in the Expressions window while debugging, but that says "", whereas
java.util.regex.Pattern.compile("Pass[I]?[D]?_([0-9]{2,3})")
compiled, but didn't match anything. What could be the problem?

Instead of Pass[I]?[D]?_([\d]{2,3}) try this:
Pass(?:I)?(?:D)?_([\d]{2,3})

There's nothing invalid with your tegex, but it sucks. You don't need character classes around single character terms. Try this:
"Pass(?:ID)?_(\\d{2,3})"

Locale appending 'variation' to language and country code

LocaleContext.getLocale() returns the locale object currently as 'en_US_WOL'. I verified the locale object using breakpoint and looks like en- language English, US - country code of US, WOL - variation (a field of Locale object).
How and why is the variation field getting appending and returned for getLocale() method? and how can I stop that? (LocaleContext is of type ThreadLocal)

According to http://docs.oracle.com/javase/6/docs/api/java/util/Locale.html
The variant argument is a vendor or browser-specific code. For example, use WIN for Windows, MAC for Macintosh, and POSIX for POSIX. Where there are two variants, separate them with an underscore, and put the most important one first. For example, a Traditional Spanish collation might construct a locale with parameters for language, country and variant as: "es", "ES", "Traditional_WIN".
If you're after Locale for specific variant, I presume you can use this constructor:
Locale(String language, String country, String variant)
Or adjust your browser's locale settings (if your application involves browser at all)

I had a problem with this too. Unfortunately I haven't found any build-in method to nicely output lang-country code without Variant so I helped myself with such snippet (maybe would be handy to somebody) :
public static String getLanguageCode(Locale locale) {
StringBuilder sb = new StringBuilder();
sb.append(locale.getLanguage());
if (locale.getCountry() != null && locale.getCountry().length() > 0) {
sb.append("-");
sb.append(locale.getCountry());
}
return sb.toString();
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java String.contains() not working on a cyrillic string - java

try specifying Locale in toLowerCase() on both sides of the equation: http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#toLowerCase(java.util.Locale)

Related

Cucumber: how to define a string with only letters as the Given

How to remove \u200B (Zero Length Whitespace Unicode Character) from String in Java?

non-basic characters in java, how to handle the encoding correctly

Why doesn't this Java regex compile?

Locale appending 'variation' to language and country code

Categories

Resources