Checking if character is a part of Latin alphabet? - java

I need to test whether character is a letter or a space before moving on further with processing. So, i
for (Character c : take.toCharArray()) {
if (!(Character.isLetter(c) || Character.isSpaceChar(c)))
continue;
data.append(c);
Once i examined the data, i saw that it contains characters which look like a unicode representation of characters from outside of Latin alphabet. How can i modify the above code to tighten my conditions to only accept letter characters which fall in range of [a-z][A-Z]?
Is Regex a way to go, or there is a better (faster) way?

If you specifically want to handle only those 52 characters, then just handle them:
public static boolean isLatinLetter(char c) {
return (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z');
}

If you just want to strip out non-ASCII letter characters, then a quick approach is to use String.replaceAll() and Regex:
s.replaceAll("[^a-zA-Z]", "")
Can't say anything about performance vs. a character by character scan and append to StringBuilder, though.

I'd use the regular expression you specified for this. It's easy to read and should be quite speedy (especially if you allocate it statically).

Related

What do I do if I want to check for a special character in Java using char?

I want to create a program for checking whether any inputted character is a special character or not. The problem is that I hava no idea what to do: either check for special characters or check for the ASCII value. Can anyone tell me if I can just check for the numerical ASCII value using 'if' statement or if I need to check each special character?
You can use regex (Regular Expressions):
if (String.valueOf(character).matches("[^a-zA-Z0-9]")) {
//Your code
}
The code in the if statement will execute if the character is not alphanumeric. (whitespace will count as a special character.) If you don't want white space to count as a special character, change the string to "[^a-zA-Z0-9\\s]".
Further reading:
JavaDoc for the matches method
An excellent regex tutorial
More info about regex in Java
A regex builder (pointed out by #Wietlol)
You can use isLetter(char c) and isDigit(char c). You could do it like this:
char c;
//assign c in some way
if(!Character.isLetter(c) && !Character.isDigit(c)) {
//do something in case of special character
} else {
//do something for non-special character
}
EDIT: As pointed out in the comments it may be more viable to use isLetterOrDigit(char c) instead.
EDIT2: As ostrichofevil pointed out (which I did not think or know of when i posted the answer) this solution won't restrict "non-special" characters to A-Z, a-z and 0-9, but will include anything that is considered a letter or number in Unicode. This probably makes ostrichofevil's answer a more practical solution in most cases.
you can achieve it in this way :
char[] specialCh = {'!','#',']','#','$','%','^','&','*'}; // you can specify all special characters in this array
boolean hasSpecialChar = false;
char current;
for (Character c : specialCh) {
if (current == c){
hasSpecialChar = true;
}
}

How to detect if a string does not contains other languages letters other than English letters? [duplicate]

This question already has answers here:
Check if String contains only letters
(17 answers)
Closed 6 years ago.
Consider a line like:
[Hello簲 bye 簲 ]
This line has both Chinese and English letters which is not of my interests. So I want to find out that if a string does not have any languages' letters other than English. Any idea?
EDIT
I do not want to solve it with regex. Otherwise I would have tagged it!
https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html
In class char, there is this:
getNumericValue(char ch)
Returns the int value that the specified Unicode character represents.
I believe you can do little more research to find unicode value of English letters so that you may check value of char is in range of English characters.
If you don't want to use regexp, you can use below
String str = "Hello簲 bye 簲";
boolean isValid = true;
for (char c : str.toCharArray()) {
if (!(c >= 'a' && c <= 'z') && !(c >= 'A' && c <= 'Z')) {
isValid = false;
break;
}
}
System.out.println(isValid);
You can make use of ASCII values of all English characters in this program - digits, upper case and lower case alphabets (and also, blank spaces must be checked).
The logic: Iterate through each character of the String and check if the current character is an English character, i.e., its ASCII value lies between 48 and 57 (for numbers 0 - 9), 65 and 90 (for upper case alphabets) or 97 and 122 (for lower case alphabets) or is a blank space. If it's not any of these, then it's a non English character.
Here's the code:
String s = <-- THE STRING
int illegal = 0; //to count no. of non english characters
for(int i=0; i< s.length(); i++){
int c = (int)s.charAt(i);
if(!((c>=48 && c<=57)||(c>=65 && c<=90)||(c>=97 && c<=122)||((char)c == ' ')))
illegal++;
}
if(illegal > 0)
System.out.print("String contains non english characters");
else
System.out.print("String does not contain non english characters");
NOTE: Make sure that the platform you're running the program on supports these characters. The character encoding for Chinese is either Unicode (Unicode supports almost all languages of the world) or UTF-16. Make sure to use this or even the UTF-32 encoding while running the program and that the platform supports UTF-16/32 if not Unicode.
I tested this code on my computer with the following test data:
String s = "abcdEFGH 745401 妈妈";
and I got the correct output as I ran this on Unicode. On platforms not supporting Unicode or UTF-16/32, the compiler treats the Chinese letters 妈妈 as ?????? and it may produce an error in the program. The Chinese letters, which become ?????? for the system will simply be ignored during execution and therefore the output of the above input I tested with would be String does not contain non English characters which is wrong. So in case you're running the program on an online Terminal/IDE or on a mobile phone, make sure to take care of this factor. You don't need to worry if you are running it on a windows/mac computer.
I hope it helps you.

How would I use regex to allow certain characters?

Mainly I am using regex, and what my code does essentially, is sends a client return code if it does not contain the characters in regex. My problem is, I do not know how to allow spaces.
Currently this is my code, I would like to have allow a space, a-z, A-Z and 0-9.
if (username.length() < 1 || username.length() >= 13
|| !username.matches("[a-zA-Z_0-9]"))
{
session.getLoginPackets().sendClientPacket(3);
return;
}
The regex you're looking for is [a-zA-Z_0-9][a-zA-Z_0-9 ]* assuming you don't want a name to start with spaces.
I am quite sure you want to be Unicode compliant, so you should use
[\p{L}\p{Nd}][\p{L}\p{Nd} ]*
I created two character classes to ensure that it is not starting with a space, if this check is not needed, just remove the first class and change the quantifier of the second to a +.
From regular-expressions.info
\p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
\p{L} or \p{Letter}: any kind of letter from any language.
More about Unicode in Java
use the \w metasequence (words, letters, underscores), plus a space (or \s to match tabs too), in a character class:
var pattern = #"[\w ]{1,12}"; //c#, validates length also.
edit: this seems to work for single spacing only, does not validate the length though:
var pattern = #"^(\w+\s?)+$";
Try this one
if (username.length() < 1 || username.length() >= 13
|| !username.matches("[a-zA-Z0-9 ]+"))
{
session.getLoginPackets().sendClientPacket(3);
return;
}

How can I allow one space, A-Z, a-z, 0-9 in regex?

Mainly I am using regex, and what my code does essentially, is sends a client return code if it does not contain the characters in regex. My problem is, I do not know how to allow spaces.
Currently this is my code, I would like to have allow a space, a-z, A-Z and 0-9.
if (username.length() < 1 || username.length() >= 13
|| !username.matches("[a-zA-Z_0-9]"))
{
session.getLoginPackets().sendClientPacket(3);
return;
}
It depends on the specific regex class you're using as to what the magic sequences are, but usually either \s, or :space: will work. For some languages where space in the regex isn't ignored you can just put the space in directly: [a-zA-Z_0-9 ] will also work.
The biggest thing missing is the repetition of the regex. For example:
if (username.length() < 1 || username.length() >= 13 || !username.matches("^[a-zA-Z_0-9 ]+$")) {
session.getLoginPackets().sendClientPacket(3);
return;
}
The space character must come at the end of the character set (which I think is mostly what you were asking). The other symbols:
* '^' is 'the beginning of the entire string'
* '$' is 'the end of the string' (unless there are newlines...)
* '+' is 'what's in the [...] character set, at least once'
So, add the space at the end of [ ] and use a '+' at the end, and you should have it.
Worth noting you can do everything within the regex, e.g.:
if (!username.matches("^[a-zA-Z0-9_ ]{1,13}$") {
session.getLoginPackets().sendClientPacket(3);
return;
}
The {1,13} is a boundary saying "at least once, at most 13 times" (inclusive)
I also should point out I believe java supports some shortcuts, e.g. for any upper- or lower-cased letter: [A-z0-9_ ]
--
EDIT:
After several comments re: the 'single space', I have to admit I still am not reading the requirement that way.
If the trick is 'only allows one space', this should work:
if (username.length() < 1 || username.length() >= 13 || !username.matches("^[A-z0-9_]*\\s[A-z0-9_]*$")) {
session.getLoginPackets().sendClientPacket(3);
return;
}
Basically, you retain the size boundaries originally, then ensure it is made up of groups of letters, numbers, and underscore, with exactly one space.
Try this: [a-zA-Z_0-9]*( )?[a-zA-Z_0-9]*
This allows exactly one or no spaces within every combination of the characters a-z,A-Z,_,0-9.
try
!username.matches("[a-zA-Z_0-9 ]"))
OR
!username.matches("[a-zA-Z_0-9\s]"))
The reason \s is better is because it includes all the white space characters: e.g. tabs
Try this
^[a-zA-Z_0-9]+(?: [a-zA-Z_0-9]+)?$
the string starts with at least one alphanumeric and then optional a space and one or more alphanumerics till the end if the string.
See it here on Regexr
Since [a-zA-Z_0-9] is equivalent to \w you can simplify it to
^\w+(?: \w+)?$
If you want to be Unicode compliant, you should use the option Pattern.UNICODE_CHARACTER_CLASS see here for more details:
Enables the Unicode version of Predefined character classes and POSIX character classes.
means, \w matches all Unicode code points that have the property Letter and digit.

Java, Make sure a String contains only alphanumeric, spaces and dashes

In Java, I need to make sure a String only contains alphanumeric, space and dash characters.
I found the class org.apache.commons.lang.StringUtils and the almost adequate method isAlphanumericSpace(String)... but I also need to include dashes.
What is the best way to do this? I don't want to use Regular Expressions.
You could use:
StringUtils.isAlphanumericSpace(string.replace('-', ' '));
Hum... just program it yourself using String.chatAt(int), it's pretty easy...
Iterate through all char in the string using a position index, then compare it using the fact that ASCII characters 0 to 9, a to z and A to Z use consecutive codes, so you only need to check that character x numerically verifies one of the conditions:
between '0' and '9'
between 'a' and 'z'
between 'A and 'Z'
a space ' '
a hyphen '-'
Here is a basic code sample (using CharSequence, which lets you pass a String but also a StringBuilder as arg):
public boolean isValidChar(CharSequence seq) {
int len = seq.length();
for(int i=0;i<len;i++) {
char c = seq.charAt(i);
// Test for all positive cases
if('0'<=c && c<='9') continue;
if('a'<=c && c<='z') continue;
if('A'<=c && c<='Z') continue;
if(c==' ') continue;
if(c=='-') continue;
// ... insert more positive character tests here
// If we get here, we had an invalid char, fail right away
return false;
}
// All seen chars were valid, succeed
return true;
}
Just iterate through the string, using the character-class methods in java.lang.Character to test whether each character is acceptable or not. Which is presumably all that the StringUtils methods do, and regular expressions are just a way of driving a generalised engine to do much the same.
You have 1 of 2 options:
1. Compose a list of chars that CAN be in the string, then loop over the string checking to make sure each character IS in the list.
2. Compose a list of chars that CANNOT be in the string, then loop over the string checking to make sure each character IS NOT in the list.
Choose whatever option is quicker to compose the list.
Definitely use a regex expression. There's no point in writing your own system when a very comprehensive system in place for this exact task. If you need to learn about or brush up on regex then check out this website, it's great: http://regexr.com
I would challenge yourself on this one.

Categories

Resources