How to check if String contains Latin letters without regex

How to check if String contains Latin letters without regex - java

I want to check if String contains only Latin letters but also can contains numbers and other symbols like: _/+), etc.
String utm_source=google should pass, utm_source=google&2019_and_2020! should pass too. But utm_ресурс=google should not pass (coz cyrillic letters). I know code with regex, but how can i do it without using regex and classic for loop, maybe with Streams and Character class?

Use this code
public static boolean isValidUsAscii (String s) {
return Charset.forName("US-ASCII").newEncoder().canEncode(s);
}

For restricted "latin" (no é etcetera), it must be either US-ASCII (7 bits), or ISO-8859-1 but without accented letters.
boolean isBasicLatin(String s) {
return s.codePoints().allMatch(cp -> cp < 128 || (cp < 256 && !isLetter(cp)));
}

Less of a neat single line approach but really all you need to do is check whether the numeric value of the character is within certain limits like so:
public boolean isQwerty(String text) {
int length = text.length();
for(int i = 0; i < length; i++) {
char character = text.charAt(i);
int ascii = character;
if(ascii<32||ascii>126) {
return false;
}
}
return true;
}
Test Run
ä returns false
abc returns true

Related

Removing duplicate same characters in a row

I am trying to create a method which will either remove all duplicates from a string or only keep the same 2 characters in a row based on a parameter.
For example:
helllllllo -> helo
or
helllllllo -> hello - This keeps double letters
Currently I remove duplicates by doing:
private String removeDuplicates(String word) {
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < word.length(); i++) {
char letter = word.charAt(i);
if (buffer.length() == 0 && letter != buffer.charAt(buffer.length() - 1)) {
buffer.append(letter);
}
}
return buffer.toString();
}
If I want to keep double letters I was thinking of having a method like private String removeDuplicates(String word, boolean doubleLetter)
When doubleLetter is true it will return hello not helo
I'm not sure of the most efficient way to do this without duplicating a lot of code.

why not just use a regex?
public class RemoveDuplicates {
public static void main(String[] args) {
System.out.println(new RemoveDuplicates().result("hellllo", false)); //helo
System.out.println(new RemoveDuplicates().result("hellllo", true)); //hello
}
public String result(String input, boolean doubleLetter){
String pattern = null;
if(doubleLetter) pattern = "(.)(?=\\1{2})";
else pattern = "(.)(?=\\1)";
return input.replaceAll(pattern, "");
}
}
(.) --> matches any character and puts in group 1.
?= --> this is called a positive lookahead.
?=\\1 --> positive lookahead for the first group
So overall, this regex looks for any character that is followed (positive lookahead) by itself. For example aa or bb, etc. It is important to note that only the first character is part of the match actually, so in the word 'hello', only the first l is matched (the part (?=\1) is NOT PART of the match). So the first l is replaced by an empty String and we are left with helo, which does not match the regex
The second pattern is the same thing, but this time we look ahead for TWO occurrences of the first group, for example helllo. On the other hand 'hello' will not be matched.
Look here for a lot more: Regex
P.S. Fill free to accept the answer if it helped.

try
String s = "helllllllo";
System.out.println(s.replaceAll("(\\w)\\1+", "$1"));
output
helo

Taking this previous SO example as a starting point, I came up with this:
String str1= "Heelllllllllllooooooooooo";
String removedRepeated = str1.replaceAll("(\\w)\\1+", "$1");
System.out.println(removedRepeated);
String keepDouble = str1.replaceAll("(\\w)\\1{2,}", "$1");
System.out.println(keepDouble);
It yields:
Helo
Heelo
What it does:
(\\w)\\1+ will match any letter and place it in a regex capture group. This group is later accessed through the \\1+. Meaning that it will match one or more repetitions of the previous letter.
(\\w)\\1{2,} is the same as above the only difference being that it looks after only characters which are repeated more than 2 times. This leaves the double characters untouched.
EDIT:
Re-read the question and it seems that you want to replace multiple characters by doubles. To do that, simply use this line:
String keepDouble = str1.replaceAll("(\\w)\\1+", "$1$1");

Try this, this will be most efficient way[Edited after comment]:
public static String removeDuplicates(String str) {
int checker = 0;
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < str.length(); ++i) {
int val = str.charAt(i) - 'a';
if ((checker & (1 << val)) == 0)
buffer.append(str.charAt(i));
checker |= (1 << val);
}
return buffer.toString();
}
I am using bits to identify uniqueness.
EDIT:
Whole logic is that if a character has been parsed then its corrresponding bit is set and next time when that character comes up then it will not be added in String Buffer the corresponding bit is already set.

Java function to return if string contains illegal characters

I have the following characters that I would like to be considered "illegal":
~, #, #, *, +, %, {, }, <, >, [, ], |, “, ”, \, _, ^
I'd like to write a method that inspects a string and determines (true/false) if that string contains these illegals:
public boolean containsIllegals(String toExamine) {
return toExamine.matches("^.*[~##*+%{}<>[]|\"\\_^].*$");
}
However, a simple matches(...) check isn't feasible for this. I need the method to scan every character in the string and make sure it's not one of these characters. Of course, I could do something horrible like:
public boolean containsIllegals(String toExamine) {
for(int i = 0; i < toExamine.length(); i++) {
char c = toExamine.charAt(i);
if(c == '~')
return true;
else if(c == '#')
return true;
// etc...
}
}
Is there a more elegant/efficient way of accomplishing this?

You can make use of Pattern and Matcher class here. You can put all the filtered character in a character class, and use Matcher#find() method to check whether your pattern is available in string or not.
You can do it like this: -
public boolean containsIllegals(String toExamine) {
Pattern pattern = Pattern.compile("[~##*+%{}<>\\[\\]|\"\\_^]");
Matcher matcher = pattern.matcher(toExamine);
return matcher.find();
}
find() method will return true, if the given pattern is found in the string, even once.
Another way that has not yet been pointed out is using String#split(regex). We can split the string on the given pattern, and check the length of the array. If length is 1, then the pattern was not in the string.
public boolean containsIllegals(String toExamine) {
String[] arr = toExamine.split("[~##*+%{}<>\\[\\]|\"\\_^]", 2);
return arr.length > 1;
}
If arr.length > 1, that means the string contained one of the character in the pattern, that is why it was splitted. I have passed limit = 2 as second parameter to split, because we are ok with just single split.

I need the method to scan every character in the string
If you must do it character-by-character, regexp is probably not a good way to go. However, since all characters on your "blacklist" have codes less than 128, you can do it with a small boolean array:
static final boolean blacklist[] = new boolean[128];
static {
// Unassigned elements of the array are set to false
blacklist[(int)'~'] = true;
blacklist[(int)'#'] = true;
blacklist[(int)'#'] = true;
blacklist[(int)'*'] = true;
blacklist[(int)'+'] = true;
...
}
static isBad(char ch) {
return (ch < 128) && blacklist[(int)ch];
}

Use a constant for avoids recompile the regex in every validation.
private static final Pattern INVALID_CHARS_PATTERN =
Pattern.compile("^.*[~##*+%{}<>\\[\\]|\"\\_].*$");
And change your code to:
public boolean containsIllegals(String toExamine) {
return INVALID_CHARS_PATTERN.matcher(toExamine).matches();
}
This is the most efficient way with Regex.

If you can't use a matcher, then you can do something like this, which is cleaner than a bunch of different if statements or a byte array.
for(int i = 0; i < toExamine.length(); i++) {
char c = toExamine.charAt(i);
if("~##*+%{}<>[]|\"_^".contains(c)){
return true;
}
}

Try the negation of a character class containing all the blacklisted characters:
public boolean containsIllegals(String toExamine) {
return toExamine.matches("[^~##*+%{}<>\\[\\]|\"\\_^]*");
}
This will return true if the string contains illegals (your original function seemed to return false in that case).
The caret ^ just to the right of the opening bracket [ negates the character class. Note that in String.matches() you don't need the anchors ^ and $ because it automatically matches the whole string.

A pretty compact way of doing this would be to rely on the String.replaceAll method:
public boolean containsIllegal(final String toExamine) {
return toExamine.length() != toExamine.replaceAll(
"[~##*+%{}<>\\[\\]|\"\\_^]", "").length();
}

How do I recognize a character such as "ç" as a letter?

I have an array of bytes that contains a sentence. I need to convert the lowercase letters on this sentence into uppercase letters. Here is the function that I did:
public void CharUpperBuffAJava(byte[] word) {
for (int i = 0; i < word.length; i++) {
if (!Character.isUpperCase(word[i]) && Character.isLetter(word[i])) {
word[i] -= 32;
}
}
return cchLength;
}
It will work fine with sentences like: "a glass of water". The problem is it must work with all ANSI characters, which includes "ç,á,é,í,ó,ú" and so on. The method Character.isLetter doesn't work with these letters and, therefore, they are not converted into uppercase.
Do you know how can I identify these ANSI characters as a letter in Java?
EDIT
If someone wants to know, I did method again after the answers and now it looks like this:
public static int CharUpperBuffAJava(byte[] lpsz, int cchLength) {
String value;
try {
value = new String(lpsz, 0, cchLength, "Windows-1252");
String upperCase = value.toUpperCase();
byte[] bytes = upperCase.getBytes();
for (int i = 0; i < cchLength; i++) {
lpsz[i] = bytes[i];
}
return cchLength;
} catch (UnsupportedEncodingException e) {
return 0;
}
}

You need to "decode" the byte[] into a character string. There are several APIs to do this, but you must specify the character encoding that is use for the bytes. The overloaded versions that don't use an encoding will give different results on different machines, because they use the platform default.
For example, if you determine that the bytes were encoded with Windows-1252 (sometimes referred to as ANSI).
String s = new String(bytes, "Windows-1252");
String upper = s.toUpperCase();

Convert the byte array into a string, supporting the encoding. Then call toUpperCase(). Then, you can call getBytes() on the string if you need it as a byte array after capitalizing.

Can't you simply use:
String s = new String(bytes, "cp1252");
String upper = s.toUpperCase(someLocale);

Wouldn't changing the character set do the trick before conversion? The internal conversion logic of Java might work fine. Something like http://www.exampledepot.com/egs/java.nio.charset/ConvertChar.html, but use ASCII as the target character set.

I am looking at this table:
http://slayeroffice.com/tools/ascii/
But anything > 227 appears to be a letter, but to make it upper case you would subtract 27 from the ASCII value.

How do I check for non-word characters within a single word in Java?

I want to know if a String such as "equi-distant" or "they're" contains a non-word character. Is there a simple way to check for it?

Solution without regex (generally faster for a very simple check like this):
public static boolean hasNonWordCharacter(String s) {
char[] a = s.toCharArray();
for (char c : a) {
if (!Character.isLetter(c)) {
return true;
}
}
return false;
}

It depends entirely on what you mean by "word character".
If by "word character" you mean A-Z or a-z then you can use this:
bool containsNonWordCharacter = s.matches(".*[^A-Za-z].*");
If you mean "any character that is considered to be a letter in Unicode", then look at Character.isLetter instead.
This is code provided by bobbymcr nearly works:
public static boolean hasNonWordCharacter(String s) {
char[] a = s.toCharArray();
for (char c : a) {
if (!Character.isLetter(c)) {
return true;
}
}
return false;
}
However see the documentation:
Note: This method cannot handle supplementary characters. To support all Unicode characters, including supplementary characters, use the isLetter(int) method.
This should work for all Unicode characters:
public static boolean hasNonWordCharacter(String s) {
int offset = 0, strLen = str.length();
while (offset < strLen) {
int curChar = str.codePointAt(offset);
offset += Character.charCount(curChar);
if (!Character.isLetter(curChar)) {
return true;
}
}
return false;
}

I like the non-regex way. But with regex it could be written like this-
private static boolean containsNonWord(String toCheck) {
Pattern p = Pattern.compile("\\w*");
return !p.matcher(toCheck).matches();
}

Java regular expression \w does not support unicode. \b does support unicode under java. I think that most flavours of regex support the standard \w notation [A-Za-z0-9_]. Also isLetter only returns letters and not numbers and underscore...so that does not work for "word characters" under regular expression...Maybe Java has changed since?

How can I check if a single character appears in a string?

In Java is there a way to check the condition:
"Does this single character appear at all in string x"
without using a loop?

You can use string.indexOf('a').
If the char a is present in string :
it returns the the index of the first occurrence of the character in
the character sequence represented by this object, or -1 if the
character does not occur.

String.contains() which checks if the string contains a specified sequence of char values
String.indexOf() which returns the index within the string of the first occurence of the specified character or substring (there are 4 variations of this method)

I'm not sure what the original poster is asking exactly. Since indexOf(...) and contains(...) both probably use loops internally, perhaps he's looking to see if this is possible at all without a loop? I can think of two ways off hand, one would of course be recurrsion:
public boolean containsChar(String s, char search) {
if (s.length() == 0)
return false;
else
return s.charAt(0) == search || containsChar(s.substring(1), search);
}
The other is far less elegant, but completeness...:
/**
* Works for strings of up to 5 characters
*/
public boolean containsChar(String s, char search) {
if (s.length() > 5) throw IllegalArgumentException();
try {
if (s.charAt(0) == search) return true;
if (s.charAt(1) == search) return true;
if (s.charAt(2) == search) return true;
if (s.charAt(3) == search) return true;
if (s.charAt(4) == search) return true;
} catch (IndexOutOfBoundsException e) {
// this should never happen...
return false;
}
return false;
}
The number of lines grow as you need to support longer and longer strings of course. But there are no loops/recurrsions at all. You can even remove the length check if you're concerned that that length() uses a loop.

You can use 2 methods from the String class.
String.contains() which checks if the string contains a specified sequence of char values
String.indexOf() which returns the index within the string of the first occurence of the specified character or substring or returns -1 if the character is not found (there are 4 variations of this method)
Method 1:
String myString = "foobar";
if (myString.contains("x") {
// Do something.
}
Method 2:
String myString = "foobar";
if (myString.indexOf("x") >= 0 {
// Do something.
}
Links by: Zach Scrivena

String temp = "abcdefghi";
if(temp.indexOf("b")!=-1)
{
System.out.println("there is 'b' in temp string");
}
else
{
System.out.println("there is no 'b' in temp string");
}

If you need to check the same string often you can calculate the character occurrences up-front. This is an implementation that uses a bit array contained into a long array:
public class FastCharacterInStringChecker implements Serializable {
private static final long serialVersionUID = 1L;
private final long[] l = new long[1024]; // 65536 / 64 = 1024
public FastCharacterInStringChecker(final String string) {
for (final char c: string.toCharArray()) {
final int index = c >> 6;
final int value = c - (index << 6);
l[index] |= 1L << value;
}
}
public boolean contains(final char c) {
final int index = c >> 6; // c / 64
final int value = c - (index << 6); // c - (index * 64)
return (l[index] & (1L << value)) != 0;
}}

To check if something does not exist in a string, you at least need to look at each character in a string. So even if you don't explicitly use a loop, it'll have the same efficiency. That being said, you can try using str.contains(""+char).

Is the below what you were looking for?
int index = string.indexOf(character);
return index != -1;

Yes, using the indexOf() method on the string class. See the API documentation for this method

String.contains(String) or String.indexOf(String) - suggested
"abc".contains("Z"); // false - correct
"zzzz".contains("Z"); // false - correct
"Z".contains("Z"); // true - correct
"😀and😀".contains("😀"); // true - correct
"😀and😀".contains("😂"); // false - correct
"😀and😀".indexOf("😀"); // 0 - correct
"😀and😀".indexOf("😂"); // -1 - correct
String.indexOf(int) and carefully considered String.indexOf(char) with char to int widening
"😀and😀".indexOf("😀".charAt(0)); // 0 though incorrect usage has correct output due to portion of correct data
"😀and😀".indexOf("😂".charAt(0)); // 0 -- incorrect usage and ambiguous result
"😀and😀".indexOf("😂".codePointAt(0)); // -1 -- correct usage and correct output
The discussions around character is ambiguous in Java world
can the value of char or Character considered as single character?
No. In the context of unicode characters, char or Character can sometimes be part of a single character and should not be treated as a complete single character logically.
if not, what should be considered as single character (logically)?
Any system supporting character encodings for Unicode characters should consider unicode's codepoint as single character.
So Java should do that very clear & loud rather than exposing too much of internal implementation details to users.
String class is bad at abstraction (though it requires confusingly good amount of understanding of its encapsulations to understand the abstraction 😒😒😒 and hence an anti-pattern).
How is it different from general char usage?
char can be only be mapped to a character in Basic Multilingual Plane.
Only codePoint - int can cover the complete range of Unicode characters.
Why is this difference?
char is internally treated as 16-bit unsigned value and could not represent all the unicode characters using UTF-16 internal representation using only 2-bytes. Sometimes, values in a 16-bit range have to be combined with another 16-bit value to correctly define character.
Without getting too verbose, the usage of indexOf, charAt, length and such methods should be more explicit. Sincerely hoping Java will add new UnicodeString and UnicodeCharacter classes with clearly defined abstractions.
Reason to prefer contains and not indexOf(int)
Practically there are many code flows that treat a logical character as char in java.
In Unicode context, char is not sufficient
Though the indexOf takes in an int, char to int conversion masks this from the user and user might do something like str.indexOf(someotherstr.charAt(0))(unless the user is aware of the exact context)
So, treating everything as CharSequence (aka String) is better
public static void main(String[] args) {
System.out.println("😀and😀".indexOf("😀".charAt(0))); // 0 though incorrect usage has correct output due to portion of correct data
System.out.println("😀and😀".indexOf("😂".charAt(0))); // 0 -- incorrect usage and ambiguous result
System.out.println("😀and😀".indexOf("😂".codePointAt(0))); // -1 -- correct usage and correct output
System.out.println("😀and😀".contains("😀")); // true - correct
System.out.println("😀and😀".contains("😂")); // false - correct
}
Semantics
char can handle most of the practical use cases. Still its better to use codepoints within programming environment for future extensibility.
codepoint should handle nearly all of the technical use cases around encodings.
Still, Grapheme Clusters falls out of the scope of codepoint level of abstraction.
Storage layers can choose char interface if ints are too costly(doubled). Unless storage cost is the only metric, its still better to use codepoint. Also, its better to treat storage as byte and delegate semantics to business logic built around storage.
Semantics can be abstracted at multiple levels. codepoint should become lowest level of interface and other semantics can be built around codepoint in runtime environment.

package com;
public class _index {
public static void main(String[] args) {
String s1="be proud to be an indian";
char ch=s1.charAt(s1.indexOf('e'));
int count = 0;
for(int i=0;i<s1.length();i++) {
if(s1.charAt(i)=='e'){
System.out.println("number of E:=="+ch);
count++;
}
}
System.out.println("Total count of E:=="+count);
}
}

static String removeOccurences(String a, String b)
{
StringBuilder s2 = new StringBuilder(a);
for(int i=0;i<b.length();i++){
char ch = b.charAt(i);
System.out.println(ch+" first index"+a.indexOf(ch));
int lastind = a.lastIndexOf(ch);
for(int k=new String(s2).indexOf(ch);k > 0;k=new String(s2).indexOf(ch)){
if(s2.charAt(k) == ch){
s2.deleteCharAt(k);
System.out.println("val of s2 : "+s2.toString());
}
}
}
System.out.println(s1.toString());
return (s1.toString());
}

you can use this code. It will check the char is present or not. If it is present then the return value is >= 0 otherwise it's -1. Here I am printing alphabets that is not present in the input.
import java.util.Scanner;
public class Test {
public static void letters()
{
System.out.println("Enter input char");
Scanner sc = new Scanner(System.in);
String input = sc.next();
System.out.println("Output : ");
for (char alphabet = 'A'; alphabet <= 'Z'; alphabet++) {
if(input.toUpperCase().indexOf(alphabet) < 0)
System.out.print(alphabet + " ");
}
}
public static void main(String[] args) {
letters();
}
}
//Ouput Example
Enter input char
nandu
Output :
B C E F G H I J K L M O P Q R S T V W X Y Z

If you see the source code of indexOf in JAVA:
public int indexOf(int ch, int fromIndex) {
final int max = value.length;
if (fromIndex < 0) {
fromIndex = 0;
} else if (fromIndex >= max) {
// Note: fromIndex might be near -1>>>1.
return -1;
}
if (ch < Character.MIN_SUPPLEMENTARY_CODE_POINT) {
// handle most cases here (ch is a BMP code point or a
// negative value (invalid code point))
final char[] value = this.value;
for (int i = fromIndex; i < max; i++) {
if (value[i] == ch) {
return i;
}
}
return -1;
} else {
return indexOfSupplementary(ch, fromIndex);
}
}
you can see it uses a for loop for finding a character. Note that each indexOf you may use in your code, is equal to one loop.
So, it is unavoidable to use loop for a single character.
However, if you want to find a special string with more different forms, use useful libraries such as util.regex, it deploys stronger algorithm to match a character or a string pattern with Regular Expressions. For example to find an email in a string:
String regex = "^(.+)#(.+)$";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(email);
If you don't like to use regex, just use a loop and charAt and try to cover all cases in one loop.
Be careful recursive methods has more overhead than loop, so it's not recommended.

how about one uses this ;
let text = "Hello world, welcome to the universe.";
let result = text.includes("world");
console.log(result) ....// true
the result will be a true or false
this always works for me

You won't be able to check if char appears at all in some string without atleast going over the string once using loop / recursion ( the built-in methods like indexOf also use a loop )
If the no. of times you look up if a char is in string x is more way more than the length of the string than I would recommend using a Set data structure as that would be more efficient than simply using indexOf
String s = "abc";
// Build a set so we can check if character exists in constant time O(1)
Set<Character> set = new HashSet<>();
int len = s.length();
for(int i = 0; i < len; i++) set.add(s.charAt(i));
// Now we can check without the need of a loop
// contains method of set doesn't use a loop unlike string's contains method
set.contains('a') // true
set.contains('z') // false
Using set you will be able to check if character exists in a string in constant time O(1) but you will also use additional memory ( Space complexity will be O(n) ).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to check if String contains Latin letters without regex - java

Use this code public static boolean isValidUsAscii (String s) { return Charset.forName("US-ASCII").newEncoder().canEncode(s); }

For restricted "latin" (no é etcetera), it must be either US-ASCII (7 bits), or ISO-8859-1 but without accented letters. boolean isBasicLatin(String s) { return s.codePoints().allMatch(cp -> cp < 128 || (cp < 256 && !isLetter(cp))); }

Related

Removing duplicate same characters in a row

Java function to return if string contains illegal characters

How do I recognize a character such as "ç" as a letter?

How do I check for non-word characters within a single word in Java?

How can I check if a single character appears in a string?

Categories

Resources