When doing Java string comparison what values for ~ (tilda) hold? - java

For example:
public static void smallestWord() {
String smallestWord = "~";
List<String> words = new ArrayList<>();
words.add("dba");
words.add("dba");
words.add("eba");
words.add("dca");
words.add("eca");
for (String word : words) {
if (word.compareTo(smallestWord) < 0) {
smallestWord = word;
}
}
}
It returns dba as smallest word which is correct, but I initialized smallestWord as ~ initially, if I leave it as empty or . I do not get the correct answer. What value does ~ hold in Java lexicography?

All characters in Java are compared by their Unicode codepoint. ~ is U+007E (126) in Unicode, which as larger than all the standard ASCII Latin characters, but less than characters from all other scripts, or accented Latin characters. For more detailed information on how strings are compared, you can read the String.compareTo JavaDoc.
What you want to do is probably rather something like this:
public static void smallestWord() {
String smallestWord = null;
List<String> words = new ArrayList<>();
words.add("dba");
words.add("dba");
words.add("eba");
words.add("dca");
words.add("eca");
for (String word : words) {
if ((smallestWord == null) || (word.compareTo(smallestWord) < 0)) {
smallestWord = word;
}
}
}
Or, alternatively, use the standard library:
Collections.min(words);

As others have pointed out '~' is an ASCII character / Unicode codepoint that is larger than all ASCII letters; i.e. upper and lowercase 'A' through 'Z'.
Therefore, according to the specification1 of the String class, "~" comes after any English word.
However, the '~' codepoint is NOT less than accented letters and letters in non-latin alphabets. So the "~" trick won't work with Cyrillic or Hindi. And if you can think of a French / German / Portuguese / etc word that has an accented first letter, it won't work in those languages either.
And it won't work with Emojis either.
In short, that code using "~" as in your example won't work in internationalized contexts.
You could use null as per #Dolda2000's answer, or you could use "\u10ffff".
(\u10ffff is the largest possible Unicode codepoint. However that approach is not entirely fool-proof either. There are legal Java strings that are larger than "\u10ffff"; e.g. "\u10ffffZZZZ". Unfortunately, the largest possible string value cannot be written as a string literal, and its representation is ridiculously large - roughly 2^31 bytes!)
1 - The ordering of strings is based on the ordering of UTF-16 code units rather than Unicode codepoints. But for well-formed strings there is no difference in the two ways of thinking about it.

compareTo works unicode character value, ~ has unicode value greater than the alphabets so it works while space and dot has unicode value less than alphabets so it consider them as small and print the same.

Related

Password check using regex in Java

My application has a feature to check password. I need to get this scenario:
password length 10 ~ 32 .
It has to be a combination of either:
character and numbers
characters and special characters
numbers and special characters
Current code in application:
private boolean isSpecialMixedText(String password)
{
String number = "[0-9]";
String english = "[a-zA-Z]";
String special = "[!##\\$%^&*()~`\\-=_+\\[\\]{}|:\\\";',\\./<>?£¥\\\\]";
Pattern numberPattern = Pattern.compile(number);
Matcher numberMatcher = numberPattern.matcher(password);
Pattern englishPattern = Pattern.compile(english);
Matcher englishMatcher = englishPattern.matcher(password);
Pattern specialPattern = Pattern.compile(special);
Matcher specialMatcher = specialPattern.matcher(password);
return numberMatcher.find() && englishMatcher.find() || specialMatcher.find();
}
Please help me get the combination working
Actually, the regexes look fine. The problem is in this statement:
return numberMatcher.find() && englishMatcher.find() ||
specialMatcher.find();
It actually needs to be something like this:
boolean n = numberMatcher.find();
boolean e = englishMatcher.find();
boolean s = specialMatcher.find();
return (n && e) || (n && s) || (e && s);
And I agree with #adelphus' comment. Your rules for deciding what passwords are acceptable are very English-language-centric.
In my opinion your logic is wrong because you look for combination (so only these characters allowed) of: characters and numbers OR characters and special characters OR numbers and special characters. However with pair of matches like: [0-9] and [a-zA-Z] you are actually looking for a String with some digits and some letter, but it could be also 123ABC#$%#$%$#%#$ (because it has letters and digits).
What you need is something to check, if given string is composed ONLY of of allowed combination of characters. I think you can use one regex here (not too elegant, but effective) like:
^(?:((?=.*[A-Za-z].*)(?=.*[0-9].*)[A-Za-z0-9]{10,32})|((?=.*[-!##\\$%^&*()~`\=_+\[\]{}|:\";',.\/<>?£¥\\].*)(?=.*[0-9].*)[0-9-!##\\$%^&*()~`\=_+\[\]{}|:\";',.\/<>?£¥\\]{10,32})|((?=.*[-!##\\$%^&*()~`\=_+\[\]{}|:\";',.\/<>?£¥\\].*)(?=.*[A-Za-z].*)[-A-Za-z!##\\$%^&*()~`\=_+\[\]{}|:\";',.\/<>?£¥\\]{10,32}))$
DEMO - it show valid and invalid matches.
This is quite long regex, but mainly because of you special character class. This regular expression is composed of three parts with similar structure:
positive lookagead for required characters + character class of
allowed characters
On an example:
(?=.*[A-Za-z].*)(?=.*[0-9].*)[A-Za-z0-9]{10,32}
means that string need to have:
(?=.*[A-Za-z].*) - at least one letter (positive lookahead for letter which could be surrounded by other characters),
(?=.*[0-9].*) - at least one number (positive lookahead for digit which could be surrounded by other characters)
[A-Za-z0-9]{10,32} - from 10 to 32 letters or digits,
in effect, the given password need to have 10 to 32 characters, but both letters and digits, proportion is not important.
Whats more, the ^ at beginning and $ in the end ensure that the whole examined string has such composition.
Also I would agree with others, it is not best idea to restrict allowed character in password like that, but it is your decision.

How do I compare each character of a String while accounting for characters with length > 1?

I have a variable string that might contain any unicode character. One of these unicode characters is the han 𩸽.
The thing is that this "han" character has "𩸽".length() == 2 but is written in the string as a single character.
Considering the code below, how would I iterate over all characters and compare each one while considering the fact it might contain one character with length greater than 1?
for ( int i = 0; i < string.length(); i++ ) {
char character = string.charAt( i );
if ( character == '𩸽' ) {
// Fail, it interprets as 2 chars =/
}
}
EDIT:
This question is not a duplicate. This asks how to iterate for each character of a String while considering characters that contains .length() > 1 (character not as a char type but as the representation of a written symbol). This question does not require previous knowledge of how to iterate over unicode code points of a Java String, although an answer mentioning that may also be correct.
int hanCodePoint = "𩸽".codePointAt(0);
for (int i = 0; i < string.length();) {
int currentCodePoint = string.codePointAt(i);
if (currentCodePoint == hanCodePoint) {
// do something here.
}
i += Character.charCount(currentCodePoint);
}
The String.charAt and String.length methods treat a String as a sequence of UTF-16 code units. You want to treat the string as Unicode code-points.
Look at the "code point" methods in the String API:
codePointAt(int index) returns the (32 bit) code point at a given code-unit index
offsetByCodePoints(int index, int codePointOffset) returns the code-unit index corresponding to codePointOffset code-points from the code-unit at index.
codePointCount(int beginIndex, int endIndex) counts the code-points between two code-unit indexes.
Indexing the string by code point index is a bit tricky, especially if the string is long and you want to do it efficiently. However, it is a do-able, albeit that the code is rather cumbersome.
#sstan's answer is one solution.
This will be simpler if you treat both the string and the data you're searching for as Strings. If you just need to test for the presence of that character:
if (string.contains("𩸽") {
// do something here.
}
If you specifically need the index where that character appears:
int i = string.indexOf("𩸽");
if (i >= 0) {
// do something with i here.
}
And if you really need to iterate through every code point, see How can I iterate through the unicode codepoints of a Java String? .
An ASCII character takes half the amount a Unicode char does, so it's logical that the han character is of length 2. It not an ASCII char, nor a Unicode letter. If it were the second case, the letter would be displayed correctly.

How to build the longest String with different Unicode characters

Thanks in advance for your patience. This is my problem.
I'm writing a program in Java that works best with a big set of different characters.
I have to store all the characters in a String. I started with
private static final String values = "0123456789";
Then I added A-Z, a-z and all the commons symbols.
But they are still too few, so I tought that maybe Unicode could be the solution.
The problem is now: what is the best way to get all the unicode characters that can be displayed in Eclipse (my algorithm will probably fail if there are unrecognized characters - those displayed like little rectangles). Is it possible to build a string (or some strings) with all the characters present here (en.wikipedia.org/wiki/List_of_Unicode_characters) correctly displayed?
I can do a rough copy-paste from http://www.terena.org/activities/multiling/euroml/tests/test-ucspages1ucs.html or http://zenoplex.jp/tools/unicoderange_generator.html, but I would appreciate some cleaner solution.
I don't know if there is a way to extract characters fron a font (the Unifont one). Or maybe I should parse this (www. utf8-chartable.de/unicode-utf8-table.pl) webpage.
Moreover, by adding all the characters into a String I will probably get the error:
"The type generates a string that requires more than 65535 bytes to encode in Utf8 format in the constant pool" (discussed in this question on SO: /questions/10798769/how-to-process-a-string-with-823237-characters).
Hybrid solutions can be accepted. I can remove duplicates following this question on SO questions/4989091/removing-duplicates-from-a-string-in-java)
Finally: every solution to get the longest only-different-characters string is accepted.
Thanks!
You are mixing some things up. The question whether a character can be displayed in Eclipse depends on the font you have chosen; and whether the source file can be processed correctly depends on which character encoding you have set up for the source file. When choosing UTF-8 and a good unicode font you can use and display almost any character, at least more than fit into a single String literal.
But is it really required to show the character in Eclipse? You can use the unicode escapes, e.g. \u20ac to refer to characters, regardless of whether they can be displayed or if the file encoding can handle them.
And if it is not a requirement to blow up your source code, it’s easy to create a String containing all existing characters:
// all chars (i.e. UTF-16 values)
StringBuilder sb=new StringBuilder(Character.MAX_VALUE);
for(char c=0; c<Character.MAX_VALUE; c++) sb.append(c);
String s=sb.toString();
// if it should behave like a compile-time constant:
s=s.intern();
or
// all unicode characters (aka code points)
StringBuilder sb=new StringBuilder(2162686);
for(int c=0; c<Character.MAX_CODE_POINT; c++) sb.appendCodePoint(c);
String s=sb.toString();
// if it should behave like a compile-time constant:
s=s.intern();
If you wan’t the String to contain valid unicode characters only you can use if(Character.isDefined(c)) … inside the loop. But that’s a moving target— newer JRE’s will most probably know more defined characters.
Smply use Apache classes, org.apache.commons.lang.RandomStringUtils (commons-lang) can solve your purpose.
http://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/RandomStringUtils.html
Also please refer to below code for api usage,
import org.apache.commons.lang3.RandomStringUtils;
public class RandomString {
public static void main(String[] args) {
// Random string only with numbers
String string = RandomStringUtils.random(64, false, true);
System.out.println("Random 0 = " + string);
// Random alphabetic string
string = RandomStringUtils.randomAlphabetic(64);
System.out.println("Random 1 = " + string);
// Random ASCII string
string = RandomStringUtils.randomAscii(32);
System.out.println("Random 2 = " + string);
// Create a random string with indexes from the given array of chars
string = RandomStringUtils.random(32, 0, 20, true, true, "bj81G5RDED3DC6142kasok".toCharArray());
System.out.println("Random 3 = " + string);
}
}

Create String[] containing only certain characters

I am trying to create a String[] which contains only words that comprise of certain characters. For example I have a dictionary containing a number of words like so:
arm
army
art
as
at
attack
attempt
attention
attraction
authority
automatic
awake
baby
back
bad
bag
balance
I want to narrow the list down so that it only contains words with the characters a, b and g. Therefore the list should only contain the word 'bag' in this example.
Currently I am trying to do this using regexes but having never used them before I can't seem to get it to work.
Here is my code:
public class LetterJugglingMain {
public static void main(String[] args) {
String dictFile = "/Users/simonrhillary/Desktop/Dictionary(3).txt";
fileReader fr = new fileReader();
fr.openFile(dictFile);
String[] dictionary = fr.fileToArray();
String regx = "able";
String[] newDict = createListOfValidWords(dictionary, regx);
printArray(newDict);
}
public static String[] createListOfValidWords(String[] d, String regex){
List<String> narrowed = new ArrayList<String>();
for(int i = 0; i<d.length; i++){
if(d[i].matches(regex)){
narrowed.add(d[i]);
System.out.println("added " + d[i]);
}
}
String[] narrowArray = narrowed.toArray(new String[0]);
return narrowArray;
}
however the array returned is always empty unless the String regex is the exact word! Any ideas? I can post more code if needed...I think I must be trying to initialise the regex wrong.
The narrowed down list must contain ONLY the characters from the regex.
Frankly, I'm not an expert in regexes, but I don't think it's the best tool to do what you want. I would use a method like the following:
public boolean containsAll(String s, Set<Character> chars) {
Set<Character> copy = new HashSet<Character>();
for (int i = 0; i < s.length() && copy.size() < chars.size(); i++) {
char c = s.charAt(i);
if (chars.contains(c)) {
copy.add(c);
}
}
return copy.size() == chars.size();
}
The regex able will match only the string "able". However, if you want a regular expression to match either character of a, b, l or e, the regex you're looking for is [able] (in brackets). If you want words containing several such characters, add a + for repeating the pattern: [able]+.
The OP wants words that contain every character. Not just one of them.
And other characters are not a problem.
If this is the case, I think the simiplest way would be to loop through the entire string, character by character, and check to see if it contains all of the characters you want. Keep flags to check and see if every character has been found.
If this isn't the case.... :
Try using the regex:
^[able]+$
Here's what it does:
^ matches the beginning of the string and $ matches the end of the string. This makes sure that you're not getting a partial match.
[able] matches the characters you want the string to consist of, in this case a, b, l, and e. + Makes sure that there are 1 or more of these characters in the string.
Note: This regex will match a string that contains these 4 letters. For example, it will match:
able, albe, aeble, aaaabbblllleeee
and will not match
qable, treatable, and abled.
A sample regex that filters out words that contains at least one occurrence of all characters in a set. This will match any English word (case-insensitive) that contains at least one occurrence of all the characters a, b, g:
(?i)(?=.*a)(?=.*b)(?=.*g)[a-z]+
Example of strings that match would be bag, baggy, grab.
Example of strings that don't match would be big, argument, nothing.
The (?i) means turns on case-insensitive flag.
You need to append as many (?=.*<character>) as the number of characters in the set, for each of the characters.
I assume a word only contains English alphabet, so I specify [a-z]. Specify more if you need space, hyphen, etc.
I assume matches(String regex) method in String class, so I omitted the ^ and $.
The performance may be bad, since in the worst case (the characters are found at the end of the words), I think that the regex engine may go through the string for around n times where n is the number of characters in the set. It may not be an actual concern at all, since the words are very short, but if it turns out that this is a bottleneck, you may consider doing simple looping.

To split only Chinese characters in java

I am writing a java application; but stuck on this point.
Basically I have a string of Chinese characters with ALSO some possible Latin chars or numbers, lets say:
查詢促進民間參與公共建設法(210BOT法).
I want to split those Chinese chars except the Latin or numbers as "BOT" above. So, at the end I will have this kind of list:
[ 查, 詢, 促, 進, 民, 間, 參, 與, 公, 共, 建, 設, 法, (, 210, BOT, 法, ), ., ]
How can I resolve this problem (for java)?
Chinese characters lies within certain Unicode ranges:
2F00-2FDF: Kangxi
4E00-9FAF: CJK
3400-4DBF: CJK Extension
So all you basically need to do is to check if the character's codepoint lies within the known ranges. This example is a good starting point to write a stackbased parser/splitter, you only need to extend it to separate digits from latin letters, which should be obvious enough (hint: Character#isDigit()):
Set<UnicodeBlock> chineseUnicodeBlocks = new HashSet<UnicodeBlock>() {{
add(UnicodeBlock.CJK_COMPATIBILITY);
add(UnicodeBlock.CJK_COMPATIBILITY_FORMS);
add(UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS);
add(UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS_SUPPLEMENT);
add(UnicodeBlock.CJK_RADICALS_SUPPLEMENT);
add(UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION);
add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS);
add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A);
add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B);
add(UnicodeBlock.KANGXI_RADICALS);
add(UnicodeBlock.IDEOGRAPHIC_DESCRIPTION_CHARACTERS);
}};
String mixedChinese = "查詢促進民間參與公共建設法(210BOT法)";
for (char c : mixedChinese.toCharArray()) {
if (chineseUnicodeBlocks.contains(UnicodeBlock.of(c))) {
System.out.println(c + " is chinese");
} else {
System.out.println(c + " is not chinese");
}
}
Good luck.
Diclaimer: I'm a complete Lucene newbie.
Using the latest version of Lucene (3.6.0 at the time of writing) I manage to get close to the result you require.
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36, Collections.emptySet());
List<String> words = new ArrayList<String>();
TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(original));
CharTermAttribute termAttribute = tokenStream.addAttribute(CharTermAttribute.class);
try {
tokenStream.reset(); // Resets this stream to the beginning. (Required)
while (tokenStream.incrementToken()) {
words.add(termAttribute.toString());
}
tokenStream.end(); // Perform end-of-stream operations, e.g. set the final offset.
}
finally {
tokenStream.close(); // Release resources associated with this stream.
}
The result I get is:
[查, 詢, 促, 進, 民, 間, 參, 與, 公, 共, 建, 設, 法, 210bot, 法]
Here's an approach I would take.
You can use Character.codePointAt(char[] charArray, int index) to return the Unicode value for a char in your char array.
You will also need a mapping of Latin Unicode characters.
If you look in the source of Character.UnicodeBlock, the full LATIN block is the interval [0x0000, 0x0249]. So basically you check if your Unicode code point is somewhere within that interval.
I suspect there is a way to just use a Character.Subset to check if it contains your char, but I haven't looked into that.

Categories

Resources