To split only Chinese characters in java - java

I am writing a java application; but stuck on this point.
Basically I have a string of Chinese characters with ALSO some possible Latin chars or numbers, lets say:
查詢促進民間參與公共建設法(210BOT法).
I want to split those Chinese chars except the Latin or numbers as "BOT" above. So, at the end I will have this kind of list:
[ 查, 詢, 促, 進, 民, 間, 參, 與, 公, 共, 建, 設, 法, (, 210, BOT, 法, ), ., ]
How can I resolve this problem (for java)?

Chinese characters lies within certain Unicode ranges:
2F00-2FDF: Kangxi
4E00-9FAF: CJK
3400-4DBF: CJK Extension
So all you basically need to do is to check if the character's codepoint lies within the known ranges. This example is a good starting point to write a stackbased parser/splitter, you only need to extend it to separate digits from latin letters, which should be obvious enough (hint: Character#isDigit()):
Set<UnicodeBlock> chineseUnicodeBlocks = new HashSet<UnicodeBlock>() {{
add(UnicodeBlock.CJK_COMPATIBILITY);
add(UnicodeBlock.CJK_COMPATIBILITY_FORMS);
add(UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS);
add(UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS_SUPPLEMENT);
add(UnicodeBlock.CJK_RADICALS_SUPPLEMENT);
add(UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION);
add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS);
add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A);
add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B);
add(UnicodeBlock.KANGXI_RADICALS);
add(UnicodeBlock.IDEOGRAPHIC_DESCRIPTION_CHARACTERS);
}};
String mixedChinese = "查詢促進民間參與公共建設法(210BOT法)";
for (char c : mixedChinese.toCharArray()) {
if (chineseUnicodeBlocks.contains(UnicodeBlock.of(c))) {
System.out.println(c + " is chinese");
} else {
System.out.println(c + " is not chinese");
}
}
Good luck.

Diclaimer: I'm a complete Lucene newbie.
Using the latest version of Lucene (3.6.0 at the time of writing) I manage to get close to the result you require.
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36, Collections.emptySet());
List<String> words = new ArrayList<String>();
TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(original));
CharTermAttribute termAttribute = tokenStream.addAttribute(CharTermAttribute.class);
try {
tokenStream.reset(); // Resets this stream to the beginning. (Required)
while (tokenStream.incrementToken()) {
words.add(termAttribute.toString());
}
tokenStream.end(); // Perform end-of-stream operations, e.g. set the final offset.
}
finally {
tokenStream.close(); // Release resources associated with this stream.
}
The result I get is:
[查, 詢, 促, 進, 民, 間, 參, 與, 公, 共, 建, 設, 法, 210bot, 法]

Here's an approach I would take.
You can use Character.codePointAt(char[] charArray, int index) to return the Unicode value for a char in your char array.
You will also need a mapping of Latin Unicode characters.
If you look in the source of Character.UnicodeBlock, the full LATIN block is the interval [0x0000, 0x0249]. So basically you check if your Unicode code point is somewhere within that interval.
I suspect there is a way to just use a Character.Subset to check if it contains your char, but I haven't looked into that.

Related

How to convert special characters in a string to unicode?

I couldn't find an answer to this problem, having tried several answer here combined to find something that works, to no avail.
An application I'm working on uses a users name to create PDF's with that name in it. However, when someones name contains a special character like "Yağmur" the pdf creator freaks out and omits this special character.
However, when it gets the unicode equivalent ("Yağmur"), it prints "Yağmur" in the pdf as it should.
How do I check a name/string for any special character (regex = "[^a-z0-9 ]") and when found, replace that character with its unicode equivalent and returning the new unicoded string?
I will try to give the solution in generic way as the frame work you are using is not mentioned as the part of your problem statement.
I too faced the same kind of issue long time back. This should be handled by the pdf engine if you set the text/char encoding as UTF-8. Please find how you can set encoding in your framework for pdf generation and try it out. Hope it helps !!
One hackish way to do this would be as follows:
/*
* TODO: poorly named
*/
public static String convertUnicodePoints(String input) {
// getting char array from input
char[] chars = input.toCharArray();
// initializing output
StringBuilder sb = new StringBuilder();
// iterating input chars
for (int i = 0; i < input.length(); i++) {
// checking character code point to infer whether "conversion" is required
// here, picking an arbitrary code point 125 as boundary
if (Character.codePointAt(input, i) < 125) {
sb.append(chars[i]);
}
// need to "convert", code point > boundary
else {
// for hex representation: prepends as many 0s as required
// to get a hex string of the char code point, 4 characters long
// sb.append(String.format("&#xu%04X;", (int)chars[i]));
// for decimal representation, which is what you want here
sb.append(String.format("&#%d;", (int)chars[i]));
}
}
return sb.toString();
}
If you execute: System.out.println(convertUnicodePoints("Yağmur"));...
... you'll get: Yağmur.
Of course, you can play with the "conversion" logic and decide which ranges get converted.

How to build the longest String with different Unicode characters

Thanks in advance for your patience. This is my problem.
I'm writing a program in Java that works best with a big set of different characters.
I have to store all the characters in a String. I started with
private static final String values = "0123456789";
Then I added A-Z, a-z and all the commons symbols.
But they are still too few, so I tought that maybe Unicode could be the solution.
The problem is now: what is the best way to get all the unicode characters that can be displayed in Eclipse (my algorithm will probably fail if there are unrecognized characters - those displayed like little rectangles). Is it possible to build a string (or some strings) with all the characters present here (en.wikipedia.org/wiki/List_of_Unicode_characters) correctly displayed?
I can do a rough copy-paste from http://www.terena.org/activities/multiling/euroml/tests/test-ucspages1ucs.html or http://zenoplex.jp/tools/unicoderange_generator.html, but I would appreciate some cleaner solution.
I don't know if there is a way to extract characters fron a font (the Unifont one). Or maybe I should parse this (www. utf8-chartable.de/unicode-utf8-table.pl) webpage.
Moreover, by adding all the characters into a String I will probably get the error:
"The type generates a string that requires more than 65535 bytes to encode in Utf8 format in the constant pool" (discussed in this question on SO: /questions/10798769/how-to-process-a-string-with-823237-characters).
Hybrid solutions can be accepted. I can remove duplicates following this question on SO questions/4989091/removing-duplicates-from-a-string-in-java)
Finally: every solution to get the longest only-different-characters string is accepted.
Thanks!
You are mixing some things up. The question whether a character can be displayed in Eclipse depends on the font you have chosen; and whether the source file can be processed correctly depends on which character encoding you have set up for the source file. When choosing UTF-8 and a good unicode font you can use and display almost any character, at least more than fit into a single String literal.
But is it really required to show the character in Eclipse? You can use the unicode escapes, e.g. \u20ac to refer to characters, regardless of whether they can be displayed or if the file encoding can handle them.
And if it is not a requirement to blow up your source code, it’s easy to create a String containing all existing characters:
// all chars (i.e. UTF-16 values)
StringBuilder sb=new StringBuilder(Character.MAX_VALUE);
for(char c=0; c<Character.MAX_VALUE; c++) sb.append(c);
String s=sb.toString();
// if it should behave like a compile-time constant:
s=s.intern();
or
// all unicode characters (aka code points)
StringBuilder sb=new StringBuilder(2162686);
for(int c=0; c<Character.MAX_CODE_POINT; c++) sb.appendCodePoint(c);
String s=sb.toString();
// if it should behave like a compile-time constant:
s=s.intern();
If you wan’t the String to contain valid unicode characters only you can use if(Character.isDefined(c)) … inside the loop. But that’s a moving target— newer JRE’s will most probably know more defined characters.
Smply use Apache classes, org.apache.commons.lang.RandomStringUtils (commons-lang) can solve your purpose.
http://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/RandomStringUtils.html
Also please refer to below code for api usage,
import org.apache.commons.lang3.RandomStringUtils;
public class RandomString {
public static void main(String[] args) {
// Random string only with numbers
String string = RandomStringUtils.random(64, false, true);
System.out.println("Random 0 = " + string);
// Random alphabetic string
string = RandomStringUtils.randomAlphabetic(64);
System.out.println("Random 1 = " + string);
// Random ASCII string
string = RandomStringUtils.randomAscii(32);
System.out.println("Random 2 = " + string);
// Create a random string with indexes from the given array of chars
string = RandomStringUtils.random(32, 0, 20, true, true, "bj81G5RDED3DC6142kasok".toCharArray());
System.out.println("Random 3 = " + string);
}
}

Create String[] containing only certain characters

I am trying to create a String[] which contains only words that comprise of certain characters. For example I have a dictionary containing a number of words like so:
arm
army
art
as
at
attack
attempt
attention
attraction
authority
automatic
awake
baby
back
bad
bag
balance
I want to narrow the list down so that it only contains words with the characters a, b and g. Therefore the list should only contain the word 'bag' in this example.
Currently I am trying to do this using regexes but having never used them before I can't seem to get it to work.
Here is my code:
public class LetterJugglingMain {
public static void main(String[] args) {
String dictFile = "/Users/simonrhillary/Desktop/Dictionary(3).txt";
fileReader fr = new fileReader();
fr.openFile(dictFile);
String[] dictionary = fr.fileToArray();
String regx = "able";
String[] newDict = createListOfValidWords(dictionary, regx);
printArray(newDict);
}
public static String[] createListOfValidWords(String[] d, String regex){
List<String> narrowed = new ArrayList<String>();
for(int i = 0; i<d.length; i++){
if(d[i].matches(regex)){
narrowed.add(d[i]);
System.out.println("added " + d[i]);
}
}
String[] narrowArray = narrowed.toArray(new String[0]);
return narrowArray;
}
however the array returned is always empty unless the String regex is the exact word! Any ideas? I can post more code if needed...I think I must be trying to initialise the regex wrong.
The narrowed down list must contain ONLY the characters from the regex.
Frankly, I'm not an expert in regexes, but I don't think it's the best tool to do what you want. I would use a method like the following:
public boolean containsAll(String s, Set<Character> chars) {
Set<Character> copy = new HashSet<Character>();
for (int i = 0; i < s.length() && copy.size() < chars.size(); i++) {
char c = s.charAt(i);
if (chars.contains(c)) {
copy.add(c);
}
}
return copy.size() == chars.size();
}
The regex able will match only the string "able". However, if you want a regular expression to match either character of a, b, l or e, the regex you're looking for is [able] (in brackets). If you want words containing several such characters, add a + for repeating the pattern: [able]+.
The OP wants words that contain every character. Not just one of them.
And other characters are not a problem.
If this is the case, I think the simiplest way would be to loop through the entire string, character by character, and check to see if it contains all of the characters you want. Keep flags to check and see if every character has been found.
If this isn't the case.... :
Try using the regex:
^[able]+$
Here's what it does:
^ matches the beginning of the string and $ matches the end of the string. This makes sure that you're not getting a partial match.
[able] matches the characters you want the string to consist of, in this case a, b, l, and e. + Makes sure that there are 1 or more of these characters in the string.
Note: This regex will match a string that contains these 4 letters. For example, it will match:
able, albe, aeble, aaaabbblllleeee
and will not match
qable, treatable, and abled.
A sample regex that filters out words that contains at least one occurrence of all characters in a set. This will match any English word (case-insensitive) that contains at least one occurrence of all the characters a, b, g:
(?i)(?=.*a)(?=.*b)(?=.*g)[a-z]+
Example of strings that match would be bag, baggy, grab.
Example of strings that don't match would be big, argument, nothing.
The (?i) means turns on case-insensitive flag.
You need to append as many (?=.*<character>) as the number of characters in the set, for each of the characters.
I assume a word only contains English alphabet, so I specify [a-z]. Specify more if you need space, hyphen, etc.
I assume matches(String regex) method in String class, so I omitted the ^ and $.
The performance may be bad, since in the worst case (the characters are found at the end of the words), I think that the regex engine may go through the string for around n times where n is the number of characters in the set. It may not be an actual concern at all, since the words are very short, but if it turns out that this is a bottleneck, you may consider doing simple looping.

Select words with at least two different letters

I am using this code
Matcher m2 = Pattern.compile("\\b[ABE]+\\b").matcher(key);
to only get keys from a HashMap that contain the letters A, B or E
I am not though interested in words such as AAAAAA or EEEEE I need words with at least two different letters (in the best case, three).
Is there a way to modify the regex ? Can anyone offer insight on this?
Replace everything except your letters, make a Set of the result, test the Set for size.
public static void main (String args[])
{
String alphabet = "ABC";
String totest = "BBA";
if (args.length == 2)
{
alphabet = args[0];
totest = args[1];
}
String cleared = totest.replaceAll ("[^" + alphabet + "]", "");
char[] ca = cleared.toCharArray ();
Set <Character> unique = new HashSet <Character> ();
for (char c: ca)
unique.add (c);
System.out.println ("Result: " + (unique.size () > 1));
}
Example implementation
You could use a more complicated regex to do it e.g.
(.*A.*[BE].*|.*[BE].*A.*)|(.*B.*[AE].*|.*[AE].*B.*)|(.*E.*[BA].*|.*[BA].*E.*)
But it's probably going to be more easy to understand to do some kind of replacement, for instance make a loop that replaces one letter at a time with '', and check the size of the new string each time - if it changes the size of the string twice, then you've got two of your desired characters. EDIT: actually, if you know the set of desired characters at runtime before you do the check, NullUserException had it right in his comment - indexOf or contains will be more efficient and probably more readable than this.
Note that if your set of desired characters is unknown at compile time (or at least pre-string-checking at runtime), the second option is preferable - if you're looking for any characters, just replace all occurrences of the first character in a while(str.length > 0) loop - the number of times it goes through the loop is the number of different characters you've got.
Mark explicitly the repetition of desired letters,
It would look like this :
\b[ABE]{1,3}\b
It matches AAE, EEE, AEE but not AAAA, AAEE

Number to words Mapping , Awe inspiring memory whim?

I need help for completing this little project
Program will take a phone number as an input and convert it into a proper English word.
Explaination:
There is some letters related to digits from 0-9 saved in a text file in first ten lines, something like
1 akl
2 dgh
3 qnm
4 rtu
5 zx
6 cvf
7 eip
8 wjs
9 yb
0 o
On line# 11 total number of words is present i-e 50000
after that, from line number 12 all 50000 words are present; one word per line.
Now program will take number(s) as an input form user until user enters -1
and then generate a proper English matching word from this text file.Each letter represents a digit from the list.
for example user enters
6182703
output will be :
Fashion
for more than 1 matching words , system will list all the words hyphen '-' seperated.
How should I start this, what approach should I use ?
If someone gives Pseudo code or hints .. It would be really great.
I would take a dictionary of words and sort it in a file by your needs.
e.g:
apple = 17717
cherry = 627449
Then go through the file with a search algorithm.
EDIT: or you could store the data in a Relational DB (http://hsqldb.org/ is simple) to avoid a bigger memory footprint. If you like the solution you also could investigate some key/value stores etc.
A lot of the detail in your question relates to the input spec, which is all pretty trivial.
After parsing your input, you're going to have a list of "candidate" words (all the words), and a mapping of digits to the set of characters it can be represented with.
List<String> words;
Map<Character, Set<Character>> digitMapping;
The simplest way of generating the word for a number is probably this: sequentially filter the list of candidates, testing if they match the input digits, and removing them otherwise. Something like this might do the trick (consider this pseudocode - I haven't tried compiling it):
List<String> getMatches(String inputDigits) {
// Take a copy of the word list. You don't want to ruin the list for the next caller
List<String> candidates = new ArrayList<String>(words);
for (Iterator<String> it = candidates.iterator(); it.hasNext() && !candidats.isEmpty(); ) {
String candidate = it.getNext();
for (int i = 0; i < inputDigits.length; ++i) {
Character c = new Character(candidate.charAt(i));
Character d = new Character(inputDigits.charAt(i));
if (!digitMapping.get(d).contains(c)) {
it.remove();
}
}
}
return candidates;
}
It will return all the words that match, so in your example, "555" will likely return an empty list. "6182703" might only return a single word, "fashion", while "202" might return several words in a list ("dog", "hog", "god"). You'll need to decide how you want to handle the zero and multiple cases.
Edit: Details on populating digitMapping:
The digitMapping will be something like:
Map<Character, Set<Character>> digitMapping = new HashMap<Character, Set<Character>>();
Then you'll need to grab a char and a String from the input. For the input line "1 akl", your char will be '1', while your String will be "akl". You're mapping from the character to the set of characters in the string, so will need to construct an empty set, put it into the map, then populate the set. Something like (again, I haven't even tried compiling this, so take it with a grain of salt):
private void addDigitToMap(char digit, String chars) {
Set<Character> set = new HashSet<Character>();
digitMapping.put(set);
for (char c : chars.toCharArray()) {
set.add(new Character(c));
}
}
So now the map will have an entry that points to a set of the characters it can be represented by.

Categories

Resources