stripAccents on Thai language - java

I am trying to strip accents from Thai language word using the stripAccent function in scala language, seems like it's not able to strip the accent.
import org.apache.commons.lang3.StringUtils.stripAccents
println("stripped string " + stripAccents("CLEกอ่ตัRงขึนในปีR"))
stripped string CLEกอ่ตัRงขึนในปีR
I am running in Intellij windows environment. It's stripping many other languages like German, Dutch etc.
Has anyone faced similar issue, how did you resolve?

You can use java Normalizer :
import java.text.Normalizer
val thaiString = "CLEกอ่ตัRงขึนในปีR"
val strippedString = Normalizer.normalize(thaiString, Normalizer.Form.NFD)
.replaceAll("[\\p{InCombiningDiacriticalMarks}\\p{IsM}]+", "")
println(strippedString)
//CLEกอตRงขนในปR

Related

Replace quote (‘NOA’) using groovy

Can anyone guide me on how to replace this char (‘ ’) using groovy or java?
When I try the below code (i assume this is a single quote), it's not working.
def a = "‘NOA’,’CTF’,’CLM’"
def rep = a.replaceAll("\'","")
My expected Output : NOA,CTF,CLM
Those are curly quotes in your source text. Your replaceAll is replacing straight quotes.
You should have copy-pasted the characters from your source.
System.out.println(
"‘NOA’,’CTF’,’CLM’"
.replaceAll( "‘" , "" )
.replaceAll( "’" , "" )
);
See this code run live at OneCompiler.
NOA,CTF,CLM
i would suggest this
a.replaceAll("[‘’]", "")
or even better to escape unicode characters in a source code
a.replaceAll("[\u2018\u2019]", "")

Java regex is working in my system but not in the server

The regular expression is
String regex = "^[\\p{IsHangul}\\p{IsDigit}]+";
And whenever i do
text.matches(regex);
It works fine in my system but not in some of the system.
I am not able to track the issue.
Thank you in advance.
Exception:
Exception in thread "main" java.util.regex.PatternSyntaxException: Unknown character property name {Hangul} near index 13
^[\p{IsHangul}\p{IsDigit}]+
^
at java.util.regex.Pattern.error(Pattern.java:1713)
at java.util.regex.Pattern.charPropertyNodeFor(Pattern.java:2437)
at java.util.regex.Pattern.family(Pattern.java:2412)
at java.util.regex.Pattern.range(Pattern.java:2335)
at java.util.regex.Pattern.clazz(Pattern.java:2268)
at java.util.regex.Pattern.sequence(Pattern.java:1818)
at java.util.regex.Pattern.expr(Pattern.java:1752)
at java.util.regex.Pattern.compile(Pattern.java:1460)
at java.util.regex.Pattern.<init>(Pattern.java:1133)
at java.util.regex.Pattern.compile(Pattern.java:823)
at java.util.regex.Pattern.matches(Pattern.java:928)
at java.lang.String.matches(String.java:2090)
at com.mycompany.helper.ApplicationHelper.main(ApplicationHelper.java:200)
According to Using Regular Expressions in Java:
Java 5 fixes some bugs and adds support for Unicode blocks. ...
Make sure you're using Java 5+ in the server.
It seems that Java version you are using is not able to recognise Hangul as correct script character so you can try to create your own character class which will cover same range as Hongul from newer versions of Java.
From what I see in code in source code of Character.UnicodeScript on Java 8 Hangul refers to Unicode ranges
1100..11FF
302E..302F
3131..318F
3200..321F
3260..327E
A960..A97F
AC00..D7FB
FFA0..FFDF
so maybe try with such pattern
Pattern.compile("^["
+ "\u1100-\u11FF"
+ "\u302E-\u302F"
+ "\u3131-\u318F"
+ "\u3200-\u321F"
+ "\u3260-\u327E"
+ "\uA960-\uA97F"
+ "\uAC00-\uD7FB"
+ "\uFFA0-\uFFDF"
+ "\\p{IsDigit}]+");

A custom tokenizer for Java

I am developing an application in which I need to process text files containing emails. I need all the tokens from the text and the following is the definition of token:
Alphanumeric
Case-sensitive (case to be preserved)
'!' and '$' are to be considered as constituent characters. Ex: FREE!!, $50 are tokens
'.' (dot) and ',' comma are to be considered as constituent characters if they occur between numbers. For ex:
192.168.1.1, $24,500
are tokens.
and so on..
Please suggest me some open-source tokenizers for Java which are easy to customize to suit my needs. Will simply using StringTokenizer and regex be enough? I have to perform stopping also and that's why I was looking for an open source tokenizer which will also perform some extra things like stopping, stemming.
A few comments up front:
From StringTokenizer javadoc:
StringTokenizer is a legacy class that is retained for compatibility
reasons although its use is discouraged in new code. It is recommended
that anyone seeking this functionality use the split method of String
or the java.util.regex package instead.
Always use Google first - the first result as of now is JTopas. I did not use it, but it looks it could work for this
As for regex, it really depends on your requirements. Given the above, this might work:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Mkt {
public static void main(String[] args) {
Pattern p = Pattern.compile("([$\\d.,]+)|([\\w\\d!$]+)");
String str = "--- FREE!! $50 192.168.1.1 $24,500";
System.out.println("input: " + str);
Matcher m = p.matcher(str);
while(m.find()) {
System.out.println("token: " + m.group());
}
}
}
Here's a sample run:
$ javac Mkt.java && java Mkt
input: --- FREE!! $50 192.168.1.1 $24,500
token: FREE!!
token: $50
token: 192.168.1.1
token: $24,500
Now, you might need to tweak the regex, for example:
You gave $24,500 as an example. Should this work for $24,500abc or $24,500EUR?
You mentioned 192.168.1.1 should be included. Should it also include 192,168.1,1 (given . and , are to be included)?
and I guess there are other things to consider.
Hope this helps to get you started.

simple regex in java split

I have a string blah-*-bleh-*-bloh
and I want to split it by -*- so I tried (amongst other things):
res.split("/-\\*-/g");
But it's not working. Anyone has an idea?
In java, there is no need for / before and /g after:
String[] splittedArray = res.split("-\\*-");

how can i get the Unicode infinity symbol converted to String

I want to use the infinity symbol (8 lying sideways) in java.
furthermore i want to use it as a String component.
i did not find a working charcode/ascii code for this (is there any?).
i tried:
String s=Character.toString(236);
String s=Character.toString('236');
am i missing something?
i got this now:
System.out.println(Character.toString('\u221E'));
but the output is ?
i am using java 1.7 jdk and eclipse. why is the infinity sign not showing up?
You need the Unicode infinity sign, U+221E. 236 is a Windows typing convention, that won't help you at all. '\u221e' is the character constant.
Now, I can't promise that this will result in any ∞ characters on your screen. That depends on what sort of computer you have, what font you are using, and what you set in -Dfile.encoding. Also see this question.
I know this is very late reply but the below information will definitely help someone.
In Eclipse by default Text File encoding for console is Cp1252, then
How to support UTF-8 encoding in Eclipse
and I will encourage to handle the infinity symbol in String like below source:
String infinitySymbol = null;
try {
infinitySymbol = new String(String.valueOf(Character.toString('\u221E')).getBytes("UTF-8"), "UTF-8");
} catch (UnsupportedEncodingException ex) {
infinitySymbol = "?";
//ex.printStackTrace(); //print the unsupported encoding exception.
} finally {
System.out.print("Symbol of infinity is : " + infinitySymbol);
}

Categories

Resources