String split, words including accented characters - java

I'm using this regex:
x.split("[^a-zA-Z0-9']+");
This returns an array of strings with letters and/or numbers.
If I use this:
String name = "CEN01_Automated_TestCase.java";
String[] names = name.Split.split("[^a-zA-Z0-9']+");
I got:
CEN01
Automated
TestCase
Java
But if I use this:
String name = "CEN01_Automação_Caso_Teste.java";
String[] names = name.Split.split("[^a-zA-Z0-9']+");
I got:
CEN01
Automa
o
Caso
Teste
Java
How can I modify this regex to include accented characters? (á,ã,õ, etc...)

From http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Categories that behave like the java.lang.Character boolean ismethodname methods (except for the deprecated ones) are available through the same \p{prop} syntax where the specified property has the name javamethodname.
Since Character class contains isAlphabetic method you can use
name.split("[^\\p{IsAlphabetic}0-9']+");
You can also use
name.split("(?U)[^\\p{Alpha}0-9']+");
but you will need to use UNICODE_CHARACTER_CLASS flag which can be used by adding (?U) in regex.

I would check out the Java Documentation on Regular Expressions. There is a unicode section which I believe is what you may be looking for.
EDIT: Example
Another way would be to match on the character code you are looking for. For example
\uFFFF where FFFF is the hexadecimal number of the character you are trying to match.
Example: \u00E0 matches à
Realize that the backslash will need to be escaped in Java if you are using it as a string literal.
Read more about it here.

You can use this:
String[] names = name.split("[^a-zA-Z0-9'\\p{L}]+");
System.out.println(Arrays.toString(names)); Will output:
[CEN01, Automação, Caso, Teste, java]
See this for more information.

Why not split on the separator characters?
String[] names = name.split("[_.]");

Instead of blacklisting all the characters you don't want, you could always whitlist the characters you want like :
^[^<>%$]*$
The expression [^(many characters here)] just matches any character that is not listed.
But that is a personnal opinion.

Related

Split and parse string into integers [duplicate]

So I want to split a string in java on any non-alphanumeric characters.
Currently I have been doing it like this
words= Str.split("\\W+");
However I want to keep apostrophes("'") in there. Is there any regular expression to preserve apostrophes but kick the rest of the junk? Thanks.
words = Str.split("[^\\w']+");
Just add it to the character class. \W is equivalent to [^\w], which you can then add ' to.
Do note, however, that \w also actually includes underscores. If you want to split on underscores as well, you should be using [^a-zA-Z0-9'] instead.
For basic English characters, use
words = Str.split("[^a-zA-Z0-9']+");
If you want to include English words with special characters (such as fiancé) or for languages that use non-English characters, go with
words = Str.split("[^\\p{L}0-9']+");

Validate with REGEX alpha with support in many languages

Hi guys I am working in a java validator library. My question is How can I validate inputs to be alpha (no alphanumeric) in many languages. I have the following REGEX:
public AlphaValidator() {
super();
this.rule = "^[a-zA-Z[*]]+$"; // its fine with : angel, world, bottle, etc.
}
Its ok, but If the library is implemented for spanish inputs or french ones maybe with words like : vi un ñandu or árbol do not match with the REGEX.
I was writting the special characters like :
private String getSpanishFilter() {
return "-ñ-Ñ-á-Á-é-É-í-Í-ó-Ó-ú-Ú-ü-Ü";
}
private String getFrenchFilter() {
return "â-à-ç-é-ê-ë-è-ï-î-ô-û-ù-Â-À-Ç-É-Ê-Ë-È-Ï-Î-Ô-Û-Ù";
}
But I think this is not the best solution. Any help?
Don't understand why you have enclosed * inside a nested character class. That is nothing but union, and is as good as using just *. And to match unicode letters, you can use \p{L}.
And if you're already on Java 7, then you can use Pattern.UNICODE_CHARACTER_CLASS flag, or embedded flag - (?U) with your given pattern:
Pattern p = Pattern.compile("^[*\\w&&[^\\d_]]+$", Pattern.UNICODE_CHARACTER_CLASS);
And if you're keeping regex as string, then use embedded flag as:
rule = "(?U)^[*\\w&&[^\\d_]]+$";
Have you looked at the docs for Pattern?
Under "Classes for Unicode scripts, blocks, categories and binary properties":
\p{IsAlphabetic} An alphabetic character (binary property)
So your pattern could be:
"\\p{IsAlphabetic}+"
The shortest way with matches():
\\pL+ # no need to add anchors with matches() method
The shortest way with find():
\\PL # stop at the first non letter character
note: you can write \\p{L} and \\P{L} too, \\pL and \\PL are shortcuts.
But if you need to match only latin characters, it's better to use:
\\p{isLatin}+

What all characters can be used as String Delimiters in Java?

I am trying break a String in various pieces using delimiter(":").
String sepIds[]=ids.split(":");
It is working fine. But when I replace ":" with " * " and use " * " as delimiter, it doesn't work.
String sepIds[]=ids.split("*"); //doesn't work
It just hangs up there, and doesn't execute further.
What mistake I am making here?
String#split takes a regular expression as parameter. In regex some chars have special meanings so they need to be escaped, for example:
"foo*bar".split("\\*")
the result will be as you expect:
[foo, bar]
You could also use the method Pattern#quote to simplify the task.
"foo*bar".split(Pattern.quote("*"))
String.split expects a regular expression argument. * has got a meaning in regex. So if you want to use them then you need to escape them like this:
String sepIds[]=ids.split("\\*");
The argument of .split() is a regular expression, not a string literal. Therefore you need to escape * since it is a special regex character. Write:
ids.split("\\*");
This is how you would split agaisnt one or more spaces:
ids.split("\\s+");
Note that Guava has Splitter which is very, very fast and can split against literals:
Splitter.on('*').split(ids);
'*' and '.' are special characters you have to blackshlash it.
String sepIds[]=ids.split("\\*");
To read more about java patterns please visit that page.
That is expected behaviour. The documentation for the String split function says that the input string is treated as a regular expression (with a link explaining how that works). As Germann points out, '*' is a special character in regular expressions.
Java's String.split() uses regular expressions to split up the string (unlike similar functions in C# or python). * is a special character in regular expressions and you need to escape it with a \ (backslash). So you should use instead:
String sepIds[]=ids.split("\\*");
You can find more information on regular expressions anywhere on the internet a quite complete list of special characters supported by java should be here: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

Split on non arabic characters

I have a String like this
أصبح::ينال::أخذ::حصل (على)::أحضر
And I want to split it on non Arabic characters using java
And here's my code
String s = "أصبح::ينال::أخذ::حصل (على)::أحضر";
String[] arr = s.split("^\\p{InArabic}+");
System.out.println(Arrays.toString(arr));
And the output was
[, ::ينال::أخذ::حصل (على)::أحضر]
But I expect the output to be
[ينال,أخذ,حصل,على,أحضر]
So I don't know what's wrong with this?
You need a negated class, and to do that, you need square brackets [ ... ]. Try to split with this:
"[^\\p{InArabic}]+"
If \\p{InArabic} matches any arabic character, then [^\\p{InArabic}] will match any non-arabic character.
Another option you can consider is an equivalent syntax, using P instead of p to indicate the opposite of the \\p{InArabic} character class like #Pshemo mentioned:
"\\P{InArabic}+"
This works just like \\W is the opposite of \\w.
The only possible advantage you get with the first syntax over the second (again like #Pshemo mentioned), is that if you want to add other characters to the list of characters which shouldn't match, for example, if you want to match all non \\p{InArabic} except periods, the first one is more flexible:
"[^\\p{InArabic}.]+"
^
Otherwise, if you really want to use \\P{InArabic}, you'll need subtraction within classes:
"[\\P{InArabic}&&[^.]]+"
The expression you want is "\\P{InArabic}+"
This means match any (non-zero) number of characters that are not Arabic.

Remove everything from a string upto a certain character and optionally a string if it follows too

I am looking to write a regex that can remove any characters upto the first &emsp and if there is a (new section) following &emsp then remove that as well. But the following regex doesn't seem to work. Why? How do I correct this?
String removeEmsp =" “[<centd>[</centd>]§ 431:10A–126 (new section)[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.";
Pattern removeEmspPattern1 = Pattern.compile("(.*( (\\(new section\\)))?)(.*)", Pattern.MULTILINE);
System.out.println(removeEmspPattern1.matcher(removeEmsp).replaceAll("$2"));
Have you tried String Split? This creates an array of strings from a string, based on a deliminator.
Once you have the string split, just select the elements of the array that you need for print statement.
Read more here
Your regex is very long and I do not want to debug it. However the tip is that some characters have special meaning in regular expressions. For example & means "and". Squire brackets allow defining characters groups etc. Such characters must be escaped if you want them to be interpreted as just characters and not regex commands. To escape special character you have to write \ in front of it. But \ is escape character for java too, so it should be duplicate.
For example to replace ampersand by letter A you should write str.replaceAll("\\&", "A")
Now you have all information you need. Try to start from simpler regex and then expand it to what you need. Good luck.
EDIT
BTW parsing XML and/or HTML using regular expressions is possible but is highly not recommended. Use special parser for such formats.
Try this:
String removeEmsp =" “[<centd>[</centd>]§ 431:10A–126 (new section)[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.";
System.out.println(removeEmsp.replaceFirst("^.*?\\ (\\(new\\ssection\\))?", ""));
System.out.println(removeEmsp.replaceAll("^.*?\\ (\\(new\\ssection\\))?", ""));
Output:
[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.
[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.
It will remove everything up to " " and optionally, the following "(new section)" text if any.

Categories

Resources