Java 7 Unicode Regex Tabs-only and Spaces-only - java

I'm currently trying to add support to our application for Japanese and French language encodings. In doing so, I'm trying to create two Pattern matchers to detect tabs-only and spaces-only in a read file, regardless of language encoding.
These will be used to determine what delimiter is used in a file, so they can be processed accordingly.
When I've tried compiling a space pattern
Pattern.compile(" ", Pattern.UNICODE_CHARACTER_CLASS);
I don't see it generating a regex to handle different unicode space values.
eg something like "[\\u00A0\\u2028\\u2029\\u3000\\u00C2\\u009A\\u0041]"
Compilation seems to work properly with the '\s' character set, but that includes tabs and newlines.
How should I be doing this in Java?
UPDATE
So part of the reason this wasn't working was the fact that Japanese web text HAS NO spaces, even though there appear to be spaces. Take the following line from a web imoprt:
実なので説明は不要だろう。その後1987
There are actually no spaces here う。そ. Just three characters.
Fixing this is really the subject of another question, so I have accepted Casimir's answer, as it handled the French case just fine.

You can use a negated character class. Example:
[^\\S \\t]
that means \s without space and tab.
Or you can use a class intersection:
[\\s&&[^ \\t]]

If I follow your question, you could use something like this for spaces -
Pattern p = Pattern.compile("^[ ]+$", Pattern.UNICODE_CHARACTER_CLASS);
String[] inputs = {" ", " ", " \t", "Hello"};
for (String input : inputs) {
Matcher m = p.matcher(input);
System.out.printf("For input: '%s' = %s%n", input, m.find());
}
Output is
For input: ' ' = true
For input: ' ' = true
For input: ' ' = false
For input: 'Hello' = false
and for tabs
Pattern p = Pattern.compile("^[\t]+$", Pattern.UNICODE_CHARACTER_CLASS);
String[] inputs = {"\t", "\t\t", " \t", "Hello"};
for (String input : inputs) {
Matcher m = p.matcher(input);
System.out.printf("For input: '%s' = %s%n", input, m.find());
}
Output is
For input: ' ' = true
For input: ' ' = true
For input: ' ' = false
For input: 'Hello' = false
Finally, use * instead of + for 0 or more matches. This uses +, so that is 1 or more match required. Starting with (^) and ending with ($).

Related

Regex for finding only single alphabets in a string and ignore consecutive double

I have searched a lot but I am unable to find a regex that could select only single alphabets and double them while those alphabets which are already double, should remain untouched.
I tried
String str = "yahoo";
str = str.replaceAll("(\\w)\\1+", "$0$0");
But since this (\\w)\\1+ selects all double elements, my output becomes yahoooo. I tried to add negation to it !(\\w)\\1+ but didn't work and output becomes same as input. I have tried
str.replaceAll(".", "$0$0");
But that doubles every character including which are already doubled.
Please help to write an regex that could replace all single character with double while double character should remain untouched.
Example
abc -> aabbcc
yahoo -> yyaahhoo (o should remain untouched)
opinion -> ooppiinniioonn
aaaaaabc -> aaaaaabbcc
You can match using this regex:
((.)\2+)|(.)
And replace it with:
$1$3$3
RegEx Demo
RegEx Explanation:
((.)\2+): Match a character and capture in group #2 and using \2+ next to it to make sure we match all multiple repeats of captured character. Capture all the repeated characters in group #1
|: OR
(.): Match any character and capture in group #3
Code Demo:
import java.util.List;
class Ideone {
public static void main(String[] args) {
List<String> input = List.of("aaa", "abc", "yahoo",
"opinion", "aaaaaabc");
for (String s: input) {
System.out.println( s + " => " +
s.replaceAll("((.)\\2+)|(.)", "$1$3$3") );
}
}
}
Output:
aaa => aaa
abc => aabbcc
yahoo => yyaahhoo
opinion => ooppiinniioonn
aaaaaabc => aaaaaabbcc
The solution by #anubhava, if viable in Java, is probably the best way to go. For a more brute force approach, we can try a regex iteration approach on the following pattern:
(\\w)\\1+|\\w
This matches, eagerly, a series of similar letters (two or more of them), followed by, that failing, a single letter. For each match, we can no-op on the multi-letter match, and double up any other single letter. Here is a short Java code which does this:
List<String> inputs = Arrays.asList(new String[] {"abc", "yahoo", "opinion", "aaaaaabc"});
String pattern = "(\\w)\\1+|\\w";
Pattern r = Pattern.compile(pattern);
for (String input : inputs) {
Matcher m = r.matcher(input);
StringBuffer buffer = new StringBuffer();
while (m.find()) {
if (m.group().matches("(\\w)\\1+")) {
m.appendReplacement(buffer, m.group());
}
else {
m.appendReplacement(buffer, m.group() + m.group());
}
}
m.appendTail(buffer);
System.out.println(input + " => " + buffer.toString());
}
}
This prints:
abc => aabbcc
yahoo => yyaahhoo
opinion => ooppiinniioonn
aaaaaabc => aaaaaabbcc
I've got two different understandings of the question.
If the goal is to get an even amount of each word character:
Search for (\w)\1? and replace with $1$1 (regex101 demo).
If just solely characters should be duplicated and others left untouched:
Search for (\w)\1?(\1*) and replace with $1$1$2 (regex 101 demo).
Captures a word character \w to $1, optionally matches the same character again. The second variant captures any more of the same character to $2 for attaching in the replacement.
FYI: If using as a Java string remember to escape the pattern. E.g. \1 -> \\1, \w ->\\w, ...

Regex NotBlank and doesnt contains <

I try to create a regex for a String which is NotBlank and cannot contain "<".
My question is what Im doing wrong thank you.
"(\\A(?!\\s*\\Z))|([^<]+)"
Edit
Maybe this way how to combine this regex
^[^<]+$
with this regex
\\A(?!\\s*\\Z).+
With regex, you can use
\A(?!\s+\z)[^<]+\z
(?U)\A(?!\s+\z)[^<]+\z
The (?U) is only necessary when you expect any Unicode chars in the input.
In Java, when used with matches, the anchors on both ends are implicit:
text.matches("(?U)(?!\\s+\\z)[^<]+")
The regex in matches is executed once and requires the full string match. Here, it matches
\A - (implicit in matches) - start of string
(?U) - Pattern.UNICODE_CHARACTER_CLASS option enabled so that \s could match any Unicode whitespaces
(?!\\s+\\z) - until the very end of string, there should be no one or more whitespaces
[^<]+ - one or more chars other than <
\z - (implicit in matches) - end of string.
See the Java test:
String texts[] = {"Abc <<", " ", "", "abc 123"};
Pattern p = Pattern.compile("(?U)(?!\\s+\\z)[^<]+");
for(String text : texts)
{
Matcher m = p.matcher(text);
System.out.println("'" + text + "' => " + m.matches());
}
Output:
'Abc <<' => false
' ' => false
'' => false
'abc 123' => true
See an online regex test (modified to fit the single multiline string demo environment so as not to cross over line boundaries.)
You can try to use this regex:
[^<\s]+
Any char that is not "<", for 1 or more times.
Here is the example to test it: https://regex101.com/r/9ptt15/2
However, you can try to solve it without a regular expression:
boolean isValid = s != null && !s.isEmpty() && s.indexOf(" ") == -1 && s.indexOf("<") == -1;

regex - How to match elements while ignoring others between quotation marks?

I can't seem to find the regex that suits my needs.
I have a .txt file of this form:
Abc "test" aBC : "Abc aBC"
Brooking "ABC" sadxzc : "I am sad"
asd : "lorem"
a22 : "tactius"
testsa2 : "bruchia"
test : "Abc aBC"
b2 : "Ast2"
From this .txt file I wish to extract everything matching this regex "([a-zA-Z]\w+)", except the ones between the quotation marks.
I want to rename every word (except the words in quotation marks), so I should have for example the following output:
A "test " B : "Abc aBC"
Z "ABC" X : "I am sad"
Test : "lorem"
F : "tactius"
H : "bruchia"
Game : "Abc aBC"
S: "Ast2"
Is this even achievable using a regex? Are there alternatives without using regex?
If quotes are balanced and there is no escaping in the input like \" then you can use this regex to match words outside double quotes:
(?=(?:(?:[^"]*"){2})*[^"]*$)(\b[a-zA-Z]\w+\b)
RegEx Demo
In java it will be:
Pattern p = Pattern.compile("(?=(?:(?:[^\"]*\"){2})*[^\"]*$)(\\b[a-zA-Z]\\w+\\b)");
This regex will match word if those are outside double quotes by using a lookahead to make sure there are even number of quotes after each matched word.
A simple approach might be to split the string by ", then do the replace using your regex on every odd part (on parts 1, 3, ..., if you start the numbering from 1), and join everything back.
UPD
However, it is also simple to implement manually. Just go along the line and track whether you are inside quotes or not.
insideQuotes = false
result = ""
currentPart = ""
input = input + '"' // so that we do not need to process the last part separately
for ch in string
if ch == '"'
if not insideQuotes
currentPart = replace(currentPart)
result = result + currentPart + '"'
currentPart = ""
insideQuotes = not insideQuotes
else
currentPart = currentPart + ch
drop the last symbol of result (it is that quote mark that we have added)
However, think also on whether you will need some more advanced syntax. For example, quote escaping like
word "inside quote \" still inside" outside again
? If yes, then you will need a more advanced parser, or you might think of using some special format.
You can’t formulate a “within quotes” condition the way you might think. But you can easily search for unquoted words or quoted strings and take action only for the unquoted words:
Pattern p = Pattern.compile("\"[^\"]*\"|([a-zA-Z]\\w+)");
for(String s: lines) {
Matcher m=p.matcher(s);
while(m.find()) {
if(m.group(1)!=null) {
System.out.println("take action with "+m.group(1));
}
}
}
This utilizes the fact that each search for the next match starts at the end of the previous. So if you find a quoted string ("[^"]*") you don’t take any action and continue searching for other matches. Only if there is no match for a quoted string, the pattern looks for a word (([a-zA-Z]\w+)) and if one is found, the group 1 captures the word (will be non null).

Optimizing several RegEx in Java Code

The below mentioned RegEx perform very poorly on a very large string or more than 2000 Lines. Basically the Java String is composed of PL/SQL script.
1- Replace each occurrence of delimiting character, for example ||, != or > sign with a space before and after the characters. This takes infinite time and never ends, so no time can be recorded.
// Delimiting characters for SQLPlus
private static final String[] delimiters = { "\\|\\|", "=>", ":=", "!=", "<>", "<", ">", "\\(", "\\)", "!", ",", "\\+", "-", "=", "\\*", "\\|" };
for (int i = 0; i < delimiters.length; i++) {
script = script.replaceAll(delimiters[i], " " + delimiters[i] + " ");
}
2- The following pattern looks for all occurances of forward slash / except the ones that are preceded by a *. That mean don't look for forward slash in a block comment syntax. This takes about 103 Seconds for a 2000 lines of String.
Pattern p = Pattern.compile("([^\\*])([\\/])([^\\*])");
Matcher m = p.matcher(script);
while (m.find()) {
script = script.replaceAll(m.group(2), " " + m.group(2) + " ");
}
3- Remove any white spaces from within date or date format
Pattern p = Pattern.compile("(?i)(\\w{1,2}) +/ +(\\w{1,2}) +/ +(\\w{2,4})");
// Create a matcher with an input string
Matcher m = p.matcher(script);
while (m.find()) {
part1 = script.substring(0, m.start());
part2 = script.substring(m.end());
script = part1 + m.group().replaceAll("[ \t]+", "") + part2;
m = p.matcher(script);
}
Is there any way to optimize all the three RegEx so that they take less time?
Thanks
Ali
I'll answer the first question.
You can combine all this into a single regex replace operation:
script = script.replaceAll("\\|\\||=>|[:!]=|<>|[<>()!,+=*|-]", " $0 ");
Explanation:
\|\| # Match ||
| # or
=> # =>
| # or
[:!]= # := or !=
| # or
<> # <>
| # or
[<>()!,+=*|-] # <, >, (, ), !, comma, +, =, *, | or -
Sure. Your second approach is "almost" good. The problem is that you do not use your pattern for replacement itself. When you are using str.replaceAll() you actually creating Pattern instance every time you are calling this method. Pattern.compile() is called for you and it takes 90% of time.
You should use Matcher.replaceAll() instead.
String script = "dfgafjd;fjfd;jfd;djf;jds\\fdfdf****\\/";
String result = script;
Pattern p = Pattern.compile("[\\*\\/\\\\]"); // write all characters you want to remove here.
Matcher m = p.matcher(script);
if (m.find()) {
result = m.replaceAll("");
}
System.out.println(result);
It isn't the regexes causing your performance problem, it's that fact that you're doing many passes over the text, and constantly creating new Pattern objects. And it's not just performance that suffers, as Tim pointed out; it's much too easy to mess up the results of prior passes when you do that.
In fact, I'm guessing that those extra spaces in the dates are just a side effect your other replacements. If so, here's a way you can do all the replacements in one pass, without adding unwanted characters:
static String doReplace(String input)
{
String regex =
"/\\*[^*]*(?:\\*(?!/)[^*]*)*\\*/|" // a comment
+ "\\b\\d{2}/\\d{2}/\\d{2,4}\\b|" // a date
+ "(/|\\|\\||=>|[:!]=|<>|[<>()!,+=*|-])"; // an operator
Matcher m = Pattern.compile(regex).matcher(input);
StringBuffer sb = new StringBuffer();
while (m.find())
{
// if we found an operator, replace it
if (m.start(1) != -1)
{
m.appendReplacement(sb, " $1 ");
}
}
m.appendTail(sb);
return sb.toString();
}
see the online demo
The trick is, if you don't call appendReplacement(), the match position is not updated, so it's as if the match didn't occur. Because I ignore them, the comments and dates get reinserted along with the rest of the unmatched text, and I don't have to worry about matching the slash characters inside them.
EDIT Make sure the "comment" part of the regex comes before the "operator" part. Otherwise, the leading / of every comment will be treated as an operator.

Regular expression matching "dictionary words"

I'm a Java user but I'm new to regular expressions.
I just want to have a tiny expression that, given a word (we assume that the string is only one word), answers with a boolean, telling if the word is valid or not.
An example... I want to catch all words that is plausible to be in a dictionary... So, i just want words with chars from a-z A-Z, an hyphen (for example: man-in-the-middle) and an apostrophe (like I'll or Tiffany's).
Valid words:
"food"
"RocKet"
"man-in-the-middle"
"kahsdkjhsakdhakjsd"
"JESUS", etc.
Non-valid words:
"gipsy76"
"www.google.com"
"me#gmail.com"
"745474"
"+-x/", etc.
I use this code, but it won't gave the correct answer:
Pattern p = Pattern.compile("[A-Za-z&-&']");
Matcher m = p.matcher(s);
System.out.println(m.matches());
What's wrong with my regex?
Add a + after the expression to say "one or more of those characters":
Escape the hyphen with \ (or put it last).
Remove those & characters:
Here's the code:
Pattern p = Pattern.compile("[A-Za-z'-]+");
Matcher m = p.matcher(s);
System.out.println(m.matches());
Complete test:
String[] ok = {"food","RocKet","man-in-the-middle","kahsdkjhsakdhakjsd","JESUS"};
String[] notOk = {"gipsy76", "www.google.com", "me#gmail.com", "745474","+-x/" };
Pattern p = Pattern.compile("[A-Za-z'-]+");
for (String shouldMatch : ok)
if (!p.matcher(shouldMatch).matches())
System.out.println("Error on: " + shouldMatch);
for (String shouldNotMatch : notOk)
if (p.matcher(shouldNotMatch).matches())
System.out.println("Error on: " + shouldNotMatch);
(Produces no output.)
This should work:
"[A-Za-z'-]+"
But "-word" and "word-" are not valid. So you can uses this pattern:
WORD_EXP = "^[A-Za-z]+(-[A-Za-z]+)*$"
Regex - /^([a-zA-Z]*('|-)?[a-zA-Z]+)*/
You can use above regex if you don't want successive "'" or "-".
It will give you accurate matching your text.
It accepts
man-in-the-middle
asd'asdasd'asd
It rejects following string
man--in--midle
asdasd''asd
Hi Aloob please check with this, Bit lengthy, might be having shorter version of this, Still...
[A-z]*||[[A-z]*[-]*]*||[[A-z]*[-]*[']*]*

Categories

Resources