How to parse the meta keywords of chinese with java regex? - java

I have code below, but it looks it parse keyword in wrong way for chinese. How can i change it?
OUTPUT:
keyword:test
keyword:中
keyword:文
keyword:U
keyword:I
keyword:素
keyword:材
Should be below:
keyword:test
keyword:中文
keyword:UI
keyword:素材
This is my code:
public class test {
public static final Pattern KEYWORDS_REGEX =
Pattern.compile("[^\\s,](?:[^,]+[^\\s,])?");
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
String keywords = "test, 中文, UI, 素材";
Matcher matcher = KEYWORDS_REGEX.matcher(keywords);
while (matcher.find()) {
String s = matcher.group();
System.out.println("keyword:" +s);
}
}
Thanks!

The problem isn't with Chinese characters, the problem is with keywords that are two characters long. (That's why it affects UI as well.) This regex:
[^\s,](?:[^,]+[^\s,])?
allows two possibilities:
[^\s,] <-- exactly one character
[^\s,][^,]+[^\s,] <-- three or more characters
so any keywords with two characters will not match, so they get split into single-character keywords.
You could fix your regex by changing [^,]+ to [^,]*, but I'm inclined to agree with the spirit of Kisaro's comment above; I think you'd be better off using Pattern.split:
private static final KEYWORD_SPLITTER = Pattern.compile("\\s*,\\s*");
for(final String s : KEYWORD_SPLITTER.split(keywords))
System.out.println("keyword:" + s);

Your regex should be \\w to match words. This should generate the desired output.
Also since someone suggested explode: Apache Commons

Related

Convert this pattern to regex for Pattern.matches(..)

Some of my strings may contain a substring that looks like #[alph4Num3ric-alph4Num3ric] , where I will find the alpha numberic id and replace it with a corresponding text value mapped to the associated key in a map.
My first inclination was to check if my string.contains("#[") but I want to be more specific
so now I am looking at Pattern.matches( but am unsure of the regex and total expression
how would I regex for #[ ...... - .... ] in the Pattern.matches method, it must also account for dashes. So I'm not sure what needs to be escaped in this syntax or wildcarded, or more.
I am also not 100% sure if this is the best message. I want to get a boolean from Pattern.matches first, and then get the real value and modify the string with those values, which seems good enough, but I want to minimize computations.
Plese try this ,
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
// TODO Auto-generated method stub
String expression = "String contains #[alph4Num3ric-alph4Num3ric] as substring";
Pattern pattern = Pattern
.compile("\\#\\[([a-zA-Z0-9]+)-([a-zA-Z0-9]+)\\]");
Matcher matcher = pattern.matcher(expression);
while (matcher.find()) {
System.out.println("matched: "+matcher.group());
System.out.println("group1: "+matcher.group(1));
System.out.println("group2: "+matcher.group(2));
System.out
.println("after replace "+expression.replace(matcher.group(1), "customkey"));
}
}
}
output :
matched: #[alph4Num3ric-alph4Num3ric]
group1: alph4Num3ric
group2: alph4Num3ric
after replace: String contains #[customkey-customkey] as substring
Try using this:
/#[(a-zA-Z0-9-)+]/
I haven't given it a try but hope this would help. Also if it returns an error then add a backward slash between 9 and - e.g. /#[(a-zA-Z0-9-)+]/

Java regex matcher not working

Im trying to get the hang of pattern and matcher. This method should use the regex pattern to iterate over an array of state capitals and return the state or states that correspond to the pattern. The method works fine when I check for whole strings like "tallahassee" or "salt lake city" but not for something like "^t" what is it that im not getting?
This is the method and main that calls it:
public ArrayList<String> getState(String s) throws RemoteException
{
Pattern pattern = Pattern.compile(s);
Matcher matcher;
int i=0;
System.out.println(s);
for(String ct:capitalValues)
{
matcher = pattern.matcher(ct);
if(ct.toLowerCase().matches(s))
states.add(stateValues[i]);
i++;
}
return states;
}
public static void main (String[] args) throws RemoteException
{
ArrayList<String> result = new ArrayList<String>();
hashTester ht = new hashTester();
result = ht.getState(("^t").toLowerCase());
System.out.println("result: ");
for(String s:result)
System.out.println(s);
}
thanks for your help
You're not even using your matcher for matching. You're using String#matches() method. Both that method and Matcher#matches() method matches the regex against the complete string, and not a part of it. So your regex should cover entire string. If you just want to match with a part of the string, use Matcher#find() method.
You should use it like this:
if(matcher.find(ct.toLowerCase())) {
// Found regex pattern
}
BTW, if you only want to see if a string starts with t, you can directly use String#startsWith() method. No need of regex for that case. But I guess it's a general case here.
^ is an anchor character in regex. You have to escape it if you do not want anchoring. Otherwise ^t mens the t at the beginning of the string. Escape it using \\^t

Find string that does not contain some substring

I have a one liner string that looks like this:
My db objects are db.main_flow_tbl, 'main_flow_audit_tbl',
main_request_seq and MAIN_SUBFLOW_TBL.
I want to use regular expressions to return database tables that start with main but do not contain words audit or seq, and irrespective of the case. So in the above example strings main_flow_tbl and MAIN_SUBFLOW_TBL shall return. Can someone help me with this please?
Here is a fully regex based solution:
public static void main(String[] args) throws Exception {
final String in = "My db objects are db.main_flow_tbl, 'main_flow_audit_tbl', main_request_seq and MAIN_SUBFLOW_TBL.";
final Pattern pat = Pattern.compile("main_(?!\\w*?(?:audit|seq))\\w++", Pattern.CASE_INSENSITIVE);
final Matcher m = pat.matcher(in);
while(m.find()) {
System.out.println(m.group());
}
}
Output:
main_flow_tbl
MAIN_SUBFLOW_TBL
This assumes that table names can only contain A-Za-Z_ which \w is the shorthand for.
Pattern breakdown:
main_ is the liternal "main" that you want tables to start with
(?!\\w*?(?:audit|seq)) is a negative lookahead (not followed by) which takes any number of \w characters (lazily) followed by either "audit" or "seq". This excludes tables names that contain those sequences.
\\w++ consume any table characters possesively.
EDIT
OP's comment they may contain numbers as well
In this case use this pattern:
main_(?![\\d\\w]*?(?:audit|seq))[\\d\\w]++
i.e. use [\\d\\w] rather than \\w
String str
while ((str.startsWith("main"))&&!str.contains("audit")||!str.contains("seq")){
//your code here
}
If the string matches
^main_(\w_)*(?!(?:audit|seq))
it should be what you want...

regular expression for comments in java [duplicate]

A little fun with Java this time. I want to write a program that reads a code from standard input (line by line, for example), like:
// some comment
class Main {
/* blah */
// /* foo
foo();
// foo */
foo2();
/* // foo2 */
}
finds all comments in it and removes them. I'm trying to use regular expressions, and for now I've done something like this:
private static String ParseCode(String pCode)
{
String MyCommentsRegex = "(?://.*)|(/\\*(?:.|[\\n\\r])*?\\*/)";
return pCode.replaceAll(MyCommentsRegex, " ");
}
but it seems not to work for all the cases, e.g.:
System.out.print("We can use /* comments */ inside a string of course, but it shouldn't start a comment");
Any advice or ideas different from regex?
Thanks in advance.
You may have already given up on this by now but I was intrigued by the problem.
I believe this is a partial solution...
Native regex:
//.*|("(?:\\[^"]|\\"|.)*?")|(?s)/\*.*?\*/
In Java:
String clean = original.replaceAll( "//.*|(\"(?:\\\\[^\"]|\\\\\"|.)*?\")|(?s)/\\*.*?\\*/", "$1 " );
This appears to properly handle comments embedded in strings as well as properly escaped quotes inside strings. I threw a few things at it to check but not exhaustively.
There is one compromise in that all "" blocks in the code will end up with space after them. Keeping this simple and solving that problem would be very difficult given the need to cleanly handle:
int/* some comment */foo = 5;
A simple Matcher.find/appendReplacement loop could conditionally check for group(1) before replacing with a space and would only be a handful of lines of code. Still simpler than a full up parser maybe. (I could add the matcher loop too if anyone is interested.)
The last example is no problem I think:
/* we comment out some code
System.out.print("We can use */ inside a string of course");
we end the comment */
... because the comment actually ends with "We can use */. This code does not compile.
But I have another problematic case:
int/*comment*/foo=3;
Your pattern will transform this into:
intfoo=3;
...what is invalid code. So better replace your comments with " " instead of "".
I think a 100% correct solution using regular expressions is either inhuman or impossible (taking into account escapes, etc.).
I believe the best option would be using ANTLR- I believe they even provide a Java grammar you can use.
I ended up with this solution.
public class CommentsFun {
static List<Match> commentMatches = new ArrayList<Match>();
public static void main(String[] args) {
Pattern commentsPattern = Pattern.compile("(//.*?$)|(/\\*.*?\\*/)", Pattern.MULTILINE | Pattern.DOTALL);
Pattern stringsPattern = Pattern.compile("(\".*?(?<!\\\\)\")");
String text = getTextFromFile("src/my/test/CommentsFun.java");
Matcher commentsMatcher = commentsPattern.matcher(text);
while (commentsMatcher.find()) {
Match match = new Match();
match.start = commentsMatcher.start();
match.text = commentsMatcher.group();
commentMatches.add(match);
}
List<Match> commentsToRemove = new ArrayList<Match>();
Matcher stringsMatcher = stringsPattern.matcher(text);
while (stringsMatcher.find()) {
for (Match comment : commentMatches) {
if (comment.start > stringsMatcher.start() && comment.start < stringsMatcher.end())
commentsToRemove.add(comment);
}
}
for (Match comment : commentsToRemove)
commentMatches.remove(comment);
for (Match comment : commentMatches)
text = text.replace(comment.text, " ");
System.out.println(text);
}
//Single-line
// "String? Nope"
/*
* "This is not String either"
*/
//Complex */
///*More complex*/
/*Single line, but */
String moreFun = " /* comment? doubt that */";
String evenMoreFun = " // comment? doubt that ";
static class Match {
int start;
String text;
}
}
Another alternative is to use some library supporting AST parsing, for e.g. org.eclipse.jdt.core has all the APIs you need to do this and more. But then that's just one alternative:)

Java - regular expression finding comments in code

A little fun with Java this time. I want to write a program that reads a code from standard input (line by line, for example), like:
// some comment
class Main {
/* blah */
// /* foo
foo();
// foo */
foo2();
/* // foo2 */
}
finds all comments in it and removes them. I'm trying to use regular expressions, and for now I've done something like this:
private static String ParseCode(String pCode)
{
String MyCommentsRegex = "(?://.*)|(/\\*(?:.|[\\n\\r])*?\\*/)";
return pCode.replaceAll(MyCommentsRegex, " ");
}
but it seems not to work for all the cases, e.g.:
System.out.print("We can use /* comments */ inside a string of course, but it shouldn't start a comment");
Any advice or ideas different from regex?
Thanks in advance.
You may have already given up on this by now but I was intrigued by the problem.
I believe this is a partial solution...
Native regex:
//.*|("(?:\\[^"]|\\"|.)*?")|(?s)/\*.*?\*/
In Java:
String clean = original.replaceAll( "//.*|(\"(?:\\\\[^\"]|\\\\\"|.)*?\")|(?s)/\\*.*?\\*/", "$1 " );
This appears to properly handle comments embedded in strings as well as properly escaped quotes inside strings. I threw a few things at it to check but not exhaustively.
There is one compromise in that all "" blocks in the code will end up with space after them. Keeping this simple and solving that problem would be very difficult given the need to cleanly handle:
int/* some comment */foo = 5;
A simple Matcher.find/appendReplacement loop could conditionally check for group(1) before replacing with a space and would only be a handful of lines of code. Still simpler than a full up parser maybe. (I could add the matcher loop too if anyone is interested.)
The last example is no problem I think:
/* we comment out some code
System.out.print("We can use */ inside a string of course");
we end the comment */
... because the comment actually ends with "We can use */. This code does not compile.
But I have another problematic case:
int/*comment*/foo=3;
Your pattern will transform this into:
intfoo=3;
...what is invalid code. So better replace your comments with " " instead of "".
I think a 100% correct solution using regular expressions is either inhuman or impossible (taking into account escapes, etc.).
I believe the best option would be using ANTLR- I believe they even provide a Java grammar you can use.
I ended up with this solution.
public class CommentsFun {
static List<Match> commentMatches = new ArrayList<Match>();
public static void main(String[] args) {
Pattern commentsPattern = Pattern.compile("(//.*?$)|(/\\*.*?\\*/)", Pattern.MULTILINE | Pattern.DOTALL);
Pattern stringsPattern = Pattern.compile("(\".*?(?<!\\\\)\")");
String text = getTextFromFile("src/my/test/CommentsFun.java");
Matcher commentsMatcher = commentsPattern.matcher(text);
while (commentsMatcher.find()) {
Match match = new Match();
match.start = commentsMatcher.start();
match.text = commentsMatcher.group();
commentMatches.add(match);
}
List<Match> commentsToRemove = new ArrayList<Match>();
Matcher stringsMatcher = stringsPattern.matcher(text);
while (stringsMatcher.find()) {
for (Match comment : commentMatches) {
if (comment.start > stringsMatcher.start() && comment.start < stringsMatcher.end())
commentsToRemove.add(comment);
}
}
for (Match comment : commentsToRemove)
commentMatches.remove(comment);
for (Match comment : commentMatches)
text = text.replace(comment.text, " ");
System.out.println(text);
}
//Single-line
// "String? Nope"
/*
* "This is not String either"
*/
//Complex */
///*More complex*/
/*Single line, but */
String moreFun = " /* comment? doubt that */";
String evenMoreFun = " // comment? doubt that ";
static class Match {
int start;
String text;
}
}
Another alternative is to use some library supporting AST parsing, for e.g. org.eclipse.jdt.core has all the APIs you need to do this and more. But then that's just one alternative:)

Categories

Resources