I have several string which multiple masks. I would like to know is there any better way of handling strings with mask parsing rather than String.spilt and loop over tokens and identify sequence etc. This code also gets clumsy that lots of token logic have to coded.
Sample masks can be:
PROD-LOC-STATE-CITY
PROD-DEST-STATE-ZIP
PROD-OZIP-DZIP-VER-INS
Sample Strings:
CoolDuo-GROUND-NYC-10082
Sample code:
String[] arr = input.split("-");
int pos = 0;
for(String k:arr){
if(pos == 0) {
//-- k is of PROD
...
...
}
..
...
pos++;
}
Above type of code is kept for every mask type.
You can use regex groups to get target strings by group names http://docs.oracle.com/javase/tutorial/essential/regex/groups.html. Check this Regex Named Groups in Java
If you can't use named groups, you can do it in this way (if your are absolutely sure in your strings structure):
final static int PROD_POS = 1;
final static int STATE_POS = 3;
...
Pattern pattern = Pattern.compile("(some_regexp)-(some_regexp)-(some_regexp)");
Matcher matcher = pattern.matcher(input);
if ( matcher.matches() ) {
String state = matcher.group(STATE_POS);
}
If you really want to delve in quite deep into this problem when your masks gets quite too big to manage, you can use some sort of lexical analysis packages available to java.
If you want to get a basis of what that really means look here (http://en.wikipedia.org/wiki/Lexical_analysis)
A popular package out there for java is JFlex (http://jflex.de/), but there are many others out there, just Google it for best results!
Best of luck
Related
I have a regex with multiple disjunctive capture groups
(a)|(b)|(c)|...
Is there a faster way than this one to access the index of the first successfully matching capture group?
(matcher is an instance of java.util.regex.Matcher)
int getCaptureGroup(Matcher matcher){
for(int i = 1; i <= matcher.groupCount(); ++i){
if(matcher.group(i) != null){
return i;
}
}
}
That depends on what you mean by faster. You can make the code a little more efficient by using start(int) instead of group(int)
if(matcher.start(i) != -1){
If you don't need the actual content of the group, there's no point trying to create a new string object to hold it. I doubt you'll notice any difference in performance, but there's no reason not to do it this way.
But you still have to write the same amount of boilerplate code; there's no way around that. Java's regex flavor is severely lacking in syntactic sugar compared to most other languages.
I guess the pattern is so:
if (matcher.find()) {
String wholeMatch = matcher.group(0);
String firstCaptureGroup = matcher.group(1);
String secondCaptureGroup = matcher.group(2);
//etc....
}
There could be more than one match. So you could use while cycle for going through all matches.
Please take a look at "Group number" section in javadoc of java.util.regex.Pattern.
I am having a group of strings in Arraylist.
I want to remove all the strings with only numbers
and also strings like this : (0.75%),$1.5 ..basically everything that does not contain the characters.
2) I want to remove all special characters in the string before i write to the console.
"God should be printed God.
"Including should be printed: quoteIncluding
'find should be find
Java boasts a very nice Pattern class that makes use of regular expressions. You should definitely read up on that. A good reference guide is here.
I was going to post a coding solution for you, but styfle beat me to it! The only thing I was going to do different here was within the for loop, I would have used the Pattern and Matcher class, as such:
for(int i = 0; i < myArray.size(); i++){
Pattern p = Pattern.compile("[a-z][A-Z]");
Matcher m = p.matcher(myArray.get(i));
boolean match = m.matches();
//more code to get the string you want
}
But that too bulky. styfle's solution is succinct and easy.
When you say "characters," I'm assuming you mean only "a through z" and "A through Z." You probably want to use Regular Expressions (Regex) as D1e mentioned in a comment. Here is an example using the replaceAll method.
import java.util.ArrayList;
public class Test {
public static void main(String[] args) {
ArrayList<String> list = new ArrayList<String>(5);
list.add("\"God");
list.add(""Including");
list.add("'find");
list.add("24No3Numbers97");
list.add("w0or5*d;");
for (String s : list) {
s = s.replaceAll("[^a-zA-Z]",""); //use whatever regex you wish
System.out.println(s);
}
}
}
The output of this code is as follows:
God
quotIncluding
find
NoNumbers
word
The replaceAll method uses a regex pattern and replaces all the matches with the second parameter (in this case, the empty string).
I am not a beginner to regular expressions, but their use in perl seems a bit different than in Java.
Anyways, I basically have a dictionary of shorthand words and their definitions. I want to iterate over words in the dictionary and replace them with their meanings. what is the best way to do this in JAVA?
I have seen String.replaceAll(), String.replace(), as well as the Pattern/Matcher classes. I wish to do a case insensitive replacement along the lines of:
word =~ s/\s?\Q$short_word\E\s?/ \Q$short_def\E /sig
While I am at it, do you think that it is best to extract all the words from the string and then apply my dictionary or just apply the dictionary to the string? I know that I need to be careful, because the shorthand words could match parts of other shorthand meanings.
Hopefully this all makes sense.
Thanks.
Clarification:
Dictionary is something like:
lol:laugh out loud, rofl:rolling on the floor laughing, ll:like lemons
string is:
lol, i am rofl
replaced text:
laugh out loud, i am rolling on the floor laughing
notice how the ll wasnt added anywhere
The danger is false positives inside of normal words. "fell" != "felikes lemons"
One way is to split the words on whitespace (do multiple spaces need to be conserved?) then loop over the List performing the 'if contains() { replace } else { output original } idea above.
My output class would be a StringBuffer
StringBuffer outputBuffer = new StringBuffer();
for(String s: split(inputText)) {
outputBuffer.append( dictionary.contains(s) ? dictionary.get(s) : s);
}
Make your split method smart enough to return word delimiters also:
split("now is the time") -> now,<space>,is,<space>,the,<space><space>,time
Then you don't have to worry about conserving white space - the loop above will just append anything that isn't a dictionary word to the StringBuffer.
Here's a recent SO thread on retaining delimiters when regexing.
If you insist on using regex, this would work (taking Zoltan Balazs' dictionary map approach):
Map<String, String> substitutions = loadDictionaryFromSomewhere();
int lengthOfShortestKeyInMap = 3; //Calculate
int lengthOfLongestKeyInMap = 3; //Calculate
StringBuffer output = new StringBuffer(input.length());
Pattern pattern = Pattern.compile("\\b(\\w{" + lengthOfShortestKeyInMap + "," + lengthOfLongestKeyInMap + "})\\b");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String candidate = matcher.group(1);
String substitute = substitutions.get(candidate);
if (substitute == null)
substitute = candidate; // no match, use original
matcher.appendReplacement(output, Matcher.quoteReplacement(substitute));
}
matcher.appendTail(output);
// output now contains the text with substituted words
If you plan to process many inputs, pre-compiling the pattern is more efficient than using String.split(), which compiles a new Pattern each call.
(edit) Compiling all of the keys into a single pattern yields a more efficient approach, like so:
Pattern pattern = Pattern.compile("\\b(lol|rtfm|rofl|wtf)\\b");
// rest of the method unchanged, don't need the shortest/longest key stuff
This allows the regex engine to skip over any words that happen to be short enough but aren't in the list, saving you a lot of map accesses.
The first thing, that comes into my mind is this:
...
// eg: lol -> laugh out loud
Map<String, String> dictionatry;
ArrayList<String> originalText;
ArrayList<String> replacedText;
for(String string : originalText) {
if(dictionary.contains(string)) {
replacedText.add(dictionary.get(string));
} else {
replacedText.add(string);
}
...
Or you could use a StringBuffer instead of the replacedText.
Say for example I want to take this phrase:
{{Hello|What's Up|Howdy} {world|planet} |
{Goodbye|Later}
{people|citizens|inhabitants}}
and randomly make it into one of the following:
Hello world
Goodbye people
What's Up word
What's Up planet
Later citizens
etc.
The basic idea is that enclosed within every pair of braces will be an unlimited number of choices separated by "|". The program needs to go through and randomly choose one choice for each set of braces. Keep in mind that braces can be nested endlessly within each other. I found a thread about this and tried to convert it to Java, but it did not work. Here is the python code that supposedly worked:
import re
from random import randint
def select(m):
choices = m.group(1).split('|')
return choices[randint(0, len(choices)-1)]
def spinner(s):
r = re.compile('{([^{}]*)}')
while True:
s, n = r.subn(select, s)
if n == 0: break
return s.strip()
Here is my attempt to convert that Python code to Java.
public String generateSpun(String text){
String spun = new String(text);
Pattern reg = Pattern.compile("{([^{}]*)}");
Matcher matcher = reg.matcher(spun);
while (matcher.find()){
spun = matcher.replaceFirst(select(matcher.group()));
}
return spun;
}
private String select(String m){
String[] choices = m.split("|");
Random random = new Random();
int index = random.nextInt(choices.length - 1);
return choices[index];
}
Unfortunately, when I try to test this by calling
generateAd("{{Hello|What's Up|Howdy} {world|planet} | {Goodbye|Later} {people|citizens|inhabitants}}");
In the main of my program, it gives me an error in the line in generateSpun where Pattern reg is declared, giving me a PatternSyntaxException.
java.util.regex.PatternSyntaxException: Illegal repetition
{([^{}]*)}
Can someone try to create a Java method that will do what I am trying to do?
Here are some of the problems with your current code:
You should reuse your compiled Pattern, instead of Pattern.compile every time
You should reuse your Random, instead of new Random every time
Be aware that String.split is regex-based, so you must split("\\|")
Be aware that curly braces in Java regex must be escaped to match literally, so Pattern.compile("\\{([^{}]*)\\}");
You should query group(1), not group() which defaults to group 0
You're using replaceFirst wrong, look up Matcher.appendReplacement/Tail instead
Random.nextInt(int n) has exclusive upper bound (like many such methods in Java)
The algorithm itself actually does not handle arbitrarily nested braces properly
Note that escaping is done by preceding with \, and as a Java string literal it needs to be doubled (i.e. "\\" contains a single character, the backslash).
Attachment
Source code and output with above fix but no major change to algorithm
To fix the regex, add backslashes before the outer { and }. These are meta-characters in Java regexes. However, I don't think that will result in a working program. You are modifying the variable spun after it has been bound to the regex, and I do not think the returned Matcher will reflect the updated value.
I also don't think the python code will work for nested choices. Have you actually tried the python code? You say it "supposedly works", but it would be wise to verify that before you spend a lot of time porting it to Java.
Well , I just created one in PHP & Python , demo here http://spin.developerscrib.com , its at a very early stage so might not work to expectation , the source code is on github : https://github.com/razzbee/razzy-spinner
Use this, will work... I did, and working great
Pattern p = Pattern.compile("cat");
Matcher m = p.matcher("one cat two cats in the yard");
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb, "dog");
}
m.appendTail(sb);
System.out.println(sb.toString());
and here
private String select(String m){
String[] choices = m.split("|");
Random random = new Random();
int index = random.nextInt(choices.length - 1);
return choices[index];
}
m.split("|") use m.split("\\|")
Other wise it splits each an every character
and use Pattern.compile("\\{([^{}]*)\\}");
Yup, you read that right. I needs something that is capable of generating random text from a regular expression. So the text should be random, but be matched by the regular expression. It seems it doesn't exist, but I could be wrong.
Just a an example: that library would be capable of taking '[ab]*c' as input, and generate samples such as:
abc
abbbc
bac
etc.
Update: I created something myself: Xeger. Check out http://code.google.com/p/xeger/.
I just created a library for doing this a minute ago. It's hosted here: http://code.google.com/p/xeger/. Carefully read the instructions before using it. (Especially the one referring to downloading another required library.) ;-)
This is the way you use it:
String regex = "[ab]{4,6}c";
Xeger generator = new Xeger(regex);
String result = generator.generate();
assert result.matches(regex);
I am not aware of such a library. If you're interested in writing one yourself, then these are probably the steps you'll need to take:
Write a parser for regular expressions (you may want to start out with a restricted class of regexes).
Use the result to construct an NFA.
(Optional) Convert the NFA to a DFA.
Randomly traverse the resulting automaton from the start state to any accepting state, while storing the characters outputted by every transition.
The result is a word which is accepted by the original regex. For more, see e.g. Converting a Regular Expression into a Deterministic Finite Automaton.
Here's a few implementations of such a beast, but none of them in Java (and all but the closed-source Microsoft one very limited in their regexp feature support).
based on Wilfred Springer's solution together with
http://www.brics.dk/~amoeller/automaton/ i build another generator.
It do not use recursion. It take as input the patter/regularExpression minimum String length and maximum String length. The result is an accepted String between min and max length. It also allow some of the XML "short hand character classes".
I use this for an XML Sample Generator that build valid String for facets.
public static final String generate(final String pattern, final int minLength, final int maxLength) {
final String regex = pattern
.replace("\\d", "[0-9]") // Used d=Digit
.replace("\\w", "[A-Za-z0-9_]") // Used d=Word
.replace("\\s", "[ \t\r\n]"); // Used s="White"Space
final Automaton automaton = new RegExp(regex).toAutomaton();
final Random random = new Random(System.nanoTime());
final List<String> validLength = new LinkedList<>();
int len = 0;
final StringBuilder builder = new StringBuilder();
State state = automaton.getInitialState();
Transition[] transitions;
while(len <= maxLength && (transitions = state.getSortedTransitionArray(true)).length != 0) {
final int option = random.nextInt(transitions.length);
if (state.isAccept() && len >= minLength && len <= maxLength) validLength.add(builder.toString());
final Transition t = transitions[option]; // random transition
builder.append((char) (t.getMin()+random.nextInt(t.getMax()-t.getMin()+1))); len ++;
state = t.getDest();
}
if(validLength.size() == 0) throw new IllegalArgumentException(automaton.toString()+" , "+minLength+" , "+maxLength);
return validLength.get(random.nextInt(validLength.size()));
}
Here is a Python implementation of a module like that: http://www.mail-archive.com/python-list#python.org/msg125198.html It should be portable to Java.