Pattern, matcher in Java, REGEX help

Pattern, matcher in Java, REGEX help - java

I'm trying to just get rid of duplicate consecutive words from a text file, and someone mentioned that I could do something like this:
Pattern p = Pattern.compile("(\\w+) \\1");
StringBuilder sb = new StringBuilder(1000);
int i = 0;
for (String s : lineOfWords) { // line of words is a List<String> that has each line read in from txt file
Matcher m = p.matcher(s.toUpperCase());
// and then do something like
while (m.find()) {
// do something here
}
I tried looking at the m.end to see if I could create a new string, or remove the item(s) where the matches are, but I wasn't sure how it works after reading the documentation. For example, as a test case to see how it worked, I did:
if (m.find()) {
System.out.println(s.substring(i, m.end()));
}
To the text file that has: This is an example example test test test.
Why is my output This is?
Edit:
if I have an AraryList lineOfWords that reads each line from a line of .txt file and then I create a new ArrayList to hold the modified string. For example
List<String> newString = new ArrayList<String>();
for (String s : lineOfWords {
s = s.replaceAll( code from Kobi here);
newString.add(s);
}
but then it doesn't give me the new s, but the original s. Is it because of shallow vs deep copy?

Try something like:
s = s.replaceAll("\\b(\\w+)\\b(\\s+\\1)+\\b", "$1");
That regex is a bit stronger than yours - it checks for whole words (no partial matches), and gets rid of any number of consecutive repetitions.
The regex captures a first word: \b(\w+)\b, and then attempts to match spaces and repetitions of that word: (\s+\1)+. The final \b is to avoid partial matching of \1, as in "for formatting".

The first match is "ThIS IS an example...", so m.end() points to the end of the second "is". I'm not sure why you use i for the start index; try m.start() instead.
To improve your regex, use \b before and after the word to indicate that there should be word boundaries: (\\b\\w+\\b). Otherwise, as you're seeing, you'll get matches inside of words.

Related

java create variable from regex findings

I'm pretty new to Java, but I am looking to create a String variable from a regex finding. But I am not too sure how.
Basically I need: previous_identifer = (all the text in nextline up to the third comma);
Something maybe like this?
previous_identifier = line.split("^(.+?),(.+?),(.+?),");
Or:
line = reader.readLine();
Pattern courseColumnPattern = Pattern.compile("^(.+?),(.+?),(.+?),");
previous_identifier = (courseColumnPattern.matcher(line).find());
But I know that won't work. What should I do differently?

You can use split to return an array of Strings, then use a StringBuilder to build your return string. An advantage of this approach is being able to easily return the first four strings, two strings, ten strings, etc.
int limit = 3, current = 0;
StringBuilder sb = new StringBuilder();
// Used as an example of input
String str = "test,west,best,zest,jest";
String[] strings = str.split(",");
for(String s : strings) {
if(++current > limit) {
// We've reached the limit; bail
break;
}
if(current > 1) {
// Add a comma if it's not the first element. Alternative is to
// append a comma each time after appending s and remove the last
// character
sb.append(",");
}
sb.append(s);
}
System.out.println(sb.toString()); // Prints "test,west,best"
If you don't need to use the three elements separately (you truly want just the first three elements in a chunk), you can use a Matcher with the following regular expression:
String str = "test, west, best, zest, jest";
// Matches against "non-commas", then a comma, then "non-commas", then
// a comma, then "non-commas". This way, you still don't have a trailing
// comma at the end.
Matcher match = Pattern.compile("^([^,]*,[^,]*,[^,]*)").matcher(str);
if(match.find())
{
// Print out the output!
System.out.println(match.group(1));
}
else
{
// We didn't have a match. Handle it here.
}

Your regex will work, but could be expressed more briefly. This is how you can "extract" it:
String head = str.replaceAll("((.+?,){3}).*", "$1");
This matches the whole string, while capturing the target, with the replacement being the captured input using a back reference to group 1.
Despite the downvote, here's proof the code works!
String str = "foo,bar,baz,other,stuff";
String head = str.replaceAll("((.+?,){3}).*", "$1");
System.out.println(head);
Output:
foo,bar,baz,

try an online regex tester to work out the regex, i think you need less brackets to get the entire text, i'd guess something like:
([^,+?],[^,+?],[^,+?])
Which says, find everything except a comma, then a comma, then everything but a comma, then a comman, then everything else that isn't a comma. I suspect this can be improved dramatically, i am not a regex expert
Then your java just needs to compile it and match against your string:
line = reader.readLine();
Pattern courseColumnPattern = Pattern.compile("([^,+?],[^,+?],[^,+?])");
if (previous_identifier.matches()) {
previous_identifier = (courseColumnPattern.matcher(line);
}

scanner.useDelimiter() regex java

I have a string in an inconvenient format. Here is an example:
(Air Fresheners,17)->(Chocolate Chips,14)->(Juice-Frozen,24)
I need to go through this string and extract only the first items in the parenthesis. So using the snippet from above as input, I would like my code to return:
Air Fresheners
Chocolate Chips
Juice-Frozen
Note that some of the items have - in the name of the item. These should be kept and included in the final output. I was trying to use:
Scanner.useDelimiter(insert regex here)
...but I am not having any luck. Other methods of accomplishing the task are fine, but please keep it relatively simple.

I know this is old and I'm no expert but can't you use replaceAll? As below:
String s = "(Air Fresheners,17)->(Chocolate Chips,14)->(Juice-Frozen,24)".replaceAll("(->)|[\\(\\)]|\\d+","");
for (String str : s.split(","))
{
System.out.println(str);
}

Try this one
Use regex to split on the basis of )->(
String s="(Air Fresheners,17)->(Chocolate Chips,14)->(Juice-Frozen,24)";
Pattern regex = Pattern.compile("\\)->\\(");
Matcher regexMatcher = regex.matcher(s);
int i=0;
while (regexMatcher.find()) {
System.out.println(s.substring(i+1,regexMatcher.start()));
i=regexMatcher.end()-1;
}
System.out.println(s.substring(i+1,s.length()-1));
Try String.split() method
String s = "(Air Fresheners,17)->(Chocolate Chips,14)->(Juice-Frozen,24)";
for (String str : s.substring(1, s.length() - 1).split("\\)->\\(")) {
System.out.println(str);
}

This can be done with regular expressions. Where we match ([^,)(]*) matches any name that do not contain brackets or commas, ,\\d+\\) matches the ,14) part and (?:->)? matches possible -> after the tuple. We use group(1) to get the name (group(0) returns the whole tuple (Air Fresheners,17)->
List<String> ans = new ArrayList<>();
Matcher m = Pattern.compile("\\(([^,)(]*),\\d+\\)(?:->)?").matcher(str);
while(m.find()){
String s = m.group(1);
ans.add(m.group(1));
}
Given (Air Fresheners,17)->(Chocolate Chips,14)->(Juice-Frozen,24), this program returns [Air Fresheners, Chocolate Chips, Juice-Frozen]

(Air Fresheners,17)->(Chocolate Chips,14)->(Juice-Frozen,24)
You could think of it as everything between the ( and the ,
So,
\(.*?\,
would match "(Air Fresheners," (the ? is to make it non-greedy, and stop when it sees a comma)
So if you're keen to use regex, then just match these, and take a substring to get rid of the ( and ,

I would first go through using )-> as the delimiter. On each scanner.next() get rid of the first character (the parenthesis) using substring, and then place a second scanner on that string that uses , as the delimiter. In code this would look something like:
Scanner s1 = new Scanner(string).useDelimiter("\\s*)->\\s*");
while(s1.hasNext())
{
Scanner s2 = new Scanner(s1.next).useDelimiter("\\s*,\\s*");
System.out.println(s2.next.substring(1));
}

How to best strip out certain strings in a file?

If I have a file with the following content:
11:17 GET this is my content #2013
11:18 GET this is my content #2014
11:19 GET this is my content #2015
How can I use a Scanner and ignore certain parts of a `String line = scanner.nextLine();?
The result that I like to have would be:
this is my content
this is my content
this is my content
So I'd like to trip everything from the start until GET, and then take everything until the # char.
How could this easily be done?

You can use the String.indexOf(String str) and String.indexOf(char ch) methods. For example:
String line = scanner.nextLine();
int start = line.indexOf("GET");
int end = line.indexOf('#');
String result = line.substring(start + 4, end);

One way might be
String strippedStart = scanner.nextLine().split(" ", 3)[2];
String result = strippedStart.substring(0, strippedStart.lastIndexOf("#")).trim();
This assumes the are always two space separated tokens at the beginning (11:22 GET or 11:33 POST, idk).

You could do something like this:-
String line ="11:17 GET this is my content #2013";
int startIndex = line.indexOf("GET ");
int endIndex = line.indexOf("#");
line = line.substring(startIndex+4, endIndex-1);
System.out.println(line);

In my opinion the best solution for your problem would be using Java regex. Using regex you can define which group or groups of text you want to retrieve and what kind of text comes where. I haven't been working with Java in a long time, so I'll try to help you out from the top of my head. I'll try to give you a point in the right direction.
First off, compile a pattern:
Pattern pattern = Pattern.compile("^\d{1,2}:\d{1,2} GET (.*?) #\d+$", Pattern.MULTILINE);
First part of the regex says that you expect one or two digits followed by a colon followed by one or two digits again. After that comes the GET (you can use GET|POST if you expect those words or \w+? if you expect any word). Then you define the group you want with the parentheses. Lastly, you put the hash and any number of digits with at least one digit. You might consider putting flags DOTALL and CASE_INSENSITIVE, although I don't think you'll be needing them.
Then you continue with the matcher:
Matcher matcher = pattern.matcher(textToParse);
while (matcher.find())
{
//extract groups here
String group = matcher.group(1);
}
In the while loop you can use matcher.group(1) to find the text in the group you selected with the parentheses (the text you'd like extracted). matcher.group(0) gives the entire find, which is not what you're currently looking for (I guess).
Sorry for any errors in the code, it has not been tested. Hope this puts you on the right track.

You can try this rather flexible solution:
Scanner s = new Scanner(new File("data"));
Pattern p = Pattern.compile("^(.+?)\\s+(.+?)\\s+(.*)\\s+(.+?)$");
Matcher m;
while (s.hasNextLine()) {
m = p.matcher(s.nextLine());
if (m.find()) {
System.out.println(m.group(3));
}
}
This piece of code ignores first, second and last words from every line before printing them.
Advantage is that it relies on whitespaces rather than specific string literals to perform the stripping.

The best way to find out that part of the string is potencial RegEx match

how would you do this:
I have a string and some regexes. Then I iterate over the string and in every iteration I need to know if the part (string index 0 to string currently iterated index) of that string is possible full match of one or more given regexes in next iterations.
Thank you for help.

What about a code like this:
// all of *greedy* regexs into a list
List<String> regex = new ArrayList<String>();
// here is my text
String mytext = "...";
String tmp = null;
// iterate over letters of my text
for (int i = 0; i < mytext.length(); i++) {
// substring from 0. position till i. index
tmp = mytext.substring(0, i);
// append regex on sub text
for (String reg : regex ) {
Pattern p = Pattern.compile(reg);
Matcher m = p.matcher(tmp);
// if found, do smt
if (m.find() ) { bingo.. do smt! }
}
}

You could use Matcher.lookingAt() to try to match as much as possible from a given input, but not requiring the whole input to match (.matches() would require the full input to match and .find() would not require the match to start at the beginning).

I don't believe the Java regular expression API provides such "incremental" or "step-by-step" search.
What you could do however, is to formulate your expression using reluctant quantifiers.
[...] The reluctant quantifiers, however, take the opposite approach: They start at the beginning of the input string, then reluctantly eat one character at a time looking for a match. The last thing they try is the entire input string. [...]
If this isn't viable in your case, you could use the Matcher.setRegion method to incrementally increase the region used by the matcher.

So I've been searching for alternatives to Java's standart RegEx library and found one that does the job well - JRegex

how to replace parts of string using regular expressions

I am not a beginner to regular expressions, but their use in perl seems a bit different than in Java.
Anyways, I basically have a dictionary of shorthand words and their definitions. I want to iterate over words in the dictionary and replace them with their meanings. what is the best way to do this in JAVA?
I have seen String.replaceAll(), String.replace(), as well as the Pattern/Matcher classes. I wish to do a case insensitive replacement along the lines of:
word =~ s/\s?\Q$short_word\E\s?/ \Q$short_def\E /sig
While I am at it, do you think that it is best to extract all the words from the string and then apply my dictionary or just apply the dictionary to the string? I know that I need to be careful, because the shorthand words could match parts of other shorthand meanings.
Hopefully this all makes sense.
Thanks.
Clarification:
Dictionary is something like:
lol:laugh out loud, rofl:rolling on the floor laughing, ll:like lemons
string is:
lol, i am rofl
replaced text:
laugh out loud, i am rolling on the floor laughing
notice how the ll wasnt added anywhere

The danger is false positives inside of normal words. "fell" != "felikes lemons"
One way is to split the words on whitespace (do multiple spaces need to be conserved?) then loop over the List performing the 'if contains() { replace } else { output original } idea above.
My output class would be a StringBuffer
StringBuffer outputBuffer = new StringBuffer();
for(String s: split(inputText)) {
outputBuffer.append( dictionary.contains(s) ? dictionary.get(s) : s);
}
Make your split method smart enough to return word delimiters also:
split("now is the time") -> now,<space>,is,<space>,the,<space><space>,time
Then you don't have to worry about conserving white space - the loop above will just append anything that isn't a dictionary word to the StringBuffer.
Here's a recent SO thread on retaining delimiters when regexing.

If you insist on using regex, this would work (taking Zoltan Balazs' dictionary map approach):
Map<String, String> substitutions = loadDictionaryFromSomewhere();
int lengthOfShortestKeyInMap = 3; //Calculate
int lengthOfLongestKeyInMap = 3; //Calculate
StringBuffer output = new StringBuffer(input.length());
Pattern pattern = Pattern.compile("\\b(\\w{" + lengthOfShortestKeyInMap + "," + lengthOfLongestKeyInMap + "})\\b");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String candidate = matcher.group(1);
String substitute = substitutions.get(candidate);
if (substitute == null)
substitute = candidate; // no match, use original
matcher.appendReplacement(output, Matcher.quoteReplacement(substitute));
}
matcher.appendTail(output);
// output now contains the text with substituted words
If you plan to process many inputs, pre-compiling the pattern is more efficient than using String.split(), which compiles a new Pattern each call.
(edit) Compiling all of the keys into a single pattern yields a more efficient approach, like so:
Pattern pattern = Pattern.compile("\\b(lol|rtfm|rofl|wtf)\\b");
// rest of the method unchanged, don't need the shortest/longest key stuff
This allows the regex engine to skip over any words that happen to be short enough but aren't in the list, saving you a lot of map accesses.

The first thing, that comes into my mind is this:
...
// eg: lol -> laugh out loud
Map<String, String> dictionatry;
ArrayList<String> originalText;
ArrayList<String> replacedText;
for(String string : originalText) {
if(dictionary.contains(string)) {
replacedText.add(dictionary.get(string));
} else {
replacedText.add(string);
}
...
Or you could use a StringBuffer instead of the replacedText.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Pattern, matcher in Java, REGEX help - java

Related

java create variable from regex findings

scanner.useDelimiter() regex java

How to best strip out certain strings in a file?

The best way to find out that part of the string is potencial RegEx match

how to replace parts of string using regular expressions

Categories

Resources