Split to ArrayList using pattern matcher in Java

Split to ArrayList using pattern matcher in Java - java

String s = "A..?-B^&';(,,,)G56.6C,,,M4788C..,,A1''";
String[] result = s.split("(?=[ABC])");
System.out.println(Arrays.toString(result));
Output:
[A..?-, B^&';(,,,)G56.6, C,,,M4788, C..,,, A1'']
Please refer to the The split in the above case. I am trying to separate strings based on A, B orC. How can I get the the same split strings into an ArrayList using pattern matcher? I could not figure out how to group in the below code.
Pattern p = Pattern.compile("(?=[ABC])");
Matcher m = p.matcher(s);
List<String> matches = new ArrayList<>();
while (m.find()) {
matches.add(m.group());
}
Also suppose I have few characters before first occurance of A, B or C and I want to combine with first element in ArrayList. ,,A..
Appreciate the help.

[ABC][^ABC]*
If I didn't ommit any edge case that should work with the code you provided
For the extra question, you could possibly add (^[^ABC]*)* to the beggining, but that makes it slower and look less readable, not to mention it will only work for single-line strings to check. I would recommend just parsing the beggining characters manually, treating it like a special case it is.

Related

How to match this string patteren?

Below is my string.
{" 3/4", "ERW", "A53-A", "STEEL", "PIPE", "STD", "BLK", "PE"}
i need to match this string using regular expression, please help me to achieve this.
I tried below code snippet to achieve this, but it is matching partially(only 6 strings i can match using this).
String pattern = "\\s*,\\s*";
String[] sourceValues= listTwo.get(1).toString().split(pattern);
i cant able to match first and last string using this pattern.
Please help me to achieve this, i need to match all 8 strings.
Thanks,
Sandesh P

You might try:
" ?([^"]+)"
That would capture what is between double quotes (without a leading single whitespace) in group 1. Now you have 8 strings instead of 6.
List<String> allMatches = new ArrayList<String>();
Matcher m = Pattern.compile("\" ?([^\"]+)\"").matcher("{\" 3/4\", \"ERW\", \"A53-A\", \"STEEL\", \"PIPE\", \"STD\", \"BLK\", \"PE\"}");
while (m.find()) {
allMatches.add(m.group(1));
}
Java output test

Multiple replace option for a string in java regex

How can I replace different match with different replacement regex
if I have two match option separated by |, for each of the match I want to reference the string or substring that matches.
if I have
Pattern p = Pattern.compile("man|woman|girls");
Matcher m = p.matcher("some string");
If the match is "man" I want to use a different replacement from when the match is "woman" or "girls".
I have looked through Most efficient way to use replace multiple words in a string but dont understand how to reference the match itself.

Consider improving your pattern a little to add word-boundries to prevent it from patching only some part of words like man can be match for mandatory.
(BTW: Also in case you would want to replace words which have same start like man and manual you should place manual before man in regex, or man will consume <man>ual part which will prevent ual from being match. So correct order would be manual|man)
So your regex can look more like
Pattern p = Pattern.compile("\\b(man|woman|girls)\\b");
Matcher m = p.matcher("some text about woman and few girls");
Next thing you can do is simply store pairs originalValue -> replacement inside some collection which will let you easily get replacement for value. Simplest way would be using Map
Map<String, String> replacementMap = new HashMap<>();
replacementMap.put("man", "foo");
replacementMap.put("woman", "bar");
replacementMap.put("girls", "baz");
Now your code can look like this:
StringBuffer sb = new StringBuffer();
while(m.find()){
String wordToReplace = m.group();
//replace found word with with its replacement in map
m.appendReplacement(sb, replacementMap.get(wordToReplace));
}
m.appendTail(sb);
String replaced = sb.toString();

You could do
str =
str.replace("woman", "REPLACEMENT1")
.replace("man", "REPLACEMENT2")
.replace("girls", "REPLACEMENT3");

How to best strip out certain strings in a file?

If I have a file with the following content:
11:17 GET this is my content #2013
11:18 GET this is my content #2014
11:19 GET this is my content #2015
How can I use a Scanner and ignore certain parts of a `String line = scanner.nextLine();?
The result that I like to have would be:
this is my content
this is my content
this is my content
So I'd like to trip everything from the start until GET, and then take everything until the # char.
How could this easily be done?

You can use the String.indexOf(String str) and String.indexOf(char ch) methods. For example:
String line = scanner.nextLine();
int start = line.indexOf("GET");
int end = line.indexOf('#');
String result = line.substring(start + 4, end);

One way might be
String strippedStart = scanner.nextLine().split(" ", 3)[2];
String result = strippedStart.substring(0, strippedStart.lastIndexOf("#")).trim();
This assumes the are always two space separated tokens at the beginning (11:22 GET or 11:33 POST, idk).

You could do something like this:-
String line ="11:17 GET this is my content #2013";
int startIndex = line.indexOf("GET ");
int endIndex = line.indexOf("#");
line = line.substring(startIndex+4, endIndex-1);
System.out.println(line);

In my opinion the best solution for your problem would be using Java regex. Using regex you can define which group or groups of text you want to retrieve and what kind of text comes where. I haven't been working with Java in a long time, so I'll try to help you out from the top of my head. I'll try to give you a point in the right direction.
First off, compile a pattern:
Pattern pattern = Pattern.compile("^\d{1,2}:\d{1,2} GET (.*?) #\d+$", Pattern.MULTILINE);
First part of the regex says that you expect one or two digits followed by a colon followed by one or two digits again. After that comes the GET (you can use GET|POST if you expect those words or \w+? if you expect any word). Then you define the group you want with the parentheses. Lastly, you put the hash and any number of digits with at least one digit. You might consider putting flags DOTALL and CASE_INSENSITIVE, although I don't think you'll be needing them.
Then you continue with the matcher:
Matcher matcher = pattern.matcher(textToParse);
while (matcher.find())
{
//extract groups here
String group = matcher.group(1);
}
In the while loop you can use matcher.group(1) to find the text in the group you selected with the parentheses (the text you'd like extracted). matcher.group(0) gives the entire find, which is not what you're currently looking for (I guess).
Sorry for any errors in the code, it has not been tested. Hope this puts you on the right track.

You can try this rather flexible solution:
Scanner s = new Scanner(new File("data"));
Pattern p = Pattern.compile("^(.+?)\\s+(.+?)\\s+(.*)\\s+(.+?)$");
Matcher m;
while (s.hasNextLine()) {
m = p.matcher(s.nextLine());
if (m.find()) {
System.out.println(m.group(3));
}
}
This piece of code ignores first, second and last words from every line before printing them.
Advantage is that it relies on whitespaces rather than specific string literals to perform the stripping.

Java regex pattern matcher

I have a string of the following format:
String name = "A|DescA+B|DescB+C|DescC+...X|DescX+"
So the repeating pattern is ?|?+, and I don't know how many there will be. The part I want to extract is the part before |...so for my example I want to extract a list (an ArrayList for example) that will contain:
[A, B, C, ... X]
I have tried the following pattern:
(.+)\\|.*\\+
but that doesn't work the way I want it to? Any suggestions?

To convert this into a list you can do like this:
String name = "A|DescA+B|DescB+C|DescC+X|DescX+";
Matcher m = Pattern.compile("([^|]+)\\|.*?\\+").matcher(name);
List<String> matches = new ArrayList<String>();
while (m.find()) {
matches.add(m.group(1));
}
This gives you the list:
[A, B, C, X]
Note the ? in the middle, that prevents the second part of the regex to consume the entire string, since it makes the * lazy instead of greedy.

You are consuming any character (.) and that includes the | so, the parser goes on munching everything, and once it's done taking any char, it looks for |, but there's nothing left.
So, try to match any character but | like this:
"([^|]+)\\|.*\\+"
And if it fits, make sure your all-but-| is at the beginning of the string using ^ and that there's a + at the end of the string with $:
"^([^|]+)\\|.*\\+$"
UPDATE: Tim Pietzcker makes a good point: since you are already matching until you find a |, you could just as well match the rest of the string and be done with it:
"^([^|]+).*\\+$"
UPDATE2: By the way, if you want to simply get the first part of the string, you can simplify things with:
myString.split("\\|")[0]

Another idea: Find all characters between + (or start of string) and |:
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("(?<=^|[+])[^|]+");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}

I think the easiest solution would be to split by \\+, then for each part apply the (.+?)\\|.* pattern to extract the group you need.

how to replace parts of string using regular expressions

I am not a beginner to regular expressions, but their use in perl seems a bit different than in Java.
Anyways, I basically have a dictionary of shorthand words and their definitions. I want to iterate over words in the dictionary and replace them with their meanings. what is the best way to do this in JAVA?
I have seen String.replaceAll(), String.replace(), as well as the Pattern/Matcher classes. I wish to do a case insensitive replacement along the lines of:
word =~ s/\s?\Q$short_word\E\s?/ \Q$short_def\E /sig
While I am at it, do you think that it is best to extract all the words from the string and then apply my dictionary or just apply the dictionary to the string? I know that I need to be careful, because the shorthand words could match parts of other shorthand meanings.
Hopefully this all makes sense.
Thanks.
Clarification:
Dictionary is something like:
lol:laugh out loud, rofl:rolling on the floor laughing, ll:like lemons
string is:
lol, i am rofl
replaced text:
laugh out loud, i am rolling on the floor laughing
notice how the ll wasnt added anywhere

The danger is false positives inside of normal words. "fell" != "felikes lemons"
One way is to split the words on whitespace (do multiple spaces need to be conserved?) then loop over the List performing the 'if contains() { replace } else { output original } idea above.
My output class would be a StringBuffer
StringBuffer outputBuffer = new StringBuffer();
for(String s: split(inputText)) {
outputBuffer.append( dictionary.contains(s) ? dictionary.get(s) : s);
}
Make your split method smart enough to return word delimiters also:
split("now is the time") -> now,<space>,is,<space>,the,<space><space>,time
Then you don't have to worry about conserving white space - the loop above will just append anything that isn't a dictionary word to the StringBuffer.
Here's a recent SO thread on retaining delimiters when regexing.

If you insist on using regex, this would work (taking Zoltan Balazs' dictionary map approach):
Map<String, String> substitutions = loadDictionaryFromSomewhere();
int lengthOfShortestKeyInMap = 3; //Calculate
int lengthOfLongestKeyInMap = 3; //Calculate
StringBuffer output = new StringBuffer(input.length());
Pattern pattern = Pattern.compile("\\b(\\w{" + lengthOfShortestKeyInMap + "," + lengthOfLongestKeyInMap + "})\\b");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String candidate = matcher.group(1);
String substitute = substitutions.get(candidate);
if (substitute == null)
substitute = candidate; // no match, use original
matcher.appendReplacement(output, Matcher.quoteReplacement(substitute));
}
matcher.appendTail(output);
// output now contains the text with substituted words
If you plan to process many inputs, pre-compiling the pattern is more efficient than using String.split(), which compiles a new Pattern each call.
(edit) Compiling all of the keys into a single pattern yields a more efficient approach, like so:
Pattern pattern = Pattern.compile("\\b(lol|rtfm|rofl|wtf)\\b");
// rest of the method unchanged, don't need the shortest/longest key stuff
This allows the regex engine to skip over any words that happen to be short enough but aren't in the list, saving you a lot of map accesses.

The first thing, that comes into my mind is this:
...
// eg: lol -> laugh out loud
Map<String, String> dictionatry;
ArrayList<String> originalText;
ArrayList<String> replacedText;
for(String string : originalText) {
if(dictionary.contains(string)) {
replacedText.add(dictionary.get(string));
} else {
replacedText.add(string);
}
...
Or you could use a StringBuffer instead of the replacedText.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Split to ArrayList using pattern matcher in Java - java

Related

How to match this string patteren?

Multiple replace option for a string in java regex

How to best strip out certain strings in a file?

Java regex pattern matcher

how to replace parts of string using regular expressions

Categories

Resources