regex to select specific multiple lines - java

I'm trying to capture group of lines from large number of lines(upto 100 to 130) after a specific term.
here is my code.
String inp = "Welcome!\n"
+" Welcome to the Apache ActiveMQ Console of localhost (ID:InternetXXX022-45298-5447895412354475-2:9) \n"
+" You can find more information about Apache ActiveMQ on the Apache ActiveMQ Site \n"
+" Broker\n"
+" Name localhost\n"
+" Version 5.13.3\n"
+" ID ID:InternetXXX022-45298-5447895412354475-2:9\n"
+" Uptime 14 days 14 hours\n"
+" Store percent used 19\n"
+" Memory percent used 0\n"
+" Temp percent used 0\n"
+ "Queue Views\n"
+ "Graph\n"
+ "Topic Views\n"
+ " \n"
+ "Subscribers Views\n";
Pattern rgx = Pattern.compile("(?<=Broker)\n((?:.*\n){1,7})", Pattern.DOTALL);
Matcher mtch = rgx.matcher(inp);
if (mtch.find()) {
String result = mtch.group();
System.out.println(result);
}
I want to capture below lines from above mentioned all lines in inp.
Name localhost\n
Version 5.13.3\n
ID ID:InternetXXX022-45298-5447895412354475-2:9\n
Uptime 14 days 14 hours\n
Store percent used 19\n
Memory percent used 0\n
Temp percent used 0\n
But my code giving me all lines after "Broker". May I know please what am doing wrong ?
Secondly, I want to understand, ?: means non capturing group but still why my regex((?:.*\n)) able to capture lines after Broker ?

You must remove Pattern.DOTALL since it makes . match newlines, too, and you grab the whole text with .* and the limiting quantifier is needless then.
Besides, your real data seems to contain CRLF line endings, so it is more convenient to use \R rather than \n to match line breaks. Else, you may use a Pattern.UNIX_LINES modifier (or its embedded flag equivalent, (?d), inside the pattern) and then you may keep your pattern as is (since only \n, LF, will be considered a line break and . will match carriage returns, CRs).
Also, I suggest trimming the result.
Use
Pattern rgx = Pattern.compile("(?<=Broker)\\R((?:.*\\R){1,7})");
// Or,
// Pattern rgx = Pattern.compile("(?d)(?<=Broker)\n((?:.*\n){1,7})");
Matcher mtch = rgx.matcher(inp);
if (mtch.find()) {
String result = mtch.group();
System.out.println(result.trim());
}
See the Java demo online.

Related

Python Regex to Java

I am trying to convert a python regex to java. It finds a match in python but fails on the same string in java.
Python regex : "(CommandLineEventConsumer)(\x00\x00)(.*?)(\x00)(.*?)({})(\x00\x00)?([^\x00]*)?".format(event_consumer_name)
Java regex : "(CommandLineEventConsumer)(\\u0000\\u0000)(.*?)(\\u0000)(.*?)(" + event_consumer_name + ")(\\u0000\\u0000)?([^\\u0000]*)?"
I also tried this : "(CommandLineEventConsumer)(\\x00\\x00)(.*?)(\\x00)(.*?)(" + event_consumer_name + ")(\\x00\\x00)?([^\\x00]*)?"
What I'm I missing please?
I have attached a piece of the code
String sampleStr = "\u0000\u0000�\u0003\b\u0000\u0000\u0000�\u0005\u0000\u0000\u0003\u0000\u0000�\u0000\u000B\u0000\u0000\u0000���\u0005\u0000\u0000\u0000\u0003\u0000\u0000\u0000 \u0000\u0000\u0000\u0000string\u0000\u0000WMIDataID\u0000\u0000SystemVersion\u0000\b\u0000\u0000\u0000\f\u0000.\u0000\u0000\u0000\u0000\u0000\u0000\u0000)\u0000\u0000\u0000 \u0000\u0000�\u0003\b\u0000\u0000\u0000'\u0006\u0000\u0000\u0003\u0000\u0000�\u0000\u000B\u0000\u0000\u0000��/\u0006\u0000\u0000\u0000\u0003\u0000\u0000\u0000\u000B\u0000\u0000\u0000\u0000string\u0000\u0000WMIDataID\u0000\f\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000�\u0016\u0000\u0000\u0000R\u0000O\u0000O\u0000T\u0000\\\u0000M\u0000i\u0000c\u0000r\u0000o\u0000s\u0000o\u0000f\u0000t\u0000\\\u0000H\u0000o\u0000m\u0000e\u0000N\u0000e\u0000t\u0000\u0019\u0000\u0000\u0000H\u0000N\u0000e\u0000t\u0000_\u0000C\u0000o\u0000n\u0000n\u0000e\u0000c\u0000t\u0000i\u0000o\u0000n\u0000P\u0000r\u0000o\u0000p\u0000e\u0000r\u0000t\u0000i\u0000e\u0000s\u0000 \u0000\u0000\u0000C\u0000o\u0000n\u0000n\u0000e\u0000c\u0000t\u0000i\u0000o\u0000n\u0000�\u0000\u0000\u0000N\u0000S\u0000_\u00005\u00001\u00001\u00006\u00002\u00006\u0000F\u0000A\u0000E\u00004\u0000F\u00005\u00007\u0000D\u0000B\u0000D\u00002\u00000\u0000D\u0000F\u00005\u0000C\u0000D\u00004\u00004\u0000A\u00004\u00001\u0000D\u0000A\u0000E\u0000C\u0000E\u0000D\u00002\u00008\u0000C\u0000F\u00007\u0000B\u00003\u0000F\u0000D\u00008\u0000B\u00001\u00002\u00000\u00001\u00002\u0000C\u00007\u0000F\u00004\u0000B\u00005\u00008\u0000F\u00004\u00004\u0000E\u00006\u00006\u00005\u0000\\\u0000K\u0000I\u0000_\u0000A\u00000\u00001\u00000\u00008\u0000C\u0000E\u00002\u00006\u00001\u0000D\u00006\u0000C\u0000D\u00007\u00000\u0000D\u00003\u00005\u00000\u0000F\u00005\u0000B\u00007\u00002\u0000F\u00002\u0000E\u00009\u00008\u00007\u00004\u0000A\u0000E\u00006\u0000E\u00000\u00000\u00004\u0000D\u00003\u00000\u00002\u00009\u00000\u00001\u00005\u0000B\u00000\u00009\u00001\u00009\u0000B\u00001\u0000B\u0000D\u00003\u00002\u00006\u0000B\u0000B\u00006\u00004\u00009\u0000\\\u0000I\u0000_\u0000E\u0000D\u0000C\u0000E\u0000A\u00001\u00004\u0000E\u0000C\u00006\u00003\u0000A\u00005\u00007\u00004\u00001\u0000F\u0000A\u0000A\u00006\u00003\u00000\u00001\u0000C\u00007\u00007\u0000C\u0000A\u00002\u00006\u00000\u0000A\u0000B\u0000E\u0000C\u00000\u0000E\u00007\u00007\u00000\u00009\u00005\u00001\u00004\u0000F\u00006\u0000A\u00003\u00002\u0000C\u00000\u00003\u00004\u00007\u0000E\u00000\u00002\u00006\u00008\u00001\u00007\u0000C\u00008\u00008\u0000\u0000\u0000WQL:Re4\u00007\u0000C\u00007\u00009\u0000E\u00006\u00002\u0000C\u00002\u00002\u00002\u00007\u0000E\u0000D\u0000D\u00000\u0000F\u0000F\u00002\u00009\u0000B\u0000F\u00004\u00004\u0000D\u00008\u00007\u0000F\u00002\u0000F\u0000A\u0000F\u00009\u0000F\u0000E\u0000D\u0000F\u00006\u00000\u0000A\u00001\u00008\u0000D\u00009\u0000F\u00008\u00002\u00005\u00009\u00007\u00006\u00000\u00002\u0000B\u0000D\u00009\u00005\u0000E\u00002\u00000\u0000B\u0000D\u00003\u0000�3u�&��\u0001����+\u0004�\u0001�\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\f;\u0000\u0000\u0000\u000F\u0000\u0000\u0000�\u0000\u0000\u0000F\u0000\u0000\u0000/\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004\u0000\u0000\u0000\u0001�\u0000\u0000�\u0000__EventFilter\u0000\u001C\u0000\u0000\u0000\u0001\u0005\u0000\u0000\u0000\u0000\u0000\u0005\u0015\u0000\u0000\u0000�tw�}\n" +
"z�p�)��\u0001\u0000\u0000\u0000root\\cimv2\u0000\u0000BVTFilter\u0000\u0000SELECT * FROM __InstanceModificationEvent WITHIN 60 WHERE TargetInstance ISA \"Win32_Processor\" AND TargetInstance.LoadPercentage > 99\u0000\u0000WQL\u0000B\u0000B\u0000F\u0000C\u0000C\u0000B\u00004\u00004\u00004\u0000C\u0000F\u00006\u00006\u0000A\u0000A\u00000\u00009\u0000A\u0000E\u00006\u0000F\u00001\u00005\u00009\u00006\u00007\u0000A\u00006\u00008\u00006\u00005\u00001\u00007\u00005\u0000B\u0000B\u00000\u0000E\u0000D\u00002\u00001\u00006\u0000D\u00001\u00009\u00009\u00007\u00000\u0000A\u00007\u00009\u00008\u00008\u0000B\u00007\u00002\u0000C\u0000D\u0000F\u00000\u0000A\u00003\u0000A\u00004\u0000�3u�&��\u0001Ԏ��+\u0004�\u0001�\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u000F�����\"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000/\u0000\u0000\u0000O\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u001A\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\\\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004\u0000\u0000\u0000\u0001q\u0000\u0000�\u0000CommandLineEventConsumer\u0000\u0000cscript KernCap.vbs\u0000\u001C\u0000\u0000\u0000\u0001\u0005\u0000\u0000\u0000\u0000\u0000\u0005\u0015\u0000\u0000\u0000�tw�}\n" +
"z�p�)��\u0001\u0000\u0000\u0000BVTConsumer\u0000\u0000C:\\\\tools\\\\kernrate\u00000\u0000A\u00007\u0000A\u0000B\u0000E\u00006\u00003\u0000F\u00003\u00006\u0000E\u00002\u0000B\u00002\u00009\u00002\u00000\u0000F\u0000E\u0000D\u0000A\u0000F\u0000A\u0000E\u00008\u00004\u00009\u00008\u00002\u00003\u0000A\u0000F\u00009\u00004\u00002\u00009\u0000C\u0000C\u00000\u0000E\u0000A\u00003\u00007\u00003\u0000F\u0000F\u0000E\u0000E\u00001\u00005\u00000\u00007\u0000E\u0000D\u0000B\u00002\u00001\u0000F\u0000D\u00009\u00001\u00007\u00000\u0000�3u�&��\u0001����+\u0004�\u0001�\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000�";
String event_consumer_name = "BVTConsumer";
String cPattern = "(CommandLineEventConsumer)(\\u0000\\u0000)(.*?)(\\u0000)(.*?)(" + event_consumer_name + ")(\\u0000\\u0000)?([^\\u0000]*)?";
Pattern consumer_mo = Pattern.compile(cPattern, Pattern.CASE_INSENSITIVE);
Matcher consumer_match = consumer_mo.matcher(sampleStr);
if(consumer_match.find()){
System.out.println(consumer_match.group(6));
}
UPDATE
In python the groups return
python result screenshot
From what I posted as comments:
The (CommandLineEventConsumer)(\u0000\u0000)(.*?)(\u0000)(.*?) part matches fine.
group(3) gets cscript KernCap.vbs
group(4) gets a null character
but group(5) gets nothing.
I did try in Python and I have the exact same lack of match when I include the (BVTConsumer). So you probably had a difference in the code doing the matching in Python, not the regex itself.
So the reason is that you have a \n in your string so the matching stops there. If you do
Pattern consumer_mo = Pattern.compile(cPattern, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
it does match in your example.

Regex expression to get the file name

I want to extract only filename from the complete file name + time stamp . below is the input.
String filePath = "fileName1_20150108.csv";
expected output should be: "fileName1"
String filePath2 = "fileName1_filedesc1_20150108_002_20150109013841.csv"
And expected output should be: "fileName1_filedesc1"
I wrote a below code in java to get the file name but it is working for first part (filePath) but not for filepath2.
Pattern pattern = Pattern.compile(".*.(?=_)");
String filePath = "fileName1_20150108.csv";
String filePath2 = "fileName1_filedesc1_20150108_002_20150109013841.csv";
Matcher matcher = pattern.matcher(filePath);
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end() + " ");
System.out.println(matcher.group());
}
Can somebody please help me to correct the regex so i can parse both filepath using same regex?
Thanks
Anchor the start, and make the .* non-greedy:
^.*?(_\D.*?)?(?=[_.])
Update: change the second group (for fileDesc) to optional, and enforce that it starts with a non-digit character. This will work as long as your fileDesc strings never start with numbers.
You can get the characters before the first underscode, the first underscore, and then the characters until the next underscore:
^[^_]*_[^_]*
This should work: "^(.*?)_([0-9_]*)\\.([^.]*)$"
It will return you 3 groups:
the base name (assuming not a single part will be all numbers)
the timestamp info
the extension.
You can test here: http://fiddle.re/v0hne6 (RegexPlanet)

Certain strings that should be found by a working Regex are missed, and I need help identifying why

I have a set of strings, which I cycle through, checking those against the following set of regex, to try and separate the first small section from the rest of the string. The regex works in almost all cases, but unfortunately I have no idea why it fails occasionally. I’ve been using Pattern Matcher to print out the string, if the pattern is found.
Two example working strings:
98. SORGHUM Moench - Millets Annuals or rhizomatous perennials; inflorescence …
99. MISCANTHUS Andersson - Silver-grasses Rhizomatous perennials; inflorescence …
Two example failed strings:
100. ZEA L. - Maize Annuals; male and female inflorescences separate, the …
26. POA L. (Parodiochloa C.E. Hubb.) - Meadow-grasses Annuals or perennials with or without stolons or rhizomes; sheaths overlapping or some …
Regex’s used so far:
Pattern endOfGenus = Pattern.compile("(?<=(^\\d+\\. " + genusNames[l].toUpperCase() + "))");
Pattern endOfGenusTwo = Pattern.compile("(?<=(^\\d+" + genusNames[l].toUpperCase() + "))");
Pattern endOfGenusThree = Pattern.compile("(?<=(\\d+\\. " + genusNames[l] + "))");
Pattern endOfGenusFour = Pattern.compile("(?<=(\\d+" + genusNames[l] + "))");
Pattern endOfGenusFive = Pattern.compile("(?<=(\\. " + genusNames[l] + "))");
The first of these is the one thats producing the reliable results so far.
Example Code
Pattern endOfGenus = Pattern.compile("(?<=(^\\d+\\. " + genusNames[l].toUpperCase() + "))");
Matcher endOfGenusFinder = endOfGenus.matcher(descriptionPartBits[b]);
if (endOfGenusFinder.find()) {
System.out.print(descriptionPartBits[b] + ":- ");
System.out.print(genusNames[l] + "\n");
String[] genusNameBits = descriptionPartBits[b].split("(?<=(^\\d+\\. " + genusNames[l].toUpperCase() + "))");
}
Desired Output. This is what is produced by strings that work. Strings that don't work simply don't appear in the output:
98. SORGHUM Moench - Millets Annuals or rhizomatous perennials:- Sorghum
99. MISCANTHUS Andersson - Silver-grasses Rhizomatous perennials:- Miscanthus
From regex tutorial:
Lookahead and lookbehind, collectively called "lookaround", are
zero-length assertions just like the start and end of line, and start
and end of word anchors explained earlier in this tutorial.
Lookahead and lookbehind only return true or false.
So I changed your code example:
Pattern endOfGenus = Pattern.compile("(?<=(^\\d+\\. ZEA L))(.+)$");
// Matcher matcher = endOfGenus.matcher("98. SORGHUM Moench - Millets Annuals or rhizomatous perennials; inflorescence …");
Matcher matcher = endOfGenus.matcher("100. ZEA L. - Maize Annuals; male and female inflorescences separate, the …");
while (matcher.find()) {
String group1 = matcher.group(1);
String group2 = matcher.group(2);
System.out.println("group1=" + group1);
System.out.println("group2=" + group2);
}
Group 1 is matched by (^\\d+\\. ZEA L). Group 2 is matched by (.+).

More efficient way to make a string in a string of just words

I am making an application where I will be fetching tweets and storing them in a database. I will have a column for the complete text of the tweet and another where only the words of the tweet will remain (I need the words to calculate which words were most used later).
How I currently do it is by using 6 different .replaceAll() functions which some of them might be triggered twice. For example I will have a for loop to remove every "hashtag" using replaceAll().
The problem is that I will be editing as many as thousands of tweets that I fetch every few minutes and I think that the way I am doing it will not be too efficient.
What my requirements are in this order (also written in comments down bellow):
Delete all usernames mentioned
Delete all RT (retweets flags)
Delete all hashtags mentioned
Replace all break lines with spaces
Replace all double spaces with single spaces
Delete all special characters except spaces
Here is a Short and Compilable Example:
public class StringTest {
public static void main(String args[]) {
String text = "RT #AshStewart09: Vote for Lady Gaga for \"Best Fans\""
+ " at iHeart Awards\n"
+ "\n"
+ "RT!!\n"
+ "\n"
+ "My vote for #FanArmy goes to #LittleMonsters #iHeartAwards"
+ " htt…";
String[] hashtags = {"#FanArmy", "#LittleMonsters", "#iHeartAwards"};
System.out.println("Before: " + text + "\n");
// Delete all usernames mentioned (may run multiple times)
text = text.replaceAll("#AshStewart09", "");
System.out.println("First Phase: " + text + "\n");
// Delete all RT (retweets flags)
text = text.replaceAll("RT", "");
System.out.println("Second Phase: " + text + "\n");
// Delete all hashtags mentioned
for (String hashtag : hashtags) {
text = text.replaceAll(hashtag, "");
}
System.out.println("Third Phase: " + text + "\n");
// Replace all break lines with spaces
text = text.replaceAll("\n", " ");
System.out.println("Fourth Phase: " + text + "\n");
// Replace all double spaces with single spaces
text = text.replaceAll(" +", " ");
System.out.println("Fifth Phase: " + text + "\n");
// Delete all special characters except spaces
text = text.replaceAll("[^a-zA-Z0-9 ]+", "").trim();
System.out.println("Finaly: " + text);
}
}
Relying on replaceAll is probably the biggest performance killer as it compiles the regex again and again. The use of regexes for everything is probably the second most significant problem.
Assuming all usernames start with #, I'd replace
// Delete all usernames mentioned (may run multiple times)
text = text.replaceAll("#AshStewart09", "");
by a loop copying everything until it founds a #, then checking if the following chars match any of the listed usernames and possibly skipping them. For this lookup you could use a trie. A simpler method would be a replaceAll-like loop for the regex #\w+ together with a HashMap lookup.
// Delete all RT (retweets flags)
text = text.replaceAll("RT", "");
Here,
private static final Pattern RT_PATTERN = Pattern.compile("RT");
is a sure win. All the following parts could be handled similarly. Instead of
// Delete all special characters except spaces
text = text.replaceAll("[^a-zA-Z0-9 ]+", "").trim();
you could use Guava's CharMatcher. The method removeFrom does exactly what you did, but collapseFrom or trimAndCollapseFrom might be better.
According to the now closed question, it all boils down to
tweet = tweet.replaceAll("#\\w+|#\\w+|\\bRT\\b", "")
.replaceAll("\n", " ")
.replaceAll("[^\\p{L}\\p{N} ]+", " ")
.replaceAll(" +", " ")
.trim();
The second line seems to be redundant as the third one does remove \n too. Changing the first line's replacement to " " doesn't change the outcome an allows to aggregate the replacements.
tweet = tweet.replaceAll("#\\w*|#\\w*|\\bRT\\b|[^##\\p{L}\\p{N} ]+", " ")
.replaceAll(" +", " ")
.trim();
I've changed the usernames and hashtags part to eating also lone # or #, so that it doesn't need to be consumed by the special chars part. This is necessary for corrent processing of strings like !#AshStewart09.
For maximum performance, you surely need a precompiled pattern. I'd also re-suggest to use Guava's CharMatcher for the second part. Guava is huge (2 MB I guess), but you surely find more useful things there. So in the end you can get
private static final Pattern PATTERN =
Pattern.compile("#\\w*|#\\w*|\\bRT\\b|[^##\\p{L}\\p{N} ]+");
private static final CharMatcher CHAR_MATCHER = CharMacher.is(" ");
tweet = PATTERN.matcher(tweet).replaceAll(" ");
tweet = CHAR_MATCHER.trimAndCollapseFrom(tweet, " ");
You can inline all of the things that are being replaced with nothing into one call to replace all and everything that is replaced with a space into one call like so (also using a regex to find the hashtags and usernames as this seems easier):
text = text.replaceAll("#\w+|#\w+|RT", "");
text = text.replaceAll("\n| +", " ");
text = text.replaceAll("[^a-zA-Z0-9 ]+", "").trim();

Regex best-practices

I'm just learning how to use regex's:
I'm reading in a text file that is split into sections of two different sorts, demarcated by
<:==]:> and <:==}:> . I need to know for each section whether it's a ] or } , so I can't just do
pattern.compile("<:==]:>|<:==}:>"); pattern.split(text)
Doing this:
pattern.compile("<:=="); pattern.split(text)
works, and then I can just look at the first char in each substring, but this seems sloppy to me, and I think I'm only resorting to it because I'm not fully grasping something I need to grasp about regex's:
What would be the best practice here? Also, is there any way to split a string up while leaving the delimiter in the resulting strings- such that each begins with the delimiter?
EDIT: the file is laid out like this:
Old McDonald had a farm
<:==}:>
EIEIO. And on that farm he had a cow
<:==]:>
And on that farm he....
It may be a better idea not to use split() for this. You could instead do a match:
List<String> delimList = new ArrayList<String>();
List<String> sectionList = new ArrayList<String>();
Pattern regex = Pattern.compile(
"(<:==[\\]}]:>) # Match a delimiter, capture it in group 1.\n" +
"( # Match and capture in group 2:\n" +
" (?: # the following group which matches...\n" +
" (?!<:==[\\]}]:>) # (unless we're at the start of another delimiter)\n" +
" . # any character\n" +
" )* # any number of times.\n" +
") # End of group 2",
Pattern.COMMENTS | Pattern.DOTALL);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
delimList.add(regexMatcher.group(1));
sectionList.add(regexMatcher.group(2));
}

Categories

Resources