Multiline non terminated regex - java

I came across a problem with regex parsing columns in ASCII tables.
Imagine an ASCII table like:
COL1 | COL2 | COL3
======================
ONE | APPLE | PIE
----------------------
TWO | APPLE | PIES
----------------------
THREE | PLUM- | PIES
| APRICOT |
For the first 2 entries a trivial capture regex does the deal
(?:(?<COL1>\w+)\s*\|\s*(?<COL2>\w+)\s*\|\s*(?<COL3>\w+)\s*)
However this regex captures the header, as well as it doesn't capture the 3rd line.
I can't solve following two problems :
How to exclude the header?
How to extend the COL2 capture group to capture the multiline entry PLUM-APRICOT?
Thanks for your help!

Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems. (http://regex.info/blog/2006-09-15/247)
I've assumed an input string like:
String input = ""
+ "\n" + "COL1 | COL2 | COL3"
+ "\n" + "======================"
+ "\n" + "ONE | APPLE | PIE "
+ "\n" + "----------------------"
+ "\n" + "TWO | APPLE | PIES"
+ "\n" + "----------------------"
+ "\n" + "THREE | PLUM- | PIES"
+ "\n" + " | APRICOT | ";
To split the header and the table you can use input.split("={2,}"). This returns an array of strings of the header and the table.
After trimming the table you can use table.split("-{2,}") to get the rows of the table.
All rows can be converted to arrays of cells by using row.split("\\|").
Dealing with multiline rows: Before converting the rows to cells, you can call row.split("\n") to split multiline rows.
When this split operations returns an array with more than one element, they should be split on pipes (split("\\|")) and the resulting cells should be merged.
From here its just element manipulation to get it into the format you desire.

Related

How to extract members with regex

I have this string to parse and extract all elements between <>:
String text = "test user #myhashtag <#C5712|user_name_toto> <#U433|user_hola>";
I tried with this pattern, but it doesn't work (no result):
String pattern = "<#[C,U][0-9]+\\|[.]+>";
So in this example I want to extract:
<#C5712|user_name_toto>
<#U433|user_hola>
Then for each, I want to extract:
C or U element
ID (ie: 5712 or 433)
user name (ie: user_name_toto)
Thank you very much guys
The main problem I can see with your pattern is that it doesn't contain groups, hence retrieving parts of it will be impossible without further parsing.
You define numbered groups within parenthesis: (partOfThePattern).
From Java 7 onwards, you can also define named groups as follows: (?<theName>partOfThePattern).
The second problem is that [.] corresponds to a literal dot, not an "any character" wildcard.
The third problem is your last quantifier, which is greedy, therefore it would consume the whole rest of the string starting from the first username.
Here's a self-contained example fixing all that:
String text = "test user #myhashtag <#C5712|user_name_toto> <#U433|user_hola>";
// | starting <#
// | | group 1: any 1 char
// | | | group 2: 1+ digits
// | | | | escaped "|"
// | | | | | group 3: 1+ non-">" chars, greedy
// | | | | | | closing >
// | | | | | |
Pattern p = Pattern.compile("<#(.)(\\d+)\\|([^>]+))>");
Matcher m = p.matcher(text);
while (m.find()) {
System.out.printf(
"C or U? %s%nUser ID: %s%nUsername: %s%n",
m.group(1), m.group(2), m.group(3)
);
}
Output
C or U? C
User ID: 5712
Username: user_name_toto
C or U? U
User ID: 433
Username: user_hola
Note
I'm not validating C vs U here (gives you another . example).
You can easily replace the initial (.) with (C|U) if you only have either. You can also have the same with ([CU]).
<#([CU])(\d{4})\|(\w+)>
Where:
$1 --> C/U
$2 --> 5712/433
$3 --> user_name_toto/user_hola

How can I use a regex to match multiple address lines?

Given the following example text, can I use a regex to match each line of each address, and to add markers to know when one address finishes and the next begins? At present, I know how to match each entire address. I could then run a second regex to pick out the individual lines, but is it possible to achieve both these steps in one go?
Address:
Address 1 line 1,
Address 1 line 2,
Address 1 line 3
Address:
Address 2 line 1,
Address 2 line 2,
Address 2 line 3,
Address 2 line 4
Address:
Address 3 line 1,
Address 3 line 2
Here's a Pattern with the DOTALL flag on, enabling to find through multiple lines, using the "Address:" string as a delimiter:
// for test
String addresses = "Address:" + System.getProperty("line.separator")
+ "Address 1 line 1," + System.getProperty("line.separator")
+ "Address 1 line 2," + System.getProperty("line.separator")
+ "Address 1 line 3"
+ "Address:" + System.getProperty("line.separator")
+ "Address 2 line 1," + System.getProperty("line.separator")
+ "Address 2 line 2," + System.getProperty("line.separator")
+ "Address 2 line 3";
// | look behind for "Address:"
// | | any 1+ character,
// | | reluctantly quantified
// | | | lookahead for "Address:"
// | | | or end of input
// | | | | dot can mean
// | | | | line separator
Pattern p = Pattern.compile("(?<=Address:).+?(?=Address:|$)", Pattern.DOTALL);
Matcher m = p.matcher(addresses);
// iterating matches within given string, and printing
while (m.find()) {
System.out.printf("Found: %s%n%n", m.group());
}
Output
Found:
Address 1 line 1,
Address 1 line 2,
Address 1 line 3
Found:
Address 2 line 1,
Address 2 line 2,
Address 2 line 3
Note
In order to exclude the line separator after your "Address:" token from the match, you can use this refined pattern:
Pattern p = Pattern.compile("(?<=Address:"
+ System.getProperty("line.separator")+").+?(?=Address:"
+ System.getProperty("line.separator")+"|$)",
Pattern.DOTALL
);
If regex is what you want...
If you have a limited number of lines in an address (in your example 4), you could grab them with:
Address:\s*?(?:\n(.*),)?(?:\n(.*),)?(?:\n(.*),)?(?:\n(.*),)?(?:\n(.*))
Here the text Address: marks the beginning of the block and the four lines are grabbed, with the first three being optional.
(You'll need the global flags.)
regex101 example.

Regex pattern to match certain url

I have a large text and I only want to use certain information from it. The text looks like this:
Some random text here
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8
More random text here
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_1_av.m3u8
More random text here
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_2_av.m3u8
More random text here
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_3_av.m3u8
I only want the http text. There are several of them in the text but I only need one of them. The regular expression should be "starts with http and ends with .m3u8".
I looked at the glossary of all the different expression but it is very confusing to me. I tried "/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{12,30})([\/\w \.-]*)*\/?$/" as my pattern. But is that enough?
All help is appreciated. Thank you.
Assuming your text is line-separated at every line representation in your example, here's a snippet that will work:
String text =
"Some random text here" +
System.getProperty("line.separator") +
"http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8" +
System.getProperty("line.separator") +
"More random text here" +
System.getProperty("line.separator") +
"http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8" +
System.getProperty("line.separator") +
// removed some for brevity
"More random text here" +
System.getProperty("line.separator") +
// added counter-example ending with "NOPE"
"http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.NOPE";
// Multi-line pattern:
// ┌ line starts with http
// | ┌ any 1+ character reluctantly quantified
// | | ┌ dot escape
// | | | ┌ ending text
// | | | | ┌ end of line marker
// | | | | |
Pattern p = Pattern.compile("^http.+?\\.m3u8$", Pattern.MULTILINE);
Matcher m = p.matcher(text);
while (m.find()) {
System.out.println(m.group());
}
Output
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8
Edit
For a refined "filter" by the "index_x" file of the URL, you can simply add it in the Pattern between the protocol and ending of the line, e.g.:
Pattern.compile("^http.+?index_0.+?\\.m3u8$", Pattern.MULTILINE);
I didn't test it, but this should do the trick:
^(http:\/\/.*\.m3u8)
It is the answer of #capnibishop, but with a little change.
^(http://).*(/index_1)[^/]*\.m3u8$
Added the missing "$" sign at the end. This ensures it matches
http://something.m3u8
and not
http://something.m3u81
Added the condition to match index_1 at the end of the line, which means it wil match:
http://something/index_1_something_else.m3u8
and not
http://something/index_1/something_else.m3u8

How to replace multiple words with space in a string using Java

I tried to replace a list of words from a give string with the following code.
String Sample = " he saw a cat running of that pat's mat ";
String regex = "'s | he | of | to | a | and | in | that";
Sample = Sample.replaceAll(regex, " ");
The output is
[ saw cat running that pat mat ]
// minus the []
It still has the last word "that". Is there anyway to modify the regex to consider the last word also.
Try:
String Sample = " he saw a cat running of that pat's mat remove 's";
String resultString = Sample.replaceAll("\\b( ?'s|he|of|to|a|and|in|that)\\b", "");
System.out.print(resultString);
saw cat running pat mat remove
DEMO
http://ideone.com/Yitobz
The problem is that you have consecutive words that you are trying to replace.
For example, consider the substring
[ of that ]
while the replaceAll is running, the [ of ] matches
[ of that ]
^ ^
and that will be replaced with a (space). The next character to match is t, not a space expected by
... | that | ...
What I think you can do to fix this is add word boundaries instead of spaces.
String regex = "'s\\b|\\bhe\\b|\\bof\\b|\\bto\\b|\\ba\\b|\\band\\b|\\bin\\b|\\bthat\\b";
or the shorter version as shown in Tuga's answer.
it doesn't work, because you delete the " of " part first and then there is no space before the "that" word, because you deleted it (replaced)
you can change in two ways:
String regex = "'s | he | of| to | a | and | in | that";
or
String regex = "'s | he | of | to | a | and | in |that ";
or you just call Sample = Sample.replaceAll(regex, " "); again

How to remove everything between two outer chars?

I have the following part of string:
{{Infobox musical artist
|honorific-prefix = [[The Honourable]]
| name = Bob Marley
| image = Bob-Marley.jpg
| alt = Black and white image of Bob Marley on stage with a guitar
| caption = Bob Marley in concert, 1980.
| background = solo_singer
| birth_name = Robert Nesta Marley
| alias = Tuff Gong
| birth_date = {{birth date|df=yes|1945|2|6}}
| birth_place = [[Nine Mile, Jamaica|Nine Mile]], [[Jamaica]]
| death_date = {{death date and age|df=yes|1981|5|11|1945|2|6}}
| death_place = [[Miami]], [[Florida]]
| instrument = Vocals, guitar, percussion
| genre = [[Reggae]], [[ska]], [[rocksteady]]
| occupation = [[Singer-songwriter]], [[musician]], [[guitarist]]
| years_active = 1962–1981
| label = [[Beverley's]], [[Studio One (record label)|Studio One]],
| associated_acts = [[Bob Marley and the Wailers]]
| website = {{URL|bobmarley.com}}
}}
And I'd like to remove all of it. Now if I try the regex: \{\{(.*?)\}\} it catches {{birth date|df=yes|1945|2|6}}, which makes sense so I tried : \{\{([^\}]*?)\}\} which thens grabs from the start but ends in the same line, which also makes sense as it has encoutered }}, i've also tried without the ? greedy ,still same results. my question is, how can I remove everything that's inside a {{}}, no matter how many of the same chars are inside?
Edit: If you want my entire input, it's this:
https://en.wikipedia.org/w/index.php?maxlag=5&title=Bob+Marley&action=raw
Here's a solution with a DOTALL Pattern and a greedy quantifier for an input that contains only one instance of the fragment you wish to remove (i.e. replace with an empty String):
String input = "Foo {{Infobox musical artist\n"
+ "|honorific-prefix = [[The Honourable]]\n"
+ "| name = Bob Marley\n"
+ "| image = Bob-Marley.jpg\n"
+ "| alt = Black and white image of Bob Marley on stage with a guitar\n"
+ "| caption = Bob Marley in concert, 1980.\n"
+ "| background = solo_singer\n"
+ "| birth_name = Robert Nesta Marley\n"
+ "| alias = Tuff Gong\n"
+ "| birth_date = {{birth date|df=yes|1945|2|6}}\n"
+ "| birth_place = [[Nine Mile, Jamaica|Nine Mile]], [[Jamaica]]\n"
+ "| death_date = {{death date and age|df=yes|1981|5|11|1945|2|6}}\n"
+ "| death_place = [[Miami]], [[Florida]]\n"
+ "| instrument = Vocals, guitar, percussion\n"
+ "| genre = [[Reggae]], [[ska]], [[rocksteady]]\n"
+ "| occupation = [[Singer-songwriter]], [[musician]], [[guitarist]] \n"
+ "| years_active = 1962–1981\n"
+ "| label = [[Beverley's]], [[Studio One (record label)|Studio One]],\n"
+ "| associated_acts = [[Bob Marley and the Wailers]]\n"
+ "| website = {{URL|bobmarley.com}}\n" + "}} Bar";
// |DOTALL flag
// | |first two curly brackets
// | | |multi-line dot
// | | | |last two curly brackets
// | | | | | replace with empty
System.out.println(input.replaceAll("(?s)\\{\\{.+\\}\\}", ""));
Output
Foo Bar
Notes after comments
This case implies using regular expressions to manipulate markup language.
Regular expressions are not made to parse hierarchical markup entities, and would not serve in this case so this answer is only a stub for what would be an ugly workaround at best in this case.
See here for a famous SO thread on parsing markup with regex.
Use a greedy quantifier instead of the reluctant one you're using.
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Edit: spoonfeeding: "\{\{.*\}\}"
Try this pattern, it should take care of everything:
"\\D\\{\\{I.+[\\P{M}\\p{M}*+].+\\}\\}\\D"
specify: DOTALL
code:
String result = searchText.replaceAll("\\D\\{\\{I.+[\\P{M}\\p{M}*+].+\\}\\}\\D", "");
example: http://fiddle.re/5n4zg
This regex matches a single such block (only):
\{\{([^{}]*?\{\{.*?\}\})*.*?\}\}
See a live demo.
In java, to remove all such blocks:
str = str.replaceAll("(?s)\\{\\{([^{}]*?\\{\\{.*?\\}\\})*.*?\\}\\}", "");

Categories

Resources