I have the following part of string:
{{Infobox musical artist
|honorific-prefix = [[The Honourable]]
| name = Bob Marley
| image = Bob-Marley.jpg
| alt = Black and white image of Bob Marley on stage with a guitar
| caption = Bob Marley in concert, 1980.
| background = solo_singer
| birth_name = Robert Nesta Marley
| alias = Tuff Gong
| birth_date = {{birth date|df=yes|1945|2|6}}
| birth_place = [[Nine Mile, Jamaica|Nine Mile]], [[Jamaica]]
| death_date = {{death date and age|df=yes|1981|5|11|1945|2|6}}
| death_place = [[Miami]], [[Florida]]
| instrument = Vocals, guitar, percussion
| genre = [[Reggae]], [[ska]], [[rocksteady]]
| occupation = [[Singer-songwriter]], [[musician]], [[guitarist]]
| years_active = 1962–1981
| label = [[Beverley's]], [[Studio One (record label)|Studio One]],
| associated_acts = [[Bob Marley and the Wailers]]
| website = {{URL|bobmarley.com}}
}}
And I'd like to remove all of it. Now if I try the regex: \{\{(.*?)\}\} it catches {{birth date|df=yes|1945|2|6}}, which makes sense so I tried : \{\{([^\}]*?)\}\} which thens grabs from the start but ends in the same line, which also makes sense as it has encoutered }}, i've also tried without the ? greedy ,still same results. my question is, how can I remove everything that's inside a {{}}, no matter how many of the same chars are inside?
Edit: If you want my entire input, it's this:
https://en.wikipedia.org/w/index.php?maxlag=5&title=Bob+Marley&action=raw
Here's a solution with a DOTALL Pattern and a greedy quantifier for an input that contains only one instance of the fragment you wish to remove (i.e. replace with an empty String):
String input = "Foo {{Infobox musical artist\n"
+ "|honorific-prefix = [[The Honourable]]\n"
+ "| name = Bob Marley\n"
+ "| image = Bob-Marley.jpg\n"
+ "| alt = Black and white image of Bob Marley on stage with a guitar\n"
+ "| caption = Bob Marley in concert, 1980.\n"
+ "| background = solo_singer\n"
+ "| birth_name = Robert Nesta Marley\n"
+ "| alias = Tuff Gong\n"
+ "| birth_date = {{birth date|df=yes|1945|2|6}}\n"
+ "| birth_place = [[Nine Mile, Jamaica|Nine Mile]], [[Jamaica]]\n"
+ "| death_date = {{death date and age|df=yes|1981|5|11|1945|2|6}}\n"
+ "| death_place = [[Miami]], [[Florida]]\n"
+ "| instrument = Vocals, guitar, percussion\n"
+ "| genre = [[Reggae]], [[ska]], [[rocksteady]]\n"
+ "| occupation = [[Singer-songwriter]], [[musician]], [[guitarist]] \n"
+ "| years_active = 1962–1981\n"
+ "| label = [[Beverley's]], [[Studio One (record label)|Studio One]],\n"
+ "| associated_acts = [[Bob Marley and the Wailers]]\n"
+ "| website = {{URL|bobmarley.com}}\n" + "}} Bar";
// |DOTALL flag
// | |first two curly brackets
// | | |multi-line dot
// | | | |last two curly brackets
// | | | | | replace with empty
System.out.println(input.replaceAll("(?s)\\{\\{.+\\}\\}", ""));
Output
Foo Bar
Notes after comments
This case implies using regular expressions to manipulate markup language.
Regular expressions are not made to parse hierarchical markup entities, and would not serve in this case so this answer is only a stub for what would be an ugly workaround at best in this case.
See here for a famous SO thread on parsing markup with regex.
Use a greedy quantifier instead of the reluctant one you're using.
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Edit: spoonfeeding: "\{\{.*\}\}"
Try this pattern, it should take care of everything:
"\\D\\{\\{I.+[\\P{M}\\p{M}*+].+\\}\\}\\D"
specify: DOTALL
code:
String result = searchText.replaceAll("\\D\\{\\{I.+[\\P{M}\\p{M}*+].+\\}\\}\\D", "");
example: http://fiddle.re/5n4zg
This regex matches a single such block (only):
\{\{([^{}]*?\{\{.*?\}\})*.*?\}\}
See a live demo.
In java, to remove all such blocks:
str = str.replaceAll("(?s)\\{\\{([^{}]*?\\{\\{.*?\\}\\})*.*?\\}\\}", "");
Related
I came across a problem with regex parsing columns in ASCII tables.
Imagine an ASCII table like:
COL1 | COL2 | COL3
======================
ONE | APPLE | PIE
----------------------
TWO | APPLE | PIES
----------------------
THREE | PLUM- | PIES
| APRICOT |
For the first 2 entries a trivial capture regex does the deal
(?:(?<COL1>\w+)\s*\|\s*(?<COL2>\w+)\s*\|\s*(?<COL3>\w+)\s*)
However this regex captures the header, as well as it doesn't capture the 3rd line.
I can't solve following two problems :
How to exclude the header?
How to extend the COL2 capture group to capture the multiline entry PLUM-APRICOT?
Thanks for your help!
Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems. (http://regex.info/blog/2006-09-15/247)
I've assumed an input string like:
String input = ""
+ "\n" + "COL1 | COL2 | COL3"
+ "\n" + "======================"
+ "\n" + "ONE | APPLE | PIE "
+ "\n" + "----------------------"
+ "\n" + "TWO | APPLE | PIES"
+ "\n" + "----------------------"
+ "\n" + "THREE | PLUM- | PIES"
+ "\n" + " | APRICOT | ";
To split the header and the table you can use input.split("={2,}"). This returns an array of strings of the header and the table.
After trimming the table you can use table.split("-{2,}") to get the rows of the table.
All rows can be converted to arrays of cells by using row.split("\\|").
Dealing with multiline rows: Before converting the rows to cells, you can call row.split("\n") to split multiline rows.
When this split operations returns an array with more than one element, they should be split on pipes (split("\\|")) and the resulting cells should be merged.
From here its just element manipulation to get it into the format you desire.
I'm looking for a Regex pattern that matches the following, but I'm kind of stumped so far. I'm not sure how to grab the results of the two groups I want, marked by id, and attr.
Should match:
account[id].attr
account[anotherid].anotherattr
These should respectively return id, attr,
and anotherid, anotherattr
Any tips?
Here's a complete solution mapping your id -> attributes:
String[] input = {
"account[id].attr",
"account[anotherid].anotherattr"
};
// | literal for "account"
// | | escaped "["
// | | | group 1: any character
// | | | | escaped "]"
// | | | | | escaped "."
// | | | | | | group 2: any character
Pattern p = Pattern.compile("account\\[(.+)\\]\\.(.+)");
Map<String, String> output = new LinkedHashMap<String, String>();
// iterating over input Strings
for (String s: input) {
// matching
Matcher m = p.matcher(s);
// finding only once per input String. Change to a while-loop if multiple instances
// within single input
if (m.find()) {
// back-referencing group 1 and 2 as key -> value
output.put(m.group(1), m.group(2));
}
}
System.out.println(output);
Output
{id=attr, anotherid=anotherattr}
Note
In this implementation, "incomplete" inputs such as "account[anotherid]." will not be put in the Map as they don't match the Pattern at all.
In order to have these cases put as id -> null, you only need to add a ? at the end of the Pattern.
That will make the last group optional.
What is the simplest succinct way to expect 2 integers from a String when i know the format will always be ${INT1...INT2} e.g. "Hello ${123...456} would extract 123,456?
I would go with a Pattern with groups and back-references.
Here's an example:
String input = "Hello ${123...456}, bye ${789...101112}";
// | escaped "$"
// | | escaped "{"
// | | | first group (any number of digits)
// | | | | 3 escaped dots
// | | | | | second group (same as 1st)
// | | | | | | escaped "}"
Pattern p = Pattern.compile("\\$\\{(\\d+)\\.{3}(\\d+)\\}");
Matcher m = p.matcher(input);
// iterating over matcher's find for multiple matches
while (m.find()) {
System.out.println("Found...");
System.out.println("\t" + m.group(1));
System.out.println("\t" + m.group(2));
}
Output
Found...
123
456
Found...
789
101112
final String string = "${123...456}";
final String firstPart = string.substring(string.indexOf("${") + "${".length(), string.indexOf("..."));
final String secondPart = string.substring(string.indexOf("...") + "...".length(), string.indexOf("}"));
final Integer integer = Integer.valueOf(firstPart.concat(secondPart));
I tried to replace a list of words from a give string with the following code.
String Sample = " he saw a cat running of that pat's mat ";
String regex = "'s | he | of | to | a | and | in | that";
Sample = Sample.replaceAll(regex, " ");
The output is
[ saw cat running that pat mat ]
// minus the []
It still has the last word "that". Is there anyway to modify the regex to consider the last word also.
Try:
String Sample = " he saw a cat running of that pat's mat remove 's";
String resultString = Sample.replaceAll("\\b( ?'s|he|of|to|a|and|in|that)\\b", "");
System.out.print(resultString);
saw cat running pat mat remove
DEMO
http://ideone.com/Yitobz
The problem is that you have consecutive words that you are trying to replace.
For example, consider the substring
[ of that ]
while the replaceAll is running, the [ of ] matches
[ of that ]
^ ^
and that will be replaced with a (space). The next character to match is t, not a space expected by
... | that | ...
What I think you can do to fix this is add word boundaries instead of spaces.
String regex = "'s\\b|\\bhe\\b|\\bof\\b|\\bto\\b|\\ba\\b|\\band\\b|\\bin\\b|\\bthat\\b";
or the shorter version as shown in Tuga's answer.
it doesn't work, because you delete the " of " part first and then there is no space before the "that" word, because you deleted it (replaced)
you can change in two ways:
String regex = "'s | he | of| to | a | and | in | that";
or
String regex = "'s | he | of | to | a | and | in |that ";
or you just call Sample = Sample.replaceAll(regex, " "); again
My text will look like this
| birth_date = {{birth date|1925|09|2|df=y}}
| birth_place = [[Bristol]], [[England]], UK
| death_date = {{death date and age|2000|11|16|1925|09|02|df=y}}
| death_place = [[Eastbourne]], [[Sussex]], England, UK
| origin =
| instrument = [[Piano]]
| genre =
| occupation = [[Musician]]
I would like to get everything that is inside of [[ ]]. I tried to use replace all to replace everything that is not inside the [[ ]] and then use split by new line to get a list of text with [[ ]].
input = input.replaceAll("^[\\[\\[(.+)\\]\\]]", "");
Required output:
[[Bristol]]
[[England]]
[[Eastbourne]]
[[Sussex]]
[[Piano]]
[[Musician]]
But this is not giving the desired output. What am I missing here?. There are thousands of documents and is this the fastest way to get it? If no, do tell me the optimum way to get the desired output.
You need to match it not replace
Matcher m=Pattern.compile("\\[\\[\\w+\\]\\]").matcher(input);
while(m.find())
{
m.group();//result
}
Use Matcher.find. For example:
import java.util.regex.*;
...
String text =
"| birth_date = {{birth date|1925|09|2|df=y}}\n" +
"| birth_place = [[Bristol]], [[England]], UK\n" +
"| death_date = {{death date and age|2000|11|16|1925|09|02|df=y}}\n" +
"| death_place = [[Eastbourne]], [[Sussex]], England, UK\n" +
"| origin = \n" +
"| instrument = [[Piano]]\n" +
"| genre = \n" +
"| occupation = [[Musician]]\n";
Pattern pattern = Pattern.compile("\\[\\[.+?\\]\\]");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group());
}
Just for fun, using replaceAll:
String output = input.replaceAll("(?s)(\\]\\]|^).*?(\\[\\[|$)", "$1\n$2");