My text will look like this
| birth_date = {{birth date|1925|09|2|df=y}}
| birth_place = [[Bristol]], [[England]], UK
| death_date = {{death date and age|2000|11|16|1925|09|02|df=y}}
| death_place = [[Eastbourne]], [[Sussex]], England, UK
| origin =
| instrument = [[Piano]]
| genre =
| occupation = [[Musician]]
I would like to get everything that is inside of [[ ]]. I tried to use replace all to replace everything that is not inside the [[ ]] and then use split by new line to get a list of text with [[ ]].
input = input.replaceAll("^[\\[\\[(.+)\\]\\]]", "");
Required output:
[[Bristol]]
[[England]]
[[Eastbourne]]
[[Sussex]]
[[Piano]]
[[Musician]]
But this is not giving the desired output. What am I missing here?. There are thousands of documents and is this the fastest way to get it? If no, do tell me the optimum way to get the desired output.
You need to match it not replace
Matcher m=Pattern.compile("\\[\\[\\w+\\]\\]").matcher(input);
while(m.find())
{
m.group();//result
}
Use Matcher.find. For example:
import java.util.regex.*;
...
String text =
"| birth_date = {{birth date|1925|09|2|df=y}}\n" +
"| birth_place = [[Bristol]], [[England]], UK\n" +
"| death_date = {{death date and age|2000|11|16|1925|09|02|df=y}}\n" +
"| death_place = [[Eastbourne]], [[Sussex]], England, UK\n" +
"| origin = \n" +
"| instrument = [[Piano]]\n" +
"| genre = \n" +
"| occupation = [[Musician]]\n";
Pattern pattern = Pattern.compile("\\[\\[.+?\\]\\]");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group());
}
Just for fun, using replaceAll:
String output = input.replaceAll("(?s)(\\]\\]|^).*?(\\[\\[|$)", "$1\n$2");
Related
String s = "My cake should have ( sixteen | sixten | six teen ) candles, I love and ( should be | would be ) puff them."
final changed string
My cake should have <div><p id="1">sixteen</p><p id="2">sixten</p><p id="3">six teen</p></div> candles, I love and <div><p id="1">should be</p><p id="2"> would be</p> puff them
What i had tried is using this
Pattern pattern = Pattern.compile("\\(\\s*(.*?)(?=\\s*\\))");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(1));
}
You can match strings between parentheses and then split the texts inside with a pipe and build the replacement dynamically using Matcher.appendReplacement:
String s = "My cake should have ( sixteen | sixten | six teen ) candles, I love and ( should be | would be ) puff them.";
String rx = "\\(([^()]*)\\)";
StringBuffer result = new StringBuffer();
Matcher m = Pattern.compile(rx).matcher(s);
while (m.find()) {
String add = "";
String[] items = m.group(1).split("\\|");
for (int i=1; i<=items.length; i++) {
add += "<p id=\"" + i + "\">" + items[i-1].trim() + "</p>";
}
m.appendReplacement(result, "<div>"+add+"</div>");
}
m.appendTail(result);
System.out.println(result.toString());
See the Java online demo. Output:
My cake should have <div><p id="1">sixteen</p><p id="2">sixten</p><p id="3">six teen</p></div> candles, I love and <div><p id="1">should be</p><p id="2">would be</p></div> puff them.
I am trying to parse an string to retrieve the home and away teams, and also the result of it.
So the strings can be something like this:
Football: Real Madrid 2-1 FC Barcelona
Football: Atletico de Madrid 4-2 Real Madrid
Let's say, you have the home team name, plus the result in {homeTeamGoals}-{awayTeamGoals} and then the away team name
I want to use regexp to parse the string and retrieve the team names and result. I thought of having something like this:
String PATTERN_SPORT = "([a-zA-Z]+ ?[0-9]?)"
String PATTERN_NAME = "(.*)"
String PATTERN_RESULT = "([0-9]*)-([0-9]*)"
String PATTERN_SPORT_AND_HOME_TEAM_RESULT_AWAY_TEAM = Pattern.compile("^" + PATTERN_SPORT + ": " + PATTERN_NAME + " " + PATTERN_RESULT + " ?"
+ PATTERN_NAME + "?$")
But it does not match, and I don't know why since I used for the pattern name (.*), any clue?
I would use the following regex: (\w*:)\s?(.*)\s?(\d{1,2}-\d{1,2})\s?(.*) see here
group 1 (\w*:) will match the sport and : (eventually you can improve this to take only the sport without the : -> just do (\w*):)
group 2 (.*) first team name
group 3 (\d{1,2}-\d{1,2}) this will take any score (0-0 to 99-99)
group 4 (.*) second team name
just ignore the \s.
This will work only for your format (if you have other format the regex can be adjusted)
Java:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class Test {
public static void main(String [] args){
String s = "Football: Hannover 96 3-3 1.FC Nuernberg";
String PATTERN_SPORT = "(\\w*:)";
String PATTERN_NAME = "(.*)";
String PATTERN_RESULT = "(\\d{1,2}-\\d{1,2})";
Pattern PATTERN_RESULTS= Pattern.compile("^" + PATTERN_SPORT + "\\s?" + PATTERN_NAME + "\\s?" + PATTERN_RESULT + "\\s?" + PATTERN_NAME + "$", Pattern.UNICODE_CHARACTER_CLASS);
Matcher matcher = PATTERN_RESULTS.matcher(s);
if (matcher.matches()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));
System.out.println(matcher.group(4));
}
}
}
You can paste the code here and test it.
Output:
Football:
Hannover 96
3-3
1.FC Nuernberg
You need to make sure you match all Unicode whitespaces (the first one after : is a non-breaking space). Replacing all spaces with \s and compileing with Pattern.UNICODE_CHARACTER_CLASS option will solve the issue:
String PATTERN_SPORT = "([a-zA-Z]+\\s?[0-9]?)";
String PATTERN_NAME = "(.*)";
String PATTERN_RESULT = "([0-9]*)-([0-9]*)";
Pattern PATTERN_SPORT_AND_HOME_TEAM_RESULT_AWAY_TEAM = Pattern.compile("^" + PATTERN_SPORT + ":\\s" + PATTERN_NAME + "\\s" + PATTERN_RESULT + "\\s?"
+ PATTERN_NAME + "$", Pattern.UNICODE_CHARACTER_CLASS);
Java demo:
String s = "Football: Real Madrid 2-1 FC Barcelona";
String PATTERN_SPORT = "([a-zA-Z]+\\s?[0-9]?)";
String PATTERN_NAME = "(.*)";
String PATTERN_RESULT = "([0-9]*)-([0-9]*)";
Pattern PATTERN_SPORT_AND_HOME_TEAM_RESULT_AWAY_TEAM = Pattern.compile("^" + PATTERN_SPORT + ":\\s" + PATTERN_NAME + "\\s" + PATTERN_RESULT + "\\s?" + PATTERN_NAME + "$", Pattern.UNICODE_CHARACTER_CLASS);
Matcher matcher = PATTERN_SPORT_AND_HOME_TEAM_RESULT_AWAY_TEAM.matcher(s);
if (matcher.matches()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));
System.out.println(matcher.group(4));
System.out.println(matcher.group(5));
}
Output:
Football
Real Madrid
2
1
FC Barcelona
You can try this pattern: (?<=: )(?P<home_team>[\w ]+) (?P<result>\d{1,2}-\d{1,2}) (?P<away_team>[\w ]+).
You might want to use different lookbehind: (?<=Football: ) to parse only football results.
I also assumed, that one team won't score more than 100 goals :) \d{1,2} will match scores from range 0-99.
Demo
What is the simplest succinct way to expect 2 integers from a String when i know the format will always be ${INT1...INT2} e.g. "Hello ${123...456} would extract 123,456?
I would go with a Pattern with groups and back-references.
Here's an example:
String input = "Hello ${123...456}, bye ${789...101112}";
// | escaped "$"
// | | escaped "{"
// | | | first group (any number of digits)
// | | | | 3 escaped dots
// | | | | | second group (same as 1st)
// | | | | | | escaped "}"
Pattern p = Pattern.compile("\\$\\{(\\d+)\\.{3}(\\d+)\\}");
Matcher m = p.matcher(input);
// iterating over matcher's find for multiple matches
while (m.find()) {
System.out.println("Found...");
System.out.println("\t" + m.group(1));
System.out.println("\t" + m.group(2));
}
Output
Found...
123
456
Found...
789
101112
final String string = "${123...456}";
final String firstPart = string.substring(string.indexOf("${") + "${".length(), string.indexOf("..."));
final String secondPart = string.substring(string.indexOf("...") + "...".length(), string.indexOf("}"));
final Integer integer = Integer.valueOf(firstPart.concat(secondPart));
I tried to replace a list of words from a give string with the following code.
String Sample = " he saw a cat running of that pat's mat ";
String regex = "'s | he | of | to | a | and | in | that";
Sample = Sample.replaceAll(regex, " ");
The output is
[ saw cat running that pat mat ]
// minus the []
It still has the last word "that". Is there anyway to modify the regex to consider the last word also.
Try:
String Sample = " he saw a cat running of that pat's mat remove 's";
String resultString = Sample.replaceAll("\\b( ?'s|he|of|to|a|and|in|that)\\b", "");
System.out.print(resultString);
saw cat running pat mat remove
DEMO
http://ideone.com/Yitobz
The problem is that you have consecutive words that you are trying to replace.
For example, consider the substring
[ of that ]
while the replaceAll is running, the [ of ] matches
[ of that ]
^ ^
and that will be replaced with a (space). The next character to match is t, not a space expected by
... | that | ...
What I think you can do to fix this is add word boundaries instead of spaces.
String regex = "'s\\b|\\bhe\\b|\\bof\\b|\\bto\\b|\\ba\\b|\\band\\b|\\bin\\b|\\bthat\\b";
or the shorter version as shown in Tuga's answer.
it doesn't work, because you delete the " of " part first and then there is no space before the "that" word, because you deleted it (replaced)
you can change in two ways:
String regex = "'s | he | of| to | a | and | in | that";
or
String regex = "'s | he | of | to | a | and | in |that ";
or you just call Sample = Sample.replaceAll(regex, " "); again
I have the following part of string:
{{Infobox musical artist
|honorific-prefix = [[The Honourable]]
| name = Bob Marley
| image = Bob-Marley.jpg
| alt = Black and white image of Bob Marley on stage with a guitar
| caption = Bob Marley in concert, 1980.
| background = solo_singer
| birth_name = Robert Nesta Marley
| alias = Tuff Gong
| birth_date = {{birth date|df=yes|1945|2|6}}
| birth_place = [[Nine Mile, Jamaica|Nine Mile]], [[Jamaica]]
| death_date = {{death date and age|df=yes|1981|5|11|1945|2|6}}
| death_place = [[Miami]], [[Florida]]
| instrument = Vocals, guitar, percussion
| genre = [[Reggae]], [[ska]], [[rocksteady]]
| occupation = [[Singer-songwriter]], [[musician]], [[guitarist]]
| years_active = 1962–1981
| label = [[Beverley's]], [[Studio One (record label)|Studio One]],
| associated_acts = [[Bob Marley and the Wailers]]
| website = {{URL|bobmarley.com}}
}}
And I'd like to remove all of it. Now if I try the regex: \{\{(.*?)\}\} it catches {{birth date|df=yes|1945|2|6}}, which makes sense so I tried : \{\{([^\}]*?)\}\} which thens grabs from the start but ends in the same line, which also makes sense as it has encoutered }}, i've also tried without the ? greedy ,still same results. my question is, how can I remove everything that's inside a {{}}, no matter how many of the same chars are inside?
Edit: If you want my entire input, it's this:
https://en.wikipedia.org/w/index.php?maxlag=5&title=Bob+Marley&action=raw
Here's a solution with a DOTALL Pattern and a greedy quantifier for an input that contains only one instance of the fragment you wish to remove (i.e. replace with an empty String):
String input = "Foo {{Infobox musical artist\n"
+ "|honorific-prefix = [[The Honourable]]\n"
+ "| name = Bob Marley\n"
+ "| image = Bob-Marley.jpg\n"
+ "| alt = Black and white image of Bob Marley on stage with a guitar\n"
+ "| caption = Bob Marley in concert, 1980.\n"
+ "| background = solo_singer\n"
+ "| birth_name = Robert Nesta Marley\n"
+ "| alias = Tuff Gong\n"
+ "| birth_date = {{birth date|df=yes|1945|2|6}}\n"
+ "| birth_place = [[Nine Mile, Jamaica|Nine Mile]], [[Jamaica]]\n"
+ "| death_date = {{death date and age|df=yes|1981|5|11|1945|2|6}}\n"
+ "| death_place = [[Miami]], [[Florida]]\n"
+ "| instrument = Vocals, guitar, percussion\n"
+ "| genre = [[Reggae]], [[ska]], [[rocksteady]]\n"
+ "| occupation = [[Singer-songwriter]], [[musician]], [[guitarist]] \n"
+ "| years_active = 1962–1981\n"
+ "| label = [[Beverley's]], [[Studio One (record label)|Studio One]],\n"
+ "| associated_acts = [[Bob Marley and the Wailers]]\n"
+ "| website = {{URL|bobmarley.com}}\n" + "}} Bar";
// |DOTALL flag
// | |first two curly brackets
// | | |multi-line dot
// | | | |last two curly brackets
// | | | | | replace with empty
System.out.println(input.replaceAll("(?s)\\{\\{.+\\}\\}", ""));
Output
Foo Bar
Notes after comments
This case implies using regular expressions to manipulate markup language.
Regular expressions are not made to parse hierarchical markup entities, and would not serve in this case so this answer is only a stub for what would be an ugly workaround at best in this case.
See here for a famous SO thread on parsing markup with regex.
Use a greedy quantifier instead of the reluctant one you're using.
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Edit: spoonfeeding: "\{\{.*\}\}"
Try this pattern, it should take care of everything:
"\\D\\{\\{I.+[\\P{M}\\p{M}*+].+\\}\\}\\D"
specify: DOTALL
code:
String result = searchText.replaceAll("\\D\\{\\{I.+[\\P{M}\\p{M}*+].+\\}\\}\\D", "");
example: http://fiddle.re/5n4zg
This regex matches a single such block (only):
\{\{([^{}]*?\{\{.*?\}\})*.*?\}\}
See a live demo.
In java, to remove all such blocks:
str = str.replaceAll("(?s)\\{\\{([^{}]*?\\{\\{.*?\\}\\})*.*?\\}\\}", "");