parsing internal links from text in xml file

parsing internal links from text in xml file - java

I need to get internal links present in text field of Wikinews xml file.
In my case those are coming in two formats
[[w:President of the People's Republic of China|President]]
[[People's Republic of China]]
I applied these regex patterns
internalLinks = Pattern.compile("\\[\\[w:([^|:]+)\\|.*\\]\\]").matcher(internalLinks).replaceAll("##en.wikipedia.org/wiki/$1##");
internalLinks = Pattern.compile("\\[\\[([^:|]+)\\]\\]").matcher(internalLinks).replaceAll("[[[en.wikinews.org/wiki/$1]]]");
Pattern pattern = Pattern.compile("\\[\\[\\[(.*?)\\]\\]\\]");
Matcher matcher = pattern.matcher(internalLinks);
while (matcher.find())
{
interLinks += matcher.group(1)+",";
}
Pattern pattern1 = Pattern.compile("##(.*?)##");
Matcher matcher1 = pattern1.matcher(internalLinks);
while (matcher1.find())
{
interLinks += matcher1.group(1)+",";
}
if (interLinks.length() > 0) {
interLinks = interLinks.substring(0, interLinks.length()-1);
return interLinks;
} else return "";
Problem is it is just giving me the links matching first pattern and that too only few links, just 3-4 and not all
Here I have provided an excerpt of the text field of a document.
{{date|November 13, 2004}}
{{Brazil}}[[w:Hu Jintao|Hu Jintao]], the [[w:President of the People's Republic of China|President]] of the [[People's Republic of China]] had lunch today with the [[w:President of Brazil|President]] of [[Brazil]], [[w:Luiz Inácio Lula da Silva|Luiz Inácio Lula da Silva]], at the ''Granja do Torto'', the President's country residence in the [[w:Brazilian Federal District|Brazilian Federal District]]. Lunch was a traditional Brazilian [[w:barbecue|barbecue]] with different kinds of meat.
Some Brazilian ministers were present at the event: [[w:Antonio Palocci|Antonio Palocci]] (Economy), [[w:pt:Eduardo Campos|Eduardo Campos]] ([[w:Ministry of Science and Technology (Brazil)|Science and Technology]]), [[w:João Roberto Rodrigues|Roberto Rodrigues]] (Agriculture), [[w:pt:Luiz Fernando Furlan|Luiz Fernando Furlan]] (Development), [[w:Celso Amorim|Celso Amorim]] ([[w:Ministry of
External Relations (Brazil)|Exterior Relations]]), [[w:Dilma Rousseff|Dilma Rousseff]] (Mines and Energy). Also present were [[w:pt:Roger Agnelli|Roger Agnelli]] ([[w:Vale (mining company)|Vale do Rio Doce]] company president) and Eduardo Dutra ([[w:Petrobras|Petrobras]], government oil company, president).
This meeting is part of a new [[w:political economy|political economy]] agreement between Brazil and China where Brazil has recognized mainland China's [[w:socialist market economy|market economy]] status, and China has promised to buy more [[w:economy of Brazil|Brazilian products]].

Solution
\[\[(?:w:)?.*?\]\]
Description
Discussion
This regex assumes that the sequence of characters ]] will not appear between [[ and ]].
I wasn't able for now to find the escape sequence of ]].
Demo
http://regexr.com?37e51

I've visited the download page, on top its written:
See Meta:Data dumps for documentation on the provided data formats.
I guess they offer better parsing approaches then plain regex, check it out...

Related

How to use Scanner.useDelimiter() to match two characters next to each other followed by a word?

I am trying to parse a plain .txt file with the general structure
[[Title]]
CATEGORIES: text, text, text
some text etc...
[[Next Title]]
CATEGORIES: text, text, text
Next other text etc ...
In my code I use this pattern
Scanner inputScanner = new Scanner(fileEntry)
inputScanner.useDelimiter("\\]\\]|\\[\\[");
while (inputScanner.hasNext()) {
// Get title of wiki article and contents
String wikiName = inputScanner.next();
String wikiContents = inputScanner.next();
}
But it is also catching items like
"[some text [ some other text ] some more text ]"
"[[Vertebrate trachea|trachea]]s from human stem cells. Several [[artificial urinary bladder]]s"
"[[Image:Bohr-atom-PAR.svg|thumb|right|310px|The Rutherford–Bohr model of the hydrogen atom ([tpl]nowrap|Z [tpl]=[/tpl] 1[/tpl]) or a hydrogen-like ion ([tpl]nowrap|Z > 1[/tpl]), results in a photon of wavelength 656 nm (red light).]]"
"[[File:Gettysburg Campaign.png|thumb|350px|Gettysburg Campaign (through July 3); cavalry movements shown with dashed lines. [tpl]legend|#ff0000|Confederate[/tpl]]]"
"observed is not some nonphysical world of [[consciousness]], mind, or mental life "
I want the scanner to delimit whenever it sees
'[[' or ']] CATEGORIES'
but not sure how I could do that since I'm not that good at patterns or regex.
Can anyone identify a pattern that might work? I've tried looking around at other delimiter questions and the javadocs but it was hard to apply them to my problem.
Thank you for your time and any help you can give!

For matching the title correctly, we can use positive lookahead in the regex:
\[\[(?=.*]]\nCATEGORIES:)|]]\n(?=CATEGORIES:)
Explanation:
Match [[ followed by any sequence of characters and CATEGORIES string. Using positive lookahead so only [[ is matched.
Similarly, match ]] followed by CATEGORIES string.
Updated Snippet:
String text = "[[title1]] \n" +
"CATEGORIES: [some text [ some other text ] some more text ]\n" +
"[[Vertebrate trachea|trachea]]s from human stem cells. Several [[artificial urinary bladder]]s\n" +
"[[Image:Bohr-atom-PAR.svg|thumb|right|310px|The Rutherford–Bohr model of the hydrogen atom ([tpl]nowrap|Z [tpl]=[/tpl] 1[/tpl]) or a hydrogen-like ion ([tpl]nowrap|Z > 1[/tpl]), results in a photon of wavelength 656 nm (red light).]]\n" +
"[[File:Gettysburg Campaign.png|thumb|350px|Gettysburg Campaign (through July 3); cavalry movements shown with dashed lines. [tpl]legend|#ff0000|Confederate[/tpl]]]\n" +
"observed is not some nonphysical world of [[consciousness]], mind, or mental life\n" +
"[[title2]]\n" +
"CATEGORIES: [[some more text]]";
Scanner inputScanner = new Scanner(text);
inputScanner.useDelimiter("\\[\\[(?=.*]]\\s*CATEGORIES:)|]]\\s*\n(?=\\s*CATEGORIES:)");
while (inputScanner.hasNext()) {
String wikiName = inputScanner.next();
String wikiContents = inputScanner.next();
System.out.printf("Name:%s\nContents:%s\n\n", wikiName, wikiContents);
}
Output:
Name:title1
Contents:CATEGORIES: [some text [ some other text ] some more text ]
[[Vertebrate trachea|trachea]]s from human stem cells. Several [[artificial urinary bladder]]s
[[Image:Bohr-atom-PAR.svg|thumb|right|310px|The Rutherford–Bohr model of the hydrogen atom ([tpl]nowrap|Z [tpl]=[/tpl] 1[/tpl]) or a hydrogen-like ion ([tpl]nowrap|Z > 1[/tpl]), results in a photon of wavelength 656 nm (red light).]]
[[File:Gettysburg Campaign.png|thumb|350px|Gettysburg Campaign (through July 3); cavalry movements shown with dashed lines. [tpl]legend|#ff0000|Confederate[/tpl]]]
observed is not some nonphysical world of [[consciousness]], mind, or mental life
Name:title2
Contents:CATEGORIES: [[some more text]]

how to extract only English HashTag from tweettext

I am using twitter streaming API to get real time tweets and I am checking lang . I am extracting hashTags from those tweets but the problem is when I am extracting the hashtags from tweettext iam getting english and non-english hashtags. Is there any way to extract only english hashtag from a particular tweettext.My code after getting tweettext to extract hashtags
private String getHashTag(String TweetText) {
String[] words = TweetText.split(" ");
Set<String> hashtags = new HashSet<String>();
for (String word : words) {
if (word.startsWith("#")) {
hashtags.add(word);
}
}
return hashtags.toString();
}

You should use Apache Tika and its API for language detection. This is an example:
import org.apache.tika.language.LanguageIdentifier;
LanguageIdentifier identifier = new LanguageIdentifier(word);
String language = identifier.getLanguage();
With this solution you can get the language and therefore consider only english tweets.

What you want is to detect the language of a string. See this post: How to detect language of user entered text?

Parse an input text using Java Regex

I have this corresponding input text:
Clark is set to work in ''[[Superman (the Hero)|Superman]]'', a [[SuperHero Genre II]] movie directed [[Source:NYTimes]]...
Clark visited the [[University of Pleasantville]] campus in November 2009 to ...
*[[1973]] &ndash; [[Clark Kent]], superhero and newspaper reporter...
After appearing in other movies, Clark starred as [[negative hero]] [[Alternate Superman]] in ''[[Superman (2003 film)|Superman]]''...
Clark met ''[[Daily Planet]]'' reporter [[Louis Lane]]...</code>
This is the pattern code that I am using in Java:
<code>String pattern = "(?:\\p{Punct}|\\B|\\b)(\\[\\[[^(Arch:|Zeus:|Source:)].*?\\]\\])(?:\\p{Punct}|\\b|\\B)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(data);
while (m.find( )) {
System.out.println("Found value: " + m.group(1) );
}
I am reading the file line by line using readLine of BufferedReader (sysout-ing every line as I parse it) and getting the following output using my regex:
Clark is set to work in ''[[Superman (the Hero)|Superman]]'', a [[SuperHero Genre II]] movie directed [[Source:NYTimes]]...
Clark visited the [[University of Pleasantville]] campus in November 2009 to ...
Found value: [[University of Pleasantville]]
*[[1973]] – [[Clark Kent]], superhero and newspaper reporter...
Found value: [[1973]]
After appearing in other movies, Clark starred as [[negative hero]] [[Alternate Superman]] in ''[[Superman (2003 film)|Superman]]''...
Found value: [[negative hero]]
Found value: [[Alternate Superman]]
Clark met ''[[Daily Planet]]'' reporter [[Louis Lane]]...
Found value: [[Daily Planet]]
Found value: [[Louis Lane]]
As you can see the problem: I am not able to extract all the stuffs within the braces [[I_want_to_extract_these_except_Source_or_Arch_or_Zeus]]. Example: From the first line I should've extracted [[Superman (the Hero)|Superman]] etc. but it didn't retrieve anything. How can I modify my regex to extract everything except the ones which have [[Source:something]] etc.? Thank you.

Use a negative lookahead (e.g. (?!...)) like this:
\[\[(?!Arch:|Zeus:|Source).*?\]\]
See it in action: http://regex101.com/r/lJ6sH3/1

regex to find email address from a String

My intention is to get email address from a web page. I have the page source. I am reading the page source line by line. Now I want to get email address from the current line I am reading. This current line may or may not have email. I saw a lot of regexp examples. But most of them are for validating email address. I want to get the email address from a page source not validate. It should work as http://emailx.discoveryvip.com/ is working
Some examples input lines are :
1)<p>Send details to neeraj#yopmail.com</p>
2)<p>Interested should send details directly to www.abcdef.com/abcdef/. Should you have any questions, please email neeraj#yopmail.com.
3)Note :- Send your queries at neeraj#yopmail.com for more details call Mr. neeraj 012345678901.
I want to get neeraj#yopmail.com from examples 1,2 and 3.
I am using java and I am not good in rexexp. Help me.

You can validate e-mail address formats as according to RFC 2822, with this:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
and here's an explanation from regular-expressions.info:
This regex has two parts: the part before the #, and the part after the #. There are two alternatives for the part before the #: it can either consist of a series of letters, digits and certain symbols, including one or more dots. However, dots may not appear consecutively or at the start or end of the email address. The other alternative requires the part before the # to be enclosed in double quotes, allowing any string of ASCII characters between the quotes. Whitespace characters, double quotes and backslashes must be escaped with backslashes.
And you can check this out here: Rubular example.

The correct code is
Pattern p = Pattern.compile("\\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\\.[A-Z]{2,4}\\b",
Pattern.CASE_INSENSITIVE);
Matcher matcher = p.matcher(input);
Set<String> emails = new HashSet<String>();
while(matcher.find()) {
emails.add(matcher.group());
}
This will give the list of mail address in your long text / html input.

You need something like this regex:
".*(\\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\\.[A-Z]{2,4}\\b).*"
When it matches, you can extract the first group and that will be your email.
String regex = ".*(\\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\\.[A-Z]{2,4}\\b).*";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("your text here");
if (m.matches()) {
String email = m.group(1);
//do somethinfg with your email
}

This is a simple way to extract all emails from input String using Patterns.EMAIL_ADDRESS:
public static List<String> getEmails(#NonNull String input) {
List<String> emails = new ArrayList<>();
Matcher matcher = Patterns.EMAIL_ADDRESS.matcher(input);
while (matcher.find()) {
int matchStart = matcher.start(0);
int matchEnd = matcher.end(0);
emails.add(input.substring(matchStart, matchEnd));
}
return emails;
}

Replace string by excluding some strings in Java

How can I replace following string in Java:
Sports videos (From 2002 To 2003) here.
TO
Sports videos 2002 2003 here.
I have use code but it remove the whole string i.e.
I am getting this ouput: Sports videos here.
String pattern= "\\((From)(?:\\s*\\d*\\s*)(To)(?:\\s*\\d*\\s*)\\)";
String testStr = "Sports videos (From 2002 To 2003) here.";
String testStrAfterRegex = testStr.replaceFirst(pattern, "");
What is missing here?
Thanks
DIFFERENT STRING WITH DATE FORMATTER
If above string has date formatter like(\\) or any other character/words then digit, the answer will not work
I replace orginal answer with this pattern and it will work
String pattern= "\\((From)(.*)(To)(.*)\\)";

Change to
String pattern= "\\((From)(\\s*\\d*\\s*)(To)(\\s*\\d*\\s*)\\)";
String testStr = "Sports videos (From 2002 To 2003) here.";
String testStrAfterRegex = testStr.replaceFirst(pattern, "$2 $4");
There are two problems:
First
You put (?:) in groups with years. This is used to not remember these groups.
Second
You don't use group identifiers, like $1, $2.
I fixed using $2 and $4 for 2th and 4th groups.
EDIT
Cleaner solution:
String pattern= "\\(From(\\s*\\d*\\s*)To(\\s*\\d*\\s*)\\)";
String testStr = "Sports videos (From 2002 To 2003) here.";
String testStrAfterRegex = testStr.replaceFirst(pattern, "$1$2");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

parsing internal links from text in xml file - java

Solution \[\[(?:w:)?.*?\]\] Description Discussion This regex assumes that the sequence of characters ]] will not appear between [[ and ]]. I wasn't able for now to find the escape sequence of ]]. Demo http://regexr.com?37e51

I've visited the download page, on top its written: See Meta:Data dumps for documentation on the provided data formats. I guess they offer better parsing approaches then plain regex, check it out...

Related

How to use Scanner.useDelimiter() to match two characters next to each other followed by a word?

how to extract only English HashTag from tweettext

Parse an input text using Java Regex

regex to find email address from a String

Replace string by excluding some strings in Java

Categories

Resources