Parse an input text using Java Regex

Parse an input text using Java Regex - java

I have this corresponding input text:
Clark is set to work in ''[[Superman (the Hero)|Superman]]'', a [[SuperHero Genre II]] movie directed [[Source:NYTimes]]...
Clark visited the [[University of Pleasantville]] campus in November 2009 to ...
*[[1973]] &ndash; [[Clark Kent]], superhero and newspaper reporter...
After appearing in other movies, Clark starred as [[negative hero]] [[Alternate Superman]] in ''[[Superman (2003 film)|Superman]]''...
Clark met ''[[Daily Planet]]'' reporter [[Louis Lane]]...</code>
This is the pattern code that I am using in Java:
<code>String pattern = "(?:\\p{Punct}|\\B|\\b)(\\[\\[[^(Arch:|Zeus:|Source:)].*?\\]\\])(?:\\p{Punct}|\\b|\\B)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(data);
while (m.find( )) {
System.out.println("Found value: " + m.group(1) );
}
I am reading the file line by line using readLine of BufferedReader (sysout-ing every line as I parse it) and getting the following output using my regex:
Clark is set to work in ''[[Superman (the Hero)|Superman]]'', a [[SuperHero Genre II]] movie directed [[Source:NYTimes]]...
Clark visited the [[University of Pleasantville]] campus in November 2009 to ...
Found value: [[University of Pleasantville]]
*[[1973]] – [[Clark Kent]], superhero and newspaper reporter...
Found value: [[1973]]
After appearing in other movies, Clark starred as [[negative hero]] [[Alternate Superman]] in ''[[Superman (2003 film)|Superman]]''...
Found value: [[negative hero]]
Found value: [[Alternate Superman]]
Clark met ''[[Daily Planet]]'' reporter [[Louis Lane]]...
Found value: [[Daily Planet]]
Found value: [[Louis Lane]]
As you can see the problem: I am not able to extract all the stuffs within the braces [[I_want_to_extract_these_except_Source_or_Arch_or_Zeus]]. Example: From the first line I should've extracted [[Superman (the Hero)|Superman]] etc. but it didn't retrieve anything. How can I modify my regex to extract everything except the ones which have [[Source:something]] etc.? Thank you.

Use a negative lookahead (e.g. (?!...)) like this:
\[\[(?!Arch:|Zeus:|Source).*?\]\]
See it in action: http://regex101.com/r/lJ6sH3/1

Related

How to use Scanner.useDelimiter() to match two characters next to each other followed by a word?

I am trying to parse a plain .txt file with the general structure
[[Title]]
CATEGORIES: text, text, text
some text etc...
[[Next Title]]
CATEGORIES: text, text, text
Next other text etc ...
In my code I use this pattern
Scanner inputScanner = new Scanner(fileEntry)
inputScanner.useDelimiter("\\]\\]|\\[\\[");
while (inputScanner.hasNext()) {
// Get title of wiki article and contents
String wikiName = inputScanner.next();
String wikiContents = inputScanner.next();
}
But it is also catching items like
"[some text [ some other text ] some more text ]"
"[[Vertebrate trachea|trachea]]s from human stem cells. Several [[artificial urinary bladder]]s"
"[[Image:Bohr-atom-PAR.svg|thumb|right|310px|The Rutherford–Bohr model of the hydrogen atom ([tpl]nowrap|Z [tpl]=[/tpl] 1[/tpl]) or a hydrogen-like ion ([tpl]nowrap|Z > 1[/tpl]), results in a photon of wavelength 656 nm (red light).]]"
"[[File:Gettysburg Campaign.png|thumb|350px|Gettysburg Campaign (through July 3); cavalry movements shown with dashed lines. [tpl]legend|#ff0000|Confederate[/tpl]]]"
"observed is not some nonphysical world of [[consciousness]], mind, or mental life "
I want the scanner to delimit whenever it sees
'[[' or ']] CATEGORIES'
but not sure how I could do that since I'm not that good at patterns or regex.
Can anyone identify a pattern that might work? I've tried looking around at other delimiter questions and the javadocs but it was hard to apply them to my problem.
Thank you for your time and any help you can give!

For matching the title correctly, we can use positive lookahead in the regex:
\[\[(?=.*]]\nCATEGORIES:)|]]\n(?=CATEGORIES:)
Explanation:
Match [[ followed by any sequence of characters and CATEGORIES string. Using positive lookahead so only [[ is matched.
Similarly, match ]] followed by CATEGORIES string.
Updated Snippet:
String text = "[[title1]] \n" +
"CATEGORIES: [some text [ some other text ] some more text ]\n" +
"[[Vertebrate trachea|trachea]]s from human stem cells. Several [[artificial urinary bladder]]s\n" +
"[[Image:Bohr-atom-PAR.svg|thumb|right|310px|The Rutherford–Bohr model of the hydrogen atom ([tpl]nowrap|Z [tpl]=[/tpl] 1[/tpl]) or a hydrogen-like ion ([tpl]nowrap|Z > 1[/tpl]), results in a photon of wavelength 656 nm (red light).]]\n" +
"[[File:Gettysburg Campaign.png|thumb|350px|Gettysburg Campaign (through July 3); cavalry movements shown with dashed lines. [tpl]legend|#ff0000|Confederate[/tpl]]]\n" +
"observed is not some nonphysical world of [[consciousness]], mind, or mental life\n" +
"[[title2]]\n" +
"CATEGORIES: [[some more text]]";
Scanner inputScanner = new Scanner(text);
inputScanner.useDelimiter("\\[\\[(?=.*]]\\s*CATEGORIES:)|]]\\s*\n(?=\\s*CATEGORIES:)");
while (inputScanner.hasNext()) {
String wikiName = inputScanner.next();
String wikiContents = inputScanner.next();
System.out.printf("Name:%s\nContents:%s\n\n", wikiName, wikiContents);
}
Output:
Name:title1
Contents:CATEGORIES: [some text [ some other text ] some more text ]
[[Vertebrate trachea|trachea]]s from human stem cells. Several [[artificial urinary bladder]]s
[[Image:Bohr-atom-PAR.svg|thumb|right|310px|The Rutherford–Bohr model of the hydrogen atom ([tpl]nowrap|Z [tpl]=[/tpl] 1[/tpl]) or a hydrogen-like ion ([tpl]nowrap|Z > 1[/tpl]), results in a photon of wavelength 656 nm (red light).]]
[[File:Gettysburg Campaign.png|thumb|350px|Gettysburg Campaign (through July 3); cavalry movements shown with dashed lines. [tpl]legend|#ff0000|Confederate[/tpl]]]
observed is not some nonphysical world of [[consciousness]], mind, or mental life
Name:title2
Contents:CATEGORIES: [[some more text]]

Extracting all DATES from a .txt file

hopefully this is short and to the question..
In the below program I have successfully extracted ALL data from a notepad doc named "pad.txt", which consists of 3 sets vertically aligned with an 'ID' followed by 'Name' followed by 'Date Joined', that pattern is consistent.
The notepad doc consists solely of this:
dID: 1
Name: Bob
Date Joined: 01/12/2014
ID: 2
Name: Jim
Date Joined: 8/21/1993
ID: 3
Name: Steve
Date Joined: 6/07/2016
I have also defined a regex that accepts an acceptable date format: 1-2 digits, a slash, 1-2 digits again, a slash, then 2 to four digits for YEAR date.. At the beginning of that I specified a wild card character "." <- the dot with a greedy quantifier "" the star, to say ANY number of ANY character before the date is accepted, as well as after the date I have also specified the "."
My main goal with this code is to EXTRACT ONLY all of the DATES within the pad.txt file, and store them in a String or something..
public class Main {
public static void main(String args[]) throws Exception{
StringBuilder builder = new StringBuilder();
FileReader reader = new FileReader(new File("pad.txt"));
// Define valid date format via regex
String dateRegex = ".* (\\d{1,2})/(\\d{1,2})/(\\d{2,4}) .* ";
int fileContent = 0;
// iterate through entire notepad doc, until = 0 AKA (finished searching doc)
while((fileContent = reader.read()) !=-1){
builder.append((char)fileContent);
}//encapsulating loop
reader.close();
String extracted = builder.toString();
System.out.println("Extracted: " + extracted);
System.out.println();
Matcher m = null;
// Validate that file contents conform with 'dateRegex'
m = Pattern.compile(dateRegex).matcher(extracted);
if(m.find()){
System.out.println("Entire group : " + m.group());
}
}
}
Unfortunately, the m.group(); outprint only returns:
"Entire group : 6/07/2016"
As stated, my goal is to extract ALL of the dates, but I can't fiddle with all of the dates if the .matcher call ONLY catches the "Entire group : 6/07/2016"
In my mind, I say ANY character of ANY amount is allowed before and AFTER the date, so it scrolls to the very bottom and finds ONLY the LAST date, how do I defined the regex so that it pulls out ALL of the dates, not just the very LAST one, and why is it only pulling the last one?
I've tried relentlessly with this and cannot figure out how..
Thanks in advance

Well, that's relatively easy. You can't write a regex that matches all dates at once, but you can use matcher as it was intended to be used, i.e. find() returns true as often as another match can be found.
So you have to modify your regex and remove the .* on both ends. Then you can simply do this:
StringBuilder dateListBuilder = new Stringbuilder();
while(m.find()){
dateListBuilder.append(m.group());
}
System.out.println(dateListBuilder.toString());

Using regex to parse a string from text that includes a newline

Given the following text, I'm trying to parse out the string "TestFile" after Address::
File: TestFile
Branch
OFFICE INFORMATION
Address: TestFile
City: L.A.
District.: 43
State: California
Zip Code: 90210
DISTRICT INFORMATION
Address: TestFile2
....
I understand that lookbehinds require zero-width so quantifiers are not allowed, meaning this won't work:
(?<=OFFICE INFORMATION\n\s*Address:).*(?=\n)
I could use this
(?<=OFFICE INFORMATION\n Address:).*
but it depends on consistent spacing, which isn't dynamic and thus not ideal.
How do I reliably parse out "TestFile" and not "TestFile2" as shown in my example above. Note that Address appears twice but I only need the first value.
Thank you

You don't really need to use a lookbehind here. Get your matched text using captured group:
(?:\bOFFICE INFORMATION\s+Address:\s*)(\S+)
RegEx Demo
captured group #1 will have value TestFile
JS Code:
var re = /(?:\bOFFICE INFORMATION\s+Address:\s*)(\S+)/;
var m;
var matches = [];
if ((m = re.exec(input)) !== null) {
if (m.index === re.lastIndex)
re.lastIndex++;
matches.push(m[1]);
}
console.log(matches);

Working with Array:
// A sample String
String questions = "File: TestFile Branch OFFICE INFORMATION Address: TestFile City: L.A. District.: 43 State: California Zip Code: 90210 DISTRICT INFORMATION Address: TestFile2";
// An array list to store split elements
ArrayList arr = new ArrayList();
// Split based on colon and spaces.
// Including spaces resolves problems for new lines etc
for(String x : questions.split(":|\\s"))
// Ignore blank elements, so we get a clean array
if(!x.trim().isEmpty())
arr.add(x);
This will give you an array which is:
[File, TestFile, Branch, OFFICE, INFORMATION, Address, TestFile, City, L.A., District., 43, State, California, Zip, Code, 90210, DISTRICT, INFORMATION, Address, TestFile2]
Now lets analyze... suppose you want information corresponding to Address, or element Address. This element is at position 5 in array. That means element 6 is what you want.
So you would do this:
String address = arr.get(6);
This will return you testFile.
Similarly for City, element 8 is what you want. The count starts from 0. You can ofcourse modify my matching pattern or even create a loop and get yourself even better ways to do this task. This is just a hint.
Here is one such example loop:
// Every i+1 is the property tag, and every i+2 is the property name for
// Skip first 6 elements because they are of no real purpose to us
for(int i = 6; i<(arr.size()/2)+6; i+=2)
System.out.println(arr.get(i));
This gives following output:
TestFile
L.A.
43
California
Code
Ofcourse this loop is unrefined, refine it a little and you will get every element correctly. Even the last element. Or better yet, use ZipCode instead of Zip Code and dont use spaces in between and you will have a perfect loop with nothing much to be done in addition).
The advantage over using direct regex: You wont have to specify the regex for every single element. Iteration is always more handy to get things done automatically.

See this
//read input from file
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(new File("D:/tests/sample.txt"))));
StringBuilder string = new StringBuilder();
String line = "";
while((line = reader.readLine()) != null){
string.append(line);
string.append("\n");
}
//now string will contain the input as
/*File: TestFile
Branch
OFFICE INFORMATION
Address: TestFile
City: L.A.
District.: 43
State: California
Zip Code: 90210
DISTRICT INFORMATION
Address: TestFile2
....*/
Pattern regex = Pattern.compile("(OFFICE INFORMATION.*\\r?\\n.*Address:(?<officeAddress>.*)\\r?\\n)");
Matcher regexMatcher = regex.matcher(string.toString());
while (regexMatcher.find()) {
System.out.println(regexMatcher.group("officeAddress"));//prints TestFile
}
You can see the named group officeAddress in the pattern which is needed to be extracted.

Read a Line and Fetch the value of a particular word in Java

I have file which I read line by line in java.
Below is the content of the file
My File contains the following characters (persons, indicated by name)
There are three characters in this line Jack = 10 Jill = 11 Jhon = 12
There are two characters in the line Jack = 14 Melissa = 15
I have to search line by line for 'Jack' and I have to fetch his value 10 (in first line) and 14 (in second line) and pass it to another variable. How to achieve this?

This should get you started. I assume you know how to read file line by line, that's the draft of what you should do for every line.
Pattern pattern = Pattern.compile("(.*Jack)\\s*=\\s*(\\d+)(.*)");
String testString = " Jack =154, Jill = 111";
Matcher matcher = pattern.matcher(testString);
if(matcher.find()) {
System.out.println(matcher.group(2));
}
These are the essentials you should know to understand what's going on: http://docs.oracle.com/javase/tutorial/essential/regex/

parsing internal links from text in xml file

I need to get internal links present in text field of Wikinews xml file.
In my case those are coming in two formats
[[w:President of the People's Republic of China|President]]
[[People's Republic of China]]
I applied these regex patterns
internalLinks = Pattern.compile("\\[\\[w:([^|:]+)\\|.*\\]\\]").matcher(internalLinks).replaceAll("##en.wikipedia.org/wiki/$1##");
internalLinks = Pattern.compile("\\[\\[([^:|]+)\\]\\]").matcher(internalLinks).replaceAll("[[[en.wikinews.org/wiki/$1]]]");
Pattern pattern = Pattern.compile("\\[\\[\\[(.*?)\\]\\]\\]");
Matcher matcher = pattern.matcher(internalLinks);
while (matcher.find())
{
interLinks += matcher.group(1)+",";
}
Pattern pattern1 = Pattern.compile("##(.*?)##");
Matcher matcher1 = pattern1.matcher(internalLinks);
while (matcher1.find())
{
interLinks += matcher1.group(1)+",";
}
if (interLinks.length() > 0) {
interLinks = interLinks.substring(0, interLinks.length()-1);
return interLinks;
} else return "";
Problem is it is just giving me the links matching first pattern and that too only few links, just 3-4 and not all
Here I have provided an excerpt of the text field of a document.
{{date|November 13, 2004}}
{{Brazil}}[[w:Hu Jintao|Hu Jintao]], the [[w:President of the People's Republic of China|President]] of the [[People's Republic of China]] had lunch today with the [[w:President of Brazil|President]] of [[Brazil]], [[w:Luiz Inácio Lula da Silva|Luiz Inácio Lula da Silva]], at the ''Granja do Torto'', the President's country residence in the [[w:Brazilian Federal District|Brazilian Federal District]]. Lunch was a traditional Brazilian [[w:barbecue|barbecue]] with different kinds of meat.
Some Brazilian ministers were present at the event: [[w:Antonio Palocci|Antonio Palocci]] (Economy), [[w:pt:Eduardo Campos|Eduardo Campos]] ([[w:Ministry of Science and Technology (Brazil)|Science and Technology]]), [[w:João Roberto Rodrigues|Roberto Rodrigues]] (Agriculture), [[w:pt:Luiz Fernando Furlan|Luiz Fernando Furlan]] (Development), [[w:Celso Amorim|Celso Amorim]] ([[w:Ministry of
External Relations (Brazil)|Exterior Relations]]), [[w:Dilma Rousseff|Dilma Rousseff]] (Mines and Energy). Also present were [[w:pt:Roger Agnelli|Roger Agnelli]] ([[w:Vale (mining company)|Vale do Rio Doce]] company president) and Eduardo Dutra ([[w:Petrobras|Petrobras]], government oil company, president).
This meeting is part of a new [[w:political economy|political economy]] agreement between Brazil and China where Brazil has recognized mainland China's [[w:socialist market economy|market economy]] status, and China has promised to buy more [[w:economy of Brazil|Brazilian products]].

Solution
\[\[(?:w:)?.*?\]\]
Description
Discussion
This regex assumes that the sequence of characters ]] will not appear between [[ and ]].
I wasn't able for now to find the escape sequence of ]].
Demo
http://regexr.com?37e51

I've visited the download page, on top its written:
See Meta:Data dumps for documentation on the provided data formats.
I guess they offer better parsing approaches then plain regex, check it out...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parse an input text using Java Regex - java

Use a negative lookahead (e.g. (?!...)) like this: \[\[(?!Arch:|Zeus:|Source).*?\]\] See it in action: http://regex101.com/r/lJ6sH3/1

Related

How to use Scanner.useDelimiter() to match two characters next to each other followed by a word?

Extracting all DATES from a .txt file

Using regex to parse a string from text that includes a newline

Read a Line and Fetch the value of a particular word in Java

parsing internal links from text in xml file

Categories

Resources