How can I use a regex to match multiple address lines?

How can I use a regex to match multiple address lines? - java

Given the following example text, can I use a regex to match each line of each address, and to add markers to know when one address finishes and the next begins? At present, I know how to match each entire address. I could then run a second regex to pick out the individual lines, but is it possible to achieve both these steps in one go?
Address:
Address 1 line 1,
Address 1 line 2,
Address 1 line 3
Address:
Address 2 line 1,
Address 2 line 2,
Address 2 line 3,
Address 2 line 4
Address:
Address 3 line 1,
Address 3 line 2

Here's a Pattern with the DOTALL flag on, enabling to find through multiple lines, using the "Address:" string as a delimiter:
// for test
String addresses = "Address:" + System.getProperty("line.separator")
+ "Address 1 line 1," + System.getProperty("line.separator")
+ "Address 1 line 2," + System.getProperty("line.separator")
+ "Address 1 line 3"
+ "Address:" + System.getProperty("line.separator")
+ "Address 2 line 1," + System.getProperty("line.separator")
+ "Address 2 line 2," + System.getProperty("line.separator")
+ "Address 2 line 3";
// | look behind for "Address:"
// | | any 1+ character,
// | | reluctantly quantified
// | | | lookahead for "Address:"
// | | | or end of input
// | | | | dot can mean
// | | | | line separator
Pattern p = Pattern.compile("(?<=Address:).+?(?=Address:|$)", Pattern.DOTALL);
Matcher m = p.matcher(addresses);
// iterating matches within given string, and printing
while (m.find()) {
System.out.printf("Found: %s%n%n", m.group());
}
Output
Found:
Address 1 line 1,
Address 1 line 2,
Address 1 line 3
Found:
Address 2 line 1,
Address 2 line 2,
Address 2 line 3
Note
In order to exclude the line separator after your "Address:" token from the match, you can use this refined pattern:
Pattern p = Pattern.compile("(?<=Address:"
+ System.getProperty("line.separator")+").+?(?=Address:"
+ System.getProperty("line.separator")+"|$)",
Pattern.DOTALL
);

If regex is what you want...
If you have a limited number of lines in an address (in your example 4), you could grab them with:
Address:\s*?(?:\n(.*),)?(?:\n(.*),)?(?:\n(.*),)?(?:\n(.*),)?(?:\n(.*))
Here the text Address: marks the beginning of the block and the four lines are grabbed, with the first three being optional.
(You'll need the global flags.)
regex101 example.

Related

java parse regex multiple capture groups

Hi I need to be able to handle both of these scenarios
John, party of 4
william, party of 6 dislikes John, jeff
What I want to capture is
From string 1: John, 4
From String 2: william, 6, john, jeff
I'm pretty stumped at how to achieve this
I know that ([^,])+ gives me the first group (just the name before the comma, without including the comma) but I have no clue on how to concatenate the other portion of the expression.

You may use
(\w+)(?:,\s*party of (\d+)|(?![^,]))
See the regex demo.
Details
(\w+) - Group 1: one or more word chars
(?:,\s*party of (\d+)|(?![^,])) - a non-capturing group matching
,\s*party of (\d+) - ,, then 0+ whitespaces, then party of and a space, and then Group 2 capturing 1+ digits
| - or
(?![^,]) - a location that is followed with , or end of string.
See Java demo:
String regex = "(\\w+)(?:,\\s*party of (\\d+)|(?![^,]))";
List<String> strings = Arrays.asList("John, party of 4", "william, party of 6 dislikes John, jeff");
Pattern pattern = Pattern.compile(regex);
for (String s : strings) {
System.out.println("-------- Testing '" + s + "':");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.println(matcher.group(1) + ": " + (matcher.group(2) != null ? matcher.group(2) : "N/A"));
}
}
Output:
-------- Testing 'John, party of 4':
John: 4
-------- Testing 'william, party of 6 dislikes John, jeff':
william: 6
John: N/A
jeff: N/A

Multiline non terminated regex

I came across a problem with regex parsing columns in ASCII tables.
Imagine an ASCII table like:
COL1 | COL2 | COL3
======================
ONE | APPLE | PIE
----------------------
TWO | APPLE | PIES
----------------------
THREE | PLUM- | PIES
| APRICOT |
For the first 2 entries a trivial capture regex does the deal
(?:(?<COL1>\w+)\s*\|\s*(?<COL2>\w+)\s*\|\s*(?<COL3>\w+)\s*)
However this regex captures the header, as well as it doesn't capture the 3rd line.
I can't solve following two problems :
How to exclude the header?
How to extend the COL2 capture group to capture the multiline entry PLUM-APRICOT?
Thanks for your help!

Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems. (http://regex.info/blog/2006-09-15/247)
I've assumed an input string like:
String input = ""
+ "\n" + "COL1 | COL2 | COL3"
+ "\n" + "======================"
+ "\n" + "ONE | APPLE | PIE "
+ "\n" + "----------------------"
+ "\n" + "TWO | APPLE | PIES"
+ "\n" + "----------------------"
+ "\n" + "THREE | PLUM- | PIES"
+ "\n" + " | APRICOT | ";
To split the header and the table you can use input.split("={2,}"). This returns an array of strings of the header and the table.
After trimming the table you can use table.split("-{2,}") to get the rows of the table.
All rows can be converted to arrays of cells by using row.split("\\|").
Dealing with multiline rows: Before converting the rows to cells, you can call row.split("\n") to split multiline rows.
When this split operations returns an array with more than one element, they should be split on pipes (split("\\|")) and the resulting cells should be merged.
From here its just element manipulation to get it into the format you desire.

How to extract members with regex

I have this string to parse and extract all elements between <>:
String text = "test user #myhashtag <#C5712|user_name_toto> <#U433|user_hola>";
I tried with this pattern, but it doesn't work (no result):
String pattern = "<#[C,U][0-9]+\\|[.]+>";
So in this example I want to extract:
<#C5712|user_name_toto>
<#U433|user_hola>
Then for each, I want to extract:
C or U element
ID (ie: 5712 or 433)
user name (ie: user_name_toto)
Thank you very much guys

The main problem I can see with your pattern is that it doesn't contain groups, hence retrieving parts of it will be impossible without further parsing.
You define numbered groups within parenthesis: (partOfThePattern).
From Java 7 onwards, you can also define named groups as follows: (?<theName>partOfThePattern).
The second problem is that [.] corresponds to a literal dot, not an "any character" wildcard.
The third problem is your last quantifier, which is greedy, therefore it would consume the whole rest of the string starting from the first username.
Here's a self-contained example fixing all that:
String text = "test user #myhashtag <#C5712|user_name_toto> <#U433|user_hola>";
// | starting <#
// | | group 1: any 1 char
// | | | group 2: 1+ digits
// | | | | escaped "|"
// | | | | | group 3: 1+ non-">" chars, greedy
// | | | | | | closing >
// | | | | | |
Pattern p = Pattern.compile("<#(.)(\\d+)\\|([^>]+))>");
Matcher m = p.matcher(text);
while (m.find()) {
System.out.printf(
"C or U? %s%nUser ID: %s%nUsername: %s%n",
m.group(1), m.group(2), m.group(3)
);
}
Output
C or U? C
User ID: 5712
Username: user_name_toto
C or U? U
User ID: 433
Username: user_hola
Note
I'm not validating C vs U here (gives you another . example).
You can easily replace the initial (.) with (C|U) if you only have either. You can also have the same with ([CU]).

<#([CU])(\d{4})\|(\w+)>
Where:
$1 --> C/U
$2 --> 5712/433
$3 --> user_name_toto/user_hola

Regex pattern to match certain url

I have a large text and I only want to use certain information from it. The text looks like this:
Some random text here
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8
More random text here
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_1_av.m3u8
More random text here
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_2_av.m3u8
More random text here
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_3_av.m3u8
I only want the http text. There are several of them in the text but I only need one of them. The regular expression should be "starts with http and ends with .m3u8".
I looked at the glossary of all the different expression but it is very confusing to me. I tried "/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{12,30})([\/\w \.-]*)*\/?$/" as my pattern. But is that enough?
All help is appreciated. Thank you.

Assuming your text is line-separated at every line representation in your example, here's a snippet that will work:
String text =
"Some random text here" +
System.getProperty("line.separator") +
"http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8" +
System.getProperty("line.separator") +
"More random text here" +
System.getProperty("line.separator") +
"http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8" +
System.getProperty("line.separator") +
// removed some for brevity
"More random text here" +
System.getProperty("line.separator") +
// added counter-example ending with "NOPE"
"http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.NOPE";
// Multi-line pattern:
// ┌ line starts with http
// | ┌ any 1+ character reluctantly quantified
// | | ┌ dot escape
// | | | ┌ ending text
// | | | | ┌ end of line marker
// | | | | |
Pattern p = Pattern.compile("^http.+?\\.m3u8$", Pattern.MULTILINE);
Matcher m = p.matcher(text);
while (m.find()) {
System.out.println(m.group());
}
Output
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8
Edit
For a refined "filter" by the "index_x" file of the URL, you can simply add it in the Pattern between the protocol and ending of the line, e.g.:
Pattern.compile("^http.+?index_0.+?\\.m3u8$", Pattern.MULTILINE);

I didn't test it, but this should do the trick:
^(http:\/\/.*\.m3u8)

It is the answer of #capnibishop, but with a little change.
^(http://).*(/index_1)[^/]*\.m3u8$
Added the missing "$" sign at the end. This ensures it matches
http://something.m3u8
and not
http://something.m3u81
Added the condition to match index_1 at the end of the line, which means it wil match:
http://something/index_1_something_else.m3u8
and not
http://something/index_1/something_else.m3u8

How to replace multiple words with space in a string using Java

I tried to replace a list of words from a give string with the following code.
String Sample = " he saw a cat running of that pat's mat ";
String regex = "'s | he | of | to | a | and | in | that";
Sample = Sample.replaceAll(regex, " ");
The output is
[ saw cat running that pat mat ]
// minus the []
It still has the last word "that". Is there anyway to modify the regex to consider the last word also.

Try:
String Sample = " he saw a cat running of that pat's mat remove 's";
String resultString = Sample.replaceAll("\\b( ?'s|he|of|to|a|and|in|that)\\b", "");
System.out.print(resultString);
saw cat running pat mat remove
DEMO
http://ideone.com/Yitobz

The problem is that you have consecutive words that you are trying to replace.
For example, consider the substring
[ of that ]
while the replaceAll is running, the [ of ] matches
[ of that ]
^ ^
and that will be replaced with a (space). The next character to match is t, not a space expected by
... | that | ...
What I think you can do to fix this is add word boundaries instead of spaces.
String regex = "'s\\b|\\bhe\\b|\\bof\\b|\\bto\\b|\\ba\\b|\\band\\b|\\bin\\b|\\bthat\\b";
or the shorter version as shown in Tuga's answer.

it doesn't work, because you delete the " of " part first and then there is no space before the "that" word, because you deleted it (replaced)
you can change in two ways:
String regex = "'s | he | of| to | a | and | in | that";
or
String regex = "'s | he | of | to | a | and | in |that ";
or you just call Sample = Sample.replaceAll(regex, " "); again

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I use a regex to match multiple address lines? - java

Related

java parse regex multiple capture groups

Multiline non terminated regex

How to extract members with regex

Regex pattern to match certain url

How to replace multiple words with space in a string using Java

Categories

Resources