Parsing HTML with Regular Expressions?

Parsing HTML with Regular Expressions? - java

I've been trying to gather information using regular expressions:
Pattern hp = Pattern.compile("<small>.....</small>");
Matcher mp = hp.matcher(code);
while (mp.find()) {
String grupoHORARIO = mp.group();
System.out.println(grupoHORARIO); }
When I run the program, instead of showing me:
RESULT1
RESULT2
RESULT3
It shows this:
<small>RESULT1</small>
<small>RESULT2</small>
As you see, it shows the opening and closing "small" tags before and after the word I am looking for.
What I need is just the word, without the "small" tags around it.

USING REGEX TO PARSE HTML IS BAD.
Again, using RegEx to parse HTML is bad.
That being said... In answer to your question, the problem is how you're using the Regular Expression. The only code of yours I would change is what is inside the Pattern.compile() method. The way you're currently doing it, (click on the Java button to view the results), you will only match when there is <small>, then 5 characters, then </small>. This match includes the start and end tags.
If what you want is to only match the middle parts, then you can try using RegEx lookaround. The way I did it is: (?<=<small>).*(?=</small>). Into parts:
.* - Any number of characters.
.*(?=</small>) - Any number of characters that are followed by </small>.
(?<=<small>).*(?=</small>) - Any number of characters that are preceded by <small> and followed by </small>.
If you don't want to have it match any character, then replace the .* with whatever you do want to find (for example, ..... or {5}. will match 5 characters).

Related

Regular Expressions - Match Character Not Between Two Strings

I've read many questions that ask about finding a regular expression to match characters between two strings, but my problem is the inverse. I'm attempting to create an expression that will match characters NOT between two strings.
Consider the following string.
This is short & [tag]fun & interesting[/tag].
I want to replace any ampersand character that is NOT inside the tag elements with the symbol #. The result should be as shown below.
This is short # [tag]fun & interesting[/tag].
I tried the following regular expression, but unfortunately, it matches the ampersand inside the tag elements.
/(?<!\[tag\])&(?!\[\/tag\])/g
I understand that it matches that ampersand because it's surrounded by characters on either side in the string. But I can't add a random number of characters to check because the lookbehind and lookahead must be fixed length.
Is there a regular expression that will accomplish what I want here?

This does the job even with nested tag:
Find: \[(\w+)\].+?\[/\1\](*SKIP)(*FAIL)|&
Replace: #
Demo & explanation
How it works:
\[(\w+)\].+?\[/\1\] is trying to match opening and closing tag with some data inside
(*SKIP)(*FAIL) if tag is found, then discard it
| else
& match an ampersand. At this point, we are sure it is not inside a tag.
Unfortunately this doesn't work with Java, but this requirement was only added after I answered.

Extract words (multiple whitespace) starting with # by regular expression

I have a problem with my regular expression:
String regex = "(?<=[\\s])#\\w+\\s";
I want a regex that formats a string like this:
"This is a Text #tag1 #tag2 #tag3"
With the regular expression, I get the last two values as result but not tag1 - because there is more than one whitespace. But i want all 3 of them!
I tried some variations, but nothing worked.

Use this regular expression:
(?<=(^|\\S)\\s)#\\w+(?=\\s|$)
Here's a demo.

It's a bit unclear from your question what you're really after, so I've put up some simple alternatives:
To capture all the tags in the string, we can use a lookbehind:
((?<=\\s|^)#\\w+)
To capture all the tags at the end of the string, we can use a lookahead:
(#\\w+(?=\\s#)|#\\w+$)
If there's always three tags at the end, there's no need for a lookaround:
(#\\w+)\s(#\\w+)\s(#\\w+)$

Possible Regular Expression Question

I have a simple program that looks up details of an IP you give it, and I will show you an example of some of my code
int regIndex = src.indexOf("Region:") + 16;
int endIndex = src.indexOf("<", regIndex);
String region = src.substring(regIndex, endIndex);
if(regIndex == 15) region = "None";
int counIndex = src.indexOf("Country:") + 17;
int couneIndex = src.indexOf(" <", counIndex);
String country = src.substring(counIndex, couneIndex);
As you can see, it is definitely not the most efficient way to do this. The website I am using gives the information like this: http://whatismyipaddress.com/ip/1.1.1.1
I have never really used Regular Expressions before, but it seems to me like there might be one that could really make this more efficient and easier to program, but I've been looking around and I'm pretty lost.
So basically my question is, how could I use a Regular Expression for this (Or if there is another more efficient way).
Any help would be great,
Thanks :)

You can do something like this:
String s = "bla Country: Australia <bla";
Pattern pattern = Pattern.compile("Country: (.*) [<]");
Matcher matcher = pattern.matcher(s);
if(matcher.find()) {
System.out.println("Country = " + matcher.group(1));
}

The source would look like this
<tr><th>Country:</th><td>Australia <img src="http://whatismyipaddress.com/images/flags/au.png" alt="au flag"> </td></tr>
To use regular expression means to match a pattern.
The pattern that indicates your wanted data is pretty straight forward Country:. You need also to match the following tags like <\/th><td>. The only thing is you need to escape the forward slash. Then there is the data you are looking for, I would suggest to match everything that is not a <, so [^<], this is a capturing group with a negation at the beginning, meaning any character that is not a <, to repeat this add a + at the end, meaning at least one of the preceding character.
So, the complete thing should look like this:
Country:<\/th><td>\s*([^<]+)\s*<
I added here also the brackets, they mean put the found pattern into a variable, so your result can be found in capturing group 1. I added also \s*, this is a whitespace character repeated 0 or more times, this is to match whitespace before or after your data, I assume that you don't need that.

Firstly there are some online sites that can help you to develop a regular expression. They let you enter some text, and a regular expression and then show you the result of applying the expression to the text. This saves you having to write code as you develop the expression and expand your understanding. A good site I use alot is FileFormat regex because it allows me to test one expression against multiple test strings. A quick search also brought up regex Planet, RegExr and RegexPal. There are lots of others.
In terms of resources, the Java Pattern class reference is useful for Java development and I quite like regular-expression.info as well.
For your problem I used fileFormat.info and came up with this regex to match "http://whatismyipaddress.com/ip/1.1.1.1":
.*//([.\w]+)/.*/(\d+(?:.\d+){3})
or as a java string:
".*//([.\\w]+)/.*/(\\d+(?:.\\d+){3})"
A quick break down says anything (.*), followed by two slashes (//), followed by at least one or more decimal points or characters (([.\w]+)), followed by a slash, any number of characters and another slash (/.*/), followed by at least 1 digit ((\d+), followed by 3 sets of a decimal point and at least one digit ((?:.\d+){3})). The sets of brackets around the server name part and the IP part are called capturing groups and you can use methods on the Java Matcher class to return the contents of these sections. The ?: on the second part of the ip address tells it that we are using the brackets to group the characters but it's not to be treated as a capturing group.
This regex is not as strict or as flexible as it should be, but it's a starting point.
All of this can be researched on the above links.

How do I write a regular expression to find the following pattern?

I am trying to write a regular expression to do a find and replace operation. Assume Java regex syntax. Below are examples of what I am trying to find:
12341+1
12241+1R1
100001+1R2
So, I am searching for a string beginning with one or more digits, followed by a "1+1" substring, followed by 0 or more characters. I have the following regex:
^(\d+)(1\\+1).*
This regex will successfully find the examples above, however, my goal is to replace the strings with everything before "1+1". So, 12341+1 would become 1234, and 12241+1R1 would become 1224. If I use the first grouped expression $1 to replace the pattern, I get the wrong result as follows:
12341+1 becomes 12341
12241+1R1 becomes 12241
100001+1R2 becomes 100001
Any ideas?

Your existing regex works fine, just that you are missing a \ before \d
String str = "100001+1R2";
str = str.replaceAll("^(\\d+)(1\\+1).*","$1");
Working link

IMHO, the regex is correct.
Perhaps you wrote it wrong in the code. If you want to code the regex ^(\d+)(1\+1).* in a string, you have to write something like String regex = "^(\\d+)(1\\+1).*".
Your output is the result of ^(\d+)(1+1).* replacement, as you miss some backslash in the string (e.g. "^(\\d+)(1\+1).*").

Your regex looks fine to me - I don't have access to java but in JavaScript the code..
"12341+1".replace(/(\d+)(1\+1)/g, "$1");
Returns 1234 as you'd expect. This works on a string with many 'codes' in too e.g.
"12341+1 54321+1".replace(/(\d+)(1\+1)/g, "$1");
gives 1234 5432.

Personally, I wouldn't use a Regex at all (it'd be like using a hammer on a thumbtack), I'd just create a substring from (Pseudocode)
stringName.substring(0, stringName.indexOf("1+1"))
But it looks like other posters have already mentioned the non-greedy operator.
In most Regex Syntaxes you can add a '?' after a '+' or '*' to indicate that you want it to match as little as possible before moving on in the pattern. (Thus: ^(\d+?)(1+1) matches any number of digits until it finds "1+1" and then, NOT INCLUDING the "1+1" it continues matching, whereas your original would see the 1 and match it as well).

Linkify text with regular expressions in Java

I have a wysiwyg text area in a Java webapp. Users can input text and style it or paste some already HTML-formatted text.
What I am trying to do is to linkify the text. This means, converting all possible URLs within text, to their "working counterpart", i.e. adding < a href="...">...< /a>.
This solution works when all I have is plain text:
String r = "http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
Pattern pattern = Pattern.compile(r, Pattern.DOTALL | Pattern.UNIX_LINES | Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(comment);
comment = matcher.replaceAll("$0"); // group 0 is the whole expression
But the problem is when there is some already formatted text, i.e. that it already has the < a href="...">...< /a> tags.
So I am looking for some way for the pattern not to match whenever it finds the text between two HTML tags (< a>). I have read this can be achieved with lookahead or lookbehind but I still can't make it work. I am sure I am doing it wrong because the regex still matches. And yes, I have been playing around/ debugging groups, changing $0 to $1 etc.
Any ideas?

You are close. You can use a "negative lookbehind" like so:
(?<!href=")http:// etc
All results preceded by href will be ignored.

If you want to use regex, (though I think parsing to XML/HTML first is more robust) I think look-ahead or -behind makes sense. A first stab might be to add this at the end of your regex:
(?!</a>)
Meaning: don't match if there's a closing a tag just afterwards. (This could be tweaked forever, of course.) This doesn't work well, though, because given the string
http://example.com/
This regex will try to match "http://example.com/", fail due to the lookahead (as we hope), and then backtrack the greedy qualifier to have on the end and match "http://example.com" instead, which doesn't have a after it.
You can fix the latter problem by using a possessive qualifier on your +, * and ? operators - just stick a + after them. This prevents them from back-tracking. This is probably good for performance reasons, as well.
This works for me (note the three extra +'s):
String r = "http(s)?://([\\w+?\\.\\w+])++([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*+)?+(?!</a>)";

If you really want to do it with regex, than:
String r = "(?<![=\"\\/>])http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
e.g. check that the URL is not following a =" or />

Perhaps html parsing will be more appropriate for you (htmlparser for example). Then you could have html nodes and only "linkify" links in the text and not in the attributes.

If you have to roll your own, at least look at the algorithms/patterns used in an Open Source implementation of Markdown, e.g., MarkdownJ.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing HTML with Regular Expressions? - java

Related

Regular Expressions - Match Character Not Between Two Strings

Extract words (multiple whitespace) starting with # by regular expression

Possible Regular Expression Question

How do I write a regular expression to find the following pattern?

Linkify text with regular expressions in Java

Categories

Resources