combined regex parse - java

I want to catch the value of some HTML tag.
that could look like this:
value='3242312-3245-3245,234:3245:324,asdf asdf,asdf asd'>
or like this:
value=358 >
and maybe this:
value=83 selected='selected'>
I tried:
Pattern.compile("value=[[\'(.+)\'][(0-9)\\s]]")
but with no success...
any idea what pattern should I use?

This expression should work:
(?<field>\w+)=(('(?<value>[^']*)')|(?<value>\S+))
You can test here using the same expression w/o group names (since the tool doesn't support them):
(\w+)=(('([^']*)')|(\S+))

Related

Regex: Read value between multiple brackets

I currently working on translating a website (Smarty) with Poedit. To get all the text from the .tpl files i'm using regex to get the data between the {t} and {/t}. so an example:
{t}Password incorrect, please try again{/t}
The regex will read Password incorrect, please try again and place it in a .po file. This is all working fine. It goes wrong when it gets a little more advanced.
Sometimes the text between the {t} tags uses a parameter. this looks like this:
{t 1=$email|escape 2=$mailbox}No $1 given, please check your $2{/t}
This is also working great.
The real problem start when i use brackets inside the parameter like this:
{t 1={site info='name'} 2=$mailbox}visit %1 or go to your %2{/t}
My regex will close when it sees the first closing brackets so the result will be 2=$mailbox}visit %1 or go to your %2.
My regex looks like this:
\{t.*?\}?[}]([^\{]+)\{\/t\}|\{t\}([^\{]+)\{\/t\}
The regex is used inside a java program.
Does anybody has a way to fix this problem?
The easiest solution I see on this is to normalize the .tpl files. Just use a regex which matches all tags something like this one:
{[^}]*[^{]*}
I had the same issue to solve and it worked pretty good with the normalizing.
The normalizing-method would look like this:
final String regex = "\\{[^\\}]*[^\\{]*\\}";
private String normalizeContent(String content) {
return content.replaceAll(regex, "");
}

Selecting lines matching keywords using XPATH

I want to print all lines starting with a "+" and which have a keyword such as "hasRole".
String search="//td[contains(#class,'blob-code blob-code-addition') and contains(text(),'hasRole')]";
I know a simple and condition will not be enough. How do I formulate the XPATH search for this?
Here's a screenshot.
Also,how do I make this search case insensitive?
You can use something like this:
//td[(contains(lower-case(#class), 'blob-code blob-code-addition') ...
OR
//td[(contains(translate(#class, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"), 'blob-code blob-code-addition') ...
I tried the below and it worked.
String search="//td[(contains(#class,'blob-code blob-code-addition') and contains(.,'HasRole')) or (contains(#class,'blob-code blob-code-deletion') and contains(.,'HasRole'))]";

How to use java regex to filter xml file

I have this java string with xml info and I am trying to use java regex to filter out all the junk that is between the words to form a word enclosed in brackets, e.g. [DEFENDANT].
I want to go from this:
<w:p><w:r><w:t>[</w:t></w:r><st1:PlaceName w:st="on"><w:r><w:t>DEFENDANT</w:t></w:r>
</st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r><w:t>CITY</w:t></w:r>
</st1:PlaceType><w:r><w:t>], [</w:t></w:r><st1:place w:st="on"><st1:PlaceName w:st="on"><w:r>
<w:t>DEFENDANT</w:t></w:r></st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r>
<w:t>STATE</w:t></w:r></st1:PlaceType></st1:place><w:r><w:t>] [DEFENDANT ZIP]</w:r><w:r>
to this:
<w:p><w:r><w:t>[DEFENDANT CITY], [DEFENDANT STATE] [DEFENDANT ZIP]</w:r><w:r>
I have been testing with regex epression like (\[)<.+>+([A-Z ]+\]) on regexPlanet extensively to no avail.
Do not use Regex to parse XML. Just use the built in Java XML library.
If it's all on a single line, like this:
<w:p><w:r><w:t>[</w:t></w:r><st1:PlaceName w:st="on"><w:r><w:t>DEFENDANT</w:t></w:r></st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r><w:t>CITY</w:t></w:r></st1:PlaceType><w:r><w:t>], [</w:t></w:r><st1:place w:st="on"><st1:PlaceName w:st="on"><w:r><w:t>DEFENDANT</w:t></w:r></st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r><w:t>STATE</w:t></w:r></st1:PlaceType></st1:place><w:r><w:t>] [DEFENDANT ZIP]</w:r><w:r>
Then this regex should work:
([<\w:\w>]+)(\[[</\w:\w>]+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\s</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\],\s\[</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\s</w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+></\w+:\w+><\w:\w><\w:\w>\]\s\[)(\w+\s\w+)(\])(</\w:\w><\w:\w>)
I have a working example here: RegExr
I could have grouped things a little better, but overall, it gets the job done, so you should be able to see it working.
Also, if it's not on a single line (if it's like it is in your example), then this would work:
([<\w:\w>]+)(\[[</\w:\w>]+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w>\s+</\w+:\w+><\w:\w><\w:\w>\s</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w>\s+</\w+:\w+><\w:\w><\w:\w>\],\s\[</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w+:\w+\s\w:\w+="\w+"><\w:\w>\s+<\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\s</w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w>\s+<\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+></\w+:\w+><\w:\w><\w:\w>\]\s\[)(\w+\s\w+)(\])(</\w:\w><\w:\w>)
You can see that on RegExr here.

Need a regex expression to get value between two tags

Need regular expression to extract the values between >xxxxx<. Can anybody help me in this.
<ChangeID type="String">C10286</ChangeID>
<ChangeID type="String">C10296</ChangeID>
Is it possible to get the two values in a comma separated format like C10286,C10296 in a single regex expression?
Thanks and Regards
Riyas Hussain A
try this:
(?<=>)[^<]*
test it with grep -Po:
kent$ echo '<ChangeID type="String">C10286</ChangeID>
<ChangeID type="String">C10296</ChangeID>'|grep -Po '(?<=>)[^<]*'
C10286
C10296
My idea would be to lookup for all words and remove the ones we don't need (in case you have more than 1 value inside your tag):
(?!ChangeID\b)(?!type\b)(?!String\b)\b\w+
You can try it out on : http://regexpal.com/

Match all urls in a domain except queries

I wish to match (java regex) all urls belonging to a certain domain except the ones looking like a query string.
For e.g. I wish to match
http://www.thehindu.com/arts/music/marrying-keys-to-chips/article4061904.ece
But avoid
http://www.thehindu.com/arts/music?article=23417
I tried the following but it allows both the above patterns.
+^http://www\.thehindu\.com([^\?=])*
What about
if (yourString.matches("(http://)?www\\.thehindu\\.com[^\\?=]*") {
// match --> doesn't look like a query
} else {
// no match --> looks like a query or completely different url
}
Try this:
(^|\s)http:\/\/www\.thehindu\.com([^\?])*(\s|$)
Where the (^|\s) and (\s|$) are delimiters you expect between urls. Add more in those if you need.
I suppose regexp isn't required, try looking for question mark ?.

Categories

Resources