Regex string modifications

Regex string modifications - java

I have the following String and I want to filter the MBRB1045T4G out with a regular expression in Java. How would I achieve that?
String:
<p class="ref">
<b>Mfr Part#:</b>
MBRB1045T4G<br>
<b>Technologie:</b>
Tab Mount<br>
<b>Bauform:</b>
D2PAK-3<br>
<b>Verpackungsart:</b>
REEL<br>
<b>Standard Verpackungseinheit:</b>
800<br>

As Wrikken correctly says, HTML can't be parsed correctly by regex in the general case. However it seems you're looking at an actual website and want to scrape some contents. In that case, assuming space elements and formatting in the HTML code don't change, you can use a regex like this:
Mfr Part#:</b>([^<]+)<br>
And collect the first capture group like so (where string is your HTML):
Pattern pt = Pattern.compile("Mfr Part#:</b>\s+([^<]+)<br>",Pattern.MULTILINE);
Matcher m = pt.matcher(string);
if (m.matches())
System.out.println(m.group(1));

Related

RegEx for matching between any two HTML tags

I have the following content :
<div class="TEST-TEXT">hi</span>
first young CEO's TEST-TEXT
<span class="test">hello</span>
I am trying to match the TEST-TEXT string to replace it is value but only when it is a text and not within an attribute value.
I have checked the concepts of look-ahead and look-behind in Regex but the current issue with that is that it needs to use a fixed width for the match here is a link regex-match-all-characters-between-two-html-tags that show case a very similar case but with an exception that there is a span with a class to create a match
also checked the link regex-match-attribute-in-a-html-code
here are two regular expressions I am trying with :
\"([^"]*)\"
(?s)(?<=<([^{]*)>)(.+?)(?=</.>)
both are not working for me try using [https://regex101.com/r/ApbUEW/2]
I expect it to match only the string when it is a text
current behavior it matches both cases
Edit : I want the text to be dynamic and not specific to TEST-TEXT

Something like this should help:
\>([^"<]*)\<
EDIT:
Without open and close tags included:
(?<=\>)([^"<]*)(?=\<)

Try TEST-TEXT(?=<\/a>)
TEST-TEXT matches TEST-TEXT
?= look ahead to check closing tag </a>
see at
regex101

Here, we might just add a soft boundary on the right of the desired output, which you have been already doing, then a char list for the desired output, then collect, after that we can make a replacement by using capturing groups (). Maybe similar to this:
([A-Z-]+)(<\/)
Demo
This snippet is just to show that the expression might be valid:
const regex = /([A-Z-]+)(<\/)/gm;
const str = `<div class="TEST-TEXT">hi</span><a href=\\"https://en.wikipedia.org/wiki/TEST-TEXT\\">first young CEO's
TEST-TEXT</a><span class="test">hello</span><div class="TEST-TEXT">hi</span><a href=\\"https://en.wikipedia.org/wiki/TEST-TEXT\\">first young CEO's
TEST-TEXT</a><span class="test">hello</span>`;
const subst = `NEW-TEXT$2`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
RegEx
If this expression wasn't desired, it can be modified or changed in regex101.com.
RegEx Circuit
jex.im also helps to visualize the expressions.

Maybe this will help?
String html = "<div class=\"TEST-TEXT\">hi</span>\n" +
"first young CEO's TEST-TEXT\n" +
"<span class=\"test\">hello</span>";
Pattern pattern = Pattern.compile("(<)(.*)(>)(.*)(TEST-TEXT)(.*)</.*>");
Matcher matcher = pattern.matcher(html);
while (matcher.find()){
System.out.println(matcher.group(5));
}

A RegEx for that a string between any two HTML tags
(?![^<>]*>)(TEST\-TEXT)

How to get the middle strings with regex?

I have an input string that looks like this
DatalogSetupFile: BTS50xx1EJA\3.20\log_all.stp
The DatalogSetupFile: and \3.20\log_all.stp are constant. I wish to extract BTS50xx1EJA from the string. How should I do it?

You can make a regex group in which you can specify what all are the static content and then specify what are the dynamic content as a whole group, So that you can get the dynamic content as a whole group,
You can define regex as follow
^(?:DatalogSetupFile:\s)(.*)(?:\3.20\log_all.stp)$
Try this Demo
Here you can use the first group to get your dynamic string

Give this regex a try:
\s\K[^\\]+
Which, in Java would look like:
String myInputString = "DatalogSetupFile: BTS50xx1EJA\\3.20\\log_all.stp";
Pattern myPattern = Pattern.compile("\\s\\K[^\\\\]+");
Matcher myMatcher = Pattern.matcher(myInputString);
System.out.println(myMatcher.group(0));

Find the google loginform with java pattern

I am trying to search the google loginform within the html code with a simple java pattern. The loginform looks like this:
<form ... id="gaia_loginform" ... > ... </form>
I am using the following pattern to find it:
Pattern pat = Pattern.compile("<form[^>]*id=[\"|']gaia_loginform[\"|'][^>]*>(.*)</form>")
Matcher mat = pat.find(html); // html is the complete website
System.out.println(mat.group(1)); // throws exception
Actually it should the contents between the two tags. Thanks for advice what I am doing wrong :)

You are misusing the Matcher. Here is how it should be used (an example of using the Matcher):
String str = "<form ... id=\"gaia_loginform\" ... >\nCONTENT\n</form>";
Pattern pat = Pattern.compile("<form\\b[^>]*\\bid=[\"']gaia_loginform[\"'][^>]*>(.*?)</form>", Pattern.DOTALL);
Matcher matcher = pat.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
See IDEONE demo
For parsing HTML you should consider using HTML parsers, even if you are not using them now.
A couple of words on the regex: I am using Pattern.DOTALL flag when declaring the regex as . should be able to match newline symbols. Tag and id names must be matches as whole words and thus I am using \\b. Instead of .* we are safer with .*? (lazy matching), it will capture as few characters as possible.

Negating a Regular Expression for string replacement

I have the following code that can replace the email address in a String in Java:
addressStr.replaceFirst("([a-zA-Z0-9_\\-\\.]+)#((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.)|(([a-zA-Z0-9\\-]+\\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})", "")
So, a string with John Smith <john#smith.com> would become John Smith <>. How do I negate it so that it will instead replace all that doesn't match the email address and have the final result as just john#smith.com?
I tried to put in the ^ and ?<= at the front but it doesn't work.

Well, it's not the regex you need to change but the calling code. Your regex matches the e-mail address (in a weird way), and the replace() removes it from the string.
So just use
Pattern regex = Pattern.compile("([a-zA-Z0-9_\\-\\.]+)#((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.)|(([a-zA-Z0-9\\-]+\\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})");
Matcher regexMatcher = regex.matcher(addressStr);
if (regexMatcher.find()) {
address = regexMatcher.group();
}

The complete Java regex for catching e-mails would be as follows:
"(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|\"(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21\\x23-\\x5b\\x5d-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])*\")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21-\\x5a\\x53-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])+)\\])"
Take a look at https://www.rfc-editor.org/rfc/rfc2822#section-3.4.1 for more info on this.
A bit complicated but it is valid for all known and valid emails formats (yours do not allows mails like bob+bib#gmail.com which are valid).
For your problem, as stated multiple times, just find (stealing Tim Pietzcker piece of code):
Pattern regex = Pattern.compile("(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|\"(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21\\x23-\\x5b\\x5d-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])*\")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21-\\x5a\\x53-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])+)\\])");
Matcher regexMatcher = regex.matcher(addressStr);
foundMatch = regexMatcher.find();

You can try:
String mailId = Pattern.compile(regexp, Pattern.LITERAL).matcher(addressStr).group();
Idea here is to get the matched string rather than trying to replace everything else with blank. You can extract the pattern into a field if this operation is repetitive.

Just don't replace.... use match(es) instead.

Java regular expression for extracting the data between tags

I am trying to a regular expression which extracs the data from a string like
<B Att="text">Test</B><C>Test1</C>
The extracted output needs to be Test and Test1. This is what I have done till now:
public class HelloWorld {
public static void main(String[] args)
{
String s = "<B>Test</B>";
String reg = "<.*?>(.*)<\\/.*?>";
Pattern p = Pattern.compile(reg);
Matcher m = p.matcher(s);
while(m.find())
{
String s1 = m.group();
System.out.println(s1);
}
}
}
But this is producing the result <B>Test</B>. Can anybody point out what I am doing wrong?

Three problems:
Your test string is incorrect.
You need a non-greedy modifier in the group.
You need to specify which group you want (group 1).
Try this:
String s = "<B Att=\"text\">Test</B><C>Test1</C>"; // <-- Fix 1
String reg = "<.*?>(.*?)</.*?>"; // <-- Fix 2
// ...
String s1 = m.group(1); // <-- Fix 3
You also don't need to escape a forward slash, so I removed that.
See it running on ideone.
(Also, don't use regular expressions to parse HTML - use an HTML parser.)

If u are using eclipse there is nice plugin that will help you check your regular expression without writing any class to check it.
Here is link:
http://regex-util.sourceforge.net/update/
You will need to show view by choosing Window -> Show View -> Other, and than Regex Util
I hope it will help you fighting with regular expressions

It almost looks like you're trying to use regex on XML and/or HTML. I'd suggest not using regex and instead creating a parser or lexer to handle this type of arrangement.

I think the bestway to handle and get value of XML nodes is just treating it as an XML.
If you really want to stick to regex try:
<B[^>]*>(.+?)</B\s*>
understanding that you will get always the value of B tag.
Or if you want the value of any tag you will be using something like:
<.*?>(.*?)</.*?>

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex string modifications - java

Related

RegEx for matching between any two HTML tags

How to get the middle strings with regex?

Find the google loginform with java pattern

Negating a Regular Expression for string replacement

Java regular expression for extracting the data between tags

Categories

Resources