In a html doc, I need to replace the full path to files with just the file names.
The documents are very large so I think I can use regex to obtain a practical solution. I've already read similar questions and tried the solutions but that just did'nt work.
Example. Given this html input.
<img src="app/javax.faces.resource/color_pan.png?ln=img/partidos" style="width:100%; height:30px;" class="centerImg"/>
<img src="/app/javax.faces.resource/pan.png?ln=img/partidos" class="centerImg"/>
I need the folowing output:
<img src="color_pan.png" style="width:100%; height:30px"; class="centerImg"/>
<img src="pan.png" class="centerImg"/>
I'm trying these patterns:
Pattern p = Pattern.compile("src=\"(?=.*src).*/color_pan.png[^\"]*\"");
Patter p1 = Pattern.compile("src=\"(?!.*src).*/pan.png[^\"]*\"");
The first one works fine for the 1st image and the second one is the solution for the 2nd (both are on the same html doc).
I need a general pattern that works for every image. So the problem is to find only the first "src" element that appears left to the file name. In other words, the "src" must be the last one that appears before the file name.
That way, I could replace the strings correctly.
Any help is appreciated.
This regex seems to do the work
Solution 1 <= 2 matches in 1509 steps
(^<img src=")(?:.*?)([\w.]+)(?=\?)[^"]*"(.*$)
Regex Demo
Towards an efficient solution
Solution 2 <= 2 matches in 593 steps
(^<img src=").*(?<=\/|")([\w.]+)(?=\?)[^"]*"(.*$)
Java Code
String pattern = "(^<img src=\")(?:.*?)([\\w.]+)(?=\\?)[^\"]*\"(.*$)";
Pattern r = Pattern.compile(pattern);
while (true) {
String line = x.nextLine();
Matcher m = r.matcher(line);
if (m.find()) {
System.out.println(m.group(1) + m.group(2) + m.group(3));
} else {
System.out.println("Not Found");
}
}
Ideone Demo
Related
I have the following content :
<div class="TEST-TEXT">hi</span>
first young CEO's TEST-TEXT
<span class="test">hello</span>
I am trying to match the TEST-TEXT string to replace it is value but only when it is a text and not within an attribute value.
I have checked the concepts of look-ahead and look-behind in Regex but the current issue with that is that it needs to use a fixed width for the match here is a link regex-match-all-characters-between-two-html-tags that show case a very similar case but with an exception that there is a span with a class to create a match
also checked the link regex-match-attribute-in-a-html-code
here are two regular expressions I am trying with :
\"([^"]*)\"
(?s)(?<=<([^{]*)>)(.+?)(?=</.>)
both are not working for me try using [https://regex101.com/r/ApbUEW/2]
I expect it to match only the string when it is a text
current behavior it matches both cases
Edit : I want the text to be dynamic and not specific to TEST-TEXT
Something like this should help:
\>([^"<]*)\<
EDIT:
Without open and close tags included:
(?<=\>)([^"<]*)(?=\<)
Try TEST-TEXT(?=<\/a>)
TEST-TEXT matches TEST-TEXT
?= look ahead to check closing tag </a>
see at
regex101
Here, we might just add a soft boundary on the right of the desired output, which you have been already doing, then a char list for the desired output, then collect, after that we can make a replacement by using capturing groups (). Maybe similar to this:
([A-Z-]+)(<\/)
Demo
This snippet is just to show that the expression might be valid:
const regex = /([A-Z-]+)(<\/)/gm;
const str = `<div class="TEST-TEXT">hi</span><a href=\\"https://en.wikipedia.org/wiki/TEST-TEXT\\">first young CEO's
TEST-TEXT</a><span class="test">hello</span><div class="TEST-TEXT">hi</span><a href=\\"https://en.wikipedia.org/wiki/TEST-TEXT\\">first young CEO's
TEST-TEXT</a><span class="test">hello</span>`;
const subst = `NEW-TEXT$2`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
RegEx
If this expression wasn't desired, it can be modified or changed in regex101.com.
RegEx Circuit
jex.im also helps to visualize the expressions.
Maybe this will help?
String html = "<div class=\"TEST-TEXT\">hi</span>\n" +
"first young CEO's TEST-TEXT\n" +
"<span class=\"test\">hello</span>";
Pattern pattern = Pattern.compile("(<)(.*)(>)(.*)(TEST-TEXT)(.*)</.*>");
Matcher matcher = pattern.matcher(html);
while (matcher.find()){
System.out.println(matcher.group(5));
}
A RegEx for that a string between any two HTML tags
(?![^<>]*>)(TEST\-TEXT)
I am using Pattern and Matcher classes from Java ,
I am reading a Template text and I want to replace :
src="scripts/test.js" with src="scripts/test.js?Id=${Id}"
src="Servlet?Template=scripts/test.js" with src="Servlet?Id=${Id}&Template=scripts/test.js"
I'm using the below code to execute case 2. :
//strTemplateText is the Template's text
Pattern p2 = Pattern.compile("(?i)(src\\s*=\\s*[\"'])(.*?\\?)");
Matcher m2 = p2.matcher(strTemplateText);
strTemplateText = m2.replaceAll("$1$2Id=" + CurrentESSession.getAttributeString("Id", "") + "&");
The above code works correctly for case 2. but how can I create a regex to combine both cases 1. and 2. ?
Thank you
You don't need a regular expression. If you change case 2 to
replace Servlet?Template=scripts/test.js with Servlet?Template=scripts/test.js&Id=${Id}
all you need to do is to check whether the source string does contain a ? if not add ?Id=${Id} else add &Id=${Id}.
After all
if (strTemplateText.contains("?") {
strTemplateText += "&Id=${Id}";
}
else {
strTemplateText += "?Id=${Id}";
}
does the job.
Or even shorter
strTemplate += strTemplateText.contains("?") ? "&Id=${Id}" : "?Id=${Id}";
Your actual question doesn't match up so well with your example code. The example code seems to handle a more general case, and it substitutes an actual session Id value instead of a reference to one. The code below takes the example code to be more indicative of what you really want, but the same approach could be adapted to what you asked in the question text (using a simpler regex, even).
With that said, I don't see any way to do this with a single replaceAll() because the replacement text for the two cases is too different. You could nevertheless do it with one regex, in one pass, if you used a different approach:
Pattern p2 = Pattern.compile("(src\\s*=\\s*)(['\"])([^?]*?)(\\?.*?)?\\2",
Pattern.CASE_INSENSITIVE);
Matcher m2 = p2.matcher(strTemplateText);
StringBuffer revisedText = new StringBuffer();
while (m2.find()) {
// Append the whole match except the closing quote
m2.appendReplacement(revisedText, "$1$2$3$4");
// group 4 is the optional query string; null if none was matched
revisedText.append((m2.group(4) == null) ? '?' : '&');
revisedText.append("Id=");
revisedText.append(CurrentESSession.getAttributeString("Id", ""));
// append a copy of the opening quote
revisedText.append(m2.group(2));
}
m2.appendTail(revisedText);
strTemplateText = revisedText.toString();
That relies on BetaRide's observation that query parameter order is not significant, although the same general approach could accommodate a requirement to make Id the first query parameter, as in the question. It also matches the end of the src attribute in the pattern to the correct closing delimiter, which your pattern does not address (though it needs to do to avoid matching text that spans more than one src attribute).
Do note that nothing in the above prevents a duplicate query parameter 'Id' being added; this is consistent with the regex presented in the question. If you want to avoid that with the above approach then in the loop you need to parse the query string (when there is one) to determine whether an 'Id' parameter is already present.
You can do the following:
//strTemplateText is the Template's text
String strTemplateText = "src=\"scripts/test.js\"";
strTemplateText = "src=\"Servlet?Template=scripts/test.js\"";
java.util.regex.Pattern p2 = java.util.regex.Pattern.compile("(src\\s*=\\s*[\"'])(.*?)((?:[\\w\\s\\d.\\-\\#]+\\/?)+)(?:[?]?)(.*?\\=.*)*(['\"])");
java.util.regex.Matcher m2 = p2.matcher(strTemplateText);
System.out.println(m2.matches());
strTemplateText = m2.replaceAll("$1$2$3?Id=" + CurrentESSession.getAttributeString("Id", "") + (m2.group(4)==null? "":"&") + "$4$5");
System.out.println(strTemplateText);
It works on both cases.
If you are using java > 1.6; then, you could use custom-named group-capturing features for making the regex exp. more human-readable and easier to debug.
I have the following String and I want to filter the MBRB1045T4G out with a regular expression in Java. How would I achieve that?
String:
<p class="ref">
<b>Mfr Part#:</b>
MBRB1045T4G<br>
<b>Technologie:</b>
Tab Mount<br>
<b>Bauform:</b>
D2PAK-3<br>
<b>Verpackungsart:</b>
REEL<br>
<b>Standard Verpackungseinheit:</b>
800<br>
As Wrikken correctly says, HTML can't be parsed correctly by regex in the general case. However it seems you're looking at an actual website and want to scrape some contents. In that case, assuming space elements and formatting in the HTML code don't change, you can use a regex like this:
Mfr Part#:</b>([^<]+)<br>
And collect the first capture group like so (where string is your HTML):
Pattern pt = Pattern.compile("Mfr Part#:</b>\s+([^<]+)<br>",Pattern.MULTILINE);
Matcher m = pt.matcher(string);
if (m.matches())
System.out.println(m.group(1));
In my application I need get the link and break it if it is bigger than 10(example) chars.
The problem is, if I send the whole text, for example: "this is my website www.stackoverflow.com" directly to this matcher
Pattern patt = Pattern.compile("(?i)\\b((?:https?://|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:\'\".,<>???“”‘’]))");
Matcher matcher = patt.matcher(text);
matcher.replaceAll("$1");
it would show the whole website, without breaking it.
What I was trying to do, is to get the value of $1, so i could break the second one, keeping the first one correctly.
I've got another method to break the string up.
UPDATE
What I want to get is only the website so I could break it after all. It would help me a lot.
You can't use replaceAll; you should iterate through the matches and process each one individually. Java's Matcher already has an API for this:
// expanding on the example in the 'appendReplacement' JavaDoc:
Pattern p = Pattern.compile("..."); // your URL regexp
Matcher m = p.matcher(text);
StringBuffer sb = new StringBuffer();
while (m.find()) {
String truncatedURL = m.group(1).replaceFirst("^(.{10}).*","$1..."); // i iz smrt
m.appendReplacement(sb,
"<a href=\"http://$1\" target=\"_blank\">"); // simple replacement for $1
sb.append(truncatedURL);
sb.append("</a>");
}
m.appendTail(sb);
System.out.println(sb.toString());
(For performance, you should factor out compiled Patterns for the replace* calls inside the loop.)
Edit: use sb.append() so not to worry about escaping $ and \ in 'truncatedURL'.
I think that you have a similar problem to the one mentioned on this question
Java : replacing text URL with clickable HTML link
they suggested something like this
String basicUrlRegex = "(.*://[^<>[:space:]]+[[:alnum:]/])";
myString.replaceAll(basicUrlRegex, "$1");
Please can someone help me parse these links from an HTML page
http://nemertes.lis.upatras.gr/dspace/handle/123456789/2299
http://nemertes.lis.upatras.gr/dspace/handle/123456789/3154
http://nemertes.lis.upatras.gr/dspace/handle/123456789/3158
I want to parse using the "handle" word which is common in these links.
I'm using the command [Pattern pattern = Pattern.compile("<a.+href=\"(.+?)\"");] but it parse me all the href links of the page.
Any suggestions?
Thanks
Your regular expression is looking at ALL <a href... tags. "handle" is always used as "/dspace/handle" etc. so you can use something like this to scrape the urls you're looking for:
Pattern pattern = Pattern.compile("<a.+href=\"(/dspace/handle/.+?)\"");
Looks like your regex is doing something wrong. Instead of
Pattern pattern = Pattern.compile("<a.+href=\"(.+?)\"");
Try:
Pattern pattern = Pattern.compile("<a\\s+href=\"(.+?)\"");
the 'a.+' on your first pattern is matching any character at least one time. If you intended to set the space character the use '\s+' instead.
The following code works perfect:
String s = "<a href=\"http://nemertes.lis.upatras.gr/dspace/handle/123456789/2299\"/> " +
"<a href=\"http://nemertes.lis.upatras.gr/dspace/handle/123456789/3154\" /> " +
"<a href=\"http://nemertes.lis.upatras.gr/dspace/handle/123456789/3158\"/>";
Pattern p = Pattern.compile("<a\\s+href=\"(.+?)\"", Pattern.MULTILINE);
Matcher m = p.matcher(s);
while(m.find()){
System.out.println(m.start()+" : "+m.group(1));
}
output:
0 : http://nemertes.lis.upatras.gr/dspace/handle/123456789/2299
72 : http://nemertes.lis.upatras.gr/dspace/handle/123456789/3154
145 : http://nemertes.lis.upatras.gr/dspace/handle/123456789/3158