I have a bunch of HTML files. In these files I need to correct the src attribute of the IMG tags.
The IMG tags look typically like this:
<img alt="" src="./Suitbert_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />`
where the attributes are NOT in any specific order.
I need to remove the dot and the forward slash at the beginning of the src attribute of the IMG tags so they look like this:
<img alt="" src="Suitbert%20%E2%80%93%20Wikipedia_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />
I have the following class so far:
import java.util.regex.*;
public class Replacer {
// this PATTERN should find all img tags with 0 or more attributes before the src-attribute
private static final String PATTERN = "<img\\.*\\ssrc=\"\\./";
private static final String REPLACEMENT = "<img\\.*\\ssrc=\"";
private static final Pattern COMPILED_PATTERN = Pattern.compile(PATTERN, Pattern.CASE_INSENSITIVE);
public static void findMatches(String html){
Matcher matcher = COMPILED_PATTERN.matcher(html);
// Check all occurance
System.out.println("------------------------");
System.out.println("Following Matches found:");
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end() + " ");
System.out.println(matcher.group());
}
System.out.println("------------------------");
}
public static String replaceMatches(String html){
//Pattern replace = Pattern.compile("\\s+");
Matcher matcher = COMPILED_PATTERN.matcher(html);
html = matcher.replaceAll(REPLACEMENT);
return html;
}
}
So, my method findMatches(String html) seems to find correctly all IMG tags where the src attributes starts with ./.
Now my method replaceMatches(String html) does not correctly replace the matches.
I am a newbie to regex, but I assume that either the REPLACEMENT regex is incorrect or the usage of the replaceAll method or both.
A you can see, the replacement String contains 2 parts which are identical in all IMG tags:
<img and src="./. In between these 2 parts, there should be the 0 or more HTML attributes from the original string.
How do I formulate such a REPLACEMENT string?
Can somebody please enlighten me?
Don't use regex for HTML. Use a parser, obtain the src attribute and replace it.
Try these:
PATTERN = "(<img[^>]*\\ssrc=\")\\./"
REPLACEMENT = "$1"
Basically, you capture everything except the ./ in group #1, then plug it back in using the $1 placeholder, effectively stripping off the ./.
Notice how I changed your .* to [^>]*, too. If there happened to be two IMG tags on the same line, like this:
<img src="good" /><img src="./bad" />
...your regex would match this:
<img src="good" /><img src="./
It would do that even if you used a non-greedy .*?. [^>]* makes sure the match is always contained within the one tag.
Your replacement is incorrect. It will replace the matched string by the replacement (not interpreted as a regexp). If you want to achieve, what you want, you need to use groups. A group is delimited by the parenthesis of the regexp. Each opening parenthesis indicates a new group.
You can use $i in the replacement string to reproduce what a groupe has matched and where 'i' is your group number reference. See The doc of appendReplacement for the details.
// Here is an example (it looks a bit like your case but not exactly)
String input = "<img name=\"foobar\" src=\"img.png\">";
String regexp = "<img(.+)src=\"[^\"]+\"(.*)>";
Matcher m = Pattern.compile(regexp).matcher(input);
StringBuffer sb = new StringBuffer();
while(m.find()) {
// Found a match!
// Append all chars before the match and then replaces the match by the
// replacement (the replacement refers to group 1 & 2 with $1 & $2
// which match respectively everything between '<img' and 'src' and,
// everything after the src value and the closing >
m.appendReplacement(sb, "<img$1src=\"something else\"$2>";
}
m.appendTail(sb);// No more match, we append the end of input
Hope this helps you
If src attributes only occur in your HTML within img tags, you can just do this:
input.replace("src=\"./", "src=\"")
You could also do this without java by using sed if you're using a *nix OS
Related
I'm working on a program, which formats HTML Code, extracted from a PDF file.
I have a String list, which contains paragraphs and is divided by that.
As the PDF has hyperlinks, I decided to replace them with a foot note number "[1]".
This will be used for citation of sources. I will eventually plan, to put it at the end of a paragraph, or sentence, so you can look up the sources, like you would in a book.
My Problem
For some reason not all the hyperlinks are replaced.
The reason is most likely, that there is text directly next to the tag.
Hell<a href="http://www.example.com">o old chap!
Specifically the "o" part and the "hell" part is blocking the java .replaceAll function, from doing it's job.
Expected Result
Hello [1] old chap!
EDIT:
If I would just add space, before and after the URL, it might split some words like "help", into "hel p", which is also not an option.
My code would have to replace the URL tag (without the ) and create no new extra spaces.
This is some of my code, where the problem occures:
for (int i = 0; i < EN.length; i++) {
Pattern pattern_URL = Pattern.compile("<a(.+?)\">", Pattern.DOTALL);
Matcher matcher_URL = pattern_URL.matcher(EN[i]); //Checks in the curren Array part.
if (matcher_URL.find() == true) {
source_number++;
String extractedURL = matcher_URL.group(0);
//System.out.println(extractedURL);
String extractedURL_fully = extractedURL.replaceAll("href=\"", ""); //Anführungszeichen
//System.out.println(extractedURL_fully);
String nobracketURL = extractedURL.replaceAll("\\)", ""); //Remove round brackets from URL
EN[i] = EN[i].replaceAll("\\)\"", "\""); /*Replace round brackets from URL in Array. (For some reasons there have been href URLs, with an bracket at the end. This was already in the PDF. They were causing massive problems, because it didn't comment them out, so the entire replaceAll command didn't function.)*/
EN[i] = EN[i].replaceAll(nobracketURL, "[" + source_number + "]"); //Replace URL tags with number and Edgy brackets
}
else{
//System.out.println("FALSE: " + "[" + i + "]");
}
}
The whole idea of this is, that it loops through the array and replaces all the URLs, including it's starting tag <a until the end of the starting tag "> (which can also be seen in the pattern regex.)
Correct me if I'm wrong, but what you need is to eliminate all the <a> tags from a given string, right? If that's the case all you needed to do was use a code like the following:
final String string = "<a href=\"http://www.example.com\">Sen";
final Pattern pattern = Pattern.compile("<a(.+?)>", Pattern.DOTALL);
final Matcher matcher = pattern.matcher(string);
final String result = matcher.replaceAll("");
System.out.println(result); // prints "Sen"
Notice I didn't use the replaceAll from the String object, but from the Matcher object. This replaces all matches for the empty string "".
I have is javascript regex to extract all the <img> tags that have the src as http://.... from a string.
regex = /<img[^>]+src="?(http:\/\/[^">]+)"?\s*\/>/g;
My question is how to do this in Java, and secondly the above regex only gives the content of src, I want to extract and replace the whole <img> with blank spaces.
PS. The may have many other properties also along with the src, like 'class', 'alt' etc.
//Try this solution:
//This answer was tested I hope it is what you're looking for :
Pattern p = Pattern.compile("<img?(.+)?\\s*\\/>");
Matcher m = p.matcher("<img src=\"http://google.com\"/>");
if(m.find())
System.out.println(m.group(1));
try this one:
regex = /(<img[^>]+src="?http:\/\/[^">]+"?[^>]+\/>)/g
it should get all img tags. (changed the end of regexp and moved brackets around img tag)
Please try the below segment
.*(<img\s+.*src\s*=\s*"([^"]+)".*>).*
here it will create two matches
1. Match 1 would be the complete img tag
2. Match 2 will hold the URL of image only.
Example
package com.company;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String htmlFragment = "<img src='http://img01.ibnlive.in/ibnlive/uploads/2015/11/Videocon-Delite.gif' width='90' height='62'>Videocon Mobile Phones has launched three new Android smartphones - Z55 Delite, Z45 Dazzle, and Z45 Amaze with prices starting at Rs 4,599.";
Pattern pattern =
Pattern.compile( ".*(<img\\s+.*src\\s*=\\s*'([^']+)'.*>).*" );
Matcher matcher = pattern.matcher( htmlFragment );
if( matcher.matches()) {
String match = matcher.group(1);
String match1 = matcher.group(2);
//match.replaceAll("'","");
System.out.println(match);
System.out.println(match1);
//System.out.println(match2);
String newString = htmlFragment.replaceAll(match,"");
System.out.println(newString);
}
}
}
The example is with a single quote image url , but the provided regex at the top is for your case with double inverted quotes.
I'm trying to extract part of the URL in the text files.
for example:
/p/gnomecatalog/bugs/search/?q=status%3Aclosed-accepted+or+status%3Awont-fix+or+status%3Aclosed" class="search_bin"><span>Closed Tickets</span></a>
I would like to extract only
/p/gnomecatalog/bugs/search/?q=status%3Aclosed-accepted+or+status%3Awont-fix+or+status%3Aclosed
HOW I COULD DO THAT BY USING REGULAR Expression. I tried with regex
"/p/*./bugs/*."
but it didn't work.
Try this:
"\/p.*\/bugs[^"]*"
it means: "/p"
then: all chars,
then: "/bugs",
then: all chars except "
You can use :
(\/p\/.*\/bugs\/.*?(?="))
Java Code :
String REGEX = "(\\/p\\/.*\\/bugs\\/.*?(?=\"))";
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(line);
while (m.find()) {
String matched = m.group();
System.out.println("Mached : "+ matched);
}
OUTPUT
Mached : /p/gnomecatalog/bugs/search/?q=status%3Aclosed-accepted+or+status%3Awont-fix+or+status%3Aclosed
DEMO
Explanation:
Here's another way:
(?i)/p/[a-z/]+bugs/[^ "]+
The (?i) in the beginning makes the regex case insensitive so you don't have to worry about that. Then after bugs/ it will continue until it reaches either a space or a ".
What regular expression can be used to extract the value of src attribute in the iframe tag?
If you really are using Java (not JavaScript) and you only have the iframe, you can try the regular expression:
(?<=src=")[^"]*(?<!")
e.g.:
private static final Pattern REGEX_PATTERN =
Pattern.compile("(?<=src=\")[^\"]*(?<!\")");
public static void main(String[] args) {
String input = "<iframe name=\"I1\" id=\"I1\" marginwidth=\"1\" marginheight=\"1\" height=\"430px\" width=\"100%\" border=\"0\" frameborder=\"0\" scrolling=\"no\" src=\"report.htm?view=country=us\">";
System.out.println(
REGEX_PATTERN.matcher(input).matches()
); // prints "false"
Matcher matcher = REGEX_PATTERN.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
Output:
report.htm?view=country=us
I would say look into dom parsing. from there it would be extremely similar to the javascript answer.
Dom parser will turn the html into a document from there you can do:
iframe = document.getElementById("I1");
src = iframe.getAttribute("src");
Regex is little bit costlier do not use it until you have other simple solution, in java try this
String src="<iframe name='I1' id='I1' marginwidth='1' marginheight='1'" +
" height='430px' width='100%' border='0' frameborder='0' scrolling='no'" +
" src='report.htm?view=country=us'>";
int position1 = src.indexOf("src") + 5;
System.out.println(position1);
int position2 = src.indexOf("\'", position1);
System.out.println(position2);
System.out.println(src.substring(position1, position2));
Output:
134
160
report.htm?view=country=us
In case you meant javascript instead of java:
var iframe = document.getElementById("I1");
var src = iframe.getAttribute("src");
alert(src); //outputs the value of the src attribute
src="(.*?)"
The regular expression will match src="report.htm?view=country=us", but you will find only the part between the " in the first (and only) submatch.
When you only want to match src-attributes when they are in an iframe, do this:
<iframe.*?src="(.*?)".*?>
but there are certain corner-cases where this could fail due to the inherently non-regular nature of HTML. See the top answer to RegEx match open tags except XHTML self-contained tags for an amusing rant about this problem.
Requirement : String "richText" which can include plain text +anchor tag. The anchor tag is rewritten to modify its target, append JS, etc
Issue:
The pattern matcher find() & appendReplacement() works fine till there is no special character "$" in the anchor tag. It throws an exception when $ is part of anchor tag.
Line 1 fixes up the exception part but creates an issue if "$" or "\" is present in plain text since plain text now has additional escape characters around the above 2 special characters(bcoz of quoteReplacement()). How do I strip the additional escape characters from plain text(undo affect of quoteReplacement)?
Method:
String richText = Matcher.quoteReplacement(rText); //Line 1-escape characters
String anchorTagPattern = "<a[^>]*?href\\s*=[^>]*>(.*?)</a>";
StringBuffer result = new StringBuffer(richText.length());
Pattern pattern = Pattern.compile(anchorTagPattern);
Matcher matcher = pattern.matcher(richText);
while (matcher.find()) {
String aTag = matcher.group();
.......
String formattedAnchorTag = rewriteTag(aTag);
matcher.appendReplacement(result, formattedAnchorTag); ....
}
matcher.appendTail(result);
//Plain text with $ \ has some additional escape characters because of Line 1. How to remove them:
rText entered is
Plain text having $. Anchor tag to be rewritten is google$
If Line1 in the method- quoteReplacement is commented then I get java.lang.IllegalArgumentException: Illegal group reference
at java.util.regex.Matcher.appendReplacement(Matcher.java:724)
If I leave it, the exception goes away but the string returned is
Plain text having \$. Anchor tag to be rewritten is google$
Matcher.quoteReplacement should not be called on rText. The first question mark in the pattern seems superfluous. Only rewriteTag may be the cause.
formattedAnchorTag = Matcher.quoteReplacement(formattedAnchorTag);
matcher.appendReplacement(result, formattedAnchorTag);