Regular expression for matching repeated substring

Regular expression for matching repeated substring - java

I need to get URLs from background-image value in HTML style parameter, in this stage I have this regular (URL is long regular matching valid URLS so I omit it here for simplification):
background-image\s*?\:\s*?(url\(\s*?(['"])?\s*?(URL)\s*?(\2)?\s*?\)([,]?))+
It matches only the first occurrence of URL, I think I've allowed to match all occurrences (but obviously I haven't). What am I doing wrong?
Input may looks like this:
String txt = "<div style=\"background-image: url('A'), url(B);\">fooo</div>";
and what I need to achieve with my regular:
Check whether there is a background-image value followed with * spaces, then : (colon) and again * spaces.
Extract all values in url() pattern.
Now I am able to to get all values in url() pattern but I am not able to ensure that there is a background-image value.

Your regex is fine, except for that it doesn't search for URL's it searches for the text URL. I've added a \d behind URL to demonstrate that your regex works:
Pattern p = Pattern.compile("background-image\\s*?\\:\\s*?(url\\(\\s*?(['\"])?\\s*?(URL\\d)\\s*?(\\2)?\\s*?\\)([,]?))+");
Matcher m = p.matcher("background-image: url(URL1); background-image: url(URL2)");
while( m.find() ){
System.out.println(m.group(3));
}
Output:
URL1
URL2

Related

Java replace all regex matches

I have htmlBody field which has html of a web page assigned to it. I want to check for all occurences for relative links ending in .html and for each of them to remove their extension. I do not want htmlBody.replaceAll(".html", "") because it will remove for all links and break some external links so my approach is to find all occurences that matches regex, and for each occurence to remove their extension using replaceAll() and append to sb. I tried to follow the example from official documentation but apparently it does not change any link, what could be the problem?
StringBuilder sb = new StringBuilder();
Pattern p = Pattern.compile("^\\/(.+\\\\)*(.+).(html)$");
Matcher m = p.matcher(htmlBody);
while (m.find()) {
String updatedLink = m.group().replaceAll(".html", "");
m.appendReplacement(sb, updatedLink);
}
m.appendTail(sb);

your regex was wrong, ^ match start of string, $ match end of string.
so matcher in your code will never match.
right regex like Pattern p = Pattern.compile("['\"]\\/(.+\\\\)*(.+).(html)");
but, it can't match <a href=/a.html>

How to replace a given substring with "" from a given string?

I went through a couple of examples to replace a given sub-string from a given string with "" but could not achieve the result. The String is too long to post and it contains a sub-string which is as follows:-
/image/journal/article?img_id=24810&t=1475128689597
I want to replace this sub-string with "".Here the value of img_id and t can vary, so I would have to use regular expression. I tried with the following code:-
String regex="^/image/journal/article?img_id=([0-9])*&t=([0-9])*$";
content=content.replace(regex,"");
Here content is the original given string. But this code is actually not replacing anything from the content. So please help..any help would be appreciated .thanx in advance.

Use replaceAll works in nice way with regex
content=content.replaceAll("[0-9]*","");
Code
String content="/image/journal/article?img_id=24810&t=1475128689597";
content=content.replaceAll("[0-9]*","");
System.out.println(content);
Output :
/image/journal/article?img_id=&t=
Update : simple, might be little less cozy but easy one
String content="sas/image/journal/article?img_id=24810&t=1475128689597";
content=content.replaceAll("\\/image.*","");
System.out.println(content);
Output:
sas
If there is something more after t=1475128689597/?tag=343sdds and you want to retain ?tag=343sdds then use below
String content="sas/image/journal/article?img_id=24810&t=1475128689597/?tag=343sdds";
content=content.replaceAll("(\\/image.*[0-9]+[\\/])","");
System.out.println(content);
}
Output:
sas?tag=343sdds

If you're trying to replace the substring of the URL with two quotations like so:
/image/journal/article?img_id=""&t=""
Then you need to add escaped quotes \"\" inside your content assignment, edit your regex to only look for the numbers, and change it to replaceAll:
content=content.replaceAll(regex,"\"\"");

You can use Java regex Utility to replace your String with "" or (any desired String literal), based on given pattern (regex) as following:
String content = "ALPHA_/image/journal/article?img_id=24810&t=1475128689597_BRAVO";
String regex = "\\/image\\/journal\\/article\\?img_id=\\d+&t=\\d+";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(content);
if (matcher.find()) {
String replacement = matcher.replaceAll("PK");
System.out.println(replacement); // Will print ALPHA_PK_BRAVO
}

How to extract image url from a string?

I am trying to extract image url from inside of a string. I am using Pattern and Matcher. I am using a regular expression to match the same. Whenever I am trying to debug the code, both, matcher.matches() and matcher.find() result into false.
I am attaching the image url and regular expression as well as my code.
Pattern pattern_name;
Matcher matcher_name;
String regex = "(http(s?):/)(/[^/]+)+\" + \"\\.(?:jpg|gif|png)";
String url = "http://www.medivision360.com/pharma/pages/articleImg/thumbnail/thumb3756d839adc5da3.jpg";
pattern_name = Pattern.compile(regex);
matcher_name = pattern_name.matcher(url);
matcher_name.matches();
matcher_name.find();

You seem to have some issue with the regex, the \" + \" should come from some code you mistook for a regex. That subpattern requires a quote, one or more spaces, then a space, and another double quote to appear right before the extension. It matches something like http://www.medivision360.com/pharma/pages/articleImg/thumbnail/thumb3756d839adc5da3" ".jpg.
Also, there are two redundant capture groups at the beginning, you do not need to use them.
Use
String regex = "https?:/(?:/[^/]+)+\\.(?:jpg|gif|png)";
See this demo
Java demo:
String rx = "https?:/(?:/[^/]+)+\\.(?:jpg|gif|png)";
String url = "http://www.medivision360.com/pharma/pages/articleImg/thumbnail/thumb3756d839adc5da3.jpg";
Pattern pat = Pattern.compile(rx);
Matcher matcher = pat.matcher(url);
if (matcher.matches()) {
System.out.println(matcher.group());
}
Note that Matcher#matches() requires a full string match, while Matcher#find() will find a partial match, a match inside a larger string.

You've escaped the double quotes in the string catenation
so the regex engine sees this (http(s?):/)(/[^/]+)+" + "\.(?:jpg|gif|png)
after c++ parses the string.
You can un-escape it "(http(s?):/)(/[^/]+)+" + "\\.(?:jpg|gif|png)"
or just join them together "(http(s?):/)(/[^/]+)+\\.(?:jpg|gif|png)"

If the expression is always at the end, I would suggest:
([^/?]+)(?=/?(?:$|\?))

conditional replaceAll java

I have html code with img src tags pointing to urls. Some have mysite.com/myimage.png as src others have mysite.com/1234/12/12/myimage.png. I want to replace these urls with a cache file path. Im looking for something like this.
String website = "mysite.com"
String text = webContent.replaceAll(website+ "\\d{4}\\/\\d{2}\\/\\d{2}", String.valueOf(cacheDir));
This code however does not work when the url does not have the extra date stamp at the end. Does anyone know how i might achieve this? Thanks!

Try this one
mysite\.com/(\d{4}/\d{2}/\d{2}/)?
here ? means zero or more occurance
Note: use escape character \. for dot match because .(dot) is already used in regex
Sample code :
String[] webContents = new String[] { "mysite.com/myimage.png",
"mysite.com/1234/12/12/myimage.png" };
for (String webContent : webContents) {
String text = webContent.replaceAll("mysite\\.com/(\\d{4}/\\d{2}/\\d{2}/)?",
String.valueOf("mysite.com/abc/"));
System.out.println(text);
}
output:
mysite.com/abc/myimage.png
mysite.com/abc/myimage.png

You are missing a forward slash between the website.com and the first 4 digits.
String text = webContent.replaceAll(Pattern.quote(website) + "/\\d{4}\\/\\d{2}\\/\\d{2}", String.valueOf(cacheDir));
I'd also recommend using a literal for your website.com value (the Pattern.quote part).
Finally you are also missing the last forward slash after the last two digits so it won't be replaced, but that may be on purpose...

Try:
String text = webContent.replaceAll("(?<="+website+")(.*)(?=\\/)",
String.valueOf(cacheDir));

Why this regex not giving expected output?

i have string which contains some value as given below. i want to replace the html img tags containing specific customerId with some new text. i tried small java program which is not giving me expected output.here is the program info
My input string is
String inputText = "Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123/></p>"
+ "<p>someText</p><img src=\"getCustomers.do?custCode=2&customerId=3340&param2=456/> ..Ending here";
Regex is
String regex = "(?s)\\<img.*?customerId=3340.*?>";
new text i want to put inside input string
EDIT Starts:
String newText = "<img src=\"getCustomerNew.do\">";
EDIT ENDS:
now i am doing
String outputText = inputText.replaceAll(regex, newText);
output is
Starting here.. Replacing Text ..Ending here
but my expected output is
Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123/></p><p>someText</p>Replacing Text ..Ending here
Please note in my expected output only img tag which is containing customerId=3340 got replaced with Replacing Text. i am not getting why in the output i am getting both the img tags are getting replced?

You've got "wildcard"/"any" patterns (.*) in there which will extend the match to the longest possible matching string, and the last fixed text in the pattern is a > character, which therefore matches the last > character in the input text, i.e. the very last one!
You should be able to fix this by changing the .* parts to something like [^>]+ so that the matching won't span past the first > character.
Parsing HTML with regular expressions is bound to cause pain.

As other people have told you in the comments, HTML is not a regular language so using regex for manipulating it is usually painful. Your best option is to use an HTML parser. I haven't used Jsoup before, but googling a little bit it seems you need something like:
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
public class MyJsoupExample {
public static void main(String args[]) {
String inputText = "<html><head></head><body><p><img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123\"/></p>"
+ "<p>someText <img src=\"getCustomers.do?custCode=2&customerId=3340&param2=456\"/></p></body></html>";
Document doc = Jsoup.parse(inputText);
Elements myImgs = doc.select("img[src*=customerId=3340");
for (Element element : myImgs) {
element.replaceWith(new TextNode("my replaced text", ""));
}
System.out.println(doc.toString());
}
}
Basically the code gets the list of img nodes with a src attribute containing a given string
Elements myImgs = doc.select("img[src*=customerId=3340");
then loop over the list and replace those nodes with some text.
UPDATE
If you don't want to replace the whole img node with text but instead you need to give a new value to its src attribute then you can replace the block of the for loop with:
element.attr("src", "my new value"));
or if you want to change just a part of the src value then you can do:
String srcValue = element.attr("src");
element.attr("src", srcValue.replace("getCustomers.do", "getCustonerNew.do"));
which is very similar to what I posted in this thread.

What happens is that your regex starts matching the first img tag then consumes everything (regardless is greedy or not) until it finds customerId=3340 and then continues consuming everything until it finds >.
If you want it to consume just the img with customerId=3340 think of what makes different this tag from other tags that it may match.
In this particular case, one possible solution is to look at what is behind that img tag using a look-behind operator (which doesn't consume a match). This regex will work:
String regex = "(?<=</p>)<img src=\".*?customerId=3340.*?>";

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regular expression for matching repeated substring - java

Related

Java replace all regex matches

How to replace a given substring with "" from a given string?

How to extract image url from a string?

conditional replaceAll java

Why this regex not giving expected output?

Categories

Resources