Why this regex not giving expected output? - java

i have string which contains some value as given below. i want to replace the html img tags containing specific customerId with some new text. i tried small java program which is not giving me expected output.here is the program info
My input string is
String inputText = "Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123/></p>"
+ "<p>someText</p><img src=\"getCustomers.do?custCode=2&customerId=3340&param2=456/> ..Ending here";
Regex is
String regex = "(?s)\\<img.*?customerId=3340.*?>";
new text i want to put inside input string
EDIT Starts:
String newText = "<img src=\"getCustomerNew.do\">";
EDIT ENDS:
now i am doing
String outputText = inputText.replaceAll(regex, newText);
output is
Starting here.. Replacing Text ..Ending here
but my expected output is
Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123/></p><p>someText</p>Replacing Text ..Ending here
Please note in my expected output only img tag which is containing customerId=3340 got replaced with Replacing Text. i am not getting why in the output i am getting both the img tags are getting replced?

You've got "wildcard"/"any" patterns (.*) in there which will extend the match to the longest possible matching string, and the last fixed text in the pattern is a > character, which therefore matches the last > character in the input text, i.e. the very last one!
You should be able to fix this by changing the .* parts to something like [^>]+ so that the matching won't span past the first > character.
Parsing HTML with regular expressions is bound to cause pain.

As other people have told you in the comments, HTML is not a regular language so using regex for manipulating it is usually painful. Your best option is to use an HTML parser. I haven't used Jsoup before, but googling a little bit it seems you need something like:
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
public class MyJsoupExample {
public static void main(String args[]) {
String inputText = "<html><head></head><body><p><img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123\"/></p>"
+ "<p>someText <img src=\"getCustomers.do?custCode=2&customerId=3340&param2=456\"/></p></body></html>";
Document doc = Jsoup.parse(inputText);
Elements myImgs = doc.select("img[src*=customerId=3340");
for (Element element : myImgs) {
element.replaceWith(new TextNode("my replaced text", ""));
}
System.out.println(doc.toString());
}
}
Basically the code gets the list of img nodes with a src attribute containing a given string
Elements myImgs = doc.select("img[src*=customerId=3340");
then loop over the list and replace those nodes with some text.
UPDATE
If you don't want to replace the whole img node with text but instead you need to give a new value to its src attribute then you can replace the block of the for loop with:
element.attr("src", "my new value"));
or if you want to change just a part of the src value then you can do:
String srcValue = element.attr("src");
element.attr("src", srcValue.replace("getCustomers.do", "getCustonerNew.do"));
which is very similar to what I posted in this thread.

What happens is that your regex starts matching the first img tag then consumes everything (regardless is greedy or not) until it finds customerId=3340 and then continues consuming everything until it finds >.
If you want it to consume just the img with customerId=3340 think of what makes different this tag from other tags that it may match.
In this particular case, one possible solution is to look at what is behind that img tag using a look-behind operator (which doesn't consume a match). This regex will work:
String regex = "(?<=</p>)<img src=\".*?customerId=3340.*?>";

Related

Searching and replacing Pipe delimited characters in a text paragraph through regex

I have a scenario where, in a large text, I want to identify a mail signature and remove that. The signature appears like this-
name | some text | some text | some text E-mail:abc#xyz.com
in the paragraph. Please note, the number of pipe delimiters may be three or more but at the end it has Email.
I need a Java code locate these portions using regex and then remove them. Any pointers would help.
Thanks in advance.
Just want to add, the signature pattern mentioned above may occur one or more times in a large text. Also the text (mentioned as some text) inside the pipe delimiters would change along with the name and the E-mail field.
You will find the email with:
[^|]+$
That matches everything that is not a pipe before line end.
Try this:
public static void main(String[] args) {
String str = "name | some text | some text | some text E-mail:abc#xyz.com";
String regex = ".*\\|.*\\s+";
String email = str.replaceAll(regex, "");
System.out.println(str);
}
After splitting the string compare the last element of the string with the email regex, I'm sure you can find it online.
String[] s = yourString.split("\\|");

conditional replaceAll java

I have html code with img src tags pointing to urls. Some have mysite.com/myimage.png as src others have mysite.com/1234/12/12/myimage.png. I want to replace these urls with a cache file path. Im looking for something like this.
String website = "mysite.com"
String text = webContent.replaceAll(website+ "\\d{4}\\/\\d{2}\\/\\d{2}", String.valueOf(cacheDir));
This code however does not work when the url does not have the extra date stamp at the end. Does anyone know how i might achieve this? Thanks!
Try this one
mysite\.com/(\d{4}/\d{2}/\d{2}/)?
here ? means zero or more occurance
Note: use escape character \. for dot match because .(dot) is already used in regex
Sample code :
String[] webContents = new String[] { "mysite.com/myimage.png",
"mysite.com/1234/12/12/myimage.png" };
for (String webContent : webContents) {
String text = webContent.replaceAll("mysite\\.com/(\\d{4}/\\d{2}/\\d{2}/)?",
String.valueOf("mysite.com/abc/"));
System.out.println(text);
}
output:
mysite.com/abc/myimage.png
mysite.com/abc/myimage.png
You are missing a forward slash between the website.com and the first 4 digits.
String text = webContent.replaceAll(Pattern.quote(website) + "/\\d{4}\\/\\d{2}\\/\\d{2}", String.valueOf(cacheDir));
I'd also recommend using a literal for your website.com value (the Pattern.quote part).
Finally you are also missing the last forward slash after the last two digits so it won't be replaced, but that may be on purpose...
Try:
String text = webContent.replaceAll("(?<="+website+")(.*)(?=\\/)",
String.valueOf(cacheDir));

java get next few words in string

I am trying to search a .txt file that contains HTML in it. I need to search the file for specific HTML tags, then grab the following next few characters of code. I am new to java, but am willing to learn what I need to.
For example: Say I have the code: <span class="date">Apr 13</span> and all I need is the date(Apr 13). How do I go about doing this?
Thanks a lot!
Have a look at String class docs and try to find the method to search the string.
Since you said you are getting it from a HTML file, you can have a look at Jsoup which is a HTML parser, which will make searching for strings in HTML documents a lot easier.
With jsoup, you can do it like this
File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements spans = doc.select("span");
for (Element element : spans) {
System.out.println(element.html());
}
try this
Matcher m = Pattern.compile(">(.*?)<").matcher(s);
while(m.find()) {
String s = m.group(1);
}
If you want is something basic (I thought it would be good as you are new), you can use this :
if(s.indexOf("span class=\"date\"")!=0)
s=s.substring(s.indexOf(">")+1,s.lastIndexOf("<"));
But this answer is specific to your question than a broad one
String yourString = "<span class=\"date\">Apr 13</span>"
String date = yourString.split("class=\"date\">")[1].split("</sp")[0];

Regular expression for matching repeated substring

I need to get URLs from background-image value in HTML style parameter, in this stage I have this regular (URL is long regular matching valid URLS so I omit it here for simplification):
background-image\s*?\:\s*?(url\(\s*?(['"])?\s*?(URL)\s*?(\2)?\s*?\)([,]?))+
It matches only the first occurrence of URL, I think I've allowed to match all occurrences (but obviously I haven't). What am I doing wrong?
Input may looks like this:
String txt = "<div style=\"background-image: url('A'), url(B);\">fooo</div>";
and what I need to achieve with my regular:
Check whether there is a background-image value followed with * spaces, then : (colon) and again * spaces.
Extract all values in url() pattern.
Now I am able to to get all values in url() pattern but I am not able to ensure that there is a background-image value.
Your regex is fine, except for that it doesn't search for URL's it searches for the text URL. I've added a \d behind URL to demonstrate that your regex works:
Pattern p = Pattern.compile("background-image\\s*?\\:\\s*?(url\\(\\s*?(['\"])?\\s*?(URL\\d)\\s*?(\\2)?\\s*?\\)([,]?))+");
Matcher m = p.matcher("background-image: url(URL1); background-image: url(URL2)");
while( m.find() ){
System.out.println(m.group(3));
}
Output:
URL1
URL2

Convert HTML symbols and HTML names to HTML number using Java

I have an XML which contains many special symbols like ® (HTML number &#174) etc.
and HTML names like &atilde (HTML number &#227) etc.
I am trying to replace these HTML symbols and HTML names with corresponding HTML number using Java. For this, I first converted XML file to string and then used replaceAll method as:
File fn = new File("myxmlfile.xml");
String content = FileUtils.readFileToString(fn);
content = content.replaceAll("®", "&\#174");
FileUtils.writeStringToFile(fn, content);
But this is not working.
Can anyone please tell how to do it.
Thanks !!!
The signature for the replaceAll method is:
public String replaceAll(String regex, String replacement)
You have to be careful that your first parameter is a valid regular expression. The Java Pattern class describes the constructs used in a Java regular expression.
Based on what I see in the Pattern class description, I don't see what's wrong with:
content = content.replaceAll("®", "&\#174");
You could try:
content = content.replaceAll("\\p(®)", "&\#174");
and see if that works better.
I don't think that \# is a valid escape sequence.
BTW, what's wrong with "&#174" ?
If you want HTML numbers try first escaping for XML.
Use EscapeUtils from Apache Commons Lang.
Java may have trouble dealing with it, so first I prefere to escape Java, and after that XML or HTML.
String escapedStr= StringEscapeUtils.escapeJava(yourString);
escapedStr= StringEscapeUtils.escapeXML(yourString);
escapedStr= StringEscapeUtils.escapeHTML(yourString);

Categories

Resources