java: regular expression

java: regular expression - java

I have a Html string which include lots of image tag, I need to get the tag and change it. for example:
String imageRegex = "(<img.+(src=\".+\").+/>){1}";
String str = "<img src=\"static/image/smiley/comcom/9.gif\" smilieid=\"296\" border=\"0\" alt=\"\" />hello world<img src=\"static/image/smiley/comcom/7.gif\" smilieid=\"294\" border=\"0\" alt=\"\" />";
Matcher matcher = Pattern.compile(imageRegex, Pattern.CASE_INSENSITIVE).matcher(msg);
int i = 0;
while (matcher.find()) {
i++;
Log.i("TAG", matcher.group());
}
the result is :
<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" />hello world<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" />
but it's not I want, I want the result is
<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" />
<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" />
what's wrong with my regular expression?

Try (<img)(.*?)(/>), this should do the trick, although yes, you shouldn't use Regex for parsing HTML, as people will tell you over and over.
I don't have eclipse installed, but I have VS2010, and this works for me.
String imageRegex = "(<img)(.*?)(/>)";
String str = "<img src=\"static/image/smiley/comcom/9.gif\" smilieid=\"296\" border=\"0\" alt=\"\" />hello world<img src=\"static/image/smiley/comcom/7.gif\" smilieid=\"294\" border=\"0\" alt=\"\" />";
System.Text.RegularExpressions.MatchCollection match = System.Text.RegularExpressions.Regex.Matches(str, imageRegex, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
StringBuilder sb = new StringBuilder();
foreach (System.Text.RegularExpressions.Match m in match)
{
sb.AppendLine(m.Value);
}
System.Windows.MessageBox.Show(sb.ToString());
Result:
<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" />
<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" />

David M is correct, you really shouldn't try to do this, but your specific problem is that the + quantifier in your regex is greedy, so it will match the longest possible substring that could match.
See The regex tutorial for more details on the quantifiers.

I'd NOT recommend to use regex for parsing HTML. Please consider JSoup or similar solutions
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements images = doc.select("img");
Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp.

Related

Java: Replace only exactly matching URL

I want to replace only exactly matching link given in String.
My code is as follows:
String originalString = "<a target=\"_blank\" href=\"http://example.com/\"><span style=\"font-size: 12px;\">ABC</span></a>"
+ "<a target=\"_blank\" href=\"http://example.com/contact/\"><span style=\"font-size: 12px;\">Contact</span></a>";
String replacedString = originalString.replace("http://example.com/", "link1");
System.out.println("Replaced String:" + replacedString);
replacedString = "<a target="_blank" href="link1"><span style="font-size: 12px;">ABC</span></a><a target="_blank" href="link1contact/"><span style="font-size: 12px;">Contact</span></a>"
requiredString = "<a target="_blank" href="link1"><span style="font-size: 12px;">ABC</span></a><a target="_blank" href="link2"><span style="font-size: 12px;">Contact</span></a>"
I get Output as replacedString but required Output should be as requiredString.
Thanks in advance.

Replace the URL with the quotes:
String replacedString = originalString.replace("\"http://example.com/\"", "\"link1\"");
replacedString = replacedString.replace("\"http://example.com/contact/\"", "\"link2\"");

The problem is that http://example.com/contact/ contains http://example.com/.
Use this instead:
String replacedString = originalString.replace("http://example.com/contact/", "link2");
String replacedString2 = replacedString.replace("http://example.com/", "link1");
replacedString2 is the required output

working regex ishttp:\\/\\/example.com.*?(?=\\\\) in java. it matches all occurences of http://example.com and until the next backslash

Java regex , jsoup

How to extract these messages by regex or jsoup ? 19040172b-1、 SQL Server Develop 、zheng 、3-5,7-14 、D-101 ，
<div id="AE9D7F630640426F8457A661607D2B8E-5-2" style="display: none;" class="kbcontent">
19040172b-1
<br>SQL Server Develop
<br>
<font title="teacher">zheng</font>
<br>
<font title="week">3-5,7-14</font>
<br>
<font title="classroom">D-101</font>
<br>
</div>
I have tried the following ways but failed.
1. Pattern pattern = Pattern.compile(">(.*?)<br>");
2. Elements msg = doc.select(":matchesOwn([>.*?<br>])");

1) First, it's never a good idea to parse HTMl with a regex. You can read more about that here.
2)You can just take all text between tags.
Document doc = Jsoup.parse(file, charsetName);
String text= doc.text();
System.out.println(text);

String html = "<div id=\"AE9D7F630640426F8457A661607D2B8E-5-2\" style=\"display: none;\" class=\"kbcontent\"> 19040172b-1 <br>SQL Server Develop <br> <font title=\"teacher\">zheng</font> <br> <font title=\"week\">3-5,7-14</font> <br> <font title=\"classroom\">D-101</font> <br> </div> ";
html = html.replaceAll("<br>", "#~#");
Document doc = Jsoup.parse(html.toString());
String newHtml = doc.text();
String[] ary = newHtml.split("#~#");
This will do the job, yet there may be other clean ways to replace the br tag.

java regular expressions regex

I have problem with extracting data from website.
Im trying to get name of company and price its: SYGNITY and 8,40
<a class="link" href="http://www.money.pl/gielda/spolki-gpw/PLCMPLD00016.html">SYGNITY</a>
</td>
<td class="ac"><img width="12" height="11" src="http://static1.money.pl/i/gielda/chart.gif" title="Pokaż wykres" alt="Pokaż wykres" /></td>
<td class="al">SGN</td>
<td class="ar">8,40</td>
I tried to use this pattern but it doesnt work:
String expr = "<a class=\"link\" href=\"(.+?)\">(.+?)</a>(.+?)<td class=\"ar\">(.+?)</td> ";
any advices?

Using JSoup parser
You should use a html parser like JSoup since regex is not a good idea to parse html.
You can do something like this:
String htmlString = "YOUR HTML HERE";
Document document=Jsoup.parse(htmlString);
Element element=document.select("a[href=http://www.money.pl/gielda/spolki-gpw/PLCMPLD00016.html]").first();
System.out.println(element.text()); //SYGNITY
element=document.select("td[class=ar]").first();
System.out.println(element.text()); //8,40
Using regex
If you still want to use a regex, then you could use a regex like below and grab the content from capturing groups:
PLCMPLD00016.html">(.*?)<\/a>|"ar">(.*?)<\/td>
Working demo
String htmlString = "YOUR HTML HERE"
Pattern pattern = Pattern.compile("PLCMPLD00016.html">(.*?)<\\/a>|"ar">(.*?)<\\/td>");
Matcher matcher = pattern.matcher(htmlString );
while (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}

Add a new html tag to an html string in android

I have a string obtained from an EditText. The string contains html tags.
Spannable s = mainEditText.getText();
String webText = Html.toHtml(s);
The contents of the string is :
<p dir="ltr">test</p>
<p dir="ltr"><img src="http://files.parsetfss.com/bcff7108-cbce-4ab8-b5d1-1f82827e6519/tfss-0de7a730-3fa9-4a1e-9f82-d34e4f6e2d31-file" /><br /></p>
<p dir="ltr"><img src="http://maps.google.com/maps/api/staticmap?center=22.572646,88.363895&zoom=15&size=960x540&sensor=false&markers=color:blue%7Clabel:!%7C22.572646,88.363895" /><br /> </p>
Now, what I want to do is, wherever there is an img src tag, I want to precede it with a center tag.
What should I do to get the following output?
<p dir="ltr">test</p>
<p dir="ltr"><center><img src="http://files.parsetfss.com/bcff7108-cbce-4ab8-b5d1-1f82827e6519/tfss-0de7a730-3fa9-4a1e-9f82-d34e4f6e2d31-file" /></center><br /></p>
<p dir="ltr"><center><img src="http://maps.google.com/maps/api/staticmap?center=22.572646,88.363895&zoom=15&size=960x540&sensor=false&markers=color:blue%7Clabel:!%7C22.572646,88.363895" /></center><br /> </p>
Can a regex solve the issue or should it be done in a different way?
Can JSOUP help in any way? Is there any other type of HTML parser which can do the job?

(<img\s+[^>]*>)
You can try this.Replace with <center>$1</centre>.See demo.
http://regex101.com/r/sU3fA2/38
Something like
var re = /(<img\s+[^>]*>)/g;
var str = '<p dir="ltr">test</p> \n<p dir="ltr"><img src="http://files.parsetfss.com/bcff7108-cbce-4ab8-b5d1-1f82827e6519/tfss-0de7a730-3fa9-4a1e-9f82-d34e4f6e2d31-file" /><br /></p> \n<p dir="ltr"><img src="http://maps.google.com/maps/api/staticmap?center=22.572646,88.363895&zoom=15&size=960x540&sensor=false&markers=color:blue%7Clabel:!%7C22.572646,88.363895" /><br /> </p>';
var subst = '<center>$1</centre>';
var result = str.replace(re, subst);

By using Jsoup, you can use the wrap() method of the Element class of Jsoup.
It would look like this :
public String wrapImgWithCenter(String html) {
Document doc = Jsoup.parse(html);
doc.getElementsByTag("img").wrap("<center></center>");
return doc.html();
}

I implemented the JSOUP solution suggested by mourphy. But, I had edited the method a little and it did the miracle for me. The new method is:
public String wrapImgWithCenter(String html){
Document doc = Jsoup.parse(html);
doc.select("img").wrap("<center></center>");
return doc.html();
}
Thanks mourphy and vks for your help!

Using Regex, you could also do this in java:
String formatted = str.replaceAll("(<img\\s+[^>]*>)", "<center>$1</center>");

Java-Jsoup, scrape html

I am using Jsoup with Java to Parse an HTML file. My question is how can I just extract the line that says "Hourly Rate: 23,016 orders"
I am parsing a lot of files, so the number next to the Hourly Rate will change.
<html>
<head>
<title>Testing</title>
</head>
<body>
<p class=MsoNormal align=center style='background:#DEDEDF'>
<span style='font-size:18.0pt'><b>Testing</b></span></p>
Hourly Rate: 23,016 orders<br>
<table border=0 cellpadding=0>
<tr valign=top>
<td>
Thanks

I just added this code:
String HourlyRate = doc.body().ownText();
//String text = doc.body().text();
System.out.println(HourlyRate);
This Printed out:
Hourly Rate: 23,016 orders

Grab the MsoNormal class then use a regular expression to look for a number i.e.
Document doc = Jsoup.parse(htmlString);
Element msoNormal = doc.getElementsByClass("MsoNormal").first();
if(msoNormal!=null){
Pattern p = Pattern.compile("[0-9]+,[0-9]+");
Matcher m = pattern.matcher(msoNormal.text());
if(matcher.find())
System.out.println(m.get());
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java: regular expression - java

David M is correct, you really shouldn't try to do this, but your specific problem is that the + quantifier in your regex is greedy, so it will match the longest possible substring that could match. See The regex tutorial for more details on the quantifiers.

Related

Java: Replace only exactly matching URL

Java regex , jsoup

java regular expressions regex

Add a new html tag to an html string in android

Java-Jsoup, scrape html

Categories

Resources