Java regex to replace in between text using a pattern

Java regex to replace in between text using a pattern - java

I am a newbie to Java regex. I have a long string which contains text like this(Below is only the part of my string which I would like to replace):
href="javascript:openWin(&#39;Images/DCRMBex_01B_ex01.jpg&#39;,480,640)"
href="javascript:openWin(&#39;Images/DCRMBex_01A_ex01.jpg&#39;,480,640)"
href="javascript:openWin(&#39;Images/DCRMBex_06A_ex06.jpg&#39;,480,640)"
I would like to replace
Images
with
http://google.com/Images
For eg. my output should look like this:
href="javascript:openWin(&#39;http://google.com/Images/DCRMBex_01B_ex01.jpg&#39;,480,640)"
Below is my Java program:
import java.io.FileReader;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main2 {
public static void main(String[] args) throws FileNotFoundException {
Scanner in = new Scanner(new FileReader("C:\\Projects\\input.txt"));
StringBuilder sb = new StringBuilder();
while (in.hasNext()) {
sb.append(in.next());
}
String patternString = "href=\"javascript:openWin(.+?)\"";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(sb);
while (matcher.find()) {
//System.out.println(matcher.group(1));
//System.out.println(matcher.group(1).replaceAll("Images", "http://google.com/Images"));
matcher.group(1).replaceAll("Images", "http://google.com/Images");
}
System.out.println(sb);
}
}
Below is my input file(input.txt). This is only a part of my file. The file is too long to paste here:
<td valign="top"><a href="http://www.google.com/cds/desktop/documents/DCRMBex/DCRMBex_01_ex01.pdf"><b>Example 1: Bible (Rusch)</b></a> ï¿½ <a href="javascript:openWin(&#39;Images/DCRMBex_01A_ex01.jpg&#39;,480,640)">Figure 1A. First page of text</a> ï¿½ <a href="javascript:openWin(&#39;Images/DCRMBex_01B_ex01.jpg&#39;,480,640)">Figure 1B. Source of supplied title</a></td>
<td valign="top">&nbsp;&nbsp;</td>
<td valign="top"><a href="http://www.google.com/cds/desktop/documents/DCRMBex/DCRMBex_06_ex06.pdf"><b>Example 6: Angelo Carletti</b></a> ï¿½ <a href="javascript:openWin(&#39;Images/DCRMBex_06A_ex06.jpg&#39;,480,640)">Figure 6A. Title page</a> ï¿½ <a href="javascript:openWin(&#39;Images/DCRMBex_06B_ex06.jpg&#39;,480,640)">Figure 6B. Colophon showing use of i/j and u/v</a></td>
</tr>
<tr>
<td valign="top"><a href="http://www.google.com/cds/desktop/documents/DCRMBex/DCRMBex_02_ex02.pdf"><b>Example 2: Greek anthology</b></a> ï¿½ <a href="javascript:openWin(&#39;Images/DCRMBex_02A_ex02.jpg&#39;,480,640)">Figure 2A. First page of text</a> ï¿½ <a href="javascript:openWin(&#39;Images/DCRMBex_02B_ex02.jpg&#39;,480,640)">Figure 2B. Colophon</a></td>
<td valign="top">&nbsp;&nbsp;</td>
<td valign="top"><a href="http://www.google.com/cds/desktop/documents/DCRMBex/DCRMBex_07_ex07.pdf"><b>Example 7: Erasmus</b></a> ï¿½ <a href="javascript:openWin(&#39;Images/DCRMBex_07A_ex07.jpg&#39;,480,640)">Figure 7A. Title page</a> ï¿½ <a href="javascript:openWin(&#39;Images/DCRMBex_07B_ex07.jpg&#39;,480,640)">Figure 7B. Colophon</a> ï¿½ <a href="javascript:openWin(&#39;Images/DCRMBex_07C_ex07.jpg&#39;,640,480)">Figure 7C. Running title</a></td>
</tr>
<tr>
<td valign="top"><a href="http://www.google.com/cds/desktop/documents/DCRMBex/DCRMBex_03_ex03.pdf"><b>Example 3: Heytesbury</b></a> ï¿½ <a href="javascript:openWin(&#39;Images/DCRMBex_03A_ex03.jpg&#39;,480,640)">Figure 3A. Title page</a> ï¿½ <a href="javascript:openWin(&#39;Images/DCRMBex_03B_ex03.jpg&#39;,480,640)">Figure 3B. Colophon showing use of i/j and u/v</a></td>
<td valign="top">&nbsp;&nbsp;</td>
<td valign="top"><a href="http://www.google.com/cds/desktop/documents/DCRMBex/DCRMBex_08_ex08.pdf"><b>Example 8: Pliny</b></a> ï¿½ <a href="javascript:openWin(&#39;Images/DCRMBex_08A_ex08.jpg&#39;,480,640)">Figure 8A. Title page</a> ï¿½ <a href="javascript:openWin(&#39;Images/DCRMBex_08B_ex08.jpg&#39;,480,640)">Figure 8B. Colophon</a></td>
Output:
1) System.out.println(matcher.group(1))
(&#39;Images/DCRMBex_05_ex05.jpg&#39;,480,640)
2)System.out.println(matcher.group(1).replaceAll("Images","http://google.com/Images"));
(&#39;http://google.com/Images/DCRMBex_05_ex05.jpg&#39;,480,640)
But when I print my struingbuilder, it doesn't show any replacement. What I am doing wrong here? Any help is appreciated. Thanks

I would recommend using Files.lines() and Java Steam to modify the input. With your actual input you also don't need a regex:
try (Stream<String> lines = Files.lines(Paths.get("input.txt"))) {
String result = lines
.map(line -> line.replace("Images", "http://google.com/Images"))
.collect(Collectors.joining("\n"));
System.out.println(result);
}
If you really want to use a regex I would recommend to use a pattern outside the loop, because String.replaceAll() internally compiles the pattern every time you call it. So the performance is much better if you do not do Pattern.compile() for each line:
Pattern pattern = Pattern.compile("(href=\"javascript:openWin.*)(Images.*\")");
try (Stream<String> lines = Files.lines(Paths.get("input.txt"))) {
String result = lines
.map(pattern::matcher)
.map(matcher -> matcher.replaceAll("$1http://google.com/$2"))
.collect(Collectors.joining("\n"));
System.out.println(result);
}
Using this regex for replacement it will create two groups (between ()). You can use this groups in your replacement string by using $index. So $1 will insert the first group.
The result in both cases will be:
href="javascript:openWin(&#39;http://google.com/Images/DCRMBex_01B_ex01.jpg&#39;,480,640)"
href="javascript:openWin(&#39;http://google.com/Images/DCRMBex_01A_ex01.jpg&#39;,480,640)"
href="javascript:openWin(&#39;http://google.com/Images/DCRMBex_06A_ex06.jpg&#39;,480,640)"

replaceAll returns the modified string; it does not modify in place. In this case, I would not use java.util.regex and instead use replaceAll's support for capture groups:
Scanner in = new Scanner(new FileReader("C:\\Projects\\input.txt"));
StringBuilder sb = new StringBuilder();
while (in.hasNext()) {
sb.append(in.next());
}
// Modified regex
String patternString = "(href=\"javascript:openWin\\(&#39;)(.+?)(&#39;)";
String result = sb.toString().replaceAll(patternString, "$1http://google.com/$2$3");
Try it online
Hope this helps!

Related

How to extract substring(html) and another substring (which will be used for regex) and place it all in proper format?

I have a giant string which contains the below code and I need to extract contains in such a way that,if any HTML comes append it and if any substring that contains following pattern, create a link out of it and it in proper format and place and goes on.
Example:
<div id="contentPermission">
[[MI44,MI304,MI409,MI45,MI264,MI108,MI46,MI47,MI48,MI49,MI50,MI51,MI52,MI58,MI530]]
</div>
<div> </div>
<p> </p>
<div> </div>
<p> </p>
<p>[[LP1137]]</p>
Pattern: starting "[[" and ends with "]]"
Form above code:
[[anything between these brackets]]
So the outside should be like this:
<div id="contentPermission">
<a href="index?page=content&id=MI44></a>
<a href="index?page=content&id=MI304></a>
<a href="index?page=content&id=MI409></a>
......
......
</div>
<div> </div>
<p> </p>
<div> </div>
<p> </p>
<p><a href="index?page=content&id=LP1137></a></p>

Solution
public static void main(String[] args) {
StringBuilder str = new StringBuilder("<div id=\"contentPermission\">"
+ " [[MI44,MI304,MI409,MI45,MI264,MI108,MI46,MI47,MI48,MI49,MI50,MI51,MI52,MI58,MI530]]"
+ "</div><div> </div><p> </p><div> </div><p> </p><p>[[LP1137]]</p>");
System.out.println("Before " + str.toString()+"\n\n\n");
Pattern pattern = Pattern.compile("\\[{2}.[^\\]]*\\]{2}");
Matcher matcher = pattern.matcher(str);
while(matcher.find()){
String codes = matcher.group(0);
codes = codes.substring(2, codes.length()-2);
StringBuilder urls = new StringBuilder();
for(String code:codes.split(",")){
urls.append("\n");
}
str = new StringBuilder(matcher.replaceFirst(urls.toString()));
matcher = pattern.matcher(str);
}
System.out.println("Replaced " + str.toString());
}

Another solution with regex only (no split/loop nor substring) :
String content = "<div id=\"contentPermission\">[[MI44,MI304,MI409,MI45,MI264,MI108,MI46,MI47,MI48,MI49,MI50,MI51,MI52,MI58,MI530]]</div><div> </div><p> </p><div> </div><p> </p><p>[[LP1137]]</p>";
Pattern p = Pattern.compile("(?<=\\[\\[).*?(?=\\]\\])");
Matcher m = p.matcher(content);
while(m.find())
content = content.replaceFirst("(\\[\\[).*?(\\]\\])", m.group().replaceAll("(\\w+)(,\\s*\\d*)*", ""));

java regular expressions regex

I have problem with extracting data from website.
Im trying to get name of company and price its: SYGNITY and 8,40
<a class="link" href="http://www.money.pl/gielda/spolki-gpw/PLCMPLD00016.html">SYGNITY</a>
</td>
<td class="ac"><img width="12" height="11" src="http://static1.money.pl/i/gielda/chart.gif" title="Pokaż wykres" alt="Pokaż wykres" /></td>
<td class="al">SGN</td>
<td class="ar">8,40</td>
I tried to use this pattern but it doesnt work:
String expr = "<a class=\"link\" href=\"(.+?)\">(.+?)</a>(.+?)<td class=\"ar\">(.+?)</td> ";
any advices?

Using JSoup parser
You should use a html parser like JSoup since regex is not a good idea to parse html.
You can do something like this:
String htmlString = "YOUR HTML HERE";
Document document=Jsoup.parse(htmlString);
Element element=document.select("a[href=http://www.money.pl/gielda/spolki-gpw/PLCMPLD00016.html]").first();
System.out.println(element.text()); //SYGNITY
element=document.select("td[class=ar]").first();
System.out.println(element.text()); //8,40
Using regex
If you still want to use a regex, then you could use a regex like below and grab the content from capturing groups:
PLCMPLD00016.html">(.*?)<\/a>|"ar">(.*?)<\/td>
Working demo
String htmlString = "YOUR HTML HERE"
Pattern pattern = Pattern.compile("PLCMPLD00016.html">(.*?)<\\/a>|"ar">(.*?)<\\/td>");
Matcher matcher = pattern.matcher(htmlString );
while (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}

Parse xml with empty valued attribute

Have and input of this format:
<table>
<tbody>
<tr bgcolor='#999999'>
<td nowrap width='1%'>
</td>
<td nowrap width='3%' align='center'>
<font style='font-size: 8pt'> System ID </font>
</td>
<td nowrap width='5%' align='center'>
In order to remove nowrap attribute , was earlier using this code:
if (deletedString == null)
{
return exportedTable;
}
int tagPos = 0;
String resultTable = exportedTable;
while (resultTable.indexOf(deletedString) != -1)
{
tagPos = resultTable.indexOf(deletedString, tagPos);
String beforTag = resultTable.substring(0, tagPos);
String afterTag = resultTable.substring(tagPos + deletedString.length());
resultTable = beforTag + afterTag;
}
return resultTable;
deletedString is nowrap, and input is exportedTable.
But this is causing Performance issues. Is there any better way to do it?

My recommendation: StringUtils.remove(source, substring) will remove all instances of the substring from the source string. This answer benchmarked this method and found it to be five times faster than a few alternatives.
Alternatively, use a StringBuilder to aggregate your substrings - every time you concatenate two strings you're creating a new string, whereas StringBuilder is mutable and doesn't need to create a new copy on an update.

You could create a xmlstreamreader and have a while loop that parses through the xml as long as the streamreader.hasNext().
Format:
//Create stream reader
//Position at beginning of document
//While the stream reader has next (can see next line)
//perform action

WebDriver getText() Method with replace using Java

Html Code
<span data-bind="html: TotalCharges">
<span class="CurrencySymbol">USD </span>
7400.00
<br>
(0.00+0.00)
</span>
Webdriver to get the Totalcharge value using getText method
Code:
driver.findElement(By.xpath("//span[#data-bind='html: TotalCharges']")).getText().substring(4);
with above will get the below output
"7400.00
(0.00+0.00)"
my Expected output :"7400.00"
so how can i replace the char from "< br>" tag (need to replace "(0.00+0.00)")
i'm using java

Use the following xpath to get 7400.00:
driver.findElement(By.xpath("//span[#class='CurrencySymbol']/following-sibling::text()[1]").getText();
Oh My mistake, Thanks for correcting me #alecxe:
You can get it by:
driver.findElement(By.xpath("//span[#class='CurrencySymbol']/.."))
.getText().split("\n")[0].split(" ")[1]
splitting at \n will split it for <br> tag.

Try following solution. It will give you 7400.00 output-
String temp = driver.findElement(By.cssSelector("html>body>span")).getText();
String s1=temp.replace("USD", "").replace("\n", "").replace("\r", "");
String finalStr = s1.substring(0,s1.indexOf("(")).trim();
System.out.println(finalStr);

Jsoup returned string " " is not returning true on equals(" ")

Just playing around and pulling some data off a site to manipulate when I come across this:
String request = "http://foo";
String data = "bar";
Connection.Response res = Jsoup.connect(request).data(data).method(Method.POST).execute();
Document doc = res.parse();
Elements all = doc.select("td");
for(Element elem : all){
String test = elem.text();
if(test.equals(" ")){
//redefine test to 0 and print it
}
else{
//print it
}
The site in question is coded as so:
<td align="center">Henry</td>
<td>23</td>
<td align="center">Savannah</td>
<td>15</td></tr>
...
<td align="center"> </td>
<td> </td>
<td align="center">Jane</td>
<td>15</td></tr>
In my for loop, test is never redefined.
I've debugged in Eclipse and String test is showing as so:
Edit
Debugging test chartAt(0):
org.jsoup.nodes.Element.text() says "Returns unencoded text or empty string if none". I'm assuming the unencoded part has something to do with this, but I can't figure it out.
I ran a test program:
public static void main(String[] args) {
String str = " ";
if (str.equals(" ")){
System.out.println("True");
}
}
and it returns true.
What gives?

I don't know if you control the HTML being sent in the body of the response or if that is what you see in a browser's source page or elsewhere
<td> </td>
But it's possible the actual content is
<td>&nbsp</td> // or &#160
where &nbsp is the HTML entity for the non-breaking space.
In java, you can represent it as
char nbsp = 160;
So you could just check for both char values, the one for space and the one for non-breaking space.
Note that there might be other codepoints that are represented as white space. You need to know what you're looking for.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java regex to replace in between text using a pattern - java

Related

How to extract substring(html) and another substring (which will be used for regex) and place it all in proper format?

java regular expressions regex

Parse xml with empty valued attribute

WebDriver getText() Method with replace using Java

Jsoup returned string " " is not returning true on equals(" ")

Categories

Resources