Java regex to replace in between text using a pattern - java

I am a newbie to Java regex. I have a long string which contains text like this(Below is only the part of my string which I would like to replace):
href="javascript:openWin('Images/DCRMBex_01B_ex01.jpg',480,640)"
href="javascript:openWin('Images/DCRMBex_01A_ex01.jpg',480,640)"
href="javascript:openWin('Images/DCRMBex_06A_ex06.jpg',480,640)"
I would like to replace
Images
with
http://google.com/Images
For eg. my output should look like this:
href="javascript:openWin('http://google.com/Images/DCRMBex_01B_ex01.jpg',480,640)"
Below is my Java program:
import java.io.FileReader;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main2 {
public static void main(String[] args) throws FileNotFoundException {
Scanner in = new Scanner(new FileReader("C:\\Projects\\input.txt"));
StringBuilder sb = new StringBuilder();
while (in.hasNext()) {
sb.append(in.next());
}
String patternString = "href=\"javascript:openWin(.+?)\"";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(sb);
while (matcher.find()) {
//System.out.println(matcher.group(1));
//System.out.println(matcher.group(1).replaceAll("Images", "http://google.com/Images"));
matcher.group(1).replaceAll("Images", "http://google.com/Images");
}
System.out.println(sb);
}
}
Below is my input file(input.txt). This is only a part of my file. The file is too long to paste here:
<td valign="top"><a href="http://www.google.com/cds/desktop/documents/DCRMBex/DCRMBex_01_ex01.pdf"><b>Example 1: Bible (Rusch)</b></a> � <a href="javascript:openWin(&#39;Images/DCRMBex_01A_ex01.jpg&#39;,480,640)">Figure 1A. First page of text</a> � <a href="javascript:openWin(&#39;Images/DCRMBex_01B_ex01.jpg&#39;,480,640)">Figure 1B. Source of supplied title</a></td>
<td valign="top">&nbsp;&nbsp;</td>
<td valign="top"><a href="http://www.google.com/cds/desktop/documents/DCRMBex/DCRMBex_06_ex06.pdf"><b>Example 6: Angelo Carletti</b></a> � <a href="javascript:openWin(&#39;Images/DCRMBex_06A_ex06.jpg&#39;,480,640)">Figure 6A. Title page</a> � <a href="javascript:openWin(&#39;Images/DCRMBex_06B_ex06.jpg&#39;,480,640)">Figure 6B. Colophon showing use of i/j and u/v</a></td>
</tr>
<tr>
<td valign="top"><a href="http://www.google.com/cds/desktop/documents/DCRMBex/DCRMBex_02_ex02.pdf"><b>Example 2: Greek anthology</b></a> � <a href="javascript:openWin(&#39;Images/DCRMBex_02A_ex02.jpg&#39;,480,640)">Figure 2A. First page of text</a> � <a href="javascript:openWin(&#39;Images/DCRMBex_02B_ex02.jpg&#39;,480,640)">Figure 2B. Colophon</a></td>
<td valign="top">&nbsp;&nbsp;</td>
<td valign="top"><a href="http://www.google.com/cds/desktop/documents/DCRMBex/DCRMBex_07_ex07.pdf"><b>Example 7: Erasmus</b></a> � <a href="javascript:openWin(&#39;Images/DCRMBex_07A_ex07.jpg&#39;,480,640)">Figure 7A. Title page</a> � <a href="javascript:openWin(&#39;Images/DCRMBex_07B_ex07.jpg&#39;,480,640)">Figure 7B. Colophon</a> � <a href="javascript:openWin(&#39;Images/DCRMBex_07C_ex07.jpg&#39;,640,480)">Figure 7C. Running title</a></td>
</tr>
<tr>
<td valign="top"><a href="http://www.google.com/cds/desktop/documents/DCRMBex/DCRMBex_03_ex03.pdf"><b>Example 3: Heytesbury</b></a> � <a href="javascript:openWin(&#39;Images/DCRMBex_03A_ex03.jpg&#39;,480,640)">Figure 3A. Title page</a> � <a href="javascript:openWin(&#39;Images/DCRMBex_03B_ex03.jpg&#39;,480,640)">Figure 3B. Colophon showing use of i/j and u/v</a></td>
<td valign="top">&nbsp;&nbsp;</td>
<td valign="top"><a href="http://www.google.com/cds/desktop/documents/DCRMBex/DCRMBex_08_ex08.pdf"><b>Example 8: Pliny</b></a> � <a href="javascript:openWin(&#39;Images/DCRMBex_08A_ex08.jpg&#39;,480,640)">Figure 8A. Title page</a> � <a href="javascript:openWin(&#39;Images/DCRMBex_08B_ex08.jpg&#39;,480,640)">Figure 8B. Colophon</a></td>
Output:
1) System.out.println(matcher.group(1))
(&#39;Images/DCRMBex_05_ex05.jpg&#39;,480,640)
2)System.out.println(matcher.group(1).replaceAll("Images","http://google.com/Images"));
(&#39;http://google.com/Images/DCRMBex_05_ex05.jpg&#39;,480,640)
But when I print my struingbuilder, it doesn't show any replacement. What I am doing wrong here? Any help is appreciated. Thanks

I would recommend using Files.lines() and Java Steam to modify the input. With your actual input you also don't need a regex:
try (Stream<String> lines = Files.lines(Paths.get("input.txt"))) {
String result = lines
.map(line -> line.replace("Images", "http://google.com/Images"))
.collect(Collectors.joining("\n"));
System.out.println(result);
}
If you really want to use a regex I would recommend to use a pattern outside the loop, because String.replaceAll() internally compiles the pattern every time you call it. So the performance is much better if you do not do Pattern.compile() for each line:
Pattern pattern = Pattern.compile("(href=\"javascript:openWin.*)(Images.*\")");
try (Stream<String> lines = Files.lines(Paths.get("input.txt"))) {
String result = lines
.map(pattern::matcher)
.map(matcher -> matcher.replaceAll("$1http://google.com/$2"))
.collect(Collectors.joining("\n"));
System.out.println(result);
}
Using this regex for replacement it will create two groups (between ()). You can use this groups in your replacement string by using $index. So $1 will insert the first group.
The result in both cases will be:
href="javascript:openWin(&#39;http://google.com/Images/DCRMBex_01B_ex01.jpg&#39;,480,640)"
href="javascript:openWin(&#39;http://google.com/Images/DCRMBex_01A_ex01.jpg&#39;,480,640)"
href="javascript:openWin(&#39;http://google.com/Images/DCRMBex_06A_ex06.jpg&#39;,480,640)"

replaceAll returns the modified string; it does not modify in place. In this case, I would not use java.util.regex and instead use replaceAll's support for capture groups:
Scanner in = new Scanner(new FileReader("C:\\Projects\\input.txt"));
StringBuilder sb = new StringBuilder();
while (in.hasNext()) {
sb.append(in.next());
}
// Modified regex
String patternString = "(href=\"javascript:openWin\\(&#39;)(.+?)(&#39;)";
String result = sb.toString().replaceAll(patternString, "$1http://google.com/$2$3");
Try it online
Hope this helps!

Related

How to extract substring(html) and another substring (which will be used for regex) and place it all in proper format?

I have a giant string which contains the below code and I need to extract contains in such a way that,if any HTML comes append it and if any substring that contains following pattern, create a link out of it and it in proper format and place and goes on.
Example:
<div id="contentPermission">
[[MI44,MI304,MI409,MI45,MI264,MI108,MI46,MI47,MI48,MI49,MI50,MI51,MI52,MI58,MI530]]
</div>
<div> </div>
<p> </p>
<div> </div>
<p> </p>
<p>[[LP1137]]</p>
Pattern: starting "[[" and ends with "]]"
Form above code:
[[anything between these brackets]]
So the outside should be like this:
<div id="contentPermission">
<a href="index?page=content&id=MI44></a>
<a href="index?page=content&id=MI304></a>
<a href="index?page=content&id=MI409></a>
......
......
</div>
<div> </div>
<p> </p>
<div> </div>
<p> </p>
<p><a href="index?page=content&id=LP1137></a></p>
Solution
public static void main(String[] args) {
StringBuilder str = new StringBuilder("<div id=\"contentPermission\">"
+ " [[MI44,MI304,MI409,MI45,MI264,MI108,MI46,MI47,MI48,MI49,MI50,MI51,MI52,MI58,MI530]]"
+ "</div><div> </div><p> </p><div> </div><p> </p><p>[[LP1137]]</p>");
System.out.println("Before " + str.toString()+"\n\n\n");
Pattern pattern = Pattern.compile("\\[{2}.[^\\]]*\\]{2}");
Matcher matcher = pattern.matcher(str);
while(matcher.find()){
String codes = matcher.group(0);
codes = codes.substring(2, codes.length()-2);
StringBuilder urls = new StringBuilder();
for(String code:codes.split(",")){
urls.append("\n");
}
str = new StringBuilder(matcher.replaceFirst(urls.toString()));
matcher = pattern.matcher(str);
}
System.out.println("Replaced " + str.toString());
}
Another solution with regex only (no split/loop nor substring) :
String content = "<div id=\"contentPermission\">[[MI44,MI304,MI409,MI45,MI264,MI108,MI46,MI47,MI48,MI49,MI50,MI51,MI52,MI58,MI530]]</div><div> </div><p> </p><div> </div><p> </p><p>[[LP1137]]</p>";
Pattern p = Pattern.compile("(?<=\\[\\[).*?(?=\\]\\])");
Matcher m = p.matcher(content);
while(m.find())
content = content.replaceFirst("(\\[\\[).*?(\\]\\])", m.group().replaceAll("(\\w+)(,\\s*\\d*)*", ""));

java regular expressions regex

I have problem with extracting data from website.
Im trying to get name of company and price its: SYGNITY and 8,40
<a class="link" href="http://www.money.pl/gielda/spolki-gpw/PLCMPLD00016.html">SYGNITY</a>
</td>
<td class="ac"><img width="12" height="11" src="http://static1.money.pl/i/gielda/chart.gif" title="Pokaż wykres" alt="Pokaż wykres" /></td>
<td class="al">SGN</td>
<td class="ar">8,40</td>
I tried to use this pattern but it doesnt work:
String expr = "<a class=\"link\" href=\"(.+?)\">(.+?)</a>(.+?)<td class=\"ar\">(.+?)</td> ";
any advices?
Using JSoup parser
You should use a html parser like JSoup since regex is not a good idea to parse html.
You can do something like this:
String htmlString = "YOUR HTML HERE";
Document document=Jsoup.parse(htmlString);
Element element=document.select("a[href=http://www.money.pl/gielda/spolki-gpw/PLCMPLD00016.html]").first();
System.out.println(element.text()); //SYGNITY
element=document.select("td[class=ar]").first();
System.out.println(element.text()); //8,40
Using regex
If you still want to use a regex, then you could use a regex like below and grab the content from capturing groups:
PLCMPLD00016.html">(.*?)<\/a>|"ar">(.*?)<\/td>
Working demo
String htmlString = "YOUR HTML HERE"
Pattern pattern = Pattern.compile("PLCMPLD00016.html">(.*?)<\\/a>|"ar">(.*?)<\\/td>");
Matcher matcher = pattern.matcher(htmlString );
while (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}

Parse xml with empty valued attribute

Have and input of this format:
<table>
<tbody>
<tr bgcolor='#999999'>
<td nowrap width='1%'>
</td>
<td nowrap width='3%' align='center'>
<font style='font-size: 8pt'> System ID </font>
</td>
<td nowrap width='5%' align='center'>
In order to remove nowrap attribute , was earlier using this code:
if (deletedString == null)
{
return exportedTable;
}
int tagPos = 0;
String resultTable = exportedTable;
while (resultTable.indexOf(deletedString) != -1)
{
tagPos = resultTable.indexOf(deletedString, tagPos);
String beforTag = resultTable.substring(0, tagPos);
String afterTag = resultTable.substring(tagPos + deletedString.length());
resultTable = beforTag + afterTag;
}
return resultTable;
deletedString is nowrap, and input is exportedTable.
But this is causing Performance issues. Is there any better way to do it?
My recommendation: StringUtils.remove(source, substring) will remove all instances of the substring from the source string. This answer benchmarked this method and found it to be five times faster than a few alternatives.
Alternatively, use a StringBuilder to aggregate your substrings - every time you concatenate two strings you're creating a new string, whereas StringBuilder is mutable and doesn't need to create a new copy on an update.
You could create a xmlstreamreader and have a while loop that parses through the xml as long as the streamreader.hasNext().
Format:
//Create stream reader
//Position at beginning of document
//While the stream reader has next (can see next line)
//perform action

WebDriver getText() Method with replace using Java

Html Code
<span data-bind="html: TotalCharges">
<span class="CurrencySymbol">USD </span>
7400.00
<br>
(0.00+0.00)
</span>
Webdriver to get the Totalcharge value using getText method
Code:
driver.findElement(By.xpath("//span[#data-bind='html: TotalCharges']")).getText().substring(4);
with above will get the below output
"7400.00
(0.00+0.00)"
my Expected output :"7400.00"
so how can i replace the char from "< br>" tag (need to replace "(0.00+0.00)")
i'm using java
Use the following xpath to get 7400.00:
driver.findElement(By.xpath("//span[#class='CurrencySymbol']/following-sibling::text()[1]").getText();
Oh My mistake, Thanks for correcting me #alecxe:
You can get it by:
driver.findElement(By.xpath("//span[#class='CurrencySymbol']/.."))
.getText().split("\n")[0].split(" ")[1]
splitting at \n will split it for <br> tag.
Try following solution. It will give you 7400.00 output-
String temp = driver.findElement(By.cssSelector("html>body>span")).getText();
String s1=temp.replace("USD", "").replace("\n", "").replace("\r", "");
String finalStr = s1.substring(0,s1.indexOf("(")).trim();
System.out.println(finalStr);

Jsoup returned string " " is not returning true on equals(" ")

Just playing around and pulling some data off a site to manipulate when I come across this:
String request = "http://foo";
String data = "bar";
Connection.Response res = Jsoup.connect(request).data(data).method(Method.POST).execute();
Document doc = res.parse();
Elements all = doc.select("td");
for(Element elem : all){
String test = elem.text();
if(test.equals(" ")){
//redefine test to 0 and print it
}
else{
//print it
}
The site in question is coded as so:
<td align="center">Henry</td>
<td>23</td>
<td align="center">Savannah</td>
<td>15</td></tr>
...
<td align="center"> </td>
<td> </td>
<td align="center">Jane</td>
<td>15</td></tr>
In my for loop, test is never redefined.
I've debugged in Eclipse and String test is showing as so:
Edit
Debugging test chartAt(0):
org.jsoup.nodes.Element.text() says "Returns unencoded text or empty string if none". I'm assuming the unencoded part has something to do with this, but I can't figure it out.
I ran a test program:
public static void main(String[] args) {
String str = " ";
if (str.equals(" ")){
System.out.println("True");
}
}
and it returns true.
What gives?
I don't know if you control the HTML being sent in the body of the response or if that is what you see in a browser's source page or elsewhere
<td> </td>
But it's possible the actual content is
<td>&nbsp</td> // or &#160
where &nbsp is the HTML entity for the non-breaking space.
In java, you can represent it as
char nbsp = 160;
So you could just check for both char values, the one for space and the one for non-breaking space.
Note that there might be other codepoints that are represented as white space. You need to know what you're looking for.

Categories

Resources