I have
String s = "https://stackoverflow.com<br/>https://google.com"
Now I just want to replace all links in the href attributes, by prefixing with a fixed value (e.g. `abc.com?'). Here's the result that I want:
String s = "https://stackoverflow.com<br/>https://google.com"
I tried the following, but it doesn't resolve the problem because it replaces all strings beginning http://, not only those within href attributes:
s= s.replaceAll("http://.+?(com|net|org|vn)/{0,1}","abc.com" + "&url=" + "$0");
What can I do to replace only within the attribute, and not in other content?
You could use a HTML Parser such as JSoup
String s = "https://stackoverflow.com";
Document document = JSoup.parse(s);
Elements anchors = document.getElementsByTag("a");
anchors.get(0).attr("href", "...new href...");
Alternatively if this is too heavy weight a regex should suffice:
<a href="(?<url>[^"]+)">(?<text>[^<]+)<\/a>
Note if you dont care about the text group, replace ?<text> with ?:
Just replace the url & text group using a similar approach to this answer
As said by RealSkeptic look for href instead of the link itself, it saves a lot of effort.
var s = 'https://stackoverflow.com<br/>https://google.com';
s = s.replace(/href="/g,"href=\"abc.com&url=" );
console.log(s);
Related
I'd like to filter an html String before loading it in a WebView:
I'd like to remove all the img tags with the param:
data-custom:'delete'
In example
<img src="https://..." data-custom:'delete'/>
How can I do this in Android in a elegant way (without external libraries if possible)
I'm going to go for a nice and simple:
String element = "<img src='https://...' data-custom:'delete'/>";
String attributeRemoved = element.replaceAll("data-custom:['|\"].+['|\"]", "");
Updated based on comment
If you want to remove the whole tag you can do this:
String elementRemoved = element.replaceAll("<.*data-custom:['|\"].+['|\"].*>", "");
If you only want to do it for <img> tags you can do:
String imgElementRemoved = element.replaceAll("<img.*data-custom:['|\"].+['|\"].*>", "");
A much more reliable way would be to parse the HTML as an XML document and use XPath to find all elements with a data-custom attribute and remove them from the document, then save the updated document. While you can do this stuff with regex, it's not normally a good idea...
I'm having a scenario where within a span tag I have two strings, separated by an img tag.
<span>
text
<img/>
text
</span>
When I'm trying to find this span using selenium and Xpath, I found it - but getText() method of the span element returning "texttext". My intention is to get "text text".
driver.findElement(By.xpath("MY_XPATH_TO_FIND_THAT_SPAN").getText();
My Xpath is fine (because I'm getting the right web element, but how can I get the string as I note here? I want to append a space whenever there is an img tag.
Will be glad for your help,
Thanks!
There is no direct way to do it using .getText(). You can use .getAttribute("innerHTML") and then you will need to replace whatever is between the two "text" strings (IMG, etc.) with a space.
Here's a simple example based on your HTML that will probably work.
String s = driver.findElement(By.xpath("MY_XPATH_TO_FIND_THAT_SPAN").getAttribute("innerHTML"); // <span>text<img/>text</span>
s = s.replaceAll("<img.*?/>", " ");
System.out.println(s);
This prints
<span>text text</span>
To retrieve the text text from the first child node and the text text from third child node you can use the getAttribute("innerHTML") method and then use split() method and finally print text text inserting a space between them accordingly as follows :
String my_string = driver.findElement(By.xpath("MY_XPATH_TO_FIND_THAT_SPAN")).getAttribute("innerHTML");
String[] stringParts = my_string.split("\n");
String partA = stringParts[0];
String partB = stringParts[2];
System.out.println(partA + " " + partB);
I want to delete <a> tag and link text from my html.
Simple example:
String inputString = "<html><p>test link </p></html>";
I tried to use something like this:
String result = inputString.replaceAll("</?a[^>]*>", " ");
but it deletes only <a> tag
Expected output:
String result = "<html><p>test</p></html>";
If I understand your question, you could use String#replaceAll() and a regular expression. Something like this,
String inputString = "<html><p>test "
+ "link </p></html>";
System.out.println(inputString);
inputString = inputString.replaceAll("\\s*<a.*</a>\\s*", "");
System.out.println(inputString);
Output is
<html><p>test link </p></html>
<html><p>test</p></html>
Unlike XML, oracle offers no parser for html like jaxb. But you can use external parser.
If you want to manipulate the html tag, then here you go:
http://jsoup.org/ and download the jsoup
You can use this regex to remove nodes and spaces
String result = inputString.replaceAll("(\\s+)?<a.+?/a>(\\s+)?", "");
I have the following HTML...
<h3 class="number">
<span class="navigation">
6:55 <b>»</b>
</span>**This is the text I need to parse!**</h3>
I can use the following code to extract the text from h3 tag.
Element h3 = doc.select("h3").get(0);
Unfortunately, that gives me everything in that tag.
6:55 » This is the text I need to parse!
Can I use Jsoup to parse between different tags? Is there a best practice for doing this (regex?)
(regex?)
No, as you can read in the answers of this question, you can't parse HTML using a regular expression.
Try this:
Element h3 = doc.select("h3").get(0);
String h3Text = h3.text();
String spanText = h3.select("span").get(0).text();
String textBetweenSpanEndAndH3End = h3Text.replace(spanText, "");
No, JSoup wasn't made for this. It's supposed to parse something hierachical. Searching for a text which is between an end-tag and a start-tag, or the other way around wouldn't make any sense for JSoup. That's what regular expressions are for.
But you should of course narrow it down as much as you can using JSoup first, before you shoot with a regex at the string.
Just use ownText()
#Test
void innerTextCase() {
String sample = "<h3 class=\"number\">\n" +
"<span class=\"navigation\">\n" +
"6:55 <b>»</b>\n" +
"</span>**This is the text I need to parse!**</h3>\n";
Assertions.assertEquals("**This is the text I need to parse!**",
Jsoup.parse(sample).select("h3").first().ownText());
}
I'm trying to make something pretty simple, but I simply suck at regular expressions.
My goal is to replace :
Link To Google
To :
<b>Link To Google</b>
In java.
I tried this :
String input = "Link to Google";
String Regex1 = "<a href(.*)>";
String Regex2 = "</a>";
String output = test.replace(Regex1, "<b>");
output = test.replace(Regex2, "</b>");
But the first Regex1 is not matched with my input. Any clue ?
Thanks in advance!
It matches just fine, even tho it's wrong, and you should not use regex to parse HTML.
You want to make the second replace on the result of the first replace, not the original string:
String output = test.replace(Regex1, "<b>");
output = output.replace(Regex2, "</b>");
You can make it work for your example by using:
String Regex1 = "<a href.*?>";
Which makes the quantifier ungreedy. But this expression will break very easily for the slightest changes in the input HTML, which is (one of the reasons) why you should't use regex to work with HTML.
Some simple examples the above regex would not work for:
<A HREF="http://www.google.com">
<a href="http://www.google.com">
<a href="http://www.google.com"
>
<a href=">">
Use a parser. They are easy to use and always the more correct solution.
jsoup (http://jsoup.org) would handle your task easily like this:
File input = new File("your.html");
Document doc = Jsoup.parse(input, "UTF-8");
Elements links = doc.select("a[href]");
while (links.hasNext()) {
Element link = iterator.next();
Element bold = doc.createElement("b").appendText(link.text());
link.replaceWith(bold);
}
// now do something with...
// doc.outerHtml()
If you want it to work replace Regex1 with
<a href=\"(.*)\">
And then:
output = output.replace(Regex2,"</b>")
Don't know about using regexs in Java but there must be a "capture group" notion:
Your initial regex would be: "<a\s+href\s*=\s*".*?">(.*?)</a>"
That you would replace by: "<b>$1</b>" (where $1 means the group captured between parenthesis in the first regex)