How to replace text in an XML document using Java - java

How do I replace text in an XML document using Java?
Source:
<body>
<title>Home Owners Agreement</title>
<p>The <b>good</b> thing about a Home Owners Agreement is that...</p>
</body>
Desired output:
<body>
<title>Home Owners Agreement</title>
<p>The <b>good</b> thing about a HOA is that...</p>
</body>
I only want text in <p> tags to be replaced. I tried the following:
replaceText(string term, string replaceWith, org.w3c.dom.Node p){
p.setTextContent(p.getTextContent().replace(term, replaceWith));
}
The problem with the above code is that all the child nodes of p get lost.

Okay, I figured out the solution.
The key to this is that you don't want to replace the text of the actual node. There is a actually a child representation of just the text. I was able to accomplish what I needed with this code:
private static void replace(Node root){
if (root.getNodeType() == root.TEXT_NODE){
root.setTextContent(root.getTextContent().replace("Home Owners Agreement", "HMO"));
}
for (int i = 0; i < root.getChildNodes().getLength(); i++){
outputTextOfNode(root.getChildNodes().item(i));
}
}

The problem here is that you actually want to replace node, not only the text.
You can traverse the children of current node and add them again to the new node. Then replace nodes.
But it requires a lot of work and very sensitive to you document structure. For example if somebody will wrap your <p> tag with div you will have to change your parsing.
Moreover this approach is very ineffective from point of view of CPU and memory utilization: you have to parse whole document to change a couple of words in it.
My suggestion is the following: try to use regular expressions. In most cases it is strong enough. For example code like
xml.replaceFirst("(<p>.*?</p>)", "<p>The <b>good</b> thing about a HOA is that...</p>")
will work (at least in your case).

Related

text content located in second br tag can't be printed

I'm trying to print the text content located in second br tag by following xpath but all texts which are in all br tags are printed in console. What might be the reason ?
driver.findElement(By.xpath("//*[text()[contains(.,'Telefon')]]")).getText();
The reason you can't get the text is because the text is not in the br tag.
< open br tag /> close
Additionally, if you read a bit more about it, even the /> is surplus to requirements. If you had just <br> the text wouldn't be contained within it because:
The <br> tag is an empty tag which means that it has no end tag.
The point is, all your text is h2. You need to deal with that the best you can.
To solve you're issue you'll need to:
.getAttribute("innerHTML") - this will give you all the text of the h2 with the br tag
split your string on the string <br> (please note that in chrome my <br /> becomes <br> - you might need to adjust this)
select either select item[2] or do a lamda to select the item that contains your text (do whichever you feel more comfortable with)
And those steps look like this:
//Get the element,
var h2Element = driver.findElement(By.xpath("//*[text()[contains(.,'Telefon')]]"));
var myTextArray = h2Element.getAttribute("innerHTML").split("<br>");
//approach 1 - just print the [1] item
var approach1Text = myTextArray[2];
System.out.println(approach1Text);
//aproach 2 - use a lamda to select by contains
var approach2Text = Arrays.stream(myTextArray).filter(a -> a.contains("Telefon")).findFirst().get();
System.out.println(approach2Text);
For a bonus note - you probably had fun getting your xpath to work because the br tag splits the text into separate elements. As result your h2 actually has multiple text() values. It has text(), text()[2], text()[3], etc - as many as there are brs
I put together a simple page to test this for you - just to show you what's going on: (note the xpath in dev tools)
This is text()[3] because xpaths are indexed from 1 (comapred to the java code above that starts at 0). However - that's just an example of why it's tricky, i wouldn't recommend you do it that way.
The easy way to eliminate the <br> (and other tags!) affect on text is to use normalize-space().
An xpath like this works and is realtively simple to follow.
//h2[contains(normalize-space(),'Telefon')]
Maps to my sample page OK:
I share this extra bit in case you have any more text-split objects and it helps you down the line.
...All that said - good work on getting your original xpath to work. That's good too.
The driver.findElement() function returns all the elements with the given Xpath. To get only one element in selenium you can use driver.find_element_by_xpath(fxp) function, where 'fxp' is full XPath of the given element.
Try changing your xpath expression to
//h2/br/following-sibling::text()[contains(.,'Telefon')]
and see if it works.

How to get the specific information from an element in JSOUP?

I am trying to parse an html file using jsoup. Here is my code:
Document doc;
doc = Jsoup.connect("http://www.marketimyilmazlar.com/index.php?route=product/product&path=64_80&product_id=14102").get();
Elements elements = doc.getElementsByClass("price");
Then, when i look at the elements variable, its content is like the following:
<div class="price">
2.75 TL
<span class="kdv">KDV Dahil</span>
<br />
</div>7
Here, what i want to do is that, I want to get the value "2.75TL". I thought of using elements.get(int index) method, but do not know how to use index variable. Can anyone help me with this?
Thanks
You can use ownText method, e.g.
Elements elements = doc.getElementsByClass("price");
System.out.println(elements.get(0).ownText()); // 2.75 TL
Quite simple, you need to get the text nodes out of the element, and then take the first of it, so the solution is something like:
element.textNodes().get(0);

How to keep line breaks when using Jsoup.parse?

This is not a duplicate. The was a similar question, but none of those answers are able to deal with a real html file. One can save any html, even this one and try to run any of the solutions to that answer ... none of them solves the problem completely
The question is
I have a saved .htm file on my desktop. I need to get pure text from it . However I do need to keep the line breaks so that the text is not on just one or couple of lines.
I tried the following and all methods from here
FileInputStream in = new FileInputStream("C:\\...myfile.htm");
String htmlText = IOUtils.toString(in);
for (String line : htmlText.split("\n")) {
String stripped = Jsoup.parse(line).text();
System.out.println(stripped);
}
This does preserve only lines of html file. However, the text is still messed up, because such things as </br> , <p> got removed. How can I parse so that the text preserves all natural line breaks.
This is something I've noticed the difference between jsoup and say Selenium where Selenium keeps the line breaks and jsoup does not when extracting text. With that said, i think the best route is to get the innerHtml on the node you are trying to extract text, then do a replaceAll on the innerHtml to replace </br>and <p> with line breaks.
As a more complete solution, instead of reading the text file line by line, is it possible to traverse the html text more natively? Your best bet would be to traverse the tree using something like a recursive function and when you hit a TextNode, add that text to the stripped variable from your example. Then when you hit a <p> or </br> element, you can add a linefeed as need be.
Something like:
Document doc = Jsoup.parse(htmlText);
Then pass that in a recursive function for each child node:
String getText(Element parentElement) {
String working = "";
for (Node child : parentElement.childNodes()) {
if (child instanceof TextNode) {
working += child.text();
}
if (child instanceof Element) {
Element childElement = (Element)child;
// do more of these for p or other tags you want a new line for
if (childElement.tag().getName().equalsIgnoreCase("br")) {
working += "\n";
}
working += getText(childElement);
}
}
return working;
}
Then you can just call the function to strip the text.
strippedText = getText(doc);
Not the simplest solution, but one i can think of that should work if you want to extract all text from an HTML. I haven't run this code, just wrote it now so if i missed something, i apologize. But it should give you the general idea.

How to get JDOM2 Element text as a list if its content separated by its inner Elements?

I want to build up a String from an XML-file in Java, using JDOM2.
The XML-file snippet what I want to process looks like the following:
...
<title>
usefuldatapart1
<span profile="id1"> optionaldata1 </span>
<span profile="id2"> optionaldata2 </span>
<span profile="id3"> optionaldata3 </span>
usefuldatapart2
</title>
...
The element 'title' contains useful textual content for me separated into several parts with inner Elements, and if any of the profiles turn active I have to insert the content of the inner Element amongst the right parts (only one can be active at a time but in this case it's not important).
Is there any elegant way to get the Element text back as an array of Strings for further operations?
Or if I get this on a wrong way how could I do it properly?
(Currently suffering with Element's 'getText' and 'getContent' methods, and the basics of 'XMLOutputter')
Thanks for any help!
There are potentially multiple ways to do this. One of which is using XPaths, but perhaps it's just simpler to use a descendants iterator and a StringBuilder, and a check on the ancestry of each text node....
For example (and I'm typing in by hand, not validating this...):
public String getTitleText(final Element title) {
final StringBuilder sb = new StringBuilder();
for (final Text txt : title.getDescendants(Filters.text())) {
final Element parent = txt.getParentElement();
if (parent == title ||
parent.getAttributeValue("active", "not").equals("true")) {
sb.append(txt.getValue());
}
}
return sb.toString();
}

Could the value of an html anchor tag be fetched using xpath?

If I have HTML that looks like:
<td class="blah">&nbs;???? </td>
Could I get the ???? value using xpath?
What would it look like?
To use XPath you usually need XML not HTML, but some parsers (e.g. the one built into PHP) have a relaxed Mode which will parse most HTML, too.
If you want to find all <a> that are direct children of <td class="blah"> the XPath you need is
//td[#class = 'blah']/a
or
//td[#class = 'blah']/a[#href = 'http://...']
(depending on whether you only want the one url or all urls)
This will give you a Set of Nodes. You'll need to iterate through it and then check for the nodeType of the firstChild (supposed to be a text node) and the number of child nodes (supposed to be 1). Then the firstChild will contain the ????
Why would you use an XML parser to parse HTML?
I would suggest using a dedicated Java HTML parser, there are many, but I haven't tried any myself.
As for your question, would it work, I suspect it will not work, you will get an error when trying to parse it as HTML right at &nbs; if not earlier.

Categories

Resources