Replace string with jsoup only in text portions - java

I have found several topics with similar questions and valuable answers, but I am still struggling with this:
I want to parse some html with Jsoup so I can replace, for example,
"changeme"
with
<changed>changeme</changed>
, but only if it appears on a text portion of the html, no if it is part of a tag. So, starting with this html:
<body>
<p>test changeme app</p>
</BODY>
</HTML>
I would want to get to this:
<body>
<p>test <changed>changeme</changed> app</p>
</BODY>
</HTML>
I have tried several approaches, this one is which brings me closer to the desired result:
Document doc = null;
try {
doc = Jsoup.parse(new File("tmp1450348256397.txt"), "UTF-8");
} catch (Exception ex) {
}
Elements els = doc.body().getAllElements();
for (Element e : els) {
if (e.text().contains("changeme")) {
e.html(e.html().replaceAll("changeme","<changed>changeme</changed>"));
}
}
html = doc.toString();
System.out.println(html);
But with this approach I find two problems:
<body>
<p><a href="http://<changed>changeme</changed> .html">test
<changed>
changeme
</changed>
app</a></p>
</BODY>
</HTML>
Line breaks are inserted before and after the new element I am introducing. This is not a real problem as I coul get rid of them if I use #changed# to do the replacing and after the doc.toString() I replace them again to the desired value (with < >).
The real problem: The URL in the href has been modified, and I don't want it to happen.
Ideas? Thx.

Here is my solution:
String html=""
+"<p><a href=\"http://changeme.html\">"
+ "test changeme "
+ "<div class=\"changeme\">"
+ "inner text changeme"
+ "</div>"
+ " app</a>"
+"</p>";
Document doc = Jsoup.parse(html);
Elements els = doc.body().getAllElements();
for (Element e : els) {
List<TextNode> tnList = e.textNodes();
for (TextNode tn : tnList){
String orig = tn.text();
tn.text(orig.replaceAll("changeme","<changed>changeme</changed>"));
}
}
html = doc.toString();
System.out.println(html);
TextNodes are always leaf nodes, i.e. they do not contain more HTML elements. In your original approach you replace the HTML of an element with new HTML with replaced changme strings. You only check for the changeme to be part of the TextNodes contents, but you replace every occurrence in the HTML string of the element, including all occurrences outside TextNodes.
My solution basically works like yours, but I use the JSoup method textNodes(). This way I don't need to typecast.
P.S.
Of course, my solution as well as yours will contain <changed>changeme</changed> instead of <changed>changeme</changed> in the end. This may or may not be what you want. If you do not want this, then your result is not any more valid HTML, since changed is no valid HTML tag. Jsoup will not help you in this case. However, you can of course replace in the resulting string all <changed>changeme</changed> again - outside JSoup.

I think your issue is that you're replacing the elements html rather than just its text, change:
e.html(e.html().replaceAll("changeme","<changed>changeme</changed>"));
to
e.text(e.text().replaceAll("changeme","<changed>changeme</changed>"));
the line breaks issue can probably be solved by doing doc.outputSettings().prettyPrint(false); before doing html = doc.toString();

Finally I tried this solution (at the end of the question), using TextNodes:
How I can replace "text" in the each tag using Jsoup
This is the resulting code:
Elements els = doc.body().getAllElements();
for (Element e : els) {
for (Node child : e.childNodes()){
if (child instanceof TextNode && !((TextNode) child).isBlank()) {
((TextNode)child).text(((TextNode)child).text().replaceAll("changeme","<changed>changeme</changed>"));
}
}
}
Now the output is the expected, and it even does not introduce extra break lines. In this case prettyPrint must be set to True.
The only problem is that I don't really understand the difference of using TextNode vs Element.text(). If someone wants to provide some info it will be much appreciated.
Thanks.

Related

Modifying an html tag's own text in Java using JSoup

So yeah, suppose I have this piece of HTML
<p>And finally, how about some Links?</p>
and I want to access and modify the "And finally, how about some" part only, and get this:
<p>new text Links?</p>
I can't seem to figure out how. Here's what I've tried so far:
Document doc = null;
try {
doc = Jsoup.connect("http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html").userAgent("Mozilla").get();
} catch (IOException e1) {
e1.printStackTrace();
}
Elements d = doc.body().children();
Element e = d.get(20); //Assuming the HTML line in question is found at index 20
e.text("new text") //just outputs <p>new value</p>, which is not good for me
It seems that I can access it by
Element e = d.get(20);
System.out.println("\n"+e.ownText()); //outputs: And finally, how about some
but modifying it doesn't work.
Element e = d.get(20);
String s = e.toString().replace(e.ownText(), "new text");
e.text(s);
System.out.println(e.toString());
The output for the code above is
<p><p>changed <a href="http://www.yahoo.com/">Links?</a></p></p>
It seems to be taking the tags as literals, but I want them as < or > because I then have to re build the webpage with the new text.
Any kind of help will be hugely appreciated.
How about something like
Element e = d.get(20);
e.text("new text");
e.append("Links?");//lets you add HTML.
If link is dynamic and you don't want to change it you can earlier store it and use later
Element e = d.get(20);
Element link = e.child(0);
e.text("new text");
e.append(link.toString());

Using JSoup to select a group of tags

I am attempting to use JSoup to scrape some information off a page, which can be identified by a group of tags in a particular order. The order of them is as follows:
<span class="sold" >Sold</span></td>
<td class='prc'>
<div class="g-b bidsold" itemprop="price">
AU $1.00</div>
I am looking to grab each value that is in place of the AU $1.00 field on the page, but they can only be identified by the span class="sold" selector that occurs a few tags beforehand.
I have tried something like select("span.sold:lt(4) + [itemprop=price]") but feel like I'm flailing around in the dark!
The code below should do the trick!!!
Document doc = Jsoup.connect(/*URL of your HTML document*/").get();
Element part = doc.body();
Elements parts = part.getElementsByTag("div");
String attValue;
String requiredContent;
for(Element ent : parts)
{
if(ent.hasAttr("class"))
{
attValue = ent.attr("class");
if(attValue.equals("g-b bidsold"))
{
System.out.println("\n");
requiredContent=ent.text();
System.out.println(requiredContent);
}
}
}
Just make sure to iterate and get the output in an array.
You could also do this:
Elements soldPrices = doc.select("td:has(.sold) + td [itemprop=price]");
That will return elements (the DIVs) that have price itemprops, which have immediately preceeding TDs with elements (the SPANs) with class=sold.
See the Selector syntax for more details.

adding text before and after a link jSoup

I've just stared learning Jsoup and the cookbook on their website but I'm just a bit stuck with addling text to an element I've parsed.
try{
Document doc = Jsoup.connect(url).get();
Element add = doc.prependText("a href") ;
Elements links = add.select("a[href]");
for (Element link : links) {
PrintStream sb = System.out.format("%n %s",link.attr("abs:href"));
System.out.print("<br>");
}
}
catch(Exception e){
System.out.print("error --> " + e);
}
Example run with google.com I get
http://www.google.ie/imghp?hl=en&tab=wi<br>
http://maps.google.ie/maps?hl=en&tab=wl<br>
https://play.google.com/?hl=en&tab=w8<br>
But I really want
<a href> http://www.google.ie/imghp?hl=en&tab=wi<br></a>
<a href> http://maps.google.ie/maps?hl=en&tab=wl<br></a>
<a href> https://play.google.com/?hl=en&tab=w8<br></a>
With this code I've gotten all the links off the page but I want to also get the and tags so I can them create my on webpage. I've tried adding a string and prepend text but just can't seem to get it right.
Thanks
with link.attr(...) you get the attribute value.
But you need the whole tag:
Document doc = Jsoup.connect(...).get();
for( Element e : doc.select("a[href]") ) // Select all 'a'-Tags with 'href' attribute
{
String wholeTag = e.toString(); // Get a string as the element is
/* No you you can use the html - in this example for a simple output */
System.out.println(wholeTag);
}

How do I parse this HTML with Jsoup

I am trying to extract "Know your tractor" and "Shell Petroleum Company.1955"? Bear in mind that that is just a snippet of the whole code and there are more then one H2/H3 tag. And I would like to get the data from all the H2 and H3 tags.
Heres the HTML: http://i.stack.imgur.com/Pif3B.png
The Code I have just now is:
ArrayList<String> arrayList = new ArrayList<String>();
Document doc = null;
try{
doc = Jsoup.connect("http://primo.abdn.ac.uk:1701/primo_library/libweb/action/search.do?dscnt=0&scp.scps=scope%3A%28ALL%29&frbg=&tab=default_tab&dstmp=1332103973502&srt=rank&ct=search&mode=Basic&dum=true&indx=1&tb=t&vl(freeText0)=tractor&fn=search&vid=ABN_VU1").get();
Elements heading = doc.select("h2.EXLResultTitle span");
for (Element src : heading) {
String j = src.text();
System.out.println(j); //check whats going into the array
arrayList.add(j);
}
How would I extract "Know your tractor" and "Shell Petroleum Company.1955"? Thanks for your help!
Your selector only selects <span> elements which are inside <h2 class="EXLResultTitle">, while you actually need those <h2> elements themself. So, just remove span from the selector:
Elements headings = doc.select("h2.EXLResultTitle");
for (Element heading : headings) {
System.out.println(heading.text());
}
You should be able to figure the selector for <h3 class="EXLResultAuthor"> yourself based on the lesson learnt.
See also:
Jsoup cookbook - CSS selectors
Jsoup Selector API documentation

HTML Parsing using Jsoup.Jar

Document doc = Jsoup.connect("http://reviews.opentable.com/0938/9/reviews.htm").get();
Element part = doc.body();
Elements parts = part.getElementsByTag("span");
String attValue;
String html;
for(Element ent : parts)
{
if(ent.hasAttr("class"))
{
attValue = ent.attr("class");
if(attValue=="BVRRReviewText description")
{
System.out.println("\n");
html=ent.text();
System.out.println(html);
}
}
}
Am using Jsoup.jar for the above program.
I am accessing the webpage and my aim is to the print the text that is found within the tag <span class="BVRRReviewText description">text</span>.
But nothing is getting printed as output. There is no contents added to the String html in the program. But attValue is getting all the attribute values of the span tag.
Where must I have went wrong? Please advise.
if(attValue=="BVRRReviewText description")
should be
if(attValue.equals("...")) surely?
This is Java, not Javascript.
Change
attValue=="BVRRReviewText description"
for
attValue.matches("...")

Categories

Resources