Get part of string that is not html in Java - java

In my Java application I have String that have to be edited. The problem is that these Strings can contain HTML tags/elements, which should not be edited (no id to retrieve element).
Scenario (add -):
String a = "<span> <table> </table> </span> <div></div> <div> text 2</div>";
should become: <span> <table> </table> </span> <div></div> <div> -text 2</div>
String b = "text";
should become: -text
String c = "<p> t </p>";
should become: <p> -t </p>
My question is: How can I retrieve the text in a string that can contain html tags (cannot add id or class)

You can use an XML parsing library.
String newText = null;
for ( Node node : document.nodes() ) {
if ( node.text() != null ) newText = "-" + node.text();
}
note that this is pseudo.
newText will now be -text or whatever the node text is.
EDIT:
Your question is a bit ambiguous in terms of "the text can contain html elements."
If it doesn't contain html tags, then you cannot use an XML parser, which brings up the question.. if it doesn't contain tags, then why can't you just do...
String newString = "-" + a;

Related

Extract Data from HTML using JSoup

I am writing a script to extract data from a HTML Document. Here is a part of the document.
<div class="info">
<div id="info_box" class="inf_clear">
<div id="restaurant_info_box_left">
<table id="rest_logo">
<tr>
<td>
<a itemprop="url" title="XYZ" href="XYZ.com">
<img src="/files/logo/26721.jpg" alt="XYZ" title="XYZ" width="100" />
</a>
</td>
</tr>
</table>
<h1 id="Name"><a class="fn org url" rel="Order Online" href="XYZ.com" title="XYZ" itemprop="name">XYZ</a></h1>
<div class="rest_data" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="telephone">(305) 535-1379</span> | <b>
<span itemprop="streetAddress">1755 Alton Rd</span>,
<span itemprop="addressLocality">Miami Beach</span>,
<span itemprop="addressRegion">FL</span>
<span itemprop="postalCode">33139</span></b>
</div>
<div class="geo">
<span class="latitude" title="25.792588"></span>
<span class="longitude" title="-80.141214"></span>
</div>
<div class="rest_data">Estimated delivery time: <b>45-60 min</b></div>
</div>
</div>
I am using Jsoup and not quite sure how to achieve this.
There are many div tags in the document and I try to match with their unique attribute.
Say for div tag with class attribute value as "info"
Elements divs = doc.select("div");
for (Element div : divs) {
String divClass = div.attr("class").toString();
if (divClass.equalsIgnoreCase("rest_info")) {
}
If matched, I have to get the table with id "rest_logo" inside that divtag.
When doc.select("table") is used, it looks like the parser searches the entire document.
What I need to achieve is, if the div tag attribute is matched, I need to fetch the elements and attributes inside the matched div tag.
Expected Output:
Name : XYZ
telephone:(305) 535-1379
streetAddress:1755 Alton Rd
addressLocality:Miami Beach
addressRegion:FL
postalCode:33139
latitude:25.792588
longitude:-80.141214
Estimated delivery time:45-60 min
Any Ideas?
for (Element e : doc.select("div.info")) {
System.out.println("Name: " + e.select("a.fn").text());
System.out.println("telephone: " + e.select("span[itemprop=telephone]").text());
System.out.println("streetAddress: " + e.select("span[itemprop=streetAddress]").text());
// .....
}
Here's how I would do it:
Document doc = Jsoup. parse(myHtml);
Elements elements = doc.select("div.info")
.select(”a[itemprop=url], span[itemprop=telephone], span[itemprop=streetAddress], span[itemprop=addressLocality], span[itemprop=addressRegion], span[itemprop=postalCode], span.longitude, span.latitude”);
elements.add(doc.select("div.info > div.rest_data").last());
for (Element e:elements) {
if (e.hasAttr("itemprop”)) {
System.out.println(e.attr("itemprop") + e.text());
}
if (e.hasAttr("itemprop”) && e.attr("itemprop").equals ("url")) {
System.out.println("name: " + e.attr("title"));
}
if (e.attr("class").equals("longitude") || e.attr("class").equals("latitude")) {
System.out. println(e.attr("class") + e.attr("title"));
}
if (e.attr("class").equals("rest_data")) {
System.out.println(e.text());
}
}
(Note: I wrote this on my phone, so untested, but it should work, may also contain typos)
A bit of explanation: First get all the desired elements via doc.select(...), and then extract the desired data from each one.
Let me know if it works.
Probably the main thing to realise is that an element with an id can be selected directly - no need to loop through a collection of elements searching for it.
I've not used JSoup and my Java is very rusty but here goes ...
// 1. Select elements from document
Element container = doc.select("#restaurant_info_box_left"); // find element in document with id="restaurant_info_box_left"
Element h1 = container.select("h1"); // find h1 element in container
Elements restData = container.select(".rest_data"); //find all divs in container with class="rest_data"
Element restData_0 = restData.get(0); // find first rest_data div
Element restData_1 = restData.get(1); // find second rest_data div
Elements restData_0_spans = restData_0.select("span"); // find first rest_data div's spans
Elements geos = container.select(".geo"); // find all divs in container with class="geo"
Element geo = geos.get(0); // find first .geo div
Elements geo_spans = geo.select("span"); // find first .geo div's spans
// 2. Compose output
// h1 text
String text = "Name: " + h1.text();
// output text >>>
// restData_0_spans text
for (Element span : restData_0_spans) {
String text = span.attr("itemprop").toString() + ": " + span.text();
// output text >>>
}
// geo data
for (Element span : geo_spans) {
String text = span.attr("class").toString() + ": " + span.attr("title").toString();
// output text >>>
}
// restData_1 text
String text = restData_1.text();
// output text >>>
For someone used to JavaScript/jQuery, this all seems very laboured. With luck it may simplify somewhat.

Add a new html tag to an html string in android

I have a string obtained from an EditText. The string contains html tags.
Spannable s = mainEditText.getText();
String webText = Html.toHtml(s);
The contents of the string is :
<p dir="ltr">test</p>
<p dir="ltr"><img src="http://files.parsetfss.com/bcff7108-cbce-4ab8-b5d1-1f82827e6519/tfss-0de7a730-3fa9-4a1e-9f82-d34e4f6e2d31-file" /><br /></p>
<p dir="ltr"><img src="http://maps.google.com/maps/api/staticmap?center=22.572646,88.363895&zoom=15&size=960x540&sensor=false&markers=color:blue%7Clabel:!%7C22.572646,88.363895" /><br /> </p>
Now, what I want to do is, wherever there is an img src tag, I want to precede it with a center tag.
What should I do to get the following output?
<p dir="ltr">test</p>
<p dir="ltr"><center><img src="http://files.parsetfss.com/bcff7108-cbce-4ab8-b5d1-1f82827e6519/tfss-0de7a730-3fa9-4a1e-9f82-d34e4f6e2d31-file" /></center><br /></p>
<p dir="ltr"><center><img src="http://maps.google.com/maps/api/staticmap?center=22.572646,88.363895&zoom=15&size=960x540&sensor=false&markers=color:blue%7Clabel:!%7C22.572646,88.363895" /></center><br /> </p>
Can a regex solve the issue or should it be done in a different way?
Can JSOUP help in any way? Is there any other type of HTML parser which can do the job?
(<img\s+[^>]*>)
You can try this.Replace with <center>$1</centre>.See demo.
http://regex101.com/r/sU3fA2/38
Something like
var re = /(<img\s+[^>]*>)/g;
var str = '<p dir="ltr">test</p> \n<p dir="ltr"><img src="http://files.parsetfss.com/bcff7108-cbce-4ab8-b5d1-1f82827e6519/tfss-0de7a730-3fa9-4a1e-9f82-d34e4f6e2d31-file" /><br /></p> \n<p dir="ltr"><img src="http://maps.google.com/maps/api/staticmap?center=22.572646,88.363895&zoom=15&size=960x540&sensor=false&markers=color:blue%7Clabel:!%7C22.572646,88.363895" /><br /> </p>';
var subst = '<center>$1</centre>';
var result = str.replace(re, subst);
By using Jsoup, you can use the wrap() method of the Element class of Jsoup.
It would look like this :
public String wrapImgWithCenter(String html) {
Document doc = Jsoup.parse(html);
doc.getElementsByTag("img").wrap("<center></center>");
return doc.html();
}
I implemented the JSOUP solution suggested by mourphy. But, I had edited the method a little and it did the miracle for me. The new method is:
public String wrapImgWithCenter(String html){
Document doc = Jsoup.parse(html);
doc.select("img").wrap("<center></center>");
return doc.html();
}
Thanks mourphy and vks for your help!
Using Regex, you could also do this in java:
String formatted = str.replaceAll("(<img\\s+[^>]*>)", "<center>$1</center>");

How to validate html using java? getting issues with jsoup library

I need to validate HTML using java. So I try with jsoup library. But some my test cases failing with it.
For eg this is my html content. I dont have any control on this content. I am getting this from some external source provider.
String invalidHtml = "<div id=\"myDivId\" ' class = claasnamee value='undaa' > <<p> p tagil vanne <br> <span> span close cheythillee!! </p> </div>";
doc = Jsoup.parseBodyFragment(invalidHtml);
For above html I am getting this output.
<html>
<head></head>
<body>
<div id="myDivId" '="" class="claasnamee" value="undaa">
<
<p> p tagil vanne <br /> <span> span close cheythillee!! </span></p>
</div>
</body>
</html>
for a single quote in my above string is comming like this. So how can I fix this issue. Any one can help me please.
The best place to validate your html would be http://validator.w3.org/. But that would be manual process. But dont worry jsoup can do this for you as well. The below program is like a workaround but it does the purpose.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class JsoupValidate {
public static void main(String[] args) throws Exception {
String invalidHtml = "<div id=\"myDivId\" ' class = claasnamee value='undaa' > <<p> p tagil vanne <br> <span> span close cheythillee!! </p> </div>";
Document initialDoc = Jsoup.parseBodyFragment(invalidHtml);
Document validatedDoc = Jsoup.connect("http://validator.w3.org/check")
.data("fragment", initialDoc.html())
.data("st", "1")
.post();
System.out.println("******");
System.out.println("Errors");
System.out.println("******");
for(Element error : validatedDoc.select("li.msg_err")){
System.out.println(error.select("em").text() + " : " + error.select("span.msg").text());
}
System.out.println();
System.out.println("**************");
System.out.println("Cleaned output");
System.out.println("**************");
Document cleanedOuput = Jsoup.parse(validatedDoc.select("pre.source").text());
cleanedOuput.select("meta[name=generator]").first().remove();
cleanedOuput.outputSettings().indentAmount(4);
cleanedOuput.outputSettings().prettyPrint(true);
System.out.println(cleanedOuput.html());
}
}
var invalidHtml = "<div id=\"myDivId\" ' class = claasnamee value='undaa' > <<p> p tagil vanne <br> <span> span close cheythillee!! </p> </div>";
var parser = Parser.htmlParser()
.setTrackErrors(10); // Set the number of errors it can track. 0 by default so it's important to set that
var dom = Jsoup.parse(invalidHtml, "" /* this is the default */, parser);
System.out.println(parser.getErrors()); // Do something with the errors, if any

how to get proper formatted text from html when tags don't have line breaks

I am trying to parse this sample html file with the help of Jsoup HTML parsing Library.
<html>
<body>
<p> this is sample text</p>
<h1>this is heading sample</h1>
<select name="car" size="1">
<option value="Ford">Ford</option><option value="Chevy">Chevy</option><option selected value="Subaru">Subaru</option>
</select>
<p>this is second sample text</p>
</body>
</html>
And I am getting the following when I extract only text.
this is sample text this is heading sample FordChevySubaru this is second sample text
There is no spaces or line breaks in option tag text.
Whereas If the html had been like this
<html>
<body>
<p> this is sample text</p>
<h1>this is heading sample</h1>
<select name="car" size="1">
<option value="Ford">Ford</option>
<option value="Chevy">Chevy</option>
<option selected value="Subaru">Subaru</option>
</select>
<p>this is second sample text</p>
</body>
</html>
now in this case the text is like this
this is sample text this is heading sample Ford Chevy Subaru this is second sample text
with proper spaces in the text of option tag. How do I get the second output with the first html file. i.e. if there is no linebreak in the tags how is it possible that string does not get concatenated.
I am using the following code in Java.
public static String extractText(File file) throws IOException {
Document document = Jsoup.parse(file,null);
Element body=document.body();
String textOnly=body.text();
return textOnly;
}
I think only solution that achieves your requirements is traversing the DOM and print the textnodes:
public static String extractText(File file) throws IOException {
StringBuilder sb = new StringBuilder();
Document document = Jsoup.parse(file, null);
Elements body = document.getAllElements();
for (Element e : body) {
for (TextNode t : e.textNodes()) {
String s = t.text();
if (StringUtils.isNotBlank(s))
sb.append(t.text()).append(" ");
}
}
return sb.toString();
}
Hope it helps.

Jsoup returned string " " is not returning true on equals(" ")

Just playing around and pulling some data off a site to manipulate when I come across this:
String request = "http://foo";
String data = "bar";
Connection.Response res = Jsoup.connect(request).data(data).method(Method.POST).execute();
Document doc = res.parse();
Elements all = doc.select("td");
for(Element elem : all){
String test = elem.text();
if(test.equals(" ")){
//redefine test to 0 and print it
}
else{
//print it
}
The site in question is coded as so:
<td align="center">Henry</td>
<td>23</td>
<td align="center">Savannah</td>
<td>15</td></tr>
...
<td align="center"> </td>
<td> </td>
<td align="center">Jane</td>
<td>15</td></tr>
In my for loop, test is never redefined.
I've debugged in Eclipse and String test is showing as so:
Edit
Debugging test chartAt(0):
org.jsoup.nodes.Element.text() says "Returns unencoded text or empty string if none". I'm assuming the unencoded part has something to do with this, but I can't figure it out.
I ran a test program:
public static void main(String[] args) {
String str = " ";
if (str.equals(" ")){
System.out.println("True");
}
}
and it returns true.
What gives?
I don't know if you control the HTML being sent in the body of the response or if that is what you see in a browser's source page or elsewhere
<td> </td>
But it's possible the actual content is
<td>&nbsp</td> // or &#160
where &nbsp is the HTML entity for the non-breaking space.
In java, you can represent it as
char nbsp = 160;
So you could just check for both char values, the one for space and the one for non-breaking space.
Note that there might be other codepoints that are represented as white space. You need to know what you're looking for.

Categories

Resources