separate html coded string and normal string - java

I want to split a single string containing normal text as well as html code into array of string. I tried to search on google but not found any suitable suggestion.
Consider the following string:
blahblahblahblahblahblahblahblahblahblah
blahblah First para blahblahblahblah
blahblahblahblahblahblahblahblahblahblah
<html>
<body>
<p>hello</p>
</body>
</html>
blahblahblahblahblahblahblahblahblahblah
blahblah Second Para lahblahblahblahblah
blahblahblahblahblahblahblahblahblahblah
this becomes:
s[0]=whole first para
s[1]=html code
s[2]=whole second para
Is it possible through jsoup ?. Or I need any other api?

It is possible with jQuery. Here below is a code snippet.
var str = "blablabla <html><body><p>hello</p></body></html> blabla";
var parsedHTML = $.parseHTML(str);
myList = [];
// loop through parsed text and put it into text based on its type
$.each(parsedHTML, function( i, el ) {
if (el.nodeType < 3) myList[i] = el.nodeName;
else myList[i] = el.data;
});
// use myList ...
Here is a fiddle which shows you that it works. The only disadvantage is that both <html> and <body> tag is parsed and not being obtained in the parsedHTML.
jsfiddle example

This can be done with JSoup
Simple use example:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Then you can navigate the DOM structure to extract the information.
update
To get the text with all the tags you could wrap the entire string in <meta> ... </meta> tags; then parse it, access the individual components, and finally serialize the components back into strings.
Alternatively if you believe the code is well formed (with matching beginning and end tags) you could search for the first match of the regex
/<(html|body)\s*>/
Depending on what the contents of the first tag (match) are you then look for the last occurrence of the matching close tag.
More manual, more prone to error, not recommended. But since you have a non- standard problem it seems you might want a non-standard solution .

Related

JSOUP missing tag when converting html row

I having problem with jsoup whereby i want to get a row of data which later I will be inserting the row into another html document. But when i inspect time saw that there is no and tag. How can i solve it
String htmlcontent = "<tr><td colspan=\"2\"><div class=\"content-wrapper\"><p><strong><span class=\"CLASS 1 CLASS 2 CLASS 3\">123</span></strong><br /><strong>DATA 1</strong></p></td><td></td><td></td><td></td><td></td></tr>";
Document docnewinput = Jsoup.parse(htmlcontent, "UTF-8");
[<html>
<head></head>
<body>
<div class="content-wrapper">
<p><strong><span class="CLASS 1 CLASS 2 CLASS 3">123</span></strong><br><strong>DATA 1</strong></p>
</div>
</body>
</html>]
You have a fragment of body HTML (e.g. a div containing a couple of p tags; as opposed to a full HTML document) that you want to parse.
Use the Jsoup.parseBodyFragment(String html) method.
String html = "<table><tr><td colspan=\"2\"><div class=\"content-wrapper\"><p><strong><span class=\"CLASS 1 CLASS 2 CLASS 3\">123</span></strong><br /><strong>DATA 1</strong></p></td><td></td><td></td><td></td><td></td></tr></table>";
Document doc = Jsoup.parseBodyFragment(html);
The parseBodyFragment method creates an empty shell document, and inserts the parsed HTML into the body element. If you used the normal Jsoup.parse(String html) method, you would generally get the same result, but explicitly treating the input as a body fragment ensures that any bozo HTML provided by the user is parsed into the body element.
The parser will make every attempt to create a clean parse from the HTML you provide, regardless of whether the HTML is well-formed or not. It handles:
unclosed tags (e.g. <p>Lorem <p>Ipsum parses to <p>Lorem</p> <p>Ipsum</p>)
implicit tags (e.g. a naked <td>Table data</td> is wrapped into a <table><tr><td>...)
reliably creating the document structure (html containing a head and body, and only appropriate elements within the head)
EDIT:
By using Jsoup.parse():
String html = "<table><tr><td colspan=\"2\"><div class=\"content-wrapper\"><p><strong><span class=\"CLASS 1 CLASS 2 CLASS 3\">123</span></strong><br /><strong>DATA 1</strong></p></td><td></td><td></td><td></td><td></td></tr></table>";
Document doc = Jsoup.parse(html);
Working Demo: https://try.jsoup.org/~EdJSrHl_biDcQkyhL2BLH5ZNnck
Need to use xmlParser() so that it will just read the string as it without formatting it.

How to preserve the meaning of tags like <br>, <ul> , <li> , <p> etc when reading them in Java using JSOUP library?

I am writing a program that extracts some certain information from local HTML files. That information is then shown on a Java JFrame and is exported to an excel file. (I am using JSoup 1.9.2 library for the HTML parsing purposes)
I am running into an issue where whenever I extract anything from an HTML file, JSoup is not taking HTML tags like break tags, line tags etc. into account and so, all the information is being extracted like a big chunk of data without any proper newlines or formatting.
To show you an example, if this is the data that I want to read :
Title Line 1 Line 2 Unordered
Listelement 1 element 2
The data is coming back as :
Title Line 1 Line 2 Unordered List element 1 element 2 (i.e. all the
HTML tags are ignored)
This is the piece of code that I am using for reading in :
private String getTitle(Document doc) { // doc is the local HTML file
Elements title = doc.select(".title");
for (Element id : title) {
return id.text();
}
return "No Title Available ";
}
Can anyone suggest me a way that can be used to preserve the meaning behind the HTML tags by using which I can both display the data on the JFrame and export it to excel with a more readable format?
Thanks.
Just to give everyone an update, I was able to find a solution (more like a workaround) to the formatting issue. What i am doing now is extracting the complete HTML using id.html() which I am storing in a String object. Then, i am using the String function replaceAll() with a regular expression to get rid of all the HTML tags without pushing everything into a single line. The replaceAll() function looks something like replaceAll("\\<[^>]*>",""). My whole processHTML() function looks something like :
private String processHTML(String initial) { //initial is the String with all the HTML tags
String modified = initial;
modified = modified.replaceAll("\\<[^>]*>",""); //regular expression used
modified = modified.trim(); //To get rid of any unwanted space before and after the needed data
//All the replaceAll() functions below are to get rid of any HTML entities that might be left in the data extarcted from the HTML
modified = modified.replaceAll(" ", " ");
modified = modified.replaceAll("<", "<");
modified = modified.replaceAll(">", ">");
modified = modified.replaceAll("&", "&");
modified = modified.replaceAll(""", "\"");
modified = modified.replaceAll("&apos;", "\'");
modified = modified.replaceAll("¢", "¢");
modified = modified.replaceAll("©", "©");
modified = modified.replaceAll("®", "®");
return modified;
}
Thanks you all again for helping me with this
Cheers.

Java How to select half part of string?

I have string like this:
<html>
<body>
<p>Hello</p>
</body>
</html>
and In java I want to after selection string looked like this:
<p>Hello</p>
How?
If you are dealing with HTML parsing then go for JSoup
String html = "<html><body><p>Hello</p></body></html>";
Document doc = Jsoup.parseBodyFragment(html);
Elements fragment = doc.select("p"); // p tag
System.out.println(fragment.html());
Quick link
This one line should do it:
String text = input.replaceAll("(?s).*(<p>.*</p>).*", "$1");
Or to just get everything in the <body> tag, do this:
String text = input.replaceAll("(?s).*<body>(.*)</body>.*", "$1");
You can use JSoup, it will be quit handy for you.

Removing Html tags except few specific ones from String in java

My input is plain text string and requirement is to remove all html tags except few specific tags like:
<p>
<li>
<u>
<li>
If these specific tags have attributes like class or id, I want to remove these attributes.
A few examples:
Link -> Link
<p>paragraph</p> -> <p>paragraph</p>
<p class="class1">paragraph</p> -> <p>paragraph</p>
I have gone through this Remove HTML tags from a String but it does not answer my question completely.
Can it be handled by a set of regex's or could I make use of some library?
I tried JSoup and It seems to be able to handle all such cases. Here is example code.
public String clean(String unsafe){
Whitelist whitelist = Whitelist.none();
whitelist.addTags(new String[]{"p","br","ul"});
String safe = Jsoup.clean(unsafe, whitelist);
return StringEscapeUtils.unescapeXml(safe);
}
For input string
String unsafe = "<p class='p1'>paragraph</p>< this is not html > <a link='#'>Link</a> <![CDATA[<sender>John Smith</sender>]]>";
I get following output which is pretty much I require.
<p>paragraph</p>< this is not html > Link <sender>John Smith</sender>
For simple HTML, this may be sufficient:
// remove any <script> tags
html = html.replaceAll("(?i)<script.*?</script>", ""));
// this removes any attributes
html = html.replaceAll("(?i)<([a-zA-Z0-9-_]*)(\\s[^>]*)>", "<$1>"));
// this removes any tags (not li and p)
html = html.replaceAll("(?i)<(?!(/?(li|p)))[^>]*>", ""));
Hope that helps.

A good HTML object model in Java?

I'm looking for an HTML object model in Java, capable of parsing HTML (not required) and containing all HTML elements (and CSS as well) in an elegant object model.
I'm looking for a pure java version of the Groovy's HTML builder.
(I have no luck on google with this request.)
I want to be able to perform stuff like:
HTML html = new HTML();
Body body = html.body();
Table table body.addTable(myCssStyle);
Row row = table.addRow("a", "b", "c").withCss(cssRowStyle);
and so on...
Check out Jsoup:
Example:
(Building some html)
Document doc = Document.createShell("");
Element headline = doc.body().appendElement("h1").text("thats a headline");
Element pTag = doc.body().appendElement("p").text("some text ...");
Element span = pTag.prependElement("span").text("That's");
System.out.println(doc);
Output:
<html>
<head></head>
<body>
<h1>thats a headline</h1>
<p><span>That's</span>some text ...</p>
</body>
</html>
Documentation:
Codebook
API Documentation (JavaDoc)
Jakarta ECS might be able to do what you want.
Just an idea: you could take a look at the source code of xhtmlrenderer project.
http://code.google.com/p/flying-saucer//
It's not plain HTML (it's XHTML), but may be a good starting point, don't you think?

Categories

Resources