Open source java library for HTML to text conversion [duplicate] - java

This question already has answers here:
Remove HTML tags from a String
(35 answers)
Closed 1 year ago.
The community reviewed whether to reopen this question 8 months ago and left it closed:
Not suitable for this site
Can you recommend an open source Java library (preferably ASL/BSD/LGPL license) that converts HTML to plain text - cleans all the tags, converts entities (&, , etc.) and handles <br> and tables properly.
More Info
I have the HTML as a string, there's no need to fetch it from the web. Also, what I'm looking is for a method like this:
String convertHtmlToPlainText(String html)

Try Jericho.
The TextExtractor class sounds like it will do what you want. Sorry can't post a 2nd link as I'm a new user but scroll down the homepage a bit and there's a link to it.

HtmlUnit, it even shows the page after processing JavaScript / Ajax.

The bliki engine can do this, in two steps. See info.bliki.wiki / Home
How to convert HTML to Mediawiki text -- nediawiki text is already a rather plain text format, but you can convert it further
How to convert Mediawiki text to plain text -- your goal.
It will be some 7-8 lines of code, like this:
// html to wiki
import info.bliki.html.HTML2WikiConverter;
import info.bliki.html.wikipedia.ToWikipedia;
// wiki to plain text
import info.bliki.wiki.filter.PlainTextConverter;
import info.bliki.wiki.model.WikiModel;
...
String sbodyhtml = readFile( infilepath ); //get content as string
HTML2WikiConverter conv = new HTML2WikiConverter();
conv.setInputHTML( sbodyhtml );
String resultwiki = conv.toWiki(new ToWikipedia());
WikiModel wikiModel = new WikiModel("${image}", "${title}");
String plainStr = wikiModel.render(new PlainTextConverter(false), resultwiki );
System.out.println( plainStr );
Jsoup can do this simpler:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
...
Document doc = Jsoup.parse(sbodyhtml);
String plainStr = doc.body().text();
but in the result you lose all paragraph formatting -- there will be no any newlines.

I use TagSoup, it is available for several languages and does a really good job with HTML found "in the wild". It produces either a cleaned up version of the HTML or XML, that you can then process with some DOM/SAX parser.

I've used Apache Commons Lang to go the other way. But it looks like it can do what you need via StringEscapeUtils.

Related

Java XSS Sanitization for nested HTML elements

I am using JSoup library in Java to sanitize input to prevent XSS attacks. It works well for simple inputs like alert('vulnerable').
Example:
String data = "<script>alert('vulnerable')</script>";
data = Jsoup.clean(data, , Whitelist.none());
data = StringEscapeUtils.unescapeHtml4(data); //StringEscapeUtils from apache-commons lib
System.out.println(data);
Output: ""
However, if I tweak the input to the following, JSoup cannot sanitize the input.
String data = "<<b>script>alert('vulnerable');<</b>/script>";
data = Jsoup.clean(data, , Whitelist.none());
data = StringEscapeUtils.unescapeHtml4(data);
System.out.println(data);
Output: <script>alert('vulnerable');</script>
This output obviously still prone to XSS attacks. Is there a way to fully sanitize the input so that all HTML tags is removed from input?
Not sure if this is the best solution, but a temporary workaround would be parsing the raw text into a Doc and then clean the combined text of the Doc element and all its children:
String unsafe = "<<b>script>alert('vulnerable');<</b>/script>";
Document doc = Jsoup.parse(unsafe);
String safe = Jsoup.clean(doc.text(), Whitelist.none());
System.out.println(safe);
Wait for someone else to come up with the best solution.
The problem is that you are unescaping the safe HTML that jsoup has made. The output of the Cleaner is HTML. The none safelist passes no tags, only the textnodes, as HTML.
So the input:
<<b>script>alert('vulnerable');<</b>/script>
Through the Cleaner returns:
<script>alert('vulnerable');</script>
which is perfectly safe for presenting as HTML. See https://try.jsoup.org/~hfn2nvIglfl099_dVxLQEPxekqg
Just don't include the unescape line.

How to format a java String as JSP code and save it to file?

I have a Java string containing JSP code which I generated programmatically. I want the code in the string to be properly formatted and aligned and then save the code on disk.
I am already formatting some strings as Java code using Formatter.formatSource() method and even XML code in string can be formatted using javax.xml.transform.Transformer.
But I could not find anything similar to properly format the string as JSP code.
it likes html.
so try this:
Spanned spannedContent = Html.fromHtml(YOUR_STRING);
textView.setText(spannedContent, BufferType.SPANNABLE);

Convert PDF to HTML using iText library [duplicate]

I used iText 5 to create a nice looking report which includes some tables and graphs. I wonder if iText lets you convert PDF to HTML and if so .. how can one do it?
I believe previous versions of iText allowed it, but in iText 5 i was not able to find a way to do this.
No. iText has never converted PDF to HTML, only the reverse.
Have you had a look at http://www.jpedal.org/pdf_to_html_conversion.php - there is currently a free beta.
Possible to do with Apache Tika (it uses Apache PDFBox under the hood):
public String pdfToHtml(InputStream content) {
PDDocument pddDocument = PDDocument.load(content);
PDFText2HTML stripper = new PDFText2HTML("UTF-8");
return stripper.getText(pddDocument);
}

Microdata extraction from HTML in Java

I really need help to extract Mircodata which is embedded in HTML5. My purpose is to get structured data from a webpage just like this tool of google: http://www.google.com/webmasters/tools/richsnippets. I have searched a lot but there is no possible solution.
Currently, I use the any23 library but I can’t find any documentation, just only javadocs which dont provide enough information for me.
I use any23's Microdata Extractor but getting stuck at the third parameter: "org.w3c.dom.Document in". I can't parse a HTML content to be a w3cDom. I have used JTidy as well as JSoup but the DOM objects in these library are not fixed with the Extractor constructor. In addition, I also doubt about the 2nd parameter of the Microdata Extractor.
I hope that anyone can help me to do with any23 or suggest another library can solve this extraction issues.
Edit: I found solution myself by using the same way as any23 command line tool did. Here is the snippet of code:
HTTPDocumentSource doc = new HTTPDocumentSource(DefaultHTTPClient.createInitializedHTTPClient(), value);
InputStream documentInputInputStream = doc.openInputStream();
TagSoupParser tagSoupParser = new TagSoupParser(documentInputInputStream, doc.getDocumentURI());
Document document = tagSoupParser.getDOM();
ByteArrayOutputStream byteArrayOutput = new ByteArrayOutputStream();
MicrodataParser.getMicrodataAsJSON(tagSoupParser.getDOM(),new PrintStream(byteArrayOutput));
String result = byteArrayOutput.toString("UTF-8");
These line of code only extract microdata from HTML and write them in JSON format. I tried to use MicrodataExtractor which can change the output format to others(Rdf, turtle, ...) but the input document seems to only accept XML format. It throws "Document didn't start" when I put in a HTML document.
If anyone found the way to use MicrodataExtractor, please leave the answer here.
Thank you.
xpath is generally the way to consume html or xml.
have a look at: How to read XML using XPath in Java

Extracting PDF annotations/comments [duplicate]

This question already has answers here:
How to extract Highlighted Parts from PDF files
(2 answers)
Closed 2 years ago.
We have a pretty complex print workflow where the controlling is adding comments and annotations for draft versions of generated PDF documents using Adobe Reader or Adobe Acrobat. As part of the workflow imported PDF documents with annotations and comments should be parsed and the annotations should be imported into a CMS system (together with the PDF).
Q: are there any reliable tools (preferred Python or Java) for extracting such data in
clean and reliable way to PDF files?
This code should do the job. One of the answers to the question Parse annotations from a pdf was very helpful in getting me to write the code below. It uses the poppler library to parse the annotations. This is a link to annotations.pdf.
code
import poppler, os.path
path = 'file://%s' % os.path.realpath('annotations.pdf')
doc = poppler.document_new_from_file(path, None)
pages = [doc.get_page(i) for i in range(doc.get_n_pages())]
for page_no, page in enumerate(pages):
items = [i.annot.get_contents() for i in page.get_annot_mapping()]
items = [i for i in items if i]
print "page: %s comments: %s " % (page_no + 1, items)
output
page: 1 comments: ['This is an annotation']
page: 2 comments: [' Please note ', ' Please note ', 'This is a comment in the text']
installation
On Ubuntu the installation as as follows.
apt-get install python-poppler

Categories

Resources