Insert page number with text in footer aspose word java

Insert page number with text in footer aspose word java - java

I want to insert page number in footer alongside with text, but when footer contains some text, page number and text in footer switch places.
I am doing this while converting document from html to word using aspose words java library.
Text in footer is sent from html and I just want to add page number.
Code for add page number in footer:
log.debug("Add page number");
DocumentBuilder builder = new DocumentBuilder(doc);
// Insert PAGE field into the footer
builder.moveToHeaderFooter(HeaderFooterType.FOOTER_PRIMARY);
builder.insertField("PAGE", null);
builder.write("/");
builder.insertField("NUMPAGES", null);
Also, is there any way to replace whole text in footer?

You can add page numbers and text together in footer using table as following:
DocumentBuilder builder = new DocumentBuilder(doc);
// Insert PAGE field into the footer
builder.moveToHeaderFooter(HeaderFooterType.FOOTER_PRIMARY);
builder.startTable();
// Clear table borders
builder.getCellFormat().clearFormatting();
builder.insertCell();
// Set first cell to 1/3 of the page width.
builder.getCellFormat().setPreferredWidth(
PreferredWidth.fromPercent(100 / 3));
// Insert page numbering text here.
// It uses PAGE and NUMPAGES fields to auto calculate current page
// number and total number of pages.
builder.insertField("PAGE", null);
builder.write("/");
builder.insertField("NUMPAGES", null);
// Align this text to the left.
builder.getCurrentParagraph().getParagraphFormat()
.setAlignment(ParagraphAlignment.LEFT);
builder.insertCell();
// Set the second cell to 2/3 of the page width.
builder.getCellFormat().setPreferredWidth(
PreferredWidth.fromPercent(100 * 2 / 3));
builder.write("(C) 2017 Aspose Pty Ltd. All rights reserved.");
// Align this text to the right.
builder.getCurrentParagraph().getParagraphFormat()
.setAlignment(ParagraphAlignment.RIGHT);
builder.endRow();
builder.endTable();
To replace whole text of footer you may access the footer, clear all text and add new contents as following:
DocumentBuilder builder = new DocumentBuilder(doc);
Section currentSection = builder.getCurrentSection();
com.aspose.words.HeaderFooter primaryHeader = currentSection.getHeadersFooters().getByHeaderFooterType(HeaderFooterType.FOOTER_PRIMARY);
primaryHeader.getParagraphs().clear();
...
I am Tilal Ahmad, developer evangelist at Aspose.

Related

How to replace date field with some text in the ViewMaster (Vertical) for word/pdf using Aspose?

Aspose code is inserting Viewmaster(vertical) with default date to
select as a text inside. I want to replace with some text as shown in
the image.
Followed the code mentioned in ViewMaster(vertical) using Aspose
to generate the ViewMaster(Vertical) in the word/pdf. can someone help
in getting the right code to replace the date with text

Date is set in structured document tag. You can use code like this to get and modify value of this SDT:
// Get structured document tags from footer.
NodeCollection tags = doc.FirstSection.HeadersFooters[HeaderFooterType.FooterPrimary].GetChildNodes(NodeType.StructuredDocumentTag, true);
foreach (StructuredDocumentTag tag in tags)
{
if (tag.Title.Equals("Date") && tag.SdtType == SdtType.Date)
{
tag.IsShowingPlaceholderText = false;
tag.FullDate = DateTime.Now;
// By default SDT is minded to XML. We can simply remove mapping to use value set in FullDate property.
tag.XmlMapping.Delete();
}
}
If you do not need date, but need to insert some custom text, you can remove the tag and insert a simple paragraph with text instead. For example:
// Get structured document tags from footer.
NodeCollection tags = doc.FirstSection.HeadersFooters[HeaderFooterType.FooterPrimary].GetChildNodes(NodeType.StructuredDocumentTag, true);
foreach (StructuredDocumentTag tag in tags)
{
if (tag.Title.Equals("Date") && tag.SdtType == SdtType.Date)
{
// Put an empty paragraph ater the structured document tag
Paragraph p = new Paragraph(doc);
tag.ParentNode.InsertAfter(p, tag);
// Remove tag
tag.Remove();
// move DocumentBuilder to the newly inserted paragraph and insert some text.
builder.MoveTo(p);
builder.Write("This is my custom vertical text");
}
}

How can I create an accessible PDF with Java PDFBox 2.0.8 library that is also verifiable with PAC 2 tool?

Background
I have small project on GitHub in which I am trying to create a section 508 compliant (section508.gov) PDF which has form elements within a complex table structure. The tool recommended to verify these PDFs is at http://www.access-for-all.ch/en/pdf-lab/pdf-accessibility-checker-pac.html and my program’s output PDF does pass most of these checks. I will also know what every field is meant for at runtime, so adding tags to structure elements should not be an issue.
The Problem
The PAC 2 tool seems to have an issue with two particular items in the output PDF. In particular, my radio buttons’ widget annotations are not nested inside of a form structure element and my marked content is not tagged (Text and Table Cells).
PAC 2 verifies the P structure element that is within top-left cell but not the marked content…
However, PAC 2 does identify the marked content as an error (i.e. Text/Path object not tagged).
Also, the radio button widgets are detected, but there seems to be no APIs to add them to a form structure element.
What I Have Tried
I have looked at several questions on this website and others on the subject including this one Tagged PDF with PDFBox, but it seems that there are almost no examples for PDF/UA and very little useful documentation (That I have found). The most useful tips that I have found have been at sites that explain specs for tagged PDFs like https://taggedpdf.com/508-pdf-help-center/object-not-tagged/.
The Question
Is it possible to create a PAC 2 verifiable PDF with Apache PDFBox that includes marked content and radio button widget annotations? If it is possible, is it doable using higher level (non-deprecated) PDFBox APIs?
Side Note: This is actually my first StackExchange question (Although I have used the site extensively) and I hope everything is in order! Feel free to add any necessary edits and ask any questions that I may need clarify. Also, I have an example program on GitHub which generates my PDF document at https://github.com/chris271/UAPDFBox.
Edit 1: Direct link to Output PDF Document
*EDIT 2: After using some of the lower-level PDFBox APIs and viewing raw data streams for fully compliant PDFs with PDFDebugger, I was able to generate a PDF with nearly identical content structure compared to the compliant PDF's content structure... However, the same errors appear that the text objects are not tagged and I really can't decide where to go from here... Any guidance would be greatly appreciated!
Edit 3: Side-by-side raw PDF content comparison.
Edit 4: Internal structure of the generated PDF
and the compliant PDF
Edit 5: I have managed to fix the PAC 2 errors for tagged path/text objects thanks in part to suggestions from Tilman Hausherr! I will add an answer if I manage to fix the issues regarding 'annotation widgets not being nested inside form structure elements'.

After going through a large amount of the PDF Spec and many PDFBox examples I was able to fix all issues reported by PAC 2. There were several steps involved to create the verified PDF (with a complex table structure) and the full source code is available here on github. I will attempt to do an overview of the major portions of the code below. (Some method calls will not be explained here!)
Step 1 (Setup metadata)
Various setup info like document title and language
//Setup new document
pdf = new PDDocument();
acroForm = new PDAcroForm(pdf);
pdf.getDocumentInformation().setTitle(title);
//Adjust other document metadata
PDDocumentCatalog documentCatalog = pdf.getDocumentCatalog();
documentCatalog.setLanguage("English");
documentCatalog.setViewerPreferences(new PDViewerPreferences(new COSDictionary()));
documentCatalog.getViewerPreferences().setDisplayDocTitle(true);
documentCatalog.setAcroForm(acroForm);
documentCatalog.setStructureTreeRoot(structureTreeRoot);
PDMarkInfo markInfo = new PDMarkInfo();
markInfo.setMarked(true);
documentCatalog.setMarkInfo(markInfo);
Embed all fonts directly into resources.
//Set AcroForm Appearance Characteristics
PDResources resources = new PDResources();
defaultFont = PDType0Font.load(pdf,
new PDTrueTypeFont(PDType1Font.HELVETICA.getCOSObject()).getTrueTypeFont(), true);
resources.put(COSName.getPDFName("Helv"), defaultFont);
acroForm.setNeedAppearances(true);
acroForm.setXFA(null);
acroForm.setDefaultResources(resources);
acroForm.setDefaultAppearance(DEFAULT_APPEARANCE);
Add XMP Metadata for PDF/UA spec.
//Add UA XMP metadata based on specs at https://taggedpdf.com/508-pdf-help-center/pdfua-identifier-missing/
XMPMetadata xmp = XMPMetadata.createXMPMetadata();
xmp.createAndAddDublinCoreSchema();
xmp.getDublinCoreSchema().setTitle(title);
xmp.getDublinCoreSchema().setDescription(title);
xmp.createAndAddPDFAExtensionSchemaWithDefaultNS();
xmp.getPDFExtensionSchema().addNamespace("http://www.aiim.org/pdfa/ns/schema#", "pdfaSchema");
xmp.getPDFExtensionSchema().addNamespace("http://www.aiim.org/pdfa/ns/property#", "pdfaProperty");
xmp.getPDFExtensionSchema().addNamespace("http://www.aiim.org/pdfua/ns/id/", "pdfuaid");
XMPSchema uaSchema = new XMPSchema(XMPMetadata.createXMPMetadata(),
"pdfaSchema", "pdfaSchema", "pdfaSchema");
uaSchema.setTextPropertyValue("schema", "PDF/UA Universal Accessibility Schema");
uaSchema.setTextPropertyValue("namespaceURI", "http://www.aiim.org/pdfua/ns/id/");
uaSchema.setTextPropertyValue("prefix", "pdfuaid");
XMPSchema uaProp = new XMPSchema(XMPMetadata.createXMPMetadata(),
"pdfaProperty", "pdfaProperty", "pdfaProperty");
uaProp.setTextPropertyValue("name", "part");
uaProp.setTextPropertyValue("valueType", "Integer");
uaProp.setTextPropertyValue("category", "internal");
uaProp.setTextPropertyValue("description", "Indicates, which part of ISO 14289 standard is followed");
uaSchema.addUnqualifiedSequenceValue("property", uaProp);
xmp.getPDFExtensionSchema().addBagValue("schemas", uaSchema);
xmp.getPDFExtensionSchema().setPrefix("pdfuaid");
xmp.getPDFExtensionSchema().setTextPropertyValue("part", "1");
XmpSerializer serializer = new XmpSerializer();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
serializer.serialize(xmp, baos, true);
PDMetadata metadata = new PDMetadata(pdf);
metadata.importXMPMetadata(baos.toByteArray());
pdf.getDocumentCatalog().setMetadata(metadata);
Step 2 (Setup document tag structure)
You will need to add the root structure element and all necessary structure elements as children to the root element.
//Adds a DOCUMENT structure element as the structure tree root.
void addRoot() {
PDStructureElement root = new PDStructureElement(StandardStructureTypes.DOCUMENT, null);
root.setAlternateDescription("The document's root structure element.");
root.setTitle("PDF Document");
pdf.getDocumentCatalog().getStructureTreeRoot().appendKid(root);
currentElem = root;
rootElem = root;
}
Each marked content element (text and background graphics) will need to have an MCID and an associated tag for reference in the parent tree which will be explained in step 3.
//Assign an id for the next marked content element.
private void setNextMarkedContentDictionary(String tag) {
currentMarkedContentDictionary = new COSDictionary();
currentMarkedContentDictionary.setName("Tag", tag);
currentMarkedContentDictionary.setInt(COSName.MCID, currentMCID);
currentMCID++;
}
Artifacts (background graphics) will not be detected by the screen reader. Text needs to be detectable so a P structure element is used here when adding text.
//Set up the next marked content element with an MCID and create the containing TD structure element.
PDPageContentStream contents = new PDPageContentStream(
pdf, pages.get(pageIndex), PDPageContentStream.AppendMode.APPEND, false);
currentElem = addContentToParent(null, StandardStructureTypes.TD, pages.get(pageIndex), currentRow);
//Make the actual cell rectangle and set as artifact to avoid detection.
setNextMarkedContentDictionary(COSName.ARTIFACT.getName());
contents.beginMarkedContent(COSName.ARTIFACT, PDPropertyList.create(currentMarkedContentDictionary));
//Draws the cell itself with the given colors and location.
drawDataCell(table.getCell(i, j).getCellColor(), table.getCell(i, j).getBorderColor(),
x + table.getRows().get(i).getCellPosition(j),
y + table.getRowPosition(i),
table.getCell(i, j).getWidth(), table.getRows().get(i).getHeight(), contents);
contents.endMarkedContent();
currentElem = addContentToParent(COSName.ARTIFACT, StandardStructureTypes.P, pages.get(pageIndex), currentElem);
contents.close();
//Draw the cell's text as a P structure element
contents = new PDPageContentStream(
pdf, pages.get(pageIndex), PDPageContentStream.AppendMode.APPEND, false);
setNextMarkedContentDictionary(COSName.P.getName());
contents.beginMarkedContent(COSName.P, PDPropertyList.create(currentMarkedContentDictionary));
//... Code to draw actual text...//
//End the marked content and append it's P structure element to the containing TD structure element.
contents.endMarkedContent();
addContentToParent(COSName.P, null, pages.get(pageIndex), currentElem);
contents.close();
Annotation Widgets (form objects in this case) will need to be nested within Form structure elements.
//Add a radio button widget.
if (!table.getCell(i, j).getRbVal().isEmpty()) {
PDStructureElement fieldElem = new PDStructureElement(StandardStructureTypes.FORM, currentElem);
radioWidgets.add(addRadioButton(
x + table.getRows().get(i).getCellPosition(j) -
radioWidgets.size() * 10 + table.getCell(i, j).getWidth() / 4,
y + table.getRowPosition(i),
table.getCell(i, j).getWidth() * 1.5f, 20,
radioValues, pageIndex, radioWidgets.size()));
fieldElem.setPage(pages.get(pageIndex));
COSArray kArray = new COSArray();
kArray.add(COSInteger.get(currentMCID));
fieldElem.getCOSObject().setItem(COSName.K, kArray);
addWidgetContent(annotationRefs.get(annotationRefs.size() - 1), fieldElem, StandardStructureTypes.FORM, pageIndex);
}
//Add a text field in the current cell.
if (!table.getCell(i, j).getTextVal().isEmpty()) {
PDStructureElement fieldElem = new PDStructureElement(StandardStructureTypes.FORM, currentElem);
addTextField(x + table.getRows().get(i).getCellPosition(j),
y + table.getRowPosition(i),
table.getCell(i, j).getWidth(), table.getRows().get(i).getHeight(),
table.getCell(i, j).getTextVal(), pageIndex);
fieldElem.setPage(pages.get(pageIndex));
COSArray kArray = new COSArray();
kArray.add(COSInteger.get(currentMCID));
fieldElem.getCOSObject().setItem(COSName.K, kArray);
addWidgetContent(annotationRefs.get(annotationRefs.size() - 1), fieldElem, StandardStructureTypes.FORM, pageIndex);
}
Step 3
After all content elements have been written to the content stream and tag structure has been setup, it is necessary to go back and add the parent tree to the structure tree root. Note: Some method calls (addWidgetContent() and addContentToParent()) in the above code setup the necessary COSDictionary objects.
//Adds the parent tree to root struct element to identify tagged content
void addParentTree() {
COSDictionary dict = new COSDictionary();
nums.add(numDictionaries);
for (int i = 1; i < currentStructParent; i++) {
nums.add(COSInteger.get(i));
nums.add(annotDicts.get(i - 1));
}
dict.setItem(COSName.NUMS, nums);
PDNumberTreeNode numberTreeNode = new PDNumberTreeNode(dict, dict.getClass());
pdf.getDocumentCatalog().getStructureTreeRoot().setParentTreeNextKey(currentStructParent);
pdf.getDocumentCatalog().getStructureTreeRoot().setParentTree(numberTreeNode);
}
If all widget annotations and marked content were added correctly to the structure tree and parent tree then you should get something like this from PAC 2 and PDFDebugger.
Thank you to Tilman Hausherr for pointing me in the right direction to solve this! I will most likely make some edits to this answer for additional clarity as recommended by others.
Edit 1:
If you want to have a table structure like the one I have generated you will also need to add correct table markup to fully comply with the 508 standard... The 'Scope', 'ColSpan', 'RowSpan', or 'Headers' attributes will need to be correctly added to each table cell structure element similar to this or this. The main purpose for this markup is to allow a screen reading software like JAWS to read the table content in an understandable way. These attributes can be added in a similar way as below...
private void addTableCellMarkup(Cell cell, int pageIndex, PDStructureElement currentRow) {
COSDictionary cellAttr = new COSDictionary();
cellAttr.setName(COSName.O, "Table");
if (cell.getCellMarkup().isHeader()) {
currentElem = addContentToParent(null, StandardStructureTypes.TH, pages.get(pageIndex), currentRow);
currentElem.getCOSObject().setString(COSName.ID, cell.getCellMarkup().getId());
if (cell.getCellMarkup().getScope().length() > 0) {
cellAttr.setName(COSName.getPDFName("Scope"), cell.getCellMarkup().getScope());
}
if (cell.getCellMarkup().getColspan() > 1) {
cellAttr.setInt(COSName.getPDFName("ColSpan"), cell.getCellMarkup().getColspan());
}
if (cell.getCellMarkup().getRowSpan() > 1) {
cellAttr.setInt(COSName.getPDFName("RowSpan"), cell.getCellMarkup().getRowSpan());
}
} else {
currentElem = addContentToParent(null, StandardStructureTypes.TD, pages.get(pageIndex), currentRow);
}
if (cell.getCellMarkup().getHeaders().length > 0) {
COSArray headerA = new COSArray();
for (String s : cell.getCellMarkup().getHeaders()) {
headerA.add(new COSString(s));
}
cellAttr.setItem(COSName.getPDFName("Headers"), headerA);
}
currentElem.getCOSObject().setItem(COSName.A, cellAttr);
}
Be sure to do something like currentElem.setAlternateDescription(currentCell.getText()); on each of the structure elements with text marked content for JAWS to read the text.
Note: Each of the fields (radio button and textbox) will need a unique name to avoid setting multiple field values. GitHub has been updated with a more complex example PDF with table markup and improved form fields!

Generating pdf with itext : Some Czech characters not showing in HTMLWorker parsed paragraphs

We are using itext 2.1.7.
We have an embedded rich text editor (CKEditor) whose contents (html) are stored in a database. The editor allows contents to be formatted (bold, italic).
We generate pdf based on those html contents using the HTMLWorker.parseToList method. It works well and renders formatted content properly.
Except when some diacritics are formatted bold or italic (see capture below).
Some code to reproduce the failing behaviour :
ArrayList elements;
Font diacriticReadyFont = FontFactory.getFont("/images/arial.ttf", BaseFont.IDENTITY_H, true);
// Add one normally styled paragraph with Czech diacritics
Paragraph p1 = new Paragraph("", diacriticReadyFont);
elements = HTMLWorker.parseToList(new StringReader("<p>A normal style paragraph with Czech diacritics shows fine : Č,Ć,Š,Ž,Đ</p>"), null);
for (Object element : elements) {
p1.add(element);
}
getDocument().add(p1);
// Add one mixed style paragraph with standard characters
Paragraph p2 = new Paragraph("", diacriticReadyFont);
elements = HTMLWorker.parseToList(new StringReader("<p>A paragraph with some <em>italic text </em>and <strong>bold text </strong>shows fine</p>"), null);
for (Object element : elements) {
p2.add(element);
}
getDocument().add(p2);
// Add one bold style paragraph with Czech diacritics
Paragraph p3 = new Paragraph("", diacriticReadyFont);
elements = HTMLWorker.parseToList(new StringReader("<p><strong>However, bold text with Czech diacritics Č,Ć,Š,Ž,Đ will miss some of those diacritics</strong></p>"), null);
for (Object element : elements) {
p3.add(element);
}
getDocument().add(p3);
// Add one italic style paragraph with Czech diacritics
Paragraph p4 = new Paragraph("", diacriticReadyFont);
elements = HTMLWorker.parseToList(new StringReader("<p><em>Also, italic text with Czech diacritics Č,Ć,Š,Ž,Đ will miss some too</em></p>"), null);
for (Object element : elements) {
p4.add(element);
}
getDocument().add(p4);
// Forcing the font on "element" paragraphs does not help
Paragraph p5 = new Paragraph("", diacriticReadyFont);
elements = HTMLWorker.parseToList(new StringReader("<p><strong>Forcing the font on \"element\" paragraphs does not help : Č,Ć,Š,Ž,Đ</strong></p>"), null);
for (Object element : elements) {
((Paragraph)element).setFont(diacriticReadyFont);
p5.add(element);
}
getDocument().add(p5);
gives :
According to my analysis (greatly helped by this excellent post : Can't get Czech characters while generating a PDF), it seems the font automagically applied by the HTMLWorker to the formatted (bold or italic) text is the culprit.
As paragraph 5 example shows, manually forcing this font does not help.
Any insight ?

Keeping a (title-) Paragraph and a Table together on one page?

I'm generating a PDF document with iText 5.5.8
In this document there are numbered paragraphs that only contain a title Paragraph and a PdfPTable.
for (Item item : getItems()) {
Paragraph title = new Paragraph();
Chunk chunk = new Chunk(new Chunk(getIcon(item), 0, 0));
addBookmark(item, chunk);
title.add(chunk);
Chunk chunk2 = new Chunk(getName(item), catFont_u);
title.add(chunk2);
title.setSpacingBefore(20);
title.setSpacingAfter(14);
PdfPTable table = createTable(item); // can be more than a page!
table.setKeepTogether(true);
Section subSection = chapter.addSection(title);
subSection.add(table);
}
Now when the table is larger that te space left in the rest of the page, the table will be 'moved' to the next page (setKeepTogether()). This is good.
However, I want the title Paragraph to always be on the same page as the PdfPTable. So the title Paragraph should be moved to the next page also.
How do I accomplish this?
Thanks,
Carel

You can create an outer table of one column. then add your paragraph(title). after that create another table innerTable, here you can place your data, then add inner table to a cell and then add that cell to outer table. So that your title and table will be together, and also make outer table setsplitLate(false).

Remove HTMLs and CSS styles from PDF created using itext

We are dynamically creating PDF using itext in our application. The content of the PDF is inserted by the user in the web application using a screen where he has a Rich Text Editor.
Below are the steps specifically.
User goes to a add PDF content page.
The add page has a Rich text Editor where he can enter the PDF content.
Sometimes user can copy/paste the content from the existing word document and enter in the RTE.
Once he submits the content, PDF is created.
The RTE is used because we have some other pages where we need to show the content with BOLD, italics etc.
But, we don't want this RTE stuff in the PDF being generated.
We have used some java utility to remove the RTE stuff from the content before generating the PDF.
This works normally but when the content is copied from the word document, html and css styles applied by the document are not being removed by the java utility we are using.
How can I generate the PDF without any HTML or CSS in it?
Here is the code
Paragraph paragraph = new Paragraph(Util.removeHTML(content), font);
And the removeHTML method is as below
public static String removeHTML(String htmlString) {
if (htmlString == null)
return "";
htmlString.replace("\"", "'");
htmlString = htmlString.replaceAll("\\<.*?>", "");
htmlString = htmlString.replaceAll(" ", "");
return htmlString;
}
And below is the additional content being shown in the PDF when I copy/paste from the word document.
<w:LsdException Locked="false" Priority="10" SemiHidden="false
UnhideWhenUsed="false" QFormat="true" Name="Title" />
<w:LsdException Locked="false" Priority="11" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtitle" />
<w:LsdException Locked="false" Priority="22" SemiHidden="false"
Please help !
Thanks.

Our application is similar, we have a Rich Text Editor (TinyMCE), and our output is PDF generated via iText PDF. We want to have the HTML as clean as possible, and ideally only using the HTML tags supported by iText's HTMLWorker. TinyMCE can do some of this, but there are still situations where an end user can submit HTML which is really screwed up, and which can possibly break iText's ability to generate a PDF.
We're using a combination of jSoup and jTidy + CSSParser to filter out unwanted CSS styles entered in HTML "style" attributes. HTML entered into TinyMCE is scrubbed using this service which cleans up any paste from word markup (if the user didn't use the Paste From Word button in TinyMCE) and gives us HTML that translates well for iTextPDFs HTMLWorker.
I also found issues with table widths in iText's HTMLWorker parser (5.0.6) if the table width is in the style attribute, HTMLWorker ignores it and sets the table width to 0, so this is some logic to fix that below. We use the following libs: a
com.itextpdf:itextpdf:5.0.6 // used to generate PDFs
org.jsoup:jsoup:1.5.2 // used for cleaning HTML, primary cleaner
net.sf.jtidy:jtidy:r938 // used for cleaning HTML, secondary cleaner
net.sourceforge.cssparser:cssparser:0.9.5 // used to parse out unwanted HTML "style" attribute values
Below is some code from a Groovy service we built to scrub the HTML and only keep the tags and style attributes supported by iText + fixes the table issue. There are a few assumptions made in the code which is specific to our application. This is working really well for us at the moment.
import com.steadystate.css.parser.CSSOMParser
import org.htmlcleaner.CleanerProperties
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.PrettyHtmlSerializer
import org.htmlcleaner.SimpleHtmlSerializer
import org.htmlcleaner.TagNode
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.safety.Cleaner
import org.jsoup.safety.Whitelist
import org.jsoup.select.Elements
import org.w3c.css.sac.InputSource
import org.w3c.dom.css.CSSRule
import org.w3c.dom.css.CSSRuleList
import org.w3c.dom.css.CSSStyleDeclaration
import org.w3c.dom.css.CSSStyleSheet
import org.w3c.tidy.Tidy
class HtmlCleanerService {
static transactional = true
def cleanHTML(def html) {
// clean with JSoup which should filter out most unwanted things and
// ensure good html syntax
html = soupClean(html);
// run through JTidy to remove repeated nested tags, clean anything JSoup left out
html = tidyClean(html);
return html;
}
def tidyClean(def html) {
Tidy tidy = new Tidy()
tidy.setAsciiChars(true)
tidy.setDropEmptyParas(true)
tidy.setDropProprietaryAttributes(true)
tidy.setPrintBodyOnly(true)
tidy.setEncloseText(true)
tidy.setJoinStyles(true)
tidy.setLogicalEmphasis(true)
tidy.setQuoteMarks(true)
tidy.setHideComments(true)
tidy.setWraplen(120)
// (makeClean || dropFontTags) = replaces presentational markup by style rules
tidy.setMakeClean(true) // remove presentational clutter.
tidy.setDropFontTags(true)
// word2000 = drop style & class attributes and empty p, span elements
// draconian cleaning for Word2000
tidy.setWord2000(true)
tidy.setMakeBare(true) // remove Microsoft cruft.
tidy.setRepeatedAttributes(org.w3c.tidy.Configuration.KEEP_FIRST) // keep first or last duplicate attribute
// TODO ? tidy.setForceOutput(true)
def reader = new StringReader(html);
def writer = new StringWriter();
// hide output from stderr
tidy.setShowWarnings(false)
tidy.setErrout(new PrintWriter(new StringWriter()))
tidy.parse(reader, writer); // run tidy, providing an input and output stream
return writer.toString()
}
def soupClean(def html) {
// clean the html
Document dirty = Jsoup.parseBodyFragment(html);
Cleaner cleaner = new Cleaner(createWhitelist());
Document clean = cleaner.clean(dirty);
// now hunt down all style attributes and ensure we only have those that render with iTextPDF
Elements styledNodes = clean.select("[style]"); // a with href
styledNodes.each { element ->
def style = element.attr("style");
def tag = element.tagName().toLowerCase()
def newstyle = ""
CSSOMParser parser = new CSSOMParser();
InputSource is = new InputSource(new StringReader(style))
CSSStyleDeclaration styledeclaration = parser.parseStyleDeclaration(is)
boolean hasProps = false
for (int i=0; i < styledeclaration.getLength(); i++) {
def propname = styledeclaration.item(i)
def propval = styledeclaration.getPropertyValue(propname)
propval = propval ? propval.trim() : ""
if (["padding-left", "text-decoration", "text-align", "font-weight", "font-style"].contains(propname)) {
newstyle = newstyle + propname + ": " + propval + ";"
hasProps = true
}
// standardize table widths, itextPDF won't render tables if there is only width in the
// style attribute. Here we ensure the width is in its own attribute, and change the value so
// it is in percentage and no larger than 100% to avoid end users from creating really goofy
// tables that they can't edit properly becuase they have made the width too large.
//
// width of the display area in the editor is about 740px, so let's ensure everything
// is relative to that
//
// TODO could get into trouble with nested tables and widths within as we assume
// one table (e.g. could have nested tables both with widths of 500)
if (tag.equals("table") && propname.equals("width")) {
if (propval.endsWith("%")) {
// ensure it is <= 100%
propval = propval.replaceAll(~"[^0-9]", "")
propval = Math.min(100, propval.toInteger())
}
else {
// else we have measurement in px or assumed px, clean up and
// get integer value, then calculate a percentage
propval = propval.replaceAll(~"[^0-9]", "")
propval = Math.min(100, (int) (propval.toInteger() / 740)*100)
}
element.attr("width", propval + "%")
}
}
if (hasProps) {
element.attr("style", newstyle)
} else {
element.removeAttr("style")
}
}
return clean.body().html();
}
/**
* Returns a JSoup whitelist suitable for sane HTML output and iTextPDF
*/
def createWhitelist() {
Whitelist wl = new Whitelist();
// iText supported tags
wl.addTags(
"br", "div", "p", "pre", "span", "blockquote", "q", "hr",
"h1", "h2", "h3", "h4", "h5", "h6",
"u", "strike", "s", "strong", "sub", "sup", "em", "i", "b",
"ul", "ol", "li", "ol",
"table", "tbody", "td", "tfoot", "th", "thead", "tr",
);
// iText attributes recognized which we care about
// padding-left (div/p/span indentation)
// text-align (for table right/left align)
// text-decoration (for span/div/p underline, strikethrough)
// font-weight (for span/div/p bolder etc)
// font-style (for span/div/p italic etc)
// width (for tables)
// colspan/rowspan (for tables)
["span", "div", "p", "table", "ul", "ol", "pre", "td", "th"].each { tag ->
["style", "padding-left", "text-decoration", "text-align", "font-weight", "font-style"].each { attr ->
wl.addAttributes(tag, attr)
}
}
["td", "th"].each { tag ->
["colspan", "rowspan", "width"].each { attr ->
wl.addAttributes(tag, attr)
}
}
wl.addAttributes("table", "width", "style", "cellpadding")
// img support
// wl.addAttributes("img", "align", "alt", "height", "src", "title", "width")
return wl
}
}

If you just want the text content of the HTML document, then use an XML API such as SAX or DOM to emit only the text nodes from the document. This is trivial with the DocumentTraversal API if you know your way around DOM. If I had my IDE running, I'd paste a sample ...
In addition, the removeHtml method shown is inefficient. Use Pattern.compile and cache that in a static variable and use the Matcher API to do the replacements into a StringBuffer (or perhaps StringBuilder, if that's what it uses). That way you're not creating a bunch of intermediate strings and throwing them away.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.