Remove HTMLs and CSS styles from PDF created using itext - java

We are dynamically creating PDF using itext in our application. The content of the PDF is inserted by the user in the web application using a screen where he has a Rich Text Editor.
Below are the steps specifically.
User goes to a add PDF content page.
The add page has a Rich text Editor where he can enter the PDF content.
Sometimes user can copy/paste the content from the existing word document and enter in the RTE.
Once he submits the content, PDF is created.
The RTE is used because we have some other pages where we need to show the content with BOLD, italics etc.
But, we don't want this RTE stuff in the PDF being generated.
We have used some java utility to remove the RTE stuff from the content before generating the PDF.
This works normally but when the content is copied from the word document, html and css styles applied by the document are not being removed by the java utility we are using.
How can I generate the PDF without any HTML or CSS in it?
Here is the code
Paragraph paragraph = new Paragraph(Util.removeHTML(content), font);
And the removeHTML method is as below
public static String removeHTML(String htmlString) {
if (htmlString == null)
return "";
htmlString.replace("\"", "'");
htmlString = htmlString.replaceAll("\\<.*?>", "");
htmlString = htmlString.replaceAll(" ", "");
return htmlString;
}
And below is the additional content being shown in the PDF when I copy/paste from the word document.
<w:LsdException Locked="false" Priority="10" SemiHidden="false
UnhideWhenUsed="false" QFormat="true" Name="Title" />
<w:LsdException Locked="false" Priority="11" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtitle" />
<w:LsdException Locked="false" Priority="22" SemiHidden="false"
Please help !
Thanks.

Our application is similar, we have a Rich Text Editor (TinyMCE), and our output is PDF generated via iText PDF. We want to have the HTML as clean as possible, and ideally only using the HTML tags supported by iText's HTMLWorker. TinyMCE can do some of this, but there are still situations where an end user can submit HTML which is really screwed up, and which can possibly break iText's ability to generate a PDF.
We're using a combination of jSoup and jTidy + CSSParser to filter out unwanted CSS styles entered in HTML "style" attributes. HTML entered into TinyMCE is scrubbed using this service which cleans up any paste from word markup (if the user didn't use the Paste From Word button in TinyMCE) and gives us HTML that translates well for iTextPDFs HTMLWorker.
I also found issues with table widths in iText's HTMLWorker parser (5.0.6) if the table width is in the style attribute, HTMLWorker ignores it and sets the table width to 0, so this is some logic to fix that below. We use the following libs: a
com.itextpdf:itextpdf:5.0.6 // used to generate PDFs
org.jsoup:jsoup:1.5.2 // used for cleaning HTML, primary cleaner
net.sf.jtidy:jtidy:r938 // used for cleaning HTML, secondary cleaner
net.sourceforge.cssparser:cssparser:0.9.5 // used to parse out unwanted HTML "style" attribute values
Below is some code from a Groovy service we built to scrub the HTML and only keep the tags and style attributes supported by iText + fixes the table issue. There are a few assumptions made in the code which is specific to our application. This is working really well for us at the moment.
import com.steadystate.css.parser.CSSOMParser
import org.htmlcleaner.CleanerProperties
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.PrettyHtmlSerializer
import org.htmlcleaner.SimpleHtmlSerializer
import org.htmlcleaner.TagNode
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.safety.Cleaner
import org.jsoup.safety.Whitelist
import org.jsoup.select.Elements
import org.w3c.css.sac.InputSource
import org.w3c.dom.css.CSSRule
import org.w3c.dom.css.CSSRuleList
import org.w3c.dom.css.CSSStyleDeclaration
import org.w3c.dom.css.CSSStyleSheet
import org.w3c.tidy.Tidy
class HtmlCleanerService {
static transactional = true
def cleanHTML(def html) {
// clean with JSoup which should filter out most unwanted things and
// ensure good html syntax
html = soupClean(html);
// run through JTidy to remove repeated nested tags, clean anything JSoup left out
html = tidyClean(html);
return html;
}
def tidyClean(def html) {
Tidy tidy = new Tidy()
tidy.setAsciiChars(true)
tidy.setDropEmptyParas(true)
tidy.setDropProprietaryAttributes(true)
tidy.setPrintBodyOnly(true)
tidy.setEncloseText(true)
tidy.setJoinStyles(true)
tidy.setLogicalEmphasis(true)
tidy.setQuoteMarks(true)
tidy.setHideComments(true)
tidy.setWraplen(120)
// (makeClean || dropFontTags) = replaces presentational markup by style rules
tidy.setMakeClean(true) // remove presentational clutter.
tidy.setDropFontTags(true)
// word2000 = drop style & class attributes and empty p, span elements
// draconian cleaning for Word2000
tidy.setWord2000(true)
tidy.setMakeBare(true) // remove Microsoft cruft.
tidy.setRepeatedAttributes(org.w3c.tidy.Configuration.KEEP_FIRST) // keep first or last duplicate attribute
// TODO ? tidy.setForceOutput(true)
def reader = new StringReader(html);
def writer = new StringWriter();
// hide output from stderr
tidy.setShowWarnings(false)
tidy.setErrout(new PrintWriter(new StringWriter()))
tidy.parse(reader, writer); // run tidy, providing an input and output stream
return writer.toString()
}
def soupClean(def html) {
// clean the html
Document dirty = Jsoup.parseBodyFragment(html);
Cleaner cleaner = new Cleaner(createWhitelist());
Document clean = cleaner.clean(dirty);
// now hunt down all style attributes and ensure we only have those that render with iTextPDF
Elements styledNodes = clean.select("[style]"); // a with href
styledNodes.each { element ->
def style = element.attr("style");
def tag = element.tagName().toLowerCase()
def newstyle = ""
CSSOMParser parser = new CSSOMParser();
InputSource is = new InputSource(new StringReader(style))
CSSStyleDeclaration styledeclaration = parser.parseStyleDeclaration(is)
boolean hasProps = false
for (int i=0; i < styledeclaration.getLength(); i++) {
def propname = styledeclaration.item(i)
def propval = styledeclaration.getPropertyValue(propname)
propval = propval ? propval.trim() : ""
if (["padding-left", "text-decoration", "text-align", "font-weight", "font-style"].contains(propname)) {
newstyle = newstyle + propname + ": " + propval + ";"
hasProps = true
}
// standardize table widths, itextPDF won't render tables if there is only width in the
// style attribute. Here we ensure the width is in its own attribute, and change the value so
// it is in percentage and no larger than 100% to avoid end users from creating really goofy
// tables that they can't edit properly becuase they have made the width too large.
//
// width of the display area in the editor is about 740px, so let's ensure everything
// is relative to that
//
// TODO could get into trouble with nested tables and widths within as we assume
// one table (e.g. could have nested tables both with widths of 500)
if (tag.equals("table") && propname.equals("width")) {
if (propval.endsWith("%")) {
// ensure it is <= 100%
propval = propval.replaceAll(~"[^0-9]", "")
propval = Math.min(100, propval.toInteger())
}
else {
// else we have measurement in px or assumed px, clean up and
// get integer value, then calculate a percentage
propval = propval.replaceAll(~"[^0-9]", "")
propval = Math.min(100, (int) (propval.toInteger() / 740)*100)
}
element.attr("width", propval + "%")
}
}
if (hasProps) {
element.attr("style", newstyle)
} else {
element.removeAttr("style")
}
}
return clean.body().html();
}
/**
* Returns a JSoup whitelist suitable for sane HTML output and iTextPDF
*/
def createWhitelist() {
Whitelist wl = new Whitelist();
// iText supported tags
wl.addTags(
"br", "div", "p", "pre", "span", "blockquote", "q", "hr",
"h1", "h2", "h3", "h4", "h5", "h6",
"u", "strike", "s", "strong", "sub", "sup", "em", "i", "b",
"ul", "ol", "li", "ol",
"table", "tbody", "td", "tfoot", "th", "thead", "tr",
);
// iText attributes recognized which we care about
// padding-left (div/p/span indentation)
// text-align (for table right/left align)
// text-decoration (for span/div/p underline, strikethrough)
// font-weight (for span/div/p bolder etc)
// font-style (for span/div/p italic etc)
// width (for tables)
// colspan/rowspan (for tables)
["span", "div", "p", "table", "ul", "ol", "pre", "td", "th"].each { tag ->
["style", "padding-left", "text-decoration", "text-align", "font-weight", "font-style"].each { attr ->
wl.addAttributes(tag, attr)
}
}
["td", "th"].each { tag ->
["colspan", "rowspan", "width"].each { attr ->
wl.addAttributes(tag, attr)
}
}
wl.addAttributes("table", "width", "style", "cellpadding")
// img support
// wl.addAttributes("img", "align", "alt", "height", "src", "title", "width")
return wl
}
}

If you just want the text content of the HTML document, then use an XML API such as SAX or DOM to emit only the text nodes from the document. This is trivial with the DocumentTraversal API if you know your way around DOM. If I had my IDE running, I'd paste a sample ...
In addition, the removeHtml method shown is inefficient. Use Pattern.compile and cache that in a static variable and use the Matcher API to do the replacements into a StringBuffer (or perhaps StringBuilder, if that's what it uses). That way you're not creating a bunch of intermediate strings and throwing them away.

Related

How to replace date field with some text in the ViewMaster (Vertical) for word/pdf using Aspose?

Aspose code is inserting Viewmaster(vertical) with default date to
select as a text inside. I want to replace with some text as shown in
the image.
Followed the code mentioned in ViewMaster(vertical) using Aspose
to generate the ViewMaster(Vertical) in the word/pdf. can someone help
in getting the right code to replace the date with text
Date is set in structured document tag. You can use code like this to get and modify value of this SDT:
// Get structured document tags from footer.
NodeCollection tags = doc.FirstSection.HeadersFooters[HeaderFooterType.FooterPrimary].GetChildNodes(NodeType.StructuredDocumentTag, true);
foreach (StructuredDocumentTag tag in tags)
{
if (tag.Title.Equals("Date") && tag.SdtType == SdtType.Date)
{
tag.IsShowingPlaceholderText = false;
tag.FullDate = DateTime.Now;
// By default SDT is minded to XML. We can simply remove mapping to use value set in FullDate property.
tag.XmlMapping.Delete();
}
}
If you do not need date, but need to insert some custom text, you can remove the tag and insert a simple paragraph with text instead. For example:
// Get structured document tags from footer.
NodeCollection tags = doc.FirstSection.HeadersFooters[HeaderFooterType.FooterPrimary].GetChildNodes(NodeType.StructuredDocumentTag, true);
foreach (StructuredDocumentTag tag in tags)
{
if (tag.Title.Equals("Date") && tag.SdtType == SdtType.Date)
{
// Put an empty paragraph ater the structured document tag
Paragraph p = new Paragraph(doc);
tag.ParentNode.InsertAfter(p, tag);
// Remove tag
tag.Remove();
// move DocumentBuilder to the newly inserted paragraph and insert some text.
builder.MoveTo(p);
builder.Write("This is my custom vertical text");
}
}

How can I create an accessible PDF with Java PDFBox 2.0.8 library that is also verifiable with PAC 2 tool?

Background
I have small project on GitHub in which I am trying to create a section 508 compliant (section508.gov) PDF which has form elements within a complex table structure. The tool recommended to verify these PDFs is at http://www.access-for-all.ch/en/pdf-lab/pdf-accessibility-checker-pac.html and my program’s output PDF does pass most of these checks. I will also know what every field is meant for at runtime, so adding tags to structure elements should not be an issue.
The Problem
The PAC 2 tool seems to have an issue with two particular items in the output PDF. In particular, my radio buttons’ widget annotations are not nested inside of a form structure element and my marked content is not tagged (Text and Table Cells).
PAC 2 verifies the P structure element that is within top-left cell but not the marked content…
However, PAC 2 does identify the marked content as an error (i.e. Text/Path object not tagged).
Also, the radio button widgets are detected, but there seems to be no APIs to add them to a form structure element.
What I Have Tried
I have looked at several questions on this website and others on the subject including this one Tagged PDF with PDFBox, but it seems that there are almost no examples for PDF/UA and very little useful documentation (That I have found). The most useful tips that I have found have been at sites that explain specs for tagged PDFs like https://taggedpdf.com/508-pdf-help-center/object-not-tagged/.
The Question
Is it possible to create a PAC 2 verifiable PDF with Apache PDFBox that includes marked content and radio button widget annotations? If it is possible, is it doable using higher level (non-deprecated) PDFBox APIs?
Side Note: This is actually my first StackExchange question (Although I have used the site extensively) and I hope everything is in order! Feel free to add any necessary edits and ask any questions that I may need clarify. Also, I have an example program on GitHub which generates my PDF document at https://github.com/chris271/UAPDFBox.
Edit 1: Direct link to Output PDF Document
*EDIT 2: After using some of the lower-level PDFBox APIs and viewing raw data streams for fully compliant PDFs with PDFDebugger, I was able to generate a PDF with nearly identical content structure compared to the compliant PDF's content structure... However, the same errors appear that the text objects are not tagged and I really can't decide where to go from here... Any guidance would be greatly appreciated!
Edit 3: Side-by-side raw PDF content comparison.
Edit 4: Internal structure of the generated PDF
and the compliant PDF
Edit 5: I have managed to fix the PAC 2 errors for tagged path/text objects thanks in part to suggestions from Tilman Hausherr! I will add an answer if I manage to fix the issues regarding 'annotation widgets not being nested inside form structure elements'.
After going through a large amount of the PDF Spec and many PDFBox examples I was able to fix all issues reported by PAC 2. There were several steps involved to create the verified PDF (with a complex table structure) and the full source code is available here on github. I will attempt to do an overview of the major portions of the code below. (Some method calls will not be explained here!)
Step 1 (Setup metadata)
Various setup info like document title and language
//Setup new document
pdf = new PDDocument();
acroForm = new PDAcroForm(pdf);
pdf.getDocumentInformation().setTitle(title);
//Adjust other document metadata
PDDocumentCatalog documentCatalog = pdf.getDocumentCatalog();
documentCatalog.setLanguage("English");
documentCatalog.setViewerPreferences(new PDViewerPreferences(new COSDictionary()));
documentCatalog.getViewerPreferences().setDisplayDocTitle(true);
documentCatalog.setAcroForm(acroForm);
documentCatalog.setStructureTreeRoot(structureTreeRoot);
PDMarkInfo markInfo = new PDMarkInfo();
markInfo.setMarked(true);
documentCatalog.setMarkInfo(markInfo);
Embed all fonts directly into resources.
//Set AcroForm Appearance Characteristics
PDResources resources = new PDResources();
defaultFont = PDType0Font.load(pdf,
new PDTrueTypeFont(PDType1Font.HELVETICA.getCOSObject()).getTrueTypeFont(), true);
resources.put(COSName.getPDFName("Helv"), defaultFont);
acroForm.setNeedAppearances(true);
acroForm.setXFA(null);
acroForm.setDefaultResources(resources);
acroForm.setDefaultAppearance(DEFAULT_APPEARANCE);
Add XMP Metadata for PDF/UA spec.
//Add UA XMP metadata based on specs at https://taggedpdf.com/508-pdf-help-center/pdfua-identifier-missing/
XMPMetadata xmp = XMPMetadata.createXMPMetadata();
xmp.createAndAddDublinCoreSchema();
xmp.getDublinCoreSchema().setTitle(title);
xmp.getDublinCoreSchema().setDescription(title);
xmp.createAndAddPDFAExtensionSchemaWithDefaultNS();
xmp.getPDFExtensionSchema().addNamespace("http://www.aiim.org/pdfa/ns/schema#", "pdfaSchema");
xmp.getPDFExtensionSchema().addNamespace("http://www.aiim.org/pdfa/ns/property#", "pdfaProperty");
xmp.getPDFExtensionSchema().addNamespace("http://www.aiim.org/pdfua/ns/id/", "pdfuaid");
XMPSchema uaSchema = new XMPSchema(XMPMetadata.createXMPMetadata(),
"pdfaSchema", "pdfaSchema", "pdfaSchema");
uaSchema.setTextPropertyValue("schema", "PDF/UA Universal Accessibility Schema");
uaSchema.setTextPropertyValue("namespaceURI", "http://www.aiim.org/pdfua/ns/id/");
uaSchema.setTextPropertyValue("prefix", "pdfuaid");
XMPSchema uaProp = new XMPSchema(XMPMetadata.createXMPMetadata(),
"pdfaProperty", "pdfaProperty", "pdfaProperty");
uaProp.setTextPropertyValue("name", "part");
uaProp.setTextPropertyValue("valueType", "Integer");
uaProp.setTextPropertyValue("category", "internal");
uaProp.setTextPropertyValue("description", "Indicates, which part of ISO 14289 standard is followed");
uaSchema.addUnqualifiedSequenceValue("property", uaProp);
xmp.getPDFExtensionSchema().addBagValue("schemas", uaSchema);
xmp.getPDFExtensionSchema().setPrefix("pdfuaid");
xmp.getPDFExtensionSchema().setTextPropertyValue("part", "1");
XmpSerializer serializer = new XmpSerializer();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
serializer.serialize(xmp, baos, true);
PDMetadata metadata = new PDMetadata(pdf);
metadata.importXMPMetadata(baos.toByteArray());
pdf.getDocumentCatalog().setMetadata(metadata);
Step 2 (Setup document tag structure)
You will need to add the root structure element and all necessary structure elements as children to the root element.
//Adds a DOCUMENT structure element as the structure tree root.
void addRoot() {
PDStructureElement root = new PDStructureElement(StandardStructureTypes.DOCUMENT, null);
root.setAlternateDescription("The document's root structure element.");
root.setTitle("PDF Document");
pdf.getDocumentCatalog().getStructureTreeRoot().appendKid(root);
currentElem = root;
rootElem = root;
}
Each marked content element (text and background graphics) will need to have an MCID and an associated tag for reference in the parent tree which will be explained in step 3.
//Assign an id for the next marked content element.
private void setNextMarkedContentDictionary(String tag) {
currentMarkedContentDictionary = new COSDictionary();
currentMarkedContentDictionary.setName("Tag", tag);
currentMarkedContentDictionary.setInt(COSName.MCID, currentMCID);
currentMCID++;
}
Artifacts (background graphics) will not be detected by the screen reader. Text needs to be detectable so a P structure element is used here when adding text.
//Set up the next marked content element with an MCID and create the containing TD structure element.
PDPageContentStream contents = new PDPageContentStream(
pdf, pages.get(pageIndex), PDPageContentStream.AppendMode.APPEND, false);
currentElem = addContentToParent(null, StandardStructureTypes.TD, pages.get(pageIndex), currentRow);
//Make the actual cell rectangle and set as artifact to avoid detection.
setNextMarkedContentDictionary(COSName.ARTIFACT.getName());
contents.beginMarkedContent(COSName.ARTIFACT, PDPropertyList.create(currentMarkedContentDictionary));
//Draws the cell itself with the given colors and location.
drawDataCell(table.getCell(i, j).getCellColor(), table.getCell(i, j).getBorderColor(),
x + table.getRows().get(i).getCellPosition(j),
y + table.getRowPosition(i),
table.getCell(i, j).getWidth(), table.getRows().get(i).getHeight(), contents);
contents.endMarkedContent();
currentElem = addContentToParent(COSName.ARTIFACT, StandardStructureTypes.P, pages.get(pageIndex), currentElem);
contents.close();
//Draw the cell's text as a P structure element
contents = new PDPageContentStream(
pdf, pages.get(pageIndex), PDPageContentStream.AppendMode.APPEND, false);
setNextMarkedContentDictionary(COSName.P.getName());
contents.beginMarkedContent(COSName.P, PDPropertyList.create(currentMarkedContentDictionary));
//... Code to draw actual text...//
//End the marked content and append it's P structure element to the containing TD structure element.
contents.endMarkedContent();
addContentToParent(COSName.P, null, pages.get(pageIndex), currentElem);
contents.close();
Annotation Widgets (form objects in this case) will need to be nested within Form structure elements.
//Add a radio button widget.
if (!table.getCell(i, j).getRbVal().isEmpty()) {
PDStructureElement fieldElem = new PDStructureElement(StandardStructureTypes.FORM, currentElem);
radioWidgets.add(addRadioButton(
x + table.getRows().get(i).getCellPosition(j) -
radioWidgets.size() * 10 + table.getCell(i, j).getWidth() / 4,
y + table.getRowPosition(i),
table.getCell(i, j).getWidth() * 1.5f, 20,
radioValues, pageIndex, radioWidgets.size()));
fieldElem.setPage(pages.get(pageIndex));
COSArray kArray = new COSArray();
kArray.add(COSInteger.get(currentMCID));
fieldElem.getCOSObject().setItem(COSName.K, kArray);
addWidgetContent(annotationRefs.get(annotationRefs.size() - 1), fieldElem, StandardStructureTypes.FORM, pageIndex);
}
//Add a text field in the current cell.
if (!table.getCell(i, j).getTextVal().isEmpty()) {
PDStructureElement fieldElem = new PDStructureElement(StandardStructureTypes.FORM, currentElem);
addTextField(x + table.getRows().get(i).getCellPosition(j),
y + table.getRowPosition(i),
table.getCell(i, j).getWidth(), table.getRows().get(i).getHeight(),
table.getCell(i, j).getTextVal(), pageIndex);
fieldElem.setPage(pages.get(pageIndex));
COSArray kArray = new COSArray();
kArray.add(COSInteger.get(currentMCID));
fieldElem.getCOSObject().setItem(COSName.K, kArray);
addWidgetContent(annotationRefs.get(annotationRefs.size() - 1), fieldElem, StandardStructureTypes.FORM, pageIndex);
}
Step 3
After all content elements have been written to the content stream and tag structure has been setup, it is necessary to go back and add the parent tree to the structure tree root. Note: Some method calls (addWidgetContent() and addContentToParent()) in the above code setup the necessary COSDictionary objects.
//Adds the parent tree to root struct element to identify tagged content
void addParentTree() {
COSDictionary dict = new COSDictionary();
nums.add(numDictionaries);
for (int i = 1; i < currentStructParent; i++) {
nums.add(COSInteger.get(i));
nums.add(annotDicts.get(i - 1));
}
dict.setItem(COSName.NUMS, nums);
PDNumberTreeNode numberTreeNode = new PDNumberTreeNode(dict, dict.getClass());
pdf.getDocumentCatalog().getStructureTreeRoot().setParentTreeNextKey(currentStructParent);
pdf.getDocumentCatalog().getStructureTreeRoot().setParentTree(numberTreeNode);
}
If all widget annotations and marked content were added correctly to the structure tree and parent tree then you should get something like this from PAC 2 and PDFDebugger.
Thank you to Tilman Hausherr for pointing me in the right direction to solve this! I will most likely make some edits to this answer for additional clarity as recommended by others.
Edit 1:
If you want to have a table structure like the one I have generated you will also need to add correct table markup to fully comply with the 508 standard... The 'Scope', 'ColSpan', 'RowSpan', or 'Headers' attributes will need to be correctly added to each table cell structure element similar to this or this. The main purpose for this markup is to allow a screen reading software like JAWS to read the table content in an understandable way. These attributes can be added in a similar way as below...
private void addTableCellMarkup(Cell cell, int pageIndex, PDStructureElement currentRow) {
COSDictionary cellAttr = new COSDictionary();
cellAttr.setName(COSName.O, "Table");
if (cell.getCellMarkup().isHeader()) {
currentElem = addContentToParent(null, StandardStructureTypes.TH, pages.get(pageIndex), currentRow);
currentElem.getCOSObject().setString(COSName.ID, cell.getCellMarkup().getId());
if (cell.getCellMarkup().getScope().length() > 0) {
cellAttr.setName(COSName.getPDFName("Scope"), cell.getCellMarkup().getScope());
}
if (cell.getCellMarkup().getColspan() > 1) {
cellAttr.setInt(COSName.getPDFName("ColSpan"), cell.getCellMarkup().getColspan());
}
if (cell.getCellMarkup().getRowSpan() > 1) {
cellAttr.setInt(COSName.getPDFName("RowSpan"), cell.getCellMarkup().getRowSpan());
}
} else {
currentElem = addContentToParent(null, StandardStructureTypes.TD, pages.get(pageIndex), currentRow);
}
if (cell.getCellMarkup().getHeaders().length > 0) {
COSArray headerA = new COSArray();
for (String s : cell.getCellMarkup().getHeaders()) {
headerA.add(new COSString(s));
}
cellAttr.setItem(COSName.getPDFName("Headers"), headerA);
}
currentElem.getCOSObject().setItem(COSName.A, cellAttr);
}
Be sure to do something like currentElem.setAlternateDescription(currentCell.getText()); on each of the structure elements with text marked content for JAWS to read the text.
Note: Each of the fields (radio button and textbox) will need a unique name to avoid setting multiple field values. GitHub has been updated with a more complex example PDF with table markup and improved form fields!

How can I use Jsoup to turn unallowed html tag delimiter into entities where there are unallowed tags

Using Jsoup clean is it possible to convert this string:
Here is some <b>important</b> stuff that can't have
<script>javascript</script> or the following embed tag
<embed src="helloworld.swf" type="application/vnd.adobe.flash-movie"> movie
in the output
to this :
Here is some <b>important</b> stuff that can't have
<script>javascript</script> or the following embed tag
<embed src="helloworld.swf" type="application/vnd.adobe.flash-movie">
movie in the output
so it renders
Here is some important stuff that can't have
<script>javascript</script> or the following embed tag
<embed src="helloworld.swf" type="application/vnd.adobe.flash-movie">
movie in the output
Where the bold tag is allowed and left alone but the script and embed tags delimiters change from < > to < and > so they are treated as just text and not real html elements.
What settings are necessary to accomplish this? I have:
private static String limitHtml(String value) {
String result = value;
if (value != null && !value.isEmpty()) {
Document.OutputSettings settings = new Document.OutputSettings();
settings.prettyPrint(false);
// what other settings ???
Whitelist whitelist = Whitelist.none().addTags(ALLOWED_HTML_TAGS);
whitelist.addAttributes(":all", ALLOWED_HTML_ATTRIBUTES);
result = Jsoup.clean(value, "", whitelist, settings);
}
return result;
}
Is there a similar Java lib that can accomplish this if Jsoup doesn't.
Jsoup can definitively get your back here. The trick is to use a dummy document (transitional variable in the code) with a single pre element in it.
We will simply add each unallowed element found in this pre element.
Later, we replace the unallowed element in the initial value with its escaped html code.
CODE
// Comma separated list of allowed tags.
private static String ALLOWED_HTML_TAGS_CSS_QUERY = "b,span";
private static String limitHtml(String value) {
String result = value;
if (value != null && !value.isEmpty()) {
// Build a sided document. It will help us escape unallowed tags.
Document transitional = Jsoup.parse("<pre></pre>");
// Parse the actual value for finding unallowed tags
Document doc = Jsoup.parseBodyFragment(value, "");
Elements unallowedElements = doc.select("*:not("+ALLOWED_HTML_TAGS_CSS_QUERY+")");
for (Element e : unallowedElements) {
switch (e.tagName()) {
case "#root": case "html": case "head": case "body":
// Those tags are added automatically by Jsoup. Nothing to do...
break;
default:
// Load the unallowed element to escape its html code in the transitional document
Element pre = transitional.select("pre").first().text(e.outerHtml());
// Replace unallowed element with its escape html code
e.replaceWith(new TextNode(pre.text(), ""));
}
}
// Get the final sanitized value
Document.OutputSettings settings = new Document.OutputSettings();
settings.prettyPrint(false);
Whitelist whitelist = Whitelist.none().addTags(ALLOWED_HTML_TAGS);
whitelist.addAttributes(":all", ALLOWED_HTML_ATTRIBUTES);
result = Jsoup.clean(doc.body().html(), "", whitelist, settings);
}
return result;
}
SAMPLE USAGE
String unsanitizedHtml = "Here is some <b>important</b> stuff that can't have " + //
"<script>javascript</script> or the following embed tag " + //
"<embed src=\"helloworld.swf\" type=\"application/vnd.adobe.flash-movie\"> movie" + //
"in the output";
System.out.println("BEFORE:\n" + unsanitizedHtml);
System.out.println();
System.out.println("AFTER:\n" + limitHtml(unsanitizedHtml));
OUTPUT
BEFORE:
Here is some <b>important</b> stuff that can't have <script>javascript</script> or the following embed tag <embed src="helloworld.swf" type="application/vnd.adobe.flash-movie"> moviein the output
AFTER:
Here is some <b>important</b> stuff that can't have <script>javascript</script> or the following embed tag <embed src="helloworld.swf" type="application/vnd.adobe.flash-movie"> moviein the output

Table of contents not added in pdf format using ASPOSE WORD , java

def gDataDir;
def index() {
gDataDir = "/home/sithapa/gitProject/aposeWord/documents/basics/";
topResultsTest();
}
def topResultsTest(){
Document main_src = new Document(gDataDir + "cover_page_toc.docx");
Document src3 = new Document(gDataDir + "our_testing_template.docx");
def String[] fields = ["Heading1","Subtitle1","Subtitle2"];
def Object[] values = ['This is a Heading','this is a subtitle1','\nthis is a subtitle2'];
src3.getMailMerge().execute(fields, values);
//Appending
main_src.appendDocument(src3, ImportFormatMode.KEEP_SOURCE_FORMATTING);
//Update the table of contents.
main_src.updateFields();
main_src.updatePageLayout();
main_src.save(gDataDir + "final_output.docx");
saveAsPDF(main_src)
}
def saveAsPDF(main_src){
//Document src = new Document(gDataDir + "final_output.docx");
//main_src.save(gDataDir + "simpleOpenSaveAsPDF_output.pdf", SaveFormat.PDF);
main_src.save(gDataDir + "Document.Doc2PdfSave Out.pdf");
}
Here the Table of contents is seen in .docx in a linux OS but not seen in Windows. TOC in pdf format is not seen in both OS.I have attached the required documents in this link:
enter link description here
I noted that your headings are in the document header. Please move these to the document body.
These headings are not saved in the PDF by default. You need to specify these in an instance of PdfSaveOptions.
// Set 2 levels of headings to appear in PDF
PdfSaveOptions so = new PdfSaveOptions();
so.getOutlineOptions().setHeadingsOutlineLevels(2);
// Specify the save options as parameter
document.save("output.pdf", so);
I work for Aspose as Developer Evangelist.

Docx4j - Images in the document

How can we remove an image from the docx4j.
Say I have 10 images, and i want to replace 8 images with my own byte array/binary data, and I want to delete remaining 2.
I am also having trouble in locating images.
Is it somehow possible to replace text placeholders in the document with images?
Refer to this post : http://vixmemon.blogspot.com/2013/04/docx4j-replace-text-placeholders-with.html
for(Object obj : elemetns){
if(obj instanceof Tbl){
Tbl table = (Tbl) obj;
List rows = getAllElementFromObject(table, Tr.class);
for(Object trObj : rows){
Tr tr = (Tr) trObj;
List cols = getAllElementFromObject(tr, Tc.class);
for(Object tcObj : cols){
Tc tc = (Tc) tcObj;
List texts = getAllElementFromObject(tc, Text.class);
for(Object textObj : texts){
Text text = (Text) textObj;
if(text.getValue().equalsIgnoreCase("${MY_PLACE_HOLDER}")){
File file = new File("C:\\image.jpeg");
P paragraphWithImage = addInlineImageToParagraph(createInlineImage(file));
tc.getContent().remove(0);
tc.getContent().add(paragraphWithImage);
}
}
System.out.println("here");
}
}
System.out.println("here");
}
}
wordMLPackage.save(new java.io.File("C:\\result.docx"));
See docx4j checking checkboxes for the 2 approaches to finding stuff (XPath, or non XPath traversal).
VariableReplace allows you to replace text placeholders, but not with images. I think there may be code floating around (in the docx4j forums?) which extends it to do that.
But I'd suggest you use content control databinding instead. See how to create a new word from template with docx4j
You can use base64 encoded images in your XML data, and docx4j and/or Word will do the rest.

Categories

Resources