We are using an old Version of iText (2.x) with Java 6 at the moment.
What we now try to do is to open an existing PDF and change its Color to grayscale. I found the method PdfWriter.setDefaultColorspace(PdfName key, PdfObject cs)
but I'm not really sure how to use it.
Can anyone tell me, how to use it in the right way? Or maybe anybody knows how to change PDF to grayscale in another way with this old iText version.
Many thanks in advance!
I implemented the code here using iText 5.5.14 but it should also work with iText 2.1.7 with minimal changes.
There are two ways to remove color from PDF pages,
either one actually iterates through all color related instructions of its content streams and replaces the colors set therein by an equivalent gray
or one appends instructions to each page content stream which remove the color saturation of all that the existing instructions create.
The former option is beyond the scope of a stack overflow answer (there are many different kinds of colors in PDFs, embedded bitmaps also bring color along, and one has to also consider the effects of transparency and blend modes used) but the latter option is fairly easy to implement by overlaying the page with a grayscale color in blend mode Saturation:
void dropSaturation(PdfStamper pdfStamper) {
PdfGState gstate = new PdfGState();
gstate.setBlendMode(PdfName.SATURATION);
PdfReader pdfReader = pdfStamper.getReader();
for (int i = 1; i <= pdfReader.getNumberOfPages(); i++) {
PdfContentByte canvas = pdfStamper.getOverContent(i);
canvas.setGState(gstate);
Rectangle mediaBox = pdfReader.getPageSize(i);
canvas.setColorFill(BaseColor.BLACK);
canvas.rectangle(mediaBox.getLeft(), mediaBox.getBottom(), mediaBox.getWidth(), mediaBox.getHeight());
canvas.fill();
canvas = pdfStamper.getUnderContent(i);
canvas.setColorFill(BaseColor.WHITE);
canvas.rectangle(mediaBox.getLeft(), mediaBox.getBottom(), mediaBox.getWidth(), mediaBox.getHeight());
canvas.fill();
}
}
(ColorToGray method)
You can apply it like this:
PdfReader pdfReader = new PdfReader(SOURCE_PDF);
PdfStamper pdfStamper = new PdfStamper(pdfReader, RESULT_STREAM);
dropSaturation(pdfStamper);
pdfStamper.close();
Beware, this is a proof-of-concept. For a complete solution you actually have to do the same to all annotations of the pages.
Related
The Context
Currently, I have a solution where I loop through a PDF and draw black rectangles throughout it.
So I already have a PDRectangle list representing the right areas I need to fill/cover on the pdf, hiding all the texts I want to.
The Problems
Problem number 1: The text underneath the black rectangle is easily copied, searchable, or extracted by other tools.
I solved this by flattening my pdf (converting it into an image so that it becomes a single layer document and the black rectangle can no longer be tricked). Same solution as described here:
Disable pdf-text searching with pdfBox
This is not an actual redacting, it's more like a workaround.
Which leads me to
Problem number 2:
My final PDF becomes an image document, where I lose all the pdf properties, including searching, copying... also it's a much slower process. I wanted to keep all the pdf properties while the redacted areas are not readable by any means.
What I want to accomplish
That being said, I'd like to know if it is possible and how I could do an actual redacting, blacken out rectangles areas since I already have all the positions I need, with PDFBox, keeping the pdf properties and not allowing the redacted area to be read.
Note: I'm aware of the problems PDFBox had with the old ReplaceText function, but here I have the positions I need to make sure I'd blank precisely the areas I need.
Also, I'm accepting other free library suggestions.
Technical Specification:
PDFBox 2.0.21
Java 11.0.6+10, AdoptOpenJDK
MacOS Catalina 10.15.4, 16gb, x86_64
My Code
This is how I draw the black rectangle:
private void draw(PDPage page, PDRectangle hitPdRectangle) throws IOException {
PDPageContentStream content = new PDPageContentStream(pdDocument, page,
PDPageContentStream.AppendMode.APPEND, false, false);
content.setNonStrokingColor(0f);
content.addRect(hitPdRectangle.getLowerLeftX(),
hitPdRectangle.getLowerLeftY() -0.5f,
hitPdRectangle.getUpperRightX() - hitPdRectangle.getLowerLeftX(),
hitPdRectangle.getUpperRightY() - hitPdRectangle.getLowerLeftY());
content.fill();
content.close();
}
This is how I convert it into an Image PDF:
private PDDocument createNewRedactedPdf() throws IOException {
PDFRenderer pdfRenderer = new PDFRenderer(pdDocument);
PDDocument redactedDocument = new PDDocument();
for (int pageIndex = 0; pageIndex < pdDocument.getNumberOfPages(); pageIndex++) {
BufferedImage image = pdfRenderer.renderImageWithDPI(pageIndex, 200);
String formatName = "jpg";
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ImageIO.write(image, formatName, baos);
byte[] bimg = baos.toByteArray();
PDPage page = pdDocument.getPage(pageIndex);
float pageWidth = page.getMediaBox().getWidth();
float pageHeight = page.getMediaBox().getHeight();
PDPage pageDraw = new PDPage(new PDRectangle(pageWidth, pageHeight));
redactedDocument.addPage(pageDraw);
String imgSuffixName = pageIndex + "." + formatName;
PDImageXObject img = PDImageXObject.createFromByteArray(redactedDocument, bimg,
pdDocument.getDocument().getDocumentID() + imgSuffixName);
try (PDPageContentStream contentStream
= new PDPageContentStream(redactedDocument, pageDraw, PDPageContentStream.AppendMode.OVERWRITE, false)) {
contentStream.drawImage(img, 0, 0, pageWidth, pageHeight);
}
}
return redactedDocument;
}
Any thoughts?
What you want to have, a true redaction feature, is possible to implement based on PDFBox but it requires a lot of coding on top of it (similar to the pdfSweep add-on implemented on top of iText).
In particular you have found out yourself that it does not suffice to draw black rectangles over the areas to redact as text extraction or copy&paste from a viewer usually completely ignores whether text is visible or covered by something.
Thus, in the code you do have to find the actual instruction drawing the text to redact and remove them. But you cannot simply remove them without replacement, otherwise additional text on the same line may be moved by your redaction.
But you cannot simply replace them with the same number of spaces or a move-right by the width of the removed text: Just consider the case of a table you want to redact a column from with only "yes" and "no" entries. If after redaction a text extractor returns three spaces where there was a "yes" and two spaces where there was a "no", anyone looking at those results knows what there was in the redacted area.
You also have to clean up instructions around the actual text drawing instruction. Consider the example of the column to redact with "yes"/"no" information again, but this time for more clarity the "yes" is drawn in green and the "no" in red. If you only replace the text drawing instructions, someone with an extractor that also extracts attributes like the color will immediately know the redacted information.
In case of tagged PDFs, the tag attributes have to be inspected too. There in particular is an attribute ActualText which contains the actual text represented by the tagged instructions (in particular for screen readers). If you only remove the text drawing instructions but leave the tags with their attributes, anyone reading using a screen reader may not even realize that you tried to redact something as his screen reader reads the complete, original text to him.
For a proper redaction, therefore, you essentially have to interpret all the current instructions, determine the actual content they draw, and create a new set of instructions which draws the same content without unnecessary extra instructions which may give away something about the redacted content.
And here we only looked at redacting the text; redacting vector and bitmap graphics on a PDF page has a similar amount of challenges to overcome for proper redaction.
...
Thus, the code required for actual redaction is beyond the scope of a stack overflow answer. Nonetheless, the items above may help someone implementing a redactor not to fall into typical traps of too naive redaction code.
I'm building an app which among other things allows the user to insert a text into a PDF, using a layer.
The position of the text in the PDF page can be set using the app, which renders the PDF using ICEPdf inside a JPanel. After selecting the position and the size of the layer, the app renders it to the PDF using iText (v. 5.3.2).
The problem I'm facing is that the font rendering from Swing is sightly different from the final result in the PDF.
Here are some screen-shots, both using the Helvetica plain font inside the same bounding box:
Rendering a text with Swing:
protected void paintComponent(Graphics g){
//for each line...
g.drawString(text, b0, b1);
//b0 and b1 are computed from the selected bounding box for the text
}
I have this:
Rendering a text with iText:
PdfTemplate t; //PdfTemplate is created elsewhere
ColumnText ct = new ColumnText(t);
ct.setRunDirection(PdfWriter.RUN_DIRECTION_NO_BIDI);
ct.setSpaceCharRatio(1);
ct.setSimpleColumn(new Phrase(text, font), b0, b1, b3, b4, font.getSize(), Element.ALIGN_BOTTOM);
//b0, b1, b2 and b3 are the bounding box of the text
ct.go();
I have this:
So the question is: what can be done to make Swing and iText render fonts exactly the same way? I can tweak Swing or iText, so no matter what code is modified, I need a truly WYSIWYG experience for the user.
I tried with other fonts and types, but still there are some differences between them. I think I'm missing some configuration.
Thanks.
Edit:
Font in Java is measured in Pixels, but iText measures in Points, and we know there are 72 points per inch, and for a standard windows machine there are 96 pixels per inch / dpi.
First we need to find the difference between a point and a pixel:
difference = 72 / dpi
0.75 = 72 / 96
Next we can multiply the Java font size with the difference to get the iText font size, if the java font size is 16, then the iText font size should be 12 when used with 96dpi.
iTextFontSize = difference x javaFontSize
12 = 0.75 x 16
On a windows machine 96dpi is often the norm, but remember that is not always the case, you should work it out for each different machine.
Original Post
I believe the difference is caused by rendering hints.
Solution 1:
The best way to do it would be to draw everything on a buffered image. Then the buffered image is drawn to the swing component and the exact same buffered image is drawn to the PDF as well, that way there should be no difference between them.
Alternate Idea:
Another idea that may not work as well, but will work better with your existing code is to draw the contents of the swing component directly to a buffered image then draw the buffered image to the PDF.
The following is a quick example that will draw the contents of a JPanel to a buffered image with a few different rendering hints. (Change the hints to suit your swing component)
public BufferedImage createImage(JPanel panel)
{
BufferedImage swingComponent = new BufferedImage(
panel.getHeight(), panel.getWidth(),
BufferedImage.TYPE_INT_RGB);
Graphics2D g = swingComponent.createGraphics();
g.setRenderingHint(RenderingHints.KEY_RENDERING,
RenderingHints.VALUE_RENDER_QUALITY);
g.setRenderingHint(RenderingHints.KEY_ANTIALIASING,
RenderingHints.VALUE_ANTIALIAS_ON);
g.setRenderingHint(RenderingHints.KEY_INTERPOLATION,
RenderingHints.VALUE_INTERPOLATION_BILINEAR);
g.dispose();
return swingComponent; //print this to your PDF
}
So the question is: what can be done to make Swing and iText render fonts exactly the same way?
Exactly the same? Very little. Presuming that you supply the same font characteristics to both, there may still be differences introduced in rendering.
PDF files (or really, the text phrases) hold font information—family, size, style, etc.—but it is your PDF reader that renders the text to your device. Depending on what program you use, it can use the platform's renderer or it can choose one of its own.
AFAIK, Java Swing uses its own font rendering engine, as distinct from the underlying platform's renderer. This makes it look more consistent across platforms, but also makes it very unlikely that text will be rendered the same way any other program would.
Knowing that, your options are limited:
Use a PDF reader that also uses Swing font rendering. That's a fairly limiting solution.
Use your platform's rendering engine to draw into Swing. That sounds like iffy work, and also like something SWT would have solved. Maybe you could embed an SWT component?
I'm generating a multipage PDF from Java using iText. Problem: the lines on my charts shift color between certain pages.
Here's a screenshot of the transition between pages:
This was taken from Adobe Reader. The lines are the correct color in OS X Preview.app.
In Reader the top is #73C352, the bottom is #35FF69. In Preview.app the line is #00FE7E.
Any thoughts on what could be causing this discrepancy? I saved the PDF from Preview.app and opened it in Adobe Reader, still has the colors off.
Here is the PDF that is having trouble. Open it in Adobe Reader and look at the transition between pages 11 & 12.
On checking this out further, it appears that the java.awt.print.PrinterJob is calling print() for each pageIndex twice. This might be a clue.
The problem with the pages with darker colors is that they include a pattern object with a transparent image. When transparency is involved, Adobe Acrobat switches automatically to a custom CMYK profile and this causes the darker colors. Only Acrobat does this, other viewers behave just fine. The solution is either to remove the pattern object with the transparent image (it seems to be a drawing artifact of the PDF generator engine, it is not used anywhere on the page) or you can make the page part of a transparency group and specify the transparency group to use RGB colorspace.
Several different possibilities, yes.
Different color matching. If you're using a "calibrated" color space on one page and a "device" color space on another, the same RGB/CMYK values can produce visually different values.
If the graph is inside a Form XObject, the same graph can appear differently depending on the current graphic state when the form is drawn.
If you could post a link to your PDF, I could probably give you a specific answer.
Ouch. That PDF is painful to shclep through. I'd like to have some words with whoever wrote their PDF converter. Harsh ones. Lots of unnecessary clipping ("text" is being clipped hither and yon, page 7 for example), poor use of patters for images, but not using patters when it would actually help, drawing text as paths, and on and on...
EDIT: Which is precisely the sort of stuff you see when rendering Java UI via a PdfGraphics2D object. You CAN keep the text as text though. It's just a matter of how you create the PdfGraphics2D instance.
Okay, so the color of the line itself is identical. 0 1 0.4 RG. HOWEVER, there is some "transparency stuff" going on.
On pages that have images with soft masks or extended graphic states that change the transparency, the green line appears darker. On pages without, it appears brighter.
I suspect that all those other PDF viewers that draw the lines consistently don't support transparency at all, or only poorly.
Exposition:
I'm writting an OpenGL app in Java via JOGL. The built in fonts with GLUT are very basic. I want to be able to take a Java Font, render all the characters in it to a 2D array of bytes (a greyscale image representing the characters), which I can then load as a texture in OpenGL.
Problem:
I can load the textures in OpenGL; I can use JOGL. My only problem is going from the "Font java cn read" --> "2D greyscale image of all the characters step". What functions / libraries should I be using?
I'm not sure I quite understand what you're looking for. I think what you want is some code that will generate grayscale bitmap images of a given size for every glyph in a font.
There isn't a way (that I am aware of anyway) to get all the glyphs a font supports (oddly, you can get the number of glyphs... so yeah, I may just be missing something, bah). However, you can get glyph metrics for given characters quite easily.
Something along these lines should work for you.
HashMap<int[], Rectangle2D> generateGlyphs(int fontSize, String characters, Font font){
HashMap<int[], Rectangle2D> ret = new HashMap<int[], Rectangle>();
FontRenderContext rendCont = new FontRenderContext(null, true, true);
for(int i = 0; i < characters.length; i++){
Rectangle2D bounds = font.getStringBounds(characters.substring(i, 1), rendCont);
BufferedImage bi = new BufferedImage((int)bounds.getWidth(), (int)bounds.getHeight(), BufferedImage.TYPE_INT_GRAY);
Graphics g = bi.getGraphcs();
g.setFont(font);
g.drawString(characters.substring(i, 1), 0, (int)bounds.getHeight());
ret.put(bi.getData().getPixels(0, 0, (int)bounds.getWidth(), (int)bounds.getHeight()), bounds);
}
return ret;
}
Note that I'm clipping rather than rounding in some places, which could potentially be an issue.
Possible I'm missing something, but can java.awt not do most of this for you?
create a new BufferedImage
get its Graphics2D (image.getGraphics)
paint the text to the image
get the Raster (getData) from the buffered image.
dispose of the Graphics2D context when done.
Processing can render fonts in OpenGL, and it uses it's own in house open source class PFont to "generate grayscale bitmap images of a given size for every glyph in a font," as Kevin put it. I recommend you look at the source, which is here.
I am trying to read a pdf document in a j2ee application.
For a webapplication I have to store pdf documents on disk. To make searching easy I want to make a reverse index of the text inside the document; if it is OCR.
With the PDFbox library its possible to create a pdfDocument object wich contains an entire pdf file. However to preserve memory and improve overall performance I'd rather handle the document as a stream and read one page at a time into a buffer.
I wonder if it is possible to read a filestream containing pdf page by page or even one line at a time.
For a given generic pdf document you have no way of knowing where one page end and another one starts, using PDFBox at least.
If your concern is the use of resources, I suggest you parse the pdf document into a COSDocument, extract the parsed objects from the COSDocument using the .getObjects(), which will give you a java.util.List. This should be easy to fit into whatever scarce resources you have.
Note that you can easily convert your parsed pdf documents into Lucene indexes through the PDFBox API.
Also, before venturing into the land of optimisations, be sure that you really need them. PDFBox is able to make an in-memory representation of quite large PDF documents without much effort.
For parsing the PDF document from an InputStream, look at the COSDocument class
For writing lucene indexes, look at LucenePDFDocument class
For in-memory representations of COSDocuments, look at FDFDocument
In the 2.0.* versions, open the PDF like this:
PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());
This will setup buffering memory usage to only use temporary file(s) (no main-memory) with no restricted size.
This was answered here.
Take a look at the PDF Renderer Java library. I have tried it myself and it seems much faster than PDFBox. I haven't tried getting the OCR text, however.
Here is an example copied from the link above which shows how to draw a PDF page into an image:
File file = new File("test.pdf");
RandomAccessFile raf = new RandomAccessFile(file, "r");
FileChannel channel = raf.getChannel();
ByteBuffer buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
PDFFile pdffile = new PDFFile(buf);
// draw the first page to an image
PDFPage page = pdffile.getPage(0);
//get the width and height for the doc at the default zoom
Rectangle rect = new Rectangle(0,0,
(int)page.getBBox().getWidth(),
(int)page.getBBox().getHeight());
//generate the image
Image img = page.getImage(
rect.width, rect.height, //width & height
rect, // clip rect
null, // null for the ImageObserver
true, // fill background with white
true // block until drawing is done
);
I'd imagine you can read through the file byte by byte looking for page breaks. Line by line is more difficult because of possible PDF formatting issues.