I am trying to identify the non-text data which is a redaction done by another software(so essentially a black filled rectangle box) in a PDF in which all pages are saved as images. I am able to use Tesseract-ocr with Apache Tika to read the text in the pdf.
In the output, the text is coming as 95% correct but for the black box areas, it prints out characters like -
a OR
iis
I did some research and looks like OpenCV is an option but i am not sure if that will work for my use case. After searching the internet I have found that the best tool for this is OpenCV. I have no experience in OpenCV or image processing itself.
Any suggestions are welcome. Thanks!
My code to perform OCR-
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
pdfConfig.setExtractUniqueInlineImagesOnly(false);
TesseractOCRConfig config = new TesseractOCRConfig();
ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser);
try (InputStream stream = new FileInputStream(new File("file.pdf"))) {
parser.parse(stream, handler, metadata, parseContext);
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("Contents of the PDF:" + handler.toString());```
Related
The task is this, you need to create a printable form, replace the fields with the necessary data and convert to pdf format. Replacing text and saving in docx format is correct.
But when I run:
PdfConverter.getInstance().convert(document, outputStream, pdfOptions);
The fields that have been replaced are missing in the pdf file. The fields in the form are entered in the form for tex (figure), you can see it in the photo.
If you insert into plain text, then the replacement is correct. I think that you need to somehow configure PdfOptions, but so far there are no options for how to do it ...
If someone had such a problem and has suggestions on what to try to do, I will be glad for help))
The code that converts the finished form:
`InputStream inputStream = ByteSource.wrap(printFileDTO.getContent()).openStream();
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
PdfOptions pdfOptions = PdfOptions.create();
pdfOptions.fontProvider((familyName, encoding, size, style, color) -> {
try {
BaseFont baseFont =
BaseFont.createFont(TNR, encoding, BaseFont.EMBEDDED);
return new Font(baseFont, size, style, color);
} catch (Exception e) {
throw new RuntimeException(e);
}
});
XWPFDocument document = new XWPFDocument(inputStream);
PdfConverter.getInstance().convert(document, outputStream, pdfOptions);`
Versions:
<apache.poi>5.2.3</apache.poi> <poi.ooxml.version>5.2.3</poi.ooxml.version> <poi.scratchpad.version>5.2.3</poi.scratchpad.version> <opensagres.xdocreport.version>2.0.3</opensagres.xdocreport.version>
Updated to the latest to try, but it did not change anything
Piece of completed form:
enter image description here
I'am looking for some "stable" method to convert DOCX file from MS WORD into PDF. Since now I have used OpenOffice installed as listener but it often hangs. The problem is that we have situations when many users want to convert SXW,DOCX files into PDF at the same time. Is there some other possibility? I tryed with examples from this site: https://angelozerr.wordpress.com/2012/12/06/how-to-convert-docxodt-to-pdfhtml-with-java/ but the output result is not good (converted documents have errors and layout is quite modified).
here is "source" docx document:
here is document converted with docx4j with some exception text inside document. Also the text in upper right corner is missing.
this one is PDF created with OpenOffice as converter from docx to pdf. Some text is missing "upper right corner"
Is there some other option to convert docx into pdf with Java?
There are lot of methods to do conversion
One of the used method is using POI and DOCX4j
InputStream is = new FileInputStream(new File("your Docx PAth"));
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage
.load(is);
List sections = wordMLPackage.getDocumentModel().getSections();
for (int i = 0; i < sections.size(); i++) {
wordMLPackage.getDocumentModel().getSections().get(i)
.getPageDimensions();
}
Mapper fontMapper = new IdentityPlusMapper();
PhysicalFont font = PhysicalFonts.getPhysicalFonts().get(
"Comic Sans MS");//set your desired font
fontMapper.getFontMappings().put("Algerian", font);
wordMLPackage.setFontMapper(fontMapper);
PdfSettings pdfSettings = new PdfSettings();
org.docx4j.convert.out.pdf.PdfConversion conversion = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(
wordMLPackage);
//To turn off logger
List<Logger> loggers = Collections.<Logger> list(LogManager
.getCurrentLoggers());
loggers.add(LogManager.getRootLogger());
for (Logger logger : loggers) {
logger.setLevel(Level.OFF);
}
OutputStream out = new FileOutputStream(new File("Your OutPut PDF path"));
conversion.output(out, pdfSettings);
System.out.println("DONE!!");
This works perfect and even tried on multiple DOCX files.
I'm trying to inline (embed) an image in a JEditorPane from a file such as:
<img src="data:image/gif;utf-8,data...">
But I'm struggling with the code.
So far I have (assuming a gif file):
try
{
File imageFile = new File("C:\\test\\testImage.gif");
File htmlFile = new new File("C:\\test\\testOutput.html");
byte[] imageBytes = Files.toByteArray(imageFile);
String imageData = new String(imageBytes, "UTF-8");
String html = "<html><body><img src=\"data:image/gif;utf-8," + imageData + "\"></body></html>";
FileUtils.writeStringToFile(htmlFile, htmlText);
} catch (Exception e) {
e.printStackTrace();
}
This does create a file but the image is invalid. I'm sure I'm not converting the image the proper way...
JEditorPane (and Java HTML rendering in general) does not support Base64 encoded images.
Of course 'does not' != 'could not'.
The thing is, you'd need to create (or adjust) an EditorKit can have new elements defined. An e.g. is seen in the AppletEditorKit. You'd need to look for HTML.tag.IMG - it is is a standard image, call the super functionality, else use this source (or similar) to convert it to an image, then embed it.
I'm trying to generate an SVG image and then transcode it to PNG using Apache Batik. However, I end up with an empty image and I can't see why.
I use the Document from SVGDomImplementation as the base for my transcoding (to avoid writing the SVG to disk and loading it again). Here's an example:
DOMImplementation domImpl = SVGDOMImplementation.getDOMImplementation();
String namespace = SVGDOMImplementation.SVG_NAMESPACE_URI;
Document document = domImpl.createDocument(namespace, "svg", null);
//stuff that builds SVG (and works)
TranscoderInput transcoderInput = new TranscoderInput(svgGenerator.getDOMFactory());
PNGTranscoder transcoder = new PNGTranscoder();
transcoder.addTranscodingHint(PNGTranscoder.KEY_WIDTH, new Float(svgWidth));
transcoder.addTranscodingHint(PNGTranscoder.KEY_HEIGHT, new Float(svgHeight));
try {
File temp = File.createTempFile(key, ".png");
FileOutputStream outputstream = new FileOutputStream(temp);
TranscoderOutput output = new TranscoderOutput(outputstream);
transcoder.transcode(transcoderInput, output);
outputstream.flush();
outputstream.close();
name = temp.getName();
} catch (IOException ioex) {
ioex.printStackTrace();
} catch (TranscoderException trex) {
trex.printStackTrace();
}
My problem is that the resulting image is empty and I can't see why. Any hints?
I think it depends on how you're creating the SVG document. What are you using svgGenerator for (which I assume is an SVGGraphics2D)?
TranscoderInput transcoderInput = new TranscoderInput(svgGenerator.getDOMFactory());
If you've built up the SVG document in document, then you should pass it to the TranscoderInput constructor.
This page has an example of rasterizing an SVG DOM to a JPEG.
I am currently using javax.imageio.ImageIO to write a PNG file. I would like to include a tEXt chunk (and indeed any of the chunks listed here), but can see no means of doing so.
By the looks of com.sun.imageio.plugins.png.PNGMetadata it should be possible.
I should be most grateful for any clues or answers.
M.
The solution I struck upon after some decompilation, goes as follows ...
RenderedImage image = getMyImage();
Iterator<ImageWriter> iterator = ImageIO.getImageWritersBySuffix( "png" );
if(!iterator.hasNext()) throw new Error( "No image writer for PNG" );
ImageWriter imagewriter = iterator.next();
ByteArrayOutputStream bytes = new ByteArrayOutputStream();
imagewriter.setOutput( ImageIO.createImageOutputStream( bytes ) );
// Create & populate metadata
PNGMetadata metadata = new PNGMetadata();
// see http://www.w3.org/TR/PNG-Chunks.html#C.tEXt for standardized keywords
metadata.tEXt_keyword.add( "Title" );
metadata.tEXt_text.add( "Mandelbrot" );
metadata.tEXt_keyword.add( "Comment" );
metadata.tEXt_text.add( "..." );
metadata.tEXt_keyword.add( "MandelbrotCoords" ); // custom keyword
metadata.tEXt_text.add( fractal.getCoords().toString() );
// Render the PNG to memory
IIOImage iioImage = new IIOImage( image, null, null );
iioImage.setMetadata( metadata ); // Attach the metadata
imagewriter.write( null, iioImage, null );
I realise this question is long since answered, but if you want to do it without dipping into the "com.sun" hierarchy, here's an quick and very ugly example as I couldn't find this documented anywhere else.
BufferedImage img = new BufferedImage(300, 300, BufferedImage.TYPE_INT_ARGB);
// Create a DOM Document describing the metadata;
// I've gone the quick and dirty route. The description for PNG is at
// [http://download.oracle.com/javase/1.4.2/docs/api/javax/imageio/metadata/doc-files/png_metadata.html][1]
Calendar c = Calendar.getInstance();
String xml = "<?xml version='1.0'?><javax_imageio_png_1.0><tIME year='"+c.get(c.YEAR)+"' month='"+(c.get(c.MONTH)+1)+"' day='"+c.get(c.DAY_OF_MONTH)+"' hour='"+c.get(c.HOUR_OF_DAY)+"' minute='"+c.get(c.MINUTE)+"' second='"+c.get(c.SECOND)+"'/><pHYs pixelsPerUnitXAxis='"+11811+"' pixelsPerUnitYAxis='"+11811+"' unitSpecifier='meter'/></javax_imageio_png_1.0>";
DOMResult domresult = new DOMResult();
TransformerFactory.newInstance().newTransformer().transform(new StreamSource(new StringReader(xml)), domresult);
Document document = dom.getResult();
// Apply the metadata to the image
ImageWriter writer = (ImageWriter)ImageIO.getImageWritersBySuffix("png").next();
IIOMetadata meta = writer.getDefaultImageMetadata(new ImageTypeSpecifier(img), null);
meta.setFromTree(meta.getMetadataFormatNames()[0], document.getFirstChild());
FileOutputStream out = new FileOutputStream("out.png");
writer.setOutput(ImageIO.createImageOutputStream(out));
writer.write(new IIOImage(img, null, meta));
out.close();
Using Java 1.6, I edited Mike's code to
Node document = domresult.getNode();
instead of his line
Document document = dom.getResult();
Moreover, I'd suggest to add a line
writer.dispose()
after the job has been done, so that any resources held by the writer are released.
We do this in the JGraphX project. Download the source code and have a look in the com.mxgraph.util.png package, there you'll find three classes for encoding that we copied from the Apache Batik sources. An example of usage is in com.mxgraph.examples.swing.editor.EditorActions in the saveXmlPng method. Slightly edited the code looks like:
mxPngEncodeParam param = mxPngEncodeParam
.getDefaultEncodeParam(image);
param.setCompressedText(new String[] { "mxGraphModel", xml });
// Saves as a PNG file
FileOutputStream outputStream = new FileOutputStream(new File(
filename));
try
{
mxPngImageEncoder encoder = new mxPngImageEncoder(outputStream,
param);
if (image != null)
{
encoder.encode(image);
}
}
finally
{
outputStream.close();
}
Where image is the BufferedImage that will form the .PNG and xml is the string we wish to place in the iTxt section. "mxGraphModel" is the key for that xml string (the section comprises some number of key/value pairs), obviously you replace that with your key.
Also under com.mxgraph.util.png we've written a really simple class that extracts the iTxt without processing the whole image. You could apply the same idea for the tEXt chunk using mxPngEncodeParam.setText instead of setCompressedText(), but the compressed text section does allow for considerable larger text sections.
Try the Sixlegs Java PNG library (http://sixlegs.com/software/png/).
It claims to have support for all chunk types and does private chunk handling.
Old question, but... PNGJ gives full control for reading and writing PNG chunks