org.apache.poi.xwpf.converter.xhtml.XHTMLConverter not generating images

org.apache.poi.xwpf.converter.xhtml.XHTMLConverter not generating images - java

I am using org.apache.poi.xwpf.converter.xhtml.XHTMLConverter class to convert docx to html. Below is my groovy code
public Map convert(String wordDocPath, String htmlPath,
Map conversionParams)
{
log.info("Converting word file "+wordDocPath)
try
{
...
String notificationWorkingFolder = "C:\tomcats\Notification\store\Notification1234"
FileInputStream fis = new FileInputStream(wordDocPath);
XWPFDocument document = new XWPFDocument(fis);
XHTMLOptions options = XHTMLOptions.create().URIResolver(new FileURIResolver(new File(notificationWorkingFolder)));
File htmlFile = new File(htmlPath);
OutputStream out = new FileOutputStream(htmlFile)
XHTMLConverter.getInstance().convert(document, out, options);
log.info("Converted to HTML file "+htmlPath)
return [success:true,htmlFileName:getFileName(htmlPath)]
}
catch(Exception e)
{
log.error("Exception :"+e.getMessage(),e)
return [success:false]
}
}
The above code is converting docx to html successfully, but if docx contains any images it puts <img src="C:\tomcats\Notification\store\Notification1234\word\media\image1.png"> but do not copy the image to that folder. As a result, when I open html tag, all images appears empty. Am I missing something in code? Is there a way to generate an image srouce link instead of absolute path, like <img src="http://localhost:8080/webapp/image1.png">

I got answer for first question from this link lychaox.com/java/poi/Word07toHtml.html. I had to add one line of code options.setExtractor(new FileImageExtractor(imageFolderFile)); to generate images.
Second question I resolved by pattern search and replacement.

Even with proper usage, it's worth noting that XHTMLConverter uses XHTMLMapper, which does not process headers, footers, or VML Images. Any images falling into those categories will be lost.
The PDFConverter is more fully featured, but also uses the GPL licensed library, iText.

Related

PDF with forms to simple image PDF

How can I transform a PDF with forms made in Adobe Livecycle to a simple image PDF using Java?
I tried using Apache PDFBox but it can't save as image a PDF with forms.
This is what I tried(from this question: Convert PDF files to images with PDFBox)
String pdfFilename = "PDForm_1601661791_587488.pdf";
try (PDDocument document = PDDocument.load(new File(pdfFilename))) {
PDFRenderer pdfRenderer = new PDFRenderer(document);
for (int page = 0; page < document.getNumberOfPages(); ++page) {
BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
ImageIOUtil.writeImage(bim, pdfFilename + "-" + (page+1) + ".png", 300);
}
} catch (IOException ex) {
Logger.getLogger(StartClass.class.getName()).log(Level.SEVERE, null, ex);
}
But is not working, the result is an image where it writes that "The document you are trying to load requires Adobe Reader 8 or higher.

I guess is not possible, I tried many libraries and none worked.
This is how I solved the problem:
I used an external tool - PDFCreator.
In PDFCreator I created a special printer that prints and saves the PDF without asking any questions(you have these options in PDFCreator).
This is simple to reproduce in PDFCreator because in the Debug section you have an option to load a config file, so I have this file prepared, I just install PDFCreator and load the config file.
If you will use my INI file in the link above you should know that the resulted PDF is automatically saved in the folder: "current user folder/Desktop/temporary".
The rest of the job is done from Java using Adobe Reader, the code is in my case:
ProcessBuilder pb = new ProcessBuilder(adobePath, "/t", path+"/"+filename, printerName);
Process p = pb.start();
This code opens my PDF in AdobeReader, prints the PDF to the specified virtual printer, and exists automatically.
"adobePath" is the path to the adobe executable
path+"/"+filename is the path to my PDF.
"printerName" is the name of the virtual printer created in PDFCreator
So this is not a pure Java solution and in the future, I intend to use Apache PDFBox to generate my PDF's in a format that is compatible with browsers and all readers...but this works also.

Add OCR layer to existing PDF without the need to write to file system

I'm trying to take a scanned PDF document and add a OCR layer on top. I can get the following code to achieve this:
public void ocrFile(PDDocument pdDocument, File file) throws TesseractException, IOException {
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(pdDocument);
Tesseract instance = new Tesseract(); // JNA Interface Mapping
File tessDataFolder = LoadLibs.extractTessResources("tessdata");
instance.setDatapath(tessDataFolder.getAbsolutePath());
List<RenderedFormat> list = new ArrayList<RenderedFormat>();
list.add(RenderedFormat.PDF);
String outputFileName = FilenameUtils.removeExtension(file.getAbsolutePath());
instance.createDocuments(file.getAbsolutePath(), outputFileName, list);
}
This will output the PDF with the OCR layer in place to a specific location on disk. I'm trying to change this so the application does not need to write any files to disk. I'm not sure if this can be done?
Ideally I'd like to change the File input of ocrFile with a MultipartFile and have that be returned from this method, negating the need for involving the file system. Is this achievable?

No, it cannot be done. Tesseract's TessResultRenderer API outputs to physical files, hence the required outputbase input parameter to specify the name of output file.

Java: Download .txt File from URL

I want to download a .txt file from website and my code works, so I don't get an error and it loads the document, but the document is full of hmtl code, instead of my content.
public static void main(String[] args) {
try {
URL website = new URL("http://www.file-upload.net/download-11700212/document.txt.html");
String filepath = "C://Users//" + System.getProperty("user.name") + "//Desktop//document.txt";
ReadableByteChannel channel = Channels.newChannel(website.openStream());
FileOutputStream stream = new FileOutputStream(filepath);
stream.getChannel().transferFrom(channel, 0, Long.MAX_VALUE);
System.out.println("Download successfull.");
} catch (Exception e) {
System.out.println("Download was not successfull.");
}
}
The download itself works, I got the txt file on my desktop, but the content is wrong and full of html code.
Please help.
Thanks.

The URL you are trying to download from is an HTML page, rather than the document itself. The link on that page you should be trying to download from is...
http://www.file-upload.net/download5.php?valid=451.69031370715&id=11700212&name=document.txt
However, if you wish to guarantee that you're downloading a text file, then you should choose a text file to download directly e.g.
http://humanstxt.org/humans.txt

I have a Python project called Python Webscraper which can read a URL and copy its textual contents to a text file without the HTML.
You'll need to install a package called Beautiful Soup then run the code from the GitHub repo.

How to create PS file from PDF file using Java?

I wrote an application to create PDF file to PDDocument file it work fine. i use the pdfbox library
PDDocument pdfDoc = PDDocument.load(pdfFile);
Now i want to create PS(Post script) file from PDF file. Is there are any way in java. I can use any free API.
Many thanks.

Adobe seems to have a library. Here are some instructions. Please note, I have not tried this myself: http://help.adobe.com/en_US/livecycle/9.0/programLC/help/index.htm?content=000761.html
This link has a more detailed solution:
http://help.adobe.com/en_US/livecycle/9.0/programLC/help/index.htm?content=000074.html

You can use PDFDocument to load your PDF then use PSConverter to convert the PDF document into an OutputStream.
The library I'm using is called ghost4j:
import org.ghost4j.converter.PSConverter;
import org.ghost4j.document.PDFDocument;
Here's a small snippet:
private ByteArrayOutputStream convertPDFtoPS(){
ByteArrayOutputStream outstreamFile = new ByteArrayOutputStream();
try{
PDFDocument document = new PDFDocument();
//getPDFFile just returns an InputStream of the PDF file
document.load(getPDFFile());
PSConverter converter = new PSConverter();
converter.convert(document, outstreamFile);
outstreamFile.close();
}
catch(Exception e)
{
e.printStackTrace();
}
return outstreamFile;
}

How to check a uploaded file whether it is an image or other file?

In my web application I have an image uploading module. I want to check the uploaded file whether it's an image file or any other file. I am using Java in server side.
The image is read as BufferedImage in java and then I am writing it to disk with ImageIO.write()
How shall I check the BufferedImage, whether it's really an image or something else?
Any suggestions or links would be appreciated.

I'm assuming that you're running this in a servlet context. If it's affordable to check the content type based on just the file extension, then use ServletContext#getMimeType() to get the mime type (content type). Just check if it starts with image/.
String fileName = uploadedFile.getFileName();
String mimeType = getServletContext().getMimeType(fileName);
if (mimeType.startsWith("image/")) {
// It's an image.
} else {
// It's not an image.
}
The default mime types are definied in the web.xml of the servletcontainer in question. In for example Tomcat, it's located in /conf/web.xml. You can extend/override it in the /WEB-INF/web.xml of your webapp as follows:
<mime-mapping>
<extension>svg</extension>
<mime-type>image/svg+xml</mime-type>
</mime-mapping>
But this doesn't prevent you from users who are fooling you by changing the file extension. If you'd like to cover this as well, then you can also determine the mime type based on the actual file content. If it's affordable to check for only BMP, GIF, JPEG, PNG, TIFF or WBMP types (but not PSD, SVG, etc), then you can just feed it directly to ImageIO#read() and check if it doesn't throw an exception.
try (InputStream input = uploadedFile.getInputStream()) {
try {
ImageIO.read(input).toString();
// It's an image (only BMP, GIF, JPEG, PNG, TIFF and WBMP are recognized).
} catch (Exception e) {
// It's not an image.
}
}
But if you'd like to cover more image types as well, then consider using a 3rd party library which does all the work by sniffing the file signatures. For example Apache Tika which recognizes on top of ImageIO formats also PSD, BPG, WEBP, ICNS and SVG as well:
Tika tika = new Tika();
try (InputStream input = uploadedFile.getInputStream()) {
String mimeType = tika.detect(input);
if (mimeType.startsWith("image/")) {
// It's an image.
} else {
// It's not an image.
}
}
You could if necessary use combinations and outweigh the one and other.
That said, you don't necessarily need ImageIO#write() to save the uploaded image to disk. Just writing the obtained InputStream directly to a Path or any OutputStream like FileOutputStream the usual Java IO way is more than sufficient (see also Recommended way to save uploaded files in a servlet application):
try (InputStream input = uploadedFile.getInputStream()) {
Files.copy(input, new File(uploadFolder, fileName).toPath());
}
Unless you'd like to gather some image information like its dimensions and/or want to manipulate it (crop/resize/rotate/convert/etc) of course.

I used org.apache.commons.imaging.Imaging in my case. Below is a sample piece of code to check if an image is a jpeg image or not. It throws ImageReadException if uploaded file is not an image.
try {
//image is InputStream
byte[] byteArray = IOUtils.toByteArray(image);
ImageFormat mimeType = Imaging.guessFormat(byteArray);
if (mimeType == ImageFormats.JPEG) {
return;
} else {
// handle image of different format. Ex: PNG
}
} catch (ImageReadException e) {
//not an image
}

This is built into the JDK and simply requires a stream with support for
byte[] data = ;
InputStream is = new BufferedInputStream(new ByteArrayInputStream(data));
String mimeType = URLConnection.guessContentTypeFromStream(is);
//...close stream
Since Java SE 6 https://docs.oracle.com/javase/6/docs/api/java/net/URLConnection.html

Try using multipart file instead of BufferedImage
import org.apache.http.entity.ContentType;
...
public void processImage(MultipartFile file) {
if(!Arrays.asList(ContentType.IMAGE_JPEG.getMimeType(), ContentType.IMAGE_PNG.getMimeType(), ContentType.IMAGE_GIF.getMimeType()).contains(file.getContentType())) {
throw new IllegalStateException("File must be an Image");
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

org.apache.poi.xwpf.converter.xhtml.XHTMLConverter not generating images - java

I got answer for first question from this link lychaox.com/java/poi/Word07toHtml.html. I had to add one line of code options.setExtractor(new FileImageExtractor(imageFolderFile)); to generate images. Second question I resolved by pattern search and replacement.

Even with proper usage, it's worth noting that XHTMLConverter uses XHTMLMapper, which does not process headers, footers, or VML Images. Any images falling into those categories will be lost. The PDFConverter is more fully featured, but also uses the GPL licensed library, iText.

Related

PDF with forms to simple image PDF

Add OCR layer to existing PDF without the need to write to file system

Java: Download .txt File from URL

How to create PS file from PDF file using Java?

How to check a uploaded file whether it is an image or other file?

Categories

Resources