itext Converting PDF to csv

itext Converting PDF to csv - java

I am trying to use itext framework to convert a pdf file into a csv for import into excel.
The output is garbled and I pressume I am missing a step in regards to format conversion however I can't seem to find the information in the itext site and am looking for assistance.
Current is as below.
package com.pdf.convert;
import java.io.FileOutputStream;
import java.io.IOException;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Image;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfWriter;
public class ThirdPDF {
private static String INPUTFILE = "/location/test.pdf";
private static String OUTPUTFILE = "/location/test.csv";
public static void main(String[] args) throws DocumentException,
IOException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document,
new FileOutputStream(OUTPUTFILE));
document.open();
PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();
PdfImportedPage page;
// Go through all pages
for (int i = 1; i <= n; i++) {
// Only page number 2 will be included
if (i == 2) {
page = writer.getImportedPage(reader, i);
Image instance = Image.getInstance(page);
document.add(instance);
}
}
document.close();
}
}

Converting PDF file to CSV file.
Present Directory and File creation is based on Android Framework.
Change your path and Directory as per your Framework Accordingly.
private void convertPDFToCSV(String pdfFilePath) {
String myfolder = Environment.getExternalStorageDirectory() + "/Mycsv";
if (createFolder(myfolder)) {
try {
Document document = new Document();
document.open();
FileOutputStream fos=new FileOutputStream(myfolder + "/MyCSVFile.csv");
StringBuilder parsedText=new StringBuilder();
PdfReader reader1 = new PdfReader(pdfFilePath);
int n = reader1.getNumberOfPages();
for (int i = 0; i <n ; i++) {
parsedText.append(parsedText+PdfTextExtractor.getTextFromPage(reader1, i+1).trim()+"\n") ;
//Extracting the content fromx the different pages
}
StringReader stReader = new StringReader(parsedText.toString());
int t;
while((t=stReader.read())>0)
fos.write(t);
document.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
private boolean createFolder(String myfolder) {
File f = new File(myfolder);
if (!f.exists()) {
if (!f.mkdir()) {
return false;
} else {
return true;
}
}else{
return true;
}
}

Related

PDFBox - read text from multiple PDFs and load it into multiple Text files

I have more than 1000 pdf files in a folder , each one to be converted and saved in its corresponding text file .
I'm a bit new to Java and i'm using PDFBox to make the conversion ; I successfully got the code for one single pdf , but I'm stuck on how to do the conversion for all the PDFS in a single Folder. Can someone help me to achieve that in Java? .
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.IOException;
import java.io.PrintWriter;
public final class ExtractPdf
{
public static void main( String[] args ) throws IOException
{
String fileName = "sample.pdf";
PDDocument document = null;
try (PrintWriter out = new PrintWriter("out.txt"))
{
document = PDDocument.load( new File(fileName));
PDFTextStripper stripper = new PDFTextStripper();
String pdfText = stripper.getText(document).toString();
System.out.println( "Text in the area:" + pdfText);
out.println(pdfText);
}
finally
{
if( document != null )
{
document.close();
}
}
}
}
Thanks, Free

Basically your question is how to go through a directory…
public static void main(String[] args) throws IOException
{
File dir = new File("....");
File[] files = dir.listFiles(new FilenameFilter()
{
// use anonymous inner class
#Override
public boolean accept(File dir, String name)
{
return name.toLowerCase().endsWith(".pdf");
}
});
// null check omitted!
for (File file : files)
{
int len = file.getAbsolutePath().length();
String txtFilename = file.getAbsolutePath().substring(0, len - 4) + ".txt";
// check whether txt file exists omitted
try (OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(txtFilename), Charsets.UTF_8);
PDDocument document = PDDocument.load(file))
{
PDFTextStripper stripper = new PDFTextStripper();
stripper.writeText(document, out);
}
}
// exception catch omitted. Add code here to avoid your whole job
// dying if only one file is broken
}

Reading text of a pdf using PDFBOX occasionally returns \r\n

I’m currently using PDFBox to read the text of a set of pdfs that I’ve inherited.
I’m only interested in reading the text, not making any changes to the file.
The code that works for most of the files is:
File pdfFile = myPath.toFile();
PDDocument document = PDDocument.load(pdfFile );
Writer sw = new StringWriter();
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage( 1 );
stripper.writeText( document, sw );
String documentText = sw.toString()
For most files, I wind up with the text in the documentText field.
But, for 3 of 24 files, the documentText content for the first file is “\r\n”, for the second “\r\n\r\n”, and for the third “\r\n\r\n\r\n:, But the three files are not consecutive. Multiple good files are between each of these files.
The File is derived from a java.nio.Path. The WindowsFileAttribute that is part of the Path has a size of 279K, so the file is not empty on disk.
I can open the file and view the data, and it looks like the other files that my code reads.
I’m using Java 8.0.121, and PDFBox 2.0.4. (this is the latest version, I believe.)
Any suggestions? Is there a better way to read the text? (I’m not interested in the formatting, or fonts used, just the text.)
Thanks.

Reading multiple PDF docs using pdfbox in java
package readwordfile;
import java.io.BufferedReader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
/**
* This is an example on how to extract words from PDF document
*
* #author saravanan
*/
public class GetWordsFromPDF extends PDFTextStripper {
static List<String> words = new ArrayList<String>();
public GetWordsFromPDF() throws IOException {
}
/**
* #param args
* #throws IOException If there is an error parsing the document.
*/
public static void main(String[] args) throws IOException {
String files;
// FileWriter fs = new FileWriter("C:\\Users\\saravanan\\Desktop\\New Text Document (2).txt");
// FileInputStream fstream1 = new FileInputStream("C:\\Users\\saravanan\\Desktop\\New Text Document (2).txt");
// DataInputStream in1 = new DataInputStream(fstream1);
// BufferedReader br1 = new BufferedReader(new InputStreamReader(in1));
String path = "C:\\Users\\saravanan\\Desktop\\New folder\\"; //local folder path name
File folder = new File(path);
File[] listOfFiles = folder.listFiles();
for (int i = 0; i < listOfFiles.length; i++) {
if (listOfFiles[i].isFile()) {
files = listOfFiles[i].getName();
if (files.endsWith(".pdf") || files.endsWith(".PDF")) {
String nfiles = "C:\\Users\\saravanan\\Desktop\\New folder\\";
String fileName1 = nfiles + files;
System.out.print("\n\n" + files+"\n");
PDDocument document = null;
try {
document = PDDocument.load(new File(fileName1));
PDFTextStripper stripper = new GetWordsFromPDF();
stripper.setSortByPosition(true);
stripper.setStartPage(0);
stripper.setEndPage(document.getNumberOfPages());
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
int x = 0;
System.out.println("");
for (String word : words) {
if (word.startsWith("xxxxxx")) { //here you can give your pdf doc starting word
x = 1;
}
if (x == 1) {
if (!(word.endsWith("YYYYYY"))) { //here you can give your pdf doc ending word
System.out.print(word + " ");
// fs.write(word);
} else {
x = 0;
break;
}
}
}
} finally {
if (document != null) {
document.close();
words.clear();
}
}
}
}
}
}
/**
* Override the default functionality of PDFTextStripper.writeString()
*
* #param str
* #param textPositions
* #throws java.io.IOException
*/
#Override
protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
String[] wordsInStream = str.split(getWordSeparator());
if (wordsInStream != null) {
for (String word : wordsInStream) {
words.add(word); //store the pdf content into the List
}
}
}
}

Search for a string in html file using Jsoup

Can anyone help me with searching for a particular string in HTML file using Jsoup or any other method. There are inbuilt methods but they help in extracting title or script texts inside a specific tags and not string in general.
In this code I have used one such inbuilt method to extract title from the html page.
But I want to search a string instead.
package dynamic_tester;
import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class tester {
public static void main(String args[])
{
Document htmlFile = null;
{
try {
htmlFile = Jsoup.parse(new File("x.html"), "ISO-8859-1");
}
catch (IOException e)
{
e.printStackTrace();
}
String title = htmlFile.title();
System.out.println("Title = "+title);
}
}
}

Here's a sample. It reads the HTML file as text String and then performs search on that String.
package com.example;
import java.io.FileInputStream;
import java.nio.charset.Charset;
public class SearchTest {
public static void main(String[] args) throws Exception {
StringBuffer htmlStr = getStringFromFile("test.html", "ISO-8859-1");
boolean isPresent = htmlStr.indexOf("hello") != -1;
System.out.println("is Present ? : " + isPresent);
}
private static StringBuffer getStringFromFile(String fileName, String charSetOfFile) {
StringBuffer strBuffer = new StringBuffer();
try(FileInputStream fis = new FileInputStream(fileName)) {
byte[] buffer = new byte[10240]; //10K buffer;
int readLen = -1;
while( (readLen = fis.read(buffer)) != -1) {
strBuffer.append( new String(buffer, 0, readLen, Charset.forName(charSetOfFile)));
}
} catch(Exception ex) {
ex.printStackTrace();
strBuffer = new StringBuffer();
}
return strBuffer;
}
}

How to solve this iText PDF error?

I'm merging two pdf pages (test1.pdf & test2.pdf) using iText PDF and got the output in test_result.pdf. But the output page has not come out like the input page, it's cropped to half of the actual size. How to overcome this error? Here is my code:
public class MergePDF {
public static void main(String[] args) {
try {
List<InputStream> pdfs = new ArrayList<InputStream>();
pdfs.add(new FileInputStream("test1.pdf"));
pdfs.add(new FileInputStream("test2.pdf/"));
// pdfs.add(new FileInputStream("test_result.pdf/"));
OutputStream output = new FileOutputStream("/home/ant000112/merge_result.pdf");
MergePDF.concatPDFs(pdfs, output, true);
} catch (Exception e) {
e.printStackTrace();
}
}
public static void concatPDFs(List<InputStream> streamOfPDFFiles,
OutputStream outputStream, boolean paginate) {
Document document = new Document();
try {
List<InputStream> pdfs = streamOfPDFFiles;
List<PdfReader> readers = new ArrayList<PdfReader>();
int totalPages = 0;
Iterator<InputStream> iteratorPDFs = pdfs.iterator();
// Create Readers for the pdfs.
while (iteratorPDFs.hasNext()) {
InputStream pdf = iteratorPDFs.next();
PdfReader pdfReader = new PdfReader(pdf);
readers.add(pdfReader);
totalPages += pdfReader.getNumberOfPages();
}
// Create a writer for the outputstream
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
document.open();
//BaseFont bf = BaseFont.createFont(BaseFont.HELVETICA,BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
BaseFont bf = BaseFont.createFont(BaseFont.HELVETICA,BaseFont.CP1257, BaseFont.NOT_EMBEDDED);
PdfContentByte cb = writer.getDirectContent(); // Holds the PDF
// data
PdfImportedPage page;
int currentPageNumber = 0;
int pageOfCurrentReaderPDF = 0;
Iterator<PdfReader> iteratorPDFReader = readers.iterator();
// Loop through the PDF files and add to the output.
while (iteratorPDFReader.hasNext()) {
PdfReader pdfReader = iteratorPDFReader.next();
// Create a new page in the target for each source page.
while (pageOfCurrentReaderPDF < pdfReader.getNumberOfPages()) {
document.newPage();
pageOfCurrentReaderPDF++;
currentPageNumber++;
page = writer.getImportedPage(pdfReader,
pageOfCurrentReaderPDF);
cb.addTemplate(page, 0, 0);
// Code for pagination.
if (paginate) {
cb.beginText();
cb.setFontAndSize(bf, 9);
cb.showTextAligned(PdfContentByte.ALIGN_CENTER, ""+ currentPageNumber + " of " + totalPages, 520,5, 0);
cb.endText();
}
}
pageOfCurrentReaderPDF = 0;
}
outputStream.flush();
document.close();
outputStream.close();
} catch (Exception e) {
e.printStackTrace();
} finally {
if (document.isOpen())
document.close();
System.out.println("ghghklh");
try {
if (outputStream != null)
outputStream.close();
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
}

Did you try reading examples from the iText site? It appears you're doing it in a different way. Like you're not using PdfCopy.
/*
* This class is part of the book "iText in Action - 2nd Edition"
* written by Bruno Lowagie (ISBN: 9781935182610)
* For more info, go to: http://itextpdf.com/examples/
* This example only works with the AGPL version of iText.
*/
package part2.chapter06;
import java.io.FileOutputStream;
import java.io.IOException;
import java.sql.SQLException;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.pdf.PdfCopy;
import com.itextpdf.text.pdf.PdfReader;
import part1.chapter02.MovieHistory;
import part1.chapter02.MovieLinks1;
public class Concatenate {
/** The resulting PDF file. */
public static final String RESULT
= "results/part2/chapter06/concatenated.pdf";
/**
* Main method.
* #param args no arguments needed
* #throws DocumentException
* #throws IOException
* #throws SQLException
*/
public static void main(String[] args)
throws IOException, DocumentException, SQLException {
// using previous examples to create PDFs
MovieLinks1.main(args);
MovieHistory.main(args);
String[] files = { MovieLinks1.RESULT, MovieHistory.RESULT };
// step 1
Document document = new Document();
// step 2
PdfCopy copy = new PdfCopy(document, new FileOutputStream(RESULT));
// step 3
document.open();
// step 4
PdfReader reader;
int n;
// loop over the documents you want to concatenate
for (int i = 0; i < files.length; i++) {
reader = new PdfReader(files[i]);
// loop over the pages in that document
n = reader.getNumberOfPages();
for (int page = 0; page < n; ) {
copy.addPage(copy.getImportedPage(reader, ++page));
}
copy.freeReader(reader);
}
// step 5
document.close();
}
}
http://itextpdf.com/examples/iia.php?id=123
EDIT: Just to be fair I downloaded the library and I tried the example. It works like a charm.

iText PDF; howto convert jpeg2000 to jpg using Java

I'm trying to solve a problem using java, iText, and the Java advanced imaging library. My software system uses ghostscript to create jpg thumbnail images etc... from PDF files. However on CentOS 5.x the highest version of ghostscript is 8.7 which has a known issue of not being able to handle PDF files containing JPEG 2000 images in them. My plan is to scan the file first and see if it contains jpeg2000 images (I've already got this part figured out); if so, then use iText and the Java Advanced Imaging library (contains the jpeg2000 read & write codecs) to convert the contained jpeg2000 files into regular jpeg files & then pass the new PDF file to ghostscript. The code below attempts this, but results in another file containing jpeg2000 files. Any help with this would be much appreciated.
public class ImageReplacer{
public static void main(String [] args){
try{
String RESULT = "";
PdfReader reader = new PdfReader("pdf_containing_jpeg2000_images.pdf");
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
MyImageRenderListener listener = new MyImageRenderListener(RESULT);
MyImageConverterListener clistener = new MyImageConverterListener(RESULT);
clistener.setReader(reader);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
parser.processContent(i, clistener);
}
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream("out.pdf"));
stamper.close();
}catch(Exception e){
e.printStackTrace();
}
}
}
class MyImageConverterListener implements RenderListener {
protected String path = "";
protected PdfReader reader;
public MyImageConverterListener(String path) {
this.path = path;
}
public void beginTextBlock() { }
public void endTextBlock() { }
public void renderImage(ImageRenderInfo renderInfo) {
try {
PdfImageObject image = renderInfo.getImage();
PdfName filter = (PdfName)image.get(PdfName.FILTER);
if (PdfName.JPXDECODE.equals(filter)) {
if(image.getDictionary().isStream()){
BufferedImage bi = image.getBufferedImage();
if (bi == null) return;
int width = (int)bi.getWidth();
int height = (int)bi.getHeight();
ByteArrayOutputStream imgBytes = new ByteArrayOutputStream();
ImageIO.write(bi, "JPG", imgBytes);
PRStream stream = new PRStream(reader,imgBytes.toByteArray());
stream.clear();
stream.setData(imgBytes.toByteArray(), false, PRStream.NO_COMPRESSION);
stream.put(PdfName.TYPE, PdfName.XOBJECT);
stream.put(PdfName.SUBTYPE, PdfName.IMAGE);
stream.put(new PdfName("foo"+Math.random()), new PdfName("bar"+Math.random()));
stream.put(PdfName.FILTER, PdfName.DCTDECODE);
stream.put(PdfName.WIDTH, new PdfNumber(width));
stream.put(PdfName.HEIGHT, new PdfNumber(height));
stream.put(PdfName.BITSPERCOMPONENT, new PdfNumber(8));
stream.put(PdfName.COLORSPACE, PdfName.DEVICERGB);
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
public void renderText(TextRenderInfo renderInfo) { }
public void setReader(PdfReader r){
reader = r;
}
}

So I managed to solve this one on my own (with a little help from iText in action by Bruno Lowagie - great book). Just to re-iterate, my intention is to scan a PDF using iText to see if it contains any JPEG2000 images and if it does output the same PDF but with the inner JPEG2000 images replaced with regular JPEG images. This solves the fatal ghostscript 8.7 'Unable to process JPXDecode data' error, but would also provide useful for making PDF's iOS compatible.
So without further a do kids; here goes...
Step 1) Download iText 5.x .jar file, and download jai_imageio-1.1.jar (the Java advanced imaging library that allows you to convert JPEG2000 files)
Step 2) Create a file called PDFConverter.java and put this code in it:
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfName;
import com.itextpdf.text.pdf.PdfObject;
import com.itextpdf.text.pdf.PRStream;
import com.itextpdf.text.pdf.parser.PdfImageObject;
import com.itextpdf.text.pdf.PdfNumber;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import com.itextpdf.text.pdf.PdfStamper;
import java.io.*;
public class PDFConverter{
public static void main(String [] args){
if(args.length==1){
if(hasJpeg2000(args[0])){
System.out.println("Contains JPEG2000 images: Converting them to JPEG...");
convertPDF(args[0]);
System.out.println("Done...");
}else{
System.out.println("Doesn't contain any JPEG2000 images: Nothing to be done...");
}
}else{
System.out.println("Please specify a PDF filename as a command line argument!");
}
}
public static boolean hasJpeg2000(String s){
try{
PdfReader reader = new PdfReader(s);
int n = reader.getXrefSize();
PdfObject object;
PRStream stream;
for (int i = 0; i < n; i++) {
object = reader.getPdfObject(i);
if (object == null || !object.isStream())continue;
stream = (PRStream)object;
PdfImageObject image = new PdfImageObject(stream);
PdfName filter = (PdfName)image.get(PdfName.FILTER);
if (PdfName.JPXDECODE.equals(filter)) {
return true;
}
}
}catch(Exception e){
e.printStackTrace();
}
return false;
}
public static void convertPDF(String s){
try{
PdfReader reader = new PdfReader(s);
int n = reader.getXrefSize();
PdfObject object;
PRStream stream;
for (int i = 0; i < n; i++) {
object = reader.getPdfObject(i);
if (object == null || !object.isStream())continue;
stream = (PRStream)object;
PdfImageObject image = new PdfImageObject(stream);
PdfName filter = (PdfName)image.get(PdfName.FILTER);
if (PdfName.JPXDECODE.equals(filter)) {
BufferedImage bi = image.getBufferedImage();
if (bi == null) continue;
int width = (int)(bi.getWidth());
int height = (int)(bi.getHeight());
ByteArrayOutputStream imgBytes = new ByteArrayOutputStream();
ImageIO.write(bi, "JPG", imgBytes);
stream.clear();
stream.setData(imgBytes.toByteArray(),false, PRStream.NO_COMPRESSION);
stream.put(PdfName.TYPE, PdfName.XOBJECT);
stream.put(PdfName.SUBTYPE, PdfName.IMAGE);
stream.put(new PdfName("foo"+Math.random()), new PdfName("bar"+Math.random()));
stream.put(PdfName.FILTER, PdfName.DCTDECODE);
stream.put(PdfName.WIDTH, new PdfNumber(width));
stream.put(PdfName.HEIGHT, new PdfNumber(height));
stream.put(PdfName.BITSPERCOMPONENT,new PdfNumber(8));
stream.put(PdfName.COLORSPACE, PdfName.DEVICERGB);
}
}
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream("out.pdf")); stamper.close();
}catch(Exception e){
e.printStackTrace();
}
}
}
Step 3)Compile the above file in the following manner:
javac -cp .:iText-5.0.4.jar:jai_imageio-1.1.jar PDFConverter.java
Step 4)Run the program with a PDF...
java -cp .:iText-5.0.4.jar:jai_imageio-1.1.jar PDFConverter PDFFileName.pdf
Booyah...

Works great, but I had some issues with GlassFish v3.1. Glassfish acted as if there was no jai_imageio-1.1.jar in Classpath. I fixed this putting jai_imageio.jar in my "/path/to/glassfish/domains/domain1/lib/ext/" folder.

I have had some NullPointer problems with the PDFConverter from Reece, because my PDF had different types of embedded elements inside GhostScript on CentOS 5.3 - Unable to process JPXDecode data. So I make some Object/Type checks and added the output filename to command line.
Everything else is great and worked perfect to the image jpeg2000 problem. Thanks to Reece :)
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfName;
import com.itextpdf.text.pdf.PdfObject;
import com.itextpdf.text.pdf.*;
import com.itextpdf.text.pdf.PRStream;
import com.itextpdf.text.pdf.parser.PdfImageObject;
import com.itextpdf.text.pdf.PdfNumber;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import com.itextpdf.text.pdf.PdfStamper;
import java.io.*;
public class PDFConverter{
public static void main(String [] args){
if(args.length==2){
if(hasJpeg2000(args[0])){
System.out.println("Contains JPEG2000 images: Converting them to JPEG...");
convertPDF(args[0], args[1]);
System.out.println("Done...");
}else{
System.out.println("Doesn't contain any JPEG2000 images: Nothing to be done...");
}
}else{
System.out.println("Please specify a PDF filename and a output filename as a command line arguments!");
}
}
public static boolean hasJpeg2000(String s){
try{
PdfReader reader = new PdfReader(s);
int n = reader.getXrefSize();
PdfObject object;
PRStream stream;
for (int i = 0; i < n; i++) {
object = reader.getPdfObject(i);
if (object == null || !object.isStream())continue;
stream = (PRStream)object;
PdfObject pdfsubtype = stream.get(PdfName.SUBTYPE);
System.out.println(pdfsubtype);
if (pdfsubtype != null && pdfsubtype.toString().equals(PdfName.IMAGE.toString())) {
PdfImageObject image = new PdfImageObject(stream);
PdfName filter = (PdfName)image.get(PdfName.FILTER);
if (PdfName.JPXDECODE.equals(filter)) {
return true;
}
}
}
}catch(Exception e){
e.printStackTrace();
}
return false;
}
private static void filterObject(PdfImageObject image,PdfName filter,PRStream stream) throws java.io.IOException {
if (PdfName.JPXDECODE.equals(filter)) {
BufferedImage bi = image.getBufferedImage();
if (bi == null) return;
int width = (int)(bi.getWidth());
int height = (int)(bi.getHeight());
ByteArrayOutputStream imgBytes = new ByteArrayOutputStream();
ImageIO.write(bi, "JPG", imgBytes);
stream.clear();
stream.setData(imgBytes.toByteArray(),false, PRStream.NO_COMPRESSION);
stream.put(PdfName.TYPE, PdfName.XOBJECT);
stream.put(PdfName.SUBTYPE, PdfName.IMAGE);
stream.put(new PdfName("foo"+Math.random()), new PdfName("bar"+Math.random()));
stream.put(PdfName.FILTER, PdfName.DCTDECODE);
stream.put(PdfName.WIDTH, new PdfNumber(width));
stream.put(PdfName.HEIGHT, new PdfNumber(height));
stream.put(PdfName.BITSPERCOMPONENT,new PdfNumber(8));
stream.put(PdfName.COLORSPACE, PdfName.DEVICERGB);
}
}
public static void convertPDF(String s, String out){
try{
PdfReader reader = new PdfReader(s);
int n = reader.getXrefSize();
PdfObject object;
PRStream stream;
for (int i = 0; i < n; i++) {
object = reader.getPdfObject(i);
if (object == null || !object.isStream())continue;
stream = (PRStream)object;
PdfObject pdfsubtype = stream.get(PdfName.SUBTYPE);
if (pdfsubtype != null && pdfsubtype.toString().equals(PdfName.IMAGE.toString())) {
PdfImageObject image = new PdfImageObject(stream);
Object listOrName = image.get(PdfName.FILTER);
if (listOrName instanceof PdfName) {
PdfName filter = (PdfName)image.get(PdfName.FILTER);
filterObject(image, filter, stream);
}
else if (listOrName instanceof PdfArray) {
PdfArray list = (PdfArray)image.get(PdfName.FILTER);
for (int j = 0; j < list.size(); j++) {
PdfName filter = list.getAsName(j);
filterObject(image, filter, stream);
}
}
else {
System.err.println("Unknown Obejcttype: " + listOrName);
}
}
}
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(out)); stamper.close();
} catch(Exception e){
e.printStackTrace();
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

itext Converting PDF to csv - java

Related

PDFBox - read text from multiple PDFs and load it into multiple Text files

Reading text of a pdf using PDFBOX occasionally returns \r\n

Search for a string in html file using Jsoup

How to solve this iText PDF error?

iText PDF; howto convert jpeg2000 to jpg using Java

Categories

Resources