Can duplicating a pdf with PDFBox be small like with iText?

Can duplicating a pdf with PDFBox be small like with iText? - java

I am reading in a PDF and outputting a PDF with multiple copies of the original PDF in it. I test by doing the same thing for both PDFBox and iText. iText creates a much smaller output if I duplicate each page individually.
The question: Is there another way to do this in PDFBox that results in smaller output PDFs.
For one example input file, generating two copies to the output with both tools:
Original PDF size: 30K
PDFBox (v 1.7.1) generated PDF: 84K
iText (v 5.3.4) generated PDF: 35K
Java code for PDFBox (sorry to inflict error handling on you). Notice how it reads the input over and over and duplicates it as a whole:
PDFMergerUtility merger = new PDFMergerUtility();
PDDocument workplace = null;
try {
for (int cnt = 0; cnt < COPIES; ++cnt) {
PDDocument document = null;
InputStream stream = null;
try {
stream = new FileInputStream(new File(sourceFileName));
document = PDDocument.load(stream);
if (workplace == null) {
workplace = document;
} else {
merger.appendDocument(workplace, document);
}
} finally {
if (document != null && document != workplace) {
document.close();
}
if (stream != null) {
stream.close();
}
}
}
OutputStream out = null;
try {
out = new FileOutputStream(new File(destinationFileName));
workplace.save(out);
} finally {
if (out != null) {
out.close();
}
}
} catch (COSVisitorException e1) {
e1.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (workplace != null) {
try {
workplace.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Code to do it with iText. Notice how it loads the input file page by page and transfers each page to the output:
Document document = null;
PdfReader reader = null;
InputStream inputStream = null;
FileOutputStream outputStream = null;
try {
inputStream = new FileInputStream(new File(sourceFileName));
outputStream = new FileOutputStream(new File(destinationFileName));
document = new Document();
PdfCopy copy = new PdfSmartCopy(document, outputStream);
document.open();
reader = new PdfReader(inputStream);
// loop over the pages in that document
int pdfPageNo = reader.getNumberOfPages();
for (int page = 0; page < pdfPageNo;) {
PdfImportedPage onePage = copy.getImportedPage(reader, ++page);
// duplicate each page N times
for (int i = 0; i < COPIES; ++i) {
copy.addPage(onePage);
}
}
copy.freeReader(reader);
} catch (DocumentException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (reader != null) {
reader.close();
}
if (document != null) {
document.close();
}
try {
if (inputStream != null) {
inputStream.close();
}
if (outputStream != null) {
outputStream.close();
}
} catch (IOException e) {
// do nothing
}
}
Both are surrounded by this:
public class Duplicate {
/** The original PDF file */
private static final String sourceFileName = "PDF_CI_US2CA.pdf";
/** The resulting PDF file. */
private static final String destinationFileName = "itext_output.pdf";
private static final int COPIES = 2;
public static void main(String[] args) {
...
}
}

Using the following solution, I was able to create a PDF file with many duplicate pages and have a minimal impact on storage.
PDDocument samplePdf = null;
try {
samplePdf = PDDocument.load(PDF_PATH);
PDPage page = (PDPage) samplePdf.getDocumentCatalog().getAllPages().get(0);
for(int i = 0; i < COPIES; i++) {
samplePdf.importPage(page);
}
samplePdf.save(SAVE_PATH); //$NON-NLS-1$
} catch (IOException e) {
e.printStackTrace();
} catch (COSVisitorException e) {
e.printStackTrace();
}
In my first attempt I used, samplePdf.addPage(page) but it didn't work as expected. So obviously there is a difference between the add and import functions. I'll have to check the source or documentation to see why. Anyway, this should help you devise a solution for your needs with PDFBox.

Related

Extract pdf attachment on AWS S3 using iText Java

I am using below iText Java code to extract attachments from PDF file. that work fine on local system. It extract XML file from PDF and stores on strOutputPath. I want to perform this operation on AWS S3. PDF file will on S3 and attachment should be extracted on S3. How I can use absolute path of file on S3 in this case. I used s3client.getUrl().toExternalForm(); but I get HTTP 403 error.
import java.util.Iterator;
import java.util.Set;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.File;
import com.itextpdf.text.pdf.PdfObject;
import com.itextpdf.text.pdf.PRStream;
import com.itextpdf.text.pdf.PdfArray;
import com.itextpdf.text.pdf.PdfDictionary;
import java.io.IOException;
import com.itextpdf.text.pdf.PdfName;
import com.itextpdf.text.pdf.PdfReader;
public class app
{
public static void main(final String[] args) {
try {
final String strInputPath = args[0];
final String strOutputPath = args[1];
final PdfReader pdfReader = new PdfReader(strInputPath);
final PdfDictionary catalog = pdfReader.getCatalog();
final PdfDictionary names = catalog.getAsDict(PdfName.NAMES);
final PdfDictionary embeddedFiles = names.getAsDict(PdfName.EMBEDDEDFILES);
final PdfArray embeddedFilesArray = embeddedFiles.getAsArray(PdfName.NAMES);
for (int i = 0; i < embeddedFilesArray.size(); ++i) {
final PdfDictionary FileSpec = embeddedFilesArray.getAsDict(i);
if (FileSpec != null) {
String strFileName = FileSpec.getAsString(PdfName.F).toString();
System.out.println(strFileName);
if (strFileName.endsWith(".xml")) {
strFileName = String.valueOf(System.currentTimeMillis()) + ".xml";
extractFiles(pdfReader, FileSpec, String.valueOf(strOutputPath) + strFileName);
}
}
}
}
catch (IOException e) {
e.printStackTrace();
}
}
private static void extractFiles(final PdfReader pdfReader, final PdfDictionary filespec, final String strFileName) {
final PdfDictionary refs = filespec.getAsDict(PdfName.EF);
PRStream prStream = null;
FileOutputStream outputStream = null;
final Set<PdfName> keys = (Set<PdfName>)refs.getKeys();
try {
for (final PdfName key : keys) {
prStream = (PRStream)PdfReader.getPdfObject((PdfObject)refs.getAsIndirectObject(key));
outputStream = new FileOutputStream(new File(strFileName));
outputStream.write(PdfReader.getStreamBytes(prStream));
outputStream.flush();
outputStream.close();
}
}
catch (FileNotFoundException e) {
e.printStackTrace();
}
catch (IOException e2) {
e2.printStackTrace();
}
finally {
try {
if (outputStream != null) {
outputStream.close();
}
}
catch (IOException e3) {
e3.printStackTrace();
}
}
try {
if (outputStream != null) {
outputStream.close();
}
}
catch (IOException e3) {
e3.printStackTrace();
}
}
}

I think what you need to do is write a Java client that works on the files on your S3 bucket and performs following steps:
Downloads the required file from S3.
Extracts the attachment from the file.
Uploads the resultant files back to S3.
Sample code the perform above mentioned steps is as follows :
import java.io.*;
import java.util.Set;
import com.amazonaws.services.s3.*;
import com.amazonaws.services.s3.model.*;
import com.itextpdf.text.pdf.*;
public class S3PDFAttachmentExtractor {
public static void main(String[] args) throws IOException {
// download file from S3
AmazonS3Client amazonS3Client = new AmazonS3Client();
S3Object object = amazonS3Client.getObject("<yours3location>", "fileKey");
// write the file content to a local file.
S3ObjectInputStream objectContent = object.getObjectContent();
FileOutputStream out = new FileOutputStream("tempOutputFile.pdf");
writeToFile(objectContent, out);
// Extract attachment from the downloaded file.
extractAttachment("tempOutputFile.pdf", "tempAttachement.xml");
//upload the attachment
uploadFile("<s3bucket.fully.qualified.name>", "tempAttachement.xml", "attachementNameOnS3.xml");
}
private static void writeToFile(InputStream input, FileOutputStream out) throws IOException {
// Read the text input stream one line at a time and display each line.
try (BufferedInputStream in = new BufferedInputStream(input);) {
byte[] chunk = new byte[1024];
while (in.read(chunk) > 0) {
out.write(chunk);
}
} finally {
input.close();
}
}
public static void extractAttachment(final String strInputPath, final String strOutputPath) {
try {
final PdfReader pdfReader = new PdfReader(strInputPath);
final PdfDictionary catalog = pdfReader.getCatalog();
final PdfDictionary names = catalog.getAsDict(PdfName.NAMES);
final PdfDictionary embeddedFiles = names.getAsDict(PdfName.EMBEDDEDFILES);
final PdfArray embeddedFilesArray = embeddedFiles.getAsArray(PdfName.NAMES);
for (int i = 0; i < embeddedFilesArray.size(); ++i) {
final PdfDictionary FileSpec = embeddedFilesArray.getAsDict(i);
if (FileSpec != null) {
String strFileName = FileSpec.getAsString(PdfName.F).toString();
System.out.println(strFileName);
if (strFileName.endsWith(".xml")) {
strFileName = String.valueOf(System.currentTimeMillis()) + ".xml";
extractFiles(pdfReader, FileSpec, String.valueOf(strOutputPath) + strFileName);
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
private static void extractFiles(final PdfReader pdfReader, final PdfDictionary filespec, final String strFileName) {
final PdfDictionary refs = filespec.getAsDict(PdfName.EF);
PRStream prStream = null;
FileOutputStream outputStream = null;
final Set<PdfName> keys = (Set<PdfName>) refs.getKeys();
try {
for (final PdfName key : keys) {
prStream = (PRStream) PdfReader.getPdfObject((PdfObject) refs.getAsIndirectObject(key));
outputStream = new FileOutputStream(new File(strFileName));
outputStream.write(PdfReader.getStreamBytes(prStream));
outputStream.flush();
outputStream.close();
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e2) {
e2.printStackTrace();
} finally {
try {
if (outputStream != null) {
outputStream.close();
}
} catch (IOException e3) {
e3.printStackTrace();
}
}
try {
if (outputStream != null) {
outputStream.close();
}
} catch (IOException e3) {
e3.printStackTrace();
}
}
private static void uploadFile(String bucketFullPath, String fileLocation, String fileName) throws IOException {
AmazonS3Client amazonS3Client = new AmazonS3Client();
InputStream bis = new FileInputStream(fileLocation);
ObjectMetadata objectMetadata = new ObjectMetadata();
objectMetadata.setContentType("application/xml");
amazonS3Client.putObject(bucketFullPath, fileName, bis, objectMetadata);
}
}
Please note that a better way to do this type of thing is to write a AWS Lambda function in Java using the above code. Since AWS Lambada can be easily configured to process events from S3 Storage, your code will automatically get invoked when a file is written or modified in S3 bucket. For further details you can check the AWS Lambda Documentation
Edit:
Another alternative is - If you are running the Java code on AWS EC2, then there is a way to mount a S3 bucket as a file System. This will allow you access files as if these files are stored locally, And your original code will work. But this approach will work only on AWS EC2 environment.

JSF Primefaces p:fileDownload file name contains UTF-8 characters

I am working on Java 8, JSF 2, Primefaces 5.1.
Conversation to PDF or Docx works, but when I am displaying file name, it just skips UTF-8 encoded letters, in my case, Lithuanian letters like ą,č,ę,ė,į,š,ų,ū
What I have tried so farm is :
<h:form enctype="multipart/form-data;charset=UTF-8">
Charset.forName("UTF-8").encode(myString)
or
byte[] bytes = templateTitle.getBytes(Charset.forName("UTF-8"));
String title = new String(bytes, Charset.forName("UTF-8"));
or
UTF-8 text is garbled when form is posted as multipart/form-data
checked some tuttorials about encoding, still, no use,
also checked this, but I just do not understand this example...
Primefaces fileDownload non-english file names corrupt
my code:
Download file as docx
public void downloadTemplateAsDocx() throws Exception {
try {
InputStream content = null;
String objID = this.actData.getMainActs().get(0).getId();
ContentStream cmisStream = folderCatalogue.getDocumentContentStream(objID);
content = cmisStream.getStream();
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/hw.html"));
afiPart.setBinaryData(content);
afiPart.setContentType(new ContentType("text/html"));
Relationship altChunkRel = wordMLPackage.getMainDocumentPart().addTargetPart(afiPart);
CTAltChunk ac = Context.getWmlObjectFactory().createCTAltChunk();
ac.setId(altChunkRel.getId());
wordMLPackage.getMainDocumentPart().addObject(ac);
wordMLPackage.getContentTypeManager().addDefaultContentType("html", "text/html");
File fileTmp = File.createTempFile("tempDocFile", "docx");
wordMLPackage.save(fileTmp);
streamedContent = new DefaultStreamedContent(new FileInputStream(fileTmp), cmisStream.getMimeType(),
templateTitle + ".docx", "UTF-8");
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (InvalidFormatException eInv) {
eInv.printStackTrace();
} catch (IOException ioEx) {
ioEx.printStackTrace();
} catch (Docx4JException docxEx) {
docxEx.printStackTrace();
}
}
code for .Pdf file download.
public void downloadTemplateAsPdf() {
try {
InputStream content = null;
String objID = this.actData.getMainActs().get(0).getId();
ContentStream cmisStream = folderCatalogue.getDocumentContentStream(objID);
content = cmisStream.getStream();
File fileTmp = File.createTempFile("tempFile", "pdf");
OutputStream fileStream = new FileOutputStream(fileTmp);
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, fileStream);
document.open();
XMLWorkerHelper worker = XMLWorkerHelper.getInstance();
worker.parseXHtml(writer, document, content, Charset.forName("UTF-8"));
document.close();
fileStream.close();
streamedContent = new DefaultStreamedContent(new FileInputStream(fileTmp), cmisStream.getMimeType(),
templateTitle + ".pdf");
} catch (FileNotFoundException e) {
e.printStackTrace();
System.out.println("File was not found");
} catch (IOException ex) {
ex.printStackTrace();
} catch (Exception exeption) {
exeption.printStackTrace();
}
}
EDIT:
<p:fileDownload value="#{controller.streamedContent}" />
private StreamedContent streamedContent;

Solution,
String title = URLEncoder.encode(templateTitle, "UTF-8");
StringBuilder fileName = new StringBuilder(title);
if (title.contains("+")) {
for (int i = 0; i < title.length(); i++) {
if (title.charAt(i) == '+') {
fileName.setCharAt(i, ' ');
}
}
}
This Encoding works fine, just it replaces all spaces to + that's why I loop over it.

Is it possible to convert HTML into XHTML with Jsoup 1.8.1?

String body = "<br>";
Document document = Jsoup.parseBodyFragment(body);
document.outputSettings().escapeMode(EscapeMode.xhtml);
String str = document.body().html();
System.out.println(str);
expect: <br />
result: <br>
Can Jsoup convert value HTML into XHTML?

See Document.OutputSettings.Syntax.xml:
private String toXHTML( String html ) {
final Document document = Jsoup.parse(html);
document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
return document.html();
}

You should tell that syntax you want to leave the string in HTML or XML.
public String parserXHtml(String html) {
org.jsoup.nodes.Document document = Jsoup.parseBodyFragment(html);
document.outputSettings().syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml); //This will ensure the validity
document.outputSettings().charset("UTF-8");
return document.toString();
}

You can use JTidy API to do this. Use jtidy-r938.jar
You can use the following method to get xhtml from html
public static String getXHTMLFromHTML(String inputFile,
String outputFile) throws Exception {
File file = new File(inputFile);
FileOutputStream fos = null;
InputStream is = null;
try {
fos = new FileOutputStream(outputFile);
is = new FileInputStream(file);
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.parse(is, fos);
} catch (FileNotFoundException e) {
e.printStackTrace();
}finally{
if(fos != null){
try {
fos.close();
} catch (IOException e) {
fos = null;
}
fos = null;
}
if(is != null){
try {
is.close();
} catch (IOException e) {
is = null;
}
is = null;
}
}
return outputFile;
}

Fileinput stream / loading a simple txt file

Does anyone know why this crashes? All I'm doing is reading in a file in a txt file from my raw folder and when I click the load button in the other activity window, the code breaks when I call the variable testing within the file reader object upon click. log.d(null, ReadFileObject.fileText) Thanks in advance!
public class ReadFile extends Activity{
public String test;
public String testing;
protected void onCreate(Bundle savedInstanceState) {
}
public void fileText() {
InputStream fis;
fis = getResources().openRawResource(R.raw.checkit);
byte[] input;
try {
input = new byte [fis.available()];
while(fis.read() != -1)
{
test += new String (input);
}
testing = test;
fis.close();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.out.println(e.getMessage());
}catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.out.println(e.getMessage());
}
/* InputStream fis = null;
try {
fis = getResources().openRawResource(R.raw.checkit);
BufferedReader br = new BufferedReader(new InputStreamReader(fis));
String nextLine;
int i = 0, j = 0;
while ((nextLine = br.readLine()) != null) {
if (j == 5) {
j = 0;
i++;
}
test += nextLine;
}
} catch (Exception e) {
e.printStackTrace();
System.out.println(e.getMessage());
} finally {
if (fis != null) {
try { fis.close(); }
catch (IOException ignored) {}
}
}*/
}
}

Your code is broken here:
byte[] input;
input = new byte [fis.available()];
while(fis.read() != -1) {
test += new String (input);
}
testing = test;
fis.close();
In Java available() is unreliable (read the Javadoc).... and may even return 0. You should instead use a loop similar to:
InputStream fis = getResources().openRawResource(R.raw.checkit);
try {
byte[] buffer = new byte[4096]; // 4K buffer
int len = 0;
while((len = fis.read(buffer)) != -1) {
test += new String (buffer, 0, len);
}
testing = test;
} catch (IOException ioe) {
ioe.printStackTrace();
// make sure you do any other appropriate handling.
} finally {
fis.close();
}
(although using string concatenation is probably not the best idea, use a StringBuilder).

Your class extends `activity but theres nothing inside oncreate. If you need a simple java program try to create New java Project . Since you extend activity you should setcontentview(yourLayout). Then call your method from oncreate and do your stuffs

JTidy java API toConvert HTML to XHTML

I am using JTidy to convert from HTML to XHTML but I found in my XHTML file this tag .
Can i prevent it ?
this is my code
//from html to xhtml
try
{
fis = new FileInputStream(htmlFileName);
}
catch (java.io.FileNotFoundException e)
{
System.out.println("File not found: " + htmlFileName);
}
Tidy tidy = new Tidy();
tidy.setShowWarnings(false);
tidy.setXmlTags(false);
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setXHTML(true);//
tidy.setMakeClean(true);
Document xmlDoc = tidy.parseDOM(fis, null);
try
{
tidy.pprint(xmlDoc,new FileOutputStream("c.xhtml"));
}
catch(Exception e)
{
}

I had only success, when the input is treated as XML as well. So either set xmltags to true
tidy.setXmlTags(true);
and live with the errors and warnings or do the conversion twice.
First conversion to sanitize the html (html to xhtml) and a second conversion from xhtml to xhtml with set xmltags, thus no errors and warnings occur.
String htmlFileName = "test.html";
try( InputStream in = Thread.currentThread().getContextClassLoader().getResourceAsStream(htmlFileName);
FileOutputStream fos = new FileOutputStream("tmp.xhtml");) {
Tidy tidy = new Tidy();
tidy.setShowWarnings(true);
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setXHTML(true);
tidy.setMakeClean(true);
Document xmlDoc = tidy.parseDOM(in, fos);
} catch (Exception e) {
e.printStackTrace();
}
try( InputStream in = new FileInputStream("tmp.xhtml");
FileOutputStream fos = new FileOutputStream("c.xhtml");) {
Tidy tidy = new Tidy();
tidy.setShowWarnings(true);
tidy.setXmlTags(true);
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setXHTML(true);
tidy.setMakeClean(true);
Document xmlDoc = tidy.parseDOM(in, null);
tidy.pprint(xmlDoc, fos);
} catch (Exception e) {
e.printStackTrace();
}
I used the latest jtidy version 938.

i created a function that parse the the xhtml code and remove the unwelcome tags
and to add a link to the css File "tableStyle.css"
public static String xhtmlparser(){
String Cleanline="";
try {
// the file url
FileInputStream fstream = new FileInputStream("c.xhtml");
// Use DataInputStream to read binary NOT text.
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine = null;
int linescounter=0;
while ((strLine = br.readLine()) != null) {// read every line in the file
String m=strLine.replaceAll(" ", "");
linescounter++;
if(linescounter==5)
m=m+"\n"+ "<link rel="+ "\"stylesheet\" "+"type="+ "\"text/css\" "+"href= " +"\"tableStyle.css\""+ "/>";
Cleanline+=m+"\n";
}
}
catch(IOException e){}
return Cleanline;
}
but as a performance issue is it good?
by the way it works will

You can use the following method to get xhtml from html
public static String getXHTMLFromHTML(String inputFile,
String outputFile) throws Exception {
File file = new File(inputFile);
FileOutputStream fos = null;
InputStream is = null;
try {
fos = new FileOutputStream(outputFile);
is = new FileInputStream(file);
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.parse(is, fos);
} catch (FileNotFoundException e) {
e.printStackTrace();
}finally{
if(fos != null){
try {
fos.close();
} catch (IOException e) {
fos = null;
}
fos = null;
}
if(is != null){
try {
is.close();
} catch (IOException e) {
is = null;
}
is = null;
}
}
return outputFile;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Can duplicating a pdf with PDFBox be small like with iText? - java

Related

Extract pdf attachment on AWS S3 using iText Java

JSF Primefaces p:fileDownload file name contains UTF-8 characters

Is it possible to convert HTML into XHTML with Jsoup 1.8.1?

Fileinput stream / loading a simple txt file

JTidy java API toConvert HTML to XHTML

Categories

Resources