PDFBox - options to increase the performance

PDFBox - options to increase the performance - java

I have 2 questions regarding PDFBox library (JAVA):
I have just started using PDFBox library and though it's working well, I couldn't
help noticing that it runs slower than ITEXT (the other pdf library I used) when
using ut.mergeDocuments() method (against concat_pdf.main(..) of ITEXT).
Does any one know if/how I can increase the performance of this tool?
I see that PDFBox is more sensitive to encrypted files. The ITEXT is allowing
me to do merge on encrypted PDF's but PDFBox is throwing an exception stating:
"PDFBoxConcat failedjava.io.IOException: Error: destination PDF is encrypted, can't append encrypted PDF documents."
Does any one know how come it works on ITEXT but not on PDFBox?
My guess is that the ITEXT is more sophisticated to know exactly what is encrypted
and allowing actions by that, while the PDFBox is just checking if it's encrypted or not.
Can anyone confirm this for me?
I have this code (open source) of pfdBox for the mergeDocuments() method where you can see the check for encryption:
if( destination.isEncrypted() )
{
throw new IOException( "Error: destination PDF is encrypted, can't append encrypted PDF documents." );
}
I tried to put this on remark but the merged document came out as gibberish.
Just adding some code examples of my attempts to improve performance.
These are the 3 different ways I tried to do this:
private static void PDFBoxConcat(String filePath) {
PDFMergerUtility ut = new PDFMergerUtility();
for (int i = 0; i < 50; i++) {
ut.addSource(filePath);
}
ut.setDestinationFileName("C:\\amdocs\\sensis\\dlv858\\pdfBox" + testNum + ".pdf");
try {
ut.mergeDocuments();
} catch (Exception e) {
System.out.println("PDFBoxConcat failed");
e.printStackTrace();
}
}
private static void PDFBoxConcat2(String filePath) {
String [] fileNamesArray = new String[51];
int i = 0;
for (i = 0; i < 50; i++) {
fileNamesArray[i] = filePath;
}
fileNamesArray[i] = "C:\\amdocs\\sensis\\dlv858\\pdfM" + testNum + ".pdf";
try {
PDFMerger.main(fileNamesArray);
} catch (Exception e) {
e.printStackTrace();
}
}
private static void PDFBoxConcat3(String filePath) {
ArrayList<InputStream> list = new ArrayList<InputStream>();
PDFMergerUtility ut = new PDFMergerUtility();
for (int i = 0; i < 50; i++){
InputStream inputStream = new FileInputStream(filePath);
list.add(inputStream);
}
ut.addSources(list);
try {
ut.mergeDocuments();
} catch (Exception e) {
System.out.println("PDFBoxConcat failed");
e.printStackTrace();
}
}

Concerning your first question: Does any one know if/how I can increase the performance of this tool(= Apache PdfMergerUtility)?
The following setting helped me to reduce the merge time by ~ 75%:
pdfMergerUtility.setDocumentMergeMode(PDFMergerUtility.DocumentMergeMode.OPTIMIZE_RESOURCES_MODE);

Related

huffman code - cant decompress BMP files using bitset

I built a classic Hoffman code, with encoder and decoder. I noticed that I had a problem, I use code in "bitset", to compress the input file. But the "bitset" - does not decode all the files I send to, for example when I send a txt file, it works great, but when I send other files like BMP. It doesn't work.
Before I used bitset - the code worked - but without any compression - so I'm afraid the problem is with bitset.
The decoder I built is:
public void Decompress(String[] input_names, String[] output_names) {
HuffmanVerticle tree = new HuffmanVerticle();
tree = readTreeFile(output_names);
restoreInput(tree, output_names, input_names);
}
public static void restoreInput(HuffmanVerticle tree, String[] binary_names, String[] original_names) {
BitSet huffmanCodeBit;
try {
FileOutputStream to_original = new FileOutputStream(original_names[0]);
FileInputStream binary = new FileInputStream(binary_names[0]);
ObjectInputStream s = new ObjectInputStream(binary);
huffmanCodeBit = (BitSet) s.readObject();
System.out.println(huffmanCodeBit.toString());
int index = 0;
while(huffmanCodeBit.length() > index)
{
HuffmanVerticle tmp = tree;
while (!tmp.isNullTree())
{
boolean bit = huffmanCodeBit.get(index);
index++;
System.out.println(bit);
if (!bit)
tmp = tmp.left;
else
tmp = tmp.right;
}
to_original.write(tmp.character);
}
binary.close();
to_original.close();
} catch (Exception e) {
e.printStackTrace();
}
}
What am I missing here? Why doesn't the code work for certain files? I'm trying to run the code on some files but it doesn't work, the files that come back don't work.
The code does not work for bmp files at all, even after half an hour, for example txt files, it runs very fast.
Thank for your help.

Write a Java program that downloads the first 100 comics of the webcomic XKCD. Be sure to use https:// for all URLS

This is what I have so far, and I am having trouble downloading 1-100 comics starting at https://xkcd.com/1/ and I know I am supposed to be going to the source code for the website. However, I cant seem to figure out how to get all the first 100 comics into my designated file I set it to save to. For example, I want https://xkcd.com/1/(view-source:https://xkcd.com/1/), https://xkcd.com/2/(view-source:https://xkcd.com/2/), and all the way up to comic 100. I know the img src is at line 50, but once again I don't know how to approach it.
public static void main(String[] args) {
URL imgURL = null;
for (int web = 1; web <= 100; web++) {
try {
imgURL = new URL("https://imgs.xkcd.com/comics/barrel_cropped_(1).jpg");
InputStream stream = imgURL.openStream();
Files.copy(stream, Paths.get("file/WebComics" + web + ".png"));
System.out.println("Done!");
} catch (Exception e) {
e.printStackTrace();
System.out.println("Error!");
}
}
}
}

Add jsoup library jar to your project, and then try this:
static void do_page(int id) throws IOException {
Document doc = Jsoup.connect("https://xkcd.com/" + id).get();
Elements imgs = doc.select("#comic img");
for (Element e: imgs) {
System.out.println(e.attr("src"));
}
}
Then call the do_page function in a loop:
for (int i = 1; i <= 100; i++) {
do_page(i);
}
Now, instead of printing it, you can use JSoup again to probably download the images like you see fit.

Gate- Loading a gapp file is taking time .Hoe can I reduce it?

I am working on GATE related project. So, Here I am creating a pipeline through GATE. I am using that .gapp file in my java code. So, for loading a .gapp file, it takes around 10 seconds, which is too much for my application.
How can I solve this issue?
Second problem is that, I have to do System.exit after processing a document to release the memory, if I didn't then I got OutofMemoryError.
So, how can I solve these issues?
My code is like:
public class GateMainClass {
CorpusController application = null;
Corpus corpus = null;
public void processApp(String gateHome, String gapFilePath, String docSourcePath, String isSingleDocument) throws ResourceInstantiationException {
try {
if (!Gate.isInitialised()) {
Gate.runInSandbox(true);
Gate.setGateHome(new File(gateHome));
Gate.setPluginsHome(new File(gateHome, "plugins"));
Gate.init();
}
application = (CorpusController) PersistenceManager.loadObjectFromFile(new File(gapFilePath));
corpus = Factory.newCorpus("main");
application.setCorpus(corpus);
if(isSingleDocument.equals(Boolean.TRUE.toString())) {
Document doc = Factory.newDocument(new File(docSourcePath).toURI().toURL());
corpus.add(doc);
} else {
File[] files;
File folder = new File(docSourcePath);
files = folder.listFiles(new FileUtil.CustomFileNameFilter(".xml"));
Arrays.sort(files, LastModifiedFileComparator.LASTMODIFIED_REVERSE);
for (int i = 0; i < files.length; i++) {
Document doc = Factory.newDocument(files[i].toURI().toURL());
corpus.add(doc);
}
}
application.execute();
} catch (Exception e) {
e.printStackTrace();
} finally {
corpus.clear();
}
}
And my gapp file is like:
1.Document Reset PR
2.Annie English Tokenizer.
3.ANnie Gazetteer.
4.Annie Sentence Spiliter.
5.Annie POS Tagger.
6.GATE morphological Analyser.
7.Flexible gazetteer.
8.HTML markup transfer
9.Main Jape file.

Batching multiple files to Amazon S3 using the Java SDK

I'm trying to upload multiple files to Amazon S3 all under the same key, by appending the files. I have a list of file names and want to upload/append the files in that order. I am pretty much exactly following this tutorial but I am looping through each file first and uploading that in part. Because the files are on hdfs (the Path is actually org.apache.hadoop.fs.Path), I am using the input stream to send the file data. Some pseudocode is below (I am commenting the blocks that are word for word from the tutorial):
// Create a list of UploadPartResponse objects. You get one of these for
// each part upload.
List<PartETag> partETags = new ArrayList<PartETag>();
// Step 1: Initialize.
InitiateMultipartUploadRequest initRequest = new InitiateMultipartUploadRequest(
bk.getBucket(), bk.getKey());
InitiateMultipartUploadResult initResponse =
s3Client.initiateMultipartUpload(initRequest);
try {
int i = 1; // part number
for (String file : files) {
Path filePath = new Path(file);
// Get the input stream and content length
long contentLength = fss.get(branch).getFileStatus(filePath).getLen();
InputStream is = fss.get(branch).open(filePath);
long filePosition = 0;
while (filePosition < contentLength) {
// create request
//upload part and add response to our list
i++;
}
}
// Step 3: Complete.
CompleteMultipartUploadRequest compRequest = new
CompleteMultipartUploadRequest(bk.getBucket(),
bk.getKey(),
initResponse.getUploadId(),
partETags);
s3Client.completeMultipartUpload(compRequest);
} catch (Exception e) {
//...
}
However, I am getting the following error:
com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: 2C1126E838F65BB9), S3 Extended Request ID: QmpybmrqepaNtTVxWRM1g2w/fYW+8DPrDwUEK1XeorNKtnUKbnJeVM6qmeNcrPwc
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1109)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:741)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:461)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:296)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3743)
at com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:2617)
If anyone knows what the cause of this error might be, that would be greatly appreciated. Alternatively, if there is a better way to concatenate a bunch of files into one s3 key, that would be great as well. I tried using java's builtin SequenceInputStream but that did not work. Any help would be greatly appreciated. For reference, the total size of all the files could be as large as 10-15 gb.

I know it's probably a bit late but worth giving my contribution.
I've managed to solve a similar problem using the SequenceInputStream.
The tricks is in being able to calculate the total size of the result file and then feeding the SequenceInputStream with an Enumeration<InputStream>.
Here's some example code that might help:
public void combineFiles() {
List<String> files = getFiles();
long totalFileSize = files.stream()
.map(this::getContentLength)
.reduce(0L, (f, s) -> f + s);
try {
try (InputStream partialFile = new SequenceInputStream(getInputStreamEnumeration(files))) {
ObjectMetadata resultFileMetadata = new ObjectMetadata();
resultFileMetadata.setContentLength(totalFileSize);
s3Client.putObject("bucketName", "resultFilePath", partialFile, resultFileMetadata);
}
} catch (IOException e) {
LOG.error("An error occurred while combining files. {}", e);
}
}
private Enumeration<? extends InputStream> getInputStreamEnumeration(List<String> files) {
return new Enumeration<InputStream>() {
private Iterator<String> fileNamesIterator = files.iterator();
#Override
public boolean hasMoreElements() {
return fileNamesIterator.hasNext();
}
#Override
public InputStream nextElement() {
try {
return new FileInputStream(Paths.get(fileNamesIterator.next()).toFile());
} catch (FileNotFoundException e) {
System.err.println(e.getMessage());
throw new RuntimeException(e);
}
}
};
}
Hope this helps!

Compare image to actual screen

I'd like to make my Java program compare the actual screen with a picture (screenshot).
I don't know if it's possible, but I have seen it in Jitbit (a macro recorder) and I would like to implement it myself. (Maybe with that example you understand what I mean).
Thanks
----edit-----
In other words, is it possible to check if an image is showing in? To find and compare that pixels in the screen?

You may try aShot: documentation link
1) aShot can ignore areas you mark with special color.
2) aShot can provide image which display difference between images.
private void compareTowImages(BufferedImage expectedImage, BufferedImage actualImage) {
ImageDiffer imageDiffer = new ImageDiffer();
ImageDiff diff = imageDiffer
.withDiffMarkupPolicy(new PointsMarkupPolicy()
.withDiffColor(Color.YELLOW))
.withIgnoredColor(Color.MAGENTA)
.makeDiff(expectedImage, actualImage);
// areImagesDifferent will be true if images are different, false - images the same
boolean areImagesDifferent = diff.hasDiff();
if (areImagesDifferent) {
// code in case of failure
} else {
// Code in case of success
}
}
To save image with differences:
private void saveImage(BufferedImage image, String imageName) {
// Path where you are going to save image
String outputFilePath = String.format("target/%s.png", imageName);
File outputFile = new File(outputFilePath);
try {
ImageIO.write(image, "png", outputFile);
} catch (IOException e) {
// Some code in case of failure
}
}

You can do this in two steps:
Create a screenshot using awt.Robot
BufferedImage image = new Robot().createScreenCapture(new Rctangle(Toolkit.getDefaultToolkit().getScreenSize()));
ImageIO.write(image, "png", new File("/screenshot.png"));
Compare the screenshots using something like that: How to check if two images are similar or not using openCV in java?

Have a look at Sikuli project. Their automation engine is based on image comparison.
I guess, internally they are still using OpenCV for calculating image similarity, but there are plenty of OpenCV Java bindings like this, which allow to do so from Java.
Project source code is located here: https://github.com/sikuli/sikuli

Ok then, so I found an answer after a few days.
This method takes the screenshot:
public static void takeScreenshot() {
try {
BufferedImage image = new Robot().createScreenCapture(new Rectangle(490,490,30,30));
/* this two first parameters are the initial X and Y coordinates. And the last ones are the increment of each axis*/
ImageIO.write(image, "png", new File("C:\\Example\\Folder\\capture.png"));
} catch (IOException e) {
e.printStackTrace();
} catch (HeadlessException e) {
e.printStackTrace();
} catch (AWTException e) {
e.printStackTrace();
}
}
And this other one will compare the images
public static String compareImage() throws Exception {
// savedImage is the image we want to look for in the new screenshot.
// Both must have the same width and height
String c1 = "savedImage";
String c2 = "capture";
BufferedInputStream in = new BufferedInputStream(new FileInputStream(c1
+ ".png"));
BufferedInputStream in1 = new BufferedInputStream(new FileInputStream(
c2 + ".png"));
int i, j;
int k = 1;
while (((i = in.read()) != -1) && ((j = in1.read()) != -1)) {
if (i != j) {
k = 0;
break;
}
}
in.close();
in1.close();
if (k == 1) {
System.out.println("Ok...");
return "Ok";
} else {
System.out.println("Fail ...");
return "Fail";
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

PDFBox - options to increase the performance - java

Concerning your first question: Does any one know if/how I can increase the performance of this tool(= Apache PdfMergerUtility)? The following setting helped me to reduce the merge time by ~ 75%: pdfMergerUtility.setDocumentMergeMode(PDFMergerUtility.DocumentMergeMode.OPTIMIZE_RESOURCES_MODE);

Related

huffman code - cant decompress BMP files using bitset

Write a Java program that downloads the first 100 comics of the webcomic XKCD. Be sure to use https:// for all URLS

Gate- Loading a gapp file is taking time .Hoe can I reduce it?

Batching multiple files to Amazon S3 using the Java SDK

Compare image to actual screen

Categories

Resources