Merging PDFs with Sejda fails with stream output - java

Using Sejda 1.0.0.RELEASE, I basically followed the tutorial for splitting a PDF but tried merging instead (org.sejda.impl.itext5.MergeTask, MergeParameters, ...). All works great with the FileTaskOutput:
parameters.setOutput(new FileTaskOutput(new File("/some/path/merged.pdf")));
However I am unable to change this to StreamTaskOutput correctly:
OutputStream os = new FileOutputStream("/some/path/merged.pdf");
parameters.setOutput(new StreamTaskOutput(os));
parameters.setOutputName("merged.pdf");
No error is reported, but the resulting file cannot be read by Preview.app and is approximately 31 kB smaller (out of the ~1.2 MB total result) than the file saved above.
My first idea was: stream is not being closed properly! So I added os.close(); to the end of CompletionListener, still the same problem.
Remarks:
The reason I need to use StreamTaskOutput is that this merge logic will live in a web app, and the merged PDF will be sent directly over HTTP. I could store the temporary file and serve that one, but that is a hack.
Due to licencing issues, I cannot use the iText 5 version of the task.
Edit
Turns out, the reason is that StreamTaskOutput zips the result into a ZIP file! OutputWriterHelper.copyToStream() is the culprit. If I rename merged.pdf to merged.zip, it's a valid ZIP file containing a perfectly valid merged.pdf file!
Could anyone (dear authors of the library) comment on why this is happening?

The idea is that when a task consumes a MultipleOutputTaskParameters producing multiple output documents, the StreamTaskOutput has to group them to be able to write all of them to a stream output. Unfortunately Sejda currently applies the same logic to SingleOutputTaskParameters, hence your issue. We can fix this in Sejda 2.0 because it makes more sense to directly stream the out document in case of SingleOutputTaskParameters. For Sejda 1.x I'm not sure how to address this remaining compatible with the existing behaviour.

Related

Spring Integration: Allocate Space when sending file to FTP

I have the following problem:
We are sending files to a FTP. We hadn't been having problems since we were sending files of size < 5 MB. If the file size is greater than 5 MB, then we get an abend (an abnormal end), getting this error :
In order to "solve" this issue, we should allocate space before sending the file to the FTP, doing something like this:
QUOTE SITE BLOCKSIZE=0 LRECL=256 WRAP UNIT=DISK RECFM=VB PRI=50 SEC=50 CYL
Currently I'm using DefaultFtpSessionFactory along with a FileTransferringMessageHandler in order to send files to the FTP (obviously it works well unless the file is > 5 MB).
My question is: Is there a way to solve this issue using Spring?
I didn't tried that, but look. You can extend DefaultFtpSessionFactory and override its postProcessClientAfterConnect().
Then you can try to perform
FtpClient.sendSiteCommand("QUOTE");
FtpClient.sendSiteCommand("SITE");
FtpClient.sendSiteCommand("BLOCKSIZE=0");
FtpClient.sendSiteCommand("LRECL=256");
and so on until the end of your command.
You can check here also.

Weka CSV loader limit

I'm using Weka for a sentiment analysis project i'm working on. I'm using Weka CSV Loader to load the training instances from a csv file, but for some reason if i want to load more than 70 instances, the program gives me an "java.lang.ArrayIndexOutOfBoundsException: 2" exception. I found that u can give options to Weka CSV Loader
-B
The size of the in memory buffer (in rows).
(default: 100)
this one beeing maybe the one i need to set, to get rid of this error, but i'm not sure how to do this from a Java project. If anyone can help me with this, i would appreciate it greatly
UPDATE: The buffer size change didn't help the problems comes from somewhere else
How i'm using the loader:
private void getTrainingDataset(final String INPUT_FILENAME)
{
try{
//reading the training dataset from CSV file
CSVLoader trainingLoader =new CSVLoader();
trainingLoader.setSource(new File(INPUT_FILENAME));
inputDataset = trainingLoader.getDataSet();
}catch(IOException ex)
{
System.out.println("Exception in getTrainingDataset Method");
}
}
UPDATE: for those who want to know where the exception occurs
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2
at weka.core.converters.CSVLoader.getInstance(CSVLoader.java:1251)
at weka.core.converters.CSVLoader.readData(CSVLoader.java:866)
at weka.core.converters.CSVLoader.readHeader(CSVLoader.java:1150)
at weka.core.converters.CSVLoader.getStructure(CSVLoader.java:924)
at weka.core.converters.CSVLoader.getDataSet(CSVLoader.java:836)
at sentimentanalysis.SentimentAnalysis.getTrainingDataset(SentimentAnalysis.java:209)
at sentimentanalysis.SentimentAnalysis.trainClassifier(SentimentAnalysis.java:134)
at sentimentanalysis.SentimentAnalysis.main(SentimentAnalysis.java:282)
UPDATE: Even for under 70 instances, after a few, the Classifier also gives an error. Everything works fine for like 10-20 instances but it all goes to shit for more :)
Weka read CSV two times, first pass limited to buffersize (in rows) to extract classes of nominal attributes, the second pass read the entire file.
the classes of each nominal attribute much match the classes of the training set (no more, no less).
increase the value of the buffersize to more than the number of rows
if still an error occurs then look for a class that it is not in the both files.

Failing for Larger Input Files Only: FileServiceFactory getBlobKey throws IllegalArgumentException

I have a Google App Engine App that converts CSV to XML files. It works fine for small XML inputs, but refuses to finalize the file for larger inputed XML. The XML is read from, and the resulting csv files are written to, many times before finalization, over a long-running (multi-day duration) task. My problem is different than this FileServiceFactory getBlobKey throws IllegalArgumentException , since my code works fine both in production and development with small input files. So it's not that I'm neglecting to write to the file before closing/finalizing. However, when I attempt to read from a larger XML file. The input XML file is ~150MB, and the resulting set of 5 CSV files is each much smaller (perhaps 10MB each). I persisted the file urls for the new csv files, and even tried to close them with some static code, but I just reproduce the same error, which is
java.lang.IllegalArgumentException: creation_handle: String properties must be 500 characters or less. Instead, use com.google.appengine.api.datastore.Text, which can store strings of any length.
at com.google.appengine.api.datastore.DataTypeUtils.checkSupportedSingleValue(DataTypeUtils.java:242)
at com.google.appengine.api.datastore.DataTypeUtils.checkSupportedValue(DataTypeUtils.java:207)
at com.google.appengine.api.datastore.DataTypeUtils.checkSupportedValue(DataTypeUtils.java:173)
at com.google.appengine.api.datastore.Query$FilterPredicate.<init>(Query.java:900)
at com.google.appengine.api.datastore.Query$FilterOperator.of(Query.java:75)
at com.google.appengine.api.datastore.Query.addFilter(Query.java:351)
at com.google.appengine.api.files.FileServiceImpl.getBlobKey(FileServiceImpl.java:329)
But I know that it's not a String/Text data type issue, since I am already using similar length file service urls for the previous successful attempts with smaller files. It also wasn't an issue for the other stackoverflow post I linked above. I also tried putting one last meaningless write before finalizing, just in case it would help as it did for the other post, but it made no difference. So there's really no way for me to debug this... Here is my file closing code that is not working. It's pretty similar to the Google how-to example at http://developers.google.com/appengine/docs/java/blobstore/overview#Writing_Files_to_the_Blobstore .
log.info("closing out file 1");
try {
//locked set to true
FileWriteChannel fwc1 = fileService.openWriteChannel(csvFile1, true);
fwc1.closeFinally();
} catch (IOException ioe) {ioe.printStackTrace();}
// You can't get the blob key until the file is finalized
BlobKey blobKeyCSV1 = fileService.getBlobKey(csvFile1);
log.info("csv blob storage key is:" + blobKeyCSV1.getKeyString());
csvUrls[i-1] = blobKeyCSV1.getKeyString();
break;
At this point, I just want to finalize my new blob files for which I have the urls, but cannot. How can I get around this issue, and also, what may be the cause? Again, my code works for small files (~60 kB), but the input file of ~150MB fails). Thank you for any advice on what is causing this or how to get around it! Also, how long will my unfinalized files stick around for, before being deleted?
This issue was a bug in the Java MapReduce and Files API, which was recently fixed by Google. Read announcement here: groups.google.com/forum/#!topic/google-appengine/NmjYYLuSizo

Unable to print PNG files using Java Print Services (Everything else works fine)

I am using the Java print services to print a PNG file, however it is sending erroneous output to the printer. What actually gets printed (when I use a PNG) is some text saying:
ERROR: /syntaxerror in --%ztokenexec_continue--
Operand stack:
--nostringval-
There seems to be some more text, but that is kind of lost out of the page margins. I am setting the DocFlavor to DocFlavor.INPUT_STREAM.PNG and the specified file is actually an InputStream (Just changing the DoccFlavor to DocFlavor.INPUT_STREAM.PDF and using a pdf file works).
I have also tried it with different PNG files, but the problem persists. For what its worth, even PostScript seems to be working.
The errors that are being printed look quite similar to the gd (or ImageMagick?) errors. So, my best assumption right now is that the conversion from PNG -> PS is failing.
The code is as follows:
PrintService printService = this.getPrintService("My printer name");
final Doc doc = new SimpleDoc(document, DocFlavor.INPUT_STREAM.PNG, null);
final DocPrintJob printJob = printService.createPrintJob();
Here, getPrintService fetches a print service and is fetching a valid one. As for the document, here is how I get it:
File pngFile = new File("/home/rprabhu/temp/myprintfile.png");
FileInputStream document = new FileInputStream(pngFile);
I have no clue why it is going wrong, and I don't see any errors being output to the console as well.
Any help is greatly appreciated. Thanks.
Printing is always a messy business – inevitably so, because you have to worry about tedious details such as the size of a page, the margin sizes, and how many pages you're going to need for your output. As you might expect, the process for printing an image is different from printing text and you may also have the added complication of several printers with different capabilities being available, so with certain types of documents you need to select an appropriate printer.
Please see below links :
http://vineetreynolds.wordpress.com/2005/12/12/silent-print-a-pdf-print-pdf-programmatically/
http://hillert.blogspot.com/2011/12/java-print-service-frustrations.html

Running a JavaScript command from MATLAB to fetch a PDF file

I'm currently writing some MATLAB code to interact with my company's internal reports database. So far I can access the HTML abstract page using code which looks like this:
import com.mathworks.mde.desk.*;
wb=com.mathworks.mde.webbrowser.WebBrowser.createBrowser;
wb.setCurrentLocation(ReportURL(8:end));
pause(1);
s={};
while isempty(s)
s=char(wb.getHtmlText);
pause(.1);
end
desk=MLDesktop.getInstance;
desk.removeClient(wb);
I can extract out various bits of information from the HTML text which ends up in the variable s, however the PDF of the report is accessed via what I believe is a JavaScript command (onClick="gotoFulltext('','[Report Number]')").
Any ideas as to how I execute this JavaScript command and get the contents of the PDF file into a MATLAB variable?
(MATLAB sits on top of Java, so I believe a Java solution would work...)
I think you should take a look at the JavaScript that is being called and see what the final request to the webserver looks like.
You can do this quite easily in Firefox using the FireBug plugin.
https://addons.mozilla.org/en-US/firefox/addon/1843
Once you have found the real server request then you can just request this URL or post to this URL instead of trying to run the JavaScript.
Once you have gotten the correct URL (a la the answer from pjp), your next problem is to "get the contents of the PDF file into a MATLAB variable". Whether or not this is possible may depend on what you mean by "contents"...
If you want to get the raw data in the PDF file, I don't think there is a way currently to do this in MATLAB. The URLREAD function was the first thing I thought of to read content from a URL into a string, but it has this note in the documentation:
s = urlread('url') reads the content
at a URL into the string s. If the
server returns binary data, s will
be unreadable.
Indeed, if you try to read a PDF as in the following example, s contains some text intermingled with mostly garbage:
s = urlread('http://samplepdf.com/sample.pdf');
If you want to get the text from the PDF file, you have some options. First, you can use URLWRITE to save the contents of the URL to a file:
urlwrite('http://samplepdf.com/sample.pdf','temp.pdf');
Then you should be able to use one of two submissions on The MathWorks File Exchange to extract the text from the PDF:
Extract text from a PDF document by Dimitri Shvorob
PDF Reader by Tom Gaudette
If you simply want to view the PDF, you can just open it in Adobe Acrobat with the OPEN function:
open('temp.pdf');
wb=com.mathworks.mde.webbrowser.WebBrowser.createBrowser;
wb.executeScript('javascript:alert(''Some code from a link'')');
desk=com.mathworks.mde.desk.MLDesktop.getInstance;
desk.removeClient(wb);

Categories

Resources