I have an application that currently uses Apache Abdera to parse Atom Pub documents (Workspace, Collection, Feed, Entry) - and want to switch the the GData Libraries, mainly to get rid of a lot of dependencies and I have found the GData calls to be consistently faster. Anyway, I cannot figure out how to generate some of these document types through GData.
Example:
Workspace w = new Workspace(new PlainTextConstruct("My Workspace"));
System.out.println(w); // prints a memory location
System.out.println(w.getXmlBlob()); // prints memory location or null
In Abdera this would have worked. I am guessing I am missing the use of some parsing class, but the documentation is not very forward on this topic.
I am expecting a document like this (not exactly):
<workspace><atom:title>My Workspace</atom:title></workspace>
Well I managed to find the answer myself, still trying to figure out how to assign a default namespace so it doesn't append "atom" to every xml tag.
Workspace workspace = new Workspace(new PlainTextConstruct("My Workspace"));
CharArrayWriter charWr = new CharArrayWriter();
workspace.generate(new XmlWriter(charWr), new ExntensionProfile());
System.out.println(charWr.toString());
Related
I wanted to make a simple program to get text content from a pdf file through Java. Here is the code:
PDFTextStripper ts = new PDFTextStripper();
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = PDDocument.load(file);
String allText = ts.getText(doc1);
String gradeText = allText.substring(allText.indexOf("GRADE 10B"), allText.indexOf("GRADE 10C"));
System.out.println("Meeting ID for English: "
+ gradeText.substring(gradeText.indexOf("English") + 7, gradeText.indexOf("English") + 20));
This is just part of the code, but this is the part with the problem.
The error is: The method load(File) is undefined for the type PDDocument
I have learnt using PDFBox from JavaTPoint. I have followed the correct instructions for installing the PDFBox libraries and adding them to the Build Path.
My PDFBox version is 3.0.0
I have also searched the source files and their methods, and I am unable to find the load method there.
Thank you in advance.
As per the 3.0 migration guide the PDDocument.load method has been replaced with the Loader method:
For loading a PDF PDDocument.load has been replaced with the Loader
methods. The same is true for loading a FDF document.
When saving a PDF this will now be done in compressed mode per
default. To override that use PDDocument.save with
CompressParameters.NO_COMPRESSION.
PDFBox now loads a PDF Document incrementally reducing the initial
memory footprint. This will also reduce the memory needed to consume a
PDF if only certain parts of the PDF are accessed. Note that, due to
the nature of PDF, uses such as iterating over all pages, accessing
annotations, signing a PDF etc. might still load all parts of the PDF
overtime leading to a similar memory consumption as with PDFBox 2.0.
The input file must not be used as output for saving operations. It
will corrupt the file and throw an exception as parts of the file are
read the first time when saving it.
So you can either swap to an earlier 2.x version of PDFBox, or you need to use the new Loader method. I believe this should work:
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = Loader.loadPDF(file);
I need to find a way to ignore pictures and photos from PDF document during conversion to DOCX file.
I am creating an instance of FineReader Engine:
IEngine engine = Engine.InitializeEngine(
engineConfig.getDllFolder(), engineConfig.getCustomerProjectId(),
engineConfig.getLicensePath(), engineConfig.getLicensePassword(), "", "", false);
After that, I am converting a document:
IFRDocument document = engine.CreateFRDocument();
document.AddImageFile(file.getAbsolutePath(), null, null);
document.Process(null);
String exportPath = FileUtil.prepareExportPath(file, resultFolder);
document.Export(exportPath, FileExportFormatEnum.FEF_DOCX, null);
As a result, it converts all images from the initial pdf document.
When you exporting pdf to docx you should use some export params. In this way you can use IRTFExportParams. You can get this object:
IRTFExportParams irtfExportParams = engine.CreateRTFExportParams();
and there you can set writePicture property like this:
irtfExportParams.setWritePictures(false);
there: IEngine engine is main interface. I think u know how to initialize it;)))
Also you have to set in method document.Process() property. (document is from IFRDocument document).
In Process() method you have to give IDocumentProcessingParams iDocumentProcessingParams. This object has method setPageProcessingParams() and there you have to put IPageProcessingParams iPageProcessingParams params(You can get this object by engine.CreatePageProcessingParams()). And this object has methods:
iPageProcessingParams.setPerformAnalysis(true);
iPageProcessingParams.setPageAnalysisParams(iPageAnalysisParams);
In the first method set true,
and in the second one we give iPageAnalysisParams(IPageAnalysisParams iPageAnalysisParams = engine.CreatePageAnalysisParams()).
Last step, you have to set false value in setDetectPictures(false) method from iPageAnalysisParams like this. Thats all:)
And when you are going to export document you should put this param like this:
IFRDocument document = engine.CreateFRDocument();
document.Export(filePath, FileExportFormatEnum.FEF_DOCX, irtfExportParams);
I hope my answer will help to everyone)))
I'm not really familiar with PDF to DOCX conversion, but i think you could try custom profiles according to your needs.
At some point in your code you should create a Engine object, and then create a Document object (or IFRDocument object depending of your application). Add this line just before giving your document to your engine for processing:
engine.LoadProfile(PROFILE_FILENAME);
Then create your file with some processing parameters described in the documentation packaged with your FRE installation under "Working with Profiles" section.
Do not forget to add in your file:
... some params under other sections
[PageAnalysisParams]
DetectText = TRUE --> force text detection
DetectPictures = FALSE --> ignore pictures
... other params under PageAnalysisParams
... some params under other sections
It works the same way for Barcodes, etc... But keep in mind to benchmark your results when adding or removing things from this file as it may alter processing speed and global quality of your result.
What do PDF input pages contain? What is expected in MS Word?
It would be great if you would attach an example of an input PDF file and an example of the desired result in MS Word format.
Then give a useful recommendation will be much easier.
I tried using the simple EditorKit option, but that doesn't seem to support all the RTF formats.
So I turned into using either Tika,JODConverter or POI.
As of now I managed to make it work with JODConverter and openOffice by using this
OfficeManager officeManager = new DefaultOfficeManagerConfiguration()
.setPortNumbers(8100, 8101).buildOfficeManager();
officeManager.start();
OfficeDocumentConverter converter - new
OfficeDocumentConverter(officeManger);
try{
File tempFile = File.createTempFile("tempRtf", ".rtf");
BufferedWriter bw = new BufferedWriter(new FileWriter(tempFile));
bw.write(rtfString);
bw.close;
File outputTempFile = File.createTempFile("otuputTepFile", ".html");
converter.convert(tempFile, outputTempFile);
return FileUtils.readFileToString(outputTempFile);
This works.
My problem is that I actually set up a server and close it, which takes a lot of time.
I tried to see if I can bring up the process on the first run\report (I use it as a Handler in birt report) and then just to check if the process is running, if so use it to convert, and that's it, it'll save lots of time I see is wasted on initiating and closing process ( I don't care it will stay up)
My problem is that it seems like these classes as noted here are not present on my version of JODConverter.
After farther investigation, I found out that they are on the JODConverter 2.2 API and i use the 3.0 core-beta-4.
JODConverter seems to be kinda complex to a my simple need.
so if anyone knows how to start the office manger once and then just check if its up I'd love a code sample, and Of course if anyone got better solution than JODConverter to my need ill be glad to hear it.
EDIT: I need my Handler to do 2 things, 1. check if there is an instance of officemanager up, and connect to it (we skip the officeManager.start())
and 2. if the instance isn't up, then ill basically do what the code sample i wrote sent.
This code is written in a BIRT Handler, so i can't create the officeManager globally and just share it, cause the handler class runs everytime i call birt engine.
Maybe i can set up the officeManager in the Birt itself? then ill have the instance in the handler?
I am trying to write the name of a file into Accumulo. I am using accumulo-core-1.43.
For some reason, certain files seem to be written into Accumulo with trailing \x00 characters at the end of the name. The upload is coming through a Java servlet (using the jquery file upload plugin). In the servlet, I check the name of the file with a System.out.println and it looks normal, and I even tried unescaping the string with
org.apache.commons.lang.StringEscapeUtils.unescapeJava(...);
The actual writing to accumulo looks like this:
Mutation mut = new Mutation(new Text(checkSum));
Value val = new Value(new Text(filename).getBytes());
long timestamp = System.currentTimeMillis();
mut.put(new Text(colFam), new Text(EMPTY_BYTES), timestamp, val);
but nothing unusual showed up there (perhaps \x00 isn't escaped)? But then if I do a scan on my table in accumulo, there will be one or more \x00 in the file name.
The problem this seems to cause is that I return that string within XML when I retrieve a list of files (where it shows up) and pass that back to the browser, the the XSL that is supposed to render the information in the XML no longer works when there's these extra characters (not sure why that is the case either).
In chrome, for the response on these calls, I see that there's three red dots after the file name, and when I hover over it, \u0 pops up (which I think is a different representation of 0/null?).
Anyway, I'm just trying to figure out why this happens, or at the very least, how I can filter out \x00 characters before returning the file in Java. any ideas?
You are likely incorrectly using the Hadoop Text class -- this is not an error with Accumulo. Specifically, you make the mistake in your above example:
Value val = new Value(new Text(filename).getBytes());
You must adhere to the length of provided by the Text class. See the Text javadoc for more information. If you're using Hadoop-2.2.0, you can use the provided copyBytes method on Text. If you're on older version of Hadoop where this method doesn't yet exist, you can use something like the ByteBuffer class or the System.arraycopy method to get a copy of the byte[] with the proper limits enforced.
I'm trying to read an xml file on from an android app using XOM as the XML library. I'm trying this:
Builder parser = new Builder();
Document doc = parser.build(context.openFileInput(XML_FILE_LOCATION));
But I'm getting nu.xom.ParsingException: Premature end of file. even when the file is empty.
I need to parse a very simple XML file, and I'm ready to use another library instead of XOM so let me know if there's a better one. or just a solution to the problem using XOM.
In case it helps, I'm using xerces to get the parser.
------Edit-----
PS: The purpose of this wasn't to parse an empty file, the file just happened to be empty on the first run which showed this error.
If you follow this post to the end, it seems that this has to do with xerces and the fact that its an empty file, and they didn't reach a solution on xerces side.
So I handled the issue as follows:
Document doc = null;
try {
Builder parser = new Builder();
doc = parser.build(context.openFileInput(XML_FILE_LOCATION));
}catch (ParsingException ex) { //other catch blocks are required for other exceptions.
//fails to open the file with a parsing error.
//I create a new root element and a new document.
//I fill them with xml data (else where in the code) and save them.
Element root = new Element("root");
doc = new Document(root);
}
And then I can do whatever I want with doc. and you can add extra checks to make sure that the cause is really an empty file (like check the file size as indicated by one of sam's comments on the question).
An empty file is not a well-formed XML document. Throwing a ParsingException is the right thing to do here.