I am creating a parser for the .one file extension, which when finished I will add to the Apache Tika project.
Here is the APL 2.0 licensed Open Source project I'm creating: https://github.com/nddipiazza/onenote-parser-java
I used the specification document here: https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-one/73d22548-a613-4350-8c23-07d15576be50
As a starting point, I ported over the code from this open source C++ project: https://github.com/dropbox/onenote-parser
I have gotten a long way in the parsing of the documents, but I've hit a road block.
Here is the OneNote file I'm using to parse: https://drive.google.com/file/d/1uROTEnKeBKU08CG_K5zdDTGHa178LgBK/view?usp=sharing
I am unable to view the Section1TextArea1 and Section1TextArea2 in my parsed results. So I'm missing some sort of key data parsing element or something.
It is definitely in the OneNote file itself. I can see it in the Hex viewer:
Here is the JSON parse output: https://gist.github.com/nddipiazza/02d2252d357b3b02a6b9ab1050474267
I feel like the spec document is missing some very important information needed in order to parse this proprietary format.
What major element(s) am I missing resulting in me not getting the actual text content?
I figured it out. It was a matter of understanding that property values in OneNote can have either:
Binary contents
Ascii text contents
UTF-16LE contents.
There is a variety of them sprinkled throughout.
Also I just went ahead and parse the entire root file tree. It will result in lots of duplicate text but i don't really care.
The project is updated with test cases and the fix here: https://github.com/nddipiazza/onenote-parser-java/tree/master/src/main/java/org/apache/tika/onenote
UPDATE:
Just created the apache tika PR: https://github.com/apache/tika/pull/300
Related
I have a Word template, complete with fonts, colors, etc. I am querying a database and retrieving information into a POJO. I want to extract the relevant info from said POJO and create a Word document as per my template's directives.
The doc will have tables and graphs so I need to use Content Control Data Binding. As I understand it, I'll have to do the following to achieve this
Modify the Word template to add content controls
Transform the POJO into an XML object (template?)
Use ContentControlMergeXML to bind the XML data to the Word template
Unfortunately, I can't find a good step-by-step example of this anywhere. Nearly all of the links in the docx4j forum lead to broken GitHub pages
My questions
How can I use OpenDoPE to add tags to my Word template? I'll need to preserve style, so I want the correct OpenDoPE version
Should the POJO be converted into an XML object or document?
Is there an end to end example of this entire process so I can follow along? (preferably with source code)
Content control data binding essentially injects an XPath value into a content control in the Word document.
That XPath is evaluated against an XML document, so yes, you need to convert your POJO into XML.
Authoring
Now, there are 3 different OpenDoPE Word AddIns which you can use to add content controls to your Word document. See the links at https://opendope.org/implementations.html
The most recent one assumes a fixed XML format. So to use that, you'd need to transform your POJO to match that format. (ie use the AddIn to author your docx, then inspect the resulting XML (embedded in the docx), then figure out how to transform your POJO to that).
The older AddIns support arbitrary XML, but are cruder. To use one of these, first convert your POJO to XML (eg using JAXB), then feed the AddIn your sample XML.
Runtime
To bind your XML to a docx "template" to create an instance docx, see https://github.com/plutext/docx4j/blob/master/docx4j-samples-docx4j/src/main/java/org/docx4j/samples/ContentControlBindingExtensions.java
You can run that sample code against the sample docx + data; you can take a look at the docx to see what the content controls look like (they bind a custom xml part in the docx, so unzip it to see that)
ps the GitHub links broke as a result of a recent code re-org. GitHub isn't smart enough to dynamically maintain them :-( See https://www.docx4java.org/downloads.html for downloadable sample code.
I want to extract the data present inside a PDF file and present it in the format of a CSV/Excel sheet.I got to know that this can be done using Tika library in java.But,i did find the solution as to how extract the data as simple text,but i want to know how to store it in an excel sheet.
If someone has done such type of work earlier,then please help me.
The first part (and the hard one) is to parse original data and interpret it as a table. Apache Tika will give you xhtml representation (or call your own handler with SAX events) but it usually won't construct a table for you. From pdf file, I mean, since pdf isn't a tabular format by itself.
So, you'll have to take Tika-produced paragraphs, split them and pass resulting cells to some csv/xls/xlsx writter.
It might work if you have some regular table in you pdf (one line per table row, clean cell logical separation etc). But it will look like parsing plain text, of course.
In case I wouldn't work, you'll have to take pdf parser (like Apache PDFBox) and try to interpret its output.
The second part (output) is simple. If csv/ssv/tsv is suitable for you -- use your preferred library to produce it (I can recommend Apache commons-csv).
But take into account that MS Excel requires BOM for UTF-8 and UTF-16 csv to understand that file isn't in one-byte encoding (like CP-1252 etc).
If you want Excel xls or xlsx format -- just use Apache POI to write it.
I have a text file and i need to convert this text file all data in xml format to make more readable.
Text file
how can i convert it in xml format.
Any java library or any way that i can do it.
Your question is rather vague (and you could probably find the answer yourself with just a little research), but I'll give you a hint.
Your sample appears to be an INI file (as traditionally used for configuration files on Windows & DOS). So, look for an "INI file parser." If you can't find one, you should be able to write a simple parser yourself using regular expressions. It's a simple file format, consisting of section headings like [SectionTitle] and data fields like Key=Value. That's all.
As for generating XML ... it shouldn't be hard, but "xml format" is not a useful description. Can you be more specific? E.g., what will the XML be used for?
Try this: http://www.smooks.org/mediawiki/index.php?title=Main_Page. I've used it and it's great.
A more sophisticated solution would be to use Mule Data Mapper. On the server side, obviously.
I am using Java SAX parser to parse XML data sent from a third party source that is around 3 GB. I am getting an error resulting from the XML document not being well formed: The processing instruction target matching "[xX][mM][lL]" is not allowed.
As far as I understand, this is normally due to a character being somewhere it should not be.
Main problem: Cannot manually edit these files due to their very large size.
I was wondering if there was a workaround for files that are very large in size that cannot be opened and edited manually (due to their large size) and if there is a way to code it so that it would remove any problematic characters automatically.
I would think the most likely explanation is that the file contains a concatenation of several XML documents, or perhaps an embedded XML document: either way, an XML declaration that isn't at the start of the file.
A lot now depends on your relationship with the supplier of the bad data. If they sent you faulty equipment or buggy software, you would presumably complain and ask them to fix it. But if you don't have a service relationship with the third party, you either have to change supplier or do the best you can with the faulty input, which means repairing the fault yourself. In general, you can't repair faulty XML unless you know what kind of fault you are looking for, and that can be very difficult to determine if the files are large (or if the failures are very rare).
The data isn't XML, so don't try to use XML tools to process it. Use text processing tools such as sed or awk. The first step is to search the file for occurrences of <?xml and see if that gives any hints.
This error occurs, if the declaration is anywhere but the beginning of the document. The reason might be
Whitespace before the XML declaration
Any hidden character before the XML declaration
The XML declaration appears anywhere else in the document
You should start checking case #2, see here: http://www.w3.org/International/questions/qa-byte-order-mark#remove
If that doesn't help, you should remove leading whitespace from the document. You could do that by wrapping the original InputStream with another InputStream and use that to remove the whitespace.
The same can be done if you are facing case #3, but the implementation would be a bit more complex.
Is there a way to lookup the line number that a given element is at in an xml file via the w3c dom api?
My use case for this is that we have 30,000+ maps in kml/xml format. I wrote a unit test that iterates over each file found on the hard drive (about 17GB worth) and tests that it is parseable by our application. When it fails I throw an exception that contains the element instance that was considered "invalid". In order for our mapping department (nobody here knows how to program) to easily track down the typo we would like to log the line number of the element that caused the exception.
Can anybody suggest a way to do this? Please note we are using the W3C dom api included in the Android 1.6 SDK.
I'm not sure whether the Android API is different, but a normal Java application could catch a SAXParseException when parsing and look at the line number.
I may be wrong, but the line number shouldn't be relevant to your XML parser/reader as long as the XML structure itself is valid.
You might try to extrapolate the line-number programatically on the assumption that each node/content must be on a distinct line but it's going to be tricky.
It looks like you're validating your XML files. That is, you're not interested in whether the documents are syntactically correct ("well-formed"), but if they are semantically valid for your application. The right tool for this would be a validating XML parser, coupled with a dedicated XML scheme. See for example this tutorial on XML validation in Java. Validation errors will usually contain detailed error information, including the line number of problematic elements.