PDF and "massive" XMP data storage

PDF and "massive" XMP data storage - java

So I have a program that creates an output PDF file which I want to make readable (by my program) by embedding metadata into it. And that is quite a lot of data.
It was suggested to me to do do this using XMP format. However, I'm not sure if that is going to work.
If you don't feel like reading all this, skip to the Last paragraph(s), if you don't understand the questions, return here...
My file could have structure like this:
Heading1
<indent>1.Question
<indent><indent>a)answer
<indent><indent>b)answer
<indent>2.Question
<indent><indent>a)answer
<indent><indent>b)answer
<indent><indent>c)answer
<indent>3.Question
<indent>4.Question
Heading2
<indent>1.Question
<indent>2.Question
<indent><indent>a)answer
Every question has it's parent heading and every answer has its parent question. File like this could have unlimited number of headings, unlimited number of questions per heading, and each question zero to five answers.
In order for my program to be able to assemble the same file in it's GUI, it requires several peices of information.
It needs to know:
number of headings (integer)
heading type (boolean) (heading doesn't have to contain only questions, so this is needed, but I omitted the other type of heading in example to simplify the matter)
string containing the text in each heading/question/answer
Following the example, this is how my readable file could look like:
2 //heading number
Q/4/headingText //type of heading/number of question/content
2/questionText //number of answers/content
answerText //content
answerText //etc...
3/questionText
answerText
answerText
answerText
0/questionText
0/questionText
Q/2/headingText
0/questionText
1/questionText
answerText
This is possible if I assume that the file is read line by line. First line would tell how many headings to expect, second line (and every header line) would tell the heading type and how many questions to expect before next heading. Question lines would tell how many succesing line contain answer content. Answer lines would only contain content.
All this is to illustrate what I need of my "save file".
Last paragraph(s)
Is all that possible with XMP? Being able to read properties line by line and to have a property with multiple values attached to it, or at least to somehow divide it to couple of properties in a way that could be implemented to keep this functionality?
And the most important question is, could XMP readers/writers (iText) handle the non-fixed size of the XMP file?
My alternative is to simply attach those lines somwhere at the end of PDF file (not to mess up cross-reference table), and comment them out (using %), then create a special reader in Java that would seek for, and parse those lines.

This is how I interpret your question.
You want to create a PDF that is readible by humans and that renders header texts, questions, and possible answers.
At the same time, you want the PDF to be readible by a program that doesn't know anything about PDF. The content read by the program is different from the content as it can be read by humans, in the sense that it has some structure.
I don't see the link with PDF. I would store the data you want to be machine-readible as an attachment to the PDF, and have your program extract that attachment. If your program can use iText, then it's a piece of cake. If your program can only read bytes, then you could try different options:
(1) store the data as a stream that isn't compressed. Find the uncompressed stream by adding some kind of long recognizable String as the first line of data (that's more or less how an XMP stream is detected by software that can't interpret PDF syntax).
(2) store the data as a compressed stream, but add an extra entry to the stream dictionary of the compressed stream. Loop over the objects in the PDF file, look for a stream dictionary with that specific, custom key/value pair, read the stream and uncompress it.
If I misinterpreted your question, please rephrase.

Related

Extract paragraph sample from file

I have an unknown file type uploaded. It can be doc, pdf, xls, etc.
My ultimate goal is to:
Determine if there are paragraphs of text in the file (as opposed to, say, a bunch of picture captions or text from a chart or table)
If (1) is true and there are paragraphs of text, extract a few sample paragraphs from the file.
I know that I can use a program like Apache Tika to extract the file to a String.
However, I would like to also get the format of the extracted text and determine where there are paragraphs of full, written text (as opposed to captions, etc.).
So I also would like a way to analyze the extracted text. Specifically, I would like a library that can identify full, written paragraphs, as opposed to text that was simply taken from things like photo captions, charts, etc.
While Tika is a rather large library, I would be willing to add it if it can perform the tasks that I need.
However, I can not find anything in Tika that would allow me to analyze the structure of the text in such a way.
Is there something I missed?
Other than Tika, I am aware of some API’s for analyzing text, specifically Comprehend or Textract, but I still couldn't find something that can ensure the extraction of full, written paragraphs as I require.
I am looking for any suggestion using the libraries I listed above or others. Again, I'd like to avoid things like photo captions and such and only get text that was part of full, written paragraphs.
Is there any library that can help me with this or will I have to code the logic myself (for detecting paragraphs as well as detecting the difference between full paragraphs and text that was extracted from charts and captions)?

is it possible to have java fill in the blanks on a word document

I am making a java program where I input answers for a friendship survey. It spits out the student's top ten friends. However I need to print out the results and give them to the students. The old of doing it was to have the java program write to write html then we would open each file one at a time and print out the page. However, having 400+ students to do it for takes a while.
So since I am re making the program I would like to make it so I can just have it on word files and print them all out at once. However, I don't know how to write to a word file and notepad isn't stylish enough. Anyone know how to make this possible or another way that is easier?

I did a similar thing some years ago, using Rich Text Format. Its advantage is that it's a plain text format that can easily be manipulated.
I created the form document in Word with some unique placeholder strings where I'd later fill in the actual data and saved it as RTF.
With a text editor, I made sure that Word didn't split the placeholders by inserting some junk formatting directives, and corrected that manually where necessary.
Then, filling in the actual data just meant to do some simple text replacement (in my case, there was no risk to interfere with the formatting directives), and saving the resulting RTF file.
As Word typically opens RTF files just as easy as DOC or DOCX ones, this was an easy working solution for me.

how to make table formatted text file in android sd card?

I see this to make text file and it also helps me out but in all examples i see that they just making string in notepad or we can say text file...
Can any one say that how to make table formatted text file in android??
i want to make file(invoice)

This is most likely going to involve some some slightly messy string processing. Assuming you have your data in an acceptable format (such as string arrays), you should be able to construct a single java string representing the whole table, and then use the code you found already to print it to a file. Use the escape character \t to separate between columns and \n to separate between rows.

That would be TSV format, and it is very easy to generate. Just add a TAB after every field, and a CR/LF pair after every record.

How can I output data with special characters visible?

I have a text file that was provided to me and no one knows the encoding on it. Looking at it in a text editor, everything looks fine, aligned properly into neat columns.
However, I'm seeing some anomalies when I read the data. Even though, visually, the field "Foo" appears in the same columns in the text file (for instance, in columns 15-20), when I try to pull it out using substring(15,20) my data varies wildly. Sometimes I'll pull bytes 11-16, sometimes 18-23, sometimes 15-20...there's no consistency between records.
I suspect that there are some special chartacters, invisible to my text editor, but readable by (and counted in the index of) the String methods. Is there any way in Java to dump the contents of the file with any special characters visible so I can see what I need to Strings I need replace with regex?
If not in Java, can anyone recommed a tool that may be able to help me out?

I would start with having a look at the file directly. Any code adds a layer of doubt. Take a Total Commander (or equivalent on your platform), view the file (F3) and switch to hex mode. You suggest that the special characters behavior is not even consistent between lines, so you should get some visual clue about the format before you even attempt to fix it algorithmically.

Have you tried printing the contents of the file as individual integers or bytes? That way you can see if there are any hidden characters.

Text editor in J2ME - Store text in memory to edit

I'm developing a text editor in J2ME for editing source code, and because it has special features like syntax coloring, I can't use the regular TextBox, so I have to make a text box from scratch, using Canvas.
I found the way of reading/writing files from/to memory card, using FileConnection and the InputStreamReader/OutputStreamWriter classes for read and write text.
Now the problem is, when I read the file, how I can store the read information in memory, in order to edit the text freely and decide later if I can save or discard the changes?
Do I create a temporary file where I store the data for editing? But how can I write/delete text in middle of the file? Or do I have to dump the data in a StringBuffer?
Any methods or alternatives will be welcome.
Thanks!

I'd just use String (for storing the whole text in one variable)
or Vector of Strings (for storing the text line by line).
Temporary files is a very bad solution.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.