I'm struggling with a little Java Project:
I made a Program which autofills a PDF Formular. Mostly everything works fine for me, but there is a Problem: In this PDF Formular (which is given from my company, so I have to deal with this document) is a equation Field, which is used for calculation the Costs from Number of Items and the single Price. When I insert the Price of a single Item as a String to my PDF
public void setEinzelpreis(String Einzelpreis)
{
try {
fieldList.get(30).setValue(Einzelpreis);
...
There should be the single Price on the empty field in the first row. The last Cell of the row is auto-calculated by the pdf.
When I Click in the PDF in the "empty" Field, the Value appears:
When I click to another Field, the Value disappears. This is my Problem.
I'm getting the FieldList via pdfbox and the Code for getting my fieldList of the PFD is:
try {
pdfTemplate = PDDocument.load(template);
PDDocumentCatalog docCatalog = pdfTemplate.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
if (acroForm != null)
{
// Get field names
fieldList = acroForm.getFields();
}
...
So, can anybody tell what I'm doing wrong? Maybe the PDF wants a double Value for the equation and I am giving a String? But I don't know how to write a double in the FieldList. Thanks a lot for every hint!
Edit:
The PDF File which I'm using:
https://1drv.ms/b/s!Av6exjPNXlgOioouAuXL6QV4eUGkqg?e=ocfhvC
And this is the file I generated:
https://1drv.ms/b/s!Av6exjPNXlgOioovK-HuRuXW2aRy_w?e=D1ZCA8
The strange thing is: when I change the value in the document by hand, everything acts normal, even with a different Document Viewer.
First of all, the AcroForm form structure in your PDF is weird. It looks like someone used a graphical form generation tool he did not understand and clicked, dragged, dropped, copied, ... until the form in a viewer did what he wanted, not caring about it having become difficult to maintain.
In particular the Einzelpreis fields have a completely unnecessary structure of intermediate and final fields, e.g.
Thus, the field Einzelpreis in € exkl USt1 (the '€' is missing in the tree above) is not the one to fill in, it's merely an intermediary field. The actual form field to fill in is Einzelpreis in € exkl USt1.0.0.0.0.
Unfortunately you in your code simply grab the 30th field in the field list returned by PDAcroForm, and this field happens to be the intermediary field Einzelpreis in € exkl USt1; as an intermediary field it has no visible widgets of its own, so your setValue call doesn't change the visible Einzelpreis.
The JavaScript instruction calculating the Gesamtpreis uses the value from the final field, too:
AFSimple_Calculate("PRD", new Array ("Anzahl1", "Einzelpreis in € exkl USt1.0.0.0.0"));
But as the field value is inheritable and none of the .0 fields has an own value, the calculation sees the 100 once form calculation has been triggered and uses it.
Thus, you should fill the Einzelpreis in € exkl USt1.0.0.0.0 field instead. And the more secure way to retrieve it is not by index in a field list but by name:
PDField fieldByName = acroForm.getField("Einzelpreis in € exkl USt1.0.0.0.0");
(excerpt from FillInForm test testFill2020_04BeschaffungsantragEinzelpreis)
After filling that field, the "100" should be visible in your form.
The remaining problem that the Gesamtpreis value is not calculated is due to the fact already mentioned by #Tilman in a comment to the question: PDFBox doesn't use javascript. Thus, you have to calculate those values yourself and update the fields in question accordingly.
If you need to know the correct name of a form field, you can do as Tilman proposed and use the PDFBox PDFDebugger. If you hover over the field there, it will display the name in the status bar at the bottom.
By the way, the AcroForm method getFields won't return the field required here anyways. As documented in its JavaDocs, this method will return all of the documents root fields, no fields further down in the hierarchy, at least not immediately. (From the user perspective the method name getFields is a misnomer. It is accurate, though, from the PDF specification perspective as the corresponding entry in the AcroForms object has the key Fields.)
Beware, though, you probably will have to update your PDFBox version. In earlier versions PDFBox did not update appearances of fields with JavaScript actions (believing some JavaScript would fill it in anyways). I used the current 3.0.0-SNAPSHOT in which that behavior has been changed.
Related
I'm working with Lucene 7.4 and have indexed a sample of txt files.
I have some Fields that have been stored, such as path and filename,
and a content Field, which was unstored before passing the doc to the IndexWriter.
Consequently my content Field contains the processed (e.g. tokenized, stemmed) content data of the file, my filename Field contains the unprocessed filename, the entire String.
try (InputStream stream = Files.newInputStream(file)) {
// create empty document
Document doc = new Document();
// add the last modification time field
Field lastModField = new StoredField(LuceneConstants.LAST_MODIFICATION_TIME, Files.getAttribute(file, "lastModifiedTime", LinkOption.NOFOLLOW_LINKS).toString());
doc.add(lastModField);
// add the path Field
Field pathField = new StringField(LuceneConstants.FILE_PATH, file.toString(), Field.Store.YES);
doc.add(pathField);
// add the name Field
doc.add(new StringField(LuceneConstants.FILE_NAME, file.getFileName().toString(), Field.Store.YES));
// add the content
doc.add(new TextField(LuceneConstants.CONTENTS, new BufferedReader(new InputStreamReader(stream))));
System.out.println("adding " + file);
writer.addDocument(doc);
Now, as far as I understand, I have to use 2 QueryParsers, since I need to use 2 different Analyzers for searching over both fields, one for each.
I cant't figure out how to combine them.
What I want is a TopDoc wherein the results are ordered by a relevance score, that is some combination of the 2 relevance scores from the search over the filename Field and the search over the content Field.
Does Lucene 7.4 provide you with the means for an easy solution to this?
PS: This is my first post in a long time, if not ever. Please remark any formatting or content issues.
EDIT:
Analyzer used for indexing content Field and for searching content Field:
Analyzer myTxtAnalyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("lowercase")
.addTokenFilter("stop")
.addTokenFilter("porterstem")
.build();
And I'm using the KeywordAnalyzer to search over the filename Field, which, to reiterate, is stored, so not analyzed.
My program is supposed to index files and search over that index, retrieving
a list of the most relevant documents. If a searchString, which may contain whitespaces, exactly matches the fileName,
I'd like that to heavily impact my search results.
I'm a computer science student, and this is my first project with Lucene.
If there are no functions available, it's all good. What I'm asking for is not a requirement for my task. I'm just pondering and I feel like this is something there might already exist a simple solution for. But I can't seem to find it, if it exists.
EDIT 2:
I had a misconception aobut what happens when using Stored.YES/.NO.
My problem has nothing to do with it.
The String wasn't tokenized, because it was in a StringField.
I assumed it was because it was stored.
However, my question remains.
Is there a way to search over tokenized and untokenized Fields concurrently?
As #andrewjames mentions, you don't need to use multiple analyzers in your example because only the TextField gets analyzed, the StringFields do not. If you had a situation where you did need to use different analyzers for different fields, Lucene can accommodate that. To do so you use a PerFieldAnalyzerWrapper which basically let's you specify a default Analyzer and then as many field specific analyzers as you like (passed to PerFieldAnalyzerWrapper as a dictionary). Then when analyzing the doc it will use the field specific analyzer if one was specified and if not, it will use the default analyzer you specified for the PerFieldAnalyzerWrapper.
Whether using a single analyzer or using multiple via PerFieldAnalyzerWrapper, you only need one QueryParser and you will pass that parser either the one analyzer or the PerFieldAnalyzerWrapper which is an analyzer that wraps several analyzers.
The fact that some of your fields are stored and some are not stored has no impact on searching them. The only thing that matters for the search is that the field is indexed, and both StringFields and TextFields are always indexed.
You mention the following:
And I'm using the KeywordAnalyzer to search over the filename Field, which, to reiterate, is stored, so not analyzed.
Whether a field is stored or not has nothing to do with whether it's analyzed. For the filename field your code is using a StringField with Field.Store.YES. Because it's a StringField it will be indexed BUT not analyzed, and because you specified to store the field it will be stored. So since the field is NOT analyzed, it won't be using the KeywordAnalyzer or any other analyzer :-)
Is there a way to search over tokenized and untokenized Fields concurrently?
The real issue here isn't about searching tokenized and untokenized fields concurrently, it's really just about search multiple fields concurrently. The fact that one is tokenized and one is not is of no consequence for lucene. To search multiple fields at once you can use a BooleanQuery and with this query object you can add multiple queries to it, one for each field, and specify an AND ie Must or an OR ie Should relationship between the subqueries.
I hope this helps clear things up for you.
I want to create a simple spreadsheet in docx4j / xlsx4j. It shall contain only Strings, no formular is needed. The porpuse is basically switching from a CSV to XLSX
Therefore I tried the example here: https://github.com/plutext/docx4j/blob/master/src/samples/xlsx4j/org/xlsx4j/samples/CreateSimpleSpreadsheet.java
Unfortunetly it is not working. Even after removing the deprecated parts ( http://pastebin.com/bUnJWmFD ).
Excel reports unreadable content and suggest a repair. After that I get the error: "Entfernte Datensätze: Zellinformationen von /xl/worksheets/sheet1.xml-Part". It means something like "removed datasets: Cellinformation at /xl/worksheets/sheet1.xml-Part".
This error occures when createCell is called in line 58 (see. Github, not pastebin) or cell.setV is called with "Hello World" instead of "1234"
I think you are raising 2 issues here:
the resulting XLSX needing repair: this was the result of a typo in cell2.setR, fixed at https://github.com/plutext/docx4j/commit/7d04a65057ad61f5197fb9a98168fc654220f61f
calling setV with "Hello World", you shouldn't do that. Per http://webapp.docx4java.org/OnlineDemo/ecma376/SpreadsheetML/v.html
This element expresses the value contained in a cell. If the cell
contains a string, then this value is an index into the shared string
table, pointing to the actual string value. Otherwise, the value of
the cell is expressed directly in this element. .. For applications
not wanting to implement the shared string table, an 'inline string'
may be expressed in an <is> element under <c> (instead of a
<v> element under <c>),in the same way a string would be
expressed in the shared string table.
though I guess our setV method could detect misuse and either throw an exception or do one of those other things instead.
The CreateSimpleSpreadsheet sample as it stands shows you how to set an inline string, so you just need to test whether your input is a number or not.
In a Java class in my XPages application, I'm trying to get a handle on a Notes Document in a Notes View. The Notes View contains several Notes Documents. To get the Notes Document I want, I use 2 keys. This produces an error. If I use just one key, the first Notes Document in the Notes View is returned. The Notes View contains two sorted columns. The first column contains the empLang value, the second column contains the templateType value. Here is my code:
String empLang = "en";
String templateType = "C";
Database dbCurr = session.getCurrentDatabase();
String viewName = "vieAdminTemplates" + empLang;
View tview = dbCurr.getView(viewName);
Vector viewKey = new Vector();
viewKey.addElement(empLang);
viewKey.addElement(templateType); // this line causes the code to fail
Document templateDoc = tview.getDocumentByKey(viewKey);
What could be the cause of this problem?
A couple of ideas
1) You could concatenate the key into a single column since you said that worked. Something like 'en~C'
2) You could use the database.search method where you include a string of formula language that isolates the document you want. It returns a collection, and then you pull the document from there.
getDocumentByKey works with multiple columns. There's a known problem with doubles, but you're not hitting that there. One thing that stands out is the second column is just a single letter. That could be considered as a Char instead of a String, either when you do addElement or by the view.
I'd recommend debugging out what data type they are. viewKey.get(1).getClass().getName() I think gives you the class it's stored as. Doing the same for the View Column value.
When you say it causes the code to fail, how does it fail? Does it just not return anything or throw an error?
My next step would be to try testing it where the View and the Vector contain more than one character, e.g. "CC", to help check if there's an underlying issue with Java getDocumentByKey and single characters.
I'm very sorry. The problem here is that the view name in the code is incorrect. There is a view "vieAdminTemplates" but it does not have a second column containing the value "C". With the correct view, the code works fine. Thanks for taking the time to respond to my question.
Given an acrokey, is it possible to find the absolute position and dimension of that particular field (getLeft, getTop, getWidth, getHeight) ?
And is the viceversa possible - if I know the position, can I get the acrokey of the field?
First part of your question:
Suppose that you have an AcroFields instance (form), either retrieved from a PdfReader (read only) or a PdfStamper instance, then you can get the field position of the first widget that corresponds with a specific field name like this:
Rectangle rectangle = form.getFieldPositions(name).get(0).position;
Note that one field can correspond with multiple widgets. For instance, to get the second widget, you need:
Rectangle rectangle = form.getFieldPositions(name).get(1).position;
Of course: you probably also want to know the page number:
int page = form.getFieldPositions(name).get(0).page;
Second part of your question
Fields correspond with widget annotations. If you know the page number of the widget annotation, you could get the page dictionary and inspect the entries of the /Annots array. You'll have to loop over the different annotations, inspecting each annotation's /Rect entry. Once you find a match, you need to crawl the content for the field that corresponds with the annotation. That's more work than can be provided in a code sample on this site.
what do I obtain if I call IndexReader.getTermFrequenciesVector(...) on an index created with TermVector.YES option?
The documentation already answers this, as Xodorap notes in a comment.
The TermFreqVector object returned can retrieve which terms (words produced by your analyzer) a field contains and how many times each of those terms exists within that field.
You can cast the returned TermFreqVector to the interface TermPositionVector if you index the field using TermVector.WITH_OFFSETS, TermVector.WITH_POSITIONS or TermVector.WITH_POSITIONS_OFFSETS. This gives you access to GetTermPositions with allow you to check where in the field the term exists, and GetOffsets which allows you to check where in the original content the term originated from. The later allows, combined with Store.YES, highlighting of matching terms in a search query.
There are different contributed highlighters available under Contrib area found at the Lucene homepage.
Or you can implement proximity or first occurrence type score contributions. Which highlighting won't help you with at all.