PDFBox : Maintaining PDF structure when extracting text

PDFBox : Maintaining PDF structure when extracting text - java

I'm trying to extract text from a PDF which is full of tables.
In some cases, a column is empty.
When I extract the text from the PDF, the emptys columns are skiped and replaced by a whitespace, therefore, my regulars expressions can't figure out that there was a column with no information at this spot.
Image to a better understanding :
We can see that the columns aren't respected in the extracted text
Sample of my code that extract the text from PDF :
PDFTextStripper reader = new PDFTextStripper();
reader.setSortByPosition(true);
reader.setStartPage(page);
reader.setEndPage(page);
String st = reader.getText(document);
List<String> lines = Arrays.asList(st.split(System.getProperty("line.separator")));
How to maintain the full structure of the original PDF when extracting text from it ?
Thank's a lot.

(This originally was the answer (dated Feb 6 '15) to another question which the OP deleted including all answers. Due to the age, the code in the answer was still based on PDFBox 1.8.x, so some changes might be necessary to make it run with PDFBox 2.0.x.)
In comments the OP showed interest in a solution to extend the PDFBox PDFTextStripper to return text lines which attempt to reflect the PDF file layout which might help in case of the question at hand.
A proof-of-concept for that would be this class:
public class LayoutTextStripper extends PDFTextStripper
{
public LayoutTextStripper() throws IOException
{
super();
}
#Override
protected void startPage(PDPage page) throws IOException
{
super.startPage(page);
cropBox = page.findCropBox();
pageLeft = cropBox.getLowerLeftX();
beginLine();
}
#Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
float recentEnd = 0;
for (TextPosition textPosition: textPositions)
{
String textHere = textPosition.getCharacter();
if (textHere.trim().length() == 0)
continue;
float start = textPosition.getTextPos().getXPosition();
boolean spacePresent = endsWithWS | textHere.startsWith(" ");
if (needsWS | spacePresent | Math.abs(start - recentEnd) > 1)
{
int spacesToInsert = insertSpaces(chars, start, needsWS & !spacePresent);
for (; spacesToInsert > 0; spacesToInsert--)
{
writeString(" ");
chars++;
}
}
writeString(textHere);
chars += textHere.length();
needsWS = false;
endsWithWS = textHere.endsWith(" ");
try
{
recentEnd = getEndX(textPosition);
}
catch (IllegalArgumentException | IllegalAccessException | NoSuchFieldException | SecurityException e)
{
throw new IOException("Failure retrieving endX of TextPosition", e);
}
}
}
#Override
protected void writeLineSeparator() throws IOException
{
super.writeLineSeparator();
beginLine();
}
#Override
protected void writeWordSeparator() throws IOException
{
needsWS = true;
}
void beginLine()
{
endsWithWS = true;
needsWS = false;
chars = 0;
}
int insertSpaces(int charsInLineAlready, float chunkStart, boolean spaceRequired)
{
int indexNow = charsInLineAlready;
int indexToBe = (int)((chunkStart - pageLeft) / fixedCharWidth);
int spacesToInsert = indexToBe - indexNow;
if (spacesToInsert < 1 && spaceRequired)
spacesToInsert = 1;
return spacesToInsert;
}
float getEndX(TextPosition textPosition) throws IllegalArgumentException, IllegalAccessException, NoSuchFieldException, SecurityException
{
Field field = textPosition.getClass().getDeclaredField("endX");
field.setAccessible(true);
return field.getFloat(textPosition);
}
public float fixedCharWidth = 3;
boolean endsWithWS = true;
boolean needsWS = false;
int chars = 0;
PDRectangle cropBox = null;
float pageLeft = 0;
}
It is used like this:
PDDocument document = PDDocument.load(PDF);
LayoutTextStripper stripper = new LayoutTextStripper();
stripper.setSortByPosition(true);
stripper.fixedCharWidth = charWidth; // e.g. 5
String text = stripper.getText(document);
fixedCharWidth is the assumed character width. Depending on the writing in the PDF in question a different value might be more apropos. In my sample documents values from 3..6 were of interest.
It essentially emulates the analogous solution for iText in this answer. Results differ a bit, though, as iText text extraction forwards text chunks and PDFBox text extraction forwards individual characters.
Please be aware that this is merely a proof-of-concept. It especially does not take any rotation into account

Related

Apache POI doesn't find highlighted text

I have a file saved in doc format, and I need to extract highlighted text.
I have code like in following:
HWPFDocument document = new HWPFDocument(fis);
Range r = document.getRange();
for (int i=0;i<5;i++) {
CharacterRun t = r.getCharacterRun(i);
System.out.println(t.isHighlighted());
System.out.println(t.getHighlightedColor());
System.out.println(r.getCharacterRun(i).SPRM_HIGHLIGHT);
System.out.println(r.getCharacterRun(i));
}
None of the above methods show that text is highlighted, but when I open it, it is highlighted.
What can be the reason, and how to find if the text is highlighted or not?

Highlighting text in Word is possible using two different methods. First is applying highlighting to text runs. Second is applying shading to words or paragraphs.
For the first and using *.doc, the Word binary file format, apache poi provides methods in CharacterRun. For the second apache poi provides Paragraph.getShading. But this is only set if the shading applies to the whole paragraph. If the shading is applied only to single runs, then apache poi provides nothing for that. So using the underlying SprmOperations is needed.
Microsoft's documentation 2.6.1 Character Properties describes sprmCShd80 (0x4866) which is "A Shd80 structure that specifies the background shading for the text.". So we need searching for that.
Example:
import java.io.FileInputStream;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.*;
import org.apache.poi.hwpf.sprm.*;
import java.lang.reflect.Field;
import java.lang.reflect.Method;
public class HWPFInspectBgColor {
private static void showCharacterRunInternals(CharacterRun run) throws Exception {
Field _chpx = CharacterRun.class.getDeclaredField("_chpx");
_chpx.setAccessible(true);
SprmBuffer sprmBuffer = (SprmBuffer) _chpx.get(run);
for (SprmIterator sprmIterator = sprmBuffer.iterator(); sprmIterator.hasNext(); ) {
SprmOperation sprmOperation = sprmIterator.next();
System.out.println(sprmOperation);
}
}
static SprmOperation getCharacterRunShading(CharacterRun run) throws Exception {
SprmOperation shd80Operation = null;
Field _chpx = CharacterRun.class.getDeclaredField("_chpx");
_chpx.setAccessible(true);
Field _value = SprmOperation.class.getDeclaredField("_value");
_value.setAccessible(true);
SprmBuffer sprmBuffer = (SprmBuffer) _chpx.get(run);
for (SprmIterator sprmIterator = sprmBuffer.iterator(); sprmIterator.hasNext(); ) {
SprmOperation sprmOperation = sprmIterator.next();
short sprmValue = (short)_value.get(sprmOperation);
if (sprmValue == (short)0x4866) { // we have a Shd80 structure, see https://msdn.microsoft.com/en-us/library/dd947480(v=office.12).aspx
shd80Operation = sprmOperation;
}
}
return shd80Operation;
}
public static void main(String[] args) throws Exception {
HWPFDocument document = new HWPFDocument(new FileInputStream("sample.doc"));
Range range = document.getRange();
for (int p = 0; p < range.numParagraphs(); p++) {
Paragraph paragraph = range.getParagraph(p);
System.out.println(paragraph);
if (!paragraph.getShading().isEmpty()) {
System.out.println("Paragraph's shading: " + paragraph.getShading());
}
for (int r = 0; r < paragraph.numCharacterRuns(); r++) {
CharacterRun run = paragraph.getCharacterRun(r);
System.out.println(run);
if (run.isHighlighted()) {
System.out.println("Run's highlighted color: " + run.getHighlightedColor());
}
if (getCharacterRunShading(run) != null) {
System.out.println("Run's Shd80 structure: " + getCharacterRunShading(run));
}
}
}
}
}

How to convert an image (rgb/gray) inside a pdf to monochrom/bitonal one using itext and java

I'm writing a java programm to swap images inside a pdf. Due to the process of generation they are stored as high dpi, rgb images, but are bitonal/monochrome images. I'm using itext 7.1.1, but also testet the latest dev version (7.1.2 snapshot).
I'm already able to extract the images from pdf and convert them to png or tif using indexed colours or gray (0 & 255 only) in imagemagick (also testet gimp).
I modified some code from itext, to replace the images inside the pdf, which does work for DeviceRGB- and DeviceGray-Images, but not for Bitonal ones:
public static Image readPng(String pImageFolder, int pImageNumber) throws IOException {
String url = "./" + pImageFolder + "/" + pImageNumber + ".png";
File ifile = new File(url);
if (ifile.exists() && ifile.isFile()) {
return new Image(ImageDataFactory.create(url));
} else {
return null;
}
}
public static void replaceStream(PdfStream orig, PdfStream stream) throws IOException {
orig.clear();
orig.setData(stream.getBytes());
for (PdfName name : stream.keySet()) {
orig.put(name, stream.get(name));
}
}
public static void replaceImages(String pFilename, String pImagefolder, String pOutputFilename) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader(pFilename), new PdfWriter(pOutputFilename));
for (int i = 0; i < pdfDoc.getNumberOfPages(); i++) {
PdfDictionary page = pdfDoc.getPage(i + 1).getPdfObject();
PdfDictionary resources = page.getAsDictionary(PdfName.Resources);
PdfDictionary xobjects = resources.getAsDictionary(PdfName.XObject);
Iterator<PdfName> iter = xobjects.keySet().iterator();
PdfName imgRef;
PdfStream stream;
Image img;
int number;
while (iter.hasNext()) {
imgRef = iter.next();
number = xobjects.get(imgRef).getIndirectReference().getObjNumber();
stream = xobjects.getAsStream(imgRef);
img = readPng(pImagefolder, number);
if (img != null) {
replaceStream(stream, img.getXObject().getPdfObject());
}
}
}
pdfDoc.close();
}
If i convert the images to tif and use them as replacement, there are dark images (all pixels are black) inside the pdf. If i try to use png-images, they are not shown and pdfimages complaints "Unknown compression method in flate stream".

FYI:
There was an error in my replaceStream: getBytes() deflates a PdfStream. All Stream-Attributes were copied, thus there was a Filter-Information saying FlateDecoding is necessary.
I had to tell getBytes()not to deflate by setting the decoded-Parameter to false: getBytes(false)
public static void replaceStream(PdfStream orig, PdfStream stream) throws IOException {
orig.clear();
orig.setData(stream.getBytes(false));
for (PdfName name : stream.keySet()) {
orig.put(name, stream.get(name));
}
}
Now everything works fine, except:
Bitone-images are not CCITT4, which they should be. (Doesn't matter, because they are converted to JBig2.)
Images are said to have an error by Acrobat, but every other viewer displays just fine: There seems to be an error inside the ColorSpace information. That should be DeviceGray, but is CalGray with some Gamma-Information, but missing WhitePoint. Changing to DeviceGray by hand makes it work. A workaround is to strip gAMA and cHRM.
Both are conversion errors in iText7:
CCITT4: PNGImageHelper line 254 should be RawImageHelper.updateRawImageParameters(png.image, png.width, png.height, components, bpc, png.idat.toByteArray(), null); to trigger conversion.
WhitePoint is correctly read from the file and stored inside the ImageData-Class, but is discarded inside PdfImageXObject -> createPdfStream.

Using PDFBox to remove Optional Content Groups that are not enabled

I'm using apache PDFBox from java, and I have a source PDF with multiple optional content groups. What I am wanting to do is export a version of the PDF that includes only the standard content and the optional content groups that were enabled. It is important for my purposes that I preserve any dynamic aspects of the original.... so text fields are still text fields, vector images are still vector images, etc. The reason that this is required is because I intend to ultimately be using a pdf form editor program that does not know how to handle optional content, and would blindly render all of them, so I want to preprocess the source pdf, and use the form editing program on a less cluttered destination pdf.
I've been trying to find something that could give me any hints on how to do this with google, but to no avail. I don't know if I'm just using the wrong search terms, or if this is just something that is outside of what the PDFBox API was designed for. I rather hope it's not the latter. The info shown here does not seem to work (converting the C# code to java), because despite the pdf I'm trying to import having optional content, there does not seem to be any OC resources when I examine the tokens on each page.
for(PDPage page:pages) {
PDResources resources = page.getResources();
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
Collection tokens = parser.getTokens();
...
}
I'm truly sorry for not having any more code to show what I've tried so far, but I've just been poring over the java API docs for about 8 hours now trying to figure out what I might need to do this, and just haven't been able to figure it out.
What I DO know how to do is add text, lines, and images to a new PDPage, but I do not know how to retrieve that information from a given source page to copy it over, nor how to tell which optional content group such information is part of (if any). I am also not sure how to copy form fields in the source pdf over to the destination, nor how to copy the font information over.
Honestly, if there's a web page out there that I wasn't able to find with google with the searches that I tried, I'd be entirely happy to read up more about it, but I am really quite stuck here, and I don't know anyone personally that knows about this library.
Please help.
EDIT:
Trying what I understand from what was suggested below, I've written a loop to examine each XObject on the page as follows:
PDResources resources = pdPage.getResources();
Iterable<COSName> names = resources.getXObjectNames();
for(COSName name:names) {
PDXObject xobj = resources.getXObject(name);
PDFStreamParser parser = new PDFStreamParser(xobj.getStream().toByteArray());
parser.parse();
Object [] tokens = parser.getTokens().toArray();
for(int i = 0;i<tokens.length-1;i++) {
Object obj = tokens[i];
if (obj instanceof COSName && obj.equals(COSName.OC)) {
i++;
Object obj = tokens[i];
if (obj instanceof COSName) {
PDPropertyList props = resources.getProperties((COSName)obj);
if (props != null) {
...
However, after an OC key, the next entry in the tokens array is always an Operator tagged as "BMC". Nowhere am I finding any info that I can recognize from the named optional content groups.

Here's a robust solution for removing marked content blocks (open to feedback if anyone finds anything that isn't working right). You should be able to adjust for OC blocks...
This code properly handles nesting and removal of resources (xobject, graphics state and fonts - easy to add others if needed).
public class MarkedContentRemover {
private final MarkedContentMatcher matcher;
/**
*
*/
public MarkedContentRemover(MarkedContentMatcher matcher) {
this.matcher = matcher;
}
public int removeMarkedContent(PDDocument doc, PDPage page) throws IOException {
ResourceSuppressionTracker resourceSuppressionTracker = new ResourceSuppressionTracker();
PDResources pdResources = page.getResources();
PDFStreamParser pdParser = new PDFStreamParser(page);
PDStream newContents = new PDStream(doc);
OutputStream newContentOutput = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter newContentWriter = new ContentStreamWriter(newContentOutput);
List<Object> operands = new ArrayList<>();
Operator operator = null;
Object token;
int suppressDepth = 0;
boolean resumeOutputOnNextOperator = false;
int removedCount = 0;
while (true) {
operands.clear();
token = pdParser.parseNextToken();
while(token != null && !(token instanceof Operator)) {
operands.add(token);
token = pdParser.parseNextToken();
}
operator = (Operator)token;
if (operator == null) break;
if (resumeOutputOnNextOperator) {
resumeOutputOnNextOperator = false;
suppressDepth--;
if (suppressDepth == 0)
removedCount++;
}
if (OperatorName.BEGIN_MARKED_CONTENT_SEQ.equals(operator.getName())
|| OperatorName.BEGIN_MARKED_CONTENT.equals(operator.getName())) {
COSName contentId = (COSName)operands.get(0);
final COSDictionary properties;
if (operands.size() > 1) {
Object propsOperand = operands.get(1);
if (propsOperand instanceof COSDictionary) {
properties = (COSDictionary) propsOperand;
} else if (propsOperand instanceof COSName) {
properties = pdResources.getProperties((COSName)propsOperand).getCOSObject();
} else {
properties = new COSDictionary();
}
} else {
properties = new COSDictionary();
}
if (matcher.matches(contentId, properties)) {
suppressDepth++;
}
}
if (OperatorName.END_MARKED_CONTENT.equals(operator.getName())) {
if (suppressDepth > 0)
resumeOutputOnNextOperator = true;
}
else if (OperatorName.SET_GRAPHICS_STATE_PARAMS.equals(operator.getName())) {
resourceSuppressionTracker.markForOperator(COSName.EXT_G_STATE, operands.get(0), suppressDepth == 0);
}
else if (OperatorName.DRAW_OBJECT.equals(operator.getName())) {
resourceSuppressionTracker.markForOperator(COSName.XOBJECT, operands.get(0), suppressDepth == 0);
}
else if (OperatorName.SET_FONT_AND_SIZE.equals(operator.getName())) {
resourceSuppressionTracker.markForOperator(COSName.FONT, operands.get(0), suppressDepth == 0);
}
if (suppressDepth == 0) {
newContentWriter.writeTokens(operands);
newContentWriter.writeTokens(operator);
}
}
if (resumeOutputOnNextOperator)
removedCount++;
newContentOutput.close();
page.setContents(newContents);
resourceSuppressionTracker.updateResources(pdResources);
return removedCount;
}
private static class ResourceSuppressionTracker{
// if the boolean is TRUE, then the resource should be removed. If the boolean is FALSE, the resource should not be removed
private final Map<COSName, Map<COSName, Boolean>> tracker = new HashMap<>();
public void markForOperator(COSName resourceType, Object resourceNameOperand, boolean preserve) {
if (!(resourceNameOperand instanceof COSName)) return;
if (preserve) {
markForPreservation(resourceType, (COSName)resourceNameOperand);
} else {
markForRemoval(resourceType, (COSName)resourceNameOperand);
}
}
public void markForRemoval(COSName resourceType, COSName refId) {
if (!resourceIsPreserved(resourceType, refId)) {
getResourceTracker(resourceType).put(refId, Boolean.TRUE);
}
}
public void markForPreservation(COSName resourceType, COSName refId) {
getResourceTracker(resourceType).put(refId, Boolean.FALSE);
}
public void updateResources(PDResources pdResources) {
for (Map.Entry<COSName, Map<COSName, Boolean>> resourceEntry : tracker.entrySet()) {
for(Map.Entry<COSName, Boolean> refEntry : resourceEntry.getValue().entrySet()) {
if (refEntry.getValue().equals(Boolean.TRUE)) {
pdResources.getCOSObject().getCOSDictionary(COSName.XOBJECT).removeItem(refEntry.getKey());
}
}
}
}
private boolean resourceIsPreserved(COSName resourceType, COSName refId) {
return getResourceTracker(resourceType).getOrDefault(refId, Boolean.FALSE);
}
private Map<COSName, Boolean> getResourceTracker(COSName resourceType){
if (!tracker.containsKey(resourceType)) {
tracker.put(resourceType, new HashMap<>());
}
return tracker.get(resourceType);
}
}
}
Helper class:
public interface MarkedContentMatcher {
public boolean matches(COSName contentId, COSDictionary props);
}

Optional Content Groups are marked with BDC and EMC. You will have to navigate through all of the tokens returned from the parser and remove the "section" from the array. Here is some C# Code that was posted a while ago - [1]: How to delete an optional content group alongwith its content from pdf using pdfbox?
I investigated that (converting to Java) but couldn't get it work as expected. I managed to remove the content between BDC and EMC and then save the result using the same technique as the sample but the PDF was corrupted. Perhaps that is my lack of C# Knowledge (related to Tuples etc.)
Here is what I came up with, as I said it doesn't work perhaps you or someone else (mkl, Tilman Hausherr) can spot the flaw.
OCGDelete (PDDocument doc, int pageNum, String OCName) {
PDPage pdPage = (PDPage) doc.getDocumentCatalog().getPages().get(pageNum);
PDResources pdResources = pdPage.getResources();
PDFStreamParser pdParser = new PDFStreamParser(pdPage);
int ocgStart
int ocgLength
Collection tokens = pdParser.getTokens();
Object[] newTokens = tokens.toArray()
try {
for (int index = 0; index < newTokens.length; index++) {
obj = newTokens[index]
if (obj instanceof COSName && obj.equals(COSName.OC)) {
// println "Found COSName at "+index /// Found Optional Content
startIndex = index
index++
if (index < newTokens.size()) {
obj = newTokens[index]
if (obj instanceof COSName) {
prop = pdRes.getProperties(obj)
if (prop != null && prop instanceof PDOptionalContentGroup) {
if ((prop.getName()).equals(delLayer)) {
println "Found the Layer to be deleted"
println "prop Name was " + prop.getName()
index++
if (index < newTokens.size()) {
obj = newTokens[index]
if ((obj.getName()).equals("BDC")) {
ocgStart = index
println("OCG Start " + ocgStart)
ocgLength = -1
index++
while (index < newTokens.size()) {
ocgLength++
obj = newTokens[index]
println " Loop through relevant OCG Tokens " + obj
if (obj instanceof Operator && (obj.getName()).equals("EMC")) {
println "the next obj was " + obj
println "after that " + newTokens[index + 1] + "and then " + newTokens[index + 2]
println("OCG End " + ocgLength++)
break
}
index++
}
if (endIndex > 0) {
println "End Index was something " + (startIndex + ocgLength)
}
}
}
}
}
}
}
}
}
}
catch (Exception ex){
println ex.message()
}
for (int i = ocgStart; i < ocgStart+ ocgLength; i++){
newTokens.removeAt(i)
}
PDStream newContents = new PDStream(doc);
OutputStream output = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter writer = new ContentStreamWriter(output);
writer.writeTokens(newTokens);
output.close();
pdPage.setContents(newContents);
}

Breaking tokens into subtokens with Lucene TokenFilter

My program needs to index with Lucene (4.10) unstructured documents which contents can be anything. So my custom Analyzer is making use of the ClassicTokenizer to first tokenize the documents.
Yet it does not completely fit my needs because for example I want to be able to search for parts of an email address or part of a serial number (can also be a telephone number or anything containing numbers) that can be written as 1234.5678.9012 or 1234-5678-9012 depending on who wrote the document being indexed.
Since this ClassicTokenizer recognizes email and treats points followed by numbers as a whole token, it ends up that the generated index includes email addresses as a whole and serial numbers as a whole too whereas I would also like to break those tokens into pieces to enable the user to later search for those pieces.
Let me give a concrete example : if the input document features xyz#gmail.com, the ClassicTokenizer recognizes it as an email and consequently tokenizes it as xyz#gmail.com. If the user searches for xyz they will find nothing whereas a search for xyz#gmail.com will yield the expected result.
After reading lots of blog postings or SO question I to the conclusion that one solution could be to use a TokenFilter that would split the email into its pieces (on each side of # sign). Please not that I don't want to create my own tokenizer with JFlex and co.
Dealing with email I wrote the following code inspired from Lucene in action 2nd Edition's Synonymfilter :
public class SymbolSplitterFilter extends TokenFilter {
private final CharTermAttribute termAtt;
private final PositionIncrementAttribute posIncAtt;
private final Stack<String> termStack;
private AttributeSource.State current;
public SymbolSplitterFilter(TokenStream in) {
super(in);
termStack = new Stack<>();
termAtt = addAttribute(CharTermAttribute.class);
posIncAtt = addAttribute(PositionIncrementAttribute.class);
}
#Override
public boolean incrementToken() throws IOException {
if (!input.incrementToken()) {
return false;
}
final String currentTerm = termAtt.toString();
System.err.println("The original word was " + termAtt.toString());
final int bufferLength = termAtt.length();
if (bufferLength > 1 && currentTerm.indexOf("#") > 0) { // There must be sth more than just #
// If this is the first pass we fill in the stack with the terms
if (termStack.isEmpty()) {
// We split the token abc#cd.com into abc and cd.com
termStack.addAll(Arrays.asList(currentTerm.split("#")));
// Now we have the constituting terms of the email in the stack
System.err.println("The terms on the stacks are ");
for (int i = 0; i < termStack.size(); i++) {
System.err.println(termStack.get(i));
/** The terms on the stacks are
* xyz
* gmail.com
*/
}
// I am not sure it is the right place for this.
current = captureState();
} else {
// This part seems to never be reached!
// We add the constituents terms as tokens.
String part = termStack.pop();
System.err.println("Current part is " + part);
restoreState(current);
termAtt.setEmpty().append(part);
posIncAtt.setPositionIncrement(0);
}
}
System.err.println("In the end we have " + termAtt.toString());
// In the end we have xyz#gmail.com
return true;
}
}
Please note : I just started with the email that's why I only showed that part of code but I'll have to enhance my code to also manage serial numbers (as explained earlier)
However the stack is never processed. Indeed I can't figure out how the incrementToken method works although I read this SO question and when it processes the given token from the TokenStream.
Finally the goal I want to achieve is : for xyz#gmail.com as input text, I want to generate the following subtokens :
xyz#gmail.com
xyz
gmail.com
Any help appreciated,

Your Problem is, that the input TokenStream is already exhausted when your Stack is filled the first time. So input.incrementToken() returns false.
You should check whether the stack is filled first before incrementing the input. Like so:
public final class SymbolSplitterFilter extends TokenFilter {
private final CharTermAttribute termAtt;
private final PositionIncrementAttribute posIncAtt;
private final Stack<String> termStack;
private AttributeSource.State current;
private final TypeAttribute typeAtt;
public SymbolSplitterFilter(TokenStream in)
{
super(in);
termStack = new Stack<>();
termAtt = addAttribute(CharTermAttribute.class);
posIncAtt = addAttribute(PositionIncrementAttribute.class);
typeAtt = addAttribute(TypeAttribute.class);
}
#Override
public boolean incrementToken() throws IOException
{
if (!this.termStack.isEmpty()) {
String part = termStack.pop();
restoreState(current);
termAtt.setEmpty().append(part);
posIncAtt.setPositionIncrement(0);
return true;
} else if (!input.incrementToken()) {
return false;
} else {
final String currentTerm = termAtt.toString();
final int bufferLength = termAtt.length();
if (bufferLength > 1 && currentTerm.indexOf("#") > 0) { // There must be sth more than just #
if (termStack.isEmpty()) {
termStack.addAll(Arrays.asList(currentTerm.split("#")));
current = captureState();
}
}
return true;
}
}
}
Note, that you might possibly want to correct your offsets as well and change the order of your tokens as the test shows your resulting tokens:
public class SymbolSplitterFilterTest extends BaseTokenStreamTestCase {
#Test
public void testSomeMethod() throws IOException
{
Analyzer analyzer = this.getAnalyzer();
assertAnalyzesTo(analyzer, "hey xyz#example.com",
new String[]{"hey", "xyz#example.com", "example.com", "xyz"},
new int[]{0, 4, 4, 4},
new int[]{3, 19, 19, 19},
new String[]{"word", "word", "word", "word"},
new int[]{1, 1, 0, 0}
);
}
private Analyzer getAnalyzer()
{
return new Analyzer()
{
#Override
protected Analyzer.TokenStreamComponents createComponents(String fieldName)
{
Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, false);
SymbolSplitterFilter testFilter = new SymbolSplitterFilter(tokenizer);
return new Analyzer.TokenStreamComponents(tokenizer, testFilter);
}
};
}
}

Java - Variable number of variables

I wrote a code to find all URLs within a PDF file and replace the one(s) that matches the parameters that was passed from a PHP script.
It is working fine when a single URL is passed. But I don't know how to handle more than one URL, I'm guessing I would need a loop that reads the array length, and call the changeURL method passing the correct parameters.
I actually made it work with if Statements (if myarray.lenght < 4 do this, if it is < 6, do that, if < 8.....), but I am guessing this is not the optimal way. So I removed it and want to try something else.
Parameters passed from PHP (in this order):
args[0] - Location of original PDF
args[1] - Location of new PDF
args[2] - URL 1 (URL to be changed)
args[3] - URL 1a (URL that will replace URL 1)
args[4] - URL 2 (URL to be changed)
args[5] - URL 2a - (URL that will replace URL 2)
args...
and so on... up to maybe around 16 args, depending on how many URLs the PDF file contains.
Here's the code:
Main.java
public class Main {
public static void main(String[] args) {
if (args.length >= 4) {
URLReplacer.changeURL(args);
} else {
System.out.println("PARAMETER MISSING FROM PHP");
}
}
}
URLReplacer.java
public class URLReplacer {
public static void changeURL(String... a) {
try (PDDocument doc = PDDocument.load(a[0])) {
List<?> allPages = doc.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++) {
PDPage page = (PDPage) allPages.get(i);
List annotations = page.getAnnotations();
for (int j = 0; j < annotations.size(); j++) {
PDAnnotation annot = (PDAnnotation) annotations.get(j);
if (annot instanceof PDAnnotationLink) {
PDAnnotationLink link = (PDAnnotationLink) annot;
PDAction action = link.getAction();
if (action instanceof PDActionURI) {
PDActionURI uri = (PDActionURI) action;
String oldURL = uri.getURI();
if (a[2].equals(oldURL)) {
//System.out.println("Page " + (i + 1) + ": Replacing " + oldURL + " with " + a[3]);
uri.setURI(a[3]);
}
}
}
}
}
doc.save(a[1]);
} catch (IOException | COSVisitorException e) {
e.printStackTrace();
}
}
}
I have tried all sort of loops, but with my limited Java skills, did not achieve any success.
Also, if you notice any dodgy code, kindly let me know so I can learn the best practices from more experienced programmers.

Your main problem - as I understand -, is the "variable number of variables". And you have to send from PHP to JAVA.
1 you can transmit one by one as your example
2 or, in a structure.
there are several structures.
JSON is rather simple at PHP: multiple examples here:
encode json using php?
and for java you have: Decoding JSON String in Java.
or others (like XML , which seems too complex for this).

I'd structure your method to accept specific parameters. I used map to accept URLs, a custom object would be another option.
Also notice the way loops are changed, might give you a hint on some Java skills.
public static void changeURL(String originalPdf, String targetPdf, Map<String, String> urls ) {
try (PDDocument doc = PDDocument.load(originalPdf)) {
List<PDPage> allPages = doc.getDocumentCatalog().getAllPages();
for(PDPage page: allPages){
List annotations = page.getAnnotations();
for(PDAnnotation annot : page.getAnnotations()){
if (annot instanceof PDAnnotationLink) {
PDAnnotationLink link = (PDAnnotationLink) annot;
PDAction action = link.getAction();
if (action instanceof PDActionURI) {
PDActionURI uri = (PDActionURI) action;
String oldURL = uri.getURI();
for (Map.Entry<String, String> url : urls.entrySet()){
if (url.getKey().equals(oldURL)) {
uri.setURI(url.getValue());
}
}
}
}
}
}
doc.save(targetPdf);
} catch (IOException | COSVisitorException e) {
e.printStackTrace();
}
}
If you have to get the URL and PDF locations from command line, then call the changeURL function like this:
public static void main(String[] args) {
if (args.length >= 4) {
String originalPdf = args[0];
String targetPdf = args[1];
Map<String, String> urls = new HashMap<String, String>();
for(int i = 2; i< args.length; i+=2){
urls.put(args[i], args[i+1]);
}
URLReplacer.changeURL(originalPdf, targetPdf, urls);
} else {
System.out.println("PARAMETER MISSING FROM PHP");
}
}

Of the top of my head, you could do something like this
public static void main(String[] args) {
if (args.length >= 4 && args.length % 2 == 0) {
for(int i = 2; i < args.length; i += 2) {
URLReplacer.changeURL(args[0], args[1], args[i], args[i+1]);
args[0] = args[1];
}
} else {
System.out.println("PARAMETER MISSING FROM PHP");
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

PDFBox : Maintaining PDF structure when extracting text - java

Related

Apache POI doesn't find highlighted text

How to convert an image (rgb/gray) inside a pdf to monochrom/bitonal one using itext and java

Using PDFBox to remove Optional Content Groups that are not enabled

Breaking tokens into subtokens with Lucene TokenFilter

Java - Variable number of variables

Categories

Resources