Can PDF documents contain "unreachable" content? - java

I am investigating Java PDF libraries.
I have a tried
org.apache.pdfbox
File file = new File("file.pdf");
PDDocument document = PDDocument.load(file);
// Instantiate PDFTextStripper class
PDFTextStripper pdfStripper = new PDFTextStripper();
// Retrieving text from PDF document
String text = pdfStripper.getText(document);
System.out.println(text);
// Closing the document
document.close();
com.itextpdf.text.pdf
public static final String SRC = "file.pdf";
public static final String DEST = "streams";
public static void main(final String[] args) throws IOException {
File file = new File(DEST);
new BruteForce().parse(SRC, DEST);
}
public void parse(final String src, final String dest) throws IOException {
PdfReader reader = new PdfReader(src);
PdfObject obj;
for (int i = 1; i <= reader.getXrefSize(); i++) {
obj = reader.getPdfObject(i);
if ((obj != null) && obj.isStream()) {
PRStream stream = (PRStream) obj;
byte[] b;
try {
b = PdfReader.getStreamBytes(stream);
} catch (UnsupportedPdfException e) {
b = PdfReader.getStreamBytesRaw(stream);
}
FileOutputStream fos = new FileOutputStream(String.format(dest, i));
fos.write(b);
fos.flush();
fos.close();
} else {
final PdfDictionary pdfDictionary = (PdfDictionary) obj;
System.out.println("\t>>>>> " + pdfDictionary + "\t\t" + pdfDictionary.getKeys());
final Set<PdfName> pdfNames = pdfDictionary.getKeys();
for (final PdfName pdfName : pdfNames) {
final PdfObject pdfObject = pdfDictionary.get(pdfName);
final int type = pdfObject.type();
switch (type) {
case PdfObject.NULL:
System.out.println("\t NULL " + pdfObject);
break;
case PdfObject.BOOLEAN:
System.out.println("\t BOOLEAN " + pdfObject);
break;
case PdfObject.NUMBER:
System.out.println("\t NUMBER " + pdfObject);
break;
case PdfObject.STRING:
System.out.println("\t STRING " + pdfObject);
break;
case PdfObject.NAME:
System.out.println("\t NAME " + pdfObject);
break;
case PdfObject.ARRAY:
System.out.println("\t ARRAY " + pdfObject);
break;
case PdfObject.DICTIONARY:
System.out.println("\t DICTIONARY " + ((PdfDictionary)pdfObject).getKeys());
break;
case PdfObject.STREAM:
System.out.println("\t STREAM " + pdfObject);
break;
case PdfObject.INDIRECT:
System.out.println("\t INDIRECT " +pdfObject.getIndRef());
break;
default:
}
System.out.println("\t\t--- " + pdfObject.type());
}
}
}
}
com.snowtide.pdf
String pdfFilePath = "file.pdf";
Document pdf = PDF.open(pdfFilePath);
final List<Annotation> annotations = pdf.getAllAnnotations();
for (final Annotation annotation : annotations) {
System.out.println(annotation.pageNumber());
}
System.out.println(pdf.getAttributeMap());
System.out.println(pdf.getAttributeKeys());
System.out.println("=============================");
StringBuilder text = new StringBuilder(1024);
pdf.pipe(new OutputTarget(text));
pdf.close();
System.out.println(text);
I can extract all visible PDF content including links, text, and images apart from what appears to be a "Watermark" that appears on every page.
Can PDF documents contain "unreachable" content?
Is there no way to extract ALL content from a PDF file?
UPDATE
thinking the "watermark" was an image I tried this code
File fileW = new File("file.pdf");
PDDocument document = PDDocument.load(fileW);
PDPageTree list = document.getPages();
for (PDPage page : list) {
PDResources pdResources = page.getResources();
for (COSName c : pdResources.getXObjectNames()) {
System.out.println("????? ::>>>" + c);
PDXObject o = pdResources.getXObject(c);
if (o instanceof org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) {
File file = new File("Temp/" + System.nanoTime() + ".png");
ImageIO.write(((org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) o).getImage(), "png", file);
} else {
}
}
}
The PDF does contain images of the authors, however the "watermark" is not reached with this approach.

The page content streams of the example document provided by the OP have the following structure from page 2 onward:
A textual header line "www.electrophoresis-journal.com Page X Electrophoresis":
BT
/F1 9.12 Tf
1 0 0 1 72.024 798.46 Tm
/GS7 gs
0 g
0 G
[(w)11(w)11(w)11(.)-12(e)-2(l)15(e)-2(c)23(t)-10(r)-8(o)26(pho)26(r)-8(e)-2(s)21(i)-10(s)] TJ
ET
[...]
BT
1 0 0 1 441.53 798.46 Tm
[(E)6(l)-10(e)-2(c)23(t)-10(r)-8(o)26(pho)26(r)-8(e)23(s)-5(i)15(s)] TJ
ET
BT
1 0 0 1 497.47 798.46 Tm
[( )] TJ
ET
BT
1 0 0 1 72.024 787.9 Tm
[( )] TJ
ET
This text can easily be extracted using normal iText or PDFBox text extraction.
A textual multi-line footer "Received: ... All rights reserved."
BT
1 0 0 1 72.024 109.7 Tm
[(R)9(e)-2(c)23(e)-2(i)-10(v)26(e)-2(d:)41( )] TJ
ET
[...]
BT
1 0 0 1 72.024 47.76 Tm
[(T)6(hi)-10(s)21( )-12(a)23(r)-8(t)15(i)-10(c)23(l)-10(e)23( )13(i)-10(s)21( )-12(pr)-8(o)26(t)15(e)-2(c)23(t)-10(e)-2(d)26( )-12(by)53( )-12(c)-2(o)26(p)-25(y)53(r)-8(i)-10(g)26(ht)-10(.)-12( )-12(A)38(l)-10(l)15( )13(r)-8(i)-10(g)26(ht)15(s)-5( )13(r)-8(e)23(s)-5(e)-2(r)-8(v)26(e)-2(d)26(.)] TJ
ET
BT
1 0 0 1 278.52 47.76 Tm
[( )] TJ
ET
BT
1 0 0 1 72.024 37.2 Tm
[( )] TJ
ET
This text also can easily be extracted using normal iText or PDFBox text extraction.
A set of PDF path creation and filling operations using a custom graphics state forming the transparent "Accepted Article" writing on the left of the page:
/GS8 gs
0 g
39.605 266.51 m
39.605 261.29 39.605 256.06 39.605 250.84 c
42.197 249.94 44.776 248.99 47.367 248.09 c
49.296 247.41 50.704 247.08 51.649 247.08 c
52.413 247.08 53.058 247.38 53.609 247.97 c
54.191 248.54 54.548 249.82 54.729 251.77 c
55.18 251.77 55.624 251.77 56.075 251.77 c
56.075 247.51 56.075 243.26 56.075 239.02 c
55.624 239.02 55.18 239.02 54.729 239.02 c
54.36 240.72 53.903 241.8 53.314 242.3 c
52.144 243.3 49.809 244.47 46.247 245.67 c
32.719 250.33 19.286 255.25 5.7645 259.91 c
5.7645 260.26 5.7645 260.61 5.7645 260.95 c
19.43 265.57 33.014 270.43 46.679 275.05 c
49.984 276.16 52.075 277.24 53.064 278.15 c
54.053 279.06 54.623 280.36 54.729 282 c
55.18 282 55.624 282 56.075 282 c
56.075 276.68 56.075 271.35 56.075 266.03 c
55.624 266.03 55.18 266.03 54.729 266.03 c
54.623 267.64 54.303 268.75 53.753 269.31 c
53.202 269.88 52.519 270.15 51.718 270.15 c
50.666 270.15 48.97 269.75 46.679 268.95 c
44.319 268.15 41.971 267.31 39.605 266.51 c
h
36.92 265.67 m
30.284 263.43 23.686 261.05 17.045 258.81 c
23.686 256.5 30.284 254.07 36.92 251.77 c
36.92 256.4 36.92 261.04 36.92 265.67 c
h
f*
[...]
35.361 630.34 m
40.294 630.31 44.156 631.32 46.967 633.29 c
49.784 635.27 51.18 637.63 51.18 640.31 c
51.18 642.1 50.573 643.67 49.364 645 c
48.156 646.3 46.141 647.43 43.236 648.31 c
43.48 648.62 43.712 648.93 43.962 649.24 c
47.261 648.83 50.253 647.57 52.989 645.6 c
55.731 643.62 57.089 641.06 57.089 638.05 c
57.089 634.76 55.549 631.92 52.413 629.63 c
49.302 627.3 45.158 626.1 39.899 626.1 c
34.203 626.1 29.802 627.33 26.585 629.71 c
23.405 632.07 21.834 635.12 21.834 638.73 c
21.834 641.8 23.048 644.34 25.496 646.28 c
27.981 648.22 31.267 649.24 35.361 649.24 c
35.361 642.94 35.361 636.64 35.361 630.34 c
h
33.258 630.34 m
33.258 634.56 33.258 638.78 33.258 643 c
31.117 642.91 29.633 642.7 28.763 642.37 c
27.417 641.87 26.341 641.14 25.571 640.16 c
24.801 639.19 24.406 638.13 24.406 637.06 c
24.406 635.42 25.158 633.91 26.729 632.64 c
28.306 631.34 30.466 630.55 33.258 630.34 c
h
f*
(The instructions I quoted draw the initial 'A' and the final 'e'.)
This writing cannot be extracted using normal iText or PDFBox text extraction as it neither is drawn using text instruction nor is marked with an ActualText entry. (The latter could be recognized using customized iText or PDFBox text extraction.)
But you can extract this writing as the sequence of path creation and drawing commands it consists of using an implementation of the iText ExtRenderListener interface or a subclass of the PDFBox PDFGraphicsStreamEngine.
The actual text content of the article, opaque, using text drawing instructions, e.g.
BT
/F2 10.08 Tf
1 0 0 1 72.024 760.78 Tm
/GS7 gs
0 g
[(H)-7(I)8(G)16(H)-7( )-106(TH)-6(R)32(O)-7(U)8(G)16(H)-7(P)16(U)8(T )-106(M)-7(U)8(LTI)] TJ
ET
BT
1 0 0 1 212.98 760.78 Tm
[(-)] TJ
ET
BT
1 0 0 1 216.1 760.78 Tm
[(O)-7(R)8(G)-7(A)8(N)32( )-130(M)15(ETA)32(BO)-6(LO)16(M)-7(I)8(C)8(S)8( )-130(I)8(N)8( )-106(TH)-6(E)24( )-130(A)8(P)16(P)16(/)-7(P)16(S)8(1 )-106(M)-7(O)-7(U)8(S)8(E)24( )-130(M)15(O)-7(D)8(EL)24( )-106(O)-7(F)16( )] TJ
ET
This text also can easily be extracted using normal iText or PDFBox text extraction.
Concerning the OP's questions, therefore,
I can extract all visible PDF content including links, text, and images apart from what appears to be a "Watermark" that appears on every page.
Can PDF documents contain "unreachable" content?
That content is not "unreachable", it merely is not text drawn using text drawing instructions but instead text drawn like an arbitrary shape.
Is there no way to extract ALL content from a PDF file?
You can extract that content, merely not as text but instead as a collection of path creation and drawing instructions. Whenever you suspect such instructions to actually draw letter shapes, you can try to determine the text by rendering these paths as a bitmap and applying OCR.

Related

Keras Neural Network output different than Java TensorFlowInferenceInterface output

I have created a neural network in Keras using the InceptionV3 pretrained model:
base_model = applications.inception_v3.InceptionV3(weights='imagenet', include_top=False)
# add a global spatial average pooling layer
x = base_model.output
x = GlobalAveragePooling2D()(x)
# let's add a fully-connected layer
x = Dense(2048, activation='relu')(x)
x = Dropout(0.5)(x)
predictions = Dense(len(labels_list), activation='sigmoid')(x)
I trained the model successfully and want to following image: https://imgur.com/a/hoNjDfR. Therefore, the image is cropped to 299x299 and normalized (just devided by 255):
def img_to_array(img, data_format='channels_last', dtype='float32'):
if data_format not in {'channels_first', 'channels_last'}:
raise ValueError('Unknown data_format: %s' % data_format)
# Numpy array x has format (height, width, channel)
# or (channel, height, width)
# but original PIL image has format (width, height, channel)
x = np.asarray(img, dtype=dtype)
if len(x.shape) == 3:
if data_format == 'channels_first':
x = x.transpose(2, 0, 1)
elif len(x.shape) == 2:
if data_format == 'channels_first':
x = x.reshape((1, x.shape[0], x.shape[1]))
else:
x = x.reshape((x.shape[0], x.shape[1], 1))
else:
raise ValueError('Unsupported image shape: %s' % (x.shape,))
return x
def load_image_as_array(path):
if pil_image is not None:
_PIL_INTERPOLATION_METHODS = {
'nearest': pil_image.NEAREST,
'bilinear': pil_image.BILINEAR,
'bicubic': pil_image.BICUBIC,
}
# These methods were only introduced in version 3.4.0 (2016).
if hasattr(pil_image, 'HAMMING'):
_PIL_INTERPOLATION_METHODS['hamming'] = pil_image.HAMMING
if hasattr(pil_image, 'BOX'):
_PIL_INTERPOLATION_METHODS['box'] = pil_image.BOX
# This method is new in version 1.1.3 (2013).
if hasattr(pil_image, 'LANCZOS'):
_PIL_INTERPOLATION_METHODS['lanczos'] = pil_image.LANCZOS
with open(path, 'rb') as f:
img = pil_image.open(io.BytesIO(f.read()))
width_height_tuple = (IMG_HEIGHT, IMG_WIDTH)
resample = _PIL_INTERPOLATION_METHODS['nearest']
img = img.resize(width_height_tuple, resample)
return img_to_array(img, data_format=K.image_data_format())
img_array = load_image_as_array('https://imgur.com/a/hoNjDfR')
img_array = img_array/255
Then I predict it with the trained model in Keras:
predict(img_array.reshape(1,img_array.shape[0],img_array.shape[1],img_array.shape[2]))
The result is the following:
array([[0.02083278, 0.00425783, 0.8858412 , 0.17453966, 0.2628744 ,
0.00428194, 0.2307986 , 0.01038828, 0.07561868, 0.00983179,
0.09568241, 0.03087404, 0.00751176, 0.00651798, 0.03731382,
0.02220723, 0.0187968 , 0.02018479, 0.3416505 , 0.00586909,
0.02030778, 0.01660049, 0.00960067, 0.02457979, 0.9711478 ,
0.00666443, 0.01468313, 0.0035468 , 0.00694743, 0.03057212,
0.00429407, 0.01556832, 0.03173089, 0.01407397, 0.35166138,
0.00734553, 0.0508953 , 0.00336689, 0.0169737 , 0.07512951,
0.00484502, 0.01656419, 0.01643038, 0.02031735, 0.8343202 ,
0.02500874, 0.02459189, 0.01325032, 0.00414564, 0.08371573,
0.00484318]], dtype=float32)
The important point is that it has four values with a value greater than 0.8:
>>> y[y>=0.8]
array([0.9100583 , 0.96635956, 0.91707945, 0.9711707 ], dtype=float32))
Now I have converted my network to .pb and imported it in an android project. I wanted to predict the same image in android. Therefore I also resize the image and normalize it like I did in Python by using the following code:
// Resize image:
InputStream imageStream = getAssets().open("test3.jpg");
Bitmap bitmap = BitmapFactory.decodeStream(imageStream);
Bitmap resized_image = utils.processBitmap(bitmap,299);
and then normalize by using the following function:
public static float[] normalizeBitmap(Bitmap source,int size){
float[] output = new float[size * size * 3];
int[] intValues = new int[source.getHeight() * source.getWidth()];
source.getPixels(intValues, 0, source.getWidth(), 0, 0, source.getWidth(), source.getHeight());
for (int i = 0; i < intValues.length; ++i) {
final int val = intValues[i];
output[i * 3] = Color.blue(val) / 255.0f;
output[i * 3 + 1] = Color.green(val) / 255.0f;
output[i * 3 + 2] = Color.red(val) / 255.0f ;
}
return output;
}
But in java I get other values. None of the four indices has a value greater than 0.8.
The value of the four indices are between 0.1 and 0.4!!!
I have checked my code several times, but I don't understand why in android I don't get the same values for the same image? Any idea or hint?

How to convert raw pdf from server to pdf document

This is my code that converts the Retrofit HTTP ResponseBody to a raw String:
Method 1:
fun ByteArray.toHexString(): String {
var cnt = ""
var cnter = 0
return this.joinToString(cnt) {
if (cnter % 2 == 0)
cnt = " "
else
cnt = ""
cnter++
String.format("%02x", it)
}
}
fun convert() {
val result = response.byteStream().readBytes(response.contentLength().toInt())
val rawHtml = result.toHexString()
}
Method 1 result (snippet). It should have a whitespace after every 4th Byte:
255044462d312e340d0a25aaabacad0d0a312030206f626a0d0a3c3c0d0a2f4e616d65732032203020520d0a2f4f7574707574496e74656e7473205b3c3c0d0a2f446573744f757470757450726f66696c652033203020520d0a2f53202f4754535f50444641310d0a2f496e666f202863850eea75051264315790c769f97999de290d0a2f52656769737472794e616d652028290d0a2f4f7574707574436f6e646974696f6e2028290d0a2f54797065202f4f7574707574496e74656e740d0a2f4f7574707574436f6e646974696f6e4964656e746966696572202853a23adc3a21290d0a3e3e0d0a5d0d0a2f5669657765725072...
Method 2:
private fun getRawHTML(responseBody: ResponseBody): String {
val bodyString = responseBody.byteStream()
val reader = BufferedReader(InputStreamReader(bodyString, "iso-8859-1"), 16)
val sb = StringBuilder()
var line: String?
line = reader.readLine()
while (line != null) {
sb.append(line + "\n")
line = reader.readLine()
}
bodyString.close()
return sb.toString()
}
Method 2 result (snippet):
%PDF-1.4
1 0 obj
<<
/Title (þÿ��M��i���n��p��e��n��s��o��v��e��r��z��i��c�.��n��l)
/Creator (þÿ��w�m��p��d��f�� ��0��1��2��.��1��.��2)
/Producer (þÿ�t�� ��4����6)
/CreationDate (D:20181122184902+01'00')
>>
endobj
3 0 obj
<<
/Type /ExtGState
/SA true
/SM 0.02
/ca 1.0
/CA 1.0
/AIS false
/SMask /None>>
/Filter /FlateDecode
>>
stream
xí]MGr½Ï¯èó*å÷` )Ñ ðÁðÁàZ^,FË{ðß÷{YÕ]
When scrolling down in this PDF it shows that the encoding is /Identity-H:
/Name /FBUKTZ+Verdana
/Type /Font
/Subtype /Type0
/BaseFont /FBUKTZ+Verdana
/Encoding /Identity-H
/ToUnicode 28 0 R
/DescendantFonts [29 0 R]
>>
Which charset corresponds to this?
I want to convert this to a PDF file that can be opened by Adobe acrobat reader and shows the original PDF. When I open a correct PDF file with sublime editor, I see this:
2550 4446 2d31 2e37 0a25 e2e3 cfd3 0a31
2030 206f 626a 0a3c 3c2f 416c 7465 726e
6174 652f 4465 7669 6365 5247 422f 4e20
332f 4c65 6e67 7468 2032 3631 352f 4669
Maybe I could rephrase the question to how can I convert the small snippet to this format? I'm using Kotlin and Java.
Here's a Kotlin program that downloads a PDF file from a server and saves it in a way that allows it to be opened in a PDF viewer:
import okhttp3.OkHttpClient
import okhttp3.Request
import okhttp3.ResponseBody
import java.io.FileOutputStream
fun savePDF(response: ResponseBody) {
val fileOutputStream = FileOutputStream("my.pdf")
val data = response.byteStream().readBytes()
fileOutputStream.write(data)
}
fun main(args: Array<String>) {
val request = Request.Builder()
.url("http://www.oracle.com/events/global/en/java-outreach/resources/java-a-beginners-guide-1720064.pdf")
.build()
val client = OkHttpClient()
val response = client.newCall(request).execute()
val responseBody = response.body()
if (responseBody != null) {
savePDF(responseBody)
}
}

Mainframe comp-3 field reading using JRecord

I am trying to read mainframe file but all are working other than comp 3 file.Below program is giving strange values.It is not able to read the salary value which is double also it is giving 2020202020.20 values. I don't know what am missing.Please help me to find it.
Program:
public final class Readcopybook {
private String dataFile = "EMPFILE.txt";
private String copybookName = "EMPCOPYBOOK.txt";
public Readcopybook() {
super();
AbstractLine line;
try {
ICobolIOBuilder iob = JRecordInterface1.COBOL.newIOBuilder(copybookName)
.setFileOrganization(Constants.IO_BINARY_IBM_4680).setSplitCopybook(CopybookLoader.SPLIT_NONE);
AbstractLineReader reader = iob.newReader(dataFile);
while ((line = reader.read()) != null) {
System.out.println(line.getFieldValue("EMP-NO").asString() + " "
+ line.getFieldValue("EMP-NAME").asString() + " "
+ line.getFieldValue("EMP-ADDRESS").asString() + " "
+ line.getFieldValue("EMP-SALARY").asString() + " "
+ line.getFieldValue("EMP-ZIPCODE").asString());
}
reader.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
new Readcopybook();
}
}
EMPCOPYBOOK:
001700 01 EMP-RECORD.
001900 10 EMP-NO PIC 9(10).
002000 10 EMP-NAME PIC X(30).
002100 10 EMP-ADDRESS PIC X(30).
002200 10 EMP-SALARY PIC S9(8)V9(2) COMP-3.
002200 10 EMP-ZIPCODE PIC 9(4).
EMPFILE:
0000001001suneel kumar r bangalore e¡5671
0000001002JOSEPH WHITE FIELD rrn4500
Output:
1001 suneel kumar r bangalore 20200165a10 5671
2020202020.20
2020202020.20
2020202020.20
2020202020.20
2020202020.20
2020202020.20
2020202020.20
2020202020.20
0.00
1002 JOSEPH WHITE FIELD 202072726e0 4500
One problem is you have done a Ebcdic to Ascii conversion on the file.
The 2020... is a dead give away x'20' is the ascii space character.
This Answer deals with problems with doing an Ebcdic to ascii conversion.
You need to do a Binary transfer from the Mainframe and read the file using Ebcdic. You will need to check the RECFM on the Mainframe. If the RECFM is
FB - problems just transfer
VB - either convert to FB on the mainframe of include the RDW (Record Descriptor Word) option in the transfer.
Other - Convert to FB/VB on the mainframe
Updated java Code
int fileOrg = Constants.IO_FIXED_LENGTH_RECORDS; // or Constants.IO_VB
ICobolIOBuilder iob = JRecordInterface1.COBOL
.newIOBuilder(copybookName)
.setFileOrganization(fileOrg)
.setFont("Cp037")
.setSplitCopybook(CopybookLoader.SPLIT_NONE);
Note: IO_BINARY_IBM_4680 is for IBM 4690 Registers
There is a wiki entry here
or this Question
How do you generate java~jrecord code fror a Cobol copybook

PDFBox 2.0.7 ExtractText not working but 1.8.13 does and PDFReader as well

hopefully you have an idea of what is going wrong with extracting a text from PDF using pdfbox 2.0.7. The result is very strange:
Using 1.8.13, the command java -jar pdfbox-app-1.8.13.jar ExtractText -sort -nonSeq test.pdf leads to
Deutsche Bank Privat- und Geschäftskunden AG
Bruttoertrag 43,80 USD 37,15 EUR
Kapitalertragsteuer (KESt) - 5,36 USD - 4,55 EUR
Solidaritätszuschlag auf KESt - 0,29 USD - 0,25 EUR
Umrechnungskurs USD zu EUR 1,1791000000
Gutschrift mit Wert 15.08.2017 32,35 EUR
Using 2.0.7, the command java -jar pdfbox-app-2.0.7.jar ExtractText -sort test.pdf leads to
aeutsche Bank mrivat- und deschäftskunden Ad
Bruttoertrag QPIUM rpa PTINR bro
hapitaäertragsteuer EhbptF - RIPS rpa - QIRR bro
poäidaritätszuschäag auf hbpt - MIOV rpa - MIOR bro
rmrechnungskurs rpa zu bro NINTVNMMMMMM
dutschrift mit tert NRKMUKOMNT POIPR bro
The debugger with java -jar pdfbox-app-2.0.7.jar PDFDebugger test.pdf shows the correct text in Root/Pages/Kids/[1]/Contents/[1] so somehow the text is read correctly but not exported correctly.
I have tried to compare the information shown in the two PDFDebugger applications but they seem rather identical to me (although I don't know where/what to look for exactly). Unfortunately, I cannot share the PDF document.
I would be happy for any kind of hint of how to solve or even only attack this problem as otherwise I cannot use the newer version of pdfbox. Thanks in advance for your time!
Here is a screenshot of the Font which is used in the document (extracted with 2.0.7). This is exactly the translation of the letters that apparently is not performed:
The entry ToUnicode says
%!PS-Adobe-3.0 Resource-CMap
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /AdHoc-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
68 beginbfchar
<0004> <0021>
<0009> <0026>
<000b> <0028>
<000c> <0029>
<000f> <002c>
<0010> <002d>
<0011> <002e>
<0012> <002f>
<0013> <0030>
<0014> <0031>
<0015> <0032>
<0016> <0033>
<0017> <0034>
<0018> <0035>
<0019> <0036>
<001a> <0037>
<001b> <0038>
<001c> <0039>
<001d> <003a>
<001e> <003b>
<0024> <0041>
<0025> <0042>
<0026> <0043>
<0027> <0044>
<0028> <0045>
<0029> <0046>
<002a> <0047>
<002b> <0048>
<002c> <0049>
<002e> <004b>
<0030> <004d>
<0031> <004e>
<0032> <004f>
<0033> <0050>
<0034> <0051>
<0035> <0052>
<0036> <0053>
<0037> <0054>
<0038> <0055>
<0039> <0056>
<003a> <0057>
<003d> <005a>
<0044> <0061>
<0045> <0062>
<0046> <0063>
<0047> <0064>
<0048> <0065>
<0049> <0066>
<004a> <0067>
<004b> <0068>
<004c> <0069>
<004d> <006a>
<004e> <006b>
<004f> <006c>
<0050> <006d>
<0051> <006e>
<0052> <006f>
<0053> <0070>
<0055> <0072>
<0056> <0073>
<0057> <0074>
<0058> <0075>
<0059> <0076>
<005a> <0077>
<005d> <007a>
<006c> <00e4>
<0081> <00fc>
<0089> <00df>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
The TextView of page 2 of PDF already shows the correct text, but then somehow these replacement tables that are shown above seem to incorrectly modify the text content before it is exported by pdfbox:
Root/Pages/Kids/[1]/Contents/[1]:
=================================
0 Tw
0 Tc
0 0 0 rg
0 0 0 RG
BT
/F1 10 Tf
1 0 0 1 69.449 697.11 Tm
(Wir) Tj
1 0 0 1 87.199 697.11 Tm
(\374berweisen) Tj
1 0 0 1 141.099 697.11 Tm
(den) Tj
1 0 0 1 160.549 697.11 Tm
(Betrag) Tj
1 0 0 1 192.759 697.11 Tm
(von) Tj
1 0 0 1 211.649 697.11 Tm
(32,35) Tj
1 0 0 1 239.429 697.11 Tm
(EUR) Tj
1 0 0 1 263.299 697.11 Tm
(auf) Tj
1 0 0 1 279.959 697.11 Tm
(Ihr) Tj
1 0 0 1 294.389 697.11 Tm
(Konto) Tj
1 0 0 1 323.269 697.11 Tm
(XXXXXXX) Tj
1 0 0 1 364.959 697.11 Tm
(XX) Tj
1 0 0 1 376.079 697.11 Tm
(.) Tj
0 G
0 g
ET
69.449 669.448 m
69.449 669.698 l
549.921 669.698 l
549.921 669.448 l
549.921 669.198 l
69.449 669.198 l
h
f
0 0 0 rg
0 0 0 RG
BT
/F1 6 Tf
1 0 0 1 249.022 658.948 Tm
(Kapitalertr\344ge) Tj
1 0 0 1 288.016 658.948 Tm
(sind) Tj
1 0 0 1 300.682 658.948 Tm
(einkommensteuerpflichtig!) Tj
1 0 0 1 213.865 652.783 Tm
(Diese) Tj
1 0 0 1 230.863 652.783 Tm
(Mitteilung) Tj
1 0 0 1 258.187 652.783 Tm
(wurde) Tj
1 0 0 1 276.187 652.783 Tm
(maschinell) Tj
1 0 0 1 306.187 652.783 Tm
(erstellt) Tj
1 0 0 1 325.507 652.783 Tm
(und) Tj
1 0 0 1 337.177 652.783 Tm
(wird) Tj
1 0 0 1 349.837 652.783 Tm
(nicht) Tj
1 0 0 1 364.165 652.783 Tm
(unterschrieben.) Tj
0 G
0 g
ET
q
1 0 0 1 504.562 772.646 cm
1 0 0 1 0 0 cm
q
0 Tw
0 Tc
45.36 0 0 45.36 0 0 cm
/I0 Do
Q
Q
0 0 0 rg
0 0 0 RG
BT
/F1 10.5 Tf
1 0 0 1 552.756 23.464 Tm
(2) Tj
1 0 0 1 558.594 23.464 Tm
(/) Tj
1 0 0 1 561.503 23.464 Tm
(2) Tj
ET
Q
q
0 0 m
0 841.89 l
595.276 841.89 l
595.276 0 l
h
0 0 m
595.276 0 l
595.276 841.89 l
0 841.89 l
h
W
n
Q
1.8.13 shows:
Wir überweisen den Betrag von 32,35 EUR auf Ihr Konto XXXXXXX XX.
Kapitalerträge sind einkommensteuerpflichtig!
Diese Mitteilung wurde maschinell erstellt und wird nicht unterschrieben.
2/2
2.0.7 shows:
tir überweisen den Betrag von POIPR bro auf fhr honto XXXXXXX XX
hapitaäerträge sind einkommensteuerpfäichtig!
aiese jitteiäung wurde maschineää ersteäät und wird nicht unterschriebenK
O/O
This is the file that you were asking for: https://wetransfer.com/downloads/214674449c23713ee481c5a8f529418320170827201941/b2bea6
The information about the font in question in your PDF are contradictory and partially broken. Depending on how some software reacts to that it may or may not extract the text correctly.
On the one hand the font has an Encoding value WinAnsiEncoding. This is ok and matches what we see in the content stream, a one-byte encoding covering many of the ANSI codes.
On the other hand we have a ToUnicode map which implies that the underlying encoding is some two-byte encoding (it has a code space range <0000> <ffff>), and even if one ignores the two-byte nature, it has mappings which in particular map digit ANSI codes to uppercase letters, uppercase letter ANSI codes to other lowercase letters, and the lowercase 'l' ANSI code to the Unicode value of 'ä'.
When extracting text, PDFBox 2.0.x seems to follow the broken ToUnicode map (interpreting the two-byte codes in the tabel as one-byte codes, ignoring the upper 0) where possible (resulting in garbage) and else interpret the character code as ANSI (resulting in proper text). PDF 1.8.x seems to have ignored the ToUnicode map, and so does Adobe Reader.
Actually it looks like the ToUnicode map has been made for a font using Identity-H encoding.
If you are confronted with such a PDF and need to extract its text, you can pre-process it and remove the ToUnicode entries; thereafter text extraction should return proper text. E.g.
PDDocument document = PDDocument.load(SOURCE);
for (int pageNr = 0; pageNr < document.getNumberOfPages(); pageNr++)
{
PDPage page = document.getPage(pageNr);
PDResources resources = page.getResources();
removeToUnicodeMaps(resources);
}
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
(ExtractText test method testNoToUnicodeTest2)
using helper methods
void removeToUnicodeMaps(PDResources pdResources) throws IOException
{
COSDictionary resources = pdResources.getCOSObject();
COSDictionary fonts = asDictionary(resources, COSName.FONT);
if (fonts != null)
{
for (COSBase object : fonts.getValues())
{
while (object instanceof COSObject)
object = ((COSObject)object).getObject();
if (object instanceof COSDictionary)
{
COSDictionary font = (COSDictionary)object;
font.removeItem(COSName.TO_UNICODE);
}
}
}
for (COSName name : pdResources.getXObjectNames())
{
PDXObject xobject = pdResources.getXObject(name);
if (xobject instanceof PDFormXObject)
{
PDResources xobjectPdResources = ((PDFormXObject)xobject).getResources();
removeToUnicodeMaps(xobjectPdResources);
}
}
}
COSDictionary asDictionary(COSDictionary dictionary, COSName name)
{
COSBase object = dictionary.getDictionaryObject(name);
return object instanceof COSDictionary ? (COSDictionary) object : null;
}
(from ExtractText)
You should execute this pre-processing as early as possible after loading the document to prevent the fonts including the wrong ToUnicode mappings to be read into the document font cache.

LineBreakMeasurer produces result differ from MS Word / LibreOffice

In a swing application, I need to foresee text wrapping of a string like when putting it in a word processor program such as MS Word or LibreOffice. Providing the same width of the displayable area, the same font (face and size) and the same string as following:
displayable area width: 179mm (in a .doc file, setup an A4 portrait page - width = 210mm, margin left = 20mm, right = 11mm; the paragraph is formatted with zero margins)
Font Times New Roman, size 14
Test string: Tadf fdas fdas daebjnbvx dasf opqwe dsa: dfa fdsa ewqnbcmv caqw vstrt vsip d asfd eacc
And the result:
On both MS Word and LibreOffice, that test string is displayed on single line, no text wrapping occurs.
My bellow program report a text wrapping occurs, 2 lines
Line 1: Tadf fdas fdas daebjnbvx dasf opqwe dsa: dfa fdsa ewqnbcmv caqw vstrt vsip d asfd
Line 2: eacc
Is it possible to achieve the same text wrapping effect like MS Word in swing? What could be wrong in the code?
Bellow the my program
public static List<String> wrapText(String text, float maxWidth,
Graphics2D g, Font displayFont) {
// Normalize the graphics context so that 1 point is exactly
// 1/72 inch and thus fonts will display at the correct sizes:
GraphicsConfiguration gc = g.getDeviceConfiguration();
g.transform(gc.getNormalizingTransform());
AttributedCharacterIterator paragraph = new AttributedString(text).getIterator();
Font backupFont = g.getFont();
g.setFont(displayFont);
LineBreakMeasurer lineMeasurer = new LineBreakMeasurer(
paragraph, BreakIterator.getWordInstance(), g.getFontRenderContext());
// Set position to the index of the first character in the paragraph.
lineMeasurer.setPosition(paragraph.getBeginIndex());
List<String> lines = new ArrayList<String>();
int beginIndex = 0;
// Get lines until the entire paragraph has been displayed.
while (lineMeasurer.getPosition() < paragraph.getEndIndex()) {
lineMeasurer.nextLayout(maxWidth);
lines.add(text.substring(beginIndex, lineMeasurer.getPosition()));
beginIndex = lineMeasurer.getPosition();
}
g.setFont(backupFont);
return lines;
}
public static void main(String[] args) throws Exception {
JFrame frame = new JFrame();
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
JTextPane txtp = new JTextPane();
frame.add(txtp);
frame.setSize(200,200);
frame.setVisible(true);
Font displayFont = new Font("Times New Roman", Font.PLAIN, 14);
float textWith = (179 * 0.0393701f) // from Millimeter to Inch
* 72f; // From Inch to Pixel (User space)
List<String> lines = wrapText(
"Tadf fdas fdas daebjnbvx dasf opqwe dsa: dfa fdsa ewqnbcmv caqw vstrt vsip d asfd eacc",
textWith,
(Graphics2D) txtp.getGraphics(),
displayFont);
for (int i = 0; i < lines.size(); i++) {
System.out.print("Line " + (i + 1) + ": ");
System.out.println(lines.get(i));
}
frame.dispose();
}
+1 for the question
From my experience with text editors it's not possible to achieve exactly the same measuring.
You can try to play with DPI there is default DPI=72 and 96 on windows.
Also you can try to play with all the rendering hints of the Graphics - text antialiasing etc.

Categories

Resources