Reading PDF Literal String parsing dilemma - java
I have the following contents in the same PDF page, in different ObjectX:
First:
[(some text)] TJ ET Q
[(some other text)] TJ ET Q
Very simple and basic so far...
The second:
[( H T M L E x a m p l e)] TJ ET Q
[( S o m e s p e c i a l c h a r a c t e r s : < ¬ ¬ ¬ & ט ט © > \\ s l a s h \\ \\ d o u b l e - s l a s h \\ \\ \\ t r i p l e - s l a s h )] TJ ET Q
NOTE: It is not noticeable in text above, but:
'H T M L E x a m p l e' is actually 0H0T0M0L0[32]0E0x0a0m0p0l0e where each 0 is a literal value 0 == ((char)0) so if I ignore all the 0 values, this actually turns to be like the upper example...
Some Bytes:
htmlexample == [0, 72, 0, 84, 0, 77, 0, 76, 0, 32, 0, 69, 0, 120, 0, 97, 0, 109, 0, 112, 0, 108, 0, 101]
<content> == [0, 32, 32, -84, 0, 32, 32, -84, 0, 32, 32, -84, 0, 32, 0, 38, 0, 32, 0, -24, 0, 32, 0, -24, 0, 32, 0, -87, 0, 32, 0]
But in the next line I need to combine every two bytes into a char because of the following:
< ¬ ¬ ¬...> is actually <0[32][32]¬0[32][32]¬0[32][32]¬...> where the combination of [32]¬ is €
The problem I'm facing is not the conversion itself I use:
new String(sb.toString().getBytes("UTF-8"),"UTF-16BE")
The problem is to know when to apply it and when to keep the UTF-8.
== UPDATE ==
The font used for the problematic Object is:
#7 0# {
'Name' : "F4"
'BaseFont' : "AAAAAE+DejaVuSans-Bold"
'Subtype' : "Type0"
'ToUnicode' : #41 0# {
'Filter' : "FlateDecode"
'Length' : 1679.0f
} + Stream(5771 bytes)
'Encoding' : "Identity-H"
'DescendantFonts' : [#42 0# {
'FontDescriptor' : #43 0# {
'MaxWidth' : 2016.0f
'AvgWidth' : 573.0f
'FontBBox' : [-1069.0f, -415.0f, 1975.0f, 1174.0f]
'MissingWidth' : 600.0f
'FontName' : "AAAAAE+DejaVuSans-Bold"
'Type' : "FontDescriptor"
'CapHeight' : 729.0f
'StemV' : 60.0f
'Leading' : 0.0f
'FontFile2' : #34 0# {
'Filter' : "FlateDecode"
'Length1' : 83036.0f
'Length' : 34117.0f
} + Stream(83036 bytes)
'Ascent' : 928.0f
'Descent' : -236.0f
'XHeight' : 547.0f
'StemH' : 26.0f
'Flags' : 32.0f
'ItalicAngle' : 0.0f
}
'Subtype' : "CIDFontType2"
'W' : [32.0f, [348.0f, 456.0f, 521.0f, 838.0f, 696.0f, 1002.0f, 872.0f, 306.0f, 457.0f, 457.0f, 523.0f, 838.0f, 380.0f, 415.0f, 380.0f, 365.0f], 48.0f, 57.0f, 696.0f, 58.0f, 59.0f, 400.0f, 60.0f, 62.0f, 838.0f, 63.0f, [580.0f, 1000.0f, 774.0f, 762.0f, 734.0f, 830.0f, 683.0f, 683.0f, 821.0f, 837.0f, 372.0f, 372.0f, 775.0f, 637.0f, 995.0f, 837.0f, 850.0f, 733.0f, 850.0f, 770.0f, 720.0f, 682.0f, 812.0f, 774.0f, 1103.0f, 771.0f, 724.0f, 725.0f, 457.0f, 365.0f, 457.0f, 838.0f, 500.0f, 500.0f, 675.0f, 716.0f, 593.0f, 716.0f, 678.0f, 435.0f, 716.0f, 712.0f, 343.0f, 343.0f, 665.0f, 343.0f, 1042.0f, 712.0f, 687.0f, 716.0f, 716.0f, 493.0f, 595.0f, 478.0f, 712.0f, 652.0f, 924.0f, 645.0f, 652.0f, 582.0f, 712.0f, 365.0f, 712.0f, 838.0f], 160.0f, [348.0f, 456.0f, 696.0f, 696.0f, 636.0f, 696.0f, 365.0f, 500.0f, 500.0f, 1000.0f, 564.0f, 646.0f, 838.0f, 415.0f, 1000.0f, 500.0f, 500.0f, 838.0f, 438.0f, 438.0f, 500.0f, 736.0f, 636.0f, 380.0f, 500.0f, 438.0f, 564.0f, 646.0f], 188.0f, 190.0f, 1035.0f, 191.0f, 191.0f, 580.0f, 192.0f, 197.0f, 774.0f, 198.0f, [1085.0f, 734.0f], 200.0f, 203.0f, 683.0f, 204.0f, 207.0f, 372.0f, 208.0f, [838.0f, 837.0f], 210.0f, 214.0f, 850.0f, 215.0f, [838.0f, 850.0f], 217.0f, 220.0f, 812.0f, 221.0f, [724.0f, 738.0f, 719.0f], 224.0f, 229.0f, 675.0f, 230.0f, [1048.0f, 593.0f], 232.0f, 235.0f, 678.0f, 236.0f, 239.0f, 343.0f, 240.0f, [687.0f, 712.0f, 687.0f, 687.0f, 687.0f, 687.0f, 687.0f], 247.0f, [838.0f, 687.0f], 249.0f, 252.0f, 712.0f, 253.0f, [652.0f, 716.0f]]
'Type' : "Font"
'BaseFont' : "AAAAAE+DejaVuSans-Bold"
'CIDSystemInfo' : {
'Supplement' : 0.0f
'Ordering' : "Identity" + Stream(8 bytes)
'Registry' : "Adobe" + Stream(5 bytes)
}
'DW' : 600.0f
'CIDToGIDMap' : #44 0# {
'Filter' : "FlateDecode"
'Length' : 10200.0f
} + Stream(131072 bytes)
}]
'Type' : "Font"
}
There is no indication to the encoding type of the font.
== Update ==
As for the ToUnicode object, in the case of these font it is an unnecessary it should have been Identity-H but instead it is an X == X mapping here are some examples that goes from until FFFF:
<0000> <00ff> <0000>
<0100> <01ff> <0100>
<0200> <02ff> <0200>
<0300> <03ff> <0300>
<0400> <04ff> <0400>
<0500> <05ff> <0500>
<0600> <06ff> <0600>
<0700> <07ff> <0700>
<0800> <08ff> <0800>
<0900> <09ff> <0900>
<0a00> <0aff> <0a00>
<0b00> <0bff> <0b00>
<0c00> <0cff> <0c00>
<0d00> <0dff> <0d00>
<0e00> <0eff> <0e00>
<0f00> <0fff> <0f00>
<1000> <10ff> <1000>
<1100> <11ff> <1100>
....
....
....
<fc00> <fcff> <fc00>
<fd00> <fdff> <fd00>
<fe00> <feff> <fe00>
<ff00> <ffff> <ff00>
So the mapping is not in the ToUnicode object, but still other renderers can render it well!
Any Ideas?
I use: new String(sb.toString().getBytes("UTF-8"),"UTF-16BE")
The problem is to know when to apply it and when to keep the UTF-8.
The OP assumes, probably after examining some sample PDF files, that strings in PDF content streams are encoded using either UTF-8 or UTF-16BE.
This assumption is wrong.
PDF allows some standard single-byte encodings (MacRomanEncoding, MacExpertEncoding, and WinAnsiEncoding) none of which is UTF-8 (due to relations between different encodings, especially ASCII, Latin1, and UTF-8, they may be confused with each other when confronted with a limited sample). Furthermore numerous predefined multi-byte encodings are also allowed, some of which are indeed UTF-16-related..
But PDF allows completely custom encodings, both single-byte and multi-byte, to be used, too!
E.g. this text drawing operation
(ABCCD) Tj
for a simple font with this encoding:
<<
/Type /Encoding
/Differences [ 65 /H /e /l /o ]
>>
displays the word Hello!
And while this may look like an artificially constructed example, the procedure to create a custom encoding like this (i.e. by assigning codes from some start value upwards to glyphs in the order in which they first occur on the page or in the document) is fairly often used.
Furthermore, the OP's current solution
If your font object has a CMap, then you treat it as a UTF-16, otherwise not.
will only work for a very few documents because
a) simple fonts (using single-byte encodings) may also supply a ToUnicode CMap and
b) composite fonts CMaps also need not be UTF-like but instead can use a mixed multi-byte encoding.
Thus, there is no way around an in-depth analysis of the used font information, cf. 9.5..9.9 of the PDF specification ISO 32000-1.
PS On some comments by the OP:
this: new String(sb.toString().getBytes("UTF-8"),"UTF-16BE") was an example to the how the problem is solved not a solution! The solution is done while fetching the glyphs whether I treat the data as 16-bit or 8-bit
and
the ToUnicode map is 16-bit(The only ones I've seen) per key,
The data may be mixed data, e.g. have a look at the Adobe CMap and CIDFont
Files Specification, here the CMap example 9 contains the section
4 begincodespacerange
<00> <80>
<8140> <9ffc>
<a0> <de>
<e040> <fbec>
endcodespacerange
which is explained to mean
Figure 6 shows how the codespace definition in this example comprises two single-byte linear ranges of codes (<00> to <80> and <A0> to <DF>) and two double-byte rectangular ranges of codes (<8140> to <9FFC> and <E040> to <FBFC>). The first two-byte region comprises all codes bounded by first-byte values of 81 through 9F and second-byte values of 40 through FC. Thus, the input code <86A9> is within the region because both bytes are within bounds. That code is valid. The input code <8210> is not within the region, even though its first byte is between 81 and 9F, because its second byte is not within bounds. That code is invalid. The second two-byte region is similarly bounded.
OK, So as this seems to be complicated, and the reason for this bug is stupid, especially on my end, but there is a lesson to be learned with regards to when to treat the chars as UTF-16, and when not to.
My problem was not while parsing the fonts, but while rendering them. according to the details specified in the Font object you can determine the type of the font and apply the correct logic to it.
Related
JavaObject from Netlogo has no length using py4j?
I am running nl4py (a python module for NetLogo) in Jupyter notebook. I am trying to get import a list from netlogo into python, but the import is in a Java format. However, when I try to convert the JavaObject to a python format using py4j I get an error of: JavaObject has no len(). Is there a better way to convert JavaObject in python? Thanks. python 3.8, ipython 7.10.0, nl4py 0.5.0, jdk 15.0.2, Netlogo 6.0, MacOS Catalina 10.15.7 #start of code for nl4py import nl4py nl4py.startServer("/Applications/NetLogo 6.0/") n = nl4py.NetLogoApp() n.openModel('/Users/tracykuper/Desktop/Netlogo models/Mucin project/1_21_20/PA_metabolite_model_1_21.nlogo') n.command("setup") #run abm model for n number of times #change patch variable under a specific turtle for i in range(1): n.command("repeat 10 [go]") #A = np.array([1,2,3,4],[3,2,-1,-6])) #turtle number, metabolite diff. #run simulation of metabolic network to get biomass and metabolite values #change patch variable under a specific turtle names = ["1", "2", "3"] #turtle names patch_values = ["-0.5", "50", "-0.5"] #metabolite values for i in range(len(names)): x = ('ask turtle {} [ask patch-here [set succinate succinate + {}]]'.format(names[i],patch_values[i])) n.command(x) #set new bacteria mass values values = ["5", "30", "5"] #biomass values y = ('ask turtle {} [set m m + {}]'.format(names[i],values[i])) n.command(y) n.command("ask turtle {} [set color red]".format(names[i])) import py4j mass = n.report("mass-list") print(mass) self = n.report("self-list") type(mass) s = py4j.protocol.get_return_value(mass, object) [[0.69], [0.8], [0.73], [0.71], [0.5], [0.51], [0.54], [0.82], [0.72], [0.88]] --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-31-0b04d0127b47> in <module> 11 #map(mass + mass,mass) 12 ---> 13 s = py4j.protocol.get_return_value(mass, object) ~/opt/anaconda3/envs/netlogo4/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 319 (e.g., *hello* in `object1.hello()`). Optional. 320 """ --> 321 if is_error(answer)[0]: 322 if len(answer) > 1: 323 type = answer[1] ~/opt/anaconda3/envs/netlogo4/lib/python3.6/site-packages/py4j/protocol.py in is_error(answer) 372 373 def is_error(answer): --> 374 if len(answer) == 0 or answer[0] != SUCCESS: 375 return (True, None) 376 else: TypeError: object of type 'JavaObject' has no len()
Storing secret key on token USB gives different key (few different bytes) when doing getKey()
I'm trying to store a symmetric key (SecretKey which is Triple-DES key, ECB mode) on a cryptographic USB token. I use the following code to do it: private void storeSecretKey( SecretKey secretKey, String alias ) throws StoreException { try { log.info("Format: " + secretKey.getFormat()); log.info("Alg: " + secretKey.getAlgorithm()); log.info("STORE KEY (bytes): " + Arrays.toString( secretKey.getEncoded())); log.info("STORE KEY: " + ArrayUtils.convertToHexString( secretKey.getEncoded(), false, false)); myKeyStore.setKeyEntry(alias, secretKey, tokenPIN.toCharArray(), null); myKeyStore.store(null); Key key = myKeyStore.getKey(alias, tokenPIN.toCharArray()); log.info("Format: " + key.getFormat()); log.info("Alg: " + key.getAlgorithm()); log.info("FINAL KEY (bytes): " + Arrays.toString( key.getEncoded())); log.info("FINAL KEY: " + ArrayUtils.convertToHexString( key.getEncoded(), false, false)); } catch ( KeyStoreException e ) { throw new StoreException( "Unable to store encryption key", e ); } catch (NoSuchAlgorithmException e) { e.printStackTrace(); } catch (UnrecoverableEntryException e) { e.printStackTrace(); } } And I get the following logging: INFO: Format: RAW INFO: Alg: DESede INFO: STORE KEY (bytes): [87, -81, -89, -62, 5, -116, -46, 111, -85, -52, -28, -85, -26, -57, -26, -58, -66, -52, -16, 30, 89, -45, 61, -86] INFO: STORE KEY: 57afa7c2058cd26fabcce4abe6c7e6c6beccf01e59d33daa INFO: Format: RAW INFO: Alg: DESede INFO: FINAL KEY (bytes): [87, -82, -89, -62, 4, -116, -45, 110, -85, -51, -27, -85, -26, -57, -26, -57, -65, -51, -15, 31, 88, -45, 61, -85] INFO: FINAL KEY: 57aea7c2048cd36eabcde5abe6c7e6c7bfcdf11f58d33dab The key that is stored on the token (STORE KEY) and the key that I get using getKey() (FINAL KEY) are different. How can it be? Please inform me if you need any other information it might be missing. Thank you.
Your USB token simply adjusts the parity of the 3DES key. Each lower end bit of each byte is a parity bit for (triple) DES. It needs to be set if the addition of the 7 highest bits is an even number, and unset if the number is odd. In the end each byte should have odd parity if you add all the bits together. So the first byte with value 0x57 or 0b0101011_1 and has 4 bits in the highest positions so the final bit has to be 1 - but it already is set, so no adjustment is necessary. The second byte, 0xAF or 0b1010111_1 has 5 bits in the highest position, so the last bit needs to be 0. This is not the case, so it is adjusted to 0b1010111_0 or 0xAE. If you want to have the same value then you can construct your (triple) DES keys using a SecretKeyFactory instead of using a random number generator directly. The SecretKeyFactory will also adjust the parity for you - no need to program it yourself. I'd recommend this as other implementations could reject bytes where the number of bits is even (although generally the bits are adjusted or ignored). As indicated by James, the lowest bits are not used by (triple) DES during encryption / decryption. This will create correctly coded triple DES keys (168 bit effective, 192 bits encoded). These are also called DES ABC keys as all three DES keys are different (with a very high probability). public static SecretKey generate192Bit3DESKey() { KeyGenerator keyGen; try { keyGen = KeyGenerator.getInstance("DESede"); } catch (NoSuchAlgorithmException e) { throw new IllegalStateException("DESede functionality is required for Java, but it is missing"); } // NOTE: this is the effective key size excluding parity bits // use 112 for two key (ABA) triple DES keys (not recommended) keyGen.init(168); // this does adjust parity SecretKey desABCKey = keyGen.generateKey(); return desABCKey; } If your data needs adjusting afterwards, you can use this (with only a for loop, no additional branching): public static byte[] adjustDESParity(final byte[] keyData) { for (int i = 0; i < keyData.length; i++) { // count the bits, and XOR with 1 if even or 0 if already odd keyData[i] ^= (Integer.bitCount(keyData[i]) % 2) ^ 1; } return keyData; }
StringUtils.containsIgnoreCase() is returning wrong result
i have two values: String a = "00tz"; // (Eclipse internal debug value: [, 0, 0, t, z]) and String b = "tz"; // (Eclipse internal debug value: [, t, z]) I am reading this values from an ArrayList like for (String a : stringLists) { ... } I get "false" when i compare this two values with StringUtils.containsIgnoreCase(a,b). But it should return true because "tz" is existing in "00tz". Im using apache.commons.lang3.StringUtils. To change the values a & b didn't worked. The length of "a" is 5 and "b" is 3. It also returns false when i use a.contains(b). These are the results when i output the value with System.out.println(Arrays.toString(a.getBytes(StandardCharsets.UTF_8))); a:[-17, -69, -65, 48, 48, 118, 119] b:[-17, -69, -65, 118, 119] Im reading this values from a .txt file which contains several values like this. I read it in this way: File fileA = new File("test/a.txt"); File fileB = new File("test/b.txt"); lista = (ArrayList<String>) FileUtils.readLines(fileA, "utf-8"); listb = (ArrayList<String>) FileUtils.readLines(fileB, "utf-8"); Do you have an idea what the problem is? Thank you!
Creating 24bit BMP fails (only with a particular resolution)
I'm creating a 24bit bmp which in general works fine (I've been using this functionality for some time). Now I've tried to write a bmp with 970 x 970 pixels and I'm ending up in corrupt files (I've exported bigger images before, I'm having problems with this particular resolution). This is how I build the header: private static byte[] createHeader(int width, int height) { int size = width * height * 3 + 54; byte[] ret = new byte[54]; set(ret, 0, (byte) 'B', (byte) 'M'); set(ret, 2, intToDWord(size)); set(ret, 6, intToDWord(0)); set(ret, 10, intToDWord(54)); set(ret, 14, intToDWord(40)); set(ret, 18, intToDWord(width)); set(ret, 22, intToDWord(height)); set(ret, 26, intToWord(1)); set(ret, 28, intToWord(24)); set(ret, 30, intToDWord(0)); set(ret, 34, intToDWord(width * height * 3)); set(ret, 38, intToDWord(0)); set(ret, 42, intToDWord(0)); set(ret, 46, intToDWord(0)); set(ret, 50, intToDWord(0)); return ret; } Here is the resulting image (this test image should be completely red): test_corrupt.bmp (2.6mb) I've analyzed the header, checked the size, I can't find the reason why this image is not a valid BMP. Does anyone have a clue? I'm not making any progress.
It may be because of BMP files expects row lengths to be a multiple of 4 bytes. This changes the size you specified in header offset 34 and therefore the size in offset 2. Please refer to the following for details: http://en.wikipedia.org/wiki/BMP_file_format Related Part: For file storage purposes, only the size of each row must be a multiple of 4 bytes while the file offset can be arbitrary You can compare file by creating a 970x970 Red BMP file using MS Paint.
GeoTIFF import in worldwind ::: cannot read raster : unable to decipher image organization
I'm trying to import some (Matlab-generated) GeoTIFF files into WorldWind but seem to have no luck whatsoever. Any useful hints would greatly be appreciated. The GeoTIFF files do display fine in ArcGIS (allowing me to create a .tfw file when I export), but WorldWind gives me the following message: SEVERE: Cannot read raster: C:\Users\Matthias\Desktop\geotiff\fldextent_02- Jan-1977(1)renderedno0.tif : gov.nasa.worldwind.formats.tiff.GeotiffImageReader.read(): unable to decipher image organization Jul 09, 2013 6:54:33 PM gov.nasa.worldwind.data.CachedDataRaster drawOnTo SEVERE: C:\Users\Matthias\Desktop\geotiff\fldextent_02-Jan-1977(1)renderedno0.tif : Cannot read raster: C:\Users\Matthias\Desktop\geotiff\fldextent_02-Jan-1977(1)renderedno0.tif : gov.nasa.worldwind.formats.tiff.GeotiffImageReader.read(): unable to decipher image organization gov.nasa.worldwind.exception.WWRuntimeException: Cannot read raster: C:\Users\Matthias\Desktop \geotiff\fldextent_02-Jan-1977(1)renderedno0.tif : gov.nasa.worldwind.formats.tiff.GeotiffImageReader.read(): unable to decipher image organization at gov.nasa.worldwind.data.CachedDataRaster.getDataRasters(CachedDataRaster.java:255) at gov.nasa.worldwind.data.CachedDataRaster.drawOnTo(CachedDataRaster.java:290) at gov.nasa.worldwind.data.TiledRasterProducer.drawDataSources(TiledRasterProducer.java:576) [...] I have also looked at the attributes of the GeoTIFF file in FWTools which gives me: C:\Users\Matthias\Desktop\geotiff>gdalinfo fldextent_02-Jan-1977(1)renderedno0.tif Driver: GTiff/GeoTIFF Files: fldextent_02-Jan-1977(1)renderedno0.tif fldextent_02-Jan-1977(1)renderedno0.tfw Size is 7200, 7200 Coordinate System is: GEOGCS["WGS 84", DATUM["WGS_1984", SPHEROID["WGS 84",6378137,298.257223563, AUTHORITY["EPSG","7030"]], AUTHORITY["EPSG","6326"]], PRIMEM["Greenwich",0], UNIT["degree",0.0174532925199433], AUTHORITY["EPSG","4326"]] Origin = (99.000000000000000,7.000000000000000) Pixel Size = (0.000833333333333,-0.000833333333333) Metadata: AREA_OR_POINT=Area Image Structure Metadata: INTERLEAVE=BAND Corner Coordinates: Upper Left ( 99.0000000, 7.0000000) ( 99d 0'0.00"E, 7d 0'0.00"N) Lower Left ( 99.0000000, 1.0000000) ( 99d 0'0.00"E, 1d 0'0.00"N) Upper Right ( 105.0000000, 7.0000000) (105d 0'0.00"E, 7d 0'0.00"N) Lower Right ( 105.0000000, 1.0000000) (105d 0'0.00"E, 1d 0'0.00"N) Center ( 102.0000000, 4.0000000) (102d 0'0.00"E, 4d 0'0.00"N) Band 1 Block=128x128 Type=Byte, ColorInterp=Gray NoData Value=0 The .tfw file reads: 0.0008333333 0.0000000000 0.0000000000 -0.0008333333 99.0004166667 6.9995833333
I have found the issue finally: The important thing is to create a CLEAN GeoTIFF file in Matlab (RGB and alpha layer for transparency). Here some Matlab guidance, the resulting GeoTIFF can directly be imported into WorldWind: %%% read intensity values Z (2D matrix) - with values of 0 and above %%% (we want 0 to be completely transparent in the final geotiff) - %%% together with spatialref.GeoRasterReference ss [Z, ss] = geotiffread('./flddph_1976-01-01.tif'); info_3 = geotiffinfo('./flddph_1976-01-01.tif'); %%% generate indexed image with 0 to 255 (255 equals max. intensity) indexedimage = gray2ind(Z); indexedimage = double(indexedimage); %%% normalize so that everything between 0 and 1 normalizedimg = (indexedimage) / 255; %%% scaling data and applying colormap imgscaled = uint8(256*normalizedimg); % scale data cmp = makeColorMap([1 1 0],[1 0.75 0],[1 0 0],256); % 256 element colormap yellow - orange - red % (download appropriate function MAKECOLORMAP) imgrgb = ind2rgb(imgscaled,cmp); %%% check plot % subplot(2,1,1) % imagesc(indexedimage) % title('indexed image') % subplot(2,1,2) % image(img) % title('rgb image') %%% generating alpha layer for transparency %%% (255 for non-transparent, 0 for transparent) alpha3 = Z; alpha3(alpha3>0)=255; alpha3 = uint8(alpha3); out2 = cat(3,imgrgb,alpha3); geotiffwrite('test_rgbhope_flddph.tif',out2,ss);