Keeping Java String Offsets With Unicode Consistent in Python

Keeping Java String Offsets With Unicode Consistent in Python - java

We are building a Python 3 program which calls a Java program. The Java program (which is a 3rd party program we cannot modify) is used to tokenize strings (find the words) and provide other annotations. Those annotations are in the form of character offsets.
As an example, we might provide the program with string data such as "lovely weather today". It provides something like the following output:
0,6
7,14
15,20
Where 0,6 are the offsets corresponding to word "lovely", 7,14 are the offsets corresponding to the word "weather" and 15,20 are offsets corresponding to the word "today" within the source string. We read these offsets in Python to extract the text at those points and perform further processing.
All is well and good as long as the characters are within the Basic Multilingual Plane (BMP). However, when they are not, the offsets reported by this Java program show up all wrong on the Python side.
For example, given the string "I feel 🙂 today", the Java program will output:
0,1
2,6
7,9
10,15
On the Python side, these translate to:
0,1 "I"
2,6 "feel"
7,9 "🙂 "
10,15 "oday"
Where the last index is technically invalid. Java sees "🙂" as length 2, which causes all the annotations after that point to be off by one from the Python program's perspective.
Presumably this occurs because Java encodes strings internally in a UTF-16esqe way, and all string operations act on those UTF-16esque code units. Python strings, on the other hand, appear to operate on the actual unicode characters (code points). So when a character shows up outside the BMP, the Java program sees it as length 2, whereas Python sees it as length 1.
So now the question is: what is the best way to "correct" those offsets before Python uses them, so that the annotation substrings are consistent with what the Java program intended to output?

You could convert the string to a bytearray in UTF16 encoding, then use the offsets (multiplied by 2 since there are two bytes per UTF-16 code-unit) to index that array:
x = "I feel 🙂 today"
y = bytearray(x, "UTF-16LE")
offsets = [(0,1),(2,6),(7,9),(10,15)]
for word in offsets:
print(str(y[word[0]*2:word[1]*2], 'UTF-16LE'))
Output:
I
feel
🙂
today
Alternatively, you could convert every python character in the string individually to UTF-16 and count the number of code-units it takes. This lets you map the indices in terms of code-units (from Java) to indices in terms of Python characters:
from itertools import accumulate
x = "I feel 🙂 today"
utf16offsets = [(0,1),(2,6),(7,9),(10,15)] # from java program
# map python string indices to an index in terms of utf-16 code units
chrLengths = [len(bytearray(ch, "UTF-16LE"))//2 for ch in x]
utf16indices = [0] + list(itertools.accumulate(chrLengths))
# reverse the map so that it maps utf16 indices to python indices
index_map = dict((x,i) for i, x in enumerate(utf16indices))
# convert the offsets from utf16 code-unit indices to python string indices
offsets = [(index_map[o[0]], index_map[o[1]]) for o in utf16offsets]
# now you can just use those indices as normal
for word in offsets:
print(x[word[0]:word[1]])
Output:
I
feel
🙂
today
The above code is messy and can probably be made clearer, but you get the idea.

This solves the problem given the proper encoding, which, in our situation appears to be 'UTF-16BE':
def correct_offsets(input, offsets, encoding):
offset_list = [{'old': o, 'new': [o[0],o[1]]} for o in offsets]
for idx in range(0, len(input)):
if len(input[idx].encode(encoding)) > 2:
for o in offset_list:
if o['old'][0] > idx:
o['new'][0] -= 1
if o['old'][1] > idx:
o['new'][1] -= 1
return [o['new'] for o in offset_list]
This may be pretty inefficient though. I gladly welcome any performance improvements.

Related

Convert in reverse ascii to whole decimal in Java

Hi all and thank you for the help in advance.
I have scoured the webs and have not really turned up with anything concrete as to my initial question.
I have a program I am developing in JAVA thats primary purpose is to read a .DAT file and extract certain values from it and then calculate an output based on the extracted values which it then writes back to the file.
The file is made up of records that are all the same length and format and thus it should be fairly straightforward to access, currently I am using a loop and and an if statement to find the first occurrence of a record and then through user input determine the length of each record to then loop through each record.
HOWEVER! The first record of this file is a blank (Or so I thought). As it turns out this first record is the key to the rest of the file in that the first few chars are ascii and reference the record length and the number of records contained within the file respectively.
below are a list of the ascii values themselves as found in the files (Disregard the " " the ascii is contained within them)
"#¼ ä "
"#g â "
"ÇG # "
"Ç‰ ‰ "
"Çò È "
"=¼ "
A friend of mine who many years ago use to code in Basic recons the first 3 chars refer to the record length and the following 9 refer to the number of records.
Basically what I am needing to do is convert this initial string of ascii chars to two decimals in order to work out the length of each record and the number of records.
Any assistance will be greatly appreciated.
Edit...
Please find below the Basic code used to access the file in the past, perhaps this will help?
CLS
INPUT "Survey System Data File? : ", survey$
survey$ = "f:\apps\survey\" + survey$
reclen = 3004
OPEN survey$ + ".dat" FOR RANDOM AS 1 LEN = reclen
FIELD #1, 3 AS RL$, 9 AS n$
GET #1, 1
RL = CVI(RL$): n = CVI(n$)
PRINT "Record Length = "; RL
reclen = RL
PRINT "Number of Records = "; n
CLOSE #1
Basically what I am looking for is something similar but in java.

ASCII is a special way to translate a bit pattern in a byte to a character, and that gives each character a numerical value; for the letter 'A' is this 65.
In Java, you can get that numerical value by converting the char to an int (ok, this gives you the Unicode value, but as for the ASCII characters the Unicode value is the same as for ASCII, this does not matter).
But now you need to know how the length is calculated: do you have to add the values? Or multiply them? Or append them? Or multiply them with 128^p where p is the position, and add the result? And, in the latter case, is the first byte on position 0 or position 3?
Same for the number of records, of course.
Another possible interpretation of the data is that the bytes are BCD encoded numbers. In that case, each nibble (4bit set) represents a number from 0 to 9. In that case, you have to do some bit manipulation to extract the numbers and concatenate them, from left (highest) to right (lowest). At least you do not have to struggle with the sequence and further interpretation here …
But as BCD would require 8-bit, this would be not the right interpretation if the file really contains ASCII, as ASCII is 7-bit.

Translate Hexadecimal transformation from Oracle SQL into Java code

In searching for an answer, I used the solution provided in the following link : How to format a Java string with leading zero?
I have the following code that needs to be translated into java:
TRIM(TO_CHAR(123,'000X'))
From what I can tell, it translates the number into hexa and adds some leading zeros.
However, if I give a big value, I get ##### as answer, e.g. for the following code:
TRIM(TO_CHAR(999999,'000X'))
In Java, my current solution is the following:
String numberAsHex = Integer.toHexString(123);
System.out.println(("0000" + numberAsHex).substring(numberAsHex.length()));
It works fine for small numbers, but for big ones, like 999999 it outputs 423f. So it does the transformation, resulting the value f423f and then it cuts a part off. So I am nowhere near the value from Oracle
Any suggestion as to how to do this? Or why are ##### displayed in the Oracle case?

Instead of Integer.toHexString I would recommend using String.format because of its greater flexibility.
int number = ...;
String numberAsHex = String.format("%06x", number);
The 06 means you get 6 digits with leading zeros, x means you get lowercase hexadecimal.
Examples:
for number = 123 you get numberAsHex = "00007b"
for number = 999999you get numberAsHex = "0f423f"

Java - How to handle special characters when compressing bytes (Huffman encoding)?

I am writing a Huffman Compression/Decompression program. I have started writing my compression method and I am stuck. I am trying to read all bytes in the file and then put all of the bytes into a byte array. After putting all bytes into the byte array I create an int[] array that will store all the frequencies of each byte (with the index being the ASCII code).
It does include the extended ASCII table since the size of the int array is 256. However I encounter issues as soon as I read a special character in my file (AKA characters with a higher ASCII value than 127). I understand that a byte is signed and will wrap around to a negative value as soon as it crosses the 127 number limit (and an array index obviously cant be negative) so I tried to counter this by turning it into a signed value when I specify my index for the array (array[myByte&0xFF]).
This kind of worked but it gave me the wrong ASCII value (for example if the correct ASCII value for the character is 134 I instead got 191 or something). The even more annoying part is that I noticed that special characters are split into 2 separate bytes, which I feel will cause problems later (for example when I try to decompress).
How do I make my program compatible with every single type of character (this program is supposed to be able to compress/decompress pictures, mp3's etc).
Maybe I am taking the wrong approach to this, but I don't know what the right approach is. Please give me some tips for structuring this.
Tree:
package CompPck;
import java.util.TreeMap;
abstract class Tree implements Comparable<Tree> {
public final int frequency; // the frequency of this tree
public Tree(int freq) { frequency = freq; }
// compares on the frequency
public int compareTo(Tree tree) {
return frequency - tree.frequency;
}
}
class Leaf extends Tree {
public final int value; // the character this leaf represents
public Leaf(int freq, int val) {
super(freq);
value = val;
}
}
class Node extends Tree {
public final Tree left, right; // subtrees
public Node(Tree l, Tree r) {
super(l.frequency + r.frequency);
left = l;
right = r;
}
}
Build tree method:
public static Tree buildTree(int[] charFreqs) {
PriorityQueue<Tree> trees = new PriorityQueue<Tree>();
for (int i = 0; i < charFreqs.length; i++){
if (charFreqs[i] > 0){
trees.offer(new Leaf(charFreqs[i], i));
}
}
//assert trees.size() > 0;
while (trees.size() > 1) {
Tree a = trees.poll();
Tree b = trees.poll();
trees.offer(new Node(a, b));
}
return trees.poll();
}
Compression method:
public static void compress(File file){
try {
Path path = Paths.get(file.getAbsolutePath());
byte[] content = Files.readAllBytes(path);
TreeMap<Integer, String> treeMap = new TreeMap<Integer, String>();
File nF = new File(file.getName() + "_comp");
nF.createNewFile();
BitFileWriter bfw = new BitFileWriter(nF);
int[] charFreqs = new int[256];
// read each byte and record the frequencies
for (byte b : content){
charFreqs[b&0xFF]++;
System.out.println(b&0xFF);
}
// build tree
Tree tree = buildTree(charFreqs);
// build TreeMap
fillEncodeMap(tree, new StringBuffer(), treeMap);
} catch (IOException e) {
e.printStackTrace();
}
}

Encodings matter
If I take the character "ö" and read it in my file it will now be
represented by 2 different values (191 and 182 or something like that)
when its actual ASCII table value is 148.
That really depends, which kind of encoding was used to create your text file. Encodings determine how text messages are stored.
In UTF-8 the ö is stored as hex [0xc3, 0xb6] or [195, 182]
In ISO/IEC 8859-1 (= "Latin-1") it would be stored as hex [0xf6], or [246]
In Mac OS Central European, it would be hex [0x9a] or [154]
Please note, that the basic ASCII table itself doesn't really describe anything for that kind of character. ASCII only uses 7 bits, and by doing so only maps 128 codes.
Part of the problem is that in layman's terms, "ASCII" is sometimes used to describe extensions of ASCII as well, (e.g. like Latin-1)
History
There's actually a bit of history behind that. Originally ASCII was a very limited set of characters. When those weren't enough, each region started using the 8th bit to add their language-specific characters. Leading to all kind of compatibility issues.
Then there was some kind of consortium that made an inventory of all characters in all possible languages (and beyond). That set is called "unicode". It contains not just 128 or 256 characters, but thousands of them.
From that point on you would need more advanced encodings to cover them. UTF-8 is one of those encodings that covers that entire unicode set, and it does so while being kind-of backwards compatible with ASCII.
Each ASCII character is still mapped in the same way, but when 1-byte isn't enough, it will use the 8th bit to indicate that a 2nd byte will follow, which is the case for the ö character.
Tools
If you're using a more advanced text editor like Notepad++, then you can select your encoding from the drop-down menu.
In programming
Having said that, your current java source reads bytes, it's not reading characters. And I would think that it's a plus when it works on byte-level, because then it can support all encodings. Maybe you don't need to work on character level at all.
However, if it does matter for your specific algorithm. Let's say you've written an algorithm that is only supposed to handle Latin-1 encoding. So, then it's really going to work on "character-level" and not on "byte-level". In that case, consider reading directly to String or char[].
Java can do the heavy-lifting for you in that case. There are readers in java that will let you read a text-file directly to Strings/char[]. However, in those cases you should of course specify an encoding when you use them. Internally a single java character can contain up to 2 bytes of data.
Trying to convert bytes to characters manually is a tricky business. Unless you're working with plain old ASCII of course. The moment you see a value above 0x7F (127), (which are presented by negative values in byte) you're no longer working with simple ASCII. Then consider using something like: new String(bytes, StandardCharsets.UTF_8). There's no need to write a decoding algorithm from scratch.

Hashing raw bytes in Python and Java produces different results

I'm trying to replicate the behavior of a Python 2.7 function in Java, but I'm getting different results when running a (seemingly) identical sequence of bytes through a SHA-256 hash. The bytes are generated by manipulating a very large integer (exactly 2048 bits long) in a specific way (2nd line of my Python code example).
For my examples, the original 2048-bit integer is stored as big_int and bigInt in Python and Java respectively, and both variables contain the same number.
Python2 code I'm trying to replicate:
raw_big_int = ("%x" % big_int).decode("hex")
buff = struct.pack(">i", len(raw_big_int) + 1) + "\x00" + raw_big_int
pprint("Buffer contains: " + buff)
pprint("Encoded: " + buff.encode("hex").upper())
digest = hashlib.sha256(buff).digest()
pprint("Digest contains: " + digest)
pprint("Encoded: " + digest.encode("hex").upper())
Running this code prints the following (note that the only result I'm actually interested in is the last one - the hex-encoded digest. The other 3 prints are just to see what's going on under the hood):
'Buffer contains: \x00\x00\x01\x01\x00\xe3\xbb\xd3\x84\x94P\xff\x9c\'\xd0P\xf2\xf0s,a^\xf0i\xac~\xeb\xb9_\xb0m\xa2&f\x8d~W\xa0\xb3\xcd\xf9\xf0\xa8\xa2\x8f\x85\x02\xd4&\x7f\xfc\xe8\xd0\xf2\xe2y"\xd0\x84ck\xc2\x18\xad\xf6\x81\xb1\xb0q\x19\xabd\x1b>\xc8$g\xd7\xd2g\xe01\xd4r\xa3\x86"+N\\\x8c\n\xb7q\x1c \x0c\xa8\xbcW\x9bt\xb0\xae\xff\xc3\x8aG\x80\xb6\x9a}\xd9*\x9f\x10\x14\x14\xcc\xc0\xb6\xa9\x18*\x01/eC\x0eQ\x1b]\n\xc2\x1f\x9e\xb6\x8d\xbfb\xc7\xce\x0c\xa1\xa3\x82\x98H\x85\xa1\\\xb2\xf1\'\xafmX|\x82\xe7%\x8f\x0eT\xaa\xe4\x04*\x91\xd9\xf4e\xf7\x8c\xd6\xe5\x84\xa8\x01*\x86\x1cx\x8c\xf0d\x9cOs\xebh\xbc1\xd6\'\xb1\xb0\xcfy\xd7(\x8b\xeaIf6\xb4\xb7p\xcdgc\xca\xbb\x94\x01\xb5&\xd7M\xf9\x9co\xf3\x10\x87U\xc3jB3?vv\xc4JY\xc9>\xa3cec\x01\x86\xe9c\x81F-\x1d\x0f\xdd\xbf\xe8\xe9k\xbd\xe7c5'
'Encoded: 0000010100E3BBD3849450FF9C27D050F2F0732C615EF069AC7EEBB95FB06DA226668D7E57A0B3CDF9F0A8A28F8502D4267FFCE8D0F2E27922D084636BC218ADF681B1B07119AB641B3EC82467D7D267E031D472A386222B4E5C8C0AB7711C200CA8BC579B74B0AEFFC38A4780B69A7DD92A9F101414CCC0B6A9182A012F65430E511B5D0AC21F9EB68DBF62C7CE0CA1A382984885A15CB2F127AF6D587C82E7258F0E54AAE4042A91D9F465F78CD6E584A8012A861C788CF0649C4F73EB68BC31D627B1B0CF79D7288BEA496636B4B770CD6763CABB9401B526D74DF99C6FF3108755C36A42333F7676C44A59C93EA36365630186E96381462D1D0FDDBFE8E96BBDE76335'
'Digest contains: Q\xf9\xb9\xaf\xe1\xbey\xdc\xfa\xc4.\xa9 \xfckz\xfeB\xa0>\xb3\xd6\xd0*S\xff\xe1\xe5*\xf0\xa3i'
'Encoded: 51F9B9AFE1BE79DCFAC42EA920FC6B7AFE42A03EB3D6D02A53FFE1E52AF0A369'
Now, below is my Java code so far. When I test it, I get the same value for the input buffer, but a different value for the digest. (bigInt contains a BigInteger object containing the same number as big_int in the Python example above)
byte[] rawBigInt = bigInt.toByteArray();
ByteBuffer buff = ByteBuffer.allocate(rawBigInt.length + 4);
buff.order(ByteOrder.BIG_ENDIAN);
buff.putInt(rawBigInt.length).put(rawBigInt);
System.out.print("Buffer contains: ");
System.out.println( DatatypeConverter.printHexBinary(buff.array()) );
MessageDigest hash = MessageDigest.getInstance("SHA-256");
hash.update(buff);
byte[] digest = hash.digest();
System.out.print("Digest contains: ");
System.out.println( DatatypeConverter.printHexBinary(digest) );
Notice that in my Python example, I started the buffer off with len(raw_big_int) + 1 packed, where in Java I started with just rawBigInt.length. I also omitted the extra 0-byte ("\x00") when writing in Java. I did both of these for the same reason - in my tests, calling toByteArray() on a BigInteger returned a byte array already beginning with a 0-byte that was exactly 1 byte longer than Python's byte sequence. So, at least in my tests, len(raw_big_int) + 1 equaled rawBigInt.length, since rawBigInt began with a 0-byte and raw_big_int did not.
Alright, that aside, here is the Java code's output:
Buffer contains: 0000010100E3BBD3849450FF9C27D050F2F0732C615EF069AC7EEBB95FB06DA226668D7E57A0B3CDF9F0A8A28F8502D4267FFCE8D0F2E27922D084636BC218ADF681B1B07119AB641B3EC82467D7D267E031D472A386222B4E5C8C0AB7711C200CA8BC579B74B0AEFFC38A4780B69A7DD92A9F101414CCC0B6A9182A012F65430E511B5D0AC21F9EB68DBF62C7CE0CA1A382984885A15CB2F127AF6D587C82E7258F0E54AAE4042A91D9F465F78CD6E584A8012A861C788CF0649C4F73EB68BC31D627B1B0CF79D7288BEA496636B4B770CD6763CABB9401B526D74DF99C6FF3108755C36A42333F7676C44A59C93EA36365630186E96381462D1D0FDDBFE8E96BBDE76335
Digest contains: E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
As you can see, the buffer contents appear the same in both Python and Java, but the digests are obviously different. Can someone point out where I'm going wrong?
I suspect it has something to do with the strange way Python seems to store bytes - the variables raw_big_int and buff show as type str in the interpreter, and when printed out by themselves have that strange format with the '\x's that is almost the same as the bytes themselves in some places, but is utter gibberish in others. I don't have enough Python experience to understand exactly what's going on here, and my searches have turned up fruitless.
Also, since I'm trying to port the Python code into Java, I can't just change the Python - my goal is to write Java code that takes the same input and produces the same output. I've searched around (this question in particular seemed related) but didn't find anything to help me out. Thanks in advance, if for nothing else than for reading this long-winded question! :)

In Java, you've got the data in the buffer, but the cursor positions are all wrong. After you've written your data to the ByteBuffer it looks like this, where the x's represent your data and the 0's are unwritten bytes in the buffer:
xxxxxxxxxxxxxxxxxxxx00000000000000000000000000000000000000000
^ position ^ limit
The cursor is positioned after the data you've written. A read at this point will read from position to limit, which is the bytes you haven't written.
Instead, you want this:
xxxxxxxxxxxxxxxxxxxx00000000000000000000000000000000000000000
^ position ^ limit
where the position is 0 and the limit is the number of bytes you've written. To get there, call flip(). Flipping a buffer conceptually switches it from write mode to read mode. I say "conceptually" because ByteBuffers don't have explicit read and write modes, but you should think of them as if they do.
(The opposite operation is compact(), which goes back to read mode.)

How to calculate similarity between Chamber of Commerce numbers?

I am working on an engine that does OCR post-processing, and currently I have a set of organizations in the database, including Chamber of Commerce Numbers.
Also from the OCR output I have a list of possible Chamber of Commerce (COC) numbers.
What would be the best way to search the most similar one? Currently I am using Levenshtein Distance, but the result range is simply too big and on big databases I really doubt it's feasibility. Currently it's implemented in Java, and the database is a MySQL database.
Side note: A Chamber of Commerce number in The Netherlands is defined to be an 8-digit number for every company, an earlier version of this system used another 4 digits (0000, 0001, etc.) to indicate an establishment of an organization, nowadays totally new COC numbers are being given out for those.
Example of COCNumbers:
30209227
02045251
04087614
01155720
20081288
020179310000
09053023
09103292
30039925
13041611
01133910
09063023
34182B01
27124701
List of possible COCNumbers determined by post-processing:
102537177
000450093333
465111338098
NL90223l30416l
NLﬂ0737D447B01
12juni2013
IBANNL32ABNA0242244777
lncassantNL90223l30416l10000
KvK13041611
BtwNLﬂ0737D447B01
A few extra notes:
The post-processing picks up words and word groups from the invoice, and those word groups are being concatenated in one string. (A word group is at it says, a group of words, usually denoted by a space between them).
The condition that the post-processing uses for it to be a COC number is the following: The length should be 8 or more, half of the content should be numbers and it should be alphanumerical.
The amount of possible COCNumbers determined by post-processing is relatively small.
The database itself can grow very big, up to 10.000s of records.
How would I proceed to find the best match in general? (In this case (13041611, KvK13041611) is the best (and moreover correct) match)

Doing this matching exclusively in MySQL is probably a bad idea for a simple reason: there's no way to use a regular expression to modify a string natively.
You're going to need to use some sort of scoring algorithm to get this right, in my experience (which comes from ISBNs and other book-identifying data).
This is procedural -- you probably need to do it in Java (or some other procedural programming language).
Is the candidate string found in the table exactly? If yes, score 1.0.
Is the candidate string "kvk" (case-insensitive) prepended to a number that's found in the table exactly? If so, score 1.0.
Is the candidate string the correct length, and does it match after changing lower case L into 1 and upper case O into 0? If so, score 0.9
Is the candidate string the correct length after trimming all alphabetic characters from either beginning or the end, and does it match? If so, score 0.8.
Do both steps 3 and 4, and if you get a match score 0.7.
Trim alpha characters from both the beginning and end, and if you get a match score 0.6.
Do steps 3 and 6, and if you get a match score 0.55.
The highest scoring match wins.
Take a visual look at the ones that don't match after this set of steps and see if you can discern another pattern of OCR junk or concatenated junk. Perhaps your OCR is seeing "g" where the input is "8", or other possible issues.
You may be able to try using Levenshtein's distance to process these remaining items if you match substrings of equal length. They may also be few enough in number that you can correct your data manually and proceed.
Another possibility: you may be able to use Amazon Mechanical Turk to purchase crowdsourced labor to resolve some difficult cases.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.