How to parse freedict files (*.dict and *.index)

How to parse freedict files (*.dict and *.index) - java

I was searching for free translation dictionaries. Freedict (freedict.org) provides the ones I need but I don't know, how to parse the *.index and *.dict files. I also don't really know, what to google, to find useful information about these formats.
The *.index files look following:
00databasealphabet QdGI l
00databasedictfmt1121 B b
00databaseinfo c 5o
00databaseshort 6E u
00databaseurl 6y c
00databaseutf8 A B
a BHO M
a bad risc BHa u
a bag of nerves BII 2
[...]
and the *.dict files:
[Lot of info stuff]
German-English FreeDict Dictionary ver. 0.3.4
Pipi machen /piːpiːmaxən/
to pee; to piss
(Aktien) zusammenlegen /aktsiːəntsuːzamənleːgən/
to merge (with)
[...]
I would be glad to see some example projects (preferably in python, but java, c, c++ are also ok) to understand how to handle these files.

It is too late. However, i hope that it can be useful for others like me.
JGoerzen writes a Dictdlib lib. You can see more details how he parse .index and .dict files.
https://github.com/jgoerzen/dictdlib/blob/master/dictdlib.py

dictd considers its format of .index and .dict[.dz] as private, to reserve itself the right to change it in the future.
If you want to process it directly anyway, the index contains the headwords and the .dict[.dz] contains definitions. It is optionally compressed with a special modified gzip algorithm providing almost random access, which gzip normally does not. The index contains 3 columns per line, tab separated:
The headword for looking up the definition.
The absolute byte position of the definition in the .dict[.dz] file, base64 encoded.
The length of the definition in bytes, base64 encoded.
For more details see the dict(8) man page (section Database Format) you should have found in your research before asking your question. For processing the headwords correctly, you'd have to consider encoding and character collation.
Eventually it would be better to use an existing library to read dictd databases. But that really depends on whether the library is good (no experience here).
Finally, as you noted yourself, XML is made exactly for easy processing. You could extract the headwords and translations using XPath, leaving out all the grammatical stuff and no need to bother parsing anything.
After getting this far the next problem would be that there is no one-to-one mapping between words in different lanuages...

Related

EBCDIC unpacking comp-3 data returns 40404** in Java

I have used the unpack data logic provided in below link for java
How to unpack COMP-3 digits using Java?
But for the null data in source it returns 404040404 like on Java unpack code. I understand this was space in ebcdic, but how to unpack by handling this space or to avoid it.

There are two problems that we have to deal with. First, is the data valid comp-3 data and second, is the data considered “valid” by older language implementations like COBOL since Comp-3 was mentioned.
If the offests are not misaligned it would appear that spaces are being interpreted by existing programs as 0 instead of spaces. This would be incorrect but could be an artifact of older programs that were engineered to tolerate this bad behaviour.
The approach I would take in a legacy shop (assuming no misalignment) is to consider “spaces” (which are sequences of 0x404040404040) as being zero. This would be a legacy check to compare the field with spaces and then assume that 0x00000000000f as the actual default. This is something an individual shop would have to determine and is not recognized as a general programming approach.
In terms of Java, one has to remember that bytes are “signed” so comparisons can be tricky based on how the code is written. The only “unsigned” data type I
recall in java is char which is really two bytes (unit 16) basically.
This is less of a programming problem than it is recognizing historical tolerance and remediation.

What is the definition of "lexically-ordered" base64 and why is RFCC-1940 apparently the canonical reference?

Today I was reading the documentation for Netty's Base64Dialect class.
It includes a dialect called ORDERED, of which it says, somewhat briefly:
Special "ordered" dialect of Base64 described in RFC1940.
To cut to the chase, I can't find any definition of what this is, and includes an erroneous reference which seems to replicated all over the internet.
Instead of RFC-1940, the document actually links to RFCC-1940, which apparently is a "reader comment", and a nonsensical one at that:
RFC 920: whkpiy clujzis brkyh dwojfmz jydwq hrnwcgklt fsltaiu
Comment by lsnxkrjo sxavymwpg
Submitted on 10/26/2006
Related RFC: RFC-920
Now RFC-920 appears to have nothing to do with base 64:
Domain requirements
This memo restates and refines the requirements on establishing a
Domain first described in RFC-881. It adds considerable detail
to that discussion, and introduces the limited set of top level
domains.
Is RFC-1940 relevant? Skimming, no I can't see any base 64 encoding definitions here:
Source Demand Routing: Packet Format and Forwarding Specification (Version 1).
The purpose of SDRP is to support source-initiated selection of
routes to complement the route selection provided by existing routing
protocols for both inter-domain and intra-domain routes. [...]
In fact, searching the web for "rfcc 1940 ordered base64" finds this same URL in lots of other documentation, but sadly no explanation of "lexically ordered base 64".
Is there a legitimate definition of this anywhere? And why hasn't anyone else noticed this URL refers to nonsense?

I have not found a "legitimate definition" of ordered Base64. (At time of writing this, it is not even mentioned in the Wikipedia page on Base64.)
If you treat the code as a specification(!), ordered Base64 is a variant in which the alphabet has been reordered into ascending ASCII order. This means that the natural ordering for ordered Base64 is the same as the natural ordering for the corresponding byte sequence.
Is it a problem that there isn't a specification for ordered Base64?
Probably not.
In reality the RFCs that "specify" the different variants of Base64 (and Base32 / Base16) are actually more of an attempt to describe the variants rather than specify them. And the same applies to the Wikipedia article.
From what I can tell (google searches), the ordered Base64 variant is rarely used.
The Base64 implementation that introduced the ordered variant is legacy code. (It hasn't been changed in the last 8 years). New Java code that requires Base64 encoding / decoding capability should be using the standard Java java.util.Base64 class introduced in Java 8.
But it is concerning that the javadocs you linked to (and others!) all refer to a nonsense page. That page probably had a legitimate description at some point, but it looks like it has been vandalized.

Load a Perl Hash into Java

I have a big .pm File, which only consist of a very big Perl hash with lots of subhashes. I have to load this hash into a Java program, do some work and changes on the data lying below and save it back into a .pm File, which should look similar to the one i started with.
By now, i tried to convert it linewise by regex and string matching, converting it into a XML Document and later Elementwise parse it back into a perl hash.
This somehow works, but seems quite dodgy. Is there any more reliable way to parse the perl hash without having a perl runtime installed?

You're quite right, it's utterly filthy. Regex and string for XML in the first place is a horrible idea, and honestly XML is probably not a good fit for this anyway.
I would suggest that you consider JSON. I would be stunned to find java can't handle JSON and it's inherently a hash-and-array oriented data structure.
So you can quite literally:
use JSON;
print to_json ( $data_structure, { pretty => 1 } );
Note - it won't work for serialising objects, but for perl hash/array/scalar type structures it'll work just fine.
You can then import it back into perl using:
my $new_data = from_json $string;
print Dumper $new_data;
Either Dumper it to a file, but given you requirement is multi-language going forward, just using native JSON as your 'at rest' data is probably a more sensible choice.
But if you're looking at parsing perl code within java, without a perl interpreter? No, that's just insanity.

Problems parsing iso8583 message

I've download code from here https://github.com/vikrantlabde/iso8583-Java and after some modifications I'm parsing almost fine my fields....
I defined the schema like this:
ISOSCHEMA.put("1","BITMAP");
ISOSCHEMA.put("2","NUM-2-19-0_0");
ISOSCHEMA.put("3","NUMERIC-0-6-0_0");
ISOSCHEMA.put("4","NUMERIC-0-12-0_0");
ISOSCHEMA.put("7","NUMERIC-0-10-0_0");
ISOSCHEMA.put("11","NUMERIC-0-6-0_0");
ISOSCHEMA.put("12","NUMERIC-0-6-0_0");
ISOSCHEMA.put("13","NUMERIC-0-4-0_0");
ISOSCHEMA.put("22","NUMERIC-0-3-0_0");
ISOSCHEMA.put("23","NUMERIC-0-3-0_0");
ISOSCHEMA.put("35","NUM-2-37-0_0");
ISOSCHEMA.put("41","FCHAR-0-8-0_0");
ISOSCHEMA.put("49","FCHAR-0-3-0_0");
ISOSCHEMA.put("55","NUM-3-999-0_0");
The problem is the field 55 that is a binary field. The standard documentation says it:
55 Reserved ISO B 255 LLLVAR (ISO DOCUMENTATION)
I'm having an error parsing a string that has the bitmap turned on for the field 55.
I'm having from the output:
820200409F36020004950500000000009A031409039C01005F2A0209789F02060000000005009F03060000000000009F10201F430200200000000000000000045895000000000000000000000000000000009F260840D26C4BA5577CFB9F2701809F370443DD7E879F1A0202509F3303E0B0C8
But I expect:
820200409F36020004950500000000009A031409039C01005F2A0201249F02060000000005009F03060000000000009F10201F430200200000000000000000045895000000000000000000000000000000009F260840D26C4BA5577CFB9F2701809F370443DD7E879F1A0202509F3303E0B0C8
The length of the iso payload converted is highly different too...
The program output is:
303130307238060020C280C28200313636353433323131313232333334343535303030303030303030303030303030303031313031363138333432363030323339343133333433303130313630373130303133373635343332313131323233333434353564333131303232303030393238333030313031303238343031373430393132343233303832303230303430394633363032303030343935303530303030303030303030394130333134303930333943303130303546324130323039373839463032303630303030303030303035303039463033303630303030303030303030303039463130323031463433303230303230303030303030303030303030303030303034353839353030303030303030303030303030303030303030303030303030303030303030394632363038343044323643344241353537374346423946323730313830394633373034343344443745383739463141303230323530394633333033453042304338
What I expect is:
30313030723806002080820031363635343332313131323233333434353530303030303030303030303030303030303131303136313833343236303032333934313333343330313031363037313030313337363534333231313132323333343435353D33313130323230303039323833303031303130323834303137343039313234313135820200409F36020004950500000000009A031409039C01005F2A0201249F02060000000005009F03060000000000009F10201F430200200000000000000000045895000000000000000000000000000000009F260840D26C4BA5577CFB9F2701809F370443DD7E879F1A0202509F3303E0B0C8
One advice is:
I have to make the explicit conversion to hex from the resultant byte[] and viceversa.
It is:
String isoMessage = ISOUtil.hexString(packIsoMsg("0100", isofields).getBytes());
And:
unpackIsoMsg(new String(ISOUtil.hex2byte(isoMessage), "UTF-8"));
What about the definition of this kind of fields in this class? I'm really a newbie with the standard but I arrived here because jpos doesn't work in an Android environment. Also I'm confused with the last mentioned conversion to hex.
Any help is really appreciated...
Kind regards.

DE55 is defined as a Tag-Lag-Value (TLV) field that is not in the normal Binary / text / or numeric packed format you see the rest of ISO-8583 messages typically but is in ASN.1 BER-TLV / X.690-0207 format.
Unless you account for the BER-TLV you will not successfully unpack DE55 unless it is for non-EMV/Tokenization purposes. It threw me at first as well as I was thinking something more straight forward. Be aware that sometimes the field transport format is actually longer in this format than the original plain text or other binary data output so it is not the most efficient.
There are a couple of other fields depending on the ISO specification may also use BER-TLV but DE55 is the industry standard field to use BER-TLV for EMV functionality replacing DE55's previous use as a generic and rarely used 'fee field'.
The ISO-7816 specification its ISO-8583 use in detail for EMV and Tokenization, as well there are other references and quick guides out there if you are just looking for something not so in-depth. All volumes of the ISO-7816 specification can be found openly on the internet for free, or can be purchased directly (spendy) from the ISO organization if you want them in the plain ISO format.
I am not familiar with the specific JAVA Git you referenced but this one has a help page on how to use BER-TLV. Oracle also has a page on dealing with BER-TLV here. BinaryFoo has a Git available as well.
For the purposes of initial testing, if your data is just test data (DO NOT USE PRODUCTION DATA!) you can use http://www.emvlab.org/tlvutils/ to verify your results. Which when I input your inputs it kicks out your expected output.

what is the field defined by sender in field 55.Assign the same in unpacking.If they are sending string it should be LLLVAR.
when sending ISO message header must be in hex format.hence they convert bytes to hex.

From the looks of it, you consider 5F2A020978 as wrong whereas you expect 5F2A020124. The EMV tag's 5F2A data (with length 02) is the transaction currency code. This means your transaction is performed in EURO currency instead of Canadian dollar as you expect. You can find a currency code list here.
Hope this helps.

Replicating C struct padding in Java

According to here, the C compiler will pad out values when writing a structure to a binary file. As the example in the link says, when writing a struct like this:
struct {
char c;
int i;
} a;
to a binary file, the compiler will usually leave an unnamed, unused hole between the char and int fields, to ensure that the int field is properly aligned.
How could I to create an exact replica of the binary output file (generated in C), using a different language (in my case, Java)?
Is there an automatic way to apply C padding in Java output? Or do I have to go through compiler documentation to see how it works (the compiler is g++ by the way).

Don't do this, it is brittle and will lead to alignment and endianness bugs.
For external data it is much better to explicitly define the format in terms of bytes and write explicit functions to convert between internal and external format, using shift and masks (not union!).

This is true not only when writing to files, but also in memory. It is the fact that the struct is padded in memory, that leads to the padding showing up in the file, if the struct is written out byte-by-byte.
It is in general very hard to replicate with certainty the exact padding scheme, although I guess some heuristics would get you quite far. It helps if you have the struct declaration, for analysis.
Typically, fields larger than one char will be aligned so that their starting offset inside the structure is a multiple of their size. This means shorts will generally be on even offsets (divisible by 2, assuming sizeof (short) == 2), while doubles will be on offsets divisible by 8, and so on.
UPDATE: It is for reasons like this (and also reasons having to do with endianness) that it is generally a bad idea to dump whole structs out to files. It's better to do it field-by-field, like so:
put_char(out, a.c);
put_int(out, a.i);
Assuming the put-functions only write the bytes needed for the value, this will emit a padding-less version of the struct to the file, solving the problem. It is also possible to ensure a proper, known, byte-ordering by writing these functions accordingly.

Is there an automatic way to apply C
padding in Java output? Or do I have
to go through compiler documentation
to see how it works (the compiler is
g++ by the way).
Neither. Instead, you explicitly specify a data/communication format and implement that specification, rather than relying on implementation details of the C compiler. You won't even get the same output from different C compilers.

For interoperability, look at the ByteBuffer class.
Essentially, you create a buffer of a certain size, put() variables of different types at different positions, and then call array() at the end to retrieve the "raw" data representation:
ByteBuffer bb = ByteBuffer.allocate(8);
bb.order(ByteOrder.LITTLE_ENDIAN);
bb.put(0, someChar);
bb.put(4, someInteger);
byte[] rawBytes = bb.array();
But it's up to you to work out where to put padding-- i.e. how many bytes to skip between positions.
For reading data written from C, then you generally wrap() a ByteBuffer around some byte array that you've read from a file.
In case it's helpful, I've written more on ByteBuffer.

A handy way of reading/writing C structs in Java is to use the javolution Struct class (see http://www.javolution.org). This won't help you with automatically padding/aligning your data, but it does make working with raw data held in a ByteBuffer much more convenient. If you're not familiar with javolution, it's well worth a look as there's lots of other cool stuff in there too.

This hole is configurable, compiler has switches to align structs by 1/2/4/8 bytes.
So the first question is: Which alignment exactly do you want to simulate?

With Java, the size of data types are defined by the language specification. For example, a byte type is 1 byte, short is 2 bytes, and so on. This is unlike C, where the size of each type is architecture-dependent.
Therefore, it would be important to know how the binary file is formatted in order to be able to read the file into Java.
It may be necessary to take steps in order to be certain that fields are a specific size, to account for differences in the compiler or architecture. The mention of alignment seem to suggest that the output file will depend on the architecture.

you could try preon:
Preon is a java library for building codecs for bitstream-compressed data in a
declarative (annotation based) way. Think JAXB or Hibernate, but then for binary
encoded data.
it can handle Big/Little endian binary data, alignment (padding) and various numeric types along other features. It is a very nice library, I like it very much
my 0.02$

I highly recommend protocol buffers for exactly this problem.

As I understand it, you're saying that you don't control the output of the C program. You have to take it as given.
So do you have to read this file for some specific set of structures, or do you have to solve this in a general case? I mean, is the problem that someone said, "Here's the file created by program X, you have to read it in Java"? Or do they expect your Java program to read the C source code, find the structure definition, and then read it in Java?
If you've got a specific file to read, the problem isn't really very difficult. Either by reviewing the C compiler specifications or by studying example files, figure out where the padding is. Then on the Java side, read the file as a stream of bytes, and build the values you know are coming. Basically I'd write a set of functions to read the required number of bytes from an InputStream and turn them into the appropriate data type. Like:
int readInt(InputStream is,int len)
throws PrematureEndOfDataException
{
int n=0;
while (len-->0)
{
int i=is.read();
if (i==-1)
throw new PrematureEndOfDataException();
byte b=(byte) i;
n=(n<<8)+b;
}
return n;
}

You can alter the packing on the c side to ensure that no padding is used, or alternatively you can look at the resultant file format in a hex editor to allow you to write a parser in Java that ignores bytes that are padding.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.