Why the bson java implementation uses 4 bytes inc field? - java

In BSON Java implementation, an ObjectId is composed by 3 pieces (source code: http://grepcode.com/file/repo1.maven.org/maven2/org.mongodb/mongo-java-driver/2.9.0/org/bson/types/ObjectId.java#ObjectId.%3Cinit%3E%28int%2Cint%2Cint%29 ):
XXXX XXXX XXXX
-------------------------
time machine&pid inc
(each X represents a byte)
this is a bit different from what's described in document (doc: http://docs.mongodb.org/manual/core/object-id/ )
XXXX XXX XX XXX
--------------------------
time machine pid inc
(each X represents a byte)
Can anyone let me know why the java-driver didn't follow the spec?
Thanks!

I will put this as answer since it is a bit long for a comment.
There are a couple of JIRA links to this:
https://jira.mongodb.org/browse/JAVA-81
https://jira.mongodb.org/browse/JAVA-337
The second acknowledges that the spec is different under Java however makes no reference as to why.
If I were to make a guess it could be due to the way the PID and machine id in Java works, it could be related to: https://jira.mongodb.org/browse/JAVA-586.
You may find your answer better on the Google Group: mongodb-user since the maintainers hang out there.

I expect the original intent of an ObjectID was to generate a reasonably unique primary key, rather than packing fields that drivers would then start parsing as data.
As the MongoDB ecosystem has evolved, some developers have found it useful to interpret the ObjectID from multiple drivers as well as ensure consistency of generated IDs.
If you look at the BSON spec you will see there are a few subtypes for UUID used by older drivers, and various changes for interoperability. For example, there is mention on PYTHON-387 of supporting "legacy" byte orders and endianness for the C# and Java drivers.
As per JAVA-337 in the MongoDB issue tracker, the Java driver's ObjectID inconsistency is planned to be addressed in the 3.0 Java driver release.

I cannot explain why they are different, but I can tell you that the Python driver generates object ids using the same approach that the Java one does:
https://github.com/mongodb/mongo-python-driver/blob/master/bson/objectid.py

Related

EBCDIC unpacking comp-3 data returns 40404** in Java

I have used the unpack data logic provided in below link for java
How to unpack COMP-3 digits using Java?
But for the null data in source it returns 404040404 like on Java unpack code. I understand this was space in ebcdic, but how to unpack by handling this space or to avoid it.
There are two problems that we have to deal with. First, is the data valid comp-3 data and second, is the data considered “valid” by older language implementations like COBOL since Comp-3 was mentioned.
If the offests are not misaligned it would appear that spaces are being interpreted by existing programs as 0 instead of spaces. This would be incorrect but could be an artifact of older programs that were engineered to tolerate this bad behaviour.
The approach I would take in a legacy shop (assuming no misalignment) is to consider “spaces” (which are sequences of 0x404040404040) as being zero. This would be a legacy check to compare the field with spaces and then assume that 0x00000000000f as the actual default. This is something an individual shop would have to determine and is not recognized as a general programming approach.
In terms of Java, one has to remember that bytes are “signed” so comparisons can be tricky based on how the code is written. The only “unsigned” data type I
recall in java is char which is really two bytes (unit 16) basically.
This is less of a programming problem than it is recognizing historical tolerance and remediation.

What is the definition of "lexically-ordered" base64 and why is RFCC-1940 apparently the canonical reference?

Today I was reading the documentation for Netty's Base64Dialect class.
It includes a dialect called ORDERED, of which it says, somewhat briefly:
Special "ordered" dialect of Base64 described in RFC1940.
To cut to the chase, I can't find any definition of what this is, and includes an erroneous reference which seems to replicated all over the internet.
Instead of RFC-1940, the document actually links to RFCC-1940, which apparently is a "reader comment", and a nonsensical one at that:
RFC 920: whkpiy clujzis brkyh dwojfmz jydwq hrnwcgklt fsltaiu
Comment by lsnxkrjo sxavymwpg
Submitted on 10/26/2006
Related RFC: RFC-920
Now RFC-920 appears to have nothing to do with base 64:
Domain requirements
This memo restates and refines the requirements on establishing a
Domain first described in RFC-881. It adds considerable detail
to that discussion, and introduces the limited set of top level
domains.
Is RFC-1940 relevant? Skimming, no I can't see any base 64 encoding definitions here:
Source Demand Routing: Packet Format and Forwarding Specification (Version 1).
The purpose of SDRP is to support source-initiated selection of
routes to complement the route selection provided by existing routing
protocols for both inter-domain and intra-domain routes. [...]
In fact, searching the web for "rfcc 1940 ordered base64" finds this same URL in lots of other documentation, but sadly no explanation of "lexically ordered base 64".
Is there a legitimate definition of this anywhere? And why hasn't anyone else noticed this URL refers to nonsense?
I have not found a "legitimate definition" of ordered Base64. (At time of writing this, it is not even mentioned in the Wikipedia page on Base64.)
If you treat the code as a specification(!), ordered Base64 is a variant in which the alphabet has been reordered into ascending ASCII order. This means that the natural ordering for ordered Base64 is the same as the natural ordering for the corresponding byte sequence.
Is it a problem that there isn't a specification for ordered Base64?
Probably not.
In reality the RFCs that "specify" the different variants of Base64 (and Base32 / Base16) are actually more of an attempt to describe the variants rather than specify them. And the same applies to the Wikipedia article.
From what I can tell (google searches), the ordered Base64 variant is rarely used.
The Base64 implementation that introduced the ordered variant is legacy code. (It hasn't been changed in the last 8 years). New Java code that requires Base64 encoding / decoding capability should be using the standard Java java.util.Base64 class introduced in Java 8.
But it is concerning that the javadocs you linked to (and others!) all refer to a nonsense page. That page probably had a legitimate description at some point, but it looks like it has been vandalized.

How to parse freedict files (*.dict and *.index)

I was searching for free translation dictionaries. Freedict (freedict.org) provides the ones I need but I don't know, how to parse the *.index and *.dict files. I also don't really know, what to google, to find useful information about these formats.
The *.index files look following:
00databasealphabet QdGI l
00databasedictfmt1121 B b
00databaseinfo c 5o
00databaseshort 6E u
00databaseurl 6y c
00databaseutf8 A B
a BHO M
a bad risc BHa u
a bag of nerves BII 2
[...]
and the *.dict files:
[Lot of info stuff]
German-English FreeDict Dictionary ver. 0.3.4
Pipi machen /piːpiːmaxən/
to pee; to piss
(Aktien) zusammenlegen /aktsiːəntsuːzamənleːgən/
to merge (with)
[...]
I would be glad to see some example projects (preferably in python, but java, c, c++ are also ok) to understand how to handle these files.
It is too late. However, i hope that it can be useful for others like me.
JGoerzen writes a Dictdlib lib. You can see more details how he parse .index and .dict files.
https://github.com/jgoerzen/dictdlib/blob/master/dictdlib.py
dictd considers its format of .index and .dict[.dz] as private, to reserve itself the right to change it in the future.
If you want to process it directly anyway, the index contains the headwords and the .dict[.dz] contains definitions. It is optionally compressed with a special modified gzip algorithm providing almost random access, which gzip normally does not. The index contains 3 columns per line, tab separated:
The headword for looking up the definition.
The absolute byte position of the definition in the .dict[.dz] file, base64 encoded.
The length of the definition in bytes, base64 encoded.
For more details see the dict(8) man page (section Database Format) you should have found in your research before asking your question. For processing the headwords correctly, you'd have to consider encoding and character collation.
Eventually it would be better to use an existing library to read dictd databases. But that really depends on whether the library is good (no experience here).
Finally, as you noted yourself, XML is made exactly for easy processing. You could extract the headwords and translations using XPath, leaving out all the grammatical stuff and no need to bother parsing anything.
After getting this far the next problem would be that there is no one-to-one mapping between words in different lanuages...

Which keywords are reserved in JavaScript but not in Java?

Which keywords are reserved in JavaScript but not in Java?
One example is debugger, but there are more.
By reserved I mean reserved words as well as future reserved words (in both strict and non-strict mode) and special tokens like null, true and false.
I'm interested in ECMAScript 5.1 as well as current 6 vs. Java 5-8 (not sure if there were new keywords since Java 5).
Update
For those who's interested in reasons to know this.
I know many Java developers switching from Java to JavaScript (my story). Knowing delta in keywords is helpful.
Language history.
My very specific reason for asking: I'm building code Java/JavaScript code generation tools (quasi cross-langiuage). Which reserved keywords should I add to Java code generator so that it produces JavaScript-compatible identifiers in cross-language case?
This is what I've found out so far.
There were seems to be no new keywords in Java since 5.0 (which added enum).
Java vs. ECMAScript 5.1:
debugger
delete
function
in
typeof
var
with
export
let
yield
Java vs. ECMAScript 6 Rev 36 Release Candidate 3:
all of above
await

Whirlpool hash in java and in python give different results

I have two projects. panager and panager-android. I use the whirlpool hash algorithm and with the same data panager gives different results than panager-android.
panager is written in python and panager-android (guess) in java.
I'm ultra-new in java so take it easy :P
In python I use a module that I found on the net (whirlpool.py) and in java I use the jacksum library.
There are different versions of the Whirlpool spec which generate different output for the same input. It looks like whirlpool.py might be implementing the original Whirlpool (referred to as "Whirlpool-0"), whereas in panager-android you use Whirlpool-2:
AbstractChecksum encode = JacksumAPI.getChecksumInstance("whirlpool2");
Try changing that to "whirlpool0" and see if it matches your Python implementation now. Failing that, try "whirlpool1".
Wikipedia has known Whirlpool hashes from each version for a given test input which you may use to identify the version of a questioned Whirlpool implementation, or find out if it's just entirely wrong and broken.

Categories

Resources