I am trying to figure out what the maximum length for an AppEngine key name is in the Java API.
This question has been asked in much less depth before:
How long (max characters) can a datastore entity key_name be? Is it bad to haver very long key_names?
and received two conflicting answers (with the one that seems less credible to me being the accepted one...)
#ryan was able to provide links to the relevant Python API source in his answer and I've been trying to find something similar in the Java API.
But neither Key.java, nor KeyFactory.java, nor KeyTranslator.java seem to enforce any limits on the name property of a key. So, if there is a limit, it is implemented elsewhere. KeyTranslator calls com.google.storage.onestore.v3.OnestoreEntity.Path.Element.setName(), which could be the place where the limit is implemented, but unfortunately I can't find the source for this class anywhere...
Specifically, I would like to know:
Is the 500 character limit a hard limit specifically imposed for key names somewhere in the backend or is it simply a recommendation that should be sufficient to make sure the full key-string never exceeds the 1500-byte limit of a short text property (long text properties with more bytes cannot be indexed, if I understand correctly).
If it is a hard limit:
Is it 500 characters or 500 bytes (i.e. the length after some encoding)?
Are the full 500 bytes/characters available for the name of the key or do the other key-components (kind, parent, app-id, ...) deduct from this number?
If it is a recommendation:
Is it sufficient in all cases?
What is the maximum I can use if all keys are located in the root of my application and the kind is only one letter long? In other words: Is there a formula I can use to calculate the real limit given the other key components?
Lastly, if I simply try to measure this limit by attempting to store keys of increasing length until I get some exception, will I be able to rely on the limit that I find if I only create keys with identical ancestor paths and same-length kinds in the same application? Or are there other variable-length components to a key that might get added and reduce the available key-name-length in some cases? Should it be the same for the Development and the Production Servers?
The Datastore implements all of its validation in the backend (because it prevents successful operations in one client to fail in another). Datastore keys have the following restrictions:
A key can have at most 100 path elements (these are kind, name/id pairs)
Each kind can be at most 1500 bytes.
Each name can be at most 1500 bytes.
The 500 character limit has been converted into a 1500 byte limit. So places you've seen a 500 character limit before (like #ryan's answer in the linked question) are now 1500 bytes. Strings are encoded using UTF-8.
Importantly to answer some specifics from your question:
Are the full 500 bytes/characters available for the name of the key or do the other key-components (kind, parent, app-id, ...) deduct from this number?
No, the 1500 byte limit is per field.
Related
I have a big string array which has between 24-32 random characters (which include 0123456789abcdefghijklmnopqrstuvwxyz!##$%^&*()_+=-[]';/.,<>?}{). Some times the array is empty, but other times the array has more than 1000 elements inside it.
I send them to my client, which is a browser, via AJAX every time he requests them and I want to reload a part of my application only if that array is different. That means if there was a modification, adding/removing in said array. So I want to send the entire array, and some kind of hash of all the elements inside it. I can't use md5 or anything like that because the elements inside the array might move around.
What do you suggest I do? The server uses Java to serve pages.
Are you sure transmitting 1000 characters is actually a problem in your use case? For instance, this stackoverflow page is currently 17000 bytes large, and stackoverflow makes no effort to only transmit it if it has changed. Put differently, transmitting 1000 characters will take about 1000 bytes, or 1 ms on a 1 MBit connection (which is slow by modern standards ;-).
That said, transmitting data only if it has changed is such a basic optimization strategy that it has been incorporated into the HTTP standard itself. The HTTP standard describes both time based and etag based invalidation, and is implemented by virtually any software or hardware interacting using HTTP, including browsers and CDNs. To learn more, read an tutorial by Google or the normative specification.
You could be using time based invalidation, either by specifying a fixed lifetime or interpreting the If-Modified-Since header. You could also use an ETag that is not sensitive to ordering, by putting your elements into a particular order (e.g. through sorting) before hashing.
I would suggest a system that allows you to skip sending the strings altogether if the client has the latest version. The client keeps the version number (or hash code) of the latest version it received. If it hasn't received any strings yet, it can default to 0.
So, when the client needs to get the strings, it can say, "Give me the strings if the current version isn't X," where X is the version that the client currently has.
The server maintains a version number or hash code which it updates whenever the strings change. If it receives a request, and the client's version is the same as the current version, then the server returns a result that says, "You already have the current version."
The point here is twofold: prevent transmitting information that you don't need to transmit, and prevent the client from having to compute a hash code.
If the server needs to compute a hash at every request rather than just keeping a current hash code value, have the server sort the array of strings first, and then do an MD5 or CRC or whatever.
For some time i am working on creating index for very large data sets (around 190 million). I have a BTree which can insert data sets (typically an object)/search for key and while i searched how to persist the data into files in disk, i came across this amazing article (http://www.javaworld.com/article/2076333/java-web-development/use-a-randomaccessfile-to-build-a-low-level-database.html#resources). This pretty much gives me the starting point.
Here they are indexing String key to binary object (blob). They have the file format where they have divided it into 3 regions, header(stores start point of indexes), index(stores index and its corresponding location) and data region (stores data). They are using RandomAccessFile to get the data.
How do i define similar file format for btree. All i know is for every read made to disk, i have to get one node(typically one block 512 bytes). There are many similar questions on how to persist but it is little difficult to understand the big picture on why we decide on something that we implemented like this question (Persisting B-Tree nodes to RandomAccessFile -[SOLVED]). Please share your thoughts.
Here is an alternative take on the question, based on problem specifics that have become known in the meantime. This post is based on the following assumptions:
record count about 190 million, fixed
keys are 64-byte hashes, like SHA-256
values are filenames: variable length, but sensible (average length < 64 bytes, max < page)
page size 4 KiByte
Efficient representation of filenames in a database is a different topic that cannot be addressed here. Should the filenames be awkward - longish on average and/or Unicode - then the hashing solution will punish you with increased disk read counts (more overflows, more chaining) or reduced average occupancy (more wasted space). A B-tree solution reacts somewhat more benignly, though, since an optimum tree can be constructed in any case.
The most efficient solution in this situation - and the simplest to implement by a wide margin - is hashing, since your keys are perfect hashes already. Take the first 23 bits of the hash as the page number, and lay out the pages like this:
page header
uint32_t next_page
uint16_t key_count
key/offset vector
uint16_t value_offset;
byte key[64];
... unallocated space ...
last arrived filename
...
2nd arrived filename
1st arrived filename
Values (filenames) are stored from the end of the page downwards, prefixed with their 16-bit length, and the key/offset vector grows upwards. That way neither low/high key counts nor short/long values can cause unnecessary waste of space, as would be the case with fixed-size structures. Nor do you have to parse variable-length structures during key searches. Apart from that I've aimed for the greatest possible simplicity - no premature optimisation. The bottom of the heap can be stored in the page header, in KO.[PH.key_count].value_offset (my preference), or computed as KO.Take(PH.key_count).Select(r => r.value_offset).Min(), whatever pleases you most.
The key/offset vector needs to be kept sorted on the keys so that you can use binary search but the values can be written as they arrive, they do not need to be in any particular order. If the page overflows, allocate a new one just like it at the current end of the file (growing the file by one page) and stash its page number in the appropriate header slot. This means that you can binary search within a page but all chained pages need to be read and searched one by one. Also, you do not need any kind of file header, since the file size is otherwise available and that's the only piece of global management information that needs to be maintained.
Create the file as a sparse file with the number of pages as indicated by your chosen number of hash key bits (e.g. 8388608 pages for 23 bits). Empty pages in a sparse file don't take up any disk space and read as all 0s, which works perfectly fine with our page layout/semantics. Extend the file by one page whenever you need to allocate an overflow page. Note: the 'sparse file' thing isn't very important here since almost all pages will have been written to when you're done building the file.
For maximum efficiency you need to run some analyses on your data. In my simulation - with random numbers as stand-ins for the hashes, and on the assumption that average filename size is 62 bytes or less - the optimum turned out to be making 2^23 = 8388608 buckets/pages. This means that you take the first 23 bit of the hash as the page number to load. Here are the details:
# bucket statistics for K = 23 and N = 190000000 ... 7336,5 ms
average occupancy 22,6 records
0 empty buckets (min: 3 records)
310101/8388608 buckets with 32+ records (3,7%)
That keeps the chaining to a minimum, on average you need to read just 1.04 pages per search. Increasing the hash key size by one single bit to 24 reduces the expected number of overflowing pages to 3 but doubles the file size and reduces average occupancy to 11.3 records per page/bucket. Reducing the key to 22 bits means that almost all pages (98.4%) can be expected to overflow - meaning the file is virtually the same size as that for 23 bits but you have to do twice as many disk reads per search.
Hence you see how important it is to run a detailed analysis on the data to decide on the proper number of bits to use for hash addressing. You should run an analysis that uses the actual filename sizes and tracks the per-page overhead, to see what the actual picture looks like for 22 bits to 24 bits. It'll take a while to run but that's still way faster than building a multi-gigabyte file blindly and then finding that you have wasted 70% of space or that searches take significantly more than 1.05 page reads on average.
Any B-tree based solution would be much more involved (read: complicated) but could not reduce the page read count per search below 1.000, for obvious reasons, and even that only on the assumption that a sufficient number of internal nodes can be kept cached in memory. If your system has such humongous amounts of RAM that data pages can be cached to a significant degree then the hashing solution will benefit just as much as one that is based on some kind of B-tree.
As much as I would like an excuse for building a screamingly fast hybrid radix/B+tree, the hashing solution delivers essentially the same performance for a tiny fraction of the effort. The only thing where B-treeish solutions can outdo hashing here is space efficiency, since it is trivial to construct an optimum tree for existing pre-sorted data.
The are plenty of Open Source key/value stores and full database engines - take a week off and start Googling. Even if you end up using none of them, you still need to study a representative cross section (architecture, design histories, key implementation details) to get enough of an overview over the subject matter so that you can make informed decisions and ask intelligent questions. For a brief overview, try to Google details on index file formats, both historic ones like IDX or NTX, and current ones used in various database engines.
If you want to roll your own then you might consider hitching yourself to the bandwagon of an existing format, like the dBASE variants Clipper and Visual FoxPro (my favourite). This gives you the ability to work your data with existing tools, including Total Commander plugins and whatnot. You don't need to support the full formats, just the single binary instance of the format that you choose for your project. Great for debugging, reindexing, ad hoc queries and so on. The format itself is dead simple and easy to generate even if you don't use any of the existing libraries. The index file formats aren't quite as trivial but still manageable.
If you want to roll your own from scratch then you've got quite a road ahead of you, since the basics of intra-node (intra-page) design and practice are poorly represented on the Internet and in literature. For example, some old DDJ issues contained articles about efficient key matching in connection with prefix truncation (a.k.a. 'prefix compression') and so on but I found nothing comparable out there on the 'net at the moment, except buried deeply in some research papers or source code repositories.
The single most important item here is the algorithm for searching prefix-truncated keys efficiently. Once you've got that, the rest more or less falls into place. I have found only one resource on the 'net, which is this DDJ (Dr Dobb's Journal) article:
Supercharging Sequential Searches by Walter Williams
A lot of tricks can also be gleaned from papers like
Efficient index compression in DB2 LUW
For more details and pretty much everything else you could do a lot worse than reading the following two books cover to cover (both of them!):
Goetz Graefe: Modern B-Tree Techniques (ISBN 1601984820)
Jim Gray: Transaction Processing. Concepts and Techniques (ISBN 1558601902)
An alternative to the latter might be
Philip E. Bernstein: Principles of Transaction Processing (ISBN 1558606238)
It covers a similar spectrum and it seems to be a bit more hands-on, but it does not seem to have quite the same depth. I cannot say for certain, though (I've ordered a copy but haven't got it yet).
These books give you a complete overview over all that's involved, and they are virtually free of fat - i.e. you need to know almost everything that's in there. They will answer gazillions of questions that you didn't know you had, or that you should have asked yourself. And they cover the whole ground - from B-tree (and B+tree) basics to detailed implementation issues like concurrency, locking, page replacement strategies and so forth. And they enable you to utilise the information that is scattered over the 'net, like articles, papers, implementation notes and source code.
Having said that, I'd recommend matching the node size to the architecture's RAM page size (4 KB or 8 KB), because then you can utilise the paging infrastructure of your OS instead of running afoul of it. And you're probably better off keeping index and blob data in separate files. Otherwise you couldn't put them on different volumes and the data would b0rken the caching of the index pages in subsystems that are not part of your program (hardware, OS and so forth).
I'd definitely go with a B+tree structure instead of watering down the index pages with data as in a normal B-tree. I'd also recommend using an indirection vector (Graefe has some interesting details there) in connection with length-prefixed keys. Treat the keys as raw bytes and keep all the collation/normalisation/upper-lower nonsense out of your core engine. Users can feed you UTF8 if they want - you don't want to have to care about that, trust me.
There is something to be said for using only suffix truncation in internal nodes (i.e. for distinguishing between 'John Smith' and 'Lucky Luke', 'K' or 'L' work just as well as the given keys) and only prefix truncation in leaves (i.e. instead of 'John Smith' and 'John Smythe' you store 'John Smith' and 7+'ythe').
It simplifies the implementation, and gives you most of the bang that could be got. I.e. shared prefixes tend to be very common at the leaf level (between neighbouring records in index order) but not so much in internal nodes, i.e. at higher index levels. Conversely, the leaves need to store the full keys anyway and so there's nothing to truncate and throw away there, but internal nodes only need to route traffic and you can fit a lot more truncated keys in a page than non-truncated ones.
Key matching against a page full of prefix-truncated keys is extremely efficient - on average you compare a lot less than one character per key - but it's still a linear scan, even with all the hopping forward based on skip counts. This limits effective page sizes somewhat, since binary search is more complicated in the face of truncated keys. Graefe has a lot of details on that. One workaround for enabling bigger node sizes (many thousands of keys instead of hundreds) is to lay out the node like a mini B-tree with two or three levels. It can make things lightning-fast (especially if you respect magic thresholds like 64-byte cache line size), but it also makes the code hugely more complicated.
I'd go with a simple lean and mean design (similar in scope to IDA's key/value store), or use an existing product/library, unless you are in search of a new hobby...
In some previous posts I have asked some questions about coding of Custom Hash Map/Table in java. Now as I can't solve it and may be I forgot to properly mentioning what I really want, I am summarizing all of them to make it clear and precise.
What I am going to do:
I am trying to code for our server in which I have to find users access type by URL.
Now, I have 1110 millions of URLs (approx).
So, what we did,
1) Divided the database on 10 parts each of 110 millions of Urls.
2) Building a HashMap using parallel array whose key are URL's one part (represented as LONG) and values are URL's other part (represented as INT) - key can have multiple values.
3) Then search the HashMap for some other URLs (millions of URLs saved in one day) per day at the beginning when system starts.
What you have Tried:
1) I have tried many NoSQL databases, however we found not so good for our purpose.
2) I have build our custom hashmap(using two parallel arrays) for that purpose.
So, what the issue is:
When the system starts we have to load our hashtable of each database and perform search for million of url:
Now, issue is,
1) Though the HashTable performance is quite nice, code takes more time while loading HashTable (we are using File Channel & memory-mapped buffer to load it which takes 20 seconds to load HashTable - 220 millions entry - as load factor is 0.5, we found it most faster)
So, we are spending time: (HashTable Load + HashTable Search) * No. of DB = (5 + 20) * 10 = 250 seconds. Which is quite expensive for us and most of the time (200 out of 250 sec) is going for loading hashtables.
Have you think any-other way:
One way can be:
Without worrying about loading and storing, and leave caching to the operating system by using a memory-mapped buffer. But, as I have to search for millions of keys, it gives worser performance than above.
As we found HashTable performance is nice but loading time is high, we thought to cut it off in another way like:
1) Create an array of Linked Lists of the size Integer_MAX (my own custom linked list).
2) Insert values (int's) to the Linked Lists whose number is key number (we reduce the key size to INT).
3) So, we have to store only the linked lists to the disks.
Now, issue is, it is taking lots of time to create such amount of Linked Lists and creating such large amount of Linked Lists has no meaning if data is not well distributed.
So, What is your requirements:
Simply my requirements:
1) Key with multiple values insertion and searching. Looking for nice searching performance.
2) Fast way to load (specially) into memory.
(keys are 64 bit INT and Values are 32 bit INT, one key can have at most 2-3 values. We can make our key 32 bit also but will give more collisions, but acceptable for us, if we can make it better).
Can anyone help me, how to solve this or any comment how to solve this issue ?
Thanks.
NB:
1) As per previous suggestions of Stack Overflow, Pre-read data for disk caching is not possible because when system starts our application will start working and on next day when system starts.
2) We have not found NoSQL db's are scaling well as our requirements are simple (means just insert hashtable key value and load and search (retrieve values)).
3) As our application is a part of small project and to be applied on a small campus, I don't think anybody will buy me a SSD disk for that. That is my limitation.
4) We use Guava/ Trove also but they are not able to store such large amount of data in 16 GB also (we are using 32 GB ubuntu server.)
If you need quick access to 1110 million data items then hashing is the way to go. But dont reinvent the wheel, use something like:
memcacheDB: http://memcachedb.org
MongoDB: http://www.mongodb.org
Cassandra: http://cassandra.apache.org
It seems to me (if I understand your problem correctly) that you are trying to approach the problem in a convoluted manner.
I mean the data you are trying to pre-load are huge to begin with (let's say 220 Million * 64 ~ 14GB). And you are trying to memory-map etc for this.
I think this is a typical problem that is solved by distributing the load in different machines. I.e. instead of trying to locate the linked list index you should be trying to figure out the index of the appropriate machine that a specific part of the map has been loaded and get the value from that machine from there (each machine has loaded part of this database map and you get the data from the appropriate part of the map i.e. machine each time).
Maybe I am way off here but I also suspect you are using a 32bit machine.
So if you have to stay using a one machine architecture and it is not economically possible to improve your hardware (64-bit machine and more RAM or SSD as you point out) I don't think that you can make any dramatic improvement.
I don't really understand in what form you are storing the data on disk. If what you are storing consists of urls and some numbers, you might be able to speed up loading from disk quite a bit by compressing the data (unless you are already doing that).
Creating a multithreaded loader that decompresses while loading might be able to give you quite a big boost.
I have a small application with an embedded database. Sometimes Is get truncation errors when trying to insert varchars which exceeds the maximum size of the corresponding database column.
I wish to detect this before insert/updating and show a correct message to the user.
Now I presume that there are two possibilities to achieve this.
Get the maximum length of the column of interest through the DatabaseMetaData object. You could reduce the performance lack by using Singletons or similar constructions.
Keep the maximum lengths in the Java code (eg: in ResourceBundle or Properties file) and check against these values. Downside is off course that Java code and database must be in sync. This is error prone.
What would be the best approach?
The only answer that won't require maintenance is getting the maximum length of the column of interest at database connect time.
If you use Integer.valueOf(...) you can store this in an object, which the lower values (according to the current JVM specs) backs to a singleton pool anyway. This will unload a lot of memory performance issues, as all the columns will eventually refer to the few unique values you likely have in your database.
Also, digging around in the DatabaseMetaData, I would look for any flags that indicate that columns would be truncated upon larger than data inserts. It may provide the switch to know if your code is needed.
By putting the values in a property file, you ease the detection of the issue, but at the cost of possibly getting them out of sync. Such techniques are effectively quick implementations with little up-front cost, but they create latent issues. Whether the issue ever gets raised will be unknown, but given enough time, even the remote possibilities are encountered.
Combination of both the approaches. During application build time, you use DatabaseMetaData to dynamically create a resource bundle.
One solution would be to use a CLOB. I don't know what other requirements you have for this field.
Also, Use the smallest max character value you have as a constant in the java code. This handles it having to be in sync or db dependent and it's more or less arbitrary anyway. Users don't care what the max size is, they just need to know what the max size is or be kept from making an error automatically.
Suppose we have a JDO entity that uses an unencoded string as the PrimaryKey. Is there some practical limit on the size that this unencoded string could be? Specifically, I'm wondering if I could use a String that is extremely large, e.g. 500+ KB in size.
I understand the app engine quotas on in-memory object size (1MB) and datastore entity size (32MB), I'm wondering about the key field itself. Before you start ripping me for bad design and telling me to use entity relationships, this is a theoretical question, and is something that I don't intend to abuse.
In Python a key name is limited to 500 characters. The limit should be pretty easy to test in Java as well.
Having a super long key name is not a good idea though. It would cause your indexes to consume a lot more space and probably increase write overhead.
See How Entities and Indexes are Stored for more details.