I want to generate unique keys automatically in solr. I checked the default function here
but it is generating id like 1cdee8b4-c42d-4101-8301-4dc350a4d522. In my application, I need unique autoincrement numbers like we do in MySql. What should be approach to do this ? Solrj pointers would be much helpful.
Another solution (hack) that I've implemented is to create a record in solr inside the existing schema. For example if you have a schema which has 2 string fields then you can store the values as MAX_VALUE and the other being the actual integer max value stored as string. So anytime you would add, you'd have to query for "fieldname:MAX_VALUE" and retrieve the string value from the other field of the same document. You can parse it and add 1. You then update the existing MAX_VALUE document. It's not the most feasible but it is a solution. The implementation keeps your max number within your index rather than in another application.
It's also solj friendly as it's fairly straight forward to make the query and the update query.
I apologize for the grammar. Do comment if you can't understand what I'm saying.
Related
Consider this:
I have a database with 10 rows.
Each row has a unique id (int) paired with some value e.g. name (varchar).
These ids are incremented from 1 to 10.
I delete 2 of the records - 2 and 8.
I add 2 more records 11 and 12.
Questions:
Is there a good way to redistribute unique ids in this database so it would go from 1 to 10 again ?
Would this be considered bad practice ?
I ask this question, because after some use of this database: adding and deleting values the ids would differ significantly.
One way to approach this would be to just generate the row numbers you want at the time you actually query, something like this:
SET #rn = 0;
SELECT
(#rn:=#rn + 1) AS rn, name
FROM yourTable;
ORDER BY id;
Generally speaking, you should not be worrying about the auto increment values which MySQL is assigning. MySQL will make sure that the values are unique without your intervention.
If you set the ID column to be primary key and an auto-increment as well, "resetting" is not really necessary because it will keep assigning unique IDs anyways.
If the thing that bothers you are the "gaps" among the existing values, then you might resort to "sort deletion", by employing the is_deleted column with bit/boolean values. Default value would be 0 (or b0), of course. In fact, soft-deleting is advised if there are some really important data that might be useful later on, especially if it involves possibility for payment-related entries where user can delete one of such entries either by omission or deliberately.
There is no simple way to employ the deletion where you simply remove one value and re-arrange the remaining IDs to retain the sequence. A workaround might be to do the following steps:
DELETE entry first. i.e. delete from <table> where ID = _value
INSERT INTO SELECT (without id column). please note that the table need to be identical in terms of columns and types in order for this query to work properly, so to speak... and you can also utilize temporary as the backup_table. i.e. insert into <backup_table> select <coluum1, column2, ...> from <table>
TRUNCATE your table, i.e. truncate table <table>
copy the values from the temp table back into the existing table. You can utilize the INSERT INTO SELECT once again, but make sure to drop the temp table in the end
Please note that I would NOT advise you to do this, mainly because most people utilize some sort of caching in their applications and they also utilize the specific ways to evaluate whether a specific object is the same.
I.e. in Java, the equals() and hashCode() methods for POJOs are overriden and programmers generally rely on IDs to be permanent way of identifying a specific object. By utilizing the above method, you essentially break the whole concept and I would not advise you to change the object's autoincrement ID value for this reason, before anything else.
Essentially, what you want to do is simply an anti-pattern and will generally make common patterns and practices employed by experienced programmers into solutions that are prone to unexpected issues and/or failures... and this especially applies if/when advanced features are involved, such as employing this such anti-pattern into an application that utilizes galera cluster and/or application caching.
I have a question regarding UUID generation.
Typically, when I'm generating a UUID I will use a random or time based generation method.
HOWEVER, I'm migrating legacy data from MySQL over to a C* datastore and I need to change the legacy (auto-incrementing) integer IDs to UUIDS. Instead of creating another denormalized table with the legacy integer IDs as the primary key and all the data duplicated, I was wondering what folks thought about padding 0's onto the front of the integer ID to form a UUID. Example below.
*Something important to note is that the legacy IDs highest values will never top 1 million, so overflow isn't really an issue.
The idea would look like this:
Legacy ID: 123456 ---> UUID: 00000000-0000-0000-0000-000000123456
This would be done using some string concats and the UUID.fromString("00000000-0000-0000-0000-000000123456" method.
Does this seem like a bad pattern to anyone? I'm not a huge fan of the idea, gives me a bad taste in my mouth, but I don't have a technical reason for why haha.
As far as collisions go, the probability of a collision occurring is still ridiculously low. So I'm not worried about increasing collisions. I suppose it just seems like bad practice to me, that its "too easy".
We faced the same kind of issue before when migrating from Oracle with ids generated by sequence to Cassandra with generated UUIDs.
We had to design a type to both support old data coming from Oracle with type long and new data with uuid.
The obvious solution is to use type blob to store the id. A blob can encode a long or an uuid.
This solution only works for partition key because you query them using =. It won't work for clustering column using operators like > or < because we need an ordering on their value.
There was a small objection at that time, which was using a blob to store the id makes it opaque to user, for example in cqlsh when you're doing a SELECT and you need to provide the id, how would you make a blob ?
Fortunately, the native functions of CQL bigIntAsBlob(), blobAsBigInt(), uuidAsBlob() and blobAsUUID() come in very handy.
I've decided to go in a different direction from doanduyhai's answer.
In order to maintain data consistency, we decided to fully de-normalize the data and create another table in C* that is keyed on our legacy IDs. When migrating the objects from our legacy into C*, they are assigned a new randomly generated UUID, which will be their new primary ID for the future. The legacy IDs will be kept around until such a time that we decide they are no longer needed. Upon that time, we can cleanly drop the legacy ID table and be done with them.
This solution allowed for a cleaner break from our legacy ID system in the future, and allowed us to prevent the use of strange custom made UUIDs. I also wasn't a huge fan of having the ID field as a blob type that could have multiple types of data stored in it since, in the future, we plan on only wanting UUIDs to be there.
I have a hashmap with pair.
the key in this are column-names in a table Now I want to insert them into a table say users_table,
i should be able to match the key to the column names and if both are same then insert that value into table.
What I am doing is that i have to write preparedstatement with all the columns and then pass the hashmap values as parameter using setter methods of preparedsatatement.
For doing this i need to know all the columns of table and this would be tedious work as there would be no. columns and this step would be repeated to no.of tables.
tell me Any idea of doing this, Thanks in Advance
First off, use a LinkedHashMap to preserve the order of the columns. This will make a difference when iterating over the map to assign column names and then values.
I'm not entirely sure what you're asking, but you're hinting at what is called Object Relational Mapping (ORM). Simply put, it's a way to map database tables to plain old Java objects (POJO). Though there's a lot more to it than that.
If you're interested in representing your database tables as objects, you should look into Hibernate, which is a popular Java ORM API.
Otherwise, create and keep to a standard that is uniform across both your database and your Java project and you'll be fine.
Edit:
If I understand your question a little more, you're having issues with knowing the names of the columns? This is something you have to know, there's not going to be an easy, dynamic, or efficient way of getting that information.
One example of setting that information is storing the column names in a String array of a class that represents your table. You can then access the array and iterate over it when saving to a database.
And finally, if you feel like doing some reading, check out my answer (Store nested Pojo Objects as individuall Objects in Database). I go quite in-depth on how I manage Database to Java and vice versa.
I need to generate encoding String for each item I inserted into the database. for example:
x00001 for the first item
x00002 for the sencond item
x00003 for the third item
The way I chose to do this is counting the rows. Before I insert the third item, I count against the database, I know there're already 2 rows, so the next encoding is ended with 3.
But there is a problem. If I delete the second item, the forth item will not be the x00004,but x00003.
I can add additional columns to table, to store the next encoding, I don't know if there's other better solutions ?
Most databases support some sort of auto incrementing identity field. This field is normally also setup to be unique, so duplicate ids do not occur.
Consult your database documentation to see how it is done in your database and use that - don't reinvent the wheel when you have a good mechanism in place already.
What you want is SELECT MAX(id) or SELECT MAX(some_function(id)) inside the transaction.
As suggested in Oded's answer a lot of databases have their own methods of providing sequences which are more efficient and depending on the DBMS might support non numeric ids.
Also you could have id broken down into Y and 00001 as separate columns and having both columns make up primary key; then most databases would be able to provide the sequence.
However this leads to the question if your primary key should have a meaning or not; Y suggest that there is some meaning in the part of the key (otherwise you would be content with a plain integer id).
Lets say you have a large text file. Each row contains an email id and some other information (say some product-id). Assume there are millions of rows in the file. You have to load this data in a database. How would you efficiently de-dup data (i.e. eliminate duplicates)?
Insane number of rows
Use Map&Reduce framework (e.g. Hadoop). This is a full-blown distributed computing so it's an overkill unless you have TBs of data though. ( j/k :) )
Unable to fit all rows in memory
Even the result won't fit : Use merge sort, persisting intermediate data to disk. As you merge, you can discard duplicates (probably this sample helps). This can be multi-threaded if you want.
The results will fit : Instead of reading everything in-memory and then put it in a HashSet (see below), you can use a line iterator or something and keep adding to this HashSet. You can use ConcurrentHashMap and use more than one thread to read files and add to this Map. Another multi-threaded option is to use ConcurrentSkipListSet. In this case, you will implement compareTo() instead of equals()/hashCode() (compareTo()==0 means duplicate) and keep adding to this SortedSet.
Fits in memory
Design an object that holds your data, implement a good equals()/hashCode() method and put them all in a HashSet.
Or use the methods given above (you probably don't want to persist to disk though).
Oh and if I were you, I will put the unique constraint on the DB anyways...
I will start with the obvious answer. Make a hashmap and put the email id in as the key and the rest of the information in to the value (or make an object to hold all the information). When you get to a new line, check to see if the key exists, if it does move to the next line. At the end write out all your SQL statements using the HashMap. I do agree with eqbridges that memory constraints will be important if you have a "gazillion" rows.
You have two options,
do it in Java: you could put together something like a HashSet for testing - adding an email id for each item that comes in if it doesnt exist in the set.
do it in the database: put a unique constraint on the table, such that dups will not be added to the table. An added bonus to this is that you can repeat the process and remove dups from previous runs.
Take a look at Duke (https://github.com/larsga/Duke) a fast dedupe and record linkage engine written in java. It uses Lucene to index and reduce the number of comparison (to avoid the unacceptable Cartesian product comparison). It supports the most common algorithm (edit distance, jaro winkler, etc) and it is extremely extensible and configurable.
Can you not index the table by email and product ID? Then reading by index should make duplicates of either email or email+prodId readily identified via sequential reads and simply matching the previous record.
Your problem can be solve with a Extract, transform, load (ETL) approach:
You load your data in an import schema;
Do every transformation you like on the data;
Then load it into the target database schema.
You can do this manually or use an ETL tool.