Mongo DuplicateKey error despite no overlap

Mongo DuplicateKey error despite no overlap - java

I have a well-logged pool of several java servers behind an F5 load balancer (professionally managed, it's not sending traffic to >1 host) running Tomcat with my application installed, connecting to a sharded mongo cluster. I'm using a base64-encoded SHA-1 hash of the primary natural key as the _id. When a new record is to be created, I do a pretty basic:
BasicDBObject query = new BasicDBObject();
query.put("userId", userId);
query.put("_id", id);
DBObject user = getUsersCollection().findOne(query);
if (user == null) {
getUsersCollection().insert(new UserObject(userId));
}
This is simplified. In fact there are multiple checks for the pre-existence of this user, including one which should throw a custom exception, and none are triggered. The traffic logs indicate a single incoming create request, and here's an example of what happens:
2014-01-19 20:03:45,167 [http-bio-7950-exec-827]:[...] : ERROR FATAL [...] - Internal server error
[...]: com.mongodb.MongoException$DuplicateKey: { "serverUsed" : "[...]" , "singleShard" : "replicaset_2/host1:27017,host2:27017,host3:27017" , "err" : "E11000 duplicate key error index: Users.$_id_ dup key: { : \"HASH\" }" , "code" : 11000 , "n" : 0 , "lastOp" : { "$ts" : 1390190614 , "$inc" : 1} , "connectionId" : 335764 , "ok" : 1.0}
Yet in my Users collection the record has been created:
db.Users.findOne({_id:"HASH"}):
{
"_id" : "HASH",
"createDate" : ISODate("2014-01-20T04:03:45.161Z"),
...
}
I'm pasting this as important because of the timestamps. We have a timezone issue, but that aside I interpret the 6ms difference as clock skew between the mongo cluster and my application servers. There is no other record of this incoming traffic (and it is logged as it bounces from server to server, even - nothing else!) So I am 99.999% confident that my SINGLE LEGITIMATE insert call is both inserting and throwing an error.
Any theories as to how/why this is happening would be greatly appreciated. I'll run tracers and examples if needed to answer questions with more information.

You are searching for a user using both _id and userId fields. Try to comment out this line: query.put("_id", id);.
It's not clear in your code where Java variable userId comes from. It's also not clear how UserObject sets an _id if at all.
Overall it looks like the way you search for user and the way you create him does not match, i.e. what defines a unique key on that user.
One fix could to replace these lines:
query.put("userId", userId);
query.put("_id", id);
with:
query.put("_id", userId);
To make _id field to be your userId.

Related

BigQuery: 404 "Table is truncated." when insert right after truncate

I truncate my table by executing a queryJob described here: https://cloud.google.com/bigquery/docs/quickstarts/quickstart-client-libraries
"truncate table " + PROJECT_ID + "." + datasetName + "." + tableName;
i wait until the job finishes via
queryJob = queryJob.waitFor();
Truncate works fine.
Anyway, if i do an insert right after the truncate operation via
InsertAllResponse response = table.insert(rows);
it results in a
com.google.cloud.bigquery.BigQueryException: Table is truncated.
with following log:
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
POST https://www.googleapis.com/bigquery/v2/projects/[MYPROJECTID]/datasets/[MYDATASET]/tables/[MYTABLE]/insertAll?prettyPrint=false
{
"code" : 404,
"errors" : [ {
"domain" : "global",
"message" : "Table is truncated.",
"reason" : "notFound"
} ],
"message" : "Table is truncated.",
"status" : "NOT_FOUND"
}
Sometimes i have even to wait more than 5 Minutes between truncate and insert.
I would like to check if my table is still in the state "Table is truncated." periodically until this state is gone.
How can i request bigquery api in order to check if the table is ready for inserts?
How can i request bigquery api for get the status of the table?
Edit
example for reproduce can be found here

If a table is truncated while the streaming pipeline is still going on or performing a streaming insertion on a recently truncated table, you could receive some errors like mentioned in the question (Table is truncated), that's expected behavior. The metadata consistency mode for the InsertAll (very high QPS API) is eventually consistent, this means that when using the InsertAll API, it may get delayed table metadata and returns the failure like table truncated. The typical way to resolve this issue is to back-off and retry.
Currently, there is no option in the BigQuery API to check if the table is in truncated state or not.

Unfortunately the api does not (yet?) provide an endpoint to check the truncated state of the table.
In order to avoid this issue, one can use a load job via gc storage.
It looks like the load job respects this state, as i have no issues with truncate/load multiple times in a row.
public void load(String datasetName, String tableName, String sourceUri) throws InterruptedException {
Table table = getTable(datasetName, tableName);
Job job = table.load(FormatOptions.json(), sourceUri);
// Wait for the job to complete
Job completedJob = job.waitFor(RetryOption.initialRetryDelay(Duration.ofSeconds(1)),
RetryOption.totalTimeout(Duration.ofMinutes(3)));
if (completedJob != null && completedJob.getStatus().getError() == null) {
// Job completed successfully
} else {
// Handle error case
}
}

Failed to make bulk upsert using mongo

I'm trying to do upsert using mongodb driver, here is a code:
BulkWriteOperation builder = coll.initializeUnorderedBulkOperation();
DBObject toDBObject;
for (T entity : entities) {
toDBObject = morphia.toDBObject(entity);
builder.find(toDBObject).upsert().replaceOne(toDBObject);
}
BulkWriteResult result = builder.execute();
where "entity" is morphia object. When I'm running the code first time (there are no entities in the DB, so all of the queries should be insert) it works fine and I see the entities in the database with generated _id field. Second run I'm changing some fields and trying to save changed entities and then I receive the folowing error from mongo:
E11000 duplicate key error collection: statistics.counters index: _id_ dup key: { : ObjectId('56adfbf43d801b870e63be29') }
what I forgot to configure in my example?

I don't know the structure of dbObject, but that bulk Upsert needs a valid query in order to work.
Let's say, for example, that you have a unique (_id) property called "id". A valid query would look like:
builder.find({id: toDBObject.id}).upsert().replaceOne(toDBObject);
This way, the engine can (a) find an object to update and then (b) update it (or, insert if the object wasn't found). Of course, you need the Java syntax for find, but same rule applies: make sure your .find will find something, then do an update.
I believe (just a guess) that the way it's written now will find "all" docs and try to update the first one ... but the behavior you are describing suggests it's finding "no doc" and attempting an insert.

DynamoDB's withLimit clause with DynamoDBMapper.query

I am using DynamoDBMapper for a class, let's say "User" (username being the primary key) which has a field on it which says "Status". It is a Hash+Range key table, and everytime a user's status changes (changes are extremely infrequent), we add a new entry to the table alongwith the timestamp (which is the range key). To fetch the current status, this is what I am doing:
DynamoDBQueryExpression expr =
new DynamoDBQueryExpression(new AttributeValue().withS(userName))
.withScanIndexForward(false).withLimit(1);
PaginatedQueryList<User> result =
this.getMapper().query(User.class, expr);
if(result == null || result.size() == 0) {
return null;
}
for(final User user : result) {
System.out.println(user.getStatus());
}
This for some reason, is printing all the statuses a user has had till now. I have set scanIndexForward to false so that it is in descending order and I put limit of 1. I am expecting this to return the latest single entry in the table for that username.
However, when I even look into the wire logs of the same, I see a huge amount of entries being returned, much more than 1. For now, I am using:
final String currentStatus = result.get(0).getStatus();
What I am trying to understand here is, what is whole point of the withLimit clause in this case, or am I doing something wrong?

In March 2013 on the AWS forums a user complained about the same problem.
A representative from Amazon sent him to use the queryPage function.
It seems as if the limit is not preserved for elements but rather a limit on chunk of elements retrieved in a single API call, and the queryPage might help.
You could also look into the pagination loading strategy configuration
Also, you can always open a Github issue for the team.

Google BigQuery - query ran successfully but results not pushed to destination table

We run a nightly query against BigQuery via the Java REST API that specifies a destination table for the results to be pushed to (write disposition=WRITE_TRUNCATE). Today's query appeared to run without errors but the results were not pushed to the destination table.
This query has been running for a few weeks now and we've had no issues. No code changes were made either.
Manually running it a second time after it "failed" worked fine. It was just this one glitch that we spotted and we're concerned it may happen again.
Our logged JSON response from the "failed" query looks fine (I've obfuscated any sensitive data):
INFO: Job finished successfully: {
"configuration" : {
"dryRun" : false,
"query" : {
"createDisposition" : "CREATE_IF_NEEDED",
"destinationTable" : {
"datasetId" : "[REMOVED]",
"projectId" : "[REMOVED]",
"tableId" : "[REMOVED]"
},
"priority" : "INTERACTIVE",
"query" : "[REMOVED]",
"writeDisposition" : "WRITE_TRUNCATE"
}
},
"etag" : "[REMOVED]",
"id" : "[REMOVED]",
"jobReference" : {
"jobId" : "[REMOVED]",
"projectId" : "[REMOVED]"
},
"kind" : "bigquery#job",
"selfLink" : "[REMOVED]",
"statistics" : {
"creationTime" : "1390435780070",
"endTime" : "1390435780769",
"query" : {
"cacheHit" : false,
"totalBytesProcessed" : "12546"
},
"startTime" : "1390435780245",
"totalBytesProcessed" : "12546"
},
"status" : {
"state" : "DONE"
}
}
Using the "try it!" for Jobs/GET here and plugging in the job id also shows the job was indeed successful and matches our logged output (pasted above).
Checking the web console shows the destination table has been truncated but not updated. Weirdly, the "Last Modified" has not been updated (I did try refreshing the page numerous times):
http://i.stack.imgur.com/384NL.png
Has anyone experienced this before with BigQuery - a query appearing to run successfully but if a destination/reference table was specified the results were not pushed yet the table was truncated?

I am a developer on the BigQuery team. I've looked up the details of you job from the breadcrumbs you left (your query was the only one that started at that start time).
It looks like your destination table was truncated at 4:09 pm today PST, which is the time your job ran, but it was left empty -- the query that truncated it didn't actually fill in any information.
I'm having a little bit of trouble piecing together the details, because one of the source tables appears to have been overwritten (the left table in your left outer of join was created at 4:20 PM).
However, there is a clue in the "total bytes processed" field -- it says that the query only processed 12K of data. The internal statistics say that only 384 rows were involved in the query among both tables that were involved.
My guess is that the query legitimately returned 0 rows, so the table was cleared.
There is a bug in that deleting all of the data in a table doesn't update the last modified time. We use last modified to mean either ast time the metadata was updated (like description, schema, etc) or the last time the table had data added to it). But if you just truncate the table, that doesn't update the metadata or add data, so we end up with a stale last-modified time.
If this doesn't sound like a reasonable chain of events, we'll need more information from you about how to debug it (especially since it looks like the tables involved have been modified since you ran this query), and a way that we can reproduce it would be great.

So, we figured out what the problem is with this. It failed again a few times over the last few days so we dug in further.
The query that is being executed is dependant on a another query which is executed immediately before it. Although we do wait for the first query to finish (job status = "DONE"), it appears that behind the scenes it's actually not fully complete and it's data is not yet available to be used.
Current process is:
Fetch data from another data source and stream the results into table A
When (1) is complete (poll job id and get status "DONE") submit another query which uses the results in table A to join on to create table B
Table A's data is not yet available so query from (2) results in an empty table
We've noticed it takes about 5-10 seconds for the data to actually appear and be available in BigQuery when using streaming for the first query.
We used a fairly ugly workaround - simply wait a few seconds after the first query before running the next one. Not exactly elegant but it works.

Reading error results from mongod in java after insert attempt

I am getting started with mongodb and I would like to do an operation where I attempt to insert a user with a username, password, and email. I have built unique indexes on username and email so the insert will fail if the specified username or email already exists.
So now I would like to report to the user that either, their email is already registered, or that the username the chose is taken. So I have gotten as far as:
CommandResult result = db.getLastError();
However, I dont see an easy way to read the error other than parsing through the single error message that it is giving me.
{ "serverUsed" : "127.0.0.1:27017" ,
"err" : "E11000 duplicate key error index: mojulo.users.$username_1 dup key: { : \"blahblah\" }" ,
"code" : 11000 ,
"n" : 0 ,
"connectionId" : 12 ,
"ok" : 1.0}
Also it appears that this is only reporting the first error that it encounters, is there anyway to do the check for both email and username in a single query?

If you want to do a check for both, then you will have to issue a query yourself to check for it. If you just insert documents, then it will report the first violation of a unique index only. The code E11000 should only indicate a duplicate key error which should make it easy to spot that it happens. You will need to parse the error message to figure out the collection though.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.