I am using Java and SQL to push data to a Timestamp partitioned table in BigQuery. In my code, I specify the destination table:
.setDestinationTable(TableId.of("MyDataset", "MyTable"))
When I run it, it creates a table perfectly. However, when I attempt to insert new data, it throws a BigQueryException claiming the table already exists:
Exception in thread "main" com.google.cloud.bigquery.BigQueryException:
Already Exists: Table MyProject:MyDataset.MyTable
After some documentation digging, I found a solution that works:
.setWriteDisposition(WriteDisposition.WRITE_APPEND)
Adding the above appends any data (even if it's duplicate). I'm not sure why the default setting for .setDestinationTable() is the equivalent of WRITE_EMPTY, which returns a duplicate error. The google docs for .setDestinationTable() are:
Describes the table where the query results should be stored. If not
present, a new table will be created to store the results.
The docs should probably clarify the default value.
Related
I just wrote a toy class to test Spark dataframe (actually Dataset since I'm using Java).
Dataset<Row> ds = spark.sql("select id,name,gender from test2.dummy where dt='2018-12-12'");
ds = ds.withColumn("dt", lit("2018-12-17"));
ds.cache();
ds.write().mode(SaveMode.Append).insertInto("test2.dummy");
//
System.out.println(ds.count());
According to my understanding, there're 2 actions, "insertInto" and "count".
I debug the code step by step, when running "insertInto", I see several lines of:
19/01/21 20:14:56 INFO FileScanRDD: Reading File path: hdfs://ip:9000/root/hive/warehouse/test2.db/dummy/dt=2018-12-12/000000_0, range: 0-451, partition values: [2018-12-12]
When running "count", I still see similar logs:
19/01/21 20:15:26 INFO FileScanRDD: Reading File path: hdfs://ip:9000/root/hive/warehouse/test2.db/dummy/dt=2018-12-12/000000_0, range: 0-451, partition values: [2018-12-12]
I have 2 questions:
1) When there're 2 actions on same dataframe like above, if I don't call ds.cache or ds.persist explicitly, will the 2nd action always causes the re-executing of the sql query?
2) If I understand the log correctly, both actions trigger hdfs file reading, does that mean the ds.cache() actually doesn't work here? If so, why it doesn't work here?
Many thanks.
It's because you append into the table where ds is created from, so ds needs to be recomputed because the underlying data changed. In such cases, spark invalidates the cache. If you read e.g. this Jira (https://issues.apache.org/jira/browse/SPARK-24596):
When invalidating a cache, we invalid other caches dependent on this
cache to ensure cached data is up to date. For example, when the
underlying table has been modified or the table has been dropped
itself, all caches that use this table should be invalidated or
refreshed.
Try to run the ds.count before inserting into the table.
I found that the other answer doesn't work. What I had to do was break lineage such that the df I was writing does not know that one of its source is the table I am writing to. To break lineage, I created a copy df using
copy_of_df = sql_context.createDataframe(df.rdd)
Hi I'm trying to make a simple adnroid app that works with dynamodb and following through this tutorial:
Link of Tutorial
I have been able to connect with the dbClient and access the table. I can successfully perform the dbTable.putItem and also other methods like dbTable.getTableDescription.
I'm having trouble understanding how to execute and dbTable.getItem method which requires a Primitive as an input. I don't quite understand how to use the Hashkey or primary key.
My table looks like this:
Click image
these are the hash keys
Primary Key
When I execute this line of code:
Document doc = dbTable.getItem(new Primitive("1"));
where 1 is the value of the first value in the table.
I get this error.
java.lang.IllegalStateException: hash key type does not match the one
in table defination
at com.amazonaws.mobileconnectors.dynamodbv2.document.Table.makeKey(Table.java:720)
at com.amazonaws.mobileconnectors.dynamodbv2.document.Table.getItem(Table.java:298)
at com.example.user.dynamodb.MainActivity$1.run(MainActivity.java:65)
We are trying to save Dataframe to a Hive Table using the saveAsTable() method. But, We are getting the below exception. We are trying to store the data as TextInputFormat.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Saving data in the Hive serde table `cdx_network`.`inv_devices_incr` is not supported yet. Please use the insertInto() API as an alternative..;
reducedFN.write().mode(SaveMode.Append).saveAsTable("cdx_network.alert_pas_incr");
I tried insertInto() and also enableHiveSupport() and it works. But, I want to use saveAsTable() .
I want to understand why the saveAsTable() does not work. I tried going through the documentation and also the code. Did not get much understanding. It supposed to be working. I have seen issues raised by people who are using Parquet format but for TextFileInputFormat i did not see any issues.
Table definition
CREATE TABLE `cdx_network.alert_pas_incr`(
`alertid` string,
`alerttype` string,
`alert_pas_documentid` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'maprfs:/apps/cdx-dev/alert_pas_incr'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
'numFiles'='0',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='0',
'transient_lastDdlTime'='1524121971')
Looks like this is bug. I made a little research and found this issue SPARK-19152. Fixed version is 2.2.0. Unfortunately I can’t verify it, cause my company’s cluster uses version 2.1.0
I would like to stream data from my java code to a BigQuery table using templateSuffix, but I can't make it work properly. My code:
return bigquery.tabledata()
.insertAll(
projectId,
datasetId,
tableId,
new TableDataInsertAllRequest()
.setTemplateSuffix(templateSuffix)
.setRows(singletonList(row))
).execute();
When I run it with projectId, datasetId, MyTable20160426 and 20160426, I get the error:
"message" : "404 Not found: Table projectId:datasetId.MyTable20160426"
When I run it with projectId, datasetId, MyTable and 20160426, I get the error:
"message" : "404 Not found: Table projectId:datasetId.MyTable"
The table MyTable already exists and already is templated on date (I used the bulk upload for GCS) (20160426 is the today date)
How cat I make it work ?
Where should I look to understand what's wrong ?
Thanks
First, the base table projectId:datasetId.MyTable should exist and should already have a schema. Thisis how BigQuery knows how to find the schema of the templated table that gets created.
Second, You should pass MyTable and instead of MyTable20160426 as the table ID in your request.
Third, the existence (or non-existence) of a table is cached. So if you get a "not found" error and then create the table, you'll still get a "not found" error for up to a half hour.
It sounds like you might be able to wait and try again. If this doesn't work, please provide the actual project, dataset, and table ids you're using and e-mail the details to tigani#google.com, and I can help look into what is going on.
I'm trying to do upsert using mongodb driver, here is a code:
BulkWriteOperation builder = coll.initializeUnorderedBulkOperation();
DBObject toDBObject;
for (T entity : entities) {
toDBObject = morphia.toDBObject(entity);
builder.find(toDBObject).upsert().replaceOne(toDBObject);
}
BulkWriteResult result = builder.execute();
where "entity" is morphia object. When I'm running the code first time (there are no entities in the DB, so all of the queries should be insert) it works fine and I see the entities in the database with generated _id field. Second run I'm changing some fields and trying to save changed entities and then I receive the folowing error from mongo:
E11000 duplicate key error collection: statistics.counters index: _id_ dup key: { : ObjectId('56adfbf43d801b870e63be29') }
what I forgot to configure in my example?
I don't know the structure of dbObject, but that bulk Upsert needs a valid query in order to work.
Let's say, for example, that you have a unique (_id) property called "id". A valid query would look like:
builder.find({id: toDBObject.id}).upsert().replaceOne(toDBObject);
This way, the engine can (a) find an object to update and then (b) update it (or, insert if the object wasn't found). Of course, you need the Java syntax for find, but same rule applies: make sure your .find will find something, then do an update.
I believe (just a guess) that the way it's written now will find "all" docs and try to update the first one ... but the behavior you are describing suggests it's finding "no doc" and attempting an insert.