How to read BigQuery table from java spark with BigQuery connector

How to read BigQuery table from java spark with BigQuery connector - java

I am trying to read bigquery table through spark java code as below:
BigQuerySQLContext bqSqlCtx = new BigQuerySQLContext(sqlContext);
bqSqlCtx.setGcpJsonKeyFile("sxxxl-gcp-1x4c0xxxxxxx.json");
bqSqlCtx.setBigQueryProjectId("winged-standard-2xxxx");
bqSqlCtx.setBigQueryDatasetLocation("asia-east1");
bqSqlCtx.setBigQueryGcsBucket("dataproc-9cxxxxx39-exxdc-4e73-xx07- 2258xxxx4-asia-east1");
Dataset<Row> testds = bqSqlCtx.bigQuerySelect("select * from bqtestdata.customer_visits limit 100");
But I'm facing the below issue:
19/01/14 10:52:01 WARN org.apache.spark.sql.SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
19/01/14 10:52:01 INFO com.samelamin.spark.bigquery.BigQueryClient: Executing query select * from bqtestdata.customer_visits limit 100
19/01/14 10:52:02 INFO com.samelamin.spark.bigquery.BigQueryClient: Creating staging dataset winged-standard-2xxxxx:spark_bigquery_staging_asia-east1
Exception in thread "main" java.util.concurrent.ExecutionException: com.google.api.client.googleapis.json.GoogleJsonResponseException:
400 Bad Request
{
"code" : 400,
"errors" :
[ {
"domain" : "global",
**"message" : "Invalid dataset ID \"spark_bigquery_staging_asia-east1\". Dataset IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long.",**
"reason" : "invalid"
} ],
"message" : "Invalid dataset ID \"spark_bigquery_staging_asia-east1\". Dataset IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long.",
"status" : "INVALID_ARGUMENT"
}

The message in the response
Dataset IDs must be alphanumeric (plus underscores)...
indicates that the dataset ID "spark_bigquery_staging_asia-east1" is invalid since it has a hyphen in it, specifically in asia-east1.

I had a similar problem with samelamin's Scala library. Apparently this is due to the library not able to handle location other than US and EU, therefore the library will not be able to access datasets from asia-east1.
For now, I'm using the BigQuery Spark Connector to load and write my data from BigQuery.
If you were able to get a workaround to use this library, please share it as well.

Related

BigQuery: 404 "Table is truncated." when insert right after truncate

I truncate my table by executing a queryJob described here: https://cloud.google.com/bigquery/docs/quickstarts/quickstart-client-libraries
"truncate table " + PROJECT_ID + "." + datasetName + "." + tableName;
i wait until the job finishes via
queryJob = queryJob.waitFor();
Truncate works fine.
Anyway, if i do an insert right after the truncate operation via
InsertAllResponse response = table.insert(rows);
it results in a
com.google.cloud.bigquery.BigQueryException: Table is truncated.
with following log:
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
POST https://www.googleapis.com/bigquery/v2/projects/[MYPROJECTID]/datasets/[MYDATASET]/tables/[MYTABLE]/insertAll?prettyPrint=false
{
"code" : 404,
"errors" : [ {
"domain" : "global",
"message" : "Table is truncated.",
"reason" : "notFound"
} ],
"message" : "Table is truncated.",
"status" : "NOT_FOUND"
}
Sometimes i have even to wait more than 5 Minutes between truncate and insert.
I would like to check if my table is still in the state "Table is truncated." periodically until this state is gone.
How can i request bigquery api in order to check if the table is ready for inserts?
How can i request bigquery api for get the status of the table?
Edit
example for reproduce can be found here

If a table is truncated while the streaming pipeline is still going on or performing a streaming insertion on a recently truncated table, you could receive some errors like mentioned in the question (Table is truncated), that's expected behavior. The metadata consistency mode for the InsertAll (very high QPS API) is eventually consistent, this means that when using the InsertAll API, it may get delayed table metadata and returns the failure like table truncated. The typical way to resolve this issue is to back-off and retry.
Currently, there is no option in the BigQuery API to check if the table is in truncated state or not.

Unfortunately the api does not (yet?) provide an endpoint to check the truncated state of the table.
In order to avoid this issue, one can use a load job via gc storage.
It looks like the load job respects this state, as i have no issues with truncate/load multiple times in a row.
public void load(String datasetName, String tableName, String sourceUri) throws InterruptedException {
Table table = getTable(datasetName, tableName);
Job job = table.load(FormatOptions.json(), sourceUri);
// Wait for the job to complete
Job completedJob = job.waitFor(RetryOption.initialRetryDelay(Duration.ofSeconds(1)),
RetryOption.totalTimeout(Duration.ofMinutes(3)));
if (completedJob != null && completedJob.getStatus().getError() == null) {
// Job completed successfully
} else {
// Handle error case
}
}

Mule4 Bulk insert after map in dataweave prompting an error for field contains multiple object

I am trying to insert bulk data to mssql after batch processing.
Below is the input to bulk insert component in Mule4:
[
{
"schemaId": 311,
"createDT": "2019-04-29 04:22:51.535",
"jsonData": {
"Employee Name": "Becky Forgey"
}
},
{
"schemaId": 311,
"createDT": "2019-04-29 04:22:51.536",
"jsonData": {
"Employee Name": "sahana"
}
}
]
Database Query is:
INSERT INTO [test].[dbo].[EmployeeData] (SchemaID,CreateDatetime,JsonData) VALUES (:schemaId,:createDT,:jsonData)
INPUT parameter is payload.
If i send a string value for jsonData it is inserting but batch result consists of multiple records and i am mapping it in dataweave.
Getting below error if i try to insert above json:
Message : The conversion from UNKNOWN to NVARCHAR is unsupported.
Error type : DB:QUERY_EXECUTION
Element : test-mapFlow/processors/5 # test-map:test-map.xml:41 (Bulk insert)
Element XML : <db:bulk-insert doc:name="Bulk insert" doc:id="98f8b9a0-b3d2-4beb-a31c-9f76af7f1447" config-ref="Database_Config">
<db:sql>INSERT INTO [rq].[dbo].[EmployeeMasterData] (SchemaID,CreateDatetime,JsonData) VALUES (:schemaId,:createDT,:jsonData)</db:sql>
</db:bulk-insert>
Please guide

Please, provide full script how you present data for the SQL. Usually it has mapping between your values and sql values. Without it I can only guess and my guess is that instead of jsonData ir should be jsonData."Employee Name"
Another guess - I cannot confirm it without proper logging - jsonData is absent. To avoid such issues defualt value should be provided for each parameter.
In general - try to avoid muitple conversions or do them on one platform as close to end as possible https://simpleflatservice.com/mule4/AvoidCoversionsOrMakeThemNative.html

BigQuery returns 404 error on table that exists

com.google.api.client.googleapis.json.GoogleJsonResponseException: 404
NOT_FOUND
{
"code" : 404,
"errors" : [ {
"domain" : "global",
"message" : "Not found: Table
optimizehit.com:optimizehit-db:_026b772272a27a80bf416f682c0426cbb2061afe.anon1d35b89ee725c18278192da78ee38f63d0b749ca",
"reason" : "notFound"
} ],
"message" : "Not found: Table
optimizehit.com:optimizehit-db:_026b772272a27a80bf416f682c0426cbb2061afe.anon1d35b89ee725c18278192da78ee38f63d0b749ca"
}
We are getting this error while querying a table that always exists in our dataset, so it's not a problem of the table missing as per the documentation of the error code.
Checking the dashboard also showed that the table exists, and querying it after a minute or so, it was working.
There is, however, a cron that updates this table with a query job with
writeDisposition set to WRITE_TRUNCATE.
Could this affect querying the table?
Edit: The time of this error was: August 5, 2015 at 6:48:39 PM UTC+5:30

you're receiving the "not found" error not on the table FROM which you query, but instead on the anonymous table to which you are writing:
optimizehit.com:optimizehit-db:_026b772272a27a80bf416f682c0426cbb2061afe.anon1d35b89ee725c18278192da78ee38f63d0b749ca
As a BigQuery engineer, I looked into the history of this anonymous table and the jobs that operated on it. The job that would have used this anonymous table failed with a RATE_LIMIT_EXCEEDED error.
The configuration.query.destination_table fields would still contain the anonymous table reference the job would have filled with results. Did you check for an error result before retrieving the query job results?

Google BigQuery - query ran successfully but results not pushed to destination table

We run a nightly query against BigQuery via the Java REST API that specifies a destination table for the results to be pushed to (write disposition=WRITE_TRUNCATE). Today's query appeared to run without errors but the results were not pushed to the destination table.
This query has been running for a few weeks now and we've had no issues. No code changes were made either.
Manually running it a second time after it "failed" worked fine. It was just this one glitch that we spotted and we're concerned it may happen again.
Our logged JSON response from the "failed" query looks fine (I've obfuscated any sensitive data):
INFO: Job finished successfully: {
"configuration" : {
"dryRun" : false,
"query" : {
"createDisposition" : "CREATE_IF_NEEDED",
"destinationTable" : {
"datasetId" : "[REMOVED]",
"projectId" : "[REMOVED]",
"tableId" : "[REMOVED]"
},
"priority" : "INTERACTIVE",
"query" : "[REMOVED]",
"writeDisposition" : "WRITE_TRUNCATE"
}
},
"etag" : "[REMOVED]",
"id" : "[REMOVED]",
"jobReference" : {
"jobId" : "[REMOVED]",
"projectId" : "[REMOVED]"
},
"kind" : "bigquery#job",
"selfLink" : "[REMOVED]",
"statistics" : {
"creationTime" : "1390435780070",
"endTime" : "1390435780769",
"query" : {
"cacheHit" : false,
"totalBytesProcessed" : "12546"
},
"startTime" : "1390435780245",
"totalBytesProcessed" : "12546"
},
"status" : {
"state" : "DONE"
}
}
Using the "try it!" for Jobs/GET here and plugging in the job id also shows the job was indeed successful and matches our logged output (pasted above).
Checking the web console shows the destination table has been truncated but not updated. Weirdly, the "Last Modified" has not been updated (I did try refreshing the page numerous times):
http://i.stack.imgur.com/384NL.png
Has anyone experienced this before with BigQuery - a query appearing to run successfully but if a destination/reference table was specified the results were not pushed yet the table was truncated?

I am a developer on the BigQuery team. I've looked up the details of you job from the breadcrumbs you left (your query was the only one that started at that start time).
It looks like your destination table was truncated at 4:09 pm today PST, which is the time your job ran, but it was left empty -- the query that truncated it didn't actually fill in any information.
I'm having a little bit of trouble piecing together the details, because one of the source tables appears to have been overwritten (the left table in your left outer of join was created at 4:20 PM).
However, there is a clue in the "total bytes processed" field -- it says that the query only processed 12K of data. The internal statistics say that only 384 rows were involved in the query among both tables that were involved.
My guess is that the query legitimately returned 0 rows, so the table was cleared.
There is a bug in that deleting all of the data in a table doesn't update the last modified time. We use last modified to mean either ast time the metadata was updated (like description, schema, etc) or the last time the table had data added to it). But if you just truncate the table, that doesn't update the metadata or add data, so we end up with a stale last-modified time.
If this doesn't sound like a reasonable chain of events, we'll need more information from you about how to debug it (especially since it looks like the tables involved have been modified since you ran this query), and a way that we can reproduce it would be great.

So, we figured out what the problem is with this. It failed again a few times over the last few days so we dug in further.
The query that is being executed is dependant on a another query which is executed immediately before it. Although we do wait for the first query to finish (job status = "DONE"), it appears that behind the scenes it's actually not fully complete and it's data is not yet available to be used.
Current process is:
Fetch data from another data source and stream the results into table A
When (1) is complete (poll job id and get status "DONE") submit another query which uses the results in table A to join on to create table B
Table A's data is not yet available so query from (2) results in an empty table
We've noticed it takes about 5-10 seconds for the data to actually appear and be available in BigQuery when using streaming for the first query.
We used a fairly ugly workaround - simply wait a few seconds after the first query before running the next one. Not exactly elegant but it works.

Mongo DuplicateKey error despite no overlap

I have a well-logged pool of several java servers behind an F5 load balancer (professionally managed, it's not sending traffic to >1 host) running Tomcat with my application installed, connecting to a sharded mongo cluster. I'm using a base64-encoded SHA-1 hash of the primary natural key as the _id. When a new record is to be created, I do a pretty basic:
BasicDBObject query = new BasicDBObject();
query.put("userId", userId);
query.put("_id", id);
DBObject user = getUsersCollection().findOne(query);
if (user == null) {
getUsersCollection().insert(new UserObject(userId));
}
This is simplified. In fact there are multiple checks for the pre-existence of this user, including one which should throw a custom exception, and none are triggered. The traffic logs indicate a single incoming create request, and here's an example of what happens:
2014-01-19 20:03:45,167 [http-bio-7950-exec-827]:[...] : ERROR FATAL [...] - Internal server error
[...]: com.mongodb.MongoException$DuplicateKey: { "serverUsed" : "[...]" , "singleShard" : "replicaset_2/host1:27017,host2:27017,host3:27017" , "err" : "E11000 duplicate key error index: Users.$_id_ dup key: { : \"HASH\" }" , "code" : 11000 , "n" : 0 , "lastOp" : { "$ts" : 1390190614 , "$inc" : 1} , "connectionId" : 335764 , "ok" : 1.0}
Yet in my Users collection the record has been created:
db.Users.findOne({_id:"HASH"}):
{
"_id" : "HASH",
"createDate" : ISODate("2014-01-20T04:03:45.161Z"),
...
}
I'm pasting this as important because of the timestamps. We have a timezone issue, but that aside I interpret the 6ms difference as clock skew between the mongo cluster and my application servers. There is no other record of this incoming traffic (and it is logged as it bounces from server to server, even - nothing else!) So I am 99.999% confident that my SINGLE LEGITIMATE insert call is both inserting and throwing an error.
Any theories as to how/why this is happening would be greatly appreciated. I'll run tracers and examples if needed to answer questions with more information.

You are searching for a user using both _id and userId fields. Try to comment out this line: query.put("_id", id);.
It's not clear in your code where Java variable userId comes from. It's also not clear how UserObject sets an _id if at all.
Overall it looks like the way you search for user and the way you create him does not match, i.e. what defines a unique key on that user.
One fix could to replace these lines:
query.put("userId", userId);
query.put("_id", id);
with:
query.put("_id", userId);
To make _id field to be your userId.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.