Good day!
I'm using cloud dlp api to inspect bigquery views by converting chunks of the data into ContentItem and passing it to the inspect request. However, I am having trouble converting the findings and saving it to a bigquery table. Before, I used an airflow DLP operator for this and it is being done automatically by passing output storage config in an InspectConfig. However, that approach won't be applicable anymore as I'm calling the DLP api per chunks of data using apache beam in java.
I saw that the finding object has a writeTo() method but I'm not sure how to use it and how to save the findings with correct types into a bigquery table. can you help me with this? I'm currently stuck. thank you!
what I want to do is something like this
for (Finding res : result.getFindingsList()){
TableRow bqRow = new TableRow();
Object data = res.getLocation();
bqRow.set("field", data);
context.output(bqRow);
}
but this approach wouldn't save it in bigquery with correct types, especially for getLocation as it returns something like a protobuf message type.
I was trying to see if I can use the writeTo() method but I'm not sure how to use it. Thank you in advance for the help!
for (Finding res : result.getFindingsList()){
res.writeTo(...)
...
context.output(...);
}
If you use HybridInspect we'll store the findings for you to BigQuery.
https://cloud.google.com/dlp/docs/how-to-hybrid-jobs
If you do it yourself you will need to convert to a native BQ format like json
Load protobuf data to bigquery
Related
I am very very new to DynamoDB. I have a lot of stale data in a table. I want to delete it in batch. What I have done till now is that I have queried the Table using GSI. Now, since there is not much relevant content of using batchWriteItem in Java, can someone please help me. A code example would be appreciated.
I have tried googling a lot and have read AWS documentation for batchWriteItem. But they don't have any code examples as such.
You have to use addDeleteItem in the BatchWriteItemEnhancedRequest
var builder =
BatchWriteItemEnhancedRequest.builder();
items.forEach(item -> builder.addWriteBatch(WriteBatch.builder(ItemEntity.class)
.addDeleteItem(<key or item>)
.build()));
enhancedClient.batchWriteItem(builder.build());
All of this is from the v2 sdk version
This is most similar to this question.
I am creating a pipeline in Dataflow 2.x that takes streaming input from a Pubsub queue. Every single message that comes in needs to be streamed through a very large dataset that comes from Google BigQuery and have all the relevant values attached to it (based on a key) before being written to a database.
The trouble is that the mapping dataset from BigQuery is very large - any attempt to use it as a side input fails with the Dataflow runners throwing the error "java.lang.IllegalArgumentException: ByteString would be too long". I have attempted the following strategies:
1) Side input
As stated,the mapping data is (apparently) too large to do this. If I'm wrong here or there is a work-around for this, please let me know because this would be the simplest solution.
2) Key-Value pair mapping
In this strategy, I read the BigQuery data and Pubsub message data in the first part of the pipeline, then run each through ParDo transformations that change every value in the PCollections to KeyValue pairs. Then, I run a Merge.Flatten transform and a GroupByKey transform to attach the relevant mapping data to each message.
The trouble here is that streaming data requires windowing to be merged with other data, so I have to apply windowing to the large, bounded BigQuery data as well. It also requires that the windowing strategies are the same on both datasets. But no windowing strategy for the bounded data makes sense, and the few windowing attempts I've made simply send all the BQ data in a single window and then never send it again. It needs to be joined with every incoming pubsub message.
3) Calling BQ directly in a ParDo (DoFn)
This seemed like a good idea - have each worker declare a static instance of the map data. If it's not there, then call BigQuery directly to get it. Unfortunately this throws internal errors from BigQuery every time (as in the entire message just says "Internal error"). Filing a support ticket with Google resulted in them telling me that, essentially, "you can't do that".
It seems this task doesn't really fit the "embarrassingly parallelizable" model, so am I barking up the wrong tree here?
EDIT :
Even when using a high memory machine in dataflow and attempting to make the side input into a map view, I get the error java.lang.IllegalArgumentException: ByteString would be too long
Here is an example (psuedo) of the code I'm using:
Pipeline pipeline = Pipeline.create(options);
PCollectionView<Map<String, TableRow>> mapData = pipeline
.apply("ReadMapData", BigQueryIO.read().fromQuery("SELECT whatever FROM ...").usingStandardSql())
.apply("BQToKeyValPairs", ParDo.of(new BQToKeyValueDoFn()))
.apply(View.asMap());
PCollection<PubsubMessage> messages = pipeline.apply(PubsubIO.readMessages()
.fromSubscription(String.format("projects/%1$s/subscriptions/%2$s", projectId, pubsubSubscription)));
messages.apply(ParDo.of(new DoFn<PubsubMessage, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) {
JSONObject data = new JSONObject(new String(c.element().getPayload()));
String key = getKeyFromData(data);
TableRow sideInputData = c.sideInput(mapData).get(key);
if (sideInputData != null) {
LOG.info("holyWowItWOrked");
c.output(new TableRow());
} else {
LOG.info("noSideInputDataHere");
}
}
}).withSideInputs(mapData));
The pipeline throws the exception and fails before logging anything from within the ParDo.
Stack trace:
java.lang.IllegalArgumentException: ByteString would be too long: 644959474+1551393497
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.concat(ByteString.java:524)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:576)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:575)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:575)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:575)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.copyFrom(ByteString.java:559)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString$Output.toByteString(ByteString.java:1006)
com.google.cloud.dataflow.worker.WindmillStateInternals$WindmillBag.persistDirectly(WindmillStateInternals.java:575)
com.google.cloud.dataflow.worker.WindmillStateInternals$SimpleWindmillState.persist(WindmillStateInternals.java:320)
com.google.cloud.dataflow.worker.WindmillStateInternals$WindmillCombiningState.persist(WindmillStateInternals.java:951)
com.google.cloud.dataflow.worker.WindmillStateInternals.persist(WindmillStateInternals.java:216)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext$StepContext.flushState(StreamingModeExecutionContext.java:513)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:363)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1000)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$800(StreamingDataflowWorker.java:133)
com.google.cloud.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:771)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
Check out the section called "Pattern: Streaming mode large lookup tables" in Guide to common Cloud Dataflow use-case patterns, Part 2. It might be the only viable solution since your side input doesn't fit into memory.
Description:
A large (in GBs) lookup table must be accurate, and changes often or
does not fit in memory.
Example:
You have point of sale information from a retailer and need to
associate the name of the product item with the data record which
contains the productID. There are hundreds of thousands of items
stored in an external database that can change constantly. Also, all
elements must be processed using the correct value.
Solution:
Use the "Calling external services for data enrichment" pattern
but rather than calling a micro service, call a read-optimized NoSQL
database (such as Cloud Datastore or Cloud Bigtable) directly.
For each value to be looked up, create a Key Value pair using the KV
utility class. Do a GroupByKey to create batches of the same key type
to make the call against the database. In the DoFn, make a call out to
the database for that key and then apply the value to all values by
walking through the iterable. Follow best practices with client
instantiation as described in "Calling external services for data
enrichment".
Other relevant patterns are described in Guide to common Cloud Dataflow use-case patterns, Part 1:
Pattern: Slowly-changing lookup cache
Pattern: Calling external services for data enrichment
I am new to Android and Windows Azure. I have successfully inserted data from Android application but how do I retrieve single data and post that data on a TextView?
The read function after the gettable class is also not working. What is the exact function use for it? I have followed these instructions but they did not work for me, also I do not understand the documentation.
Currently, I just can provide some tutorials about how to use query data from azure database. I recommend you can refer to this official document about how to use Azure Client Library using Java: https://azure.microsoft.com/en-us/documentation/articles/mobile-services-android-how-to-use-client-library . You can focus on two part: “how to query data from a mobile service” and “how to bind data to the UI”.
At the same time, you can view this video from Channel 9: https://channel9.msdn.com/Series/Windows-Azure-Mobile-Services/Android-Getting-Started-With-Data-Connecting-your-app-to-Windows-Azure-Mobile-Services.
The sample code project of this tutorial, please go to the GitHub link https://github.com/Azure/mobile-services-samples/tree/master/GettingStartedWithData .
For the ‘getTable(Class )’ function is not working, please double check whether the class name is same as table name. If they are same, you can use it like below:
MobileServiceTable<ToDoItem> mToDoTable = mClient.getTable(ToDoItem.class);
If not, you can write you code like this:
MobileServiceTable<ToDoItem> mToDoTable = mClient.getTable("ToDoItemBackup", ToDoItem.class);
For further better support, please share more detail about your code snippet .
I am new to Elasticsearch. I read Elasticsearch's Java client APIs and am able to build query and send it to the Elasticsearch server via the transport client.
Because my query is quite complex with multi-level filters and I notice that it is cumbersome to build a query via the Java client. I feel that it is much simpler to build a JSON query string and then send it over to the Elasticsearch server via a Java client.
Is this something Elasticsearch offers?
I do like what Elasticsearch Java API can do after receiving results such as scrolling over the results. I want to keep these features.
Thanks for any input and links!
Regards.
Did further research on Elasticsearch API and found out that Elasticsearch does offer this capability. Here is how:
SearchResponse scrollResp = client.prepareSearch("my-index")
.setTypes("my-type")
.setSearchType(SearchType.SCAN)
.setQuery(query) // **<-- Query string in JSON format**
.execute().actionGet();
You can no longer pass in string to the .setQuery function, however you can use a WrapperQueryBuilder like this:
WrapperQueryBuilder builder = QueryBuilders.wrapperQuery(searchQuery);
SearchRequestBuilder sr = client.prepareSearch().setIndices(index).setTypes(mapping).setQuery(builder);
I'd recommend using the Java API, it is very good once you get used to it and in most cases it is less cumbersome. If you look through the Elasticsearch source code you will see that the Java API Builds the JSON under the hood. Here is an example from the MatchAllQueryBuilder:
#Override
public void doXContent(XContentBuilder builder, Params params) throws IOException {
builder.startObject(MatchAllQueryParser.NAME);
if (boost != -1) {
builder.field("boost", boost);
}
if (normsField != null) {
builder.field("norms_field", normsField);
}
builder.endObject();
}
ElasticSearch has built in capabilities to do exactly what you need, in an organized manner.
To answer your question, please see this link (the material is gone on elastic's site, so it might no longer work):
https://web.archive.org/web/20150906215934/https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/search.html
All you have to do is build a simple file which contains your search template i.e complex search query.
It can be a simple json file, or a text file.
Now you simply pass in your parameters, through your java code.
See the example in the link, it makes things amply clear.
Bhargav.
I'm very new in using web services. Appreciate if anyone can help me on this.
In my PHP codes, I'm trying to use the SOAP web services from another server (JIRA, java). The JIRA SOAP API is shown here.
$jirasoap = new SoapClient($jiraserver['url']);
$token = $jirasoap->login($jiraserver['username'], $jiraserver['password']);
$remoteissue = $jirasoap->getIssue($token, "issuekey");
I found that my codes have no problem to call the functions listed on that page. However, I don't know how to use the objects returned by the API calls.
My question are:
In my PHP codes, how can I use the methods in the Java class objects returned by SOAP API calls?
For example, the function $remoteissue = $jirasoap->getIssue($a, $b) will return a RemoteIssue. Based on this (http://docs.atlassian.com/rpc-jira-plugin/latest/com/atlassian/jira/rpc/soap/beans/RemoteIssue.html), there are methods like getSummary, getKey, etc. How can I use these functions in my codes?
Based on some PHP examples I found from the internet, it seems that everyone is using something like this:
$remoteissue = $jirasoap->getIssue($token, "issuekey");
$key = $remoteissue->key;
They are not using the object's methods.
Refer to this example, it seems that someone is able to do this in other languages. Can it be done in PHP too?
The problem I'm facing is that, I am trying to get the ID of an Attachment. However, it seems that we can't get the Attachment ID using this method: $attachmentid = $remoteattachment->id;. I am trying to use the $remoteattachment->getId() method.
In PHP codes, after we made a SOAP API call and received the returned objects, how do we know what data fields are available in that object?
For example,
$remoteissue = $jirasoap->getIssue($token, "issuekey");
$summary = $remoteissue->summary;
How do we know ->summary is available in $remoteissue?
When i refer to this document (http://docs.atlassian.com/rpc-jira-plugin/latest/com/atlassian/jira/rpc/soap/beans/RemoteIssue.html), I don't see it mention any data fields in RemoteIssue. How do we know we can get key, summary, etc, from this object? How do we know it is ->summary, not ->getsummary? We need to use a web browser to open the WSDL URL?
Thanks.
This question is over one year old, but to share knowledge and provide an answer to people who have this same question and found this page, here are my findings.
The document mentioned in the question is an overview of the JiraSoapService interface. This is a good reference for what functions can be called with which arguments and what they return.
If you use Java for your Jira SoapClient the returned objects are implemented, but if you use PHP, the returned objects aren't of the type stated in this documentation and do not have any of the methods mentioned. The returned objects are instances of the internal PHP class stdClass, which is a placeholder for undefined objects. The best way to know what is returned is to use var_dump() on the objects returned from the SoapCalls.
$jirasoap = new SoapClient($jiraserver['url']);
$token = $jirasoap->login($jiraserver['username'], $jiraserver['password']);
$remoteissue = $jirasoap->getIssue($token, "PROJ-1");
var_dump($remoteissue);
/* -- You will get something like this ---
object(stdClass)#2 (21) {
["id"]=> string(3) "100"
["affectsVersions"]=> array(0) { }
["assignee"]=> string(4) "user"
...
["created"]=> string(24) "2012-12-13T09:27:49.934Z"
...
["description"]=> string(17) "issue description"
....
["key"]=> string(6) "PROJ-1"
["priority"]=> string(1) "3"
["project"]=> string(4) "PROJ"
["reporter"]=> string(4) "user"
["resolution"]=> NULL
["status"]=> string(1) "1"
["summary"]=> string(15) "Project issue 1"
["type"]=> string(1) "3"
["updated"]=> string(24) "2013-01-21T16:11:43.073Z"
["votes"]=> int(0)
}
*/
// You can access data like this:
$jiraKey = $remoteissue->key;
$jiraProject = $remoteissue->project;
The document you referred to in #2 is to a Java implementation and really doesn't give you any help with PHP. If they do not publish a public API for their service (which would be unusual), then using the WSDL as a reference will let you know what objects and methods are accepted by the service and you can plan your method calls accordingly.
The technique you used to call getIssue(...) seems fine, although you should consider using try...catch in case of a SoapException.
I have used Jira SOAP in .NET project and IntelliSense hinted me what fields are available for returned object.
You can use something like VS.Php for Visual Studio or Php for Visual Studio if you are using Visual Studio.
Or you can choose one of the IDEs from here with support of IntelliSense.