Write the result of SQL Query to file by Apache Flink

Write the result of SQL Query to file by Apache Flink - java

I have the following task:
Create a job with SQL request to Hive table;
Run this job on remote Flink cluster;
Collect the result of this job in file (HDFS is preferable).
Note
Because it is necessary to run this job on remote Flink cluster i can not use TableEnvironment in a simple way. This problem is mentioned in this ticket: https://issues.apache.org/jira/browse/FLINK-18095. For current solution I use adivce from http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Table-Environment-for-Remote-Execution-td35691.html.
Code
EnvironmentSettings batchSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build();
// create remote env
StreamExecutionEnvironment streamExecutionEnvironment = StreamExecutionEnvironment.createRemoteEnvironment("localhost", 8081, "/path/to/my/jar");
// create StreamTableEnvironment
TableConfig tableConfig = new TableConfig();
ClassLoader classLoader = Thread.currentThread().getContextClassLoader();
CatalogManager catalogManager = CatalogManager.newBuilder()
.classLoader(classLoader)
.config(tableConfig.getConfiguration())
.defaultCatalog(
batchSettings.getBuiltInCatalogName(),
new GenericInMemoryCatalog(
batchSettings.getBuiltInCatalogName(),
batchSettings.getBuiltInDatabaseName()))
.executionConfig(
streamExecutionEnvironment.getConfig())
.build();
ModuleManager moduleManager = new ModuleManager();
BatchExecutor batchExecutor = new BatchExecutor(streamExecutionEnvironment);
FunctionCatalog functionCatalog = new FunctionCatalog(tableConfig, catalogManager, moduleManager);
StreamTableEnvironmentImpl tableEnv = new StreamTableEnvironmentImpl(
catalogManager,
moduleManager,
functionCatalog,
tableConfig,
streamExecutionEnvironment,
new BatchPlanner(batchExecutor, tableConfig, functionCatalog, catalogManager),
batchExecutor,
false);
// configure HiveCatalog
String name = "myhive";
String defaultDatabase = "default";
String hiveConfDir = "/path/to/hive/conf"; // a local path
HiveCatalog hive = new HiveCatalog(name, defaultDatabase, hiveConfDir);
tableEnv.registerCatalog("myhive", hive);
tableEnv.useCatalog("myhive");
// request to Hive
Table table = tableEnv.sqlQuery("select * from myhive.`default`.test");
Question
On this step I can call table.execute() method and after it get CloseableIterator by collect() method. But in my case I can get a large count of rows as a result of my request and it will be perfect to collect it into file (ORC in HDFS).
How can I reach my goal?

Table.execute().collect() returns the result of the view to your client side for interactive purpose. In your case, you can use the filesystem connector and use INSERT INTO for writing the view to the file. For example:
// create a filesystem table
tableEnvironment.executeSql("CREATE TABLE MyUserTable (\n" +
" column_name1 INT,\n" +
" column_name2 STRING,\n" +
" ..." +
" \n" +
") WITH (\n" +
" 'connector' = 'filesystem',\n" +
" 'path' = 'hdfs://path/to/your/file',\n" +
" 'format' = 'orc' \n" +
")");
// submit the job
tableEnvironment.executeSql("insert into MyUserTable select * from myhive.`default`.test");
See more about the filesystem connector: https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/filesystem.html

Related

Marklogic Aggregate query with Java API

I am trying to get an aggregate query to run using the ML Java API, and am having a bit of trouble. I have followed the documentation, but there is a requirement for a values constraint, and i'm not really sure what that is supposed to look like. I tried the following:
String options =
"<options xmlns:search=\"http://marklogic.com/appservices/search\">" +
" <values name=\"average\">" +
" <range type=\"xs:string\">" +
" <element ns=\"\" name=\"content-id\"/>" +
" </range>" +
" </values>" +
"</options>";
StringHandle handle = new StringHandle(options);
QueryOptionsManager optMgr = client.newServerConfigManager().newQueryOptionsManager();
optMgr.writeOptions("average", handle);
QueryManager queryManager = client.newQueryManager();
StructuredQueryBuilder queryBuilder = queryManager.newStructuredQueryBuilder();
ValuesDefinition valuesDefinition = queryManager.newValuesDefinition("average");
valuesDefinition.setAggregate("avg");
valuesDefinition.setQueryDefinition(queryBuilder.value(queryBuilder.element("content-id"),contentId));
ValuesHandle results = queryManager.values(valuesDefinition, new ValuesHandle());
I took a stab at the options based on some other options i'm using. However, when I try to write the options it tells me Invalid Content: Unexpected Payload.
I get the feeling i'm going about this the wrong way. Essentially I want to find all documents that have a given value in the element "content-id", and then get the average of another element called "star-rating".
Should the options be set for "content-id" or "star-rating"? The documentation doesn't show the use of a queryDefinition, should I remove that? Modify it? Is there an easier way to do this in Java?
Edit: Forgot to mention, I also created an element range index on content-id with type string.

With the guidance of #SamMefford I was able to reach a solution. It looks like this:
String options =
"<options xmlns=\"http://marklogic.com/appservices/search\">" +
" <values name=\"star-rating\">" +
" <range type=\"xs:float\">" +
" <element ns=\"\" name=\"star-rating\"/>" +
" </range>" +
" </values>" +
"</options>";
StringHandle handle = new StringHandle(options);
QueryOptionsManager optMgr = client.newServerConfigManager().newQueryOptionsManager();
optMgr.writeOptions("star-rating", handle);
QueryManager queryManager = client.newQueryManager();
StructuredQueryBuilder queryBuilder =
queryManager.newStructuredQueryBuilder("star-rating");
ValuesDefinition valuesDefinition = queryManager.newValuesDefinition("star-rating");
valuesDefinition.setAggregate("avg");
valuesDefinition.setQueryDefinition(
queryBuilder.range(
queryBuilder.element("content-id"),"string",
StructuredQueryBuilder.Operator.EQ,contentId
)
);
String results = queryManager.values(valuesDefinition, new ValuesHandle())
.getAggregate("avg").getValue();
The options were created around the field that I wanted the average of. I created element indexes for both < star-rating > and < content-id >. The query then allowed me to filter to records with a specific content-id, and then get the average value of their star-ratings.

Why AWS RDS is showing exception - Status Code: 400; Error Code: InvalidParameterValue

Code -
private static String getDBInstanceTag(AmazonRDS amazonRDS, String dbInstanceIdentifier, String region, String tagKey) {
log.info("Trying to fetch dbInstanceIdentifier - " + dbInstanceIdentifier + " in db instance region - " + region);
String arn = String.format("arn:aws:rds:" + region + ":%s:db:%s",
SyncJobConstants.AWSProperties.AWS_ACCOUNT_NUMBER,
dbInstanceIdentifier);
ListTagsForResourceResult tagsList = amazonRDS.listTagsForResource(
new ListTagsForResourceRequest().withResourceName(arn));
for(Tag tag : tagsList.getTagList()) {
if(tagKey.equalsIgnoreCase(tag.getKey())) {
return tag.getValue();
}
}
throw new InternalProcessingException(tagKey + " is not present in given dbInstance - " + tagsList);
}
public static String getDBInstanceTag(String dbInstanceIdentifier, String tagKey) throws IOException {
AWSCredentials credentials = new PropertiesCredentials(
RedshiftUtil.class.getClassLoader().getResourceAsStream("AWSCredentials.properties"));
AmazonRDS amazonRDS = new AmazonRDSClient(credentials);
DBInstance dbInstance = new DBInstance();
dbInstance.setDBInstanceIdentifier(dbInstanceIdentifier);
for(String region : SyncJobConstants.AWSProperties.RDS_REGIONS) {
try {
return getDBInstanceTag(amazonRDS, dbInstanceIdentifier, region, tagKey);
} catch (DBInstanceNotFoundException e) {
log.info("dbInstanceIdentifier - " + dbInstanceIdentifier + " is not present in db instance region - " + region);
} catch (AmazonServiceException e) {
if ( "AccessDenied".equals(e.getErrorCode()) ) {
log.info("dbInstanceIdentifier - " + dbInstanceIdentifier + " is not present in db instance region - " + region);
} else {
throw new InternalProcessingException("Not able to fetch dbInstance details from RDS. DBInstanceId - " + dbInstanceIdentifier, e);
}
}
}
throw new InvalidRequestException("RDS endpoint details is not correct.");
}
It is throwing error for some of the calls even though db instances is there. Error detail -
Caused by: com.amazonaws.AmazonServiceException: The specified resource name does not match an RDS resource in this region. (Service: AmazonRDS; Status Code: 400; Error Code: InvalidParameterValue; Request ID: b0e01d56-36ca-11e6-8441-1968d9061f57)
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:310)
at com.amazonaws.services.rds.AmazonRDSClient.invoke(AmazonRDSClient.java:5197)
at com.amazonaws.services.rds.AmazonRDSClient.listTagsForResource(AmazonRDSClient.java:1997)
Can you please tell me what I am missing here ?
Error meanings-
http://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html
Update 2
public static final List RDS_REGIONS = Arrays.asList("us-east-1",
"us-west-1",
"us-west-2",
"eu-west-1",
"eu-central-1",
"ap-northeast-1",
"ap-northeast-2",
"ap-southeast-1",
"ap-southeast-2",
"sa-east-1");

Seems like a region related issue - Is your RDS instance located in region US-EAST-1? (that's the default region of Amazon SDK)
Log into Amazon web console and confirm the region. Set the correct region and try again.
Reference : AWS Region Selection

Although this is an old question, I am sharing an idea that might help someone..
As it seems the user is using a Java AWS SDK client, he can use describeDBInstances on the
AmazonRDS amazonRDS = new AmazonRDSClient(credentials);
DescribeDBInstancesResult describeDBInstancesResult = amazonRDS.describeDBInstances();
then you can debug and look into the describeDBInstancesResult to make sure that DB is actually within the scope of current instantiated amazonRDS client.
if its not there, it might be that he using not passing the right region into the client.

No query defined for that name [getAuditTaskById]

When I tried to start a task using taskService.start(task.getId(), "krisv");, I get No query defined for that name [getAuditTaskById]. The bpmn file is very similar to the Evaluation.bpmn file. My current version of jbpmn is 6.2.
The code snippet is the following:
List<TaskSummary> tasks = taskService.getTasksAssignedAsPotentialOwner("krisv", "en-UK");
if (tasks.size() > 0) {
TaskSummary task = tasks.get(0);
System.out.println("Task id: " + task.getId());
System.out.println("'krisv' completing task " + task.getName() + ": " + task.getDescription());
System.out.println("Task status: " + task.getStatus().name());
System.out.println("Potential owners: " + task.getActualOwner().getId());
taskService.start(task.getId(), "krisv");
Map<String, Object> results = new HashMap<String, Object>();
results.put("performance", "exceeding");
taskService.complete(task.getId(), "krisv", results);
System.out.println("Completed task");
} else {
System.out.println("No tasks!");
}
The code above is almost a replicate of the ProcessTest.java file in the sample folder. The ProcessTest.java allows the completion of the tasks, but the exact same code doesn't in my custom java file.
Also, the current task's status is "reserved" if that is of any help. Thanks!

The query is defined in the jbpm-human-task-audit-audit jar, you need that on your classpath:
https://github.com/droolsjbpm/jbpm/blob/6.2.0.Final/jbpm-human-task/jbpm-human-task-audit/src/main/resources/META-INF/TaskAuditorm.xml#L40
And you need to make sure this file is referenced in your persistence.xml, like for example here:
https://github.com/droolsjbpm/jbpm/blob/6.2.0.Final/jbpm-test/src/main/resources/META-INF/persistence.xml#L15

SVN log using SVNKit

I'm sure this question will be silly or annoying on multiple levels....
I am using SVNKit in Java.
I want to get the list of files committed in a particular commit. I have the release ID. Normally I would run something like
svn log url/to/repository -qv -r12345
And I would get the list of commands as normal.
I can't puzzle out how to do a similar thing in SVNKit. Any tips? :)

final SvnOperationFactory svnOperationFactory = new SvnOperationFactory();
final SvnLog log = svnOperationFactory.createLog();
log.setSingleTarget(SvnTarget.fromURL(url));
log.addRange(SvnRevisionRange.create(SVNRevision.create(12345), SVNRevision.create(12345)));
log.setDiscoverChangedPaths(true);
final SVNLogEntry logEntry = log.run();
final Map<String,SVNLogEntryPath> changedPaths = logEntry.getChangedPaths();
for (Map.Entry<String, SVNLogEntryPath> entry : changedPaths.entrySet()) {
final SVNLogEntryPath svnLogEntryPath = entry.getValue();
System.out.println(svnLogEntryPath.getType() + " " + svnLogEntryPath.getPath() +
(svnLogEntryPath.getCopyPath() == null ?
"" : (" from " + svnLogEntryPath.getCopyPath() + ":" + svnLogEntryPath.getCopyRevision())));
}
If you want to run one log request for a revision range, you should use log.setReceiver() call with your receiver implemetation.

Is it possible to get a list of workflows the document is attached to in Alfresco

I'm trying to get a list of workflows the document is attached to in an Alfresco webscript, but I am kind of stuck.
My original problem is that I have a list of files, and the current user may have workflows assigned to him with these documents. So, now I want to create a webscript that will look in a folder, take all the documents there, and assemble a list of documents together with task references, if there are any for the current user.
I know about the "workflow" object that gives me the list of workflows for the current user, but this is not a solution for my problem.
So, can I get a list of workflows a specific document is attached to?

Well, for future reference, I've found a way to get all the active workflows on a document from javascript:
var nodeR = search.findNode('workspace://SpacesStore/'+doc.nodeRef);
for each ( wf in nodeR.activeWorkflows )
{
// Do whatever here.
}

I used packageContains association to find workflows for document.
Below i posted code in Alfresco JavaScript for active workflows (as zladuric answered) and also for all workflows:
/*global search, logger, workflow*/
var getWorkflowsForDocument, getActiveWorkflowsForDocument;
getWorkflowsForDocument = function () {
"use strict";
var doc, parentAssocs, packages, packagesLen, i, pack, props, workflowId, instance, isActive;
//
doc = search.findNode("workspace://SpacesStore/8847ea95-108d-4e08-90ab-34114e7b3977");
parentAssocs = doc.getParentAssocs();
packages = parentAssocs["{http://www.alfresco.org/model/bpm/1.0}packageContains"];
//
if (packages) {
packagesLen = packages.length;
//
for (i = 0; i < packagesLen; i += 1) {
pack = packages[i];
props = pack.getProperties();
workflowId = props["{http://www.alfresco.org/model/bpm/1.0}workflowInstanceId"];
instance = workflow.getInstance(workflowId);
/* instance is org.alfresco.repo.workflow.jscript.JscriptWorkflowInstance */
isActive = instance.isActive();
logger.log(" + instance: " + workflowId + " (active: " + isActive + ")");
}
}
};
getActiveWorkflowsForDocument = function () {
"use strict";
var doc, activeWorkflows, activeWorkflowsLen, i, instance;
//
doc = search.findNode("workspace://SpacesStore/8847ea95-108d-4e08-90ab-34114e7b3977");
activeWorkflows = doc.activeWorkflows;
activeWorkflowsLen = activeWorkflows.length;
for (i = 0; i < activeWorkflowsLen; i += 1) {
instance = activeWorkflows[i];
/* instance is org.alfresco.repo.workflow.jscript.JscriptWorkflowInstance */
logger.log(" - instance: " + instance.getId() + " (active: " + instance.isActive() + ")");
}
}
getWorkflowsForDocument();
getActiveWorkflowsForDocument();

Unfortunately the javascript API doesn't expose all the workflow functions. It look like getting the list of workflow instances that are attached to a document only works in Java (or Java backed webscripts).
List<WorkflowInstance> workflows = workflowService.getWorkflowsForContent(node.getNodeRef(), true);
A usage of this can be found in the workflow list in the document details: http://svn.alfresco.com/repos/alfresco-open-mirror/alfresco/HEAD/root/projects/web-client/source/java/org/alfresco/web/ui/repo/component/UINodeWorkflowInfo.java
To get to the users who have tasks assigned you would then need to use getWorkflowPaths and getTasksForWorkflowPath methods of the WorkflowService.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Write the result of SQL Query to file by Apache Flink - java

Related

Marklogic Aggregate query with Java API

Why AWS RDS is showing exception - Status Code: 400; Error Code: InvalidParameterValue

No query defined for that name [getAuditTaskById]

SVN log using SVNKit

Is it possible to get a list of workflows the document is attached to in Alfresco

Categories

Resources