Defining new schema for Spark Rows

Defining new schema for Spark Rows - java

I have a DataFrame and one of its columns contains a string of JSON. So far, I've implemented the Function interface as required by the JavaRDD.map method: Function<Row,Row>(). Within this function, I'm parsing the JSON, and creating a new row whose additional columns came from values in the JSON. For example:
Original row:
+------+-----------------------------------+
| id | json |
+------+-----------------------------------+
| 1 | {"id":"abcd", "name":"dmux",...} |
+------------------------------------------+
After applying my function:
+------+----------+-----------+
| id | json_id | json_name |
+------+----------+-----------+
| 1 | abcd | dmux |
+-----------------+-----------+
I'm running into trouble when trying to create a new DataFrame from the returned JavaRDD. Now that I have these new rows, I need to create a schema. The schema is highly dependent on the structure of the JSON, so I'm trying to figure out a way of passing schema data back from the function along with the Row object. I can't use broadcast variables as the SparkContext doesn't get passed into the function.
Other than looping through each column in a row in the caller of Function what options do I have?

You can create a StructType. This is Scala, but it would work the same way:
val newSchema = StructType(Array(
StructField("id", LongType, false),
StructField("json_id", StringType, false),
StructField("json_name", StringType, false)
))
val newDf = sqlContext.createDataFrame(rdd, newSchema)
Incidentally, you need to make sure your rdd is of type RDD[Row].

Related

Java generate queryDSL

So I have a table named BLOG_REPORTERS like:
blog_id | reporter_id
1 1
2 3
And REPORTER:
reporter_id | name | etc
1 asd etc
I need to perform JOIN operation between them. I generated a QueryDSL class for them, and tried something like:
new JPAQuery<>(entityManager)
.select(QReporter.reporter)
.from(QReporter.reporter)
.join(QBlogReporters.blogreporters)
but this is wrong because join() method accepts EntityPath<P> and QBlogReporters extends BeanPath<T>.
Is any way to do this?

spark scala : Convert DataFrame OR Dataset to single comma separated string

Below is the spark scala code which will print one column DataSet[Row]:
import org.apache.spark.sql.{Dataset, Row, SparkSession}
val spark: SparkSession = SparkSession.builder()
.appName("Spark DataValidation")
.config("SPARK_MAJOR_VERSION", "2").enableHiveSupport()
.getOrCreate()
val kafkaPath:String="hdfs:///landing/APPLICATION/*"
val targetPath:String="hdfs://datacompare/3"
val pk:String = "APPLICATION_ID"
val pkValues = spark
.read
.json(kafkaPath)
.select("message.data.*")
.select(pk)
.distinct()
pkValues.show()
Output of about code :
+--------------+
|APPLICATION_ID|
+--------------+
| 388|
| 447|
| 346|
| 861|
| 361|
| 557|
| 482|
| 518|
| 432|
| 422|
| 533|
| 733|
| 472|
| 457|
| 387|
| 394|
| 786|
| 458|
+--------------+
Question :
How to convert this dataframe to comma separated String variable ?
Expected output :
val data:String= "388,447,346,861,361,557,482,518,432,422,533,733,472,457,387,394,786,458"
Please suggest how to convert DataFrame[Row] or Dataset to one String .

I don't think that's a good idea, since a dataFrame is a distributed object and can be inmense. Collect will bring all the data to the driver, so you should perform this kind operation carefully.
Here is what you can do with a dataFrame (two options):
df.select("APPLICATION_ID").rdd.map(r => r(0)).collect.mkString(",")
df.select("APPLICATION_ID").collect.mkString(",")
Result with a test dataFrame with only 3 rows:
String = 388,447,346
Edit: With DataSet you can do directly:
ds.collect.mkString(",")

Use collect_list:
import org.apache.spark.sql.functions._
val data = pkValues.select(collect_list(col(pk))) // collect to one row
.as[Array[Long]] // set encoder, so you will have strongly-typed Dataset
.take(1)(0) // get the first row - result will be Array[Long]
.mkString(",") // and join all values
However, it's quite a bad idea to perform collect or take of all rows. Instead, you may want to save pkValues somewhere with .write? Or make it an argument to other function, to keep distributed computing
Edit: Just noticed, that #SCouto posted other answer just after me. Collect will also be correct, with collect_list function you have one advantage - you can easily go grouping if you want and i.e. group keys to even and odd ones. It's up to you which solution you prefer, simpler with collect or one line longer, but more powerful

Postgresql Array Functions with QueryDSL

I use the Vlad Mihalcea's library in order to map SQL arrays (Postgresql in my case) to JPA. Then let's imagine I have an Entity, ex.
#TypeDefs(
{#TypeDef(name = "string-array", typeClass =
StringArrayType.class)}
)
#Entity
public class Entity {
#Type(type = "string-array")
#Column(columnDefinition = "text[]")
private String[] tags;
}
The appropriate SQL is:
CREATE TABLE entity (
tags text[]
);
Using QueryDSL I'd like to fetch rows which tags contains all the given ones. The raw SQL could be:
SELECT * FROM entity WHERE tags #> '{"someTag","anotherTag"}'::text[];
(taken from: https://www.postgresql.org/docs/9.1/static/functions-array.html)
Is it possible to do it with QueryDSL? Something like the code bellow ?
predicate.and(entity.tags.eqAll(<whatever>));

1st step is to generate proper sql: WHERE tags #> '{"someTag","anotherTag"}'::text[];
2nd step is described by coladict (thanks a lot!): figure out the functions which are called: #> is arraycontains and ::text[] is string_to_array
3rd step is to call them properly. After hours of debug I figured out that HQL doesn't treat functions as functions unless I added an expression sign (in my case: ...=true), so the final solution looks like this:
predicate.and(
Expressions.booleanTemplate("arraycontains({0}, string_to_array({1}, ',')) = true",
entity.tags,
tagsStr)
);
where tagsStr - is a String with values separated by ,

Since you can't use custom operators, you will have to use their functional equivalents. You can look them up in the psql console with \doS+. For \doS+ #> we get several results, but this is the one you want:
List of operators
Schema | Name | Left arg type | Right arg type | Result type | Function | Description
------------+------+---------------+----------------+-------------+---------------------+-------------
pg_catalog | #> | anyarray | anyarray | boolean | arraycontains | contains
It tells us the function used is called arraycontains, so now we look-up that function to see it's parameters using \df arraycontains
List of functions
Schema | Name | Result data type | Argument data types | Type
------------+---------------+------------------+---------------------+--------
pg_catalog | arraycontains | boolean | anyarray, anyarray | normal
From here, we transform the target query you're aiming for into:
SELECT * FROM entity WHERE arraycontains(tags, '{"someTag","anotherTag"}'::text[]);
You should then be able to use the builder's function call to create this condition.
ParameterExpression<String[]> tags = cb.parameter(String[].class);
Expression<Boolean> tagcheck = cb.function("Flight_.id", Boolean.class, Entity_.tags, tags);
Though I use a different array solution (might publish soon), I believe it should work, unless there are bugs in the underlying implementation.
An alternative to method would be to compile the escaped string format of the array and pass it on as the second parameter. It's easier to print if you don't treat the double-quotes as optional. In that event, you have to replace String[] with String in the ParameterExpression row above

For EclipseLink I created a function
CREATE OR REPLACE FUNCTION check_array(array_val text[], string_comma character varying ) RETURNS bool AS $$
BEGIN
RETURN arraycontains(array_val, string_to_array(string_comma, ','));
END;
$$ LANGUAGE plpgsql;
As pointed out by Serhii, then you can useExpressions.booleanTemplate("FUNCTION('check_array', {0}, {1}) = true", entity.tags, tagsStr)

Stored procedures in jooq with dynamic names

I want to call stored procedures from PostgreSQL in JOOQ by name dynamically:
final Field function = function("report_" + name, Object.class, (Field[])params.toArray(new Field[params.size()]));
dsl().select(function).fetchArrays();
For example it generates:
select report_total_requests('83.84.85.3184');
Which returns:
report_total_requests
-----------------------
(3683,2111,0)
(29303,10644,1)
And in java it is array of "(3683,2111,0)" objects.
I want to generate:
select * from report_total_requests('83.84.85.3184')
To produce:
total | users | priority
------+-------+----------
3683 | 2111 | 0
29303 | 10644 | 1
That is in java array of arrays of objects
Any ideas?

The way forward is to use plain SQL as follows:
Name name = DSL.name("report_" + name);
QueryPart arguments = DSL.list(params);
dsl().select().from("{0}({1})", name, arguments).fetch();
Note, I've wrapped the function name in a DSL.name() object, to prevent SQL injection.

How to delete an item in DynamoDB using Java?

I know this sounds like a simple question, but for some reason I can't find a clear answer online or through StackOverflow.
I have a DynamoDB with a Table named "ABC". The primary key is "ID" as a String and one of the other attributes is "Name" as a String. How can I delete an item from this table using Java?
AmazonDynamoDBClient dynamoDB;
.
.
.
DeleteItemRequest dir = new DeleteItemRequest();
dir.withConditionExpression("ID = 214141").withTableName("ABC");
DeleteItemResult deleteResult = dynamoDB.deleteItem(dir);
I have a validation exception:
Exception in thread "main" com.amazonaws.AmazonServiceException: 1 validation error detected: Value null at 'key' failed to satisfy constraint: Member must not be null (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: RQ70OIGOQAJ9MRGSUA0UIJLRUNVV4KQNSO5AEMVJF66Q9ASUAAJG)
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1160)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:748)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:467)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:302)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:3240)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.deleteItem(AmazonDynamoDBClient.java:972)
at DynamoDBUploader.deleteItems(DynamoDBUploader.java:168)
at Main.main(Main.java:56)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
If I need to know the Hash Key in order to delete an item in a DynamoDB Table, I think I may need to redesign my database in order to delete items efficiently.
My table looks like this:
If that is the case, ahh... I think I need to re-design my database table.
ID | Name | Date | Value
-----------------------------------
1 | TransactionA | 2015-06-21 | 30
2 | TransactionB | 2015-06-21 | 40
3 | TransactionC | 2015-06-21 | 50
Basically, I would like to easily delete all transactions with Date "2015-06-21". How can I do this simply and quickly without having to deal with the Hash Key ID?

AWS DynamoDB knows the column that is hash key of your table.
You just need to specify the value to be deleted.
DeleteItemRequest has a fluent API for that :
Key keyToDelete = new Key().withHashKeyElement(new AttributeValue("214141"));
DeleteItemRequest dir = new DeleteItemRequest()
.withTableName("ABC")
.withKey(keyToDelete);

For Kotlin:
I have table with :
Partition key: account_id (String)
Sort key: message_id (String)
To delete an item fron DynamoDb I do the following:
fun deleteMessageById(messageId: String, accountId: String){
val item = HashMap<String, AttributeValue>()
item["account_id"] = AttributeValue(accountId)
item["message_id"] = AttributeValue(messageId)
val deleteRequest = DeleteItemRequest().withTableName(tableName).withKey(item)
dynamoConfiguration.amazonDynamoDB().deleteItem(deleteRequest)
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Defining new schema for Spark Rows - java

Related

Java generate queryDSL

spark scala : Convert DataFrame OR Dataset to single comma separated string

Postgresql Array Functions with QueryDSL

Stored procedures in jooq with dynamic names

How to delete an item in DynamoDB using Java?

Categories

Resources