Alternative of Select DATE_FORMAT(date, format) in Apache Spark - java

I am using Apache-Spark SQL and Java to read from parquet file. The file contains a date column(M/d/yyyy) and I want to change that to some other format(yyyy-dd-MM). Something like Select DATE_FORMAT(date, format) that we can do in mysql. Is there any similar method in Apache-Spark?

What you can do is parse the string using to_timestamp with your current schema and format it with the one you desire using date_format:
val df = Seq("1/1/2015", "02/10/2014", "4/30/2010", "03/7/2015").toDF("d")
df.select('d, date_format(to_timestamp('d, "MM/dd/yyyy"), "yyyy-dd-MM") as "new_d")
.show
+----------+----------+
| d| new_d|
+----------+----------+
| 1/1/2015|2015-01-01|
|02/10/2014|2014-10-02|
| 4/30/2010|2010-30-04|
| 03/7/2015|2015-07-03|
+----------+----------+
Note that the parsing is pretty robust and supports single digit days and months.

Related

spark scala : Convert DataFrame OR Dataset to single comma separated string

Below is the spark scala code which will print one column DataSet[Row]:
import org.apache.spark.sql.{Dataset, Row, SparkSession}
val spark: SparkSession = SparkSession.builder()
.appName("Spark DataValidation")
.config("SPARK_MAJOR_VERSION", "2").enableHiveSupport()
.getOrCreate()
val kafkaPath:String="hdfs:///landing/APPLICATION/*"
val targetPath:String="hdfs://datacompare/3"
val pk:String = "APPLICATION_ID"
val pkValues = spark
.read
.json(kafkaPath)
.select("message.data.*")
.select(pk)
.distinct()
pkValues.show()
Output of about code :
+--------------+
|APPLICATION_ID|
+--------------+
| 388|
| 447|
| 346|
| 861|
| 361|
| 557|
| 482|
| 518|
| 432|
| 422|
| 533|
| 733|
| 472|
| 457|
| 387|
| 394|
| 786|
| 458|
+--------------+
Question :
How to convert this dataframe to comma separated String variable ?
Expected output :
val data:String= "388,447,346,861,361,557,482,518,432,422,533,733,472,457,387,394,786,458"
Please suggest how to convert DataFrame[Row] or Dataset to one String .
I don't think that's a good idea, since a dataFrame is a distributed object and can be inmense. Collect will bring all the data to the driver, so you should perform this kind operation carefully.
Here is what you can do with a dataFrame (two options):
df.select("APPLICATION_ID").rdd.map(r => r(0)).collect.mkString(",")
df.select("APPLICATION_ID").collect.mkString(",")
Result with a test dataFrame with only 3 rows:
String = 388,447,346
Edit: With DataSet you can do directly:
ds.collect.mkString(",")
Use collect_list:
import org.apache.spark.sql.functions._
val data = pkValues.select(collect_list(col(pk))) // collect to one row
.as[Array[Long]] // set encoder, so you will have strongly-typed Dataset
.take(1)(0) // get the first row - result will be Array[Long]
.mkString(",") // and join all values
However, it's quite a bad idea to perform collect or take of all rows. Instead, you may want to save pkValues somewhere with .write? Or make it an argument to other function, to keep distributed computing
Edit: Just noticed, that #SCouto posted other answer just after me. Collect will also be correct, with collect_list function you have one advantage - you can easily go grouping if you want and i.e. group keys to even and odd ones. It's up to you which solution you prefer, simpler with collect or one line longer, but more powerful

Postgresql Array Functions with QueryDSL

I use the Vlad Mihalcea's library in order to map SQL arrays (Postgresql in my case) to JPA. Then let's imagine I have an Entity, ex.
#TypeDefs(
{#TypeDef(name = "string-array", typeClass =
StringArrayType.class)}
)
#Entity
public class Entity {
#Type(type = "string-array")
#Column(columnDefinition = "text[]")
private String[] tags;
}
The appropriate SQL is:
CREATE TABLE entity (
tags text[]
);
Using QueryDSL I'd like to fetch rows which tags contains all the given ones. The raw SQL could be:
SELECT * FROM entity WHERE tags #> '{"someTag","anotherTag"}'::text[];
(taken from: https://www.postgresql.org/docs/9.1/static/functions-array.html)
Is it possible to do it with QueryDSL? Something like the code bellow ?
predicate.and(entity.tags.eqAll(<whatever>));
1st step is to generate proper sql: WHERE tags #> '{"someTag","anotherTag"}'::text[];
2nd step is described by coladict (thanks a lot!): figure out the functions which are called: #> is arraycontains and ::text[] is string_to_array
3rd step is to call them properly. After hours of debug I figured out that HQL doesn't treat functions as functions unless I added an expression sign (in my case: ...=true), so the final solution looks like this:
predicate.and(
Expressions.booleanTemplate("arraycontains({0}, string_to_array({1}, ',')) = true",
entity.tags,
tagsStr)
);
where tagsStr - is a String with values separated by ,
Since you can't use custom operators, you will have to use their functional equivalents. You can look them up in the psql console with \doS+. For \doS+ #> we get several results, but this is the one you want:
List of operators
Schema | Name | Left arg type | Right arg type | Result type | Function | Description
------------+------+---------------+----------------+-------------+---------------------+-------------
pg_catalog | #> | anyarray | anyarray | boolean | arraycontains | contains
It tells us the function used is called arraycontains, so now we look-up that function to see it's parameters using \df arraycontains
List of functions
Schema | Name | Result data type | Argument data types | Type
------------+---------------+------------------+---------------------+--------
pg_catalog | arraycontains | boolean | anyarray, anyarray | normal
From here, we transform the target query you're aiming for into:
SELECT * FROM entity WHERE arraycontains(tags, '{"someTag","anotherTag"}'::text[]);
You should then be able to use the builder's function call to create this condition.
ParameterExpression<String[]> tags = cb.parameter(String[].class);
Expression<Boolean> tagcheck = cb.function("Flight_.id", Boolean.class, Entity_.tags, tags);
Though I use a different array solution (might publish soon), I believe it should work, unless there are bugs in the underlying implementation.
An alternative to method would be to compile the escaped string format of the array and pass it on as the second parameter. It's easier to print if you don't treat the double-quotes as optional. In that event, you have to replace String[] with String in the ParameterExpression row above
For EclipseLink I created a function
CREATE OR REPLACE FUNCTION check_array(array_val text[], string_comma character varying ) RETURNS bool AS $$
BEGIN
RETURN arraycontains(array_val, string_to_array(string_comma, ','));
END;
$$ LANGUAGE plpgsql;
As pointed out by Serhii, then you can useExpressions.booleanTemplate("FUNCTION('check_array', {0}, {1}) = true", entity.tags, tagsStr)

Neo4j CSV Import Type Error

I'm trying to import a CSV file into Neo4j (Community Edition V 2.3.2). The CSV is structured like this:
id,title,value,updated
123456,"title 1",10,20160407
123457,"title 2",11,20160405
The CSV path is set within the Neo4j properties file.
When I use the following import statement
LOAD CSV WITH HEADERS FROM
'file:///test.csv' AS line
CREATE (:Title_Node { title: line[1], identifier: toInt(line[0]), value: line[3]})
I receive the following error message:
WARNING: Expected 1 to be a java.lang.String, but it was a java.lang.Long
When I just query the test.csv file with
LOAD CSV WITH HEADERS FROM 'file:///test.csv'
AS line
RETURN line.title, line.id, line.value;
Cypher can access the data without any problem.
+------------------------------------+
| line.title | line.id | line.value |
+------------------------------------+
| "title 1" | "123456" | "10" |
| "title 2" | "123457" | "11" |
+------------------------------------+
The effect occurs in the browser as well as in the shell.
I found the following question at Having `Neo.ClientError.Statement.InvalidType` in Neo4j and tried the hints mentioned in the Neo4j Link posted in this answer, but with little success. The CSV file itself seems to be ok by structure (UTF8, no hidden entries etc.).
Every help in solving this is greatly appreciated.
Best
Krid
You're supplying the fields for the line header, so use them in the import -
LOAD CSV WITH HEADERS FROM
'file:///test.csv' AS line
CREATE (:Title_Node { title: line.title, identifier: line.id, value: line.value})
This is a classic example of an error message which is entirely correct, but not very helpful!
You can supply literals to line indexes if you prefer-
LOAD CSV WITH HEADERS FROM
'file:///test.csv' AS line
CREATE (:Title_Node { title: line['title'], identifier: line['id'], value: line['value']})

Yahoo Finance Stream Parsing to insert into MySQL database

I'm using the Yahoo Finance Streaming API to get stock quotes, I wanted to save these into a DB table for historical reference.
I'm looking for something which can easily parse various strings which have a format that varies like the examples below:
<script>try{parent.yfs_mktmcb({"unixtime":1310957222});}catch(e){}</script>
<script>try{parent.yfs_u1f({"ASX.AX":{c10:"-0.06"}});}catch(e){}</script>
<script>try{parent.yfs_u1f({"AWC.AX":{l10:"2.16",c10:"+0.01",p20:"+0.47"}});}catch(e){}</script>
<script>try{parent.yfs_u1f({"ALZ.AX":{l10:"2.6900",c10:"-0.1200",p20:"-4.27"}});}catch(e){}</script>
I want to parse these strings to a MySQL database and I was thinking the easiest way will be using Java to do this parsing. Basically these entries are line by line in a text file. I want to extract the time, the stock code, the price and the change values in a simple table.
The table looks like StockCode | Date | Time | Price | ChangeDol | ChangePer
Are there any tools or frameworks which would make this process easy?
Thanks!
I don't how you get your quote, but if you could use YQL, any XML parser would do:
YQL
<quote symbol="YHOO">
<Ask>14.76</Ask>
<AverageDailyVolume>28463800</AverageDailyVolume>
<Bid>14.51</Bid>
<AskRealtime>14.76</AskRealtime>
<BidRealtime>14.51</BidRealtime>
<BookValue>9.826</BookValue>
<Change_PercentChange>0.00 - 0.00%</Change_PercentChange>
....
</quote>
List of XML Parsers for Java
You could have a look there
http://www.wikijava.org/wiki/Downloading_stock_market_quotes_from_Yahoo!_finance
They get finencial data as csv from yahoo.

handling DATETIME values 0000-00-00 00:00:00 in JDBC

I get an exception (see below) if I try to do
resultset.getString("add_date");
for a JDBC connection to a MySQL database containing a DATETIME value of 0000-00-00 00:00:00 (the quasi-null value for DATETIME), even though I'm just trying to get the value as string, not as an object.
I got around this by doing
SELECT CAST(add_date AS CHAR) as add_date
which works, but seems silly... is there a better way to do this?
My point is that I just want the raw DATETIME string, so I can parse it myself as is.
note: here's where the 0000 comes in: (from http://dev.mysql.com/doc/refman/5.0/en/datetime.html)
Illegal DATETIME, DATE, or TIMESTAMP
values are converted to the “zero”
value of the appropriate type
('0000-00-00 00:00:00' or
'0000-00-00').
The specific exception is this one:
SQLException: Cannot convert value '0000-00-00 00:00:00' from column 5 to TIMESTAMP.
SQLState: S1009
VendorError: 0
java.sql.SQLException: Cannot convert value '0000-00-00 00:00:00' from column 5 to TIMESTAMP.
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1055)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:926)
at com.mysql.jdbc.ResultSetImpl.getTimestampFromString(ResultSetImpl.java:6343)
at com.mysql.jdbc.ResultSetImpl.getStringInternal(ResultSetImpl.java:5670)
at com.mysql.jdbc.ResultSetImpl.getString(ResultSetImpl.java:5491)
at com.mysql.jdbc.ResultSetImpl.getString(ResultSetImpl.java:5531)
Alternative answer, you can use this JDBC URL directly in your datasource configuration:
jdbc:mysql://yourserver:3306/yourdatabase?zeroDateTimeBehavior=convertToNull
Edit:
Source: MySQL Manual
Datetimes with all-zero components (0000-00-00 ...) — These values can not be represented reliably in Java. Connector/J 3.0.x always converted them to NULL when being read from a ResultSet.
Connector/J 3.1 throws an exception by default when these values are encountered as this is the most correct behavior according to the JDBC and SQL standards. This behavior can be modified using the zeroDateTimeBehavior configuration property. The allowable values are:
exception (the default), which throws an SQLException with an SQLState of S1009.
convertToNull, which returns NULL instead of the date.
round, which rounds the date to the nearest closest value which is 0001-01-01.
Update: Alexander reported a bug affecting mysql-connector-5.1.15 on that feature. See CHANGELOGS on the official website.
I stumbled across this attempting to solve the same issue. The installation I am working with uses JBOSS and Hibernate, so I had to do this a different way. For the basic case, you should be able to add zeroDateTimeBehavior=convertToNull to your connection URI as per this configuration properties page.
I found other suggestions across the land referring to putting that parameter in your hibernate config:
In hibernate.cfg.xml:
<property name="hibernate.connection.zeroDateTimeBehavior">convertToNull</property>
In hibernate.properties:
hibernate.connection.zeroDateTimeBehavior=convertToNull
But I had to put it in my mysql-ds.xml file for JBOSS as:
<connection-property name="zeroDateTimeBehavior">convertToNull</connection-property>
Hope this helps someone. :)
My point is that I just want the raw DATETIME string, so I can parse it myself as is.
That makes me think that your "workaround" is not a workaround, but in fact the only way to get the value from the database into your code:
SELECT CAST(add_date AS CHAR) as add_date
By the way, some more notes from the MySQL documentation:
MySQL Constraints on Invalid Data:
Before MySQL 5.0.2, MySQL is forgiving of illegal or improper data values and coerces them to legal values for data entry. In MySQL 5.0.2 and up, that remains the default behavior, but you can change the server SQL mode to select more traditional treatment of bad values such that the server rejects them and aborts the statement in which they occur.
[..]
If you try to store NULL into a column that doesn't take NULL values, an error occurs for single-row INSERT statements. For multiple-row INSERT statements or for INSERT INTO ... SELECT statements, MySQL Server stores the implicit default value for the column data type.
MySQL 5.x Date and Time Types:
MySQL also allows you to store '0000-00-00' as a “dummy date” (if you are not using the NO_ZERO_DATE SQL mode). This is in some cases more convenient (and uses less data and index space) than using NULL values.
[..]
By default, when MySQL encounters a value for a date or time type that is out of range or otherwise illegal for the type (as described at the beginning of this section), it converts the value to the “zero” value for that type.
DATE_FORMAT(column name, '%Y-%m-%d %T') as dtime
Use this to avoid the error. It return the date in string format and then you can get it as a string.
resultset.getString("dtime");
This actually does NOT work. Even though you call getString. Internally mysql still tries to convert it to date first.
at com.mysql.jdbc.ResultSetImpl.getDateFromString(ResultSetImpl.java:2270)
~[mysql-connector-java-5.1.15.jar:na] at com.mysql.jdbc.ResultSetImpl.getStringInternal(ResultSetImpl.java:5743)
~[mysql-connector-java-5.1.15.jar:na] at com.mysql.jdbc.ResultSetImpl.getString(ResultSetImpl.java:5576)
~[mysql-connector-java-5.1.15.jar:na]
If, after adding lines:
<property
name="hibernate.connection.zeroDateTimeBehavior">convertToNull</property>
hibernate.connection.zeroDateTimeBehavior=convertToNull
<connection-property
name="zeroDateTimeBehavior">convertToNull</connection-property>
continues to be an error:
Illegal DATETIME, DATE, or TIMESTAMP values are converted to the “zero” value of the appropriate type ('0000-00-00 00:00:00' or '0000-00-00').
find lines:
1) resultSet.getTime("time"); // time = 00:00:00
2) resultSet.getTimestamp("timestamp"); // timestamp = 00000000000000
3) resultSet.getDate("date"); // date = 0000-00-00 00:00:00
replace with the following lines, respectively:
1) Time.valueOf(resultSet.getString("time"));
2) Timestamp.valueOf(resultSet.getString("timestamp"));
3) Date.valueOf(resultSet.getString("date"));
I wrestled with this problem and implemented the 'convertToNull' solutions discussed above. It worked in my local MySql instance. But when I deployed my Play/Scala app to Heroku it no longer would work. Heroku also concatenates several args to the DB URL that they provide users, and this solution, because of Heroku's use concatenation of "?" before their own set of args, will not work. However I found a different solution which seems to work equally well.
SET sql_mode = 'NO_ZERO_DATE';
I put this in my table descriptions and it solved the problem of '0000-00-00 00:00:00' can not be represented as java.sql.Timestamp
I suggest you use null to represent a null value.
What is the exception you get?
BTW:
There is no year called 0 or 0000. (Though some dates allow this year)
And there is no 0 month of the year or 0 day of the month. (Which may be the cause of your problem)
I solved the problem considerating '00-00-....' isn't a valid date, then, I changed my SQL column definition adding "NULL" expresion to permit null values:
SELECT "-- Tabla item_pedido";
CREATE TABLE item_pedido (
id INTEGER AUTO_INCREMENT PRIMARY KEY,
id_pedido INTEGER,
id_item_carta INTEGER,
observacion VARCHAR(64),
fecha_estimada TIMESTAMP,
fecha_entrega TIMESTAMP NULL, // HERE IS!!.. NULL = DELIVERY DATE NOT SET YET
CONSTRAINT fk_item_pedido_id_pedido FOREIGN KEY (id_pedido)
REFERENCES pedido(id),...
Then, I've to be able to insert NULL values, that means "I didnt register that timestamp yet"...
SELECT "++ INSERT item_pedido";
INSERT INTO item_pedido VALUES
(01, 01, 01, 'Ninguna', ADDDATE(#HOY, INTERVAL 5 MINUTE), NULL),
(02, 01, 02, 'Ninguna', ADDDATE(#HOY, INTERVAL 3 MINUTE), NULL),...
The table look that:
mysql> select * from item_pedido;
+----+-----------+---------------+-------------+---------------------+---------------------+
| id | id_pedido | id_item_carta | observacion | fecha_estimada | fecha_entrega |
+----+-----------+---------------+-------------+---------------------+---------------------+
| 1 | 1 | 1 | Ninguna | 2013-05-19 15:09:48 | NULL |
| 2 | 1 | 2 | Ninguna | 2013-05-19 15:07:48 | NULL |
| 3 | 1 | 3 | Ninguna | 2013-05-19 15:24:48 | NULL |
| 4 | 1 | 6 | Ninguna | 2013-05-19 15:06:48 | NULL |
| 5 | 2 | 4 | Suave | 2013-05-19 15:07:48 | 2013-05-19 15:09:48 |
| 6 | 2 | 5 | Seco | 2013-05-19 15:07:48 | 2013-05-19 15:12:48 |
| 7 | 3 | 5 | Con Mayo | 2013-05-19 14:54:48 | NULL |
| 8 | 3 | 6 | Bilz | 2013-05-19 14:57:48 | NULL |
+----+-----------+---------------+-------------+---------------------+---------------------+
8 rows in set (0.00 sec)
Finally: JPA in action:
#Stateless
#LocalBean
public class PedidosServices {
#PersistenceContext(unitName="vagonpubPU")
private EntityManager em;
private Logger log = Logger.getLogger(PedidosServices.class.getName());
#SuppressWarnings("unchecked")
public List<ItemPedido> obtenerPedidosRetrasados() {
log.info("Obteniendo listado de pedidos retrasados");
Query qry = em.createQuery("SELECT ip FROM ItemPedido ip, Pedido p WHERE" +
" ip.fechaEntrega=NULL" +
" AND ip.idPedido=p.id" +
" AND ip.fechaEstimada < :arg3" +
" AND (p.idTipoEstado=:arg0 OR p.idTipoEstado=:arg1 OR p.idTipoEstado=:arg2)");
qry.setParameter("arg0", Tipo.ESTADO_BOUCHER_ESPERA_PAGO);
qry.setParameter("arg1", Tipo.ESTADO_BOUCHER_EN_SERVICIO);
qry.setParameter("arg2", Tipo.ESTADO_BOUCHER_RECIBIDO);
qry.setParameter("arg3", new Date());
return qry.getResultList();
}
At last all its work. I hope that help you.
To add to the other answers: If yout want the 0000-00-00 string, you can use noDatetimeStringSync=true (with the caveat of sacrificing timezone conversion).
The official MySQL bug: https://bugs.mysql.com/bug.php?id=47108.
Also, for history, JDBC used to return NULL for 0000-00-00 dates but now return an exception by default.
Source
you can append the jdbc url with
?zeroDateTimeBehavior=convertToNull&autoReconnect=true&characterEncoding=UTF-8&characterSetResults=UTF-8
With the help of this, sql convert '0000-00-00 00:00:00' as null value.
eg:
jdbc:mysql:<host-name>/<db-name>?zeroDateTimeBehavior=convertToNull&autoReconnect=true&characterEncoding=UTF-8&characterSetResults=UTF-8

Categories

Resources