I am parsing a csv file having data as:
2016-10-03, 18.00.00, 2, 6
When I am reading file creating schema as below:
StructType schema = DataTypes.createStructType(Arrays.asList(
DataTypes.createStructField("Date", DataTypes.DateType, false),
DataTypes.createStructField("Time", DataTypes.TimestampType, false),
DataTypes.createStructField("CO(GT)", DataTypes.IntegerType, false),
DataTypes.createStructField("PT08.S1(CO)", DataTypes.IntegerType, false)))
Dataset<Row> df = spark.read().format("csv").schema(schema).load("src/main/resources/AirQualityUCI/sample.csv");
Its producing below error as:
Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException
at java.sql.Date.valueOf(Unknown Source)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
I feel that it is due to time format error. What are the ways of converting them into specific formats or changes to be made into StructType for its proper meaning?
The format I expect is in form of hh:mm:ss as it will be helpful via spark sql to convert it into timestamp format by concatenating columns.
2016-10-03, 18:00:00, 2, 6
If you read both Date and Time as string, then you can easily merge and convert them to a Timestamp. You do not need to change "." to a ":" in the Time column as the format can be specified when creating the Timestamp. Example of an solution in Scala:
val df = Seq(("2016-10-03", "00.00.17"),("2016-10-04", "00.01.17"))
.toDF("Date", "Time")
val df2 = df.withColumn("DateTime", concat($"Date", lit(" "), $"Time"))
.withColumn("Timestamp", unix_timestamp($"DateTime", "yyyy-MM-dd HH.mm.ss"))
Which will give you:
+----------+--------+-------------------+----------+
| Date| Time| DateTime| Timestamp|
+----------+--------+-------------------+----------+
|2016-10-03|00.00.17|2016-10-03 00.00.17|1475424017|
|2016-10-04|00.01.17|2016-10-04 00.01.17|1475510477|
+----------+--------+-------------------+----------+
Of course, if you want you can still convert the Time column to use ":" instead of ".". It can be done by using regexp_replace:
df.withColumn("Time2", regexp_replace($"Time", "\\.", ":"))
If you do this before converting to a Timestamp, you need to change the specified format above.
Related
So, as in the title, I have the following example Document in my MongoDB database:
{"_id":{"$oid":"5fcf541b466a3d10f55f8241"}, "dateOfBirth":"1992-11-02T12:05:17"}
As you can see, the date is stored as a String and not as an ISODate object. As far as I know, MongoDB should be able to still handle and query it as a Date. (source)
Thus, I am trying to query it in my java app with JDBC in the following way:
java.util.Date queryDate = new GregorianCalendar(1980, Calendar.JANUARY, 1).getTime();
Bson query = Filters.gte("dateOfBirth", queryDate);
FindIterable<Document> result = collection.find(query);
However, this does not work. My thought process was, if I pass in a java.util.Date, then the Filters.gte() method will know i mean to query a Date and it will work as intended in MongoDB. However, I get 0 matches.
I also tried putting a formatter on my queryDate (for a different purpose, before):
DateFormat dformat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
Bson query = Filters.gte("dateOfBirth", dformat.format(queryDate));
However, this caused the Filters.gte() to query it as Strings, according to String-Comparison, so alphabetical order roughly. This made me think initially that the original java.util.Date version did/should indeed know then, that I queried a Date and not a String, it just somehow failed to convert the one in the database to a date-type? I'm unsure how it should work.
I understand this is a niche case usage, and that I really should insert dates as ISODate in mongoDB, however in my special case right now, this is not an option.
Is there a way to query dates stored as Strings in MongoDB if I am using JDBC?
Minor point: You are using the Java connector for MongoDB. JDBC drivers are for relational databases which use the SQL query language. I therefore changed the JDBC tag to Java in your question.
Working with Dates as Strings
Regarding the datetime format in your documents: Because of the format you are using, and because it is stored as a string, it is OK to use string comparisons when running your queries. Lexical ordering will ensure your string comparisons will be equivalent to datetime comparisons. This is what is being used by the code in the question you linked to.
Obviously this assumption will break if you have any data stored in other string formats, such as "dd-MM-yyyy", where the string ordering would not match the datetime ordering.
However you proceed, you should avoid the old and problematic Java Date and Calendar classes. Instead, use the modern java.time classes. More background here.
In your case, your documents are storing datetime data without any timezone or offset information. You can use java.time.LocalDateTime for this. The word "local" in this name actually means "no specific locality or timezone" - which matches what you have in your Mongo documents.
The Java imports:
import java.time.LocalDateTime;
import java.time.Month;
import java.time.format.DateTimeFormatter;
And an example local datetime:
LocalDateTime ldt = LocalDateTime.of(1980, Month.JANUARY, 1, 0, 0);
DateTimeFormatter dtf = DateTimeFormatter.ISO_DATE_TIME;
String s = ldt.format(dtf); // "1980-01-01T00:00:00"
Working with Dates as Objects
If you want to use a Java LocalDate object directly in your query, instead of using string comparisons, you can use a projection to create a date object in your query results, and then use the Java LocalDate object directly in your filter:
Bson resultsWithDate = Aggregates.project(Projections.fields(
Projections.include("dateOfBirth"),
Projections.computed("birthDate", Projections.computed("$toDate", "$dateOfBirth"))
));
The above projection adds a new dateOfBirth field to each retrieved document, and populates it via the $toDate operator.
Then we can apply our filter:
collection.aggregate(
Arrays.asList(
resultsWithDate,
Aggregates.match(Filters.gte("birthDate", ldt)))
).forEach(printConsumer);
The filter now uses our ldt object, from above.
I am using the following helper method to print each results document as a JSON string in my console:
Consumer<Document> printConsumer = (final Document document) -> {
System.out.println(document.toJson());
};
There may be a more compact or efficient way to build this MongoDB aggregate - I am not a regular Mongo user.
Also, as a final note: My use of the Mongo $toDate operator does not specify a timezone - so it defaults to Zulu time (UT timezone), as shown in the sample output below:
{
"_id": {
"$oid": "5fcf541b466a3d10f55f8241"
},
"dateOfBirth": "1992-11-02T12:05:17",
"birthDate": {
"$date": "1992-11-02T12:05:17Z"
}
}
I am adding a partition column to Spark Dataframe. New column(s) contains year month and day.
I have a timestamp column in my dataframe.
DataFrame dfPartition = df.withColumn("year", df.col("date").substr(0, 4));
dfPartition = dfPartition.withColumn("month", dfPartition.col("date").substr(6, 2));
dfPartition = dfPartition.withColumn("day", dfPartition.col("date").substr(9, 2));
I can see the correct values of columns when I output the dataframe eg : 2016 01 08
But When I export this dataframe to hive table like
dfPartition.write().partitionBy("year", "month","day").mode(SaveMode.Append).saveAsTable("testdb.testtable");
I see that directory structure generated misses leading zeroes.
I tried to cast column to String but did not work.
Is there is a way to capture two digits date/month in hive partition
Thanks
Per Spark documentation, partition-column-type inference is a feature enabled by default. OP string values, since they are interpretable as ints, were converted as such. If this is undesirable in the Spark session as a whole, one can disable it by setting the corresponding spark configuration attribute to false:
SparkSession.builder.config("spark.sql.sources.partitionColumnTypeInference.enabled", value = false)
or by running the corresponding SET key=value command using SQL. Otherwise, one can individually-counteract it at the column level w/ the corresponding Spark-native format-string function as J.Doe suggests.
Refer to Add leading zeros to Columns in a Spark Data Frame
you can see the answer of how to add leading 0's with this answer:
val df2 = df
.withColumn("month", format_string("%02d", $"month"))
I tried this on my code using the snippet below and it worked!
.withColumn("year", year(col("my_time")))
.withColumn("month", format_string("%02d",month(col("my_time")))) //pad with leading 0's
.withColumn("day", format_string("%02d",dayofmonth(col("my_time")))) //pad with leading 0's
.withColumn("hour", format_string("%02d",hour(col("my_time")))) //pad with leading 0's
.writeStream
.partitionBy("year", "month", "day", "hour")
I am reading a text file which has a field in Timestamp in this format "yyyy-MM-dd HH:mm:ss"
I want to be able to convert it to a field in Impala as BigInt and should like yyyMMddHHmmss in Java.
I am using Talend for the ETL but I get this error "schema's dbType not correct for this component"
and so I want to have the right transformation in my tImpalaOutput component
One obvious option is to read the date in as a string, format it to the output you want and then convert it to a long before sending it to Impala.
To do this you would start by using Talend's parseDate function with something like:
TalendDate.parseDate("yyyy-MM-dd HH:mm:ss",row1.date)
This parses the date string into a Date type object. From here you can convert this into your desired string format with:
TalendDate.formatDate("yyyMMddHHmmss",row2.date)
Alternatively this can be done in one go with:
TalendDate.formatDate("yyyMMddHHmmss",TalendDate.parseDate("yyyy-MM-dd HH:mm:ss",row1.date))
After this you should have a date string in your desired format. You can then cast this to a Long using a tConvertType component or the following Java code:
Long.valueOf(row3.date)
Or, once again we can do the whole thing in a one liner:
Long.valueOf(TalendDate.formatDate("yyyMMddHHmmss",TalendDate.parseDate("yyyy-MM-dd HH:mm:ss",row1.date)))
From here you should be able to send this to Impala as a Java Long to an Impala BIGINT field.
BigDecimal myNumber = new BigDecimal(1234.56);
NumberFormat instance = NumberFormat.getInstance(Locale.GERMAN);
String localizedNumber = instance.format(myNumber);
System.out.print("\nformatting " + localizedNumber); o/p ->1.234,56
Till here code works fine but below line gives NumberFormatter exception as given string contains comma in it.
BigDecimal bigDecimal = new BigDecimal(localizedNumber);
I want numeric values to be localized but I cannot return string as putting number as string shows below error in excel
Number in this cell is formatter as text or preceded by an apostrophe
Is there any way by which I can return back numeric value(BigDecimal / Integer / BigInteger etc) in localized format
So that I won't get above error in excel and I can perform filter operations on excel data.
I've also tried new BigDecimalType().valueToString(value, locale); and new BigDecimalType().stringToValue(s, locale); of dynamic jasper reports api.
Happy to answer this question which asked me only and quite surprised that no one replied to this.
Actually we don't have to do anything for number localization in excel export because it's done by our operating system settings automatically.
Go to "Region and Language" -> Format -> Select language/country name -> Apply and check your excel earlier generated in English.
You will see numbers in currently selected country's number format. :)
For example, I have some documents described by fields: id, date and price.
First document: id=1, date='from 10.01.2014 to 20.01.2014', price='120'
Second document: id=2, date='19.01.2014' and price='from 100 to 140'
My program receives key/value parameters and should find the most suitable documents. So, for example, with this parameters date=19.01.2014 and price='120' program should find both documents. With date=20.01.2014, price=120' only the first document. With date='19.01.2014, price=140' only the second one.
How can I do it with Lucene in Java? I saw examples where I'm typing query like 'give me docs where date is from .. to ..', and Lucene gives me docs in this range. Instead of this I want to specify range for my document and not for query.
You could index both opening and closing ranges for dates and prices, e.g.
Your document #1 would be indexed as:
id = 1
dateFrom = 10.01.2014
dateTo = 20.01.2014
priceFrom = 120
priceTo = 9999999999
And document #2 as
id=2
dateFrom = 19.01.2014
dateTo = 01.01.2099
priceFrom = 100
priceTo = 140
The query would look like this:
+dateFrom:[19.01.2014 TO *] +priceFrom:[120 TO *] +priceTo:[* TO 140]
This is not very effective but it should work.