I just wrote a toy class to test Spark dataframe (actually Dataset since I'm using Java).
Dataset<Row> ds = spark.sql("select id,name,gender from test2.dummy where dt='2018-12-12'");
ds = ds.withColumn("dt", lit("2018-12-17"));
ds.cache();
ds.write().mode(SaveMode.Append).insertInto("test2.dummy");
//
System.out.println(ds.count());
According to my understanding, there're 2 actions, "insertInto" and "count".
I debug the code step by step, when running "insertInto", I see several lines of:
19/01/21 20:14:56 INFO FileScanRDD: Reading File path: hdfs://ip:9000/root/hive/warehouse/test2.db/dummy/dt=2018-12-12/000000_0, range: 0-451, partition values: [2018-12-12]
When running "count", I still see similar logs:
19/01/21 20:15:26 INFO FileScanRDD: Reading File path: hdfs://ip:9000/root/hive/warehouse/test2.db/dummy/dt=2018-12-12/000000_0, range: 0-451, partition values: [2018-12-12]
I have 2 questions:
1) When there're 2 actions on same dataframe like above, if I don't call ds.cache or ds.persist explicitly, will the 2nd action always causes the re-executing of the sql query?
2) If I understand the log correctly, both actions trigger hdfs file reading, does that mean the ds.cache() actually doesn't work here? If so, why it doesn't work here?
Many thanks.
It's because you append into the table where ds is created from, so ds needs to be recomputed because the underlying data changed. In such cases, spark invalidates the cache. If you read e.g. this Jira (https://issues.apache.org/jira/browse/SPARK-24596):
When invalidating a cache, we invalid other caches dependent on this
cache to ensure cached data is up to date. For example, when the
underlying table has been modified or the table has been dropped
itself, all caches that use this table should be invalidated or
refreshed.
Try to run the ds.count before inserting into the table.
I found that the other answer doesn't work. What I had to do was break lineage such that the df I was writing does not know that one of its source is the table I am writing to. To break lineage, I created a copy df using
copy_of_df = sql_context.createDataframe(df.rdd)
Related
In Java Spark, I have a dataframe that has a 'bucket_timestamp' column, which represents the time of the bucket that the row belongs to.
I want to write the dataframe to a Cassandra DB. The data must be written to the DB with TTL. The TTL should be depended on the bucket timestamp - where each row's TTL should be calculated as ROW_TTL = CONST_TTL - (CurrentTime - bucket_timestamp), where CONST_TTL is a constant TTL that I configured.
Currently I am writing to Cassandra with spark using a constant TTL, with the following code:
df.write().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "key_space_name");
put("table, "table_name");
put("spark.cassandra.output.ttl, Long.toString(CONST_TTL)); // Should be depended on bucket_timestamp column
}
}).mode(SaveMode.Overwrite).save();
One possible way I thought about is - for each possible bucket_timestamp - filter the data according to timestamp, calculate the TTL and write filtered data to Cassandra. but this seems very non-efficient and not the spark way. Is there a way in Java Spark to provide a spark column as the TTL option, so that the TTL will differ for each row?
Solution should be working with Java and dataset< Row>: I encountered some solutions for performing this with RDD in scala, but didn't find a solution for using Java and dataframe.
Thanks!
From Spark-Cassandra connector options (https://github.com/datastax/spark-cassandra-connector/blob/v2.3.0/spark-cassandra-connector/src/main/java/com/datastax/spark/connector/japi/RDDAndDStreamCommonJavaFunctions.java) you can set the TTL as:
constant value (withConstantTTL)
automatically resolved value (withAutoTTL)
column-based value (withPerRowTTL)
In your case you could try the last option and compute the TTL as a new column of the starting Dataset with the rule you provided in the question.
For use case you can see the test here: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/it/scala/com/datastax/spark/connector/writer/TableWriterSpec.scala#L612
For DataFrame API there is no support for such functionality, yet... There is JIRA for it - https://datastax-oss.atlassian.net/browse/SPARKC-416, you can watch it to get notified when it's implemented...
So only choice that you have is to use RDD API as described in the #bartosz25's answer...
I wonder if there is any way to disable WAL (write ahead log) operations when inserting new data to a hbase table with JAVA API?
Thank you for you help :)
In HBase 2.0.0
To skip WAL at an individual update level (for a single Put or Delete):
Put p = new Put(ROW_ID).addColumn(FAMILY, NAME, VALUE).setDurability(Durability.SKIP_WAL)
To set this setting for the entire table (so you don't have to do it each time for each update):
TableDescriptorBuilder tBuilder = TableDescriptorBuilder.newBuilder(TableName.valueOf(TABLE_ID));
tBuilder.setDurability(Durability.SKIP_WAL);
... continue building the table
Hope this helps
I have an applicaton that executes a query using NamedParameterJdbcTemplate. The resultSet is then parsed row by row using ResultSet.next().
Now in some cases during multi threading scenarios, this goes wrong. The result set is returning wrong values. When I execute the same query in SQLDeveloper, I am seeing the correct values. Not sure what could be the problem behind this.
while (rs.next()) {
count++;
long dbKy = rs.getLong("DBKY");
pAttrs = map.get(dbKy );
if (pAttrs== null) {
pAttrs= new HashMap<String, String>();
map.put(dbKy , pAttrs);
}
log.info( "PrintingResultSet!!::"+rs.getLong("DBKY")
+"::"+rs.getString(ATTR_NAME)
+"::"+rs.getString(ATTR_VAL)
+"::"+rs.getString(Constants.VAL));
pAttrs.put(rs.getString(ATTR_NAME),rs.getString(ATTR_VAL));
}
EDIT: This code is in the repo layer of SpringBoot application. Multithreading is, this issue happens when multiple requests are sent simultaneously. I have printed Thread id in my logs and it confirms that this happens only in multi threaded scenarios.
The value that is being returned actually is the value of some other row.
What values (wrong values) do you see when you are trying to display the resultset. If you see some unknown texts or symbols then probably it could be "encoding" issue. I suggest you to please refer on how to encode values like some special characters/symbols on your service layer since no doubt you will be able to see the data in the database by using the query but if that data contains some special characters/symbols then there is a need of encoding "UTF-8".
Thanks!
For proessing with Apache Flink I am trying to create a DataSet from data given in a Microsoft SQL database. The test_table has two columns, "numbers" and "strings" which contain INTs and VARCHARs respectively.
// supply row type info
TypeInformation<?>[] fieldTypes = new TypeInformation<?>[] {
BasicTypeInfo.INT_TYPE_INFO,
BasicTypeInfo.CHAR_TYPE_INFO,
};
RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
// create and configure input format
JDBCInputFormat inputFormat = JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("com.microsoft.sqlserver.jdbc.SQLServerDriver")
.setDBUrl(serverurl)
.setUsername(username)
.setPassword(password)
.setQuery("SELECT numbers, strings FROM test_table")
.setRowTypeInfo(rowTypeInfo)
.finish();
// create and configure type information for DataSet
TupleTypeInfo typeInformation = new TupleTypeInfo(Tuple2.class, BasicTypeInfo.INT_TYPE_INFO, BasicTypeInfo.STRING_TYPE_INFO);
// Read data from a relational database using the JDBC input format
DataSet<Tuple2<Integer, String>> dbData = environment.createInput(inputFormat, typeInformation);
// write to sink
dbData.print();
On execution, the following error happens and no output is created.
Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:714)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:660)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:660)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.NullPointerException
at org.apache.flink.api.java.io.jdbc.JDBCInputFormat.open(JDBCInputFormat.java:231)
at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:147)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:745)
This leaves me with no real clue where and how to look for a solution. Curiously, this piece of code worked before I changed my Flink JDBC version from 0.10.2 to 1.1.3. The RowTypeInfo part was not necessary with the old version, meaning it probably checked the types itself?, but apart from adding that to the code, nothing changed.
Chances are, then, that it has to do with the RowTypeInfo. I tried changing them around a bit, e.g. using BasicTypeInfo.CHAR_TYPE_INFO instead of BasicTypeInfo.STRING_TYPE_INFO (as the column is a VARCHAR column), but the error remained.
Ideally, I would like to fix the NullPointer problem and proceed with a DataSet containing the information from the database. Considering the lack of documentation/tutorials and (working) examples, I also arrive at a more general question: Is it a good idea at all to try and process SQL data in Flink or is it just not meant for this? As of now, I'm starting to think it might be easier, though tedious if nothing else, to create a routine that reads from a database and saves its contents to a CSV file before starting a Flink job on that.
I have a table with 62,000,000 rows aprox, a need select data from these a export to .txt or .csv
My query limit the result to 60,000 rows aprox.
When I run my the query in my developer machine, I eat all memory and get a java.lang.OutOfMemoryError
In this moment I use Hibernate for DAO, but I can change to pure JDBC solution when you recommend
My pseoudo-code is
List<Map> list = myDao.getMyData(Params param); //program crash here
initFile();
for(Map map : list){
util.append(map); //this transform row to file
}
closeFile();
Suggesting me to write my file?
Note: I use .setResultTransformer(Transformers.ALIAS_TO_ENTITY_MAP); to get Map instead of any Entity
You could use hibernate's ScrollableResults. See documentation here: http://docs.jboss.org/hibernate/orm/4.3/manual/en-US/html/ch11.html#objectstate-querying-executing-scrolling
This uses server-side cursors, if your database engine / database driver supports this. Be sure for this to work you set the following properties:
query.setReadOnly(true);
query.setCacheable(false);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
while (results.next()) {
SomeEntity entity = results.get()[0];
}
results.close();
lock the table and then perform subset selection and exports, appending to the results file. ensure you unconditionally unlock when done.
Not nice, but the task will perform to completion even on limited resource servers or clients.