Null Pointer Exception with JdbcIO read in Apache Beam [DataFlow] - java

I have written code for reading data from Oracle using Apache Beam with Dataflow as a Runner.
I am facing one weird error for a few tables while ingesting data.
There are various tables that have 200+ Columns with various data types such as Date, Number, and VARCHAR2(***) in Oracle.
Our requirement is to migrate all columns through dataflow to BigQuery. When we select all the columns in Select Query then it gives the Null Pointer exception error mentioned below. So we tried to use selective columns in Query but in that case
When the collective datatype size of selected columns is less than ~46752 Bytes then Pipeline runs successfully.
and when it crosses this limit then it starts giving a Null Pointer error.
If I explain in more detail, if there is a limit of 2000 Bytes (assumption, the actual value we are getting is around 46752 Bytes) then we will be able to select only two columns with VARCHAR2(1000) datatype or 4 columns with VARCHAR2(500) datatype and so on.
Note - The threshold limit ~46752 Bytes we have calculated by adding columns one by one in Query and executing the code.
We are not sure if is there any such limit attached to the Java JDBC connector of Apache Beam or not but we are facing challenges while migration when our selected columns cross this limit.
Please help me if I am missing any point here or any parameter while reading data through JdbcIO.
Below is the code snippet where the Code is giving an error. It is the entry point of our pipeline.
The code is giving errors while reading only. In the below code, I have not mentioned the write operation to BigQuery since it is not getting executed because of failure at JdbcIO.read() only. (cross-check it by commenting out the logic of BigQuery as well).
Read Data from Oracle Code
// Read from JDBC
Pipeline p2 = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
String query2 = "SELECT Col1...Col00 FROM table WHERE rownum<=1000";
PCollection<TableRow> rows = p2.apply(JdbcIO.<TableRow>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"oracle.jdbc.OracleDriver", "jdbc:oracle:thin:#//localhost:1521/orcl")
.withUsername("root")
.withPassword("password"))
.withQuery(query2)
.withRowMapper(new JdbcIO.RowMapper<TableRow>() {
#Override
public TableRow mapRow(ResultSet resultSet) throws Exception {
schema = getSchemaFromResultSet(resultSet);
TableRow tableRow = new TableRow();
List<TableFieldSchema> columnNames = schema.getFields();
}
return tableRow;
}
)
);
p2.run().waitUntilFinish();
Error (only when it crosses above the limit of Bytes in columns)
Error message from worker: java.lang.NullPointerException
oracle.sql.converter.CharacterConverter1Byte.toUnicodeChars(CharacterConverter1Byte.java:344)
oracle.sql.CharacterSet1Byte.toCharWithReplacement(CharacterSet1Byte.java:134)
oracle.jdbc.driver.DBConversion._CHARBytesToJavaChars(DBConversion.java:964)
oracle.jdbc.driver.DBConversion.CHARBytesToJavaChars(DBConversion.java:867)
oracle.jdbc.driver.T4CVarcharAccessor.unmarshalOneRow(T4CVarcharAccessor.java:298)
oracle.jdbc.driver.T4CTTIrxd.unmarshal(T4CTTIrxd.java:934)
oracle.jdbc.driver.T4CTTIrxd.unmarshal(T4CTTIrxd.java:853)
oracle.jdbc.driver.T4C8Oall.readRXD(T4C8Oall.java:699)
oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:337)
oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:191)
oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:523)
oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:207)
oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:863)
oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1153)
oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1275)
oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3576)
oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3620)
oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java
:1491)
org.apache.commons.dbcp2.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java
:122)
org.apache.commons.dbcp2.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java
:122) org.apache.beam.sdk.io.jdbc.JdbcIO$ReadFn.processElement(JdbcIO.java:1381)
More Details
Oracle Version - 11g
Apache Beam SDK - Java
There are a few columns that we are selecting has Null values also but
that should not cause any issues.
There is no issue with specific columns since I have tried all the possible combinations of the columns.
There is no number of columns select limit since I am able to read data from an Oracle table where the number of columns is more than 300 but the total byte size is less than 46752 Bytes.

The JdbcIO source transform has a property called FetchSize that defines how much data can be fetched from the database. The default value is 50000 bytes. We can change the runtime value using the withFetchSize(int fetchSize) method. Docs.

Related

Push Data to Existing Table in BigQuery

I am using Java and SQL to push data to a Timestamp partitioned table in BigQuery. In my code, I specify the destination table:
.setDestinationTable(TableId.of("MyDataset", "MyTable"))
When I run it, it creates a table perfectly. However, when I attempt to insert new data, it throws a BigQueryException claiming the table already exists:
Exception in thread "main" com.google.cloud.bigquery.BigQueryException:
Already Exists: Table MyProject:MyDataset.MyTable
After some documentation digging, I found a solution that works:
.setWriteDisposition(WriteDisposition.WRITE_APPEND)
Adding the above appends any data (even if it's duplicate). I'm not sure why the default setting for .setDestinationTable() is the equivalent of WRITE_EMPTY, which returns a duplicate error. The google docs for .setDestinationTable() are:
Describes the table where the query results should be stored. If not
present, a new table will be created to store the results.
The docs should probably clarify the default value.

Spark writing to Cassandra with varying TTL

In Java Spark, I have a dataframe that has a 'bucket_timestamp' column, which represents the time of the bucket that the row belongs to.
I want to write the dataframe to a Cassandra DB. The data must be written to the DB with TTL. The TTL should be depended on the bucket timestamp - where each row's TTL should be calculated as ROW_TTL = CONST_TTL - (CurrentTime - bucket_timestamp), where CONST_TTL is a constant TTL that I configured.
Currently I am writing to Cassandra with spark using a constant TTL, with the following code:
df.write().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "key_space_name");
put("table, "table_name");
put("spark.cassandra.output.ttl, Long.toString(CONST_TTL)); // Should be depended on bucket_timestamp column
}
}).mode(SaveMode.Overwrite).save();
One possible way I thought about is - for each possible bucket_timestamp - filter the data according to timestamp, calculate the TTL and write filtered data to Cassandra. but this seems very non-efficient and not the spark way. Is there a way in Java Spark to provide a spark column as the TTL option, so that the TTL will differ for each row?
Solution should be working with Java and dataset< Row>: I encountered some solutions for performing this with RDD in scala, but didn't find a solution for using Java and dataframe.
Thanks!
From Spark-Cassandra connector options (https://github.com/datastax/spark-cassandra-connector/blob/v2.3.0/spark-cassandra-connector/src/main/java/com/datastax/spark/connector/japi/RDDAndDStreamCommonJavaFunctions.java) you can set the TTL as:
constant value (withConstantTTL)
automatically resolved value (withAutoTTL)
column-based value (withPerRowTTL)
In your case you could try the last option and compute the TTL as a new column of the starting Dataset with the rule you provided in the question.
For use case you can see the test here: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/it/scala/com/datastax/spark/connector/writer/TableWriterSpec.scala#L612
For DataFrame API there is no support for such functionality, yet... There is JIRA for it - https://datastax-oss.atlassian.net/browse/SPARKC-416, you can watch it to get notified when it's implemented...
So only choice that you have is to use RDD API as described in the #bartosz25's answer...

NullPointerException in JDBCInputFormat.open when trying to read DataSet from MS SQL

For proessing with Apache Flink I am trying to create a DataSet from data given in a Microsoft SQL database. The test_table has two columns, "numbers" and "strings" which contain INTs and VARCHARs respectively.
// supply row type info
TypeInformation<?>[] fieldTypes = new TypeInformation<?>[] {
BasicTypeInfo.INT_TYPE_INFO,
BasicTypeInfo.CHAR_TYPE_INFO,
};
RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
// create and configure input format
JDBCInputFormat inputFormat = JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("com.microsoft.sqlserver.jdbc.SQLServerDriver")
.setDBUrl(serverurl)
.setUsername(username)
.setPassword(password)
.setQuery("SELECT numbers, strings FROM test_table")
.setRowTypeInfo(rowTypeInfo)
.finish();
// create and configure type information for DataSet
TupleTypeInfo typeInformation = new TupleTypeInfo(Tuple2.class, BasicTypeInfo.INT_TYPE_INFO, BasicTypeInfo.STRING_TYPE_INFO);
// Read data from a relational database using the JDBC input format
DataSet<Tuple2<Integer, String>> dbData = environment.createInput(inputFormat, typeInformation);
// write to sink
dbData.print();
On execution, the following error happens and no output is created.
Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:714)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:660)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:660)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.NullPointerException
at org.apache.flink.api.java.io.jdbc.JDBCInputFormat.open(JDBCInputFormat.java:231)
at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:147)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:745)
This leaves me with no real clue where and how to look for a solution. Curiously, this piece of code worked before I changed my Flink JDBC version from 0.10.2 to 1.1.3. The RowTypeInfo part was not necessary with the old version, meaning it probably checked the types itself?, but apart from adding that to the code, nothing changed.
Chances are, then, that it has to do with the RowTypeInfo. I tried changing them around a bit, e.g. using BasicTypeInfo.CHAR_TYPE_INFO instead of BasicTypeInfo.STRING_TYPE_INFO (as the column is a VARCHAR column), but the error remained.
Ideally, I would like to fix the NullPointer problem and proceed with a DataSet containing the information from the database. Considering the lack of documentation/tutorials and (working) examples, I also arrive at a more general question: Is it a good idea at all to try and process SQL data in Flink or is it just not meant for this? As of now, I'm starting to think it might be easier, though tedious if nothing else, to create a routine that reads from a database and saves its contents to a CSV file before starting a Flink job on that.

java select large table and export to file

I have a table with 62,000,000 rows aprox, a need select data from these a export to .txt or .csv
My query limit the result to 60,000 rows aprox.
When I run my the query in my developer machine, I eat all memory and get a java.lang.OutOfMemoryError
In this moment I use Hibernate for DAO, but I can change to pure JDBC solution when you recommend
My pseoudo-code is
List<Map> list = myDao.getMyData(Params param); //program crash here
initFile();
for(Map map : list){
util.append(map); //this transform row to file
}
closeFile();
Suggesting me to write my file?
Note: I use .setResultTransformer(Transformers.ALIAS_TO_ENTITY_MAP); to get Map instead of any Entity
You could use hibernate's ScrollableResults. See documentation here: http://docs.jboss.org/hibernate/orm/4.3/manual/en-US/html/ch11.html#objectstate-querying-executing-scrolling
This uses server-side cursors, if your database engine / database driver supports this. Be sure for this to work you set the following properties:
query.setReadOnly(true);
query.setCacheable(false);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
while (results.next()) {
SomeEntity entity = results.get()[0];
}
results.close();
lock the table and then perform subset selection and exports, appending to the results file. ensure you unconditionally unlock when done.
Not nice, but the task will perform to completion even on limited resource servers or clients.

file (not in memory) based JDBC driver for CSV files

Is there a open source file based (NOT in-memory based) JDBC driver for CSV files? My CSV are dynamically generated from the UI according to the user selections and each user will have a different CSV file. I'm doing this to reduce database hits, since the information is contained in the CSV file. I only need to perform SELECT operations.
HSQLDB allows for indexed searches if we specify an index, but I won't be able to provide an unique column that can be used as an index, hence it does SQL operations in memory.
Edit:
I've tried CSVJDBC but that doesn't support simple operations like order by and group by. It is still unclear whether it reads from file or loads into memory.
I've tried xlSQL, but that again relies on HSQLDB and only works with Excel and not CSV. Plus its not in development or support anymore.
H2, but that only reads CSV. Doesn't support SQL.
You can solve this problem using the H2 database.
The following groovy script demonstrates:
Loading data into the database
Running a "GROUP BY" and "ORDER BY" sql query
Note: H2 supports in-memory databases, so you have the choice of persisting the data or not.
// Create the database
def sql = Sql.newInstance("jdbc:h2:db/csv", "user", "pass", "org.h2.Driver")
// Load CSV file
sql.execute("CREATE TABLE data (id INT PRIMARY KEY, message VARCHAR(255), score INT) AS SELECT * FROM CSVREAD('data.csv')")
// Print results
def result = sql.firstRow("SELECT message, score, count(*) FROM data GROUP BY message, score ORDER BY score")
assert result[0] == "hello world"
assert result[1] == 0
assert result[2] == 5
// Cleanup
sql.close()
Sample CSV data:
0,hello world,0
1,hello world,1
2,hello world,0
3,hello world,1
4,hello world,0
5,hello world,1
6,hello world,0
7,hello world,1
8,hello world,0
9,hello world,1
10,hello world,0
If you check the sourceforge project csvjdbc please report your expierences. the documentation says it is useful for importing CSV files.
Project page
This was discussed on Superuser https://superuser.com/questions/7169/querying-a-csv-file.
You can use the Text Tables feature of hsqldb: http://hsqldb.org/doc/2.0/guide/texttables-chapt.html
csvsql/gcsvsql are also possible solutions (but there is no JDBC driver, you will have to run a command line program for your query).
sqlite is another solution but you have to import the CSV file into a database before you can query it.
Alternatively, there is commercial software such as http://www.csv-jdbc.com/ which will do what you want.
To do anything with a file you have to load it into memory at some point. What you could do is just open the file and read it line by line, discarding the previous line as you read in a new one. Only downside to this approach is its linearity. Have you thought about using something like memcache on a server where you use Key-Value stores in memory you can query instead of dumping to a CSV file?
You can use either specialized JDBC driver, like CsvJdbc (http://csvjdbc.sourceforge.net) or you may chose to configure a database engine such as mySQL to treat your CSV as a table and then manipulate your CSV through standard JDBC driver.
The trade-off here - available SQL features vs performance.
Direct access to CSV via CsvJdbc (or similar) will allow you very quick operations on big data volumes, but without capabilities to sort or group records using SQL commands ;
mySQL CSV engine can provide rich set of SQL features, but with the cost of performance.
So if the size of your table is relatively small - go with mySQL. However if you need to process big files (> 100Mb) without need for grouping or sorting - go with CsvJdbc.
If you need both - handle very bif files and be able to manipulate them using SQL, then optimal course of action - to load the CSV into normal database table (e.g. mySQL) first and then handle the data as usual SQL table.

Categories

Resources