Hello,
I written code for streaming job where as source and target is a PostgreSQL database. I used JDBCInputFormat/JDBCOutputFormat to read and write the records(Referenced example).
Code:
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
environment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
JDBCInputFormatBuilder inputBuilder = JDBCInputFormat.buildJDBCInputFormat()
.setDrivername(JDBCConfig.DRIVER_CLASS)
.setDBUrl(JDBCConfig.DB_URL)
.setQuery(JDBCConfig.SELECT_FROM_SOURCE)
.setRowTypeInfo(JDBCConfig.ROW_TYPE_INFO);
SingleOutputStreamOperator<Row> source = environment.createInput(inputBuilder.finish())
.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Row>() {
#Override
public long extractAscendingTimestamp(Row row) {
Date dt = (Date) row.getField(2);
return dt.getTime();
}
})
.keyBy(0).window(TumblingEventTimeWindows.of(Time.seconds(5)))
.fold(null, new FoldFunction<Row, Row>(){
#Override
public Row fold(Row row1, Row row) throws Exception {
return row;
}
});
source.writeUsingOutputFormat(JDBCOutputFormat.buildJDBCOutputFormat()
.setDrivername(JDBCConfig.DRIVER_CLASS)
.setDBUrl(JDBCConfig.DB_URL)
.setQuery("insert into tablename(id, name) values (?,?)")
.setSqlTypes(new int[]{Types.BIGINT, Types.VARCHAR})
.finish());
This code is executing correctly but not running continuously on Flink server(Select query is executing only once.)
Expected to run continuously on flink server.
Probably, you have to define your own Flink Source or JDBCInputFormat, since the one you use here will stop the SourceTask while fetching all results from DB. One way to solve this is create your own jdbc input format based on JDBCInputFormat, trying to re-execute the SQL query while reading the last row from DB in nextRecord.
Related
I am writing Java Code to get data from SAP BAPI using Java Connector (JCo). This is my first time to make a connection to SAP using JCo. I was able to get the Tables available in the Data Source and also get one particular Table and Number of Columns using table_name.getNumColumns() which gives me the total count of columns. But when I do, table_name.getNumRows(), it says 0. Where as in my Data source, there are around 85 Rows. How can I get the rows from this table?
The code I have been using:
public class SapConnection {
public static void gettingTableData(JCoFunction function) {
JCoParameterList table_list = function.getTableParameterList();
JCoTable my_table = function.getTableParameterList().getTable("SOME_TABLE");
System.out.println("Field Count: "+my_table.getFieldCount());
// This is not working as Number of Rows is 0.
for(int i = 0; i<my_table.getNumRows(); i++, my_table.nextRow()) {
// get those rows and do something ..
}
System.out.println("Is Empty: "+my_table.isEmpty()); // returns True
System.out.println("Is First Row: "+my_table.isFirstRow()); // returns false
System.out.println("Next Row: "+my_table.nextRow()); // returns false
System.out.println("Num Rows: "+my_table.getNumRows()); // returning 0
}
public static void loadDataSourceAndGetData(JCoDestination dest) throws JCoException {
JCoRepository sapRepository = dest.getRepository();
JCoFunctionTemplate template =
sapRepository.getFunctionTemplate("DATA_SOURCE_NAME");
JCoFunction my_function = template.getFunction();
gettingTableData(my_function);
}
public static void main(String[] args) throws JCoException {
// get the Properties created for connection.
Properties pp = getJcoProperties();
PropertiesDestinationDataProvider pddp = new PropertiesDestinationDataProvider(pp);
Environment.registerDestinationDataProvider(pddp);
JCoDestination dest = getDestination();
try {
// Using JCo Context for stateful function calls to Start() and End()
JCoContext.begin(dest);
loadDataSourceAndGetData(dest);
JCoRepository sapRepository = dest.getRepository();
System.out.println(sapRepository.getMonitor().getLastAccessTimestamp());
} finally {
// end the connection.
JCoContext.end(dest);
}
}
}
If you would like to get some data from a SAP BAPI it would help a lot also to call this BAPI. The data doesn't materialize automatically in the JCo objects out of thin air.
In your code you do not execute any JCoFunction.
Set the mandatory import parameter values for this BAPI (if there are any), execute the BAPI (your JCoFunction object) and then you will get the export data from the SAP system in response which will then also add appropriate rows to the JCoTable object.
I have a log file of 30k records, which I am publishing from Kafka and through spark I am persisting it into HBase. Out of 30K records, I can see only 4K records in HBase table.
I have tried saving the stream in MySQL and it is saving all records in MySql properly.
But in HBase if I publish a file of 100 records in Kafka topic, it saves 36 records in HBase table where if I publish 30K records Hbase shows only 4k records.
Also, Records(rows) in HBase are not in sequence like 1..3..10..17th.
final Job newAPIJobConfiguration1 = Job.getInstance(config); newAPIJobConfiguration1.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "logs"); newAPIJobConfiguration1.setOutputFormatClass(org.apache.hadoop.hbase.mapreduce.TableOutputFormat.class);
HTable hTable = new HTable(config, "country");
lines.foreachRDD((rdd,time)->
{
// Get the singleton instance of SparkSession
SparkSession spark = SparkSession.builder().config(rdd.context().getConf()).getOrCreate();
// Convert RDD[String] to RDD[case class] to DataFrame
JavaRDD rowRDD = rdd.map(line -> {
String[] logLine = line.split(" +");
Log record = new Log();
record.setTime((logLine[0]));
record.setTime_taken((logLine[1]));
record.setIp(logLine[2]);
return record;
});
saveToHBase(rowRDD, newAPIJobConfiguration1.getConfiguration());
});
ssc.start();
ssc.awaitTermination();
}
//6. saveToHBase method - insert data into HBase
public static void saveToHBase(JavaRDD rowRDD, Configuration conf) throws IOException {
// create Key, Value pair to store in HBase
JavaPairRDD hbasePuts = rowRDD.mapToPair(
new PairFunction() {
private static final long serialVersionUID = 1L;
#Override
public Tuple2 call(Log row) throws Exception {
Put put = new Put(Bytes.toBytes(System.currentTimeMillis()));
//put.addColumn(Bytes.toBytes("sparkaf"), Bytes.toBytes("message"), Bytes.toBytes(row.getMessage()));
put.addImmutable(Bytes.toBytes("time"), Bytes.toBytes("col1"), Bytes.toBytes(row.getTime()));
put.addImmutable(Bytes.toBytes("time_taken"), Bytes.toBytes("col2"), Bytes.toBytes(row.getTime_taken()));
put.addImmutable(Bytes.toBytes("ip"), Bytes.toBytes("col3"), Bytes.toBytes(row.getIp()));
return new Tuple2(new ImmutableBytesWritable(), put);
}
});
// save to HBase- Spark built-in API method
//hbasePuts.saveAsNewAPIHadoopDataset(conf);
hbasePuts.saveAsNewAPIHadoopDataset(conf);
Since HBase stores records uniquely by rowkey, it is very possible that you are overwriting records.
You are using the currentTime in milliseconds as the rowkey and any records created with the same rowkey will overwrite the old one.
Put put = new Put(Bytes.toBytes(System.currentTimeMillis()));
So if 100 Puts are created in 1 millisecond, then only 100 will show up in HBase since the same row was overwritten 99 times.
It's likely that the 4k rowkeys in HBase are the 4k unique milliseconds (4 seconds) it took to load the data.
I would suggest using a different rowkey design. Also, as a side note, it is typically a bad idea to use monotonically increasing rowkeys in HBase:
Further Information
Using the following code, I am getting the following errors when trying to write to BigQuery
I am using Apache-Beam 2.0.0
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.NullPointerException
If I change the text.startsWith to D, everything works fine (i.e. so something is output).
Is there someway to catch or watch for empty PCollections?
Based on the StackTrace it looks like the error is actually in BigQueryIO - the file left in my bucket has 0 bytes and maybe this is causing BigQueryIO a problem.
My use case is that I am using side outputs for DeadLetters and encountered this error when my job produced no dead-letter output, so robustly handling this would be useful.
The job should really be able to run in batch or streaming mode, my best guess is to write any output to GCS / TextIO in batch mode and GBQ when streaming, if that sounds sensible?
Any help gratefully received.
public class EmptyPCollection {
public static void main(String [] args) {
PipelineOptions options = PipelineOptionsFactory.create();
options.setTempLocation("gs://<your-bucket-here>/temp");
Pipeline pipeline = Pipeline.create(options);
String schema = "{\"fields\": [{\"name\": \"pet\", \"type\": \"string\", \"mode\": \"required\"}]}";
String table = "<your-dataset>.<your-table>";
List<String> pets = Arrays.asList("Dog", "Cat", "Goldfish");
PCollection<String> inputText = pipeline.apply(Create.of(pets)).setCoder(StringUtf8Coder.of());
PCollection<TableRow> rows = inputText.apply(ParDo.of(new DoFn<String, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) {
String text = c.element();
if (text.startsWith("X")) { // change to (D)og and works fine
TableRow row = new TableRow();
row.set("pet", text);
c.output(row);
}
}
}));
rows.apply(BigQueryIO.writeTableRows().to(table).withJsonSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
pipeline.run().waitUntilFinish();
}
}
[direct-runner-worker] INFO org.apache.beam.sdk.io.gcp.bigquery.TableRowWriter - Opening TableRowWriter to gs://<your-bucket>/temp/BigQueryWriteTemp/05c7a7c0786a4656abad97f11ef23d8e/2675e1c7-f4d7-4f78-a85f-a38095b57e6b.
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.NullPointerException
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:322)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:292)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:200)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:63)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:295)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:281)
at EmptyPCollection.main(EmptyPCollection.java:54)
Caused by: java.lang.NullPointerException
at org.apache.beam.sdk.io.gcp.bigquery.WriteTables.processElement(WriteTables.java:97)
This looks like a bug in the BigQuery sink implementation within Apache Beam. Filing a bug in the Apache Beam Jira would be the appropriate place to file this.
I have filed https://issues.apache.org/jira/browse/BEAM-2406 to track this issue.
I'm having a hard time getting HBase's FuzzyRowFilter to work.
I have the following test table:
hbase(main):014:0> scan 'test'
ROW COLUMN+CELL
row-01 column=colfam1:col1, timestamp=1481193793338, value=value1
row-02 column=colfam1:col1, timestamp=1481193799186, value=value2
row-03 column=colfam1:col1, timestamp=1481193803941, value=value3
row-04 column=colfam1:col1, timestamp=1481193808209, value=value4
row-05 column=colfam1:col1, timestamp=1481193812737, value=value5
5 row(s) in 0.0200 seconds
Here is my Java code (I started with Scala, but the results are the same - none):
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "localhost:2182");
conf.set("hbase.master", "localhost:60000");
conf.set("hbase.rootdir", "/hbase");
try {
Scan scan = new Scan();
scan.setCaching(5);
byte[] rowKeys = Bytes.toBytesBinary("???-01");
byte[] fuzzyInfo = {0x01,0x01,0x01,0x00,0x00,0x00};
FuzzyRowFilter fuzzyFilter = new FuzzyRowFilter(
Arrays.asList(
new Pair<byte[], byte[]>(
rowKeys,
fuzzyInfo)));
System.out.println("### fuzzyFilter: " + fuzzyFilter.toString());
scan.addFamily(Bytes.toBytesBinary("colfam1"));
scan.setStartRow(Bytes.toBytesBinary("row-01"));
scan.setStopRow(Bytes.toBytesBinary("row-05"));
scan.setFilter(fuzzyFilter);
Connection conn = ConnectionFactory.createConnection(conf);
Table table = conn.getTable(TableName.valueOf("test"));
ResultScanner results = table.getScanner(scan);
int count = 0;
int limit = 100;
for ( Result r : results ) {
System.out.println("" + r.toString());
if (count++ >= limit) break;
}
} catch (Exception e) {
e.printStackTrace();
}
I simply do not get any results back from the server. If I comment out the line scan.setFilter(fuzzyFilter);, I get the exepcted results:
keyvalues={row-01/colfam1:col1/1481193793338/Put/vlen=6/seqid=0}
keyvalues={row-02/colfam1:col1/1481193799186/Put/vlen=6/seqid=0}
keyvalues={row-03/colfam1:col1/1481193803941/Put/vlen=6/seqid=0}
keyvalues={row-04/colfam1:col1/1481193808209/Put/vlen=6/seqid=0}
Am I doing something wrong? Is there a bug in HBase (version 1.2.2)? I am using the version installed through Homebrew on latest Mac OS Sierra.
Update
On a Cloudera Hadoop cluster running CDH 5.7 with HBase 1.2.0-cdh5.7.0, I get the desired output for rowkey row-01. The error must somehow be related to my local setup.
Solution
Indeed, the problem was that HBase server installation and client JAR versions did not match. In my case, I was using the artifacts
hbase-common
hbase-client
hbase-server
with version 1.2.0-cdh5.7.0 instead of 1.2.2.
My mistake was assuming that minor version differences would not have a large impact, but apparently Cloudera has applied some major changes in their versions with respect to the official code base. Changing to the official version 1.2.2 made the FuzzyRowFilter work as expected.
It should print only rowkey of row-01 as can be perceived from the filter condition.
There is no such bug and it will work as expected as I have been using same for some time now.
Check your configurations,dependencies,etc.
Due to versioning,many times libraries and their clients becom incompatible.
Lets take a simple example:
class ServerVersionA {
public static void getData() {
return DataOject(data with headerVersionA);
}
}
class ClientVersionB {
public void showData() {
DataObject dataObject = makeRequest(params);
//Check whether data recieved is of version B after veryfying header boolean status=validate(dataObject);
if (status) {
doIO(dataObject);
}
}
}
In this case,if the header does not match,client does simply sit idle.
These kind of issues are mostly taken care of but sometimes they creep in.
If we look at the sources of installation and client version,we can find out why data is not being returned and no exception is propagated.
I have a problem: i write entries from java code to cassandra database, it works for a while, and then stops writing. (nodetool cfstats keyspace.users -H on all nodes show no changes in Number of keys (estimate))
Configuration : 4 nodes (4GB, 4GB, 4GB, and 6GB RAM).
I am using datastax driver, and connection like
private Cluster cluster = Cluster.builder()
.addContactPoints(<points>)
.build();
private Session session = cluster.connect("keyspace");
private MappingManager mappingManager = new MappingManager(session);
...
I do insert in database like
public void writeUser(User user) {
Mapper<User> mapper = mappingManager.mapper(User.class);
mapper.saveAsync(user, Mapper.Option.timestamp(TimeUnit.NANOSECONDS.toMicros(System.nanoTime())));
}
I also tried
public void writeUser(User user) {
Mapper<User> mapper = mappingManager.mapper(User.class);
mapper.save(user);
}
And two variants between.
In debug.log from server i see
DEBUG [GossipStage:1] 2016-05-11 12:21:14,565 FailureDetector.java:456 - Ignoring interval time of 2000380153 for /node
Maybe the problem is, that server in another country? But why it is writing entities at the beginning? How can i fix my problem?
Another update: session.execute on mapper.save returns ResultSet[ exhausted: true, Columns[]]