I have a job that reads from SQL Server database list of documents. Documents needs to be in some status and sorted by column status_updated_time.
I want to read an document.id and then to process it in job processor as Driving Query Based ItemReaders.
Column status is changed in writer, so I can't use JpaPagingItemReader because of this problem.
I used JdbcPagingItemReader but got an error on sorting by status_updated_time.
Then I tried to add and id to sorting, but that didn't help.
Query that I want to get is:
SELECT id
FROM document
WHERE status IN (0, 1, 2)
ORDER BY status_updated_time ASC, id ASC
My reader:
#StepScope
#Bean
private ItemReader<Long> statusReader() {
JdbcPagingItemReader<Long> reader = new JdbcPagingItemReader<>();
...
reader.setRowMapper(SingleColumnRowMapper.newInstance(Long.class));
...
Map<String, Order> sortKeys = new HashMap<>();
sortKeys.put("status_updated_time", Order.ASCENDING);
sortKeys.put("id", Order.ASCENDING);
SqlServerPagingQueryProvider queryProvider = new SqlServerPagingQueryProvider();
queryProvider.setSelectClause(SELECT_CLAUSE);
queryProvider.setFromClause(FROM_CLAUSE);
queryProvider.setWhereClause(WHERE_CLAUSE);
queryProvider.setSortKeys(sortKeys);
reader.setQueryProvider(queryProvider);
...
return reader;
}
Where constants are:
private static final String SELECT_CLAUSE = "id";
private static final String FROM_CLAUSE = "document";
private static final String WHERE_CLAUSE = "status IN (0, 1, 2) ";
When job is executed I get error:
org.springframework.dao.TransientDataAccessResourceException: StatementCallback; SQL [SELECT TOP 10 id FROM document WHERE status IN (0, 1, 2) ORDER BY id ASC, status_updated_time ASC]; The column name status_updated_time is not valid.; nested exception is com.microsoft.sqlserver.jdbc.SQLServerException: The column name status_updated_time is not valid.
at org.springframework.jdbc.support.SQLStateSQLExceptionTranslator.doTranslate(SQLStateSQLExceptionTranslator.java:110)
at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:72)
at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:81)
at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:81)
at org.springframework.jdbc.core.JdbcTemplate.translateException(JdbcTemplate.java:1443)
at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:388)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:452)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:462)
at org.springframework.batch.item.database.JdbcPagingItemReader.doReadPage(JdbcPagingItemReader.java:210)
at org.springframework.batch.item.database.AbstractPagingItemReader.doRead(AbstractPagingItemReader.java:108)
at org.springframework.batch.item.support.AbstractItemCountingItemStreamItemReader.read(AbstractItemCountingItemStreamItemReader.java:92)
at org.springframework.batch.core.step.item.SimpleChunkProvider.doRead(SimpleChunkProvider.java:94)
at org.springframework.batch.core.step.item.FaultTolerantChunkProvider.read(FaultTolerantChunkProvider.java:87)
at org.springframework.batch.core.step.item.SimpleChunkProvider$1.doInIteration(SimpleChunkProvider.java:119)
I saw some question regarding The column name XYZ is not valid on stack overflow (this...) but haven't seen anything that works in my case where I need to sort by another column.
Another problem is sorting columns order.
No matter if I first add status_updated_time or id to the map sorting in generated script is always ORDER BY id ASC, status_updated_time ASC.
EDIT:
Reading this question, specially this line:
JdbcPagingItemReader assumes here that the sort key and the column in the select clause are called exactly the same
I realized that I need column status_updated_time in result set, so I refactored:
private static final String SELECT_CLAUSE = "id, status_updated_time";
...
queryProvider.setSelectClause(SELECT_CLAUSE);
...
reader.setRowMapper(
(rs, i) -> {
Document document = new Document();
document.setId(rs.getLong(1));
document.setStatusUpdatedTime(rs.getObject(2, Timestamp.class));
return document;
}
);
Now application can compile and job can run.
But, problem with sorting stay the same. I can't order by first status_updated_time and then id. id always comes first.
I tried to remove id from sorting and came to another problem.
On test env. I had 1600 rows to process. My job process row and update status_updated_time to now(). When job started processing he didn't stop at 1600, but continue processing because each row got new status_updated_time and reader consider it it new row, and kept processing endlessly.
When sort only by id job processed 1600 rows and then stopped.
So it seems like I can't use JdbcPagingItemReader because of sorting problem.
And I wanted some reader that can run in parallel to speed up this job (it runs about 20 minutes each hour in a day).
Any suggestions?
I want to thank Mahmoud for monitoring Spring Batch question and trying to help. But his proposal didn't helped me so I used different approach.
I used temporary (auxiliary) table to prepare data for main step execution and in the main step reader is reading from that table.
First step will drop help table:
#Bean
private Step dropHelpTable() {
return stepBuilderFactory
.get(STEP_DROP_HELP_TABLE)
.transactionManager(cronTransactionManager)
.tasklet(dropHelpTableTasklet())
.build();
}
private Tasklet dropHelpTableTasklet() {
return (contribution, chunkContext) -> {
jdbcTemplate.execute(DROP_SCRIPT);
return RepeatStatus.FINISHED;
};
}
private static final String STEP_DROP_HELP_TABLE = "dropHelpTable";
private static final String DROP_SCRIPT = "IF EXISTS (SELECT 1 FROM INFORMATION_SCHEMA.TABLES "
+ "WHERE TABLE_NAME = 'query_document_helper') "
+ "BEGIN "
+ " DROP TABLE query_document_helper "
+ "END";
Second step will prepare data. Insert id's of document which will be processed later:
#Bean
private Step insertDataToHelpTable() {
return stepBuilderFactory
.get(STEP_INSERT_HELP_TABLE)
.transactionManager(cronTransactionManager)
.tasklet(insertDataToHelpTableTasklet())
.build();
}
private Tasklet insertDataToHelpTableTasklet() {
return (contribution, chunkContext) -> {
jdbcTemplate.execute("SELECT TOP " + limit + " id " + INSERT_SCRIPT);
return RepeatStatus.FINISHED;
};
}
private static final String STEP_INSERT_HELP_TABLE = "insertHelpTable";
private static final String INSERT_SCRIPT = "INTO query_document_helper "
+ "FROM dbo.document "
+ "WHERE status IN (0, 1, 2) "
+ "ORDER BY status_updated_time ASC";
#Value("${cron.batchjob.queryDocument.limit}")
private Integer limit;
After this I have all the data that will be used in one job execution so ordering by status_updated_time is not longer needed (condition was not to process youngest document in this job execution, but in some later execution when they become oldest).
And then in the next step I use regular reader.
#Bean
private Step queryDocumentStep() {
return stepBuilderFactory
.get(STEP_QUERY_NEW_DOCUMENT_STATUS)
.transactionManager(cronTransactionManager)
.<Long, Document>chunk(chunk)
.reader(documentReader())
...
.taskExecutor(multiThreadingTaskExecutor.threadPoolTaskExecutor())
.build();
}
#StepScope
#Bean
private ItemReader<Long> documentReader() {
JdbcPagingItemReader<Long> reader = new JdbcPagingItemReader<>();
reader.setDataSource(coreBatchDataSource);
reader.setMaxItemCount(limit);
reader.setPageSize(chunk);
...
Map<String, Order> sortKeys = new HashMap<>();
sortKeys.put("id", Order.ASCENDING);
SqlServerPagingQueryProvider queryProvider = new SqlServerPagingQueryProvider();
queryProvider.setSelectClause(SELECT_CLAUSE);
queryProvider.setFromClause(FROM_CLAUSE);
queryProvider.setSortKeys(sortKeys);
reader.setQueryProvider(queryProvider);
...
return reader;
}
private static final String STEP_QUERY_NEW_DOCUMENT_STATUS = "queryNewDocumentStatus";
private static final String SELECT_CLAUSE = "id";
private static final String FROM_CLAUSE = "query_archive_document_helper";
And job looks like this:
#Bean
public Job queryDocumentJob() {
return jobBuilderFactory
.get(JOB_QUERY_DOCUMENT)
.incrementer(new RunIdIncrementer())
.start(dropHelpTable())
.next(insertDataToHelpTable())
.next(queryDocumentStep())
.build();
}
private static final String JOB_QUERY_DOCUMENT = "queryDocument";
Related
When I run the below stream it does not receive any subsequent data once the stream runs.
final long HOUR = 3600000;
final long PAST_HOUR = System.currentTimeMillis()-HOUR;
private final static ActorSystem actorSystem = ActorSystem.create(Behaviors.empty(), "as");
protected static ElasticsearchParams constructElasticsearchParams(
String indexName, String typeName, ApiVersion apiVersion) {
if (apiVersion == ApiVersion.V5) {
return ElasticsearchParams.V5(indexName, typeName);
} else if (apiVersion == ApiVersion.V7) {
return ElasticsearchParams.V7(indexName);
}
else {
throw new IllegalArgumentException("API version " + apiVersion + " is not supported");
}
}
String queryStr = "{ \"bool\": { \"must\" : [{\"range\" : {"+
"\"timestamp\" : { "+
"\"gte\" : "+PAST_HOUR
+" }} }]}} ";
ElasticsearchConnectionSettings connectionSettings =
ElasticsearchConnectionSettings.create("****")
.withCredentials("****", "****");
ElasticsearchSourceSettings sourceSettings =
ElasticsearchSourceSettings.create(connectionSettings)
.withApiVersion(ApiVersion.V7);
Source<ReadResult<Stats>, NotUsed> dataSource =
ElasticsearchSource.typed(
constructElasticsearchParams("data", "_doc", ApiVersion.V7),
queryStr,
sourceSettings,
Stats.class);
dataSource.buffer(10000, OverflowStrategy.backpressure());
dataSource.backpressureTimeout(Duration.ofSeconds(1));
dataSource
.log("error")
.runWith(Sink.foreach(a -> System.out.println(a)), actorSystem);
produces output :
ReadResult(id=1656107389556,source=Stats(size=0.09471),version=)
Data is continually being written to the index data but the stream does not process it once it has started. Shouldn't the stream continually process data from the upstream source? In this case, the upstream source is an Elastic index named data.
I've tried amending the query to match all documents :
String queryStr = "{\"match_all\": {}}";
but the same result.
The Elasticsearch source does not run continuously. It initiates a search, manages pagination (using the bulk API) and streams results; when Elasticsearch reports no more results it completes.
You could do something like
Source.repeat(Done).flatMapConcat(done -> ElasticsearchSource.typed(...))
Which will run a new search immediately after the previous one finishes. Note that it would be the responsibility of the downstream to filter out duplicates.
I have a table in DynamoDB and it has an attribute 'createDate' and I want to do a scan using a filter in a specific period of that attribute (for example: 2022-01-01 to 2022-01-31) but I don't know exactly if it's possible and how to do. If anyone has done this and can help me it would be very helpful.
just one more question: is it possible to put the result in a CSV file?
Here is my code where I can scan with a single date:
public class QueryTableResearchAnswers {
static AmazonDynamoDB client = AmazonDynamoDBClientBuilder.standard().build();
static DynamoDB dynamoDB = new DynamoDB(client);
static String tableName = "research-answers";
public static void main(String[] args) throws Exception {
String researchAnswers = "Amazon DynamoDB";
findAnswersWithinTimePeriod(researchAnswers);
//findRepliesPostedWithinTimePeriod(researchAnswers);
}
private static void findAnswersWithinTimePeriod(String researchAnswers) {
Table table = dynamoDB.getTable(tableName);
Map<String, Object> expressionAttributeValues = new HashMap<String, Object>();
expressionAttributeValues.put(":startDate", "2022-01-01T00:00:00.0Z" );
ItemCollection<ScanOutcome> items = table.scan("createDate between > startDate", // FilterExpression
"bizId, accountingsessionid, accounttype, acctsessionid, choicecode, contextname, createDate, document, framedipaddress," +
"macaddress, macaddressnetworkdata, machash, mail, nasgrelocalip, nasidentifier, nasipaddress, nasportid, network, networktype, networkuuid, phone," +
"question, questionanswer, questioncode, realm, relayingmacaddress, remoteipaddress, useragent, username", // ProjectionExpression
null, // ExpressionAttributeNames - not used in this example
expressionAttributeValues);
System.out.println("Scan of " + tableName + " for january answers");
Iterator<Item> iterator = items.iterator();
while (iterator.hasNext()) {
System.out.println(iterator.next().toJSONPretty());
}
}
In general, for an arbitrary date range:
createDate BETWEEN :date1 AND :date2
But, in your specific case of 2022-01-01 to 2022-01-31 (the entire month of January), you can simplify this to:
beginsWith(createDate, "2022-01")
If I want to run two different select queries on a flink table created from the dataStream, the blink-planner runs them as two different jobs. Is there a way to combine them and run as a single job ?
Example code :
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(4);
System.out.println("Running credit scores : ");
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
DataStream<String> recordsStream =
env.readTextFile("src/main/resources/credit_trial.csv");
DataStream<CreditRecord> creditStream = recordsStream
.filter((FilterFunction<String>) line -> !line.contains(
"Loan ID,Customer ID,Loan Status,Current Loan Amount,Term,Credit Score,Annual Income,Years in current job" +
",Home Ownership,Purpose,Monthly Debt,Years of Credit History,Months since last delinquent,Number of Open Accounts," +
"Number of Credit Problems,Current Credit Balance,Maximum Open Credit,Bankruptcies,Tax Liens"))
.map(new MapFunction<String, CreditRecord>() {
#Override
public CreditRecord map(String s) throws Exception {
String[] fields = s.split(",");
return new CreditRecord(fields[0], fields[2], Double.parseDouble(fields[3]),
fields[4], fields[5].trim().equals("")?0.0: Double.parseDouble(fields[5]),
fields[6].trim().equals("")?0.0:Double.parseDouble(fields[6]),
fields[8], Double.parseDouble(fields[15]));
}
});
tableEnv.createTemporaryView("CreditDetails", creditStream);
Table creditDetailsTable = tableEnv.from("CreditDetails");
Table resultsTable = creditDetailsTable.select($("*"))
.filter($("loanStatus").isEqual("Charged Off"));
TableResult result = resultsTable.execute();
result.print();
Table resultsTable2 = creditDetailsTable.select($("*"))
.filter($("loanStatus").isEqual("Fully Paid"));
TableResult result2 = resultsTable2.execute();
result2.print();
The above code creates 2 different jobs, but I don't want that. Is there any way out ?
I have a nested SQL query to fetch employee details using their ID.
Right now I am using BeanListHandler to fetch data as a List<Details> but want to store it as a Map<String, Details> where the ID I originally pass needs to be the key for easy retrieval instead of searching the List with streams every time.
I have tried to convert to Maps but I am not sure of how to map the ID as String nor how to get the original ID passed to the inner Query as a column in the final result.
MainTest.java:
String candidateId = "('1111', '2222', '3333', '4444')";
String detailsQuery =
"select PARTNER, BIRTHDT, XSEXM, XSEXF from \"schema\".\"platform.view/table2\" where partner IN \r\n"
+ "(select SID from \"schema\".\"platform.view/table1\" where TYPE='BB' and CLASS='yy' and ID IN \r\n"
+ "(select SID from \"schema\".\"platform.view/table1\" where TYPE='AA' and CLASS='zz' and ID IN"
+ candidateId + "\r\n" + "))";
Map<String, Details> detailsView = queryRunner.query(conn, detailsQuery, new DetailsViewHandler());
Details.java:
public class Details {
private String candidateId;
private String birthDate;
private String maleSex;
private String femaleSex;
// getter and setter
}
DetailsViewHandler.java:
public class DetailsViewHandler extends BeanMapHandler<String, Details> {
public DetailsViewHandler() {
super(Details.class, new BasicRowProcessor(new BeanProcessor(getColumnsToFieldsMap())));
}
public static Map<String, String> getColumnsToFieldsMap() {
Map<String, String> columnsToFieldsMap = new HashMap<>();
columnsToFieldsMap.put("PARTNER", "candidateId");
columnsToFieldsMap.put("BIRTHDT", "birthDate");
columnsToFieldsMap.put("XSEXM", "maleSex");
columnsToFieldsMap.put("XSEXF", "femaleSex");
return columnsToFieldsMap;
}
}
Is there a way to get the ID (candidateId) in the result and what am I missing in terms of creating the key-value pairing ?
From the doc https://commons.apache.org/proper/commons-dbutils/apidocs/org/apache/commons/dbutils/handlers/BeanMapHandler.html
of constructor which you are using
public BeanMapHandler(Class<V> type,
RowProcessor convert)
// Creates a new instance of BeanMapHandler. The value of the first column of each row will be a key in the Map.
Above should work.
You can also try overriding createKey like so
protected K createKey(ResultSet rs)
throws SQLException {
return rs.getString("PARTNER"); // or getInt whatever suits
}
I am trying insert an item in MongoDB using Java MongoDB driver.Before inserting I am trying to get nextId to insert,but not sure why I am always getting nextId as 4 .I am using below given method to get nextId before inserting any item in Mongo.
private Long getNextIdValue(DBCollection dbCollection) {
Long nextSequenceNumber = 1L;
DBObject query = new BasicDBObject();
query.put("id", -1);
DBCursor cursor = dbCollection.find().sort(query).limit(1);
while (cursor.hasNext()) {
DBObject itemDBObj = cursor.next();
nextSequenceNumber = new Long(itemDBObj.get("id").toString()) + 1;
}
return nextSequenceNumber;
}
I have total 13 record in my mongodb collection.What I am doing wrong here?
Please don't do that. You don't need create a bad management id situation as the driver already do this in the best way, just use the right type and annotation for the field:
#Id
#ObjectId
private String id;
Then write a generic method to insert all entites:
public T create(T entity) throws MongoException, IOException {
WriteResult<? extends Object, String> result = jacksonDB.insert(entity);
return (T) result.getSavedObject();
}
This will create a time-based indexed hash for id's which is pretty much more powerful than get the "next id".
https://www.tutorialspoint.com/mongodb/mongodb_objectid.htm
How can you perform Arithmetic operations like +1 to String
nextSequenceNumber = new Long(itemDBObj.get("id").toString()) + 1;
Try to create a Sequence collection like this.
{"id":"MySequence","sequence":1}
Then use Update to increment the id
// Query for sequence collection
Query query = new Query(new Criteria().where("id").is("MySequence"));
//Increment the sequence by 1
Update update = new Update();
update.inc("sequence", 1);
FindAndModifyOptions findAndModifyOptions = new FindAndModifyOptions();
findAndModifyOptions.returnNew(true);
SequenceCollection sequenceCollection = mongoOperations.findAndModify(query, update,findAndModifyOptions, SequenceCollection.class);
return sequenceModel.getSequence();
I found the work around using b.collection.count().I simply find the total count and incremented by 1 to assign id to my object.