cassandra stop writing after some time - java

I have a problem: i write entries from java code to cassandra database, it works for a while, and then stops writing. (nodetool cfstats keyspace.users -H on all nodes show no changes in Number of keys (estimate))
Configuration : 4 nodes (4GB, 4GB, 4GB, and 6GB RAM).
I am using datastax driver, and connection like
private Cluster cluster = Cluster.builder()
.addContactPoints(<points>)
.build();
private Session session = cluster.connect("keyspace");
private MappingManager mappingManager = new MappingManager(session);
...
I do insert in database like
public void writeUser(User user) {
Mapper<User> mapper = mappingManager.mapper(User.class);
mapper.saveAsync(user, Mapper.Option.timestamp(TimeUnit.NANOSECONDS.toMicros(System.nanoTime())));
}
I also tried
public void writeUser(User user) {
Mapper<User> mapper = mappingManager.mapper(User.class);
mapper.save(user);
}
And two variants between.
In debug.log from server i see
DEBUG [GossipStage:1] 2016-05-11 12:21:14,565 FailureDetector.java:456 - Ignoring interval time of 2000380153 for /node
Maybe the problem is, that server in another country? But why it is writing entities at the beginning? How can i fix my problem?
Another update: session.execute on mapper.save returns ResultSet[ exhausted: true, Columns[]]

Related

OutOfMemoryError: Java heap space , while retrieving data by using Jdbctemplate : (RowCallbackHandlerResultSetExtractor.extractData)

public List<Employee> getEmployeeDataByDate(Date dateCreated) throws Exception {
List<Employee> listDetails = new ArrayList<>();
sqlServTemplate.query(SyncQueryConstants.RETRIEVE_EMPLOYEE_RECORDS_BY_DATE, new Object[] { dateCreated },
new RowCallbackHandler() {
#Override
public void processRow(ResultSet rs) throws SQLException {
Employee emp = new Employee();
emp.setEmployeeID(rs.getString("employeeID"));
emp.setFirstName(rs.getString("firstName"));
emp.setLastName(rs.getString("lastName"));
emp.setMiddleName(rs.getString("middleName"));
emp.setNickName(rs.getString("nickName"));
byte[] res = rs.getBytes("employeeImage");
Blob blob = new SerialBlob(res);
emp.setEmployeeImage(blob);
// .....
listDetails.add(emp);
}
});
return listDetails;
}
Here I'm trying to retrieving records of the employee table.Because of BLOB data It's saying OutOfMemoryError: Java heap space . Could any one help me on this ?
It's a stand-alone application, I'm doing syncing from one table to another table. So unable to use pagination. Every day 2k record will sync at mid-night by cron job. Give some idea how I can solve this issue.
SELECT * FROM Employees with(nolock) WHERE cast (datediff (day, 0, dateCreated) as datetime) >= ?
This query giving me all data based on date, (around 2k record each day).
If I'm commenting
byte[] res = rs.getBytes("employeeImage");
Blob blob = new SerialBlob(res);
emp.setEmployeeImage(blob);
This line then no issue. Other wise It's throwing error.
Please give some idea, if possible give some sample code.
I'm struggling from 2days in this possition.
As some other commenters have mentioned you can either increase your heap space or limit the amount of records you are returning from your query and process them in smaller batches.
You are reading an MB sized image in bytes, it will consume your HEAP memory..
instead try using BinaryStream:
InputStream image = rs.getBinaryStream("employeeImage");
Rather then adding each user to the list you could process them one at a time. Put each record in the other database as it's pulled from the source database, rather then adding them to the List which is causing the OOM error. If there is some other processing downstream then inject a class into this DAO that handles the actual processing/writing to the target DB.

Connection pooling in multi tenant app. Shared pool vs pool per tenant

I'm building a multi tenant REST server application with Spring 2.x, Hibernate 5.x, Spring Data REST, Mysql 5.7.
Spring 2.x uses Hikari for connection pooling.
I'm going to use a DB per tenant approach, so every tenant will have his own database.
I created my MultiTenantConnectionProvider in this way:
#Component
#Profile("prod")
public class MultiTenantConnectionProviderImpl implements MultiTenantConnectionProvider {
private static final long serialVersionUID = 3193007611085791247L;
private Logger log = LogManager.getLogger();
private Map<String, HikariDataSource> dataSourceMap = new ConcurrentHashMap<String, HikariDataSource>();
#Autowired
private TenantRestClient tenantRestClient;
#Autowired
private PasswordEncrypt passwordEncrypt;
#Override
public void releaseAnyConnection(Connection connection) throws SQLException {
connection.close();
}
#Override
public Connection getAnyConnection() throws SQLException {
Connection connection = getDataSource(TenantIdResolver.TENANT_DEFAULT).getConnection();
return connection;
}
#Override
public Connection getConnection(String tenantId) throws SQLException {
Connection connection = getDataSource(tenantId).getConnection();
return connection;
}
#Override
public void releaseConnection(String tenantId, Connection connection) throws SQLException {
log.info("releaseConnection " + tenantId);
connection.close();
}
#Override
public boolean supportsAggressiveRelease() {
return false;
}
#Override
public boolean isUnwrappableAs(Class unwrapType) {
return false;
}
#Override
public <T> T unwrap(Class<T> unwrapType) {
return null;
}
public HikariDataSource getDataSource(#NotNull String tentantId) throws SQLException {
if (dataSourceMap.containsKey(tentantId)) {
return dataSourceMap.get(tentantId);
} else {
HikariDataSource dataSource = createDataSource(tentantId);
dataSourceMap.put(tentantId, dataSource);
return dataSource;
}
}
public HikariDataSource createDataSource(String tenantId) throws SQLException {
log.info("Create Datasource for tenant {}", tenantId);
try {
Database database = tenantRestClient.getDatabase(tenantId);
DatabaseInstance databaseInstance = tenantRestClient.getDatabaseInstance(tenantId);
if (database != null && databaseInstance != null) {
HikariConfig hikari = new HikariConfig();
String driver = "";
String options = "";
switch (databaseInstance.getType()) {
case MYSQL:
driver = "jdbc:mysql://";
options = "?useLegacyDatetimeCode=false&serverTimezone=UTC&useUnicode=yes&characterEncoding=UTF-8&useSSL=false";
break;
default:
driver = "jdbc:mysql://";
options = "?useLegacyDatetimeCode=false&serverTimezone=UTC&useUnicode=yes&characterEncoding=UTF-8&useSSL=false";
}
hikari.setJdbcUrl(driver + databaseInstance.getHost() + ":" + databaseInstance.getPort() + "/" + database.getName() + options);
hikari.setUsername(database.getUsername());
hikari.setPassword(passwordEncrypt.decryptPassword(database.getPassword()));
// MySQL optimizations, see
// https://github.com/brettwooldridge/HikariCP/wiki/MySQL-Configuration
hikari.addDataSourceProperty("cachePrepStmts", true);
hikari.addDataSourceProperty("prepStmtCacheSize", "250");
hikari.addDataSourceProperty("prepStmtCacheSqlLimit", "2048");
hikari.addDataSourceProperty("useServerPrepStmts", "true");
hikari.addDataSourceProperty("useLocalSessionState", "true");
hikari.addDataSourceProperty("useLocalTransactionState", "true");
hikari.addDataSourceProperty("rewriteBatchedStatements", "true");
hikari.addDataSourceProperty("cacheResultSetMetadata", "true");
hikari.addDataSourceProperty("cacheServerConfiguration", "true");
hikari.addDataSourceProperty("elideSetAutoCommits", "true");
hikari.addDataSourceProperty("maintainTimeStats", "false");
hikari.setMinimumIdle(3);
hikari.setMaximumPoolSize(5);
hikari.setIdleTimeout(30000);
hikari.setPoolName("JPAHikari_" + tenantId);
// mysql wait_timeout 600seconds
hikari.setMaxLifetime(580000);
hikari.setLeakDetectionThreshold(60 * 1000);
HikariDataSource dataSource = new HikariDataSource(hikari);
return dataSource;
} else {
throw new SQLException(String.format("DB not found for tenant %s!", tenantId));
}
} catch (Exception e) {
throw new SQLException(e.getMessage());
}
}
}
In my implementation I read tenantId and I get information about the database instance from a central management system.
I create a new pool for each tenant and I cache the pool in order to avoid to recreate it each time.
I read this interesting question, but my question is quite different.
I'm thinking to use AWS (both for server instance, and RDS db instance).
Let's hypothesize a concrete scenario in which I've 100 tenants.
The application is a management/point of sale software. It will be used just from agents. Let's say each tenants has an average of 3 agents working concurrently in each moment.
With that numbers in mind and according to this article, the first thing I realize is that it seems hard to have a pool for each tenant.
For 100 tenants I would like to think that a db.r4.large (2vcore, 15,25GB RAM and fast disk access ) with Aurora should be enough (about 150€/month).
According to the formula to size a connection pool:
connections = ((core_count * 2) + effective_spindle_count)
I should have 2core*2 + 1 = 5 connections in the pool.
From what I got, this should be the max connections in the pool to maximise performance on that DB instance.
1st solution
So my first question is pretty simple: how can I create a separate connection pool for each tenant seen that I should only use 5 connection in total?
It seems not possible to me. Even if I assign 2 connections to each tenant, I would have 200 connections to the DBMS!!
According to this question, on a db.r4.large instance I could have at max 1300 connections, so seems the instance should face quite well the load.
But according the article I mentioned before, seems a bad practice use hundreds connections to the db:
If you have 10,000 front-end users, having a connection pool of 10,000 would be shear insanity. 1000 still horrible. Even 100 connections, overkill. You want a small pool of a few dozen connections at most, and you want the rest of the application threads blocked on the pool awaiting connections.
2nd solution
The second solution I have in mind is to share a connection pool for tenants on the same DMBS. This means that all 100 tenants will use the same Hikari pool of 5 connections (honestly it seems quite low to me).
Should this the right way to maximize performance and redure the response time of the application?
Do you have a better idea of how to manage this scenario with Spring, Hibernate, Mysql (hosted on AWS RDS Aurora)?
Most definitely opening connection per tenant is a very bad idea. All you need is a pool of connections shared across all users.
So first step would be to find the load or anticipate what it would be based on some projections.
Decide how much latency is acceptable, what is the burst peak time traffic etc
Finally come to number of connections you will need for this and decide on number of instances required. For instance if your peak time usage is 10k per s and each query takes 10ms then you will need 100 open connections for latency of 1s.
Implement it without any bindings to user. i.e. the same pool shared across all. Unless you have a case to group say premium/basic users to say have set of two pools etc
Finally as you are doing this in AWS if you need more than 1 instance based on point 3 - see if you can autoscale up/down based on load to save costs.
Check these out for some comparison metrics
This one is probably most interesting in terms of spike demand
https://github.com/brettwooldridge/HikariCP/blob/dev/documents/Welcome-To-The-Jungle.md
Some more...
https://github.com/brettwooldridge/HikariCP
https://www.wix.engineering/blog/how-does-hikaricp-compare-to-other-connection-pools
Follow previous Q&A the selected strategy for multi tenant environment will be (surprisingly) using connection pool per tenant
Strategy 2 : each tenant have it's own schema and it's own connection pool in a single database
strategy 2 is more flexible and safe : every tenant cannot consume more than a given amount of connection (and this amount can be configured per tenant if you need it)
I suggest put the HikariCP's formula aside here, and use less tenants number as 10 (dynamic size? ) with low connection pool size as 2.
Be more focus on the traffic you expect, notice that 10 connection pool size comment in HikariCP Pool Size maybe should suffice:
10 as a nice round number. Seem low? Give it a try, we'd wager that you could easily handle 3000 front-end users running simple queries at 6000 TPS on such a setup.
See also comment indicates that 100 instances are too much
, but it would have to be a massive load to require 100s.
By #EssexBoy

Spark to Hbase Table is not showing complete data records

I have a log file of 30k records, which I am publishing from Kafka and through spark I am persisting it into HBase. Out of 30K records, I can see only 4K records in HBase table.
I have tried saving the stream in MySQL and it is saving all records in MySql properly.
But in HBase if I publish a file of 100 records in Kafka topic, it saves 36 records in HBase table where if I publish 30K records Hbase shows only 4k records.
Also, Records(rows) in HBase are not in sequence like 1..3..10..17th.
final Job newAPIJobConfiguration1 = Job.getInstance(config); newAPIJobConfiguration1.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "logs"); newAPIJobConfiguration1.setOutputFormatClass(org.apache.hadoop.hbase.mapreduce.TableOutputFormat.class);
HTable hTable = new HTable(config, "country");
lines.foreachRDD((rdd,time)->
{
// Get the singleton instance of SparkSession
SparkSession spark = SparkSession.builder().config(rdd.context().getConf()).getOrCreate();
// Convert RDD[String] to RDD[case class] to DataFrame
JavaRDD rowRDD = rdd.map(line -> {
String[] logLine = line.split(" +");
Log record = new Log();
record.setTime((logLine[0]));
record.setTime_taken((logLine[1]));
record.setIp(logLine[2]);
return record;
});
saveToHBase(rowRDD, newAPIJobConfiguration1.getConfiguration());
});
ssc.start();
ssc.awaitTermination();
}
//6. saveToHBase method - insert data into HBase
public static void saveToHBase(JavaRDD rowRDD, Configuration conf) throws IOException {
// create Key, Value pair to store in HBase
JavaPairRDD hbasePuts = rowRDD.mapToPair(
new PairFunction() {
private static final long serialVersionUID = 1L;
#Override
public Tuple2 call(Log row) throws Exception {
Put put = new Put(Bytes.toBytes(System.currentTimeMillis()));
//put.addColumn(Bytes.toBytes("sparkaf"), Bytes.toBytes("message"), Bytes.toBytes(row.getMessage()));
put.addImmutable(Bytes.toBytes("time"), Bytes.toBytes("col1"), Bytes.toBytes(row.getTime()));
put.addImmutable(Bytes.toBytes("time_taken"), Bytes.toBytes("col2"), Bytes.toBytes(row.getTime_taken()));
put.addImmutable(Bytes.toBytes("ip"), Bytes.toBytes("col3"), Bytes.toBytes(row.getIp()));
return new Tuple2(new ImmutableBytesWritable(), put);
}
});
// save to HBase- Spark built-in API method
//hbasePuts.saveAsNewAPIHadoopDataset(conf);
hbasePuts.saveAsNewAPIHadoopDataset(conf);
Since HBase stores records uniquely by rowkey, it is very possible that you are overwriting records.
You are using the currentTime in milliseconds as the rowkey and any records created with the same rowkey will overwrite the old one.
Put put = new Put(Bytes.toBytes(System.currentTimeMillis()));
So if 100 Puts are created in 1 millisecond, then only 100 will show up in HBase since the same row was overwritten 99 times.
It's likely that the 4k rowkeys in HBase are the 4k unique milliseconds (4 seconds) it took to load the data.
I would suggest using a different rowkey design. Also, as a side note, it is typically a bad idea to use monotonically increasing rowkeys in HBase:
Further Information

How to run apache flink streaming job continuously on Flink server

Hello,
I written code for streaming job where as source and target is a PostgreSQL database. I used JDBCInputFormat/JDBCOutputFormat to read and write the records(Referenced example).
Code:
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
environment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
JDBCInputFormatBuilder inputBuilder = JDBCInputFormat.buildJDBCInputFormat()
.setDrivername(JDBCConfig.DRIVER_CLASS)
.setDBUrl(JDBCConfig.DB_URL)
.setQuery(JDBCConfig.SELECT_FROM_SOURCE)
.setRowTypeInfo(JDBCConfig.ROW_TYPE_INFO);
SingleOutputStreamOperator<Row> source = environment.createInput(inputBuilder.finish())
.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Row>() {
#Override
public long extractAscendingTimestamp(Row row) {
Date dt = (Date) row.getField(2);
return dt.getTime();
}
})
.keyBy(0).window(TumblingEventTimeWindows.of(Time.seconds(5)))
.fold(null, new FoldFunction<Row, Row>(){
#Override
public Row fold(Row row1, Row row) throws Exception {
return row;
}
});
source.writeUsingOutputFormat(JDBCOutputFormat.buildJDBCOutputFormat()
.setDrivername(JDBCConfig.DRIVER_CLASS)
.setDBUrl(JDBCConfig.DB_URL)
.setQuery("insert into tablename(id, name) values (?,?)")
.setSqlTypes(new int[]{Types.BIGINT, Types.VARCHAR})
.finish());
This code is executing correctly but not running continuously on Flink server(Select query is executing only once.)
Expected to run continuously on flink server.
Probably, you have to define your own Flink Source or JDBCInputFormat, since the one you use here will stop the SourceTask while fetching all results from DB. One way to solve this is create your own jdbc input format based on JDBCInputFormat, trying to re-execute the SQL query while reading the last row from DB in nextRecord.

HBase FuzzyRowFilter returns no results

I'm having a hard time getting HBase's FuzzyRowFilter to work.
I have the following test table:
hbase(main):014:0> scan 'test'
ROW COLUMN+CELL
row-01 column=colfam1:col1, timestamp=1481193793338, value=value1
row-02 column=colfam1:col1, timestamp=1481193799186, value=value2
row-03 column=colfam1:col1, timestamp=1481193803941, value=value3
row-04 column=colfam1:col1, timestamp=1481193808209, value=value4
row-05 column=colfam1:col1, timestamp=1481193812737, value=value5
5 row(s) in 0.0200 seconds
Here is my Java code (I started with Scala, but the results are the same - none):
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "localhost:2182");
conf.set("hbase.master", "localhost:60000");
conf.set("hbase.rootdir", "/hbase");
try {
Scan scan = new Scan();
scan.setCaching(5);
byte[] rowKeys = Bytes.toBytesBinary("???-01");
byte[] fuzzyInfo = {0x01,0x01,0x01,0x00,0x00,0x00};
FuzzyRowFilter fuzzyFilter = new FuzzyRowFilter(
Arrays.asList(
new Pair<byte[], byte[]>(
rowKeys,
fuzzyInfo)));
System.out.println("### fuzzyFilter: " + fuzzyFilter.toString());
scan.addFamily(Bytes.toBytesBinary("colfam1"));
scan.setStartRow(Bytes.toBytesBinary("row-01"));
scan.setStopRow(Bytes.toBytesBinary("row-05"));
scan.setFilter(fuzzyFilter);
Connection conn = ConnectionFactory.createConnection(conf);
Table table = conn.getTable(TableName.valueOf("test"));
ResultScanner results = table.getScanner(scan);
int count = 0;
int limit = 100;
for ( Result r : results ) {
System.out.println("" + r.toString());
if (count++ >= limit) break;
}
} catch (Exception e) {
e.printStackTrace();
}
I simply do not get any results back from the server. If I comment out the line scan.setFilter(fuzzyFilter);, I get the exepcted results:
keyvalues={row-01/colfam1:col1/1481193793338/Put/vlen=6/seqid=0}
keyvalues={row-02/colfam1:col1/1481193799186/Put/vlen=6/seqid=0}
keyvalues={row-03/colfam1:col1/1481193803941/Put/vlen=6/seqid=0}
keyvalues={row-04/colfam1:col1/1481193808209/Put/vlen=6/seqid=0}
Am I doing something wrong? Is there a bug in HBase (version 1.2.2)? I am using the version installed through Homebrew on latest Mac OS Sierra.
Update
On a Cloudera Hadoop cluster running CDH 5.7 with HBase 1.2.0-cdh5.7.0, I get the desired output for rowkey row-01. The error must somehow be related to my local setup.
Solution
Indeed, the problem was that HBase server installation and client JAR versions did not match. In my case, I was using the artifacts
hbase-common
hbase-client
hbase-server
with version 1.2.0-cdh5.7.0 instead of 1.2.2.
My mistake was assuming that minor version differences would not have a large impact, but apparently Cloudera has applied some major changes in their versions with respect to the official code base. Changing to the official version 1.2.2 made the FuzzyRowFilter work as expected.
It should print only rowkey of row-01 as can be perceived from the filter condition.
There is no such bug and it will work as expected as I have been using same for some time now.
Check your configurations,dependencies,etc.
Due to versioning,many times libraries and their clients becom incompatible.
Lets take a simple example:
class ServerVersionA {
public static void getData() {
return DataOject(data with headerVersionA);
}
}
class ClientVersionB {
public void showData() {
DataObject dataObject = makeRequest(params);
//Check whether data recieved is of version B after veryfying header boolean status=validate(dataObject);
if (status) {
doIO(dataObject);
}
}
}
In this case,if the header does not match,client does simply sit idle.
These kind of issues are mostly taken care of but sometimes they creep in.
If we look at the sources of installation and client version,we can find out why data is not being returned and no exception is propagated.

Categories

Resources