Hbase CopyTable inside Java - java

I want to copy one Hbase table to another location with good performance.
I would like to reuse the code from CopyTable.java from Hbase-server github page
I've been looking the doccumentation from hbase but it didn't help me much http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/CopyTable.html
After looking in this post of stackoverflow: Can a main() method of class be invoked in another class in java
I think I can directly call it using its main class.
Question: Do you think anyway better to get this copy done rather than using CopyTable from hbase-server ? Do you see any inconvenience using this CopyTable ?

Question: Do you think anyway better to get this copy done rather than
using CopyTable from hbase-server ? Do you see any inconvenience using
this CopyTable ?
First thing is snapshot is better way than CopyTable.
HBase Snapshots allow you to take a snapshot of a table without too much impact on Region Servers. Snapshot, Clone and restore operations don't involve data copying. Also, Exporting the snapshot to another cluster doesn't have impact on the Region Servers.
Prior to version 0.94.6, the only way to backup or to clone a table is to use CopyTable/ExportTable, or to copy all the hfiles in HDFS after disabling the table. The disadvantages of these methods are that you can degrade region server performance (Copy/Export Table) or you need to disable the table, that means no reads or writes; and this is usually unacceptable.
Snapshot is not just rename, between multiple operations if you want to restore at one particular point then this is the right case to use :
A snapshot is a set of metadata information that allows an admin to get back to a previous state of the table. A snapshot is not a copy of the table; it’s just a list of file names and doesn’t copy the data. A full snapshot restore means that you get back to the previous “table schema” and you get back your previous data losing any changes made since the snapshot was taken.
Also, see Snapshots+and+Repeatable+reads+for+HBase+Tables
Snapshot Internals
Another Map reduce way than CopyTable :
You can implement something like below in your code this is for standalone program where as you have write mapreduce job to insert multiple put records as a batch (may be 100000).
This increased performance for standalone inserts in to hbase client you can try this in mapreduce way
public void addMultipleRecordsAtaShot(final ArrayList<Put> puts, final String tableName) throws Exception {
try {
final HTable table = new HTable(HBaseConnection.getHBaseConfiguration(), getTable(tableName));
table.put(puts);
LOG.info("INSERT record[s] " + puts.size() + " to table " + tableName + " OK.");
} catch (final Throwable e) {
e.printStackTrace();
} finally {
LOG.info("Processed ---> " + puts.size());
if (puts != null) {
puts.clear();
}
}
}
along with that you can also consider below...
Enable write buffer to large value than default
1) table.setAutoFlush(false)
2) Set buffer size
<property>
<name>hbase.client.write.buffer</name>
<value>20971520</value> // you can double this for better performance 2 x 20971520 = 41943040
</property>
OR
void setWriteBufferSize(long writeBufferSize) throws IOException
The buffer is only ever flushed on two occasions:
Explicit flush
Use the flushCommits() call to send the data to the servers for permanent storage.
Implicit flush
This is triggered when you call put() or setWriteBufferSize().
Both calls compare the currently used buffer size with the configured limit and optionally invoke the flushCommits() method.
In case the entire buffer is disabled, setting setAutoFlush(true) will force the client to call the flush method for every invocation of put().

Related

Deleting record in Kafka StateStore does not work (NullPointerException thrown on .delete(key))

I have a materialized in-memory statestore in my code. I have another separate stream that is supposed to look up and delete records based on some criteria.
I need to allow my stream to access and delete records in the previously built statestore.
I have the following code below
#bean
public StreamBuilder myStreamCodeBean(StreamBuilder streamBuilder) {
//create store supplier
KeyValueBytesStoreSupplier myStoreSupplier = Stores.inMemoryKeyValueStore("MyStateStore");
//materialize statstore and enable caching
Materialized materializedStore = Materialized.<String, MyObject>as(myStoreSupplier)
.withKeySerde(Serdes.String())
.withValueSerde(myObjectSerde)
.withCachingEnabled();
//other code here that creates KTable, and another stream to consume records into ktable from another topic
//........
//another stream that consumes another topic and deletes records in store with some logic
streamsBuilder
.stream("someTopicName", someConsumerObject)
.filter((key, value) -> {
KeyValueStore<Bytes, byte[]> kvStore = myStoreSupplier.get();
kvStore.delete(key); //StateStore never "open" and this throws nullpointerexception (even tho key is NOT null)
return true;
}
.to("some topic name here", producerObject);
return streamBuilder;
}
The error that is thrown is really generic. The error is that Kafka streams is not running.
Doing some debugging, I found that my statestore isn't "open" when doing delete.
What am I doing wrong here? I can read records using ReadOnlyKeyValueStore but I need to delete so I can't use that.
any help appreciated.
State store must be accessed through the processor's context and not using the supplier object.
After creating the store, you need to ensure that it is accessible by the processor in which you are trying to access the store from.
If your store is a local store, then you need to specify which processors are going to access the store.
If your store is a global store, then it is accessible by all the processors in the topology.
You are creating a stream using streamsBuilder.stream() and at least from the code you have posted, you doesn't seem to give your processor access to the state store.
Ensure that you have called addStateStore() in StreamsBuilder
To get the state store in the processor, we need to use context.getStateStore(storeName).
You can refer the following example
(I don't think we can access state store in filter() because it is a stateless operation). So, you can use Processor or Transformer for and pass in the state store names (MyStateStore in your case).

Save a spark RDD using mapPartition with iterator

I have some intermediate data that I need to be stored in HDFS and local as well. I'm using Spark 1.6. In HDFS as intermediate form I'm getting data in /output/testDummy/part-00000 and /output/testDummy/part-00001. I want to save these partitions in local using Java/Scala so that I could save them as /users/home/indexes/index.nt(by merging both in local) or /users/home/indexes/index-0000.nt and /home/indexes/index-0001.nt separately.
Here is my code:
Note: testDummy is same as test, output is with two partitions. I want to store them separately or combined but local with index.nt file. I prefer to store separately in two data-nodes. I'm using cluster and submit spark job on YARN. I also added some comments, how many times and what data I'm getting. How could I do? Any help is appreciated.
val testDummy = outputFlatMapTuples.coalesce(Constants.INITIAL_PARTITIONS).saveAsTextFile(outputFilePathForHDFS+"/testDummy")
println("testDummy done") //1 time print
def savesData(iterator: Iterator[(String)]): Iterator[(String)] = {
println("Inside savesData") // now 4 times when coalesce(Constants.INITIAL_PARTITIONS)=2
println("iter size"+iterator.size) // 2 735 2 735 values
val filenamesWithExtension = outputPath + "/index.nt"
println("filenamesWithExtension "+filenamesWithExtension.length) //4 times
var list = List[(String)]()
val fileWritter = new FileWriter(filenamesWithExtension,true)
val bufferWritter = new BufferedWriter(fileWritter)
while (iterator.hasNext){ //iterator.hasNext is false
println("inside iterator") //0 times
val dat = iterator.next()
println("datadata "+iterator.next())
bufferWritter.write(dat + "\n")
bufferWritter.flush()
println("index files written")
val dataElements = dat.split(" ")
println("dataElements") //0
list = list.::(dataElements(0))
list = list.::(dataElements(1))
list = list.::(dataElements(2))
}
bufferWritter.close() //closing
println("savesData method end") //4 times when coal=2
list.iterator
}
println("before saving data into local") //1
val test = outputFlatMapTuples.coalesce(Constants.INITIAL_PARTITIONS).mapPartitions(savesData)
println("testRDD partitions "+test.getNumPartitions) //2
println("testRDD size "+test.collect().length) //0
println("after saving data into local") //1
PS: I followed, this and this but not exactly same what I'm looking for, I did somehow but not getting anything in index.nt
A couple of things:
Never call Iterator.size if you plan to use data later. Iterators are TraversableOnce. The only way to compute Iterator size is to traverse all its element and after that there is no more data to be read.
Don't use transformations like mapPartitions for side effects. If you want to perform some type of IO use actions like foreach / foreachPartition. It is a bad practice and doesn't guarantee that given piece of code will be executed only once.
Local path inside action or transformations is a local path of particular worker. If you want to write directly on the client machine you should fetch data first with collect or toLocalIterator. It could be better though to write to distributed storage and fetch data later.
Java 7 provides means to watch directories.
https://docs.oracle.com/javase/tutorial/essential/io/notification.html
The idea is to create a watch service, register it with the directory of interest (mention the events of your interest, like file creation, deletion, etc.,), do watch, you will be notified of any events like creation, deletion, etc., you can take whatever action you want then.
You will have to depend on Java hdfs api heavily wherever applicable.
Run the program in background since it waits for events forever. (You can write logic to quit after you do whatever you want)
On the other hand, shell scripting will also help.
Be aware of coherency model of hdfs file system while reading files.
Hope this helps with some idea.

Getting CPU 100 percent when I am trying to downloading CSV in Spring

I am getting CPU performance issue on server when I am trying to download CSV in my project, CPU goes 100% but SQL returns the response within 1 minute. In the CSV we are writing around 600K records for one user it is working fine but for concurrent users we are getting this issue.
Environment
Spring 4.2.5
Tomcat 7/8 (RAM 2GB Allocated)
MySQL 5.0.5
Java 1.7
Here is the Spring Controller code:-
#RequestMapping(value="csvData")
public void getCSVData(HttpServletRequest request,
HttpServletResponse response,
#RequestParam(value="param1", required=false) String param1,
#RequestParam(value="param2", required=false) String param2,
#RequestParam(value="param3", required=false) String param3) throws IOException{
List<Log> logs = service.getCSVData(param1,param2,param3);
response.setHeader("Content-type","application/csv");
response.setHeader("Content-disposition","inline; filename=logData.csv");
PrintWriter out = response.getWriter();
out.println("Field1,Field2,Field3,.......,Field16");
for(Log row: logs){
out.println(row.getField1()+","+row.getField2()+","+row.getField3()+"......"+row.getField16());
}
out.flush();
out.close();
}}
Persistance Code:- I am using spring JDBCTemplate
#Override
public List<Log> getCSVLog(String param1,String param2,String param3) {
String sql =SqlConstants.CSV_ACTIVITY.toString();
List<Log> csvLog = JdbcTemplate.query(sql, new Object[]{param1, param2, param3},
new RowMapper<Log>() {
#Override
public Log mapRow(ResultSet rs, int rowNum)
throws SQLException {
Log log = new Log();
log.getField1(rs.getInt("field1"));
log.getField2(rs.getString("field2"));
log.getField3(rs.getString("field3"));
.
.
.
log.getField16(rs.getString("field16"));
}
return log;
}
});
return csvLog;
}
I think you need to be specific on what you meant by "100% CPU usage" whether it's the Java process or MySQL server. As you have got 600K records, trying to load everything in to memory would easily end up in OutOfMemoryError. Given that this works for one user means that you've got enough heap space to process this number of records for just one user and symptoms surface when there are multiple users trying to use the same service.
First issue I can see in your posted code is that you try to load everything into one big list and the size of the list varies based on the content of the Log class. Using a list like this also means that you have to have enough memory to process JDBC result set and generate new list of Log instances. This can be a major problem with a growing number of users. This type of short-lived objects will cause frequent GC and once GC cannot keep up with the amount of garbage being collected it fails obviously. To solve this major issue my suggestion is to use ScrollableResultSet. Additionally you can make this result set read-only, for example below is code fragment for creating a scrollable result set. Take a look at the documentation for how to use it.
Statement st = conn.createStatement(ResultSet.TYPE_SCROLL_SENSITIVE, ResultSet.CONCUR_READ_ONLY);
Above option is suitable if you're using pure JDBC or SpringJDBC template. If Hibernate is already used in your project you can still achieve the same this with the below code fragment. Again please check the documentation for more information and you have a different JPA provider.
StatelessSession session = sessionFactory.openStatelessSession();
Query query = session.createSQLQuery(queryStr).setCacheable(false).setFetchSize(Integer.MIN_VALUE).setReadOnly(true);
query.setParameter(query_param_key, query_paramter_value);
ScrollableResults resultSet = query.scroll(ScrollMode.FORWARD_ONLY);
This way you're not loading all the records to Java process in one go, instead you they're loaded on demand and will have small memory footprint at any given time. Note that JDBC connection will be open until you're done with processing the entire record set. This also means that your DB connection pool can be exhausted if many users are going to download CSV files from this endpoint. You need to take measures to overcome this problem (i.e use of an API manager to rate limit the calls to this endpoint, reading from a read-replica or whatever viable option).
My other suggestion is to stream data which you have already done, so that any records fetched from the DB are processed and sent to client before the next set of records are processed. Again I would suggest you to use a CSV library such as SuperCSV to handle this as these libraries are designed to handle a good load of data.
Please note that this answer may not exactly answer your question as you haven't provided necessary parts of your source such as how to retrieve data from DB but will give the right direction to solve this issue
Your problem in loading all data on application server from database at once, try to run query with limit and offset parameters (with mandatory order by), push loaded records to client and load next part of data with different offset. It help you decrease memory footprint and will not required keep connection to database open all the time. Of course, database will loaded a bit more, but maybe whole situation will better. Try different limit values, for example 5K-50K and monitor cpu usage - on both app server and database.
If you can allow keep many open connection to database #Bunti answer is very good.
http://dev.mysql.com/doc/refman/5.7/en/select.html

How to read and write the records in EhCache?

Hi All,
My current requirement is to store and read the records using EhCache. I am new to EhCache Implementation. I have read the EhCache Documentation and started to implement. I have done the records insert part and also read part. While the records are inserted, there will be *.data nd *.index files are created. Following is Code.
public class Driver
{
public static void main(String[] args) {
CacheManager cm = CacheManager.create("ehcache.xml");
Cache cache = cm.getCache("test");
// I do a couple of puts
for(int i=0;i<10;i++){
cache.put(new Element("key1", "val1"));
cache.flush();
}
System.out.println(cache.getKeys());
for(int i=0;i<10;i++){
Element el = cache.get("key"+i);
System.out.println(el.getObjectValue());
}
cm.shutdown();
}
}
Now what the issue is cm.shutdown(). If I am commenting this line and comment out the insert part and run the program means, Not able to retrieve the records and also *.index file is deleted. So In real scenario if the program is stopped abruptly means we can't read the records after startup. I want to know why the file is deleted and why I cant read the records in this situation... The Exception coming in the console is
net.sf.ehcache.util.SetAsList#b66cc
Exception in thread "main" java.lang.NullPointerException
at Driver.main(Driver.java:29)...
Any Input is needed Please..
What you are doing is correct and the expected behaviour is correct too. Caches are typically used to enhance application performance by providing frequently used data quickly, while avoiding costly trips to datastore.
Not all applications need to persist cache after the system is shutdown- and that's the default behaviour you are seeing (Most applications will build cache on application startup or as requests start coming in). The data you are caching is in heap - and as soon as your JVM dies- the cache is gone. Now you want to persist it beyond restart? There are options available. Loook up here
And I am copying the code snippet right from the same page:
DiskStoreConfiguration diskStoreConfiguration = new DiskStoreConfiguration();
diskStoreConfiguration.setPath("/my/path/dir");
// Already created a configuration object ...
configuration.addDiskStore(diskStoreConfiguration);
// By adding configuration for storing the cache in a file - you are not using default cache manager
CacheManager mgr = new CacheManager(configuration);
In addition, you will have to also configure the persistence options as explained here
Again copying code snippet from link:
<cache>
<persistence strategy=”localRestartable” synchronousWrites=”true”/>
</cache>
Hope this helps!

Postgresql 8.4 reading OID style BLOBs with Hibernate

I am getting this weird case when querying Postgres 8.4 for some records with Blobs (of type OIDs) with Hibernate. The query does return all right but when my code wants to read the content of the BLOB with the simple code below, it gets 0 bytes back
public static byte[] readBlob(Blob blob) throws Exception {
InputStream is = null;
try {
is = blob.getBinaryStream();
return org.apache.commons.io.IOUtils.toByteArray(is);
}
finally {
if (is != null)
try {
is.close();
}
catch(Exception e) {}
}
}
Funny think is that I am getting this behavior only since I've started adding more then one such records to the table.
The underlying JDBC library is type 3 (postgresq 8.4-701).
Can someone give me a hint as to how to solve this issue?
Thanks
Peter
Looks like you may have found this bug:
http://opensource.atlassian.com/projects/hibernate/browse/HHH-4876
It was a while since I have run into similar issue, and since I've refreshed my memories about this topic I am thinking of sharing the results. The problem is that Postgres (and a few versions back Oracle too) will not handle the Blob content at the record creation time in the same transaction. Funny think is that one needs to pass the content after the external file (where the content gets stored eventually) has been well created and reserved for the operation. Yes, the record gets created but the Blob is blank. To have the Blob filled out with whatever you need to put in, you need that operation in a second transaction (sort of an update record). That's a funny business (maybe a major bug), ehe

Categories

Resources