Fastest way to export keys from cassandra - java

What is the fastest way to export all the rowkeys from a column family in cassandra (0.7.x and later versions) with Java APIs or other tools ?
Currently I am using the Java Pelops API, and paging through all records, but Im wondering if there is a better mechanism.
I am specifically interested in only exporting the rowkeys (no columns/subcolumns), so Im wondering if there is a section of the cassandra direct storage APIs that could be used to do this as quickly as possible (bypassing thrift).

What about using Java hector client. Sample taken from
https://github.com/rantav/hector/wiki/User-Guide
RangeSlicesQuery<String, String, String> rangeSlicesQuery =
HFactory.createRangeSlicesQuery(keyspace, stringSerializer,
stringSerializer, stringSerializer);
rangeSlicesQuery.setColumnFamily("Standard1");
rangeSlicesQuery.setKeys("fake_key_", "");
rangeSlicesQuery.setReturnKeysOnly(); // use this
rangeSlicesQuery.setRowCount(5);
Result<OrderedRows<String, String, String>> result = rangeSlicesQuery.execute();
thrift is API interface for cassandra. Going directly to storage would require you to read data files in binary. Code above should give you good performance.
If you need this for one time export then I would say it's OK. If you need this for production you should reconsider your data-model - you may be doing something wrong.
You may need to split the query using multiple key ranges in case you need to scan many rows.

Related

Azure Synapse Database To And From Netezza - Most Efficient Approach

We were hoping to load data to Azure Synapse (Cloud) from Netezza and vice versa using Qlik however we are finding the performance is unacceptable. What is the fastest way to achieve this?
We have some in-house tools written in Java that perform this task however I have no clue how to run this code on the native cloud environment, or whether this is even feasible.
I do not have much experience with Cloud so any guidance about where to spend my time to get to my goal quicker would be appreciated.
From Netezza the fastest is ‘create external table as select….’
If your Netezza is new enough (CP4D) you can even refer to a file location ON cloud, but otherwise you may need a (fast) file stores on both Azure and on-Prem
A bit of reading:
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop
https://www.ibm.com/docs/en/SSULQD_7.2.1/com.ibm.nz.load.doc/c_load_create_external_tbl_syntax.html
Basically you will need to use UTF8 (aka ‘internal’ on Netezza) and choose 5 special characters:
an escape character (usually ‘\’)
a column delimiter (usually the TAB character)
a row delimiter (usually new-line)
a string delimiter (usually double quotes ‘“‘)
a NULL indicator character (usually a star ‘*’)
Choose the same 5 characters in both ends and do a binary file-transfer of some sort (xFTPx, HTTP or a dedicated Azure Copy tool of some sort) an you should be good :)

Store multiple values in a file - best format?

I want to store multiple values (String, Int and Date) in a file via Java in Android Studio.
I don't have that much experience in that area, so I tried to google a bit, but I didn't get the solution, which I've been looking for. So, maybe you can recommend me something?
What I've tried so far:
Android offers a SharedPreferences feature, which allows a user to save a primitive value for a key. But I have multiple values for a key, so that won't work for me.
Another option is saving data on an external storage medium as file. As far as good. But I want to keep the filesize at minimum and load the file as fast as possible. That's the place, where I can't get ahead. If I directly save all values as simple text, I would need to parse the .txt file per hand to load the data which will take time for multiple entries.
Is there a possibility to save multiple entries with multiple values for a particular key in an efficient way?
No need to reinvent a bicycle. Most probably the best option for your case is using the databases. Look into Sqlite or Realm.
You don’t divulge enough details about your data structure or volume, so it is difficult to give a specific solution.
Generally speaking, you have these three choices.
Serialize a collection
I have multiple values for a key
You could use a Map with a List or Set as its value. This has been discussed countless times on Stack Overflow.
Then use Serialization to write and read to storage.
Text file
Write a text file.
Use Tab-delimited or CSV format if appropriate. I suggest using the Apache Commons CSV library for that.
Database
If you have much data, or concurrency issues with multiple threads, use a database such as the H2 Database Engine.

How to manage a crawler URL frontier?

Guys
I have the following code to add visited links on my crawler.
After extracting links i have a for loop which loop thorough each individual href tags.
And after i have visited a link , opened it , i will add the URL to a visited link collection variable defined above.
private final Collection<String> urlForntier = Collections.synchronizedSet(new HashSet<String>());
The crawler implementation is mulithread and assume if i have visited 100,000 urls, if i didn't terminate the crawler it will grow day by day . and It will create memory issues ? Please , what option do i have to refresh the variable without creating inconsistency across threads ?
Thanks in advance!
If your crawlers are any good, managing the crawl frontier quickly becomes difficult, slow and error-prone.
Luckily, your don't need to write this yourself, just write your crawlers to use consume the URL Frontier API and plug-in an implementation that suits you.
See https://github.com/crawler-commons/url-frontier
The most usable way for modern crawling systems is to use NoSQL databases.
This solution is notable slower than HashSet. That is why you can leverage different caching strategy like a Redis, or even Bloom filters
But including specific nature of URL, I'd like to recommend Trie data structure that gives you lot of options to manipulate and search by url string. (Discussion of java implementation can be found on this Stackoevrflow topic)
As per question, I would recommend using Redis to replace use of Collection. It's in-memory database for data structure store and super fast to insert and retrieve data with support of all standard data structures.In your case Set and you can check existence of key in set with SISMEMBER command).
Apache Nutch is also good to explore.

mapping data in properties files

I have the following data:
User System SubSystem
user1 System1 SubSystem1
user2 System1 SubSystem2
user3 N/A N/A
and i need to be able to determine the system/subsystem tuple from the user. I must be able to add users at any time without rebuilding and redeploying the system.
I know the database would be the best option here but I cannot use a database table.
I currently have it mapped using a hash map but I don't want it to be hard-coded. I was thinking about using a properties file but I can't visualize how I would implement it. Anyone else have any suggestion?
Not that it matters but I'm using JAVA, on weblogic 10.3.
You could do this using a HashMap (as you do now) and store it using XStream.
XStream allows you to serialise/deserialise Java objects to/from readable/editable XML. You can then write this to (say) a filesystem, and the result is editable by hand.
The downside is that it's a serialisation in XML of a Java object, so not as immediately obvious as a properties file to edit. However it's still very readable, and easily understood by anyone remotely technical. Whether this is an appropriate solution depends on the audience of this file.
Sounds like something you could very well use YAML for..
SnakeYAML looks to be a workable Java implementation.
I would go for something as simple as :
user1 = userValue
user1.system = systemValue
user1.system.subsystem= subsystemValue
user2 = userValue
user2.system = systemValue
user2.system.subsystem= subsystemValue
user(id) is used as "primary" key in your properties, and a very simple concatenation of your fields to store your table values.
I use this very often : trust me, it's much more powerfull than it may appear :)
For this project i've gone with the solution proposed by Olivier. Some project contrainst (legacy of the project) prevent me for going with a probably better solution of using XStream.
Thx for the feed back guys

Best way to save data in a Java application?

I'm trying to find the best way to save the state of a simple application.
From a DB point-of-view there are 4/5 tables with date fields and relationships off course.
Because the app is simple, and I want the user to have the option of moving the data around (usb pen, dropbox, etc), I wanted to put all data in a single file.
What is the best way/lib to do this?
XML usually is the best format for this (readability & openness), but I haven't found any great lib for this without doing SAX/DOM.
If you want to use XML, take a look at XStream for simple serialization of Java objects into XML. Here is "Two minute tutorial".
If you want something simple, standard Java Properties format can be also a way to store/load some small data.
consider using plain JAXB annotations that come with the JDK:
#XmlRootElement
private class Foo {
#XmlAttribute
private String text = "bar";
}
here's a blog-post of mine that gives more details on this simple usage of JAXB (it also mentiones a more "classy" JAXB-based approach -- in case you need better control over your XML schema, e.g. to guarantee backwards compatibility)
2 other options you might consider -
Hsqldb is a small sql db written in
java. More relevant for your
purposes, it can be configured to
simply write to a csv file as it's
data store, so you could conceivably
use it's text output as a portable
datastore and still use sql, if
that's what you prefer.
A second option might be to write the
datastore directly to a serialized
file either directly or through a
library like prevayler. Very good
performance and simple to implement,
cons are the fragility and opacity of
the format.
But if the data is small enough, xml is probably much less bother.
If you don't need to provide semantic meaning to your data then XML is probably a wrong choice. I would recommend using the fat-free alternative JSON, which is much more naturally built for data structures.

Categories

Resources