I am trying to move all my data from one column-family (table) to the other. Since both the tables have different descriptions, I would have to pull all data from table-1 and create a new object for table-2 and then do a bulk aync insert. My table-1 has millions of records so I cannot get all the data directly in my data structure and work that out. I am looking out for solutions to do that easily using Spring Data Cassandra with Java.
I initially planned for moving all the data to a temp table first followed by creating some composite key relations and then querying back my master table. However, it doesn't seems favorable to me. Can anyone suggest a good strategy to do this? Any leads would be appreciated. Thanks!
My table-1 has millions of records so I cannot get all the data directly in my data structure and work that out.
With datastax java driver you can get all data by token ranges and work out data from each token range. For example:
Set<TokenRange> tokenRanges = cassandraSession.getCluster().getMetadata().getTokenRanges();
for(TokenRange tr: tokenRanges) {
List<Row> rows = new ArrayList<>();
for(TokenRange sub: tr.unwrap()){
String query = "SELECT * FROM keyspace.table WHERE token(pk) > ? AND token(pk) <= ?";
SimpleStatement st = new SimpleStatement( query, sub.getStart(), sub.getEnd() );
rows.addAll( session.execute( st ).all() );
}
transformAndWriteToNewTable(rows);
}
Each token range contains only piece of all data and can be handled by one physical machine. You can handle each token range independently (in parallel or asynchronously) to get more performance.
You could use Apache Spark Streaming. Technically, you will read data from the first table, do on-the-fly transformation and write to the second table.
Note, I prefer Spark scala API, as it has more elegant API and streaming jobs code would be more laconic. But if you want to do it using pure Java, that's your choice.
Related
I'm new to Spark, I'm loading a huge CSV file using Data Frame code given below
Dataset<Row> df = sqlContext.read().format("com.databricks.spark.csv").schema(customSchema)
.option("delimiter", "|").option("header", true).load(inputDataPath);
Now after loading CSV data in data frame, now I want to iterate through each row and based on some columns want to query from PostgreSQL DB (performing some geometry operation). Later want to merge some fields returned from DB with the data frame records. What's the best way to do it, consider huge amount of records.
Any help appreciated. I'm using Java.
Like #mck also pointed out: the best way is to use join.
with spark you can read external jdbc table using the DataRame Api
for example
val props = Map(....)
spark.read.format("jdbc").options(props).load()
see the DataFrameReader scaladoc for more options and which properties and values you need to set.
then use join to merge fields
I have a very large table in the database, the table has a column called
"unique_code_string", this table has almost 100,000,000 records.
Every 2 minutes, I will receive 100,000 code string, they are in an array and they are unique to each other. I need to insert them to the large table if they are all "good".
The meaning of "good" is this:
All 100,000 codes in the array never occur in the database large table.
If one or more codes occur in the database large table, the whole array will not use at all,
it means no codes in the array will insert into the large table.
Currently, I use this way:
First I do a loop and check each code in the array to see if there is already same code in the database large table.
Second, if all code is "new", then, I do the real insert.
But this way is very slow, I must finish all thing within 2 minutes.
I am thinking of other ways:
Join the 100,000 code in a SQL "in clause", each code has 32 length, I think no database will accept this 32*100,000 length "in clause".
Use database transaction, I force insert the codes anyway, if error happens, the transaction rollback. This cause some performance issue.
Use database temporary table, I am not good at writing SQL querys, please give me some example if this idea may work.
Now, can any experts give me some advice or some solutions?
I am a non-English speaker, I hope you see the issue I am meeting.
Thank you very much.
Load the 100,000 rows into a table!
Create a unique index on the original table:
create unique index unq_bigtable_uniquecodestring on bigtable (unique_code_string);
Now, you have the tools you need. I think I would go for a transaction, something like this:
insert into bigtable ( . . . )
select . . .
from smalltable;
If any row fails (due to the unique index), then the transaction will fail and nothing is inserted. You can also be explicit:
insert into bigtable ( . . . )
select . . .
from smalltable
where not exists (select 1
from smalltable st join
bigtable bt
on st.unique_code_string = bt.unique_code_string
);
For this version, you should also have an index/unique constraint on smalltable(unique_code_string).
It's hard to find an optimal solution with so little information. Often this depends on the network latency between application and database server and hardware resources.
You can load the 100,000,000 unique_code_string from the database and use HashSet or TreeSet to de-duplicate in-memory before inserting into the database. If your database server is resource constrained or there is considerable network latency this might be faster.
Depending how your receive the 100,000 records delta you could load it into the database e.g. a CSV file can be read using external table. If you can get the data efficiently into a temporary table and database server is not overloaded you can do it very efficiently with SQL or stored procedure.
You should spend some time to understand how real-time the update has to be e.g. how many SQL queries are reading the 100,000,000 row table and can you allow some of these SQL queries to be cancelled or blocked while you update the rows. Often it's a good idea to create a shadow table:
Create new table as copy of the existing 100,000,000 rows table.
Disable the indexes on the new table
Load the delta rows to the new table
Rebuild the indexes on new table
Delete the existing table
Rename the new table to the existing 100,000,000 rows table
The approach here is database specific. It will depend on how your database is defining the indexes e.g. if you have a partitioned table it might be not necessary.
How we can read data from relational database using custom data source. I am new to flink streaming. I am facing problem while adding new custom data-source. So please help me to add custom data source and read data continuously from source DB.
As suggested by Chengzhi, relational databases are not designed to be processed in a streaming fashion and it would be better to use Kafka, Kinesis or some other system for that.
However you could write a custom source function that uses a JDBC connection to fetch the data. It would have to continuously query the DB for any new data. The issue here is that you need a way to determine which data you have already read/processed and which you did not. From the top of my head you could use a couple of things, like remembering what was the last processed primary key, and use it in subsequent query like:
SELECT * FROM events WHERE event_id > $last_processed_event_id;
Alternatively you could clear the events table inside some transaction like:
SELECT * FROM unprocessed_events;
DELETE FROM unprocessed_events WHERE event_id IN $PROCESSED_EVENT_IDS;
event_id can be anything that lets you uniquely identify the records, maybe it could be some timestamp or a set of fields.
Another thing to consider is that you would have to manually take care of checkpointing (of the last_processed_even_id offset) if you want to provide any reasonable at-least-once or exactly-once guarantees.
I'm using MongoDB and PostgreSQL in my application. The need of using MongoDB is we might have any number of new fields that would get inserted for which we'll store data in MongoDB.
We are storing our fixed field values in PostgreSQL and custom field values in MongoDB.
E.g.
**Employee Table (RDBMS):**
id Name Salary
1 Krish 40000
**Employee Collection (MongoDB):**
{
<some autogenerated id of mongodb>
instanceId: 1 (The id of SQL: MANUALLY ASSIGNED),
employeeCode: A001
}
We get the records from SQL, and from their ids, we fetch related records from MongoDB. Then map the result to get the values of new fields and send on UI.
Now I'm searching for some optimized solution to get the MongoDB results in PostgreSQL POJO / Model so I don't have to fetch the data manually from MongoDB by passing ids of SQL and then mapping them again.
Is there any way through which I can connect MongoDB with PostgreSQL through columns (Here Id of RDBMS and instanceId of MongoDB) so that with one fetch, I can get related Mongo result too. Any kind of return type is acceptable but I need all of them at one call.
I'm using Hibernate and Spring in my application.
Using Spring Data might be the best solution for your use case, since it supports both:
JPA
MongoDB
You can still get all data in one request but that doesn't mean you have to use a single DB call. You can have one service call which spans to twp database calls. Because the PostgreSQL row is probably the primary entity, I advise you to share the PostgreSQL primary key with MongoDB too.
There's no need to have separate IDs. This way you can simply fetch the SQL and the Mongo document by the same ID. Sharing the same ID can give you the advantage of processing those requests concurrently and merging the result prior to returning from the service call. So the service method duration will not take the sum of the two Repositories calls, being the max of these to calls.
Astonishingly, yes, you potentially can. There's a foreign data wrapper named mongo_fdw that allows PostgreSQL to query MongoDB. I haven't used it and have no opinion as to its performance, utility or quality.
I would be very surprised if you could effectively use this via Hibernate, unless you can convince Hibernate that the FDW mapped "tables" are just views. You might have more luck with EclipseLink and their "NoSQL" support if you want to do it at the Java level.
Separately, this sounds like a monstrosity of a design. There are many sane ways to do what you want within a decent RDBMS, without going for a hybrid database platform. There's a time and a place for hybrid, but I really doubt your situation justifies the complexity.
Just use PostgreSQL's json / jsonb support to support dynamic mappings. Or use traditional options like storing json as text fields, storing XML, or even EAV mapping. Don't build a rube goldberg machine.
I have an Hbase table with a couple of million records. Each record has a couple of properties describing the record, stored each in a column qualifier.(Mostly int or string values)
I have a a requirement that I should be able to see the records paginated and sorted based on a column qualifier (or even more than one, in the future). What would be a best approach to do this? I have looked into secondary indexes using coprocessors (mostly hindex from huawei), but it doesn't seem to match my use case exactly. I've also thought about replicating all the data into multiple tables, one for each sort property, which would be included in the rowkey and then redirect queries to those tables. But this seems very tedious as I have a few so called properties already..
Thanks for any suggestions.
You need your NoSQL database to work just like a RDBMS, and given the size of your data your life would be a lot simpler if you stick to it, unless you expect exponential growth :) Also, you don't mention if your data gets updated, this is very important to make a good decision.
Having said that, you have a lot of options, here are some:
If you can wait for the results: Write a MapReduce task to do the scan, sort it and retrieve the top X rows, do you really need more than 1000 pages (20-50k rows) for each sort type?. Another option would be using something like Hive.
If you can aggregate the data and "reduce" the dataset: Write a MapReduce task to periodically export the newest aggregated data to a SQL table (which will handle the queries). I've done this a few times to and it works like a charm, but it depends on your requirements.
If you have plenty of storage: Write a MapReduce task to periodically regenerate (or append the data) a new table for each property (sorting by it in the row-key). You don't need multiple tables, just use a prefix in your rowkeys for each case, or, if you do not want tables and you won't have a lot queries, simply write the sorted data to csv files and store them in the HDFS, they could be easily read by your frontend app.
Manually maintain a secondary index: Which would not very tolerant to schema updates and new properties but would work great for near real-time results. To do it, you have to update your code to also to write to the secondary table with a good buffer to help with performance while avoiding hot regions. Think about this type of rowkeys: [4B SORT FIELD ID (4 chars)] [8B SORT FIELD VALUE] [8B timestamp], with just one column storing the rowkey of the main table. To retrieve the data sorted by any of the fields just perform a SCAN using the SORT FIELD ID as start row + the starting sort field value as pivot for pagination (ignore it to get the first page, then set the last one retrieved), that way you'll have the rowkeys of the main table, and you can just perform a multiget to it to retrieve the full data. Keep in mind that you'll need a small script to scan the main table and write the data to the index table for the existing rows.
Rely on any of the automatic secondary indexing through coprocessors like you mentioned, although I do not like this option at all.
You have mostly enumerated the options. HBase natively does not support secondary indexes as you are aware. In addition to hindex you may consider phoenix
https://github.com/forcedotcom/phoenix
( from SalesForce) which in addition to secondary indexes has jdbc driver and sql support.