Apache Cassandra crashes under heavy read load

Apache Cassandra crashes under heavy read load - java

I've been working on an application which requires regular writes and massive reads at once.
The application is storing a few text columns which are not very big in size and a map of which is the biggest column in the table.
Working with Phantom-DSL in Scala (Datastax Java driver underneath), my application crashes when the data size increases.
Here is a log from my application.
[error] - com.websudos.phantom - All host(s) tried for query failed (tried: /127.0.0.1:9042 (com.datastax.driver.core.OperationTimedOutException: [/127.0.0.1:9042] Operation timed out))
And here are the cassandra logs.
I have posted Cassandra logs in a pastebin because they were pretty large to be embedded in the answers.
I can't seem to understand the reason for this crash. I have tried increasing the timeout and turning off row cache.
From what I understand, this is a basic problem and can be resolved by tuning cassandra for this special case.
My cassandra usage is coming from different data sources. So writes are not very frequent. But reads are big in size in that over 300K rows may be required at once which then need to be transferred over HTTP.

The logs show significant GC pressure (ParNew of 5 seconds).
When you say "reads are big in size in that over 300K rows may be required at once", do you mean you're pulling 300k rows in a single query? The Datastax driver supports native paging - set the fetch size significantly lower (500 or 1000), and allow it to page through those queries instead of trying to load all 300k rows in a single pass?

Maps (and collections in general) can be very demanding for Cassandra heapspace. Changing your data model to replace the map with another table may solve your gc issues. But that's a lot of speculation due to the lack of further details on your Cassandra usage.

Related

Mongock data partitioning

I am trying o migrate data in MongoDB with mongock and stumbled upon an issue when I need to run migration across a big amount of data.
Is there a way to partition this data in any way? I didn't find support for that on documentation.
Problem is that it takes a lot of time to make queries on this data and to load all of it into memory at once also can cause problems.
UPD:
Problem is that sample code can bring 1 or 1kk documents and it can take huge amount of time
mongoTemplate.findAll(User.class).stream()
.map(this::migrateUser)
.forEach(mongoTemplate::save);

Large dataset: None of the configured nodes are available for large dataset

There are a lot of questions about this error, but none for this condition.
I am running Elasticsearch 5.4.1 with a java client(1.8) which uses the API to make Elasticsearch calls. It is on a Mac. I have a large number of documents which I have to search on / insert/ and merge on. (~ 8000 documents)
I get this exception
None of the configured nodes are available: [{#transport#-1}{gNzkHZmURzaabE336I-T4w}{localhost}{127.0.0.1:9300}]
The thing is, it works with a smaller number of entries (e.g. 5000 documents). So the connection seems to be going down for some reason? Should I allocate more memory/ use more nodes to it? Is there some weird garbage collection going on?
According to the activity monitor for the Mac, memory pressure is fine.

how to increase performance of data migration from oracle to mongodb

I have created an application where I am migrating data(in GBs) from Oracle to mongo. I have used sharding in MongoDB.
How to reduce time consumption for migration and increase the performance rate?

Your migration could be slow for a number of reasons.
It could be because of :
how the migration code works
how fast oracle responds to the queries
how you are pulling the data from oracle(if a stream or something similar is possible)
directly querying like select *
Also if the two db servers are in the same data center or are seperated. All above are assuming that the two db servers are in the same data center(location or within the same private server area or within same virtual cloud region).
Also depends on the number of indexes which are on oracle (which may impact the rate at which the data is being accessed).
Also if the operations are just reads they should be faster. If you are performing read and update/delete on oracle again the performance could be slower.
This is a very generic question and might have a long list of reasons which are causing the performance issues

BigQuery performance and Running concurrent jobs

We are working with Google BigQuery (using Java) for one of our cloud solution and facing lot of issues in development. Our observations and issues as follows -
We are using Query Jobs (Example: jobs().insert()/jobs().query() method first and then tablesdata().list() for data) for data retrieval. The Job execution taking 2-3 seconds (we had data in MBs only right now). We looked into sample codes on code.google.com and github.com and tried to implement them. However, we are not able to achieve fast execution than 2-3 seconds. What is the fast way to retrieve data from BigQuery tables? Is there a way to improvise Job execution speed? If yes, Can you provide links for sample codes?
In our screens, we need to fetch data from different tables (different queries) and display them. So, we inserted multiple query jobs and total execution time getting summed-up (Example: if we had two jobs (i.e. two queries), it takes 6-7 seconds). In Google documentation it has been mentioned that, we can run concurrent Jobs. Is there any sample code available for this?
Waiting for your valuable responses.

Query of cached results can be much faster, if you can run the query independently. The following query will run faster.
Check that the bottle-neck is not related to network\ paginating\ page rendering\ etc. you can do it by trying executing only the 2nd step.
Parallel jobs might be queued on BQ end based on their current load.
My recommendation would be to separate the query from presentation. Run the BQ queries, retrieve the "Small size" data to a fast access data store (flat file, Cache, Cloud SQL, etc.) and present it from there.
As Pentium10 says, BQ is excellent for huge datas (and returns results faster and cheaper than any other comparable solution). If you are looking for a back-end of a fast reporting visualization tool, I am afraid that BQ might not be your solution.

1) Big Query is a highly scalable database, before being a "super fast" database. It's designed to process HUGE amount of data distributing the processing among several different machines using a technique named Dremel. Because it's designed to use several machines and parallel processing, you should expect to have super-scalability with a good performance.
2) BigQuery is an asset when you want to analyze billions of rows.
For example: analyzing all the wikipedia revisions in 5-10 seconds isn't bad, is it? But even a much smaller table would take about the same time, even if has 10k rows.
3) Under this size, you'll be better off using more traditional data storage solutions such as Cloud SQL or the App Engine Datastore. If you want to keep SQL capability, Cloud SQL is the best guess.
Sybase IQ is often installed in a single database and it doesn't use Dremel. That said, it's going to be faster than Big Query in many scenarios...as designed.
4) Certainly the performance differ from a dedicated environment. You get your dedicated environment for 20K$ a month.

How to Fetch 1.7 Million records in Java?

I am using MySQL database in which a table has 1.7 million records. Through Restlet framework in Java I want to fetch these records and return it to the client. I am using Linux Centos which is remote server. I have created WAR file and uploaded on the server. When I run the service it is taking lot of time to respond. I waited for 40 mins but not getting any output.
So Can anybody please help me to resolve this problem?

That's probably not going to work: holding that many rows of data in memory will probably cause an out of memory exception (can you look at the logs on the server and see what exactly is happening?).
To do something like this you'll either need to abandon that plan and do pagination of some sort, or you'll need a solution that lets you stream the records to the client without holding them in memory. I'm unsure that the Restlet framework lets you do that: you'll probably need to implement that using servlets yourself.

When I have a very large number of rows I have used memory mapped files. e.g. I have one database where I have to retrieve and process 1.1 billion rows in around a minute. (Its over 200 GB)
This is a very specialist approach, and I suspect there is a way to tune your SQL database or use a NoSQL database to do what you want. I would have though you can retrieve 1.7 million in under a minute depending on what you are doing (e.g. if you are selecting this many amoungst a few TB its going to take a while)
But, if there really is no other option, you could write a custom data store.
BTW: Only a summary is produced. No one should be expected to read that many rows, certainly not display them in a browser. Perhaps there is something you can do to produce a report or summary so there is less to send the client.

I have successfully done just this kind of work in my apps. If your client is ready to accept a big response, there is nothing wrong with the approach. The main point is that you'll need to stream the response, which means you can't build te entire response as a string. Get the outputstream of the HTTP response and write records into it one by one. On the db-end you need to set up a scrollable resultset (easy to do at the JDBC level, as well as at the Hibernate level).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Cassandra crashes under heavy read load - java

Maps (and collections in general) can be very demanding for Cassandra heapspace. Changing your data model to replace the map with another table may solve your gc issues. But that's a lot of speculation due to the lack of further details on your Cassandra usage.

Related

Mongock data partitioning

Large dataset: None of the configured nodes are available for large dataset

how to increase performance of data migration from oracle to mongodb

BigQuery performance and Running concurrent jobs

How to Fetch 1.7 Million records in Java?

Categories

Resources