Best massive data persistent storage with TTL? [closed]

Best massive data persistent storage with TTL? [closed] - java

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
We are building a system which needs to put tons of data into some persistent storage for a fixed amount of time - 30 to 60 days. Since the data is not critical (we can lose some for example when virtual machine goes down) and we don't want to pay the price of persisting it with every request (latency is critical for us) we were thinking about either buffering & batching the data or sending in an async manner.
Data is append only, we would need to persist 2-3 items per request, system processes ~10k rps on multiple hosts scaled horizontally.
We are hesitating between choosing Mongo (3.x?) or Cassandra, but we can go with any other solution. Does anyone here have any experience or hints in solving this kind of problem? We are running some PoCs, but we might not be able to find all the problems early enough and pivot might be costly.

I can't comment on MongoDB but I can talk to Cassandra. Cassandra does indeed have a TTL feature in which you can expire data after a certain time. You have to plan for it though because TTL's do add some overhead during a process Cassandra runs called 'compaction' - see: http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_write_path_c.html
and: http://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html
As long as you size for that kind of workload, you should be OK. That being said, Cassandra really excels when you have event driven data - things like time series, product catalogs, click stream data, ETC.
If you aren't familiar with Patrick McFadin, meet your new best friend: https://www.youtube.com/watch?v=tg6eIht-00M
And of course, the plenty of free tutorials and training here: https://academy.datastax.com/
EDIT to add one more idea of expiring data 'safely' and with the least overhead. This is one done by a sharp guy by the name of Ryan Svihla https://lostechies.com/ryansvihla/2014/10/20/domain-modeling-around-deletes-or-using-cassandra-as-a-queue-even-when-you-know-better/

Related

Does deleting the same data multiple times have a performance impact on a Cassandra cluster? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 17 days ago.
Improve this question
If I need to perform an automated housekeeping task, and this is my query:
delete from sample_table where id = '1'
And, this scheduled query gets executed from multiple service instances.
Will this have a significant performance impact? What would be an appropriate way of testing this?

Issuing multiple deletes for the same partition can have a significant impact on your cluster.
Remember that all writes in Cassandra (INSERT, UPDATE, DELETE) are inserts under the hood. Since Cassandra does not perform a read-before-write (with the exception of lightweight transactions), issuing a DELETE will insert a tombstone marker regardless of whether the data exists or has already been deleted.
Every single DELETE you issue counts as a write request so depending on how busy your cluster is, it may have a measurable impact on its performance. Cheers!

Erick's answer is pretty solid, but I'd just like to add that the time that you'll likely see performance issues is at read-time. That's because doing a:
SELECT * FROM sample_table WHERE id='1';
...will read ALL of the times that the DELETE was written (tombstones) from the SSTable file. The default settings on a table result in deleted data staying around for 10 days (to ensure proper replication) before they can be picked-up by compaction.
So figure out how many times that DELETE happens per key over a 10 day period, and that's about how many Cassandra will have to reconcile at read-time.

What's the most efficient technique to constantly read JSONs from an API? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm about to start a program that is constantly getting information from a web API so I want it read the API JSONs as fast as possible in order not to lose any time (I want to check some values that are constantly updating).
As I'm coding using Java, I thought I could use gson to parse those JSON information but I don't know if it's the most efficient way.
Also, is Java a good language in order to read APIs efficiently?

If it's about performance there are some issues to look at / to optimize:
query speed for the data (e.g. the used http or network client)
speed of parsing the result
speed of the output of the data
....
If you want to improve the performance, I would suggest you analyze at first, where the most performance is lost in that chain, for example by logging durations for each part.
When you found the bottleneck, you can improve it and measure it again, to see wether your program became faster.
For example you could perhaps use UDP instead of tcp. Or use http2 instead of old http protocol and so on.
If it's really the parsing part, which makes critical durations, you could try to use the fact, that the data is always in the same structure. For example you could look at "keywords" in your JSON format and extract the text right before or after these keywords. Then your program doesn't have to parse (or "understand") the whole structure and can operate (possibly) faster.
Or you can extract the facts you search for with certain positions (for example the info is always after the sixth curly open brace).
But you should only optimize, if it's a real performance gain (see first part of the answer) because it's quite likely that your code gets less readable, when you optimize it for the sake of performance. That's often the tradeoff, one has to choose.

Is it better to do one db request and process data in application, or fetch data from db in multiple queries [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have around ~100 inventory of different types saved in db for a property
Is it better to have code like this
List<Inventory> type1s = inventoryRepo.findByPropertyIdAndType(propertyId, Type1);
List<Inventory> type2s = inventoryRepo.findByPropertyIdAndType(propertyId, Type2);
Map<InventoryType, List<Inventory>> typeListMap = new HashMap<>();
typeListMap.put(Type1, type1s);
typeListMap.put(Type2, type2s);
or
List<Inventory> inventories = inventoryRepo.findByPropertyIdAndTypeIn(propertyId
, Arrays.asList(Type1, Type2));
Map<InventoryType, List<Inventory>> typeListMap = inventories.stream().collect(
Collectors.groupingBy(Inventory::getType, Collectors.toList()));
Note: DB is postgresql.
I think 2nd approach is better, going by the rule of having min db calls. But am I missing some other key aspects to be considered?

As it is always with these question, the answer is - it depends.
If there is not much data to process, then single roundtrip would be optimal.
On the other hand when the amount of data grows, the problems arise:
The application memory might not be sufficient and OutOfMemoryError might be thrown.
The database can be unresponsive because it has to process a costly query.
In that case batching is usually the most reasonable approach.

It is better if you fetch the data from db at once and then process it in the application as it's always advisable to have less load on the database if you ever scale.

Up to a point, the fastest approach is to gather all data at once. Above that point, it would be advisable to separate the data into chunks of, say, 10K rows at a time.
It's up to you to discover the point at which it would be best to split, and the optimal chunk size.
I can't think of any scenario in which selecting one row at a time would be faster.

Architect Predictive Search on 30-50K objects? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have to build a search functionality where GUI will provide a search field to search objects in Oracle database. There are currently 30K objects I have to search on but they will grow in number over time. 300-400 per month approx.
As a part of the requirement, when user types in any text in search Like for example "ABC", then all objects in the DB that contains ABC should appear in a datatable more like system is predicting results based on what user has types in the search field.
Question is how to architect such feature?
Simple way to do is to load everything in the GUI Javascript object and run search on it. Since JS is ridiculously fast, performance wont be an issue.
Another way is to run query in the Database everytime user types in text in search field. This does not seem convenient as it will put unnecessary load on the database.
Is there a better way to architect this feature? Please share thoughts.

premature optimization is seldom useful.
300-400 object growth per month with a 30k base object is nothing at all for any DB to handle.
loading all 30k object at once on the browser is awful and may affect performance while querying result in the DB will not have this problem until you have LOT of and LOT of users accessing the DB.
You should be building the service using the Database and then if/when you reach a bottleneck you can think about optimization trick such as caching frequent queries on the database.

OO PHP service performance [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
So I have this app, a Java servlet. It uses a dictionary object that reads words from a file specified as a constructor parameter on instantiation and then serves queries.
I can do basically the same on PHP, but it's my understanding the class will be instantiated on each and every request, and the file will be read again every time. In fact, I did it and it works, but it collapses my humble amazon EC2 micro instance at the ridiculous amount of 11 requests per second or more.
My question is: Shouldn't some kind of compiler/file system optimization be kicking in and making the performance impact insignificant when the file does not change at all?
If the answer is no, I guess my design is quite poor and I should try to improve it. In that case, my second question is: What would be the best approach to improve it?
Building a servlet-like service so the code is properly reused?
Using memcached to keep the words file content in memory?
Using a RDBMS instead of a plain text file and have my dictionary querying it?
(despite the dictionary being only a few KB of static data and despite having to perform some complex queries such as selecting a
(cryptographically safe) random word from those having a length
higher than some per-request user setting and such?)
Something else?

Your best bet is to generate a PHP file which contains the final structure of the dictionary in PHP code. You could then include() that cache file into your code or write a new one when the file changes. You should store it on the filesystem, no databases. You could cache it in memory as well. But I don't think this is really needed at this point.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Best massive data persistent storage with TTL? [closed] - java

Related

Does deleting the same data multiple times have a performance impact on a Cassandra cluster? [closed]

What's the most efficient technique to constantly read JSONs from an API? [closed]

Is it better to do one db request and process data in application, or fetch data from db in multiple queries [closed]

Architect Predictive Search on 30-50K objects? [closed]

OO PHP service performance [closed]

Categories

Resources