I have here a bunch of XML-files which I like to store in a Cassandra database. Is there any possiblity out there to manage that or do I have to parse and reform the XML-files?
You can certainly store them as a blob or text but you will not be able to query the individual fields within the XML files. One other thing you'd want to be cautious of is payload size and partition size. Cassandra in general isn't really designed as an object store but depending on payload size and desired query functionality, you may either have to parse/chunk them out or look for an alternative solution.
Related
I am trying to get value like key and value pair but i am doing it from json file and now there is another approach suggested lets do it from db tables. because if in future value change then only update the DB is Needed.
I think using json file is more good as value hardly going to change in future(rarest of rare).. although advantage of db approach is just change the db value and done...
So My point is json will be faster then DB and Using Json will reduce load on DB..as clicking UI it invoke extra call of DB..
What do you Think .. Please let me know..
This very much depends on how you are going to use these data.
Do you need to update it often?
Do you need to update by just one specific field?
Do you need to fetch records based on some specific field?
Do you need to fetch whole json or just some specific fields?
Do some parts of json reference any other tables?
Also, consider the size of those data, e.g. if the json files together may become more in size than the whole other tables, you may break db cache. From the other hand, you can always create separate database for your json files if you still need some relational database features.
So, I would anyway start with answering first 5 questions.
I need to save permanently a big vocabulary and associate to each word some information (and use it to search words efficiently).
Is it better to store it in a DB (in a simply table and let the DBMS make the work of structuring data based on the key) or is it better to create a
trie data structure and then serialize it to a file and deserialize once the program is started, or maybe instead of serialization use a XML file?
Edit: the vocabulary would be in the order of 5 thousend to 10 thousend words in size, and for each word the metadata are structured in array of 10 Integer. The access to the word is very frequent (this is why I thought to trie data structure that have a search time ~O(1) instead of DB that use B-tree or something like that where the search is ~O(logn)).
p.s. using java.
Thanks!
using DB is better.
many companies are merged to DB, like the erp divalto was using serializations and now merged to DB to get performance
you have many choices between DBMS, if you want to see all data in one file the simple way is to use SQLITE. his advantage it not need any server DBMS running.
Performance wise, is it smart to do this?
An example would be with Gson
Gson gson = new Gson();
String json = gson.toJson(myObject);
// store json string in sql with primary key
// retrieve json string in sql with primary key
I want to simplify the way i store and retrieve objects, instead of building and separating them into pieces and individual columns each time i store/retrieve from a database.
But my concern with using a json string is that the length may impact performance when the database fills up? Im not sure, this is why im asking.
There is not an issue with 'space' used or performance of such: MySQL will deal with that just fine.
That is, while the entire JSON chunk must be pulled/pushed for processing and changes MySQL will continue to handle such as best it ever did, even 'as the database fills up'.
However, there are problems with normalization and opaqueness of information when following this design. Databases are about information, but a blob of JSON is just .. a blob of JSON to an SQL database.
Because of this none of the data in the JSON can be used for relationships or queries nor can it participate in indices or constraints. So much for the "relational" part of the database..
Unless the JSON truly is opaque data, like the contents of a binary file, consider working on relevant normalization .. or switch to a (JSON) document-oriented database (eg. Raven, Redis, Couch, Mongo).
There is no space or performance issues with storing JSON strings. MySQL is capable to handle large blobs, if you need so.
The decision whether to store your data serialized into JSON or not should be based on how do you need to process this data. Relational databases, such as MySQL, suppose that you normalize the data and establish relationships between records. That said, in many cases it can be practical to do otherwise (i.e. store as JSON).
If you will store your data as JSON strings, you will not be able to effectively process this data using MySQL features, e.g. you cannot filter out or sort records by values stored as JSON. However, if you need only to store this data, while the processing is going to be done by the application code, it can be reasonable to use JSON.
As document-oriented databases like MongoDB become more popular, some of the traditional relational databases, such as PostgreSQL and MariaDB, recently also implemented native JSON support.
I have an Hbase table with a couple of million records. Each record has a couple of properties describing the record, stored each in a column qualifier.(Mostly int or string values)
I have a a requirement that I should be able to see the records paginated and sorted based on a column qualifier (or even more than one, in the future). What would be a best approach to do this? I have looked into secondary indexes using coprocessors (mostly hindex from huawei), but it doesn't seem to match my use case exactly. I've also thought about replicating all the data into multiple tables, one for each sort property, which would be included in the rowkey and then redirect queries to those tables. But this seems very tedious as I have a few so called properties already..
Thanks for any suggestions.
You need your NoSQL database to work just like a RDBMS, and given the size of your data your life would be a lot simpler if you stick to it, unless you expect exponential growth :) Also, you don't mention if your data gets updated, this is very important to make a good decision.
Having said that, you have a lot of options, here are some:
If you can wait for the results: Write a MapReduce task to do the scan, sort it and retrieve the top X rows, do you really need more than 1000 pages (20-50k rows) for each sort type?. Another option would be using something like Hive.
If you can aggregate the data and "reduce" the dataset: Write a MapReduce task to periodically export the newest aggregated data to a SQL table (which will handle the queries). I've done this a few times to and it works like a charm, but it depends on your requirements.
If you have plenty of storage: Write a MapReduce task to periodically regenerate (or append the data) a new table for each property (sorting by it in the row-key). You don't need multiple tables, just use a prefix in your rowkeys for each case, or, if you do not want tables and you won't have a lot queries, simply write the sorted data to csv files and store them in the HDFS, they could be easily read by your frontend app.
Manually maintain a secondary index: Which would not very tolerant to schema updates and new properties but would work great for near real-time results. To do it, you have to update your code to also to write to the secondary table with a good buffer to help with performance while avoiding hot regions. Think about this type of rowkeys: [4B SORT FIELD ID (4 chars)] [8B SORT FIELD VALUE] [8B timestamp], with just one column storing the rowkey of the main table. To retrieve the data sorted by any of the fields just perform a SCAN using the SORT FIELD ID as start row + the starting sort field value as pivot for pagination (ignore it to get the first page, then set the last one retrieved), that way you'll have the rowkeys of the main table, and you can just perform a multiget to it to retrieve the full data. Keep in mind that you'll need a small script to scan the main table and write the data to the index table for the existing rows.
Rely on any of the automatic secondary indexing through coprocessors like you mentioned, although I do not like this option at all.
You have mostly enumerated the options. HBase natively does not support secondary indexes as you are aware. In addition to hindex you may consider phoenix
https://github.com/forcedotcom/phoenix
( from SalesForce) which in addition to secondary indexes has jdbc driver and sql support.
I am trying to create an application in java which pulls out records from the database and maps it to objects. It does that without knowing what the schema of the database looks like. All i want to do is fetch all rows from all tables and store them somewhere. There could be a thousand tables with thousands of records each. The application doesn't know the name of any table or attribute. It should map "on the fly". I looked at hibernate but it doesnt give me what i want for this app. I don't want to create hard-coded xml files and classes for mapping. Any ideas how i can accomplish this ?
Thanks
Oracle has a bunch of data dictionary views for metadata.
ALL_TABLES, ALL_TAB_COLUMNS would be first places to start. Then you'd build ad-hoc queries based on what you get out of there. Not sure whether you have to deal with all data types (dates, blobs, spatial, user-defined....).
Not sure what you mean by "store them somewhere". If you start thinking CSV or XML files, you'll need to escape various characters from VARCHAR2 columns.
If you are looking for some generic extract/unload routines, you should look at what is already available in the database or open-source/commercially.
MyBatis provides a pretty simple way to map data results to objects and back, maybe check that out?
http://code.google.com/p/mybatis/
Not to be flip, but for this task, you might want to check out Ruby on Rails and its ActiveRecord approach