SELECT DISTINCT vs Java Set/List - java

I need to get the distinct values from a column (which is not indexed) and the table contains billions of rows.
So when I use distinct in the select query, the query gets time out as the timeout is set to 3 minutes.
Will be it a good approach to get all the data from the table and then using set we can get the unique values?
please suggest the best approach here.
Thanks in advance !! :)

It is not a good idea to take all rows (especially if there are billions of them), because it is not memory efficient, or you"ll get OutOfMemoryError.
The best of course is to rearrange unstructured data.
Also, for big sets of data common practice is to use a paging (pagination) mechanism, that allows you to take the data in small chunks, so you'll bypass this timeout issue.

Related

~4K rows insertion to DB per user - design & performance

I'm writing an application that allows each user to label English words in three categories (some lexical exercise).
The main DB table, Word, contains ~4K different rows of words.
The Label table contains 3 labels.
--> The Word-Label table (that contains 3 columns: word_id, label_id, user_id) will add 4K rows per user (let's assume all the words starts with some pre-defined label when user register to the system).
The problem is that the table will grow very fast. 1:4000 (user/row) is bad in my opinion.
What can you suggest here to eliminate such a huge table? I've read that table-per-user is also considered bad practice.
In addition, I'm using Spring & Hibernate and the 4K insertions after the user get registered for the first time is pretty tough and takes time.
I can consider some NoSQL solution or another tool than Hibernate, but I'm consisting to use Spring & Java - so suggest something properly.
Will be glad for your help here!
There is no issue with data size. You may have an issue with Hibernate, but that is another issue.
If you end up with thousands of users, you'll have a few tens of millions of rows. That is not a large number of rows. If you want to insert default labels for a new user, then the code would look something like this:
insert into userLabels (userId, wordId, label)
select :userId, w.wordId, <default label>
from words w;
I would be surprised if this took more than a second or two.
If you knew that you would be having millions of users, then size might be more of an issue. The best solution would require better understanding of the application. The solution might vary from partitioning the tables, using arrays, or coming up with a different structure for representing your data.
You probably want various indexes on your tables to speed performance, but that depends on the queries you want to run. You might consider using a native interface to the database. Your use-case doesn't seem particularly complicated, so I don't know what advantage Hibernate or similar layers gets you.
First approach, you will just add new row to word-label for user after action. So, not every user will probably have 4k rows in that table. Now, when your database - query and stuff around that functionality will be a problem (bottleneck) then try to fix the issue and improve performance.
There are many performance tricks in sql databases you can use. For example, you wrote about table per user. That's not quite the best solution, next example, in mysql, u can create table patitions and it will be handled as one table but with performance improvement.
Second approach, for this type of data, of cource some NoSQL like MongoDB would perform great.
you could encode the user responsse-map into a 4000 entry bit-array, or string if you don't need the relational capabilities of the database
then it would be one record per user.
create table user_words (userid int, wiorddata text);
insert into user_words values (1,'YNYYNmmmYY'/* ... */ );
you application would need to have the list of words and kniow which wird each character refers to.

efficient way for storing query results?

I have a requirement like running 'n' numbers of select queries at fixed time intervals and storing that data. These results need to be pulled later upon a client's demand.
My question is:
1) Is it okay to store it as csv files? Or could you suggest another format?
2) Or, should it be stored as clob variable in a db?
Please suggest any compression techniques to store these query results; also, is it possible to store only revisions of previous resultsets instead of storing the whole resultset?
note:
The minimum time interval is hourly.
The number of queries (n) will be varying (currently 10 to 200 queries.)
The resultset size of each query is also varying (say 10 to 1,000,000 but mostly around 10k.)
The resultset data fetched between each time intervals doesn't differ much. (The row value will not be updated frequently.)
I am new to computer science and programming and also not very aware about storage or db designs.
It sounds like you should be building a data warehouse.
Performance-wise I suppose it would be better to have a table which purpose is to store the query results.
I think you need to store the data in a database. SQL database can serve you the best.
Regarding to storing the data in fixed interval of time, you just need to make effect of the change in the data set instead of storing the whole data again and again. I don't know what is your requirement and how much infrastructure you can afford. If you have such huge queries, I recommend you to work in Distributed System. Use NOSQL database for better performance.

Most performant way of querying database with JDBC?

I need to get data from several tables, so I used a query with N left outer joins. It seems to me that it may be a waste of performance since I get the cartesian product of lots of data. Which is the preferable way to this in order to achieve greater performance? I'm thinking of doing N+1 little queries. Am I on the right track?
I know, this has little to do with JDBC specifics. I want to retrieve data from a single table, and make left outer joins to other N tables. The result set gets very big because I get a cartesian product. For example:
table1data1, table2data1, table3data1
table1data1, table2data2, table3data1
table1data1, table2data1, table3data2
table1data1, table2data2, table3data2
I know that if a make several queries to the database (such as in my example I get 1 record for table1, 2 records for table 2 and 2 records for table 2) I'll make a lot of roundtrips to the database. But I've tested this way and it looks a lot faster.
This really isn't JDBC specific. Generally speaking, depending on the amount of data being returned, you'll get better performance retrieving everything in a single result set. N+1 queries tends to make for a lot of round trips to the database. Does the result set contain fields you don't need? Can you trim the columns being returned? That would be a first step, if possible.
I think your current approach off getting a lot of data in one trip to the database is the right approach. However if you find yourself executing the same query many times with different parameters, it is more performant to write it as a stored procedure using bind variables. But I would definitely shy-away from breaking your JOIN into several smaller queries.

Large SQL dataset query using java

I have the following configuration:
SQL Server 2008
Java as backend technology - Spring + Hibernate
Basically what I want to do is a select with a where clause on a table. The problem is the table has about 700M entries and the query takes a really long time.
Can you please indicate some pointers on where to optimize the query or what sort of techniques are can I use in order to get an improvement in performance?
Thanks.
Using indexes is the standard technique used to deal with this problem. As requested, here are some pointers that should get you started:
http://odetocode.com/articles/70.aspx
http://www.simple-talk.com/sql/learn-sql-server/sql-server-index-basics/
http://www.petri.co.il/introduction-to-sql-server-indexes.htm
The first thing I do in this case is isolate whether it is the amount of data I am returning that is the problem or not (an i/o issue). A simple non-scientific way to do this is change your query to just return the count:
select count(*) --just return a count, no data!
from MyTable
inner join MyOtherTable on ...
where ...
If this runs very quickly, it tells you your indexes are in order (assuming no sub-selects in your WHERE clause). If not, then you need to work on indexes, the WHERE clause, or your query construction itself (JOINs being done, etc).
Once that is satisfactory, add back in your SELECT clause. If it is slow, you are going to have to look at your data access pattern:
Can you return fewer columns?
Can you return fewer rows at once?
Is there caching you can do in the application layer?
Is this query a candidate for partitioned/materialized views (if your database supports those)?
I would run Profiler to find the exact query that is being generated. ORMs can create less than optimal queries. Once you know the query, you can run it in SSMS and see the execution plan. This will give you clues as to where you have performance problems.
Several things that can cause performance problems:
Lack of correct indexing (Foreign keys should be indexed if you have
joins as well as the criteria in the where clause)
Lack of sargability in the where clause forcing the query to not use
existing indexes
Returning more columns than are needed
Correlated subqueries and scalar functions that cause
row-by-agonzing-row operations
Returning too much data (will anybody really be looking at 1 million
records returned? You only want to return the amount you show on page
not the whole possible recordset)
Locking and blocking
There's more (After all whole very long books are written o nthis subject) but that should be enough to get you started at where to look.
You should provide some indexes for those column you often use to restrict the result. Other thing is the pagination of the result set.
Regardless of the specific DB, I would do the following:
run an explain analyze
make sure you have an index for the columns that are part of your where clause
If indexes are ok, it's very likely that you are fetching a lot of
records from disk, which is very slow: if you really cannot refine
your query so that you fetch fewer records, consider clustering your
table, to improve disk locality of your records.

Java data structure to use with Hibernate to store unknown number of parameters?

Following problem: I want to render a news stream of short messages based on localized texts. In various places of these messages I have to insert parameters to "customize" them. I guess you know what I mean ;)
My question probably falls into the "Which is the best style to do it?" category: How would you store these parameters (they may be Strings and Numbers that need to be formatted according to Locale) in the database? I'm using Hibernate to do the ORM and I can think of the following solutions:
build a combined String and save it as such (ugly and hard to maintain I think)
do some kind of fancy normalization and and make every parameter a single row on the database (clean I guess, but a performance nightmare)
Put the params into an Array, Map or other Java data structure and save it in binary format (probably causes a lot of overhead size-wise)
I tend towards option #3 but I'm afraid that it might be to costly in terms of size in the database. What do you think?
If you can afford the performance hit of using the normalized approach of having a separate table I would go with this approach. We use the same approach as your first suggestion at work, and it gets messy, especially when you reach the column limit and key/values start getting truncated!
Do the normalization.
I would suggest something like:
Table Message
id
Table Params
message_id
key
value
Storing serialized Java objects in the database is quite a bad thing in most cases. As they are hard to maintain and you cannot access them with 'simple' SQL tools.
The performance impact is not as big, as you can fetch all together in a single select using a join.
It depends a bit. Is the number of parameters huge for each entity? If it is not probable second option is the best.
If you don't want to add extra queries caused by the lazy load you can always change fetch type for the variable number of parameters that would only add one join to a query you were always doing. In normal conditions it is not a big price to pay.
Also the third and the first one forbids forever any type of queries over the parameters. A huge technical debt for the future I would not be willing to pay.
directly put it as string and save it ..

Categories

Resources