I am having 50000 entries in MySQL DB, which need to be fetches and iterated in java. Is it required to do pagination ?
This table could contain 20 columns which are varchar(100).
If its not possible, Is pagination required if I am fetching only 1 column from each row for these 50000 entries ?
This very much depends on your use case. If you want to display it to an end user you will very likely need pagination as displaying this much data in one go may make your UI sluggish (apart from not being very user-friendly). If you just need to do calculations in a background task, you can probably live without pagination.
Related
I need to get the distinct values from a column (which is not indexed) and the table contains billions of rows.
So when I use distinct in the select query, the query gets time out as the timeout is set to 3 minutes.
Will be it a good approach to get all the data from the table and then using set we can get the unique values?
please suggest the best approach here.
Thanks in advance !! :)
It is not a good idea to take all rows (especially if there are billions of them), because it is not memory efficient, or you"ll get OutOfMemoryError.
The best of course is to rearrange unstructured data.
Also, for big sets of data common practice is to use a paging (pagination) mechanism, that allows you to take the data in small chunks, so you'll bypass this timeout issue.
I have a very large table in the database, the table has a column called
"unique_code_string", this table has almost 100,000,000 records.
Every 2 minutes, I will receive 100,000 code string, they are in an array and they are unique to each other. I need to insert them to the large table if they are all "good".
The meaning of "good" is this:
All 100,000 codes in the array never occur in the database large table.
If one or more codes occur in the database large table, the whole array will not use at all,
it means no codes in the array will insert into the large table.
Currently, I use this way:
First I do a loop and check each code in the array to see if there is already same code in the database large table.
Second, if all code is "new", then, I do the real insert.
But this way is very slow, I must finish all thing within 2 minutes.
I am thinking of other ways:
Join the 100,000 code in a SQL "in clause", each code has 32 length, I think no database will accept this 32*100,000 length "in clause".
Use database transaction, I force insert the codes anyway, if error happens, the transaction rollback. This cause some performance issue.
Use database temporary table, I am not good at writing SQL querys, please give me some example if this idea may work.
Now, can any experts give me some advice or some solutions?
I am a non-English speaker, I hope you see the issue I am meeting.
Thank you very much.
Load the 100,000 rows into a table!
Create a unique index on the original table:
create unique index unq_bigtable_uniquecodestring on bigtable (unique_code_string);
Now, you have the tools you need. I think I would go for a transaction, something like this:
insert into bigtable ( . . . )
select . . .
from smalltable;
If any row fails (due to the unique index), then the transaction will fail and nothing is inserted. You can also be explicit:
insert into bigtable ( . . . )
select . . .
from smalltable
where not exists (select 1
from smalltable st join
bigtable bt
on st.unique_code_string = bt.unique_code_string
);
For this version, you should also have an index/unique constraint on smalltable(unique_code_string).
It's hard to find an optimal solution with so little information. Often this depends on the network latency between application and database server and hardware resources.
You can load the 100,000,000 unique_code_string from the database and use HashSet or TreeSet to de-duplicate in-memory before inserting into the database. If your database server is resource constrained or there is considerable network latency this might be faster.
Depending how your receive the 100,000 records delta you could load it into the database e.g. a CSV file can be read using external table. If you can get the data efficiently into a temporary table and database server is not overloaded you can do it very efficiently with SQL or stored procedure.
You should spend some time to understand how real-time the update has to be e.g. how many SQL queries are reading the 100,000,000 row table and can you allow some of these SQL queries to be cancelled or blocked while you update the rows. Often it's a good idea to create a shadow table:
Create new table as copy of the existing 100,000,000 rows table.
Disable the indexes on the new table
Load the delta rows to the new table
Rebuild the indexes on new table
Delete the existing table
Rename the new table to the existing 100,000,000 rows table
The approach here is database specific. It will depend on how your database is defining the indexes e.g. if you have a partitioned table it might be not necessary.
I have a project fetching data from DB2 database and we have a following scenario over which I need quality inputs. Thanks in advance.
Current Application is fetching data from table A (Let’s say SALES) table from DB Schema say ORIGIN_X.
The same table with different name exists in other schema say ORIGIN_Y.
Both tables has more 5 million records in each and growing.
Problem Statement
I want to merge the data from both the schema/tables to present combined view on UI without compromising performance.
The number of records are not more than 200 to show on UI but scanning 5 + 5 =10 million records degrades the performance.
Solutions worked so far.
Created logical view and tried to fetch the date from it but the query performance is dead slow.
Thinking of MQT (so as can create index on column) in DB2 equivalent to Materialized view and still progressing.
Help Need
Are these both approaches right for the problem statement? If yes, what should be done better to proceed with MQT?
What is the better approach other than above two?
Thoughts?
I have an Hbase table with a couple of million records. Each record has a couple of properties describing the record, stored each in a column qualifier.(Mostly int or string values)
I have a a requirement that I should be able to see the records paginated and sorted based on a column qualifier (or even more than one, in the future). What would be a best approach to do this? I have looked into secondary indexes using coprocessors (mostly hindex from huawei), but it doesn't seem to match my use case exactly. I've also thought about replicating all the data into multiple tables, one for each sort property, which would be included in the rowkey and then redirect queries to those tables. But this seems very tedious as I have a few so called properties already..
Thanks for any suggestions.
You need your NoSQL database to work just like a RDBMS, and given the size of your data your life would be a lot simpler if you stick to it, unless you expect exponential growth :) Also, you don't mention if your data gets updated, this is very important to make a good decision.
Having said that, you have a lot of options, here are some:
If you can wait for the results: Write a MapReduce task to do the scan, sort it and retrieve the top X rows, do you really need more than 1000 pages (20-50k rows) for each sort type?. Another option would be using something like Hive.
If you can aggregate the data and "reduce" the dataset: Write a MapReduce task to periodically export the newest aggregated data to a SQL table (which will handle the queries). I've done this a few times to and it works like a charm, but it depends on your requirements.
If you have plenty of storage: Write a MapReduce task to periodically regenerate (or append the data) a new table for each property (sorting by it in the row-key). You don't need multiple tables, just use a prefix in your rowkeys for each case, or, if you do not want tables and you won't have a lot queries, simply write the sorted data to csv files and store them in the HDFS, they could be easily read by your frontend app.
Manually maintain a secondary index: Which would not very tolerant to schema updates and new properties but would work great for near real-time results. To do it, you have to update your code to also to write to the secondary table with a good buffer to help with performance while avoiding hot regions. Think about this type of rowkeys: [4B SORT FIELD ID (4 chars)] [8B SORT FIELD VALUE] [8B timestamp], with just one column storing the rowkey of the main table. To retrieve the data sorted by any of the fields just perform a SCAN using the SORT FIELD ID as start row + the starting sort field value as pivot for pagination (ignore it to get the first page, then set the last one retrieved), that way you'll have the rowkeys of the main table, and you can just perform a multiget to it to retrieve the full data. Keep in mind that you'll need a small script to scan the main table and write the data to the index table for the existing rows.
Rely on any of the automatic secondary indexing through coprocessors like you mentioned, although I do not like this option at all.
You have mostly enumerated the options. HBase natively does not support secondary indexes as you are aware. In addition to hindex you may consider phoenix
https://github.com/forcedotcom/phoenix
( from SalesForce) which in addition to secondary indexes has jdbc driver and sql support.
I would like to display 100000 records on browser / multiple pages with minimal impact on memory. ie Per page 100 records.
I would like to move page back and forth. My doubts are
1. Can I maintain all the record inside the memory ? Is this good Idea ?
2) Can I make database connection/query for ever page ? If so how do write a query?
Could anyone please help me..
It's generally not a good idea to maintain so much records in memory. If the application is accessed by several users at the same time, the memory impact will be huge.
I don't know what DBMS are you using, but in MySQL and several others, you can rely on the DB for pagination with a query such as:
SELECT * FROM MyTable
LIMIT 0, 100
The first number after limit is the offset (how many records it will skip) and the second is the number of records it will fetch.
Bear in mind that this is SQL does not have the same syntax on every DB (some don't even support it).
I would not hold the data in memory (either in the browser or in the serving application). Instead I'd page through the results using SQL.
How you do this can be database-specific. See here for one example in MySql. Mechanisms will exist for other databases.
1) No, having all the records in memory kind of defeats the point of having a database. Look into having a scrollable result set, that way you can get the functionality you want without having to play with the SQL. You can also adjust how many records are fetched at a time so that you don't load more records than you need.
2) Db connections are expensive to create and destroy but any serious system will pool the connections so the impact on performance won't be that great.
If you want to get a bit more fancy you can do away with pages altogether and just load more records as the user scrolls through the list.
It would not be a good idea, as you are making the browser executable hold all of that.
When I had something like this to do used javascript to render the page, and just made ajax calls to get the next page. There is a slight delay in displaying the next table, as you fetch it, but users are used to that.
If you are showing 100 records/page, use json to pass the data from the server, as javascript can parse it quickly, and then use innerHTML to put the html, as the DOM is much slower in rendering tables.
As mentioned by others here, it is not a good idea to store a large list of results in memory. Query for results for each page is certainly a much better approach. To do that you have two options. One is to use whatever the database specific features your DBMS provides for targeting a specific subsection of results from a query. The other approach is to use the generic methods provided by JDBC to achieve the same effect. This keeps your code from being tied to a specific database:
// get a ResultSet from some query
ResultSet results = ...
if (count > 0) {
results.setFetchSize(count + 1);
results.setFetchDirection(ResultSet.FETCH_FORWARD);
results.absolute(count * beginIndex);
}
for (int rowNumber = 0; results.next(); ++rowNumber) {
if (count > 0 && rowNumber > count) {
break;
}
// process the ResultSet below
...
}
Using a library like Spring JDBC or Hibernate can make this even easier.
In many SQL language, you have a notion of LIMIT (mysql, ...) or OFFSET (mssql).
You can use this kind of thing to limit rows per page
Depends on the data. 100k int's might not be too bad if you are caching that.
T-SQL has SET ##ROWCOUNT = 100 to limit the amount of records returned.
But to do it right and return the total # of pages, you need a more advanced paging SPROC.
It's a pretty hotly dedated topic and there are many ways to do it.
Here's a sample of an old sproc I wrote
CREATE PROCEDURE Objects_GetPaged
(
#sort VARCHAR(255),
#Page INT,
#RecsPerPage INT,
#Total INT OUTPUT
)
AS
SET NOCOUNT ON
--Create a temporary table
CREATE TABLE #TempItems
(
id INT IDENTITY,
memberid int
)
INSERT INTO #TempItems (memberid)
SELECT Objects.id
FROM Objects
ORDER BY CASE #sort WHEN 'Alphabetical' THEN Objects.UserName ELSE NULL END ASC,
CASE #sort WHEN 'Created' THEN Objects.Created ELSE NULL END DESC,
CASE #sort WHEN 'LastLogin' THEN Objects.LastLogin ELSE NULL END DESC
SELECT #Total=COUNT(*) FROM #TempItems
-- Find out the first and last record we want
DECLARE #FirstRec int, #LastRec int
SELECT #FirstRec = (#Page - 1) * #RecsPerPage
SELECT #LastRec = (#Page * #RecsPerPage + 1)
SELECT *
FROM #TempItems
INNER JOIN Objects ON(Objects.id = #TempItems.id)
WHERE #TempItems.ID > #FirstRec AND #TempItems.ID < #LastRec
ORDER BY #TempItems.Id
I would recommend that you choose using CachedRowSet .
A CachedRowSet object is a container for rows of data that caches its rows in memory, which makes it possible to operate without always being connected to its data source.
A CachedRowSet object is a disconnected rowset, which means that it makes use of a connection to its data source only briefly. It connects to its data source while it is reading data to populate itself with rows and again while it is propagating changes back to its underlying data source.
Because a CachedRowSet object stores data in memory, the amount of data that it can contain at any one time is determined by the amount of memory available. To get around this limitation, a CachedRowSet object can retrieve data from a ResultSet object in chunks of data, called pages. To take advantage of this mechanism, an application sets the number of rows to be included in a page using the method setPageSize. In other words, if the page size is set to five, a chunk of five rows of data will be fetched from the data source at one time. An application can also optionally set the maximum number of rows that may be fetched at one time. If the maximum number of rows is set to zero, or no maximum number of rows is set, there is no limit to the number of rows that may be fetched at a time.
After properties have been set, the CachedRowSet object must be populated with data using either the method populate or the method execute. The following lines of code demonstrate using the method populate. Note that this version of the method takes two parameters, a ResultSet handle and the row in the ResultSet object from which to start retrieving rows.
CachedRowSet crs = new CachedRowSetImpl();
crs.setMaxRows(20);
crs.setPageSize(4);
crs.populate(rsHandle, 10);
When this code runs, crs will be populated with four rows from rsHandle starting with the tenth row.
On the similar path, you could build upon a strategy to paginate your data on the JSP and so on and so forth.