Solr paging performance - java

I have read (http://old.nabble.com/using-q%3D--,-adding-fq%3D-to26753938.html#a26805204):
FWIW: limiting the number of rows per
request to 50, but not limiting the
start doesn't make much sense -- the
same amount of work is needed to
handle start=0&rows=5050 and
start=5000&rows=50.
Than he completes:
There are very few use cases for
allowing people to iterate through all
the rows that also require sorting.
Is that right? Is that true just for sorted results?
How many pages of 10 rows each do you recommend to allow the user to iterate?
Does Solr 1.4 suffer the same limitation?

Yes that's true, also for Solr 1.4. That does not mean that start=0&rows=5050 has the same performance as start=5000&rows=50, since the former has to return 5050 documents while the latter only 50. Less data to transfer -> faster.
Solr doesn't have any way to get ALL results in a single page since it doesn't make much sense. As a comparison, you can't fetch the whole Google index in a single query. Nobody really needs to do that.
The page size of your application should be user-definable (i.e. the user might choose to see 10, 25, 50, or 100 results at once).
The default page size depends on what kind of data you're paging and how relevant the results really are. For example, when searching on Google you usually don't look beyond the first few results, so 10 elements are enough. eBay, on the other hand, is more about browsing the results, so it shows 50 results per page by default, and it doesn't even offer 10 results per page.
You also have to take scrolling into account. Users would probably get lost when trying to browse through a 200-result page, not to mention that it takes considerably longer to load.

start=0&rows=5050 and start=5000&rows=50
Depends how you jump to start=5000. If you scroll through all results from 0 to 4999 ignoring them all and then continue scrolling from 5000 to 5050 then yes, same amount of work is done here. Best thing to do is to limit the rows fetched from database itself by using something like ROWNUM in Oracle.
.
iterate through all the rows that also require sorting
Few but yes there are use cases that have this requirement. Examples would be CSV/Excel/PDF exports.

Related

how to do pagination with elasticsearch? from vs scroll API

I'm using elasticsearch as DB to store a large batch of log data.
I know there are 2 ways to do pagination:
Use size and from API
Use scroll API
Now I'm using 'from' to do pagination.Get page and size parameters from front end,and at back end(Java)
searchSourceBuilder.size(size);
searchSourceBuilder.from(page * size);
However, if page*size > 10000, an exception thrown from ES.
Can I use scroll API to do pagination?
I know that if I use scroll API, the searchResponse object will return me a _scroll_id, which looks like a base64 string.
How can I control page and size?
It seems Scroll API only support successive page number?
There is nothing in Elasticsearch which allows direct jump to a specific page as the results have to be collected from different shards. So in your case search_after will be a better option. You can reduce the amount of data returned for the subsequent queries and then once you reach the page which is actually requested get the complete data.
Example: Let's say you have to jump to 99th page then you can reduce the amount of data for all 98th pages request and once you're at 99 you can get the complete data.
What you said is correct!
You can't do traditional pagination by using scroll API.
I might suggest you to look the Search After API
This might not help you to cover your requirement!
The From / Size default max result size is 10,000.
As it is mentioned here
Note that from + size can not be more than the index.max_result_window index setting which defaults to 10,000
So if you somehow increase the index.max_result_window setting persistent then it will increase the max number of search result! But this might not be the solution but decrease the limitation.
Remember this above solution might hamper the performance of your ES server. Read through all the posts here.
My suggestion is to use Scroll API and change the pagination style

What is the Searching difference between various small HashSet and 1 large HashSet?

I did some POC and found that when I search in a large Set of 400 items, it is 6-7 times faster than searching in 20 sets of 20 items each. Although in both cases, hashing is used but how does just looping costs so much ?
Would you expect it to take the same time or 20 times longer? With 20 sets, you need 10.5 lookups on the average (assuming the item is present in exactly one of them), so a factor of 10.5 should result. This is reasonable close to your reported factor of 6-7. As you gave us no code, we can't point to where your benchmark fails. But without reading something about how to benchmark, nobody gets it right.
If you want to know more, provide us with more details.
PS: You should hardly ever use 20 sets the way you're probably using then. A Map<Item, Integer> is much better as a representation of a set partitioning and is as fast as a Set<Item> (actually, a Set is implemented via a Map).

JavaFx Tableview speed and sorting in Sql application

I am working on a small, toy application to expand my knowledge about Java JavaFx and Sql. I have a MySQL server in my local network, which I am able to communicate with and a simple Tableview can be populated, sorted ... etc. The data has only to be shown to the user, no editing. Everything nice and clean.
The Problems:
There are around 170 000 rows with 10 col., all chars, to display, which seems to be rather hard to do in reasonable time. The query is done during startup and it take around 1 1/2 min before I can see the table.
Also the memory footprint is enormous, the application without populated Tableview around 70 mb, with all the data it has 600-700 mb (the xml file which is used to populate the mysql is 70 mb in size ... ) !
Sorting is REALLY slow, I am using Stringproperty which should give a boost according to: JavaFx tableview sort is really slow how to improve sort speed as in java swing (If I understood that correctly) However the custom sort, I did not try so far.
My thoughts:
Similar to the application design for mobile, I think an adapter-pattern can fix these problems. Hence, I create an OberservableList with the correct size of elements, but only populate a limit of rows in the beginning. When I am scrolling done (scroll wheel) the List has to be updated with new elements in advance via sql-queries. This should give me a performance boost for the first Problem. Nice idea but what am I going to do if the user is going to scroll done via the scrollbar(click and drag down), then I would skip certain entries, but I need the information to give the user the feedback where to scroll to.
How could I fix this ?
For the sorting, I would use the sql sorting methods, so each sort will be performed on the sql server and a new OberservableList will be created. As before, only a certain amount of data would be loaded in the first query.
If this approach would also effect the memory footprint, I am not sure.
Your opinion:
Are my ideas reasonable and do-able in Java, JavaFx ?
I would love to hear your ideas about these problems.
Thank you.
I found out that JVx is capable of providing the lazy-loading function. This should do the trick

Store and search sets (with many possible values) in a database (from Java)

The problem is how to store (and search) a set of items a user likes and dislikes. Although each user may have 2-100 items in their set, the possible values for the items numbers in the tens of thousands (and is expanding).
Associated with each item is a value say from 10 (like) to 0 (neutral) to -10 (dislike).
So given a user with a particular set, how to find users with similar sets (say a percentage overlap on the intersection)? Ideally the set of matches could be reduced via a filter that includes only items with like/dislike values within a certain percentage.
I don't see how to use key/value or column-store for this, and walking relational table of items for each user would seem to consume too many resources. Making the sets into documents would seem to lose clarity.
The web app is in Java. I've searched ORMS, NoSQL, ElasticSearch and related tools and databases. Any suggestions?
Ok this seems like the actual storage isn’t the problem, but you want to make a suggestion system based on the likes/dislikes.
The point is that you can store things however you want, even in SQL, most SQL RDBMS will be good enough for your data store, but you can of course also use anything else you want. The point, is that no SQL solution (which I know of) will give you good results with this. The thing you are looking for is a suggestion system based on artificial intelligence, and the best one for distributed systems, where they have many libraries implemented, is Apache Mahout.
According to what I’ve learned about it so far, it can do what you need basically out of the box. I know that it’s based on Hadoop and Yarn but I’m not sure if you can import data from anywhere you want, or need to have it in HDFS.
Other option would be to implement a machine learning algorithm on your own, which would run only on one machine, but you just won’t get the results you want with a simple query in any sql system.
The reason you need machine learning algorithms and a query with some numbers won’t be enough in most of the cases, is the diversity of users you are facing… What if you have a user B which liked / disliked everything he has in common with user A the same way - but the coverage is only 15%. On the other hand you have user C which is pretty similar to A (while not at 100%, the directions are pretty much the same) and C has marked over 90% of the things, which A also marked. In this scenario C is much closer to A than B would be, but B has 100% coverage. There are many other scenarios where most simple percentages won’t be enough, and that’s why many companies which have suggestion systems (Amazon, Netflix, Spotify, …) use Apache Mahout and similar systems to get those done.

Java - Custom Hash Map/Table Some Points

In some previous posts I have asked some questions about coding of Custom Hash Map/Table in java. Now as I can't solve it and may be I forgot to properly mentioning what I really want, I am summarizing all of them to make it clear and precise.
What I am going to do:
I am trying to code for our server in which I have to find users access type by URL.
Now, I have 1110 millions of URLs (approx).
So, what we did,
1) Divided the database on 10 parts each of 110 millions of Urls.
2) Building a HashMap using parallel array whose key are URL's one part (represented as LONG) and values are URL's other part (represented as INT) - key can have multiple values.
3) Then search the HashMap for some other URLs (millions of URLs saved in one day) per day at the beginning when system starts.
What you have Tried:
1) I have tried many NoSQL databases, however we found not so good for our purpose.
2) I have build our custom hashmap(using two parallel arrays) for that purpose.
So, what the issue is:
When the system starts we have to load our hashtable of each database and perform search for million of url:
Now, issue is,
1) Though the HashTable performance is quite nice, code takes more time while loading HashTable (we are using File Channel & memory-mapped buffer to load it which takes 20 seconds to load HashTable - 220 millions entry - as load factor is 0.5, we found it most faster)
So, we are spending time: (HashTable Load + HashTable Search) * No. of DB = (5 + 20) * 10 = 250 seconds. Which is quite expensive for us and most of the time (200 out of 250 sec) is going for loading hashtables.
Have you think any-other way:
One way can be:
Without worrying about loading and storing, and leave caching to the operating system by using a memory-mapped buffer. But, as I have to search for millions of keys, it gives worser performance than above.
As we found HashTable performance is nice but loading time is high, we thought to cut it off in another way like:
1) Create an array of Linked Lists of the size Integer_MAX (my own custom linked list).
2) Insert values (int's) to the Linked Lists whose number is key number (we reduce the key size to INT).
3) So, we have to store only the linked lists to the disks.
Now, issue is, it is taking lots of time to create such amount of Linked Lists and creating such large amount of Linked Lists has no meaning if data is not well distributed.
So, What is your requirements:
Simply my requirements:
1) Key with multiple values insertion and searching. Looking for nice searching performance.
2) Fast way to load (specially) into memory.
(keys are 64 bit INT and Values are 32 bit INT, one key can have at most 2-3 values. We can make our key 32 bit also but will give more collisions, but acceptable for us, if we can make it better).
Can anyone help me, how to solve this or any comment how to solve this issue ?
Thanks.
NB:
1) As per previous suggestions of Stack Overflow, Pre-read data for disk caching is not possible because when system starts our application will start working and on next day when system starts.
2) We have not found NoSQL db's are scaling well as our requirements are simple (means just insert hashtable key value and load and search (retrieve values)).
3) As our application is a part of small project and to be applied on a small campus, I don't think anybody will buy me a SSD disk for that. That is my limitation.
4) We use Guava/ Trove also but they are not able to store such large amount of data in 16 GB also (we are using 32 GB ubuntu server.)
If you need quick access to 1110 million data items then hashing is the way to go. But dont reinvent the wheel, use something like:
memcacheDB: http://memcachedb.org
MongoDB: http://www.mongodb.org
Cassandra: http://cassandra.apache.org
It seems to me (if I understand your problem correctly) that you are trying to approach the problem in a convoluted manner.
I mean the data you are trying to pre-load are huge to begin with (let's say 220 Million * 64 ~ 14GB). And you are trying to memory-map etc for this.
I think this is a typical problem that is solved by distributing the load in different machines. I.e. instead of trying to locate the linked list index you should be trying to figure out the index of the appropriate machine that a specific part of the map has been loaded and get the value from that machine from there (each machine has loaded part of this database map and you get the data from the appropriate part of the map i.e. machine each time).
Maybe I am way off here but I also suspect you are using a 32bit machine.
So if you have to stay using a one machine architecture and it is not economically possible to improve your hardware (64-bit machine and more RAM or SSD as you point out) I don't think that you can make any dramatic improvement.
I don't really understand in what form you are storing the data on disk. If what you are storing consists of urls and some numbers, you might be able to speed up loading from disk quite a bit by compressing the data (unless you are already doing that).
Creating a multithreaded loader that decompresses while loading might be able to give you quite a big boost.

Categories

Resources