How to query + Reindex in AWS hosted Elasticsearch outside of Kibana

How to query + Reindex in AWS hosted Elasticsearch outside of Kibana - java

I have a problem in which I need to query for a subset of records on a large index containing a high volume of records, whilst running a Painless script with the search query to augment the result. The (much smaller) result is to be saved in a secondary index for later use. In a different SO question: Reindex part of Elasticsearch index onto new index via Jest, I mentioned this is possible through the Kibana interface, but there does not seem to be a Java library that can accomplish what I need. Has anyone ever accomplished a query within a _reindex operation outside of Kibana? I am leaning toward using the URLConnection family in Java, but am looking for suggestions and advice at this point.

Related

N1QL vs Async api for multi get document couchbase

There are 2 ways to fetch multiple documents in couchbase.
N1QL query
Reactive client (source)
I understand that 2nd has backpresssure and all because of it being reactive in nature. But I want to understand what other functional differences are there between 2 methods? (for example 1 does fire get query to all the shard vs another to a particular shard only etc.). Can someone help me understand functional differences and caveats of using 1st approach over 2nd?
My use case is to do get multiple document by id.

The best I can tell, when you use the Flux.fromIterable as in that example, it will use the key-value API behind the scenes. This is different from the N1QL approach in a number of ways that include (but probably aren't limited to):
N1QL can be used to fetch document by other non-document key attributes (e.g. `SELECT * FROM foo WHERE name LIKE '%best wishes%'
N1QL queries will use the Couchbase query service and (usually) the index service and (most likely) the data service.
The key-value API will go directly the the data service.
I think using N1QL to fetch documents by ID may not need to use the index service (assuming you use the right syntax), but will still need to use the query service. So there is some overhead.
Key-value access is always the fastest way to retrieve data from Couchbase. However, depending on your document size, concurrency needs, other operations, and what overhead the Reactive client introduces (if any--I don't know), the difference in overall performance could be anywhere from 0 to way-way-way better.
My gut recommendation is to go with Reactive (and therefore key-value) for your use case of "get multiple document by id".

How to get count for database query in Accumulo

Every database I've ever seen has a method for retrieving the count of the query prior to actually executing it. But I can't figure how to do this simple task in Accumulo.
Just for clarity, I want the Accumulo analog of this Mongo feature.
I checked the Scanner apidocs but I can't find anything. I'm using Java but answers for other languages would be greatly helpful too.

Accumulo is a lower-level application than a traditional RDBMS. It is based on Google's Big Table and not like a relational database. It's more accurately described as a massive parallel sorted map than a database.
It is designed to do different kinds of tasks than a relational database, and its focus is on big data.
To achieve the equivalent of the MongoDB feature you mentioned in Accumulo (to get a count of the size of an arbitrary query's result set), you can write a server-side Iterator which returns counts from each server, which can be summed on the client side to get a total. If you can anticipate your queries, you can also create an index which keeps track of counts during the ingest of your data.
Creating custom Iterators is an advanced activity. Typically, there are important trade-offs (time/space/consistency/convenience) to implementing something as seemingly simple as a count of a result set, so proceed with caution. I would recommend consulting the user mailing list for information and advice.

Improve results of a "search" input field?

I have a database with 20,000 records. Each record has a name. When a user wants to view a record, he can visit a webapp and type the name of the record in an inputfield. While typing, results from the database would be shown/filtered matchin what the user typed. I would like to know the basic architecture/concepts on how to program this
I'm using the following language stack:
frontend: html5/javascript (+ajax to make instant calls while user is typing)
backend: java + jdbc to connect to simple sql database
My initial idea is:
A user types text
Whenever a character is entered or removed in the inputfield, make an ajax request to the backend
The backend does a LIKE %input% query on the name field in the database
All data found by the query is send as a json string to the frontend
The frontend processes the json string and displays whatever results it finds
My two concerns are: the high amount of ajax requests to process, in conjunction with the possibly very heavy LIKE queries. What are ways to optimize this? Only search for every two characters they type/remove? Only query for the first ten results?
Do you know of websites that utilise these optimizations?
NOTE: assume the records are persons and names are like real people names, so some names are more common than others.

You can choose SPA approach - load all 20 000 names/ids to client side and then filter it in memory - it's supposed to be the fastest way with minimal load to the database and back-end

Here are possible solutions:
Restirct search to prefix search - LIKE 'prefix%' can be executed efficiently using BTREE-type index.
Measure performance of naive LIKE '%str%' solution - it you are working on B2B application, database will likely load that table in memory and do queries fast enough.
Look at documentation for your database - there could be special features for that like inverted index
as #Stepan Novikov suggested, load your data in memory and search manually
Use specialized search indexers like SOLR or ElasticSearch (likely overkill for only 20k records)
If you are feeling ninja, implement your own N-gram index.

Any reference for good Datamining tools in Java?

We are working on an internship project for company. The project itself consists of Datamining. Let's say the structure of database we have to work is huge (in Gigabytes).
Sad to say that DB itself is very poorly structured with inconsistent values and most importantly no primary or foreign keys. So in our simple Servlet modules to extract and show the inconsistent data, it takes forever for queries to perform and show up on servlet.
As n00b programmers we do not know about Join and such things in DB. Also we are using MySQL as our DB server. The DB is composed of real-time data from telecom towers.
To find sample inconsistency in table values we are using combination of multiple queries, output of one query serving as input to another query like:
"SELECT distinct(tow_id) FROM 'tower_data' WHERE TIME_STAMP LIKE ? ";
//query for finding tower-id.
"SELECT time_stamp FROM tower_data WHERE 'TIME_STAMP' LIKE ? AND 'PARAM_CODE' = ? AND 'TOW_ID'=? GROUP BY time_stamp HAVING count( * ) >1";
//query for finding time stamps with duplicate data.
And so on.
Also there are some 10 tables in the database. We need to combine 2-3 tables to get values for custom queries.
After finding all the inconsistent values for multiple factors, we have to do data cleansing, removal of noise, data prediction and such tasks in the next stage.
So we thought we can apply some Java Data Mining tools which would in turn apply some algorithm to speed up the data retrieval.
Please guide us towards some good datamining tools. Any guidance towards optimizing/rewriting the queries would also be highly appreciated.

I'm not 100% sure it will help in your case, but have a look at google-refine...

Since you seem to have a lot of badly structured data, I do not think data-mining will help.
You may consider using Apache Hadoop for going over all this data and finding inconsistencies. You can use Amazon EC2 for a simple and relatively cheap way to run Hadoop. You can also use Hadoop to port the databases to a better schema, provided that you can build one.
EDIT: I guess you can also do some things within MySQL. Use query explanation to find the slow parts of your query - I believe 'LIKE' is usually slow, and maybe you can reformulate the query to something faster. Maybe you can first sort your schema by timestamp and then look at sub-ranges. Again, you first have to have an efficient way to get the data, and then you can try to mine it. Good luck.

Keeping query statistics using lucene

I am developing a search component of a web application using Lucene. I would like to save the user queries to an index and use them to suggest alternate queries to users, and to keep query statistics (most often used queries, top scoring queries, ...).
To use this data for alternate query suggestions, I would analyze the queries to see which terms are most often used with one another and use that to create a suggestion to the user.
But I can't figure out in which form to index the data. I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content. Does anyone have any ideas about the way this can be accomplished?
Thanks for the help.

"I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content"
You can tell Lucene not to store document content, which means that the principal overhead will be the unique Terms, and the index itself. So, it might not be a large overhead to store each query as a unique Document...this way you will not be throwing away any information.

First, I believe that you should store the queries separately from the existing index. The problem is not redundant data but rather "watering down" your index - storing the queries in the same index may harm the relevance of your searches. Some options for this are:
Use a separate Lucene index.
Use Solr, with two separate cores, one for the documents and the other for the queries.
Use a query log. Store scores with the queries. Build query statistics using post-processing.As this is a web application, you can probably use a servlet container, such as Tomcat's, logs for this.
Second, Auto-Suggest From Popular Queries Using EdgeNGrams suggests an alternative implementation of query suggestion using Solr.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to query + Reindex in AWS hosted Elasticsearch outside of Kibana - java

Related

N1QL vs Async api for multi get document couchbase

How to get count for database query in Accumulo

Improve results of a "search" input field?

Any reference for good Datamining tools in Java?

Keeping query statistics using lucene

Categories

Resources