Performance of querying Apache Ignite using In-Clause

Performance of querying Apache Ignite using In-Clause - java

I have a schema created in Apache Ignite with 10 columns, where 3 of them are set index (say A, B are string type, C is int type). The total number of rows is around 40,000,000. Here is how I create cache table:
CacheConfiguration<AffinityKey<Long>, Object> cacheCfg = new CacheConfiguration<>();
cacheCfg.setName(CACHE_NAME);
cacheCfg.setDataRegionName("MY_DATA_REGION");
cacheCfg.setBackups(1);
QueryEntity queryEntity = new QueryEntity(AffinityKey.class, Object.class)
.setTableName("DataCache")
.addQueryField("Field_A", String.class.getName(), null)
.addQueryField("Field_B", String.class.getName(), null)
.addQueryField("Field_C", Integer.class.getName(), null)
.addQueryField("Field_D", Integer.class.getName(), null);
List<QueryIndex> queryIndices = new ArrayList<>();
List<String> groupIndices = new ArrayList<>();
groupIndices.add("Field_A");
groupIndices.add("Field_B");
groupIndices.add("Field_C");
queryIndices.add(new QueryIndex(groupIndices, QueryIndexType.SORTED));
queryEntity.setIndexes(queryIndices);
cacheCfg.setQueryEntities(Arrays.asList(queryEntity));
ignite.getOrCreateCache(cacheCfg);
I'm trying to query the ignite cache with sql statement like
select * from DataCache where
Field_A in (...) and Field_B in (...) and Field_C in (...)
with each in-clause having 1000~5000 length. The querying speed is not fast, even slower than directly query to Google Big Query. I just wonder if there's any way to improve the query performance when using in-clause sql.

You don't say how you created your table, but I'm guessing you have three indexes, one on each column. I suspect you'll need to create a group index, i.e., one index across all three columns. With so many elements in your IN clauses, it may also be beneficial to rewrite as a JOIN.

Related

How do I retrieve the number of distinct rows with the JPA Criteria API

I need to count the number of rows returned by a CriteriaQuery. The relevant piece of information here is that those need to be distinct rows based on a dynamically selected set of columns.
This means, that I cannot count over the entire table, since that might consider results, that would be redundant if you strip a certain number of columns.
I have a List of Predicates and Selections to conform to:
private final List<Selection<?>> projection = new ArrayList<>();
private final List<Predicate> predicates = new ArrayList<>();
and I want to count the number of rows that would be returned, if this query was executed non-paginated:
criteriaQuery.multiselect(projection)
.distinct(true)
.where(cb.and(predicates.toArray(new Predicate[0])));
The usual approach of transforming this into a Subquery will not work, since you cant multiselect on a Subquery and also can't select from the Subquery.
Can this be done with the Criteria API?

can you try this ?
CriteriaBuilder builder = em.getCriteriaBuilder();
CriteriaQuery query = builder.createQuery(String.class);
Root ruleVariableRoot = query.from(RuleVar.class);
query.select(ruleVariableRoot.get(RuleVar_.varType)).distinct(true);
More infos: https://www.objectdb.com/java/jpa/query/criteria

Sqlite relative complement on combined key

First some background about my Problem:
I am building a crawler and I want to monitor some highscore lists.
The highscore lists are defined by two parameters: a category and a collection (together unique).
After a successful download I create a new stats entry (category, collection, createdAt, ...)
Problem: I want to query the highscore list only once per day. So I need a query that will return category and collection that haven't been downloaded in 24h.
The stats Table should be used for this.
I have a List of all possible categories and of all possible collections. They work like a cross join.
So basically i need the relative complement of the cross join with the entries from the last 24h
My Idea: Cross join categories and collections and 'substract' all Pair(category, collection) of stats entries that has been created during last 24 h
Question 1: Is it possible to define categories and collections inside the query and cross join them or do I have to create a table for them?
Question 2: Is my Idea the correct approach? How would you do this in Sqlite?
Ok i realise that this might sound confusing so I drew an image of what I actually want.
I am interested in C.
Here is my current code in java, maybe it helps to understand the problem:
public List<Pair<String, String>> getCollectionsToDownload() throws SQLException {
long threshold = System.currentTimeMillis() - DAY;
QueryBuilder<TopAppStatistics, Long> query = queryBuilder();
List<TopAppStatistics> collectionsNotToQuery = query.where().ge(TopAppStatistics.CREATED_AT, threshold).query();
List<Pair<String, String>> toDownload = crossJoin();
for (TopAppStatistics stat : collectionsNotToQuery) {
toDownload.remove(new Pair<>(stat.getCategory(), stat.getCollection()));
}
return toDownload;
}
private List<Pair<String, String>> crossJoin() {
String[] categories = PlayUrls.CATEGORIES;
String[] collections = PlayUrls.COLLECTIONS;
List<Pair<String, String>> toDownload = new ArrayList<>();
for (String ca : categories) {
for (String co : collections) {
toDownload.add(new Pair<>(ca, co));
}
}
return toDownload;
}

The easiest solution to your problem is an EXCEPT. Say you have a subquery
that computes A and another one that computes B. These queries
can be very complex. The key is that both should return the same number of columns and comparable data types.
In SQLite you can then do:
<your subquery 1> EXCEPT <your subquery 2>
As simple as that.
For example:
SELECT a, b FROM T where a > 10
EXCEPT
SELECT a,b FROM T where b < 5;
Remember, both subqueries must return the same number of columns.

Hibernate - sqlQuery map redundant records while using JOIN on OneToMany

I have #OneToMany association between 2 entities (Entity1 To Entity2).
My sqlQueryString consists of next steps:
select ent1.*, ent2.differ_field from Entity1 as ent1 left outer join Entity2 as ent2 on ent1.item_id = ent2.item_id
Adding some subqueries and writing results to some_field2, some_field3 etc.
Execute:
Query sqlQuery = getCurrentSession().createSQLQuery(sqlQueryString)
.setResultTransformer(Transformers.aliasToBean(SomeDto.class));
List list = sqlQuery.list();
and
class SomeDto {
item_id;
some_filed1;
...
differ_field;
...
}
So the result is the List<SomeDto>
Fields which are highlighted with grey are the same.
So what I want is to group by, for example, item_id and
the List<Object> differFieldList would be as aggregation result.
class SomeDto {
...fields...
List<Object> differFieldList;
}
or something like that Map<SomeDto, List<Object>>
I can map it manually but there is a trouble:
When I use sqlQuery.setFirstResult(offset).setMaxResults(limit)
I retrieve limit count of records. But there are redundant rows. After merge I have less count actually.
Thanks in advance!

If you would like to store the query results in a collection of this class:
class SomeDto {
...fields...
List<Object> differFieldList;
}
When using sqlQuery.setFirstResult(offset).setMaxResults(n), the number of records being limited is based on the joined result set. After merging the number of records could be less than expected, and the data in List could also be incomplete.
To get the expected data set, the query needs to be broken down into two.
In first query you simply select data from Entity1
select * from Entity1
Query.setFirstResult(offset).setMaxResults(n) can be used here to limit the records you want to return. If fields from Entity2 needs to be used as condition in this query, you may use exists subquery to join to Entity2 and filter by Entity2 fields.
Once data is returned from the query, you can extract item_id and put them into a collection, and use the collection to query Entity 2:
select item_id, differ_field from Entity2 where item_id in (:itemid)
Query.setParameterList() can be used to set the item id collection returned from first query to the second query. Then you will need to manually map data returned from query 2 to data returned from query 1.
This seems verbose. If JPA #OneToMany mapping is configured between the 2 entity objects, and your query can be written in HQL (you said not possible in comment), you may let Hibernate lazy load Entity2 collection for you automatically, in which case the code can be much cleaner, but behind the scenes Hibernate may generate more query requests to DB while lazy loading the entity sitting at Many side.

The duplicated records are natural from a relational database perspective. To group projection according to Object Oriented principles, you can use a utility like this one:
public void visit(T object, EntityContext entityContext) {
Class<T> clazz = (Class<T>) object.getClass();
ClassId<T> objectClassId = new ClassId<T>(clazz, object.getId());
boolean objectVisited = entityContext.isVisited(objectClassId);
if (!objectVisited) {
entityContext.visit(objectClassId, object);
}
P parent = getParent(object);
if (parent != null) {
Class<P> parentClass = (Class<P>) parent.getClass();
ClassId<P> parentClassId = new ClassId<P>(parentClass, parent.getId());
if (!entityContext.isVisited(parentClassId)) {
setChildren(parent);
}
List<T> children = getChildren(parent);
if (!objectVisited) {
children.add(object);
}
}
}
The code is available on GitHub.

Returning certain amount of documents from elasticsearch query in java

I am trying to limit the document size returned by my query.i want lets say 10 documents back only,any my query normally displays 22,how would i go buy setting a limit for the returned output. i am aware i can just limit the list size by creating a list and adding to that list however i want to do it on the query level.
My Query: Thanks in advance :)
ueryBuilder raceGenderQuery = QueryBuilders.boolQuery()
.must(termQuery("lep_etg_desc", "indian"))
.must(termQuery("lep_gen_desc", "male"));
Set<String> suburbanLocationSet = new HashSet<String>();
suburbanLocationSet.add("queensburgh");
suburbanLocationSet.add("umhlanga");
suburbanLocationSet.add("tongaat");
suburbanLocationSet.add("phoenix");
suburbanLocationSet.add("shallcross");
suburbanLocationSet.add("balito");
//Build the necessary location query.
QueryBuilder locationQuery = QueryBuilders.boolQuery().must(termsQuery("lep_suburb_home", suburbanLocationSet));
//Combine all Queries so that its filtered to get exact results.
FilteredQueryBuilder finalSearchQuery = QueryBuilders.filteredQuery(QueryBuilders.boolQuery().must(raceGenderQuery).must(locationQuery), FilterBuilders.boolFilter().must(FilterBuilders.rangeFilter("lep_age").gte(25).lte(45)).must(FilterBuilders.rangeFilter("lep_max_income").gte(25000).lte(45000)));
//Run Query through elasticsearch iterating through documents in the traceps index for query matches.
List<Leads> finalLeadsList = new ArrayList<Leads>();
for (Leads leads : this.leadsRepository.search(finalSearchQuery)) {
finalLeadsList.add(leads);
}

I think this is what you want:
SearchResponse response = client.prepareSearch().setSearchType(SearchType.QUERY_THEN_FETCH).setSize(10).setQuery(finalSearchQuery).execute().get
You have to use QUERY_THEN_FETCH for it to return exactly size results because otherwise it gets size results from each shard.

ORMLite createOrUpdate() records while preserving specific column?

I'm using ORMLite to manage database tables which contain lists of lookup values for a data collection application. These lookup values are periodically updated from a remote server. However, I'd like to be able to preserve the data in a specific column while creating or updating the records, since I would like to store usage counts (specific to the device) associated with each lookup value. Here's how I'm updating the records:
//build list of new records
final List<BaseLookup> rows = new ArrayList<BaseLookup>();
for (int i = 0; i < jsonRows.length(); i++) {
JSONObject jsonRow = jsonRows.getJSONObject(i);
//parse jsonRow into a new BaseLookup object and add to rows
...
}
//add the new records
dao.callBatchTasks(new Callable<Void>() {
public Void call() throws Exception {
for (BaseLookup row : rows) {
//this is where I'd like to preserve the existing
//value (if any) of the "usageCount" column
Dao.CreateOrUpdateStatus result = dao.createOrUpdate(row);
}
return null;
}
});
I've considered attempting to fetch and merge each record individually within the loop, but this seems like it would perform poorly (some tables are a few thousand records). Is there a simpler or more integrated way to accomplish this?

I'd like to be able to preserve the data in a specific column while creating or updating the records, since I would like to store usage counts (specific to the device) associated with each lookup value
If you have to update certain columns from the JSON data but you want to set the usageCount to usageCount + 1 then you have a couple of options.
You could build an update statement using the dao.updateBuilder(); method and the UpdateBuilder class and then update the columns to their new values and usageCount to usageCount + 1 where the id matches. You should watch the return value to make sure a row was updated. If none were then you create the object.
However, it would be easier to just:
get the BaseLookup from the database
if null, call dao.create() to persist a new entry
otherwise update columns and increment the usageCount
and save it back with a dao.update(...)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Performance of querying Apache Ignite using In-Clause - java

You don't say how you created your table, but I'm guessing you have three indexes, one on each column. I suspect you'll need to create a group index, i.e., one index across all three columns. With so many elements in your IN clauses, it may also be beneficial to rewrite as a JOIN.

Related

How do I retrieve the number of distinct rows with the JPA Criteria API

Sqlite relative complement on combined key

Hibernate - sqlQuery map redundant records while using JOIN on OneToMany

Returning certain amount of documents from elasticsearch query in java

ORMLite createOrUpdate() records while preserving specific column?

Categories

Resources