Solr Composite Unique key from existing fields in schema - java

I have an index named LocationIndex in solr with fields as follows:
<fields>
<field name="solr_id" type="string" stored="true" required="true" indexed="true"/>
<field name="solr_ver" type="string" stored="true" required="true" indexed="true" default="0000"/>
// and some more fields
</fields>
<uniqueKey>solr_id</uniqueKey>
But now I want to change schema so that unique key must be composite of two already present fields solr_id and solr_ver... something as follows:
<fields>
<field name="solr_id" type="string" stored="true" required="true" indexed="true"/>
<field name="solr_ver" type="string" stored="true" required="true" indexed="true" default="0000"/>
<field name="composite-id" type="string" stored="true" required="true" indexed="true"/>
// and some more fields
</fields>
<uniqueKey>solr_ver-solr_id</uniqueKey>
After searching I found that it's possible by adding following to schema: (ref: Solr Composite Unique key from existing fields in schema)
<updateRequestProcessorChain name="composite-id">
<processor class="solr.CloneFieldUpdateProcessorFactory">
<str name="source">docid_s</str>
<str name="source">userid_s</str>
<str name="dest">id</str>
</processor>
<processor class="solr.ConcatFieldUpdateProcessorFactory">
<str name="fieldName">id</str>
<str name="delimiter">--</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
So I changed schema and finally it looks like:
<updateRequestProcessorChain name="composite-id">
<processor class="solr.CloneFieldUpdateProcessorFactory">
<str name="source">solr_ver</str>
<str name="source">solr_id</str>
<str name="dest">id</str>
</processor>
<processor class="solr.ConcatFieldUpdateProcessorFactory">
<str name="fieldName">id</str>
<str name="delimiter">-</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<fields>
<field name="solr_id" type="string" stored="true" required="true" indexed="true"/>
<field name="solr_ver" type="string" stored="true" required="true" indexed="true" default="0000"/>
<field name="id" type="string" stored="true" required="true" indexed="true"/>
// and some more fields
</fields>
<uniqueKey>id</uniqueKey>
But while adding a document it's giving me error:
org.apache.solr.client.solrj.SolrServerException: Server at http://localhost:8983/solr/LocationIndex returned non ok status:400, message:Document [null] missing required field: id
I'm not getting what changes in schema are required to work as desired?
In a document I add, it contain fields solr_ver and solr_id. How and where it'll (solr) create id field by combining both these field something like solr_ver-solr_id?
EDIT:
At this link It's given how refer to this chain. Bu I'm unable to understand how would it be used in schema? And where should I make changes?

So it looks like you have your updateRequestProcessorChain defined appropriately and it should work. However, you need to add this to the solrconfig.xml file and not the schema.xml. The additional link you provided shows you how to modify your solrconfig.xml file and add your defined updateRequestProcessorChain to the current /update request handler for your solr instance.
So find do the following:
Move your <updateRequestProcessorChain> to your solrconfig.xml file.
Update the <requestHandler name="/update" class="solr.UpdateRequestHandler"> entry in your solrconfig.xml file and modify it so it looks like the following:
<requestHandler name="/update" class="solr.UpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">composite-id</str>
</lst>
</requestHandler>
This should then execute your defined update chain and populate the id field when new documents are added to the index.

The described above solution may have some limitations, what if "dest" is over maximum length because concatenated fields are too long.
There is also one more solution with MD5Signature (A class capable of generating a signature String from the concatenation of a group of specified document fields, 128 bit hash used for exact duplicate detection)
<!-- An example dedup update processor that creates the "id" field on the fly
based on the hash code of some other fields. This example has
overwriteDupes set to false since we are using the id field as the
signatureField and Solr will maintain uniqueness based on that anyway. -->
<updateRequestProcessorChain name="dedupe">
<processor class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<bool name="overwriteDupes">false</bool>
<str name="signatureField">id</str>
<str name="fields">name,features,cat</str>
<str name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
From here: http://lucene.472066.n3.nabble.com/Solr-duplicates-detection-td506230.html

I'd like to add this as a comment, but it's impossible to get the creds these days... anyway, here is a better link:
https://wiki.apache.org/solr/Deduplication

Related

Apache Solr DataImportHandler failes trying to index

I am trying to index some xml files into Solr 6.2.1 using their DataImportHandler.
For that purpose I have added the needed import and this RequestHandler into the solrconfig.xml:
<lib dir="${solr.install.dir:../../../..}/contrib/dataimporthandler/lib/" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler" startup="lazy">
<lst name="default">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
Then I wrote the data-config.xml and put it into the same path as the solrconfig.xml:
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8"/>
<document>
<entity name="pickupdir"
processor="FileListEntityProcessor"
dataSource="null"
baseDir="/vagrant/TREC8all/Adhoc/"
recursive="true"
fileName="^[\w\d-]+\.xml$" />
<entity name="trec8_simple"
processor="XPathEntityProcessor"
stream="true"
datasource="pickupdir"
url="${pickupdir.fileAbsolutePath}"
forEach="/DOCS/DOC">
<field column="id" xpath="/DOCS/DOC/DOCNO"/>
<field column="header" xpath="/DOCS/DOC/HEADER"/>
<field column="text" xpath="/DOCS/DOC/TEXT"/>
</entity>
</document>
</dataConfig>
This should make the ImportHandler iterate recursively through all xml files in the directory and index them according to the xpaths.
When I call the requestHandler like this: (I am running solr in a vagrant box instead of locally)
http://192.168.155.156:8983/solr/trec8/dataimport?command=full-import&entity=trec8_simple
I am getting this Exception in the solr.log:
ERROR (Thread-14) [ x:trec8] o.a.s.h.d.DataImporter Full Import failed:java.lang.NullPointerException
at org.apache.solr.handler.dataimport.DataImporter.createPropertyWriter(DataImporter.java:325)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:412)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:475)
at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:458)
at java.lang.Thread.run(Thread.java:745)
Im assuming this should be the source for the DataImportHandler:
https://github.com/sudarshang/lucene-solr/blob/master/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DataImporter.java
I have trouble figuring out what is causing this exception and what it is meaning. Would be nice if somebody could help me out. Thanks!
EDIT:
I think this has something to do with the DataImportHandler not beeing able to finde the data-config.xml. When I remove it will throw the exact same exception
Ok I found the issue!
Problem was in the solrconfig,
<lst name="default">
<str name="config">data-config.xml</str>
</lst>
should have been
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>

Solr : Always seeking for a core named "collection1"

I've created a solr core with configurations and when I try to launch solr embedded server, I get the below error.
Caused by: java.io.IOException: Can't find resource 'solrconfig.xml' in
classpath or '/home/tharindu/Desktop/solr_tmp/custom/newsportal/collection1/conf'
at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:362)
at org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.java:308)
at org.apache.solr.core.Config.<init>(Config.java:117)
at org.apache.solr.core.Config.<init>(Config.java:87)
at org.apache.solr.core.SolrConfig.<init>(SolrConfig.java:167)
at org.apache.solr.core.SolrConfig.readFromResourceLoader(SolrConfig.java:145)
... 9 more
It seems that it is trying to find a solr core named collection1 by default.
The custom folder contains,
-- solr.xml
-- newsportal
-- conf
-- schema.xml
-- solrconfig.xml
-- core.properties
I'm using Spring solr template. The EmbeddedServer configuration is below.
#Bean
public EmbeddedSolrServerFactoryBean solrServerFactoryBean() {
EmbeddedSolrServerFactoryBean factory = new EmbeddedSolrServerFactoryBean();
factory.setSolrHome("/home/tharindu/Desktop/solr_tmp/custom/newsportal");
return factory;
}
#Bean
public SolrTemplate solrTemplate() throws Exception {
return new SolrTemplate(solrServerFactoryBean().getObject(), "newsportal");
}
When I change the EmbeddedServer bean as follows,(only changing the path of the core)
#Bean
public EmbeddedSolrServerFactoryBean solrServerFactoryBean() {
EmbeddedSolrServerFactoryBean factory = new EmbeddedSolrServerFactoryBean();
factory.setSolrHome("/home/tharindu/Desktop/solr_tmp/custom");
return factory;
}
I get the below error.
Caused by: org.apache.solr.common.SolrException: No such core:
at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:112)
at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
at org.springframework.data.solr.core.SolrTemplate$11.doInSolr(SolrTemplate.java:417)
at org.springframework.data.solr.core.SolrTemplate$11.doInSolr(SolrTemplate.java:414)
at org.springframework.data.solr.core.SolrTemplate.execute(SolrTemplate.java:141)
... 59 more
But when I rename the core folder as collection1 and change core name in the core.properties to name=collection1, everything works fine.
Below is my schema.xml and solrconfig.xml
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="newsportal" version="1.5">
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="text_general" class="solr.TextField" omitNorms="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopFilterFactory" words="stopwords_en.txt" />
</analyzer>
</fieldType>
</types>
<fields>
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="title" type="text_general" indexed="true" stored="true" required="true" termVectors="true"/>
<field name="description" type="text_general" indexed="true" stored="true" required="true" termVectors="true"/>
<field name="keywords" type="text_general" indexed="true" stored="true" multiValued="true" />
<defaultSearchField>keywords</defaultSearchField>
<copyField source="title" dest="keywords"/>
<copyField source="description" dest="keywords"/>
</fields>
<uniqueKey>id</uniqueKey>
</schema>
solrconfig.xml
<?xml version="1.0" encoding="UTF-8" ?>
<config>
<luceneMatchVersion>LUCENE_48</luceneMatchVersion>
<dataDir>${solr.data.dir:}</dataDir>
<directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}" />
<codecFactory class="solr.SchemaCodecFactory" />
<schemaFactory class="ClassicIndexSchemaFactory" />
<indexConfig>
<lockType>${solr.lock.type:native}</lockType>
</indexConfig>
<updateHandler class="solr.DirectUpdateHandler2"/>
<query>
<maxBooleanClauses>1024</maxBooleanClauses>
<filterCache class="solr.FastLRUCache" size="512" initialSize="512" autowarmCount="0" />
<queryResultCache class="solr.LRUCache" size="512" initialSize="512" autowarmCount="0" />
<documentCache class="solr.LRUCache" size="512" initialSize="512" autowarmCount="0" />
<enableLazyFieldLoading>true</enableLazyFieldLoading>
<queryResultWindowSize>20</queryResultWindowSize>
<queryResultMaxDocsCached>200</queryResultMaxDocsCached>
<useColdSearcher>false</useColdSearcher>
<maxWarmingSearchers>2</maxWarmingSearchers>
</query>
<requestDispatcher handleSelect="false">
<requestParsers enableRemoteStreaming="true" multipartUploadLimitInKB="2048000" formdataUploadLimitInKB="2048" />
<httpCaching never304="true" />
</requestDispatcher>
<requestHandler name="/select" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="sort">title asc</str>
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="q">*:*</str>
<bool name="facet">false</bool>
</lst>
</requestHandler>
<requestHandler name="/update" class="solr.UpdateRequestHandler"/>
<requestHandler name="/analysis/field" startup="lazy" class="solr.FieldAnalysisRequestHandler" />
<requestHandler name="/analysis/document" class="solr.DocumentAnalysisRequestHandler" startup="lazy" />
<requestHandler name="/admin/" class="solr.admin.AdminHandlers" />
<requestHandler name="/admin/ping" class="solr.PingRequestHandler">
<lst name="invariants">
<str name="q">*:*</str>
</lst>
<lst name="defaults">
<str name="echoParams">all</str>
</lst>
</requestHandler>
<admin>
<defaultQuery>*:*</defaultQuery>
</admin>
</config>
core.properties file
name=newsportal
EDIT
solr.xml file
<solr>
<solrcloud>
<str name="host">${host:}</str>
<int name="hostPort">${jetty.port:8983}</int>
<str name="hostContext">${hostContext:solr}</str>
<int name="zkClientTimeout">${zkClientTimeout:30000}</int>
<bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>
</solrcloud>
<shardHandlerFactory name="shardHandlerFactory"
class="HttpShardHandlerFactory">
<int name="socketTimeout">${socketTimeout:0}</int>
<int name="connTimeout">${connTimeout:0}</int>
</shardHandlerFactory>
</solr>
Solr version : 4.10.4
Spring solr data version : 1.5.5.BUILD-SNAPSHOT
I appreciate any help to resolve this issue.
Unfortunately, there is no clean approach how to do that because collection1 is harcoded in EmbeddedSolrServerFactory class (corresponding Jira ticket).
But I tried this hacky one and it works for me:
#Bean
public EmbeddedSolrServerFactoryBean solrClient() {
EmbeddedSolrServerFactoryBean embeddedSolrServerFactoryBean = new EmbeddedSolrServerFactoryBean() {
#Override
public EmbeddedSolrServer getObject() {
return (EmbeddedSolrServer) getSolrClient("<YOUR_CORE_NAME>");
}
};
embeddedSolrServerFactoryBean.setSolrHome("<YOUR_SOLR_HOME");
return embeddedSolrServerFactoryBean;
}
have a look at this url https://wiki.apache.org/solr/Solr.xml%20(supported%20through%204.x)
To enable support for dynamic SolrCore administration, place a file
named solr.xml in the solr.home directory. Here is an example solr.xml
file:
<solr persistent="true" sharedLib="lib">
<cores adminPath="/admin/cores">
<core name="core0" instanceDir="core0" />
<core name="core1" instanceDir="core1" />
</cores>
</solr>
Thus, change the file solr.xml with the an appropriate entry
<core name="core0" instanceDir="core0" />
Hope this helps

How to perform nested aggregation on multiple fields in Solr?

I am trying to perform search result aggregation (count and sum) grouping by several fields in a nested fashion.
For example, with the schema shown at the end of this post, I'd like to be able to get the sum of "size" grouped by "category" and sub-grouped further by "subcategory" and get something like this:
<category name="X">
<subcategory name="X_A">
<size sum="..." />
</subcategory>
<subcategory name="X_B">
<size sum="..." />
</subcategory>
</category>
....
I've been looking primarily at Solr's Stats component which, as far as I can see, doesn't allow nested aggregation.
I'd appreciate it if anyone knows of some way to implement this, with or without the Stats component.
Here is a cut-down version of the target schema:
<types>
<fieldType name="string" class="solr.StrField" />
<fieldType name="text" class="solr.TextField">
<analyzer><tokenizer class="solr.StandardTokenizerFactory" /></analyzer>
</fieldType>
<fieldType name="date" class="solr.DateField" />
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0" />
</types>
<fields>
<field name="id" type="string" indexed="true" stored="true" />
<field name="category" type="text" indexed="true" stored="true" />
<field name="subcategory" type="text" indexed="true" stored="true" />
<field name="pdate" type="date" indexed="true" stored="true" />
<field name="size" type="int" indexed="true" stored="true" />
</fields>
The new faceting module in Solr 5.1 can do this, it was added in https://issues.apache.org/jira/browse/SOLR-7214
Here is how you would add sum(size) to every facet bucket, and sort descending by that statistic.
json.facet={
categories:{terms:{
field:category,
sort:"total_size desc", // this will sort the facet buckets by your stat
facet:{
total_size:"sum(size)" // this calculates the stat per bucket
}
}}
}
And this is how you would add in the subfacet on subcategory:
json.facet={
categories:{terms:{
field:category,
sort:"total_size desc",
facet:{
total_size:"sum(size)",
subcat:{terms:{ // this will facet on the subcategory field for each bucket
field:subcategory,
facet:{
sz:"sum(size)" // this calculates the sum per sub-cat bucket
}}
}
}}
}
So the above will give you the sum(size) at both the category and subcategory levels. Documentation for the new facet module is currently at http://yonik.com/json-facet-api/
There is a patch SOLR-3583, which adds percentiles and averages to facets, pivot facets, and distributed pivot facets by making use of range facet internals. It is possible to add sums to pivot facets by improving this patch.
For example, averages can be calculated for categories using this url:
http://localhost:8983/solr/select?q=*%3A*
&facet=true
&facet.pivot=category,subcategory
&facet.stats.percentiles=true
&facet.stats.percentiles.averages=true
&facet.stats.percentiles.field=size
&f.size.stats.percentiles.requested=25,50,75
&f.size.stats.percentiles.lower.fence=0
&f.size.stats.percentiles.upper.fence=1000
&f.size.stats.percentiles.gap=10
See also this video and slides for more details.
1. Counts
To get the counts, you can use Pivot Faceting. It will generate a list very similar to what you asked but with counts only.
You'll need to append this to your query:
&facet=true&facet.pivot=category,subcategory
Note that this works on Solr 4.0 and after.
2. Sums
As for the sums, I think you can achieve them with ordinary facets but using a facet query instead of facet field.. I'm not entirely sure of this one, I'll try it out and re-post if found anything useful.

how to store in SOLR (mini) relational data

My data set is title, description and tags.
I would like to store and index in the SOLR the tag_name and their relative tag_id.
As can be understood, each record has one title, one description bt many tag names + tag ids.
I guess I can store the tags as "some-tag-[id]" but his seems wrong.
You can index tags and tags_id as multivalued fields and add in order.
The order is maintained so you can map them within the fields.
<field name="tags" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="tags_id" type="string" indexed="false" stored="true" multiValued="true"/>
Response -
<arr name="tags">
<str>tag1</str>
<str>tag2</str>
<str>tag3</str>
</arr>
<arr name="tags_id">
<str>id1</str>
<str>id2</str>
<str>id3</str>
</arr>

Data Import Handler in Solr

I am trying the Data Import Handler for MySQL Database.
I added the DIhandler in solrconfig.xml, created a data-config.xml according to my database scheme and also added a field in the schema.xml which was different. I am connecting with MySQL database
After i connect and I run the dataimport?command=full-import i get this response
"00C:\solr\conf\data-config.xmlfull-importidle1102011-03-05 15:01:04Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.2011-03-05 15:01:042011-03-05 15:01:040:0:0.400This response format is experimental. It is likely to change in the future."
The xml files are in this http://pastebin.com/iKebKGSZ
<field column="manu" name="manu" />
<field column="id" name="id" />
<field column="weight" name="weight" />
<field column="price" name="price" />
<field column="popularity" name="popularity" />
<field column="instock" name="inStock" />
<field column="includes" name="includes" />
Are these fields also in your schema.xml?
I couldnt see them in the pastebin link. Make sure you have all fields in your schema as well.

Categories

Resources