How to index existing documents in java with Solr

How to index existing documents in java with Solr - java

I am trying to write an indexer for a search engine in Solr, using Java. I've been googling a lot. I found different approaches such as using a Core Container which adds the document and then the Solr server indexes all the data. another approach is to use Nutch, using Solr Indexer.
I am new with Solr and do not know which code to use. BTW, I need to have the stored indexed document. I do not know where Solr is saving the indexed documents.
BTW, would it be better to use Nutch?
I am so new with Solr, so any help would be appreciated.

You can use the Solr Data Import Handler.
Basically, in the core you create a data-config.xml with the datasource
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/dbname" user="db_username" password="db_password"/>
and then you define how should Solr map the database information into Solr indexes something like:
<document name="products">
<entity name="item" query="select * from item">
<field column="ID" name="id" />
</entity>
</document>
More reference here.
Also, if you need to use Java code, you can try with spring-data-solr.
Hope this helps!

Related

hibernate version property mapping in reverse engineering file

I'm working on application that uses 66 tables (SQL Server 2012). Already, hibernate refuses to recognise the "identity" property set for each id column, so I have to put in the <table> element below in my reverse engineering to fix that, just in case I have to re-map my table after making changes to them in SQL Server 2012:
<table name="EMPLOYEE">
<primary-key>
<generator class="identity"/>
</primary-key>
</table>
The question is, I want to do the same for the version property without having to go into all 66 hbm.xml files to add <version name=”version” type=”long” /> every time I map my tables.
Is there a way to set something in the reverse engineering or somewhere else so that hibernate automatically adds the version property?
Thanks

Figured this out. The answer is to ensure that your versioning column in your db is called "version" and hibernate will automatically set the versioning in the xml mapping for you.

How to give weight to the specific field?

I am using Apache Solr for indexing and searching. I have to give weight to the specific field so that If I make search then search has to perform on that field which is most weighted and then on others.
I am using SolrJ, Java, and GWT for development.

To boost at index time you need to supply a boost statement in your update doc.
<add overwrite="true">
<doc boost="2.0">
<field name="id">1234</field>
<field name="type">type1</type>
</doc>
<doc>
<field name="id">2345</field>
<field name="type" boost="0.5">type2</type>
</doc>
</add>
The above example demonstrates how to boost a complete document (elevation) as well as how to boost a specific field.
For more documentation look here and here

Using the dismax (or edismax) query handler, you can set the qf (Query Fields) parameter to assign boosts to different fields. It uses this format:
field1^boost_val field2^boost_val....etc.
There are other good parameters to help you control your result ranking as well.
http://wiki.apache.org/solr/ExtendedDisMax

How to split xml to header and items using smooks?

I have a xml file roughly like this:
<batch>
<header>
<headerStuff />
</header>
<contents>
<timestamp />
<invoices>
<invoice>
<invoiceStuff />
</invoice>
<!-- Insert 1000 invoice elements here -->
</invoices>
</contents>
</batch>
I would like to split that file to 1000 files with the same headerStuff and only one invoice. Smooks documentation is very proud of the possibilities of transformations, but unfortunately I don't want to do those.
The only way I've figured how to do this is to repeat the whole structure in freemarker. But that feels like repeating the structure unnecessarily. The header has like 30 different tags so there would be lots of work involved also.
What I currently have is this:
<?xml version="1.0" encoding="UTF-8"?>
<smooks-resource-list xmlns="http://www.milyn.org/xsd/smooks-1.1.xsd"
xmlns:calc="http://www.milyn.org/xsd/smooks/calc-1.1.xsd"
xmlns:frag="http://www.milyn.org/xsd/smooks/fragment-routing-1.2.xsd"
xmlns:file="http://www.milyn.org/xsd/smooks/file-routing-1.1.xsd">
<params>
<param name="stream.filter.type">SAX</param>
</params>
<frag:serialize fragment="INVOICE" bindTo="invoiceBean" />
<calc:counter countOnElement="INVOICE" beanId="split_calc" start="1" />
<file:outputStream openOnElement="INVOICE" resourceName="invoiceSplitStream">
<file:fileNamePattern>invoice-${split_calc}.xml</file:fileNamePattern>
<file:destinationDirectoryPattern>target/invoices</file:destinationDirectoryPattern>
<file:highWaterMark mark="10"/>
</file:outputStream>
<resource-config selector="INVOICE">
<resource>org.milyn.routing.io.OutputStreamRouter</resource>
<param name="beanId">invoiceBean</param>
<param name="resourceName">invoiceSplitStream</param>
<param name="visitAfter">true</param>
</resource-config>
</smooks-resource-list>
That creates files for each invoice tag, but I don't know how to continue from there to get the header also in the file.
EDIT:
The solution has to use Smooks. We use it in an application as a generic splitter and just create different smooks configuration files for different types of input files.

I just started with Smooks myself. However... your problem sounds identical to this: http://www.smooks.org/mediawiki/index.php?title=V1.5:Smooks_v1.5_User_Guide#Routing_to_File
You will have to provide the output FTL format in whole, that's the downside of using a general purpose tool I guess. Data mapping often includes a lot of what feels like redundancy, one way around this is to leverage convention but that has to be built into the framework.

I don't know smooks, but the simplest solution (with poor performance) would be (to create the Nth file):
copy the whole xml structure
delete all the invoice tags but the Nth one
I don't know how to do that in smooks, that only an idea. In this case you don't need to duplicate the structure of the xml in a freemarker template.

JAXB multiple mappings for attribute

I'm just changing design errors made in the past, but want to keep backwards compatibility of my software. For this I would need some way to map two flavors of an xml file into one java bean. Can this be done using two JAXB annotations on one attribute/element? I understand the marshalling would be ambiguous, but the unmarshalling could work. Is there some nice way of doing this?
p.s.: I don't care about marshalling.

You can map twice:
the first time using annotations
the second time using XML resources.
Or just two XML mappings instead of annotations.
For XML resource mappings, there's a number of options:
Annox: http://confluence.highsource.org/display/ANX/JAXB+User+Guide (I wrote it so it comes first :))
EclipseLink Moxy: http://eclipse.org/eclipselink/moxy.php
JAXB Intoductions: http://community.jboss.org/wiki/JAXBIntroductions
With Annox you can easily map twice using XML mapping resources with different extensions like MyClass.ann1.xml or MyClass.ann2.xml. (It's MyClass.ann.xml per default, but the adjustment is trivial.)
Here's a sample of what mappings look like:
<class xmlns="http://annox.dev.java.net" xmlns:annox="http://annox.dev.java.net" xmlns:jaxb="http://annox.dev.java.net/javax.xml.bind.annotation">
<jaxb:XmlAccessorType value="FIELD"/>
<jaxb:XmlType name="" propOrder="productName quantity usPrice comment shipDate"/>
<field name="productName">
<jaxb:XmlElement required="true"/>
</field>
<field name="usPrice">
<jaxb:XmlElement name="USPrice" required="true"/>
</field>
<field name="shipDate">
<jaxb:XmlSchemaType name="date"/>
</field>
<field name="partNum">
<jaxb:XmlAttribute required="true"/>
</field>
</class>

What is document tag in data-config.xml of Solr

I am wirking with solr server. i want to fetch data from MySQL database to solr. the following is the db-data-config.xml:
<dataConfig>
<dataSource type="JdbcDataSource" driver="org.gjt.mm.mysql.Driver" url="jdbc:mysql://192.168.1.9:3306/angara" user="dev_user" password="ampliflex" />
<document>
<entity name="tdiamonds1" query="select UID_PK, ProductUID, name, price,Weight from tdiamonds">
</entity>
</document>
</dataConfig>
i want to know what <document> tag indicates here, can we give more then one tag here, if possible please refer me some good example.

<document> encapsulates one or more entity elements. You can also specify a name for the given document.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to index existing documents in java with Solr - java

Related

hibernate version property mapping in reverse engineering file

How to give weight to the specific field?

How to split xml to header and items using smooks?

JAXB multiple mappings for attribute

What is document tag in data-config.xml of Solr

Categories

Resources