I'm using elasticsearch 6.x version with ingest plugin to let me query inside document.
I managed to insert record with attachment document and I'm able to query it against various fields.
When I query the content of the file I'm doing this:
boolQuery.filter(new MatchPhrasePrefixQueryBuilder("attachment.content", "St. Anna Church"))
It works, but I want now to make query with this field: "Church Wall People" where basically it's not a complete phrase, I want back all the documents that contain the words Church, Wall and People.
Related
I am a non programmer. I have a ontology in owl format. I also have an excel sheet (it contains data numeric data with headers of selected ontology). Now I have to connect the excel header with ontology framework and need to extract the links in excel data from the ontology.
Do I understand you correctly that you have an RDF knowledge base whose schema is described by an OWL ontology and you want to import this data from RDF to a spreadsheet?
The most straightforward case to transform RDF to spreadsheets is a SPARQL SELECT query.
Prerequisites
If you don't already have the data in an application or endpoint where you can query it directly (e.g. Protégé may have a widget for SPARQL queries), there are three prerequisites, else skip those:
1. Export/Convert the Data
If you have your data in an application where you can't perform SPARQL queries or as a file in a syntax such as OWL/XML, you need to convert it first, because most SPARQL endpoints don't understand this format, but rather need an RDF serialization such as N-Triples, RDF Turtle or RDF/XML, so you need to export the data in one of those formats.
2. Setup a SPARQL Endpoint
Now you can install e.g. a Virtuoso SPARQL endpoint, either locally or on a server or use the endpoint of someone else who gives you access credentials.
It can take a while to install but you can use a Docker image if that is easier.
3. Upload the Data
In Virtuoso SPARQL, you can now upload the ontology and the instance data in the conductor under "Linked Data" -> "Quad Store Upload".
Querying
I don't know of any existing tool that automatically maps ontologies and downloads instances according to a given Excel sheet templates so I recommend to create a SPARQL SELECT query manually.
Example
Let's say your Excel sheet has the header rows "name", "age" and "height" (you said you have numeric data) and the ontology has a person class defined like this in RDF Turtle:
:Person a owl:Class;
rdfs:label "Person"#en.
:age a owl:DatatypeProperty;
rdfs:label "age"#en;
rdfs:domain :Person;
rdfs:range xsd:nonNegativeInteger.
:height a owl:DatatypeProperty;
rdfs:label "height"#en;
rdfs:domain :Person;
rdfs:range xsd:decimal.
Now you can write the following SPARQL SELECT query:
PREFIX :<http://my.prefix/>
SELECT ?person ?age ?height
{
?person a :person;
:age ?age;
:height ?height.
}
This will generate a result table, which you can obtain in different formats. Choose the CSV spreadsheet format and then you can import it into MS Excel, which solves your problem as far as I interpret it.
In my java web application (Jsp + Servlet + hibernate) users can request books. The request goes to the database as a text. After that I tokenize the text using Apache Open NLP. Then I need to compare these tokenized text with books table (the books table has book ID , Book Name , Author , Description) and give most related suggestions to the user. Mostly I need to compare this with book name column and book description column. Is this possible?
import opennlp.tools.tokenize.SimpleTokenizer;
public class SimpleTokenizerExample {
public static void main(String args[]){
String sentence = "Hello Guys , I like to read horror stories. If you have any horror story books please share with us. Also my favorite author is Stephen King";
//Instantiating SimpleTokenizer class
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;
//Tokenizing the given sentence
String tokens[] = simpleTokenizer.tokenize(sentence);
//Printing the tokens
for(String token : tokens) {
System.out.println(token);
}
}
}
Apache OpenNLP can do Natural Language Processing, but the task you describe is Information Retrieval. Take a look at http://lucene.apache.org/solr/.
If you really need to use DB only, you can try to make a query for each token using the LIKE sql keyword:
SELECT DISTINCT FROM mytable WHERE token IN description;
and rank the lines with higher match.
How OpenNLP can help you?
You can use the OpenNLP Stemmer. In that case you can get the stem of the book description and title before adding it to the columns to the database. You also need to stem the query. This will help you with inflections: "car" will match "cars", "car".
You can accomplish the same with the OpenNLP Lemmatizer, but you need a trained model, which is not available today for that module.
just to add to what #wcolen says, some out of the box stemmers exist for various languages in Lucene as well.
Another thing OpenNLP could help with is recognizing book authors names (e.g. Stephen King) via the NameFinderTool so that you could adjust the query so that your code creates a phrase query for such entities instead of a plain keyword based query (with the result that you won't get results containing Stephen or King but only results containing Stephen King).
I have a two or more collection in mongodb that are replicated to solr indexes using mongo-solr connectors. For the sake of explaining my problem I am facing lets take the traditional example of employee & department example (I know it's Document oriented DB & I can embed department to employee document, but please allow me to explain my question with this trivial example):
Employee document:
{
"_id": ObjectId(..),
"firstName": "John",
"lastName": "David",
"departMent": ObjectId(..) - a DBRef for department document
}
Department document:
{
"_id": ObjectId(..),
"departmentName": "Marketing"
}
Let's say that above two documents are linked in employee document using the department's object id ref. Now mongo-solr connector replicated these structures as it is and let's assume all of the fields are indexed and stored.
Now here is my question ( & the problem):
If I search the solr index by employee firstName (or lastName), I should get back results in such a way that the solr search response should include the "departmentName" instead of the Department ObjectId reference and that this should happen over a single search request originating from a client.
How do I do this using Solr apis?
Thanks in advance.
The ideal solution of course (from Solr perspective) would be to store the data in Solr in the denormalized form. However, if that is not a viable option, you could take a look at the Join query Parser.
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-JoinQueryParser
You would be performing a query along the lines of (not tested):
q={!join from=department to=departmentName}lastName:David AND departmentName:Marketing
I have an archive of university theses and publications indexed (with BM25 similarity) on Lucene (Java version). I have English document and Italian document, for this reason i have duplicate field like: pdf, pdf_en or like: titolo, titolo_en. When i have an italian document i fill italian field, otherwise i fill english filed.
Now i have a BooleanQuery with MultiFieldQueryParser, this is my code:
String[] fieldsGEN={"url","autori","lingua","settore","pdfurl"};
String[] fieldsITA={"titolo","tipologia","abstract","pdf"};
String[] fieldsENG={"titolo_en","tipologia_en", "abstract_en","pdf_en"};
MultiFieldQueryParser parserGEN = new MultiFieldQueryParser(version, fieldsGEN, analyzerIT);
MultiFieldQueryParser parserITA = new MultiFieldQueryParser(version, fieldsITA, analyzerIT);
MultiFieldQueryParser parserENG = new MultiFieldQueryParser(version, fieldsENG, analyzerENG);
parserITA.setDefaultOperator(QueryParser.Operator.OR);
parserITA.setDefaultOperator(QueryParser.Operator.OR);
parserENG.setDefaultOperator(QueryParser.Operator.OR);
Query query4 =parserGEN.parse(ricerca.ricerca);
bq.add(query4, Occur.SHOULD);
Query query2 =parserITA.parse(ricerca.ricerca);
bq.add(query2, Occur.SHOULD);
Query query3 =parserENG.parse(ricerca.ricerca);
bq.add(query3, Occur.SHOULD);
If I search "anna" (Name of an author) the 3 query are:
Query: [titolo:anna tipologia:anna abstract:anna pdf:anna]
Query: [titolo_en:anna tipologia_en:anna abstract_en:anna pdf_en:anna]
Query: [url:anna autori:anna lingua:anna settore:anna pdfurl:anna]
and I also authors without the name anna even if they are in the last position (about 3 document of 21 on 1000 indexed), I suppose that finds them in other fields.
Do you think the query is done well? the query can be improved? how? a search engine like google how it works on multifield search?
There is a better way to deal with multi-language field?
Thanks,
Neptune.
Unless you have both translations for all documents, I would create 2 indexes -- 1 for each language, using the same field names for each index. You would then use a MultiReader with the search queries.
The problem with this approach is words that are spelled the same in each language but have different meanings between English and Italian. Apart from those words, I think that this architecture will be easier to understand as well as easier to interpret the results of.
I am trying to search data using lucene indexing.I am using KeywordTokenizerFactory and LowerCaseFilterFactory I am trying to get record with name "police name 25423" ,I am not getting data. If I try with "police" or "name" or "25423" or"police name" separately then I am getting result.Why with full name not able to get result?.
Problem because you use KeywordTokenizerFactory. In this case, Lucene will search documents with term "police name 25423". You should change tokenizer factory to StandardTokenizerFactory, in this case you will be search documents with terms "police" "name" "25423".
Is there any record exits with all three words in the same query??
First Check that.