Elastic Search missing some documents while creating index from another index - java

I have created one index in elastic search name as documents_local and load data from Oracle Database via logstash. This index containing following values.
"FilePath" : "Path of the file",
"FileName" : "filename.pdf",
"Language" : "Language_name"
Then, i want to index those file contents also, so i have created one more index in elastic search named as document_attachment and using the following Java code, i fetched FilePath & FileName from the index documents_local with the help of the filepath i retrived file will is available in my local drive and i have index those file contents using ingest-attachment plugin processor.
Please find my java code below, where am indexing the files.
private final static String INDEX = "documents_local"; //Documents Table with file Path - Source Index
private final static String ATTACHMENT = "document_attachment"; // Documents with Attachment... -- Destination Index
private final static String TYPE = "doc";
public static void main(String args[]) throws IOException {
RestHighLevelClient restHighLevelClient = null;
Document doc=new Document();
try {
restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http"),
new HttpHost("localhost", 9201, "http")));
} catch (Exception e) {
System.out.println(e.getMessage());
}
//Fetching Id, FilePath & FileName from Document Index.
SearchRequest searchRequest = new SearchRequest(INDEX);
searchRequest.types(TYPE);
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
QueryBuilder qb = QueryBuilders.matchAllQuery();
searchSourceBuilder.query(qb);
searchSourceBuilder.size(3000);
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = null;
try {
searchResponse = restHighLevelClient.search(searchRequest);
} catch (IOException e) {
e.getLocalizedMessage();
}
SearchHit[] searchHits = searchResponse.getHits().getHits();
long totalHits=searchResponse.getHits().totalHits;
int line=1;
String docpath = null;
Map<String, Object> jsonMap ;
for (SearchHit hit : searchHits) {
String encodedfile = null;
File file=null;
Map<String, Object> sourceAsMap = hit.getSourceAsMap();
doc.setId((int) sourceAsMap.get("id"));
doc.setLanguage(sourceAsMap.get("language"));
doc.setFilename(sourceAsMap.get("filename").toString());
doc.setPath(sourceAsMap.get("path").toString());
String filepath=doc.getPath().concat(doc.getFilename());
System.out.println("Line Number--> "+line+++"ID---> "+doc.getId()+"File Path --->"+filepath);
file = new File(filepath);
if(file.exists() && !file.isDirectory()) {
try {
FileInputStream fileInputStreamReader = new FileInputStream(file);
byte[] bytes = new byte[(int) file.length()];
fileInputStreamReader.read(bytes);
encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
jsonMap = new HashMap<>();
jsonMap.put("id", doc.getId());
jsonMap.put("language", doc.getLanguage());
jsonMap.put("filename", doc.getFilename());
jsonMap.put("path", doc.getPath());
jsonMap.put("fileContent", encodedfile);
String id=Long.toString(doc.getId());
IndexRequest request = new IndexRequest(ATTACHMENT, "doc", id )
.source(jsonMap)
.setPipeline(ATTACHMENT);
try {
IndexResponse response = restHighLevelClient.index(request);
} catch(ElasticsearchException e) {
if (e.status() == RestStatus.CONFLICT) {
}
e.printStackTrace();
}
}
System.out.println("Indexing done...");
}
Please find my mappings details for ingest attachment plugin( i did mappping first and then executing this java code ).
PUT _ingest/pipeline/document_attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "fileContent"
}
}
]
}
But while doing this process, Am missing some of the doucments,
My source index is documents_local which is having 2910 documents.
Am fetching all the 118 documents and attaching my PDF ( after converting to base64) and writting in another index document_attachment
But document_attachment index having only 118 it should be 2910. some of the doucments are missing. Also, for indexing its taking very long time.
Am not sure, how the documents are missing in second index ( document_attachment ) and is there any other way to improvise this process ?
Can we include Thread mechanism here. Because, i future i have to index more than 100k pdf, in this same way.

Related

FreeMarker and HTML

I'm trying to use a FreeMarker template,
Here's my code:
#GET
#Produces(MediaType.APPLICATION_JSON)
#Path("{queue}")
public String test(#PathParam("queue") String qy,
#Context HttpServletRequest request) {
Map<String,Object> model = new HashMap<String,Object>();
String listTemplate = "<HTML><HEAD><STYLE>${css}</STYLE></HEAD><BODY><UL> the queue includes : ${queues}</UL></BODY></HTML>";
String listCss = inMemService.getWebCss("list");
model.put("css", listCss);
String result = null ;
IQueue queue = com.myproject.hc.Main.getHzInstance().getQueue(qy);
// Build the data-model
Object[] array = queue.toArray();
int size = queue.size();
int i;
for (i = 0; i < size ; i ++) {
temp = temp + " " + array [i] + " ";
}
model.put("queues", temp);
Template template ;
try {
template = new Template("process", new StringReader(
listTemplate), new Configuration());
Writer stringWriter = new StringWriter();
template.process(model, stringWriter);
stringWriter.flush();
result = stringWriter.toString();
} catch (IOException e) {
} catch (TemplateException e) {
}
return result;
}
However, when I add to the queue, and go the #Path, I get the following output, which doesn't recognize as HTML:
<HTML><HEAD><STYLE>body {background-color: lightgrey;} h1 {color: black;text-align: left;} p {font-family: verdana;font-size: 20px;}</STYLE></HEAD><BODY><UL> the queue includes : LA NYC </UL></BODY></HTML>
Your HTML is fine, you need to declare in code you are returning HTML (and not JSON).
You should replace #Produces(MediaType.APPLICATION_JSON) with:
#Produces({MediaType.TEXT_HTML})
The #Produces annotation is used to specify the MIME media types or representations a resource can produce and send back to the client.

ElasticSearch JavaAPI (SearchScroll)- search_context_missing_exception","reason":"No search context found for id

Am fetching more than 100k documents from one index using searchScroll and adding one more field in all 100K documents. Then again am inserting those documents into another new index.
Am using SearchScroll api also am setting the size searchSourceBuilder.size(100) i have increased the size to searchSourceBuilder.size(1000). In both the cases am getting the below error after processing 18100 doucments ( when searchSourceBuilder.size(100) ) & 21098 documents( when searchSourceBuilder.size(1000)).
search_context_missing_exception","reason":"No search context found for id
And, the error throws on this line searchResponse = SearchEngineClient.getInstance().searchScroll(scrollRequest);
Please find my complete error stack
Exception in thread "main" ElasticsearchStatusException[Elasticsearch exception
[type=search_phase_execution_exception, reason=all shards failed]]; nested: Elas
ticsearchException[Elasticsearch exception [type=search_context_missing_exceptio
n, reason=No search context found for id [388]]];
at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestR
esponse.java:177)
at org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLeve
lClient.java:573)
at org.elasticsearch.client.RestHighLevelClient.parseResponseException(R
estHighLevelClient.java:549)
at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighL
evelClient.java:456)
at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEn
tity(RestHighLevelClient.java:429)
at org.elasticsearch.client.RestHighLevelClient.searchScroll(RestHighLev
elClient.java:387)
at com.es.utility.DocumentIndex.main(DocumentIndex.java:101)
Suppressed: org.elasticsearch.client.ResponseException: method [GET], ho
st [http://localhost:9200], URI [/_search/scroll], status line [HTTP/1.1 404 Not
Found]
{"error":{"root_cause":[{"type":"search_context_missing_exception","reason":"No
search context found for id [390]"},{"type":"search_context_missing_exception","
reason":"No search context found for id [389]"},{"type":"search_context_missing_
exception","reason":"No search context found for id [392]"},{"type":"search_cont
ext_missing_exception","reason":"No search context found for id [391]"},{"type":
"search_context_missing_exception","reason":"No search context found for id [388
]"}],"type":"search_phase_execution_exception","reason":"all shards failed","pha
se":"query","grouped":true,"failed_shards":[{"shard":-1,"index":null,"reason":{"
type":"search_context_missing_exception","reason":"No search context found for i
d [390]"}},{"shard":-1,"index":null,"reason":{"type":"search_context_missing_exc
eption","reason":"No search context found for id [389]"}},{"shard":-1,"index":nu
ll,"reason":{"type":"search_context_missing_exception","reason":"No search conte
xt found for id [392]"}},{"shard":-1,"index":null,"reason":{"type":"search_conte
xt_missing_exception","reason":"No search context found for id [391]"}},{"shard"
:-1,"index":null,"reason":{"type":"search_context_missing_exception","reason":"N
o search context found for id [388]"}}],"caused_by":{"type":"search_context_miss
ing_exception","reason":"No search context found for id [388]"}},"status":404}
at org.elasticsearch.client.RestClient$1.completed(RestClient.ja
va:357)
at org.elasticsearch.client.RestClient$1.completed(RestClient.ja
va:346)
at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.
java:119)
at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerI
mpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177)
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.process
Response(HttpAsyncRequestExecutor.java:436)
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputRe
ady(HttpAsyncRequestExecutor.java:326)
at org.apache.http.impl.nio.DefaultNHttpClientConnection.consume
Input(DefaultNHttpClientConnection.java:265)
at org.apache.http.impl.nio.client.InternalIODispatch.onInputRea
dy(InternalIODispatch.java:81)
at org.apache.http.impl.nio.client.InternalIODispatch.onInputRea
dy(InternalIODispatch.java:39)
at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputRead
y(AbstractIODispatch.java:114)
at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseI
OReactor.java:162)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEve
nt(AbstractIOReactor.java:337)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEve
nts(AbstractIOReactor.java:315)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(Ab
stractIOReactor.java:276)
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIO
Reactor.java:104)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor
$Worker.run(AbstractMultiworkerIOReactor.java:588)
at java.lang.Thread.run(Unknown Source)
Caused by: ElasticsearchException[Elasticsearch exception [type=search_context_m
issing_exception, reason=No search context found for id [388]]]
at org.elasticsearch.ElasticsearchException.innerFromXContent(Elasticsea
rchException.java:490)
at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchEx
ception.java:406)
at org.elasticsearch.ElasticsearchException.innerFromXContent(Elasticsea
rchException.java:435)
at org.elasticsearch.ElasticsearchException.failureFromXContent(Elastics
earchException.java:594)
at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestR
esponse.java:169)
... 6 more
Please find my java code :
public class DocumentIndex {
private final static String INDEX = "documents";
private final static String ATTACHMENT = "document_attachment";
private final static String TYPE = "doc";
private static final Logger logger = Logger.getLogger(Thread.currentThread().getStackTrace()[0].getClassName());
public static void main(String args[]) throws IOException {
RestHighLevelClient restHighLevelClient = null;
Document doc=new Document();
logger.info("Started Indexing the Document.....");
try {
restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http"),
new HttpHost("localhost", 9201, "http")));
} catch (Exception e) {
System.out.println(e.getMessage());
}
//Fetching Id, FilePath & FileName from Document Index.
SearchRequest searchRequest = new SearchRequest(INDEX);
searchRequest.types(TYPE);
final Scroll scroll = new Scroll(TimeValue.timeValueMinutes(1L)); //part of Scroll API
searchRequest.scroll(scroll); //part of Scroll API
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
QueryBuilder qb = QueryBuilders.matchAllQuery();
searchSourceBuilder.query(qb);
searchSourceBuilder.size(100);
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = SearchEngineClient.getInstance().search(searchRequest);
String scrollId = searchResponse.getScrollId(); //part of Scroll API
SearchHit[] searchHits = searchResponse.getHits().getHits();
long totalHits=searchResponse.getHits().totalHits;
logger.info("Total Hits --->"+totalHits);
//part of Scroll API -- Starts
while (searchHits != null && searchHits.length > 0) {
SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
scrollRequest.scroll(scroll);
searchResponse = SearchEngineClient.getInstance().searchScroll(scrollRequest);
scrollId = searchResponse.getScrollId();
searchHits = searchResponse.getHits().getHits();
File all_files_path = new File("d:\\All_Files_Path.txt");
File available_files = new File("d:\\Available_Files.txt");
File missing_files = new File("d:\\Missing_Files.txt");
int totalFilePath=1;
int totalAvailableFile=1;
int missingFilecount=1;
Map<String, Object> jsonMap ;
for (SearchHit hit : searchHits) {
String encodedfile = null;
File file=null;
Map<String, Object> sourceAsMap = hit.getSourceAsMap();
if(sourceAsMap != null) {
doc.setId((int) sourceAsMap.get("id"));
doc.setApp_language(String.valueOf(sourceAsMap.get("app_language")));
}
String filepath=doc.getPath().concat(doc.getFilename());
logger.info("ID---> "+doc.getId()+"File Path --->"+filepath);
try(PrintWriter out = new PrintWriter(new FileOutputStream(all_files_path, true)) ){
out.println("FilePath Count ---"+totalFilePath+":::::::ID---> "+doc.getId()+"File Path --->"+filepath);
}
file = new File(filepath);
if(file.exists() && !file.isDirectory()) {
try {
try(PrintWriter out = new PrintWriter(new FileOutputStream(available_files, true)) ){
out.println("Available File Count --->"+totalAvailableFile+":::::::ID---> "+doc.getId()+"File Path --->"+filepath);
totalAvailableFile++;
}
FileInputStream fileInputStreamReader = new FileInputStream(file);
byte[] bytes = new byte[(int) file.length()];
fileInputStreamReader.read(bytes);
encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
fileInputStreamReader.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
else
{
System.out.println("Else block");
PrintWriter out = new PrintWriter(new FileOutputStream(missing_files, true));
out.println("Available File Count --->"+missingFilecount+":::::::ID---> "+doc.getId()+"File Path --->"+filepath);
out.close();
missingFilecount++;
}
jsonMap = new HashMap<>();
jsonMap.put("id", doc.getId());
jsonMap.put("app_language", doc.getApp_language());
jsonMap.put("fileContent", encodedfile);
String id=Long.toString(doc.getId());
IndexRequest request = new IndexRequest(ATTACHMENT, "doc", id )
.source(jsonMap)
.setPipeline(ATTACHMENT);
PrintStream printStream = new PrintStream(new File("d:\\exception.txt"));
try {
IndexResponse response = SearchEngineClient.getInstance2().index(request);
} catch(ElasticsearchException e) {
if (e.status() == RestStatus.CONFLICT) {
}
e.printStackTrace(printStream);
}
totalFilePath++;
}
}
ClearScrollRequest clearScrollRequest = new ClearScrollRequest();
clearScrollRequest.addScrollId(scrollId);
ClearScrollResponse clearScrollResponse = restHighLevelClient.clearScroll(clearScrollRequest);
boolean succeeded = clearScrollResponse.isSucceeded();
////part of Scroll API -- Ends
logger.info("Indexing done.....");
}
}
Am using ES 6.2.3 version
You're getting that error because your search context is dead before getting and processing all the results so in order to solve this problem you should keep your search context alive for longer duration. Please refer Keeping the search context alive.
Increase the time value of your scroll.
new Scroll(TimeValue.timeValueMinutes(new_value));
increase the new_value to whatever suits your requirement.

How to use Wordnet Synonyms with Hibernate Search?

I've been trying to figure out how to use WordNet synonyms with a search function I'm developing which uses Hibernate Search 5.6.1. At first, I thought about using Hibernate Search annotations:
#TokenFilterDef(factory = SynonymFilterFactory.class, params = {#Parameter(name = "ignoreCase", value = "true"),
#Parameter(name = "expand", value = "true"),#Parameter(name = "synonyms", value = "synonymsfile") })
However, this requires an actual file populated with synonyms. From WordNet I was only able to get ".pl" files. So I tried manually making a SynonymAnalyzer class which would read from the ".pl" file:
public class SynonymAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new StandardTokenizer();
TokenStream result = new StandardFilter(source);
result = new LowerCaseFilter(result);
SynonymMap wordnetSynonyms = null;
try {
wordnetSynonyms = loadSynonyms();
} catch (IOException e) {
e.printStackTrace();
}
result = new SynonymFilter(result, wordnetSynonyms, false);
result = new StopFilter(result, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
return new TokenStreamComponents(source, result);
}
private SynonymMap loadSynonyms() throws IOException {
File file = new File("synonyms\\wn_s.pl");
InputStream stream = new FileInputStream(file);
Reader reader = new InputStreamReader(stream);
SynonymMap.Builder parser = null;
parser = new WordnetSynonymParser(true, true, new StandardAnalyzer(CharArraySet.EMPTY_SET));
try {
((WordnetSynonymParser) parser).parse(reader);
} catch (ParseException e) {
e.printStackTrace();
}
return parser.build();
}
}
The problem with this method is that I'm getting java.lang.OutOfMemoryError which I'm assuming is because there's too many synonyms or something? What is the proper way to do this, everywhere I've looked online has suggested using WordNet but I can't seem to find an example with Hibernate Search Annotations. Any help is appreciated, thanks!
The wordnet format is actually supported by SynonymFilterFactory. You're simply missing the "format" parameter in your annotation configuration; by default, the factory uses the Solr format.
Change your annotation to this:
#TokenFilterDef(
factory = SynonymFilterFactory.class,
params = {
#Parameter(name = "ignoreCase", value = "true"),
#Parameter(name = "expand", value = "true"),
#Parameter(name = "synonyms", value = "synonymsfile"),
#Parameter(name = "format", value = "wordnet") // Add this
}
)
Also, make sure that the value of the "synonyms" parameter is the path of a file in your classpath (e.g. "com/acme/synonyms.pl", or just "synonyms.pl" if the file is at the root of your "resources" directory).
In general when you have an issue with the parameters of a Lucene filter/tokenizer factory, your best bet is having a look at the source code of that factory, or having a look at this page.

How to get server base path to find and read a json

I have this rest and I'm trying to mock some responses
I'm working on a WebSphere server with Spring Boot
#RequestMapping(method = RequestMethod.GET, value = "", produces = {MediaType.APPLICATION_JSON_VALUE, "application/hal+json"})
public Resources<String> getAssemblyLines() throws IOException {
String fullMockPath = servletContext.getContextPath() + "\\assets\\services-mocks\\assembly-lines\\get-assembly-lines.ok.json";
List<String> result = new ArrayList<String>();
result.add(fullMockPath);
try {
byte[] rawJson = Files.readAllBytes(Paths.get(fullMockPath));
Map<String, String> mappedJson = new HashMap<String, String>();
String jsonMock = new ObjectMapper().writeValueAsString(mappedJson);
result = new ArrayList<String>();
result.add(jsonMock);
} catch (IOException e) {
result.add("Not found");
result.add(e.getMessage());
}
return new Resources<String>(result,
linkTo(methodOn(this.getClass()).getAssemblyLines()).withSelfRel());
}
I get
FileNotFoundException
Tired Tushinov's solution
System.getProperty("user.dir");
But that returns the PATH of my server, not of my document root (and yes, they're in different folders)
How can understand my base path?
To your question How can understand my base path?. You can use:
System.getProperty("user.dir")
System.getProperty("user.dir")will return the path to your project.
Example output:
C:\folder_with_java_projects\CURRENT_PROJECT
So if the file is inside your project folder you can just do the following:
System.getProperty("user.dir") + "somePackage\someJson.json";

Lucene Index - single term and phrase querying

I've read some documents and build a lucene index which looks like
Documents:
id 1
keyword foo bar
keyword john
id 2
keyword foo
id 3
keyword john doe
keyword bar foo
keyword what the hell
I want to query lucene in a way, where I can combine single term and phrases.
Let's say my query is
foo bar
should give back the doc ids 1, 2 and 3
The query
"foo bar"
should give back the doc ids 1
The query
john
should give back the doc ids 1 and 3
The query
john "foo bar"
should give back the doc ids 1
My implementation in java is not working. Also reading tons of documents didn't help.
When I query my index with
"foo bar"
I get 0 hits
When I query my index with
foo "john doe"
I get back the doc ids 1, 2 and 3 (i would expect only doc id 3 since the query is meant as foo AND "john doe") The problem is, that "john doe" gives back 0 hits but foo gives back 3 hits.
My goal is to combine single term and phrase terms. What am I doing wrong? I've also played around with the analyzers with no luck.
My implementation looks like this:
Indexer
import ...
public class Indexer
{
private static final Logger LOG = LoggerFactory.getLogger(Indexer.class);
private final File indexDir;
private IndexWriter writer;
public Indexer(File indexDir)
{
this.indexDir = indexDir;
this.writer = null;
}
private IndexWriter createIndexWriter()
{
try
{
Directory dir = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_34, analyzer);
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
iwc.setRAMBufferSizeMB(256.0);
IndexWriter idx = new IndexWriter(dir, iwc);
idx.deleteAll();
return idx;
} catch (IOException e)
{
throw new RuntimeException(String.format("Could create indexer on directory [%s]", indexDir.getAbsolutePath()), e);
}
}
public void index(TestCaseDescription desc)
{
if (writer == null)
writer = createIndexWriter();
Document doc = new Document();
addPathToDoc(desc, doc);
addLastModifiedToDoc(desc, doc);
addIdToDoc(desc, doc);
for (String keyword : desc.getKeywords())
addKeywordToDoc(doc, keyword);
updateIndex(doc, desc);
}
private void addIdToDoc(TestCaseDescription desc, Document doc)
{
Field idField = new Field(LuceneConstants.FIELD_ID, desc.getId(), Field.Store.YES, Field.Index.ANALYZED);
idField.setIndexOptions(IndexOptions.DOCS_ONLY);
doc.add(idField);
}
private void addKeywordToDoc(Document doc, String keyword)
{
Field keywordField = new Field(LuceneConstants.FIELD_KEYWORDS, keyword, Field.Store.YES, Field.Index.ANALYZED);
keywordField.setIndexOptions(IndexOptions.DOCS_ONLY);
doc.add(keywordField);
}
private void addLastModifiedToDoc(TestCaseDescription desc, Document doc)
{
NumericField modifiedField = new NumericField(LuceneConstants.FIELD_LAST_MODIFIED);
modifiedField.setLongValue(desc.getLastModified());
doc.add(modifiedField);
}
private void addPathToDoc(TestCaseDescription desc, Document doc)
{
Field pathField = new Field(LuceneConstants.FIELD_PATH, desc.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
pathField.setIndexOptions(IndexOptions.DOCS_ONLY);
doc.add(pathField);
}
private void updateIndex(Document doc, TestCaseDescription desc)
{
try
{
if (writer.getConfig().getOpenMode() == OpenMode.CREATE)
{
// New index, so we just add the document (no old document can be there):
LOG.debug(String.format("Adding testcase [%s] (%s)", desc.getId(), desc.getPath()));
writer.addDocument(doc);
} else
{
// Existing index (an old copy of this document may have been indexed) so
// we use updateDocument instead to replace the old one matching the exact
// path, if present:
LOG.debug(String.format("Updating testcase [%s] (%s)", desc.getId(), desc.getPath()));
writer.updateDocument(new Term(LuceneConstants.FIELD_PATH, desc.getPath()), doc);
}
} catch (IOException e)
{
throw new RuntimeException(String.format("Could not create or update index for testcase [%s] (%s)", desc.getId(),
desc.getPath()), e);
}
}
public void store()
{
try
{
writer.close();
} catch (IOException e)
{
throw new RuntimeException(String.format("Could not write index [%s]", writer.getDirectory().toString()));
}
writer = null;
}
}
Searcher:
import ...
public class Searcher
{
private static final Logger LOG = LoggerFactory.getLogger(Searcher.class);
private final Analyzer analyzer;
private final QueryParser parser;
private final File indexDir;
public Searcher(File indexDir)
{
this.indexDir = indexDir;
analyzer = new StandardAnalyzer(Version.LUCENE_34);
parser = new QueryParser(Version.LUCENE_34, LuceneConstants.FIELD_KEYWORDS, analyzer);
parser.setAllowLeadingWildcard(true);
}
public List<String> search(String searchString)
{
List<String> testCaseIds = new ArrayList<String>();
try
{
IndexSearcher searcher = getIndexSearcher(indexDir);
Query query = parser.parse(searchString);
LOG.info("Searching for: " + query.toString(parser.getField()));
AllDocCollector results = new AllDocCollector();
searcher.search(query, results);
LOG.info("Found [{}] hit", results.getHits().size());
for (ScoreDoc scoreDoc : results.getHits())
{
Document doc = searcher.doc(scoreDoc.doc);
String id = doc.get(LuceneConstants.FIELD_ID);
testCaseIds.add(id);
}
searcher.close();
return testCaseIds;
} catch (Exception e)
{
throw new RuntimeException(String.format("Could not search index [%s]", indexDir.getAbsolutePath()), e);
}
}
private IndexSearcher getIndexSearcher(File indexDir)
{
try
{
FSDirectory dir = FSDirectory.open(indexDir);
return new IndexSearcher(dir);
} catch (IOException e)
{
LOG.error(String.format("Could not open index directory [%s]", indexDir.getAbsolutePath()), e);
throw new RuntimeException(e);
}
}
}
Why are you using DOCS_ONLY?! If you only index docids, then you only have a basic inverted index with term->document mappings, but no proximity information. So thats why your phrase queries don't work.
I think you roughly want:
keyword:"foo bar"~1^2 OR keyword:"foo" OR keyword:"bar"
Which is to say, phrase match "foo bar" and boost it (prefer the full phrase), OR match "foo", OR match "bar".
The full query syntax is here: http://lucene.apache.org/core/old_versioned_docs/versions/3_0_0/queryparsersyntax.html
EDIT:
It looks like one thing you're missing is that the default operator is OR. So you probably want to do something like this:
+keyword:john AND +keyword:"foo bar"
The plus sign means "must contain". You put the AND explicitly so that the document must contain both (rather that the default, which translates to "must contain john OR must contain "foo bar").
The problem was solved by replacing
StandardAnalyzer
with
KeywordAnalyzer
for both, the indexer and the searcher.
As I was able to point out that the StandardAnalyzer splits input text into several words I've replaced it with KeywordAnalyzer since the input (which can consist of one or more words) will remain untouched. It will recognize a term like
bla foo
as a single keyword.

Categories

Resources