Storing and retrieving Json object to/from lucene indexes - java

I have store a set of json object into the lucene indexes and also want to retrieve it from the index. I am using lucene-3.4.
So is there any library or easy mechanism to make this happen in lucene.
For sample: Json object
{
BOOKNAME1: {
id:1,
name:"bname1",
price:"p1"
},
BOOKNAME2: {
id:2,
name:"bname2",
price:"p2"
},
BOOKNAME3: {
id:3,
name:"bname3",
price:"p3"
}
}
Any sort of help will be appreciated.
Thanks in advance,

I would recommend you to index your json object by:
1) Parse your json file. I usually use json simple.
2) Open an index using IndexWriterConfig
3) Add documents to the index.
4) Commit changes and close the index
5) Run your queries
If you would like to use Lucene Core instead of elasticsearch, I have created a sample project, which gets as an input a file with JSON objects and creates an Index. Also, I have added a test to query the index.
I am using the latest Lucene version (4.8), please have a look here:
http://ignaciosuay.com/getting-started-with-lucene-and-json-indexing/
If you have time, I think it is worth reading "Lucene in Action".
Hope it helps.

If you don't want to search within the json but only store it, you just need to extract the id, which will hopefully be unique. Then your lucene document would have two fields:
the id (indexed, not necessarily stored)
the json itself, as it is (only stored)
Once you stored your json in lucene you can retrieve it filtering by id.
On the other hand this is pretty much what elasticsearch does with your documents. You just send some json to it via a REST api. elasticsearch will keep the json as it is and also make it searchable by default. That means you can either retrieve the json by id or search against it, out of the box without having to write any code.
Also, with lucene your documents wouldn't be available till you commit your documents or reopen the index reader, while elasticsearch adds a handy transaction log to it, so that the GET is always real time.
Also, elasticsearch offers a lot more: a nice distributed infrastructure, faceting, scripting and more. Check it out!

Related

Jiray REST API - Querying for issues with all of the fields in a flat JSON structure

In the backlog view of a project, I can select one or more issues and export them to Excel. Here's what I see when I open it.
Each issue takes up a row. Each field in an issue takes up a column in excel.
If I were to visualize this in JSON it would look something like
[
issue1:{
field1:value1,
field2:value2,
..
},
issue2:{
}
]
So the issue block has all of the attributes in a flat structure.
Is there a URL mapping in the JIRA api that can get me a response in a flat structure as above? Most of their documented apis return data in nested structures (there are different levels of complex objects for "Issues").
The REST API search and issue resources will let you extract the information. You can then normalize it into the flat json you want. Note that some fields such as components ca contain multiple values

Where can we use ElasticSearch and Where can we use MongoDB?

My question is, which situation we can choose MongoDB and which situation we can choose ElasticSearch.
If you have a case where you want to do search on particular word and you know that word is present in your db you can go for mongodb directly. But if you have a case where you want to do search partially then go for elastic search.
Example: If you do text indexing on your document's some fields, mongo text search will work on single word search. Suppose you have test field in your collection you did text indexing. test field has value " I am testing it ". on text search if you pass "testing" it will give you the document which has "testing" word in collection. But now if you search for "test" you will gonna get no data.
Instead if you try doing the same in elastic search even for "tes","testi" etc. partial search you will gonna get data in response.
reference: http://blog.mpayetta.com/elasticsearch/mongodb/2016/08/04/full-text-indexing-with-elastic-search-and-mongodb/

MarkLogic search and retrieve specific fields

I am faily new to MarkLogic (and noSQL) and currently trying to learn the Java API client. My question on searching, which returns back search result snippets / matches, is it possible for the search result to include specific fields in the document?
For example, given this document:
{"id":"1", "type":"classified", "description": "This is a classified type."}
And I search using this:
QueryManager queryMgr = client.newQueryManager();
StringQueryDefinition query = queryMgr.newStringDefinition();
query.setCriteria("classified");
queryMgr.search(query, resultsHandle);
How can I get the JSON document's 3 defined fields (id, type, description) as part of the search result - so I can display them in my UI table?
Do I need to hit the DB again by loading the document via URI (thus if I have 1000 records, that means hitting the DB again 1000 times)?
You have several options to retrieve specific fields with your search results. You could use the Pojo Data Binding Interface. You could read multiple documents matching a query which brings back the entirety of each document which you can then get as a pojo or String or any other handle. Or you can use the same API you're using above but add search options to allow you to extract a portion of a matching document.
If you're bring back thousands of matches, you're probably not showing all those snippets to end users, so you should probably disable snippets using something like
<transform-results apply="empty-snippet" />
in your options.

Is it possible to create a multivalued polyfield in Solr that will allow custom logic at query time?

I'm working with a pretty niche requirement to model a relational structure within Solr and thought that a custom polyfield would be the most suitable solution to my problem. In short, each record in the index will have a number of embargo and expiry dates for when the content should be considered 'available'. These dates are grouped with another kind of categorisation (let's say by device), so for example, any given item in the index may be available for mobile users between two dates, but only available for desktop users between another two dates.
Much like the currency and the latlon types, I would index the values as a comma separated list representing each availability window, for example:
mobile,2013-09-23T00:00:00Z,2013-09-30T00:00:00Z
So, a single index record could look like
{
id: "1234",
text: ["foobarbaz"],
availability: [
"mobile,2013-09-23T00:00:00Z,2013-09-30T00:00:00Z",
"pc,2013-09-22T00:00:00Z,2013-09-30T00:00:00Z"
]
}
The custom type would do the job of parsing the incoming value and storing it accordingly. Is this a viable solution? How would I approach the custom logic required at query time to filter by device and then make sure that NOW is within the provided dates?
My attempt so far has been based on the Currency field type, but now I've dialled it back to just storing the string in its un-parsed state. If I could prove that the filtering I want is even possible before using the polyfield features, then I'll know if it's worth continuing.
Does anybody else have any experience writing custom (poly)fields, or doing anything similar to what I'm doing?
Thanks!
If you want to be able to filter and search on these ranges, I don't think you'll have much luck storing records like that. It would make more sense to me to have a more structured document, something like:
id: "1234",
text: ["foobarbaz"],
mobileavailabilitystart: "mobile,2013-09-23T00:00:00Z",
mobileavailabilityend: "2013-09-30T00:00:00Z",
pcavailabilitystart: "2013-09-22T00:00:00Z",
pcavailabilityend: "2013-09-30T00:00:00Z"
Indexing the full contents of a csv line in Lucene/Solr, in a single field, would allow you to perform full-text searches on it, but would not be a good way to support querying for a specific element of it.

Lucene indexing strategy for documents that change often

I'm integrating search functionality into a desktop application and I'm using vanilla Lucene to do so. The application handles (potentially thousands) of POJOs each with its own set of key/value(s) properties. When mapping models between my application and Lucene I originally thought of assigning each POJO a Document and add the properties as Fields. This approach works great as far as indexing and searching goes but the main downside is that whenever a POJO changes its properties I have to reindex ALL the properties again, even the ones that didn't change, in order to update the index. I have been thinking of changing my approach and instead create a Document per property and assign the same id to all the Documents from the same POJO. This way when a POJO property changes I only update its corresponding Document without reindexing all the other unchanged properties. I think that the graph db Neo4J follows a similar approach when comes to indexing, but I'm not completely sure. Could anyone comment on possible impact on performance, querying, etc?
It depends fundamentally on what you want to return as a Document in a search result.
But indexing is pretty cheap. Does a changed POJO really have so many properties that reindexing them all is a major problem?
If you only search one field in every search request, splitting one POJO to several documents will speed up reindexing. But it will cause another problem if search one multiple fields, a POJO may appear many times.
Actually, I agree with EJP, building index is very fast in small dataset.

Categories

Resources