Internal working of GAE Search API (Java) - java

I would like to know how the Search API store Document internally? Does it store a Document into the Datastore with a "Document" kind? Or something else? Also where are indexes stored? In memcache?

Documents and indexes are stored in HR Datastore
Documents and indexes are saved in a separate persistent store
optimized for search operations
The Document class represents documents. Each document has a document identifier and a list of fields.
The Document class represents documents. Each document has a document identifier and a list of fields.
It's all in Google's documentation

Related

Tokenize textual content using Spark SQL?

I an working on implementing a requirement to create a dictionary of words to documents using apache spark and mongodb.
In my scenario I have a mongo collection in which each document has some text type fields along with a field for owner of the document.
I want to parse the text content in collection docs and create a dictionary which maps words to the document and owner fields. Basically, the key would be a word and value would be _id and owner field.
The idea is to provide auto-suggestions specific to the user when he/she types in the text box in the UI based on the user's documents.
A user can create multiple documents and a word can be in multiple documents but only one user will be able to create a document.
I used mongo spark connector and I am able to load the collection docs into a data frame using spark sql.
I am not sure how to process the textual data which is in one of the dataframe columns now to extract the words.
Is there a way using Spark SQL to process the text content in the data frame column to extract/tokenize words and map it to _id and owner fields and write the results to another collection.
If not, can someone please let me know the right approach/steps on how I can achieve it.
Spark has support for tokenisation and other text processing tasks but it's not in the core library. Checkout the Spark MLlib:
https://spark.apache.org/docs/2.1.0/ml-guide.html
And more precisely the Transformers that work on DataFrames like:
https://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer

Couchbase JAVA SDK -Partialy update document

I have a document that has a data model corresponding to a user.
The user has an adresses array, a phone array and an email array.
I make CRUD operations on theses data using the Java SDK for Couchbase.
I have a constraint: I need to get all the document data in order to display data associated to the user. On the UI I can modify everything except the data contained by the arrays (phone, email and adresses).
How can I do to update only those data when I update the document ?
When I try to use the JsonIgnore annotation on arrays method when serializing the user object, it removes them from the document when the JAVA couchbase method replace took place.
Is there a way to update partially documents with the JAVA SDK for couchbase ?

Semantics Triple Store in Marklogic

I want to store graph data in marklogic using semantic triple. I am able to do that when i use ttl file with uri of http://example.org/item/item22.
But i want to store this triple wrt to documents which are stored in Marklogic.
Means i have one document "Java KT" which is in relation to Java class, and all this data is present in marklogic , how can i create a ttl file with uri to document which is present in marklogic DB?
Load your documents, load your triples, and just add extra triples with document uri as subject or object, and some triple entity uri as the other side. You could express those in another ttl file, or create them via code.
Next question would be, though, how you would want to use documents and triples together?
HTH!
my question is what is the IRI that i will be writing in ttl file which will be of my document available in DB. As ttl file accepts IRIs , so what is the iri for my document ?
#grtjn
It sounds like you want to link from some existing information to your document URI.
If it's item22 from your example then it should be straight-forward.
Let's say item22 is a book. Your TTL data might look like this:
PREFIX item: <http://example.org/item/>
PREFIX domain: <http://example.org/stuff/>
item:item22 a domain:Book ;
domain:hasTitle "A tale of two cities" ;
domain:hasAuthor "Charles Dickens" .
Let's say you have that book as a document in MarkLogic. You could simply add another triple:
item:item22 domain:contentsInUri "/books/Dickens/A-tale-of-two-cities.xml" .
Now you can use SPARQL and easily find the URI related to all the books by Dickens or books with the title "A tale of two cities".
If you are looking for more structure you look into some semantic ontologies such as RDFS and OWL.

Lucene document object duplicate its unchanged fields

I have been working on lucene since about 1 years and suddenly today I figured out something weird about it.
I was updating my indexing using the normal lucene mechanism of fetching the document and deleting old document and then reindexing the document.
So
1. Fetched the document to update from lucene index and maintained this doc in a list
2. Removed the document from index.
3. Using the doc from list updated some of it field and then re-indexed this document.
But when I found that this updated document that was indexed were having duplicate values for the original document field.
Like suppose there was a field id:1 and I didnt updated this field and updated the other content from the document and then index this doc.
I found that this id:1 was appearing two times in the same document. And even further if i reindex the same document the same field will get created those many time under single document.
How should I get rid of this duplication?
I have to make some modification for the document that was re-indexed. Means that document I fetched from the indexed, using that I took out all the fields and then created a new fresh document and added those field to that document and then re-indexed this new document, which got indexed properly without any duplication.
Was not able to find the cause but the document fetched from index was having docId and due to this when it was re-index internally some duplication might be taking place which must have cause the problem.

Does Lucene store the actual documents in its index?

I am planning to use Lucene to index a very large corpus of text documents. I know how an inverted index and all that works.
Question: Does Lucene store the actual source documents in its index (in addition to the terms)? So if I search for a term and want all the documents that contain the term, do the documents come out of Lucene, or does Lucene just return pointers (e.g. the file path to the matched documents)?
This is up to you. Lucene represents documents as collections of fields, and for each field you can configure whether it is stored. Typically, you would store the title fields, but not the body fields, when handling largish documents, and you'd add an identifier field (not indexed) that can be used to retrieve the actual document.

Categories

Resources