Indexing external text data to lucene index in GraphDB - java

Is it possible to index external to RDF data?
Like in RDF there is a triple with the object as a link to an external file. Can the content of this file be indexed instead of the link value?

I suspect that the answer above misunderstood the question. The question refers to external content - i.e., if GraphDB's Lucene is able to index the content available at http://example.org, rather than the RDF literal associated with it (and then return in searches the triple pointing to that content).
From what I was able to try no, this is not currently supported.

Absolutely. Lucene is a core part of GraphDB and it offers the standard functionality which comes with a standalone Lucene. The data will have to be parametrized as a String literal. <http://www.example.org/> rdfs:label "An example webpage url."#EN .
Then you can configure a Lucene Index:
PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA {
luc:index luc:setParam "uris" .
luc:include luc:setParam "literals" .
luc:moleculeSize luc:setParam "1" .
luc:includePredicates luc:setParam "http://www.w3.org/2000/01/rdf-schema#label" .
}
And once you have the configuration, you can create the index.
PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA {
luc:myTestIndex luc:createIndex "true" .
}
And, given the index and your data, you can query it.
PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
SELECT * {
?subj luc:myTestIndex "web*"
}
Since you are asking about the subject of something which contains the string web*, you'll get <http://www.example.org/>. If you had other triples linking to this one, they might have also appeared.
More information about the way in which GraphDB interacts with Lucene and its Full-Text-Search capabilities can be found within the GraphDB documentation.

Related

Redisearch query with "begin with" instead of "contains"

I am trying to understand on how to perform queries in Redisearch strictly with "begins with" and I keep getting "contains".
For example if I have fields with values like 'football', 'myfootball', 'greenfootball' and would provide a search term like this:
> FT.SEARCH myIdx #myfield:foot*
I want just to get 'football' but I keep getting other fields that contain the word instead of beginning with that word.
Is there a way to avoid this?
I was trying to use VERBATIM and things like #myfield:^foot* but nothing.
I am using JRedisearch as a client but eventually I had to enter the DB and perform these queries manually in order to figure out what's happening. That being said, is this possible to do with this client at the moment?
Thanks
EDIT
A sample of my index setup:
Client client = new Client(INDEX_NAME, url, PORT);
Schema sc = new Schema().addSortableTextField("url", 1.0); // using this field for query
client.dropIndex(true);
client.createIndex(sc, Client.IndexOptions.Default());
return client;
Sample document:
id: // random uuid
urlPath: myfootbal
application: web
market: Europe
After checking the RDB provided I see that when searching foot* you are not getting myfootbal. The replies look like this: /dot-com/plp/football/x/index.html. You are getting those replies because this url is tokenized, and '/' is one of the tokenize chars. If you do not want those urls to be tokenized you need to declare them as TAGS and not as TEXT. This way the entire url will be indexed as is and when search for foot* it will not appear in the results.
For more information about TAGS see the FT.CREATE documentation: https://oss.redislabs.com/redisearch/Commands.html

How to prepare training data for OpenNLP to Tokenize the token that contains more than one word?

In some language (for example: Vietnamese), some vocabulary consists of multiple words. So that some tokens which contain more than one word can be tokenized not just using the white space.
I have following input:
Người dân địa phương đã nhiều lần báo Điện lực Bến Tre nhưng chưa được giải quyết .
Expected output:
["Người dân", "địa phương", "đã", "nhiều", "lần", "báo", "Điện lực", "Bến Tre", "nhưng", "chưa", "được", "giải quyết"]
Training data I have _ connect the word that need to stick together in one token:
Người_dân địa_phương đã nhiều lần báo Điện_lực Bến_Tre nhưng chưa được giải_quyết .
Here is command line I use to train
opennlp TokenizerTrainer -model "model/vi-token.bin" -alphaNumOpt 1 -lang "vi" -data "data/merge_vlsp_removehtml" -encoding "UTF-8" -params param/wordseg.param
with param
Iterations=1000
However, the output cannot connect multiple word in one token but it split by whitespace.
Command I run to get output
opennlp TokenizerME model/vi-token.bin < sample/sample_text > sample/sample_text.out
What should I do with training data our config param to train the tokenizer with multiple word each token ?
Rather than using the underscore for training, use tags. OpenNLP uses tags as the reference for training. Follow the instructions given for NER and training your Tokenizer.
opennlp provides 'TokenizerTrainer' tool to train data. The OpenNLP format contains one sentence per line. You can also specify tokens either separated by a whitespace or by a special tag.
you can follow this blog for head start in opennlp for various purposes. The post will show you how to create a training file and build a new model.
You can easily create your own training data-set using the modelbuilder addon and follow some rules as mentioned here to train create a good NER model.
you can find some help using modelbuilder addon here.
It is basically, you put all the information in a text file and the NER entities in another. The addon searches for a paticular entity and replace it with the required tag. Hence producing the tagged data. It must be pretty easy to use this tool!
Also, follow mr. markg's answer to get an understanding on creating new models on your own. This will help you build your own models which can be customized for your applications.
Hope this helps!

how to list all the indices' name of elasticsearch using java?

In my elasticsearch I want to get all the indices' name of the cluster. How can I do using java?
I search the internet but there's no much useful information.
You can definitely do it with the following simple Java code:
List<IndexMetaData> indices = client.admin().cluster()
.prepareState().get().getState()
.getMetaData().getIndices();
The list you obtain contains the details on all the indices available in your ES cluster.
You can use:
client.admin().indices().prepareGetIndex().setFeatures().get().getIndices();
Use setFeatures() without parameter to just get index name. Otherwise, other data, such as MAPPINGS and SETTINGS of index, will also be returned by default.
Thanks for #Val's answer. According to your method, I use it in my projects, the code is:
ClusterStateResponse response = transportClient.admin().cluster() .prepareState()
.execute().actionGet();
String[] indices=response.getState().getMetaData().getConcreteAllIndices();
This method can put all the indices name into a String array. The method works.
there's another method I think but not tried:
ImmutableOpenMap<String, MappingMetaData> mappings = node.client().admin().cluster()
.prepareState().execute().actionGet().getState().‌getMetaData().getIndices().
then, we can get the keys of mappings to get all the indices.
Thanks again!

Using dbpedia spotlight with a local mediawiki (not instance of wikipedia)

I'm trying to use dbpedia spotlight to spot special terms (which is not included in dbpedia) by using a local mediawiki dump as an input instead of the default index and spotter.dict.
Any ideas will be so appreciated
DBpedia Spotlight requires 5(five) files to build the index as follows:
Format N Triples:
Instance Types: List of URLs and their types (DBpedia, Freebase etc)
E.g:
<YOUR_LINK> <www.w3.org/1999/02/22-rdf-syntax-ns#type> <DBpedia:Type> .
Labels: List of URLs and Labels
E.g:
<YOUR_LINK> <www.w3.org/2000/01/rdf-schema#label> "Label"#en .
Redirects: List of URLs and their redirect pages
E.g:
<YOUR_LINK> <dbpedia.org/ontology/wikiPageRedirects> <YOUR_LINK> .
Disambiguations List of URLs and their disambiguations pages
.
XML Dump:
Wiki dump - (like Wikipedia Dump).
After preparing these files with your own data, "just" follow the internationalization guide available in DBpedia Spotlight wiki to create the index with your own data.
All the best,

Mibble MIB Parser - extracting comments from the mib

I am using the Mibble MIB Parser to extract all simple data types from an MIB file. I've been successful until my attempt to extract comment text.
Take the following module as an example:
invBookList OBJECT-TYPE
SYNTAX INTEGER {
mobydick(1), -- call me ishmael
paradiselost(2), -- aComment
1984(3), -- aComment
solaris(4) -- aComment
}
MAX-ACCESS read-only
STATUS current
DESCRIPTION
"A few Books for an example."
::= { invMasterList 43 }
According to Mibble's API, the OBJECT-TYPE can be accessed by extracting an SnmpObjectType and then calling the appropriate getter method. Which I have done, and can successfully extract all of the text except the comments in the INTEGER syntax.
I have tried calling getSyntax().getComment() on the SnmpObjectType, but always returns null. getSyntax() will extract the INTEGER syntax, e.g.:
mobydick(1),paradiselist(2),1984(3),solaris(4)
but unfortunately strips out the comments.
Any one out there have experience with Mibble Parser who knows how to extract the comments?
Many Thanks.
First, you need to use version 2.9 of Mibble. Then look into MibWriter.java to understand how to use the API:
https://github.com/cederberg/mibble/blob/master/src/java/net/percederberg/mibble/MibWriter.java

Categories

Resources