DBPedia Spotlight on smaller texts (words) instead of paragraphs - java

I think this question was asked earlier but removed for reasons unknown. I am very new to DBPedia and have little knowledge about writing quesries. The problem I am trying to resolve is a Natural Language Problem. I am able to extract entities from a given sentence. I am able to classify some of them as Name, Organization and Person but unable to classify the rest correctly. So I wanted to add a lookup option where I look them up in a database like DPpedia for a classification. Just yesterday a kind soul suggested I look at DBPedia Spotlight. I went throught heir documentation. The best way to integrate it in my java code is :
import org.dbpedia.spotlight.annotate.DefaultParagraphAnnotator
import org.dbpedia.spotlight.disambiguate.{TwoStepDisambiguator, ParagraphDisambiguatorJ}
import org.dbpedia.spotlight.model.SpotlightConfiguration
import org.dbpedia.spotlight.model.SpotlightFactory
val text = new String("Brazilian oil giant Petrobras and U.S. oilfield service company Halliburton have signed a technological cooperation agreement, Petrobras announced Monday. The two companies agreed on three projects: studies on contamination of fluids in oil wells, laboratory simulation of well production, and research on solidification of salt and carbon dioxide formations, said Petrobras. Twelve other projects are still under negotiation.")
val configuration = new SpotlightConfiguration("conf/server.properties")
val factory = new SpotlightFactory(configuration)
val disambiguator = new ParagraphDisambiguatorJ(new TwoStepDisambiguator(factory.candidateSearcher, factory.contextSearcher))
val spotter = factory.spotter()
val annotator = new DefaultParagraphAnnotator(spotter, disambiguator);
println(annotator.annotate(text))
However, I do not want to annotate paragraphs. Just run the annotation on words I extract from a sentence that could be possible entities e.g. in the sentence "Yahoo ceo Marissa Mayers said in a press conference yesterday...." I am able to extract Yahoo and Marissa Mayers. Now I want to use DBPedia to assign a classification to them.
Any help would be greatly appreciated.

Related

How to specify phonetic keywords for IBM Watson speech2text service?

While we have had good success with Bluemix Java SDK in the general case, we've bumped into problems while trying to recognize occasional non-English words (e.g., foreign last names). Our hope was that one could specify the keyword list using SPR phonetic notation (which works great for text2speech), but that does not seem to be supported for speech2text. Any suggestions/workarounds?
SpeechToText service = new SpeechToText();
service.setUsernameAndPassword("USERNAME", "PASSWORD");
File audio = new File("C:\\Users\\AudioFiles\\euler.wav");
RecognizeOptions options = new RecognizeOptions().Builder()
.contentType(HttpMediaType.AUDIO_WAV)
.continuous(true)
.inactivityTimeout(500)
.keywords({"Agarwal", "Euler", "Qin"})
.keywordsThreshold(0.5)
.build();
SpeechResults transcript = service.recognize(audio, options);
System.out.println(transcript);
The objective is to be able say "My name is John Euler." and for the transcript not to return something like "My name is John Oyler." (which is what it does currently).
Thx.
Hmm, the three words that you pass are actually in the vocabulary, but maybe they are not found because they have very little weight in the language model. Have you tried relaxing the threshold? You can also try to use the Watson STT customization service to boost probabilities of names if the task is name focused

Weka java spreadsubsample filter

I'm quite familiar with Weka as I've used the GUI. I'm doing some classification experiments that requires the SpreadSubsample filter on both my training and testing data.
I'm learning java, and want to use the weka API to do this. I've got to the point where I'm loading my training and testing data into Weka like so:
DataSource source = new DataSource("training.arff");
Instances trainingData = source.getDataSet();
if (trainingData.classIndex() == -1)
trainingData.setClassIndex(trainingData.numAttributes() - 1);
and I'm getting an output. Everything is working.
However, I have no idea how to implement a filter. I have the training and testing.arff files already produced and need to filter them through the spreadsubsample filter before loading it into weka.
If anyone could help with a thorough explanation and answer, it'd be much appreciated. Thankyou.
Here some sample code:
SpreadSubsample ff = new SpreadSubsample();
String opt = " ";//any options you like, see documentation
String[] optArray = weka.core.Utils.splitOptions(opt);//right format for the options
ff.setOptions(optArray);
ff.setInputFormat(dataset);
Instances filteredInstances = Filter.useFilter(dataset, ff);
Hope it helped.

Naive Bayes Text Classification Algorithm

Hye there! I just need the help for implementing Naive Bayes Text Classification Algorithm in Java to just test my Data Set for research purposes. It is compulsory to implement the algorithm in Java; rather using Weka or Rapid Miner tools to get the results!
My Data Set has the following type of Data:
Doc Words Category
Means that I have the Training Words and Categories for each training (String) known in advance. Some of the Data Set is given below:
Doc Words Category
Training
1 Integration Communities Process Oriented Structures...(more string) A
2 Integration Communities Process Oriented Structures...(more string) A
3 Theory Upper Bound Routing Estimate global routing...(more string) B
4 Hardware Design Functional Programming Perfect Match...(more string) C
.
.
.
Test
5 Methodology Toolkit Integrate Technological Organisational
6 This test contain string naive bayes test text text test
SO the Data Set comes from a MySQL DataBase and it may contain multiple training strings and test strings as well! The thing is I just need to implement Naive Bayes Text Classification Algorithm in Java.
The algorithm should follow the following example mentioned here Table 13.1
Source: Read here
The thing is that I can implement the algorithm in Java Code myself but i just need to know if it is possible that there exist some kind a Java library with source code documentation available to allow me to just test the results.
The problem is I just need the results for just one time only means its just a test for results.
So, come to the point can somebody tell me about any good java library that helps my code this algorithm in Java and that could made my dataset possible to process the results, or can somebody give me any good ideas how to do it easily...something good that can help me.
I will be thankful for your help.
Thanks in advance
As per your requirement, you can use the Machine learning library MLlib from apache. The MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities. There is also a java code template to implement the algorithm utilizing the library. So to begin with, you can:
Implement the java skeleton for the Naive Bayes provided on their site as given below.
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.mllib.classification.NaiveBayes;
import org.apache.spark.mllib.classification.NaiveBayesModel;
import org.apache.spark.mllib.regression.LabeledPoint;
import scala.Tuple2;
JavaRDD<LabeledPoint> training = ... // training set
JavaRDD<LabeledPoint> test = ... // test set
final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
JavaPairRDD<Double, Double> predictionAndLabel =
test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
#Override public Tuple2<Double, Double> call(LabeledPoint p) {
return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
}
});
double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
#Override public Boolean call(Tuple2<Double, Double> pl) {
return pl._1().equals(pl._2());
}
}).count() / (double) test.count();
For testing your datasets, there is no best solution here than use the Spark SQL. MLlib fits into Spark's APIs perfectly. To start using it, I would recommend you to go through the MLlib API first, implementing the Algorithm according to your needs. This is pretty easy using the library.
For the next step to allow the processing of your datasets possible, just use the Spark SQL.
I will recommend you to stick to this. I too have hunted down multiple options before settling for this easy to use library and it's seamless support for inter-operations with some other technologies. I would have posted the complete code here to perfectly fit your answer. But I think you are good to go.
You can use the Weka Java API and include it in your project if you do not want to use the GUI.
Here's a link to the documentation to incorporate a classifier in your code:
https://weka.wikispaces.com/Use+WEKA+in+your+Java+code
Please take a look at the Bow toolkit.
It has a Gnu license and source code. Some of its code includes
Setting word vector weights according to Naive Bayes, TFIDF, and several other methods.
Performing test/train splits, and automatic classification tests.
It's not a Java library, but you could compile the C code to ensure that you Java had similar results for a given corpus.
I also spotted a decent Dr. Dobbs article that implements in Perl. Once again, not the desired Java, but will give you the one-time results that you are asking for.
Hi I thinks Spark would help you a lot:
http://spark.apache.org/docs/1.2.0/mllib-naive-bayes.html
you can even choose the language you think is the most appropriate to your needs Java / Python / Scala!
You may want to take a look at this.
https://mahout.apache.org/users/classification/bayesian.html
Please use scipy from python. There is already an implementation of what you need:
class sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)¶
scipy
You can use an algorithm platform like KNIME, it has variety of classification algorithms (Naive bayed included). You can run it with a GUI or Java API.
If you want to implement Naive Bayes Text Classification Algorithm in Java, then WEKA Java API will be a better solution. The data set should have to be in .arff format. Creating an .arff file from mySql database is very easy. Here is the attachment of the java code for the classifier a link of a sample .arff file.
Create a new Text document. Open it with Notepad. Copy and paste all the texts below the link. Save it as DataSet.arff. http://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/weather.arff
Download Weka Java API: http://www.java2s.com/Code/Jar/w/weka.htm
Code for the classifier:
public static void main(String[] args) {
try {
StringBuilder txtAreaShow = new StringBuilder();
//reads the arff file
BufferedReader breader = null;
breader = new BufferedReader(new FileReader("DataSet.arff"));
//if 40 attributes availabe then 39 will be the class index/attribuites(yes/no)
Instances train = new Instances(breader);
train.setClassIndex(train.numAttributes() - 1);
breader.close();
//
NaiveBayes nB = new NaiveBayes();
nB.buildClassifier(train);
Evaluation eval = new Evaluation(train);
eval.crossValidateModel(nB, train, 10, new Random(1));
System.out.println("Run Information\n=====================");
System.out.println("Scheme: " + train.getClass().getName());
System.out.println("Relation: ");
System.out.println("\nClassifier Model(full training set)\n===============================");
System.out.println(nB);
System.out.println(eval.toSummaryString("\nSummary Results\n==================", true));
System.out.println(eval.toClassDetailsString());
System.out.println(eval.toMatrixString());
//txtArea output
txtAreaShow.append("\n\n\n");
txtAreaShow.append("Run Information\n===================\n");
txtAreaShow.append("Scheme: " + train.getClass().getName());
txtAreaShow.append("\n\nClassifier Model(full training set)"
+ "\n======================================\n");
txtAreaShow.append("" + nB);
txtAreaShow.append(eval.toSummaryString("\n\nSummary Results\n==================\n", true));
txtAreaShow.append(eval.toClassDetailsString());
txtAreaShow.append(eval.toMatrixString());
txtAreaShow.append("\n\n\n");
System.out.println(txtAreaShow.toString());
} catch (FileNotFoundException ex) {
System.err.println("File not found");
System.exit(1);
} catch (IOException ex) {
System.err.println("Invalid input or output.");
System.exit(1);
} catch (Exception ex) {
System.err.println("Exception occured!");
System.exit(1);
}
You can take a look at Blayze - It's a pretty minimal Naive Bayes library for the JVM written in Kotlin. Should be easy to follow.
Full disclosure: I'm one of the authors of Blayze

Has anyone used pubchemdb? Any similar API?

Update: The link in the answer is both interesting and useful, but unfortunately does not address the need for a java API, so I am still looking forward to any input.
I'm building a database of chemical compounds. I need all the synonyms (IUPAC and common names) as well as safety data for each.
I'll be using the freely available data at PubChem (http://pubchem.ncbi.nlm.nih.gov/)
There's an easy way of querying each compound with simple HTTP gets. For example, to obtain glycerol data, the URL is:
http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=753
And the following URL would return an easy to parse format:
http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=753&disopt=DisplaySDF
but it will respond only very basic info, lacking safety data and only a few common names.
There is one public domain API for JAVA that seems a very complete, developed by a group at Scripps (citation). The code is here.
Unfortunately, this API is not very well documented and it's quite difficult to follow due to the complexity of the data involved.
For what I gathered, pubchemdb is using the PubChem Power User Gateway (PUG) XML API
Has anyone used this API (or any other one available)? I would appreciate a short description or tutorial on how to start with it.
The Cactvs Chemoinformatics toolkit (free for academic/educational use) has full PubChem integration. Using the scripting environment, you can easily do something like
cactvs>ens create 753
ens0
cactvs>ens get ens0 E_NAMESET
PROPANE-1,2,3-TRIOL GLYCEROL 8043-29-6 29796-42-7 30049-52-6 37228-54-9 75398-78-6 78630-16-7 8013-25-0 175385-78-1 25618-55-7 64333-26-2 56-81-5 {Tegin M} LS-1377 G8773_SIGMA 15523_RIEDEL {Glycerin, natural} NCGC00090950-03 191612_ALDRICH 15524_RIEDEL {Glycerol solution} L-glycerol 49767_FLUKA {Biodiesel impurity} 49770_FLUKA 49771_FLUKA NCGC00090950-01 49927_FLUKA Glycerol-Gelatine G7757_SIAL GOL D-glycerol G9012_SIAL {Polyhydric alcohols} c0066 MOON {NSC 9230} G2025_SIGMA ZINC00895048 49781_FLUKA {Concentrated glycerin} {Concentrated glycerin (JP15)} D00028 {Glycerin (JP15/USP)} 44892U_SUPELCO {Glycerin, concentrated (JAN)} CRY 49782_FLUKA NCGC00090950-02 G6279_SIAL W252506_ALDRICH G7893_SIAL {Glycerin, concentrated} 33224_RIEDEL Bulbold Cristal Glyceol G9281_SIGMA Glycerol-1,2,3-3H G1901_SIGMA G7043_SIGMA 1,2,3-trihydroxypropane 1,2,3-trihydroxypropanol glycerin G2289_SIAL G9406_SIGMA {Glycerol-[2-3H]} CHEBI:17754 Glyzerin Oelsuess InChI=1/C3H8O3/c4-1-3(6)2-5/h3-6H,1-2H {90 Technical glycerine} Dagralax {Glycerin, anhydrous} {Glycerin, synthetic} Glycerine Glyceritol {Glycyl alcohol} Glyrol Glysanin NSC9230 Ophthalgan Osmoglyn Propanetriol {Synthetic glycerin} {Synthetic glycerine} Trihydroxypropane Vitrosupos {WLN: Q1YQ1Q} Glycerol-1,3-14C {4-01-00-02751 (Beilstein Handbook Reference)} AI3-00091 {BRN 0635685} {CCRIS 2295} {Caswell No. 469} {Citifluor AF 2} {Clyzerin, wasserfrei [German]} {EINECS 200-289-5} {EPA Pesticide Chemical Code 063507} {FEMA No. 2525} {Glicerina [DCIT]} {Glicerol [INN-Spanish]} {Glycerin (mist)} {Glycerin [JAN]} {Glycerin mist} {Glycerine mist} Glycerinum {Glycerolum [INN-Latin]} Grocolene {HSDB 492} IFP {Incorporation factor} 1,2,3-Propanetriol C00116 Optim {Propanetriol (VAN)} {1,2,3-PROPANETRIOL, HOMOPOLYMER} {Glycerol polymer} {Glycerol, polymers} {HL 80} {PGL 300} {PGL 500} {PGL 700} Polyglycerin Polyglycerine Polyglycerol {Unigly G 2} {Unigly G 6} G5516_SIGMA MolMap_000024
cactvs>
This hides all PUG ugliness - but in any case, I dare say that PUG is well documented. The toolkit goes much beyond simple data downloads - you can even open and query PubChem like a local SD file if you want to.
PubChem does not contain safety data, though. And safety data is country/region-dependent, strictly regulated, and you should be really careful not to be hit with liabilities. Have your approach checked by legal personnel!

XEP-0080 User Location in Smack Library

I would like to create a simple XMPP client in java that shares his location (XEP-0080) with other clients.
I already know I can use the smack library for XMPP and that it supports PEP, which is needed for XEP-0080.
Does anyone have an example how to implement this or any pointers, i don't find anything using google.
thanks in advance.
Kristof's right, the doc's are sparse - but they are getting better. There is a good, albeit hard to find, set of docs on extensions though. The PubSub one is at http://www.igniterealtime.org/fisheye/browse/~raw,r=11613/svn-org/smack/trunk/documentation/extensions/pubsub.html.
After going the from scratch custom IQ Provider route with an extension I found it was easier to do it using the managers as much as possible. The developers that wrote the managers have abstracted away a lot of the pain points.
Example (modified-for-geoloc version of one rcollier wrote on the Smack forum):
ConfigureForm form = new ConfigureForm(FormType.submit);
form.setPersistentItems(false);
form.setDeliverPayloads(true);
form.setAccessModel(AccessModel.open);
PubSubManager manager
= new PubSubManager(connection, "pubsub.communitivity.com");
Node myNode = manager.createNode("http://jabber.org/protocol/geoloc", form);
StringBuilder body = new StringBuilder(); //ws for readability
body.append("<geoloc xmlns='http://jabber.org/protocol/geoloc' xml:lang='en'>");
body.append(" <country>Italy</country>");
body.append(" <lat>45.44</lat>");
body.append(" <locality>Venice</locality>");
body.append(" <lon>12.33</lon>");
body.append(" <accuracy>20</accuracy>");
body.append("</geoloc>");
SimplePayload payload = new SimplePayload(
"geoloc",
"http://jabber.org/protocol/geoloc",
body.toString());
String itemId = "zz234";
Item<SimplePayload> item = new Item<SimplePayload>(itemId, payload);
// Required to recieve the events being published
myNode.addItemEventListener(myEventHandler);
// Publish item
myNode.publish(item);
Or at least that's the hard way :). Just remembered there's a PEPManager now...
PEPProvider pepProvider = new PEPProvider();
pepProvider.registerPEPParserExtension(
"http://jabber.org/protocol/tune", new TuneProvider());
ProviderManager.getInstance().addExtensionProvider(
"event",
"http://jabber.org/protocol/pubsub#event", pepProvider);
Tune tune = new Tune("jeff", "1", "CD", "My Title", "My Track");
pepManager.publish(tune);
You'd need to write the GeoLocProvider and GeoLoc classes.
I covered a pure PEP based approach as an alternative method in detail for Android here: https://stackoverflow.com/a/26719158/406920.
This will be very close to what you'd need to do with regular Smack.
Take a look at the existing code for implementations of other extensions. This will be your best example of how to develop with the current library. Unfortunately, there is no developers guide that I know of, so I just poked around to understand some of the basics myself until I felt comfortable with the environment. Hint: Use the providers extension facility to add custom providers for the extension specific stanzas.
You can ask questions on the developer forum for Smack, and contribute your code back to the project from here as well. If you produce an implementation of this extension, then you could potentially get commit privileges yourself if you want it.

Categories

Resources