I am trying to train en-ner-location.bin file using opennlp in java The thing is i got the training text file in the following format
<START:location> Fontana <END>
<START:location> Palo Verde <END>
<START:location> Picacho <END>
and i trained the file using the following code
import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.util.Collections;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.NameSample;
import opennlp.tools.namefind.NameSampleDataStream;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.Span;
public class TrainNames {
#SuppressWarnings("deprecation")
public void TrainNames() throws IOException{
File fileTrainer=new File("citytrain.txt");
File output=new File("en-ner-location.bin");
ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream(fileTrainer), "UTF-8");
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream);
System.out.println("lineStream = " + lineStream);
TokenNameFinderModel model = NameFinderME.train("en", "location", sampleStream, Collections.<String, Object>emptyMap(), 1, 0);
BufferedOutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(output));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}
}
}
I got no errors or warnings but when i try to get a city name from a string like this cnt="John is planning to specialize in Electrical Engineering in UC Fontana and pursue a career with IBM."; It returns the whole string
anybody could tell me why...??
Welcome to SO! Looks like you need more context around each location annotation. I believe right now openNLP thinks you are training it to find words (any word) because your training data has only one word. You need to annotate locations within whole sentences and you will need at least a few hundred samples to start seeing good results.
See this answer as well:
How I train an Named Entity Recognizer identifier in OpenNLP?
Related
While performing mail merge using Java, new line characters are converted into space in Microsoft Word.So line breaks are lost after merge.Need to retain the line breaks in the text in MicrosoftWord after the merge is done.
Java code to perform mailmerge using IXDocReport:
package sample;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import fr.opensagres.xdocreport.document.IXDocReport;
import fr.opensagres.xdocreport.document.registry.XDocReportRegistry;
import fr.opensagres.xdocreport.template.IContext;
import fr.opensagres.xdocreport.template.TemplateEngineKind;
public class Sample {
public static void main(String[] args) throws Exception{
// 1) Load ODT file and set Velocity template engine and cache it to the registry
InputStream in= new FileInputStream(new File("sample.docx"));
IXDocReport report = XDocReportRegistry.getRegistry().loadReport(in, TemplateEngineKind.Velocity);
// 2) Create Java model context
IContext context = report.createContext();
context.put("name", "new \n world");
// 3) Generate report by merging Java model with the ODT
OutputStream out = new FileOutputStream(new File("ODTHelloWordWithVelocity_Out.docx"));
report.process(context, out);
}
}
Here is my code:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
public class POI {
public POI() throws IOException, InvalidFormatException
{
XWPFDocument doc = new XWPFDocument(OPCPackage.open("input.docx"));
for (XWPFParagraph p : doc.getParagraphs()) {
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
System.out.println(text);
if (text.contains("needle"))
{
text = text.replace("needle", "haystack");
r.setText(text);
System.out.println(text);
}
}
}
doc.write(new FileOutputStream("output.docx"));
}
}
This code is meant to replace text in .docx document. My input to the program is input.docx and it contains below data
needle
game
system
My output was output.docx and it contained the below data
needlehaystack
game
system
You can see the difference. Instead of "replacing" the word needle with haystack it has simply added haystack right next to needle.
I have no idea about what I am doing wrong here. How can I properly replace text in .docx files?
Absolute no experience, but symmetrically, it should be:
r.setText(text, 0);
Try using text.replace("needle", "haystack");
replaceAll uses regex and "needle" is not intended as regex.
I've seen the answers provided [here] (Formatting XML file using StAX) and [here] (merge XML using STAX XMLStreamWriter)
In both cases it did not work. In both cases it was because my IDE, netbeans, doesn't recognize the methods as valid. This is driving me crazy. thanks in advance for your help.
Here's the code that doesn't work, I'm simply trying to wrap my writer in an IndentingWriter.
XMLOutputFactory outputFactory = XMLOutputFactory.newFactory();
XMLEventWriter writer = null;
try {
writer = outputFactory
.createXMLEventWriter(new FileOutputStream(args[1]), "UTF-8");
writer = new IndentingXMLEventWriter(writer);
} catch (XMLStreamException ex) {
Logger.getLogger(XMLReader.class.getName()).log(Level.SEVERE, null, ex);
}
Here is a list of my imports:
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Iterator;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.util.StringTokenizer;
import javax.xml.stream.XMLEventFactory;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.Characters;
import javax.xml.stream.events.EndElement;
import javax.xml.stream.events.StartDocument;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
XMLOutputFactory.createXMLEventWriter has another signature, where it accepts a Writer instead of an OutputStream. If you use that, you can use the append method to add a newline wherever you want (shown here with a StringWriter, but should work for other types of Writers as well):
StringWriter outputXml = new StringWriter();
XMLEventWriter eventWriter = factory.createXMLEventWriter(outputXml);
XMLEvent event = eventFactory.createStartDocument();
eventWriter.add(event);
eventWriter.flush(); // important!
outputXml.append("\n"); // add newline after XML declaration
You can even output stuff inside XML elements, not just necessarily in between elements. This can come handy if you need to emit some legacy or non-standard XML format.
I already have downloaded wordnet2.0 full, but i am not getting how to use it as a graph because it consists of multiple RDF files. I want to use wordnet2.0 ontology as a graph in Eclipse. The following is the snippet of code that i am using for loading a ontology as a graph. I also want to know, Am i going in a right direction???
URIFactory factory = URIFactoryMemory.getSingleton();
URI graph_uri = factory.createURI("http://graph/");
G graph = new GraphMemory(graph_uri);
String fpath ="D:/Workspace/SSM/src/wordnet-wordsensesandwords.rdf";
GDataConf graphconf = new GDataConf(GFormat.RDF_XML, fpath);
GAction actionRerootConf = new GAction(GActionType.REROOTING);
GraphConf gConf = new GraphConf();
gConf.addGDataConf(graphconf);
gConf.addGAction(actionRerootConf);
// GraphLoaderGeneric.populate(graphconf, graph);
GraphLoaderGeneric.load(gConf, graph);
// General information about the graph
System.out.println(graph.toString());
http://wordnet.princeton.edu/wordnet/download/old-versions/
You can use this link to download the ontology and may use apache jena to query this
Once you have the results, you can represent it in the form of graph
You may also download wordnet in RDF format and can display it as graph using Protege tool
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.net.URL;
import nu.xom.Builder;
import nu.xom.ParsingException;
import nu.xom.ValidityException;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
import edu.mit.jwi.Dictionary;
import edu.mit.jwi.IDictionary;
import edu.mit.jwi.item.IIndexWord;
import edu.mit.jwi.item.ISynset;
import edu.mit.jwi.item.IWord;
import edu.mit.jwi.item.IWordID;
import edu.mit.jwi.item.POS;
public class Main
{
public static void main(String[] args)
{
try
{
FileInputStream file = new FileInputStream(new File("c:\\employees.xml"));
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(file);
XPath xPath = XPathFactory.newInstance().newXPath();
System.out.println("*************************");
String expression = "/Employees/Employee[#emplid='3333']/job";
System.out.println(expression);
String job = xPath.compile(expression).evaluate(xmlDocument);
System.out.println(job);
System.out.println("*************************");
String path = "C:\\Program Files\\WordNet\\2.1\\dict";
URL url = new URL("file", null, path);
IDictionary dict = new Dictionary(url);
dict.open();
IIndexWord idxWord = dict . getIndexWord (job, POS. NOUN );
IWordID wordID = idxWord . getWordIDs ().get (0) ;
IWord word = dict . getWord ( wordID );
ISynset synset= word.getSynset();
for (IWord w : synset.getWords())
System.out.println(w.getLemma());
}
catch(Exception a)
{
System.out.println(a);
}
}
}
This is a sample code in which wornet can be queried for getting the synonyms of the word job from wordnet and using it to find similar terms like job from the RDF graph.
I have only worked with wornet for capturing related terms and hypernyms. Hope this may help
How to read data from solr/data/index by some simple console Java application? I found some solution.
But maybe there is more simple way.
Help please with that, I really don't know what to do.
It's my own solution. I get index files from solr 4.4 and I also use lucene-core-4.4.0.jar library. Maybe it can help someone.
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.solr.client.solrj.SolrServerException;
public class SomeClass {
public static void main(String[] args) throws IOException {
Directory dirIndex = FSDirectory.open(new File("solr/home/data/index"));
IndexReader indexReader = IndexReader.open(dirIndex);
Document doc = null;
for(int i = 0; i < indexReader.numDocs(); i++) {
doc = indexReader.document(i);
}
System.out.println(doc.toString());
indexReader.close();
dirIndex.close();
}
}
There is a project called Luke which is a GUI for users to inspect Lucene indices.
Here is a blog post with more detail.