Here is my code:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
public class POI {
public POI() throws IOException, InvalidFormatException
{
XWPFDocument doc = new XWPFDocument(OPCPackage.open("input.docx"));
for (XWPFParagraph p : doc.getParagraphs()) {
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
System.out.println(text);
if (text.contains("needle"))
{
text = text.replace("needle", "haystack");
r.setText(text);
System.out.println(text);
}
}
}
doc.write(new FileOutputStream("output.docx"));
}
}
This code is meant to replace text in .docx document. My input to the program is input.docx and it contains below data
needle
game
system
My output was output.docx and it contained the below data
needlehaystack
game
system
You can see the difference. Instead of "replacing" the word needle with haystack it has simply added haystack right next to needle.
I have no idea about what I am doing wrong here. How can I properly replace text in .docx files?
Absolute no experience, but symmetrically, it should be:
r.setText(text, 0);
Try using text.replace("needle", "haystack");
replaceAll uses regex and "needle" is not intended as regex.
Related
I am using Stanford to do some NER analysis on txt files. The problem so far is that I have been to read all files in a directory. I have just been able to read simple Strings. What should be the next step to read several files? I tried with Iterator but it did not work.
Please see my code below:
Blockquote
import java.io.*;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import edu.stanford.nlp.ie.AbstractSequenceClassifier;
import edu.stanford.nlp.ie.NERClassifierCombiner;
import edu.stanford.nlp.pipeline.SentimentAnnotator;
import edu.stanford.nlp.ie.crf.CRFClassifier;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.PartOfSpeechAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.process.PTBEscapingProcessor;
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.process.DocumentPreprocessor;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.util.*;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import org.apache.commons.io.FileUtils;
import com.google.common.io.Files;
import org.apache.commons.io.*;
public class NLPtest2 {
public static void main(String[] args) throws IOException {
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, parse, ner, dcoref, sentiment");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
//how can we read all documents in a directory instead of just a String??
String text = "I work at Lalalala Ltd. It is awesome";
Annotation annotation = new Annotation(text);
pipeline.annotate(annotation);
// Annotation annotation = pipeline.process(text);
List<CoreMap> sentences = annotation.get(SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
String sentiment = sentence.get(SentimentCoreAnnotations.SentimentClass.class);
System.out.println(sentiment + "\t" + sentence);
// System.out.println(annotation.get(CoreAnnotations.QuotationsAnnotation.class));// dont need it
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(PartOfSpeechAnnotation.class);
// this is the NER label of the token
String ne = token.get(NamedEntityTagAnnotation.class);
System.out.println( "Text:"+ word +"//"+"Part of Speech:"+ pos + "//"+ "Entity Recognition:"+ ne);
}
}
}
}
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.util.*;
import java.util.*;
public class ReadFiles {
public static void main(String[] args) {
List<String> filePaths = IOUtils.linesFromFile(args[0]);
for (String filePath : filePaths) {
String fileContents = IOUtils.stringFromFile(filePath);
}
}
}
I am developing a project which needs a docx file to be converted to pdf. I found same question already posted and used the code which was provided by "Kishan C S". It uses docx4J2.8.1
The code is working fine , pdf is generated but only problem I am facing is that the docx file contains logo.jpg (images header part) which are not converted. Only textual format is converted to pdf.
I am posting the code which I have used. Please let me know what how can I solve the problem
P.S: link I referred Convert docx file into PDF with Java
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.Collections;
import java.util.List;
import org.apache.log4j.Level;
import org.apache.log4j.LogManager;
import org.apache.log4j.Logger;
import org.docx4j.convert.out.pdf.viaXSLFO.PdfSettings;
import org.docx4j.fonts.IdentityPlusMapper;
import org.docx4j.fonts.Mapper;
import org.docx4j.fonts.PhysicalFont;
import org.docx4j.fonts.PhysicalFonts;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
public class DocxConverter {
public static void main(String[] args) throws FileNotFoundException, Docx4JException, Exception {
InputStream is = new FileInputStream(new File("D:\\Test\\C_IN0004_AppointmentLetter.docx"));
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(is);
List sections = wordMLPackage.getDocumentModel().getSections();
for (int i = 0; i < sections.size(); i++) {
wordMLPackage.getDocumentModel().getSections().get(i).getPageDimensions();
}
Mapper fontMapper = new IdentityPlusMapper();
PhysicalFont font = PhysicalFonts.getPhysicalFonts().get("Comic Sans MS");//set your desired font
fontMapper.getFontMappings().put("Algerian", font);
wordMLPackage.setFontMapper(fontMapper);
PdfSettings pdfSettings = new PdfSettings();
org.docx4j.convert.out.pdf.PdfConversion conversion = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);
//To turn off logger
List<Logger> loggers = Collections.<Logger> list(LogManager.getCurrentLoggers());
loggers.add(LogManager.getRootLogger());
for (Logger logger : loggers) {
logger.setLevel(Level.OFF);
}
OutputStream out = new FileOutputStream(new File("D:\\Test\\C_IN0004_AppointmentLetter.pdf"));
conversion.output(out, pdfSettings);
System.out.println("DONE!!");
}
}
I am trying to train en-ner-location.bin file using opennlp in java The thing is i got the training text file in the following format
<START:location> Fontana <END>
<START:location> Palo Verde <END>
<START:location> Picacho <END>
and i trained the file using the following code
import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.util.Collections;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.NameSample;
import opennlp.tools.namefind.NameSampleDataStream;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.Span;
public class TrainNames {
#SuppressWarnings("deprecation")
public void TrainNames() throws IOException{
File fileTrainer=new File("citytrain.txt");
File output=new File("en-ner-location.bin");
ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream(fileTrainer), "UTF-8");
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream);
System.out.println("lineStream = " + lineStream);
TokenNameFinderModel model = NameFinderME.train("en", "location", sampleStream, Collections.<String, Object>emptyMap(), 1, 0);
BufferedOutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(output));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}
}
}
I got no errors or warnings but when i try to get a city name from a string like this cnt="John is planning to specialize in Electrical Engineering in UC Fontana and pursue a career with IBM."; It returns the whole string
anybody could tell me why...??
Welcome to SO! Looks like you need more context around each location annotation. I believe right now openNLP thinks you are training it to find words (any word) because your training data has only one word. You need to annotate locations within whole sentences and you will need at least a few hundred samples to start seeing good results.
See this answer as well:
How I train an Named Entity Recognizer identifier in OpenNLP?
I have a .doc file, I Want to find superscript and subscript using Apache-poi.
Following example shows a way to read superscript/subscript from a docx file. Doc will be similar too.
package demo.poi;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xwpf.usermodel.VerticalAlign;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
import org.junit.Test;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Iterator;
public class DocReaderTest {
#Test
public void showReadDocWithSubscriptAndSuperScript() throws IOException, InvalidFormatException {
File docFile = new File("C:/temp/sample.docx");
XWPFDocument hdoc = new XWPFDocument(OPCPackage.openOrCreate(docFile));
Iterator<XWPFParagraph> paragraphsIterator = hdoc.getParagraphsIterator();
while (paragraphsIterator.hasNext()) {
XWPFParagraph next = paragraphsIterator.next();
for (XWPFRun xwrun : next.getRuns()) {
VerticalAlign subscript = xwrun.getSubscript();
String smalltext = xwrun.getText(0);
switch (subscript) {
case BASELINE:
System.out.println("smalltext, plain = " + smalltext);
break;
case SUBSCRIPT:
System.out.println("smalltext, subscript = " + smalltext);
break;
case SUPERSCRIPT:
System.out.println("smalltext, superscript = " + smalltext);
break;
}
}
}
}
}
How to read data from solr/data/index by some simple console Java application? I found some solution.
But maybe there is more simple way.
Help please with that, I really don't know what to do.
It's my own solution. I get index files from solr 4.4 and I also use lucene-core-4.4.0.jar library. Maybe it can help someone.
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.solr.client.solrj.SolrServerException;
public class SomeClass {
public static void main(String[] args) throws IOException {
Directory dirIndex = FSDirectory.open(new File("solr/home/data/index"));
IndexReader indexReader = IndexReader.open(dirIndex);
Document doc = null;
for(int i = 0; i < indexReader.numDocs(); i++) {
doc = indexReader.document(i);
}
System.out.println(doc.toString());
indexReader.close();
dirIndex.close();
}
}
There is a project called Luke which is a GUI for users to inspect Lucene indices.
Here is a blog post with more detail.