Reading Superscript and Subscript in .doc file using Apache poi? - java

I have a .doc file, I Want to find superscript and subscript using Apache-poi.

Following example shows a way to read superscript/subscript from a docx file. Doc will be similar too.
package demo.poi;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xwpf.usermodel.VerticalAlign;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
import org.junit.Test;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Iterator;
public class DocReaderTest {
#Test
public void showReadDocWithSubscriptAndSuperScript() throws IOException, InvalidFormatException {
File docFile = new File("C:/temp/sample.docx");
XWPFDocument hdoc = new XWPFDocument(OPCPackage.openOrCreate(docFile));
Iterator<XWPFParagraph> paragraphsIterator = hdoc.getParagraphsIterator();
while (paragraphsIterator.hasNext()) {
XWPFParagraph next = paragraphsIterator.next();
for (XWPFRun xwrun : next.getRuns()) {
VerticalAlign subscript = xwrun.getSubscript();
String smalltext = xwrun.getText(0);
switch (subscript) {
case BASELINE:
System.out.println("smalltext, plain = " + smalltext);
break;
case SUBSCRIPT:
System.out.println("smalltext, subscript = " + smalltext);
break;
case SUPERSCRIPT:
System.out.println("smalltext, superscript = " + smalltext);
break;
}
}
}
}
}

Related

how to insert data in an excel sheet into database(including blank spaces without editing them) using java

Java code to read all the cells in a sheet including blank cells in excel sheets and insert the same data into postgresql database without editing them.
I have already tried using apache-poi, missingcellpolicy, replacefunction, etc. But, it throws a nullpointerexception.
When I run my code to display the celltype of a cell in excel, it is not displaying any type for blankspaces and skips the cell completely.
Here is the code:
package poiexcel;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.text.Format;
import java.text.SimpleDateFormat;
//import java.util.ArrayList;
import java.util.Date;
//import org.apache.poi.hssf.usermodel.HSSFCell;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.CellStyle;
import org.apache.poi.ss.usermodel.CellType;
import org.apache.poi.ss.usermodel.DataFormat;
import org.apache.poi.ss.usermodel.DataFormatter;
import org.apache.poi.ss.usermodel.DateUtil;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
#SuppressWarnings("unused")
public class ReadExcel {
public static final String SAMPLE_XLSX_FILE_PATH = "C:\\Raj's Documents\\Documents\\studentdata.xlsx";
public static void main(String[] args) throws IOException, InvalidFormatException {
try {
FileInputStream file = new FileInputStream(new File(SAMPLE_XLSX_FILE_PATH));// Read the spreadsheet that
// needs to be uploaded
XSSFWorkbook myExcelBook = new XSSFWorkbook(file);
DataFormatter dataFormatter = new DataFormatter();
for (Sheet myExcelsheet : myExcelBook) {
// ArrayList<SheetDetails> details = new ArrayList<SheetDetails>();
// for (Row row : myExcelsheet) {
for (int i = 0; i < myExcelsheet.getPhysicalNumberOfRows(); i++) {
Row row = myExcelsheet.getRow(i);
for (int j = 0; j < row.getPhysicalNumberOfCells(); j++) {
Cell cell = row.getCell(j);
CellType ct = cell.getCellType();
System.out.println(ct);
}
System.out.println();
}
}
myExcelBook.close();
}
catch (Exception e) {
e.printStackTrace();
}
}
}
Here is the output:
NUMERIC
STRING
java.lang.NullPointerException
at poiexcel.ReadExcel.main(ReadExcel.java:52)
Under 'dependencies'in build.gradle (if using Gradle, otherwise find the matching dependency for Maven) file paste the dependency we want:
compile group: 'org.apache.commons', name: 'commons-csv', version: '1.5'
Create the following import:
import org.apache.commons.csv.*;
Parsing an excel csv file is then as easy as:
public Iterable<CSVRecord> parse(String csvPath) throws IOException {
Reader in = new FileReader(csvPath);
return CSVFormat.EXCEL.withFirstRecordAsHeader().parse(in);
}
EXCEL can be changed to DEFAULT if the csv files are plain text csv files.
We can then use the iterable list to get its values by id, column title or by enum, although these need to be defined. Then persist these values into the relevant database entity via the corresponding repository.

Java - Stanford NLP - Process all files in directory

I am using Stanford to do some NER analysis on txt files. The problem so far is that I have been to read all files in a directory. I have just been able to read simple Strings. What should be the next step to read several files? I tried with Iterator but it did not work.
Please see my code below:
Blockquote
import java.io.*;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import edu.stanford.nlp.ie.AbstractSequenceClassifier;
import edu.stanford.nlp.ie.NERClassifierCombiner;
import edu.stanford.nlp.pipeline.SentimentAnnotator;
import edu.stanford.nlp.ie.crf.CRFClassifier;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.PartOfSpeechAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.process.PTBEscapingProcessor;
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.process.DocumentPreprocessor;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.util.*;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import org.apache.commons.io.FileUtils;
import com.google.common.io.Files;
import org.apache.commons.io.*;
public class NLPtest2 {
public static void main(String[] args) throws IOException {
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, parse, ner, dcoref, sentiment");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
//how can we read all documents in a directory instead of just a String??
String text = "I work at Lalalala Ltd. It is awesome";
Annotation annotation = new Annotation(text);
pipeline.annotate(annotation);
// Annotation annotation = pipeline.process(text);
List<CoreMap> sentences = annotation.get(SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
String sentiment = sentence.get(SentimentCoreAnnotations.SentimentClass.class);
System.out.println(sentiment + "\t" + sentence);
// System.out.println(annotation.get(CoreAnnotations.QuotationsAnnotation.class));// dont need it
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(PartOfSpeechAnnotation.class);
// this is the NER label of the token
String ne = token.get(NamedEntityTagAnnotation.class);
System.out.println( "Text:"+ word +"//"+"Part of Speech:"+ pos + "//"+ "Entity Recognition:"+ ne);
}
}
}
}
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.util.*;
import java.util.*;
public class ReadFiles {
public static void main(String[] args) {
List<String> filePaths = IOUtils.linesFromFile(args[0]);
for (String filePath : filePaths) {
String fileContents = IOUtils.stringFromFile(filePath);
}
}
}

Java-Can't pass Directory variable as an argument to IndexReader.open() in Apache Lucene 6.4.2

I'm trying to use the open function defined in the Lucene documentation here- https://lucene.apache.org/core/3_5_0/api/core/org/apache/lucene/index/IndexReader.html (Do a Ctrl + F for 'open'). However Netbeans 8.1 with Apache Lucene 6.4.2 gives an in-line error on the code at statement 'reader = IndexReader.open(indexDirectory);'. Here is the error and code.
Cannot find symbol
symbol: method open(Directory)
location: class IndexReader
import java.io.File;
import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Paths;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Explanation;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
public class Indexing_Searching
{
public static final String FIELD_CONTENTS = "contents";
public int searchIndex(String instring, String Index_Dir_Path)
{
int numDocs =0;
try
{
Path path = Paths.get(Index_Dir_Path);
Directory indexDirectory = FSDirectory.open(path);
IndexReader reader;
reader = IndexReader.open(indexDirectory);
Term term = new Term("content", instring);
numDocs = reader.docFreq(term);
//System.out.println("Number of documents for given key" + instring +" # docs" + numDocs);
}
catch (CorruptIndexException e)
{
e.printStackTrace();
}
catch (IOException e)
{
e.printStackTrace();
}
return(numDocs);
}// End of one-words searching function
}
According to current IndexReader JavaDoc for Lucene 6.4.2 you should use DirectoryReader.open.

Convert docx file to pdf in java..issue

I am developing a project which needs a docx file to be converted to pdf. I found same question already posted and used the code which was provided by "Kishan C S". It uses docx4J2.8.1
The code is working fine , pdf is generated but only problem I am facing is that the docx file contains logo.jpg (images header part) which are not converted. Only textual format is converted to pdf.
I am posting the code which I have used. Please let me know what how can I solve the problem
P.S: link I referred Convert docx file into PDF with Java
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.Collections;
import java.util.List;
import org.apache.log4j.Level;
import org.apache.log4j.LogManager;
import org.apache.log4j.Logger;
import org.docx4j.convert.out.pdf.viaXSLFO.PdfSettings;
import org.docx4j.fonts.IdentityPlusMapper;
import org.docx4j.fonts.Mapper;
import org.docx4j.fonts.PhysicalFont;
import org.docx4j.fonts.PhysicalFonts;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
public class DocxConverter {
public static void main(String[] args) throws FileNotFoundException, Docx4JException, Exception {
InputStream is = new FileInputStream(new File("D:\\Test\\C_IN0004_AppointmentLetter.docx"));
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(is);
List sections = wordMLPackage.getDocumentModel().getSections();
for (int i = 0; i < sections.size(); i++) {
wordMLPackage.getDocumentModel().getSections().get(i).getPageDimensions();
}
Mapper fontMapper = new IdentityPlusMapper();
PhysicalFont font = PhysicalFonts.getPhysicalFonts().get("Comic Sans MS");//set your desired font
fontMapper.getFontMappings().put("Algerian", font);
wordMLPackage.setFontMapper(fontMapper);
PdfSettings pdfSettings = new PdfSettings();
org.docx4j.convert.out.pdf.PdfConversion conversion = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);
//To turn off logger
List<Logger> loggers = Collections.<Logger> list(LogManager.getCurrentLoggers());
loggers.add(LogManager.getRootLogger());
for (Logger logger : loggers) {
logger.setLevel(Level.OFF);
}
OutputStream out = new FileOutputStream(new File("D:\\Test\\C_IN0004_AppointmentLetter.pdf"));
conversion.output(out, pdfSettings);
System.out.println("DONE!!");
}
}

Using `Replace()` method does not replace text in Apache POI

Here is my code:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
public class POI {
public POI() throws IOException, InvalidFormatException
{
XWPFDocument doc = new XWPFDocument(OPCPackage.open("input.docx"));
for (XWPFParagraph p : doc.getParagraphs()) {
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
System.out.println(text);
if (text.contains("needle"))
{
text = text.replace("needle", "haystack");
r.setText(text);
System.out.println(text);
}
}
}
doc.write(new FileOutputStream("output.docx"));
}
}
This code is meant to replace text in .docx document. My input to the program is input.docx and it contains below data
needle
game
system
My output was output.docx and it contained the below data
needlehaystack
game
system
You can see the difference. Instead of "replacing" the word needle with haystack it has simply added haystack right next to needle.
I have no idea about what I am doing wrong here. How can I properly replace text in .docx files?
Absolute no experience, but symmetrically, it should be:
r.setText(text, 0);
Try using text.replace("needle", "haystack");
replaceAll uses regex and "needle" is not intended as regex.

Categories

Resources