I have a file saved in doc format, and I need to extract highlighted text.
I have code like in following:
HWPFDocument document = new HWPFDocument(fis);
Range r = document.getRange();
for (int i=0;i<5;i++) {
CharacterRun t = r.getCharacterRun(i);
System.out.println(t.isHighlighted());
System.out.println(t.getHighlightedColor());
System.out.println(r.getCharacterRun(i).SPRM_HIGHLIGHT);
System.out.println(r.getCharacterRun(i));
}
None of the above methods show that text is highlighted, but when I open it, it is highlighted.
What can be the reason, and how to find if the text is highlighted or not?
Highlighting text in Word is possible using two different methods. First is applying highlighting to text runs. Second is applying shading to words or paragraphs.
For the first and using *.doc, the Word binary file format, apache poi provides methods in CharacterRun. For the second apache poi provides Paragraph.getShading. But this is only set if the shading applies to the whole paragraph. If the shading is applied only to single runs, then apache poi provides nothing for that. So using the underlying SprmOperations is needed.
Microsoft's documentation 2.6.1 Character Properties describes sprmCShd80 (0x4866) which is "A Shd80 structure that specifies the background shading for the text.". So we need searching for that.
Example:
import java.io.FileInputStream;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.*;
import org.apache.poi.hwpf.sprm.*;
import java.lang.reflect.Field;
import java.lang.reflect.Method;
public class HWPFInspectBgColor {
private static void showCharacterRunInternals(CharacterRun run) throws Exception {
Field _chpx = CharacterRun.class.getDeclaredField("_chpx");
_chpx.setAccessible(true);
SprmBuffer sprmBuffer = (SprmBuffer) _chpx.get(run);
for (SprmIterator sprmIterator = sprmBuffer.iterator(); sprmIterator.hasNext(); ) {
SprmOperation sprmOperation = sprmIterator.next();
System.out.println(sprmOperation);
}
}
static SprmOperation getCharacterRunShading(CharacterRun run) throws Exception {
SprmOperation shd80Operation = null;
Field _chpx = CharacterRun.class.getDeclaredField("_chpx");
_chpx.setAccessible(true);
Field _value = SprmOperation.class.getDeclaredField("_value");
_value.setAccessible(true);
SprmBuffer sprmBuffer = (SprmBuffer) _chpx.get(run);
for (SprmIterator sprmIterator = sprmBuffer.iterator(); sprmIterator.hasNext(); ) {
SprmOperation sprmOperation = sprmIterator.next();
short sprmValue = (short)_value.get(sprmOperation);
if (sprmValue == (short)0x4866) { // we have a Shd80 structure, see https://msdn.microsoft.com/en-us/library/dd947480(v=office.12).aspx
shd80Operation = sprmOperation;
}
}
return shd80Operation;
}
public static void main(String[] args) throws Exception {
HWPFDocument document = new HWPFDocument(new FileInputStream("sample.doc"));
Range range = document.getRange();
for (int p = 0; p < range.numParagraphs(); p++) {
Paragraph paragraph = range.getParagraph(p);
System.out.println(paragraph);
if (!paragraph.getShading().isEmpty()) {
System.out.println("Paragraph's shading: " + paragraph.getShading());
}
for (int r = 0; r < paragraph.numCharacterRuns(); r++) {
CharacterRun run = paragraph.getCharacterRun(r);
System.out.println(run);
if (run.isHighlighted()) {
System.out.println("Run's highlighted color: " + run.getHighlightedColor());
}
if (getCharacterRunShading(run) != null) {
System.out.println("Run's Shd80 structure: " + getCharacterRunShading(run));
}
}
}
}
}
Related
I am trying to merge 2 docx files which has their own bullet number, after merging of word docs the bullets are automatically updated.
E.g:
Doc A has 1 2 3
Doc B has 1 2 3
After merging the bullet numbering are updated to be 1 2 3 4 5 6
how to stop this.
I am using following code
if(counter==1)
{
FirstFileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FirstFileIS = new java.io.ByteArrayInputStream(FirstFileByteStream);
FirstWordFile = org.docx4j.openpackaging.packages.WordprocessingMLPackage.load(FirstFileIS);
main = FirstWordFile.getMainDocumentPart();
//Add page break for Table of Content
main.addObject(objBr);
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Table of contents - End
}
else
{
FileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FileIS = new java.io.ByteArrayInputStream(FileByteStream);
byte[] bytes = IOUtils.toByteArray(FileIS);
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/part" + (chunkCount++) + ".docx"));
afiPart.setContentType(new ContentType(CONTENT_TYPE));
afiPart.setBinaryData(bytes);
Relationship altChunkRel = main.addTargetPart(afiPart);
CTAltChunk chunk = Context.getWmlObjectFactory().createCTAltChunk();
chunk.setId(altChunkRel.getId());
main.addObject(objBr);
htmlCode = new StringBuilder();
htmlCode.append("<html>");
htmlCode.append("<h2><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><p style=\"font-family:'Arial Black'; color: #f35b1c\">"+ReqName+"</p></h2>");
htmlCode.append("</html>");
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Add Page Break before new content
main.addObject(objBr);
//Add new content
main.addObject(chunk);
}
Looking at your code, you are adding HTML altChunks to your document.
For these to display it Word, the HTML is converted to normal docx content.
An altChunk is usually converted by Word when you open the docx.
(Alternatively, docx4j-ImportXHTML can do it for an altChunk of type XHTML)
The upshot is that what happens with the bullets (when Word converts your HTML) is largely outside your control. You could experiment with CSS but I think Word will mostly ignore it.
An alternative may be to use XHTML altChunks, and have docx4j-ImportXHTML convert them. main.convertAltChunks()
If the same problem occurs when you try that, well, at least we can address it.
I was able to fix my issue using following code. I found it at (http://webapp.docx4java.org/OnlineDemo/forms/upload_MergeDocx.xhtml). You can also generate your custom code, they have a nice demo where they generate code according to your requirement :).
public final static String DIR_IN = System.getProperty("user.dir")+ "/";
public final static String DIR_OUT = System.getProperty("user.dir")+ "/";
public static void main(String[] args) throws Exception
{
String[] files = {"part1docx_20200717t173750539gmt.docx", "part1docx_20200717t173750539gmt (1).docx", "part1docx_20200717t173750539gmt.docx"};
List blockRanges = new ArrayList();
for (int i=0 ; i< files.length; i++) {
BlockRange block = new BlockRange(WordprocessingMLPackage.load(new File(DIR_IN + files[i])));
blockRanges.add( block );
block.setStyleHandler(StyleHandler.RENAME_RETAIN);
block.setNumberingHandler(NumberingHandler.ADD_NEW_LIST);
block.setRestartPageNumbering(false);
block.setHeaderBehaviour(HfBehaviour.DEFAULT);
block.setFooterBehaviour(HfBehaviour.DEFAULT);
block.setSectionBreakBefore(SectionBreakBefore.NEXT_PAGE);
}
// Perform the actual merge
DocumentBuilder documentBuilder = new DocumentBuilder();
WordprocessingMLPackage output = documentBuilder.buildOpenDocument(blockRanges);
// Save the result
SaveToZipFile saver = new SaveToZipFile(output);
saver.save(DIR_OUT+"OUT_MergeWholeDocumentsUsingBlockRange.docx");
}
I have the following method (createAdditionalSheetsInExcel) in my code which tries to create additional sheets for a particular scenario (runOutputNumber > 1). It ends up creating an excel but the problem is when you try to open the excel you end up getting the following errors:
The workBookObj.cloneSheet(index,sheetName) throws no errors in the java code but when I try to open the excel file I get the following errors:
I tried to remove the formatting for the table in the sheet and then the error disappears. So it must be something to do with the format of the table inside the sheet.
private static void createAdditionalSheetsInExcel(String tempOutputFileName, String outputFileName, int runOutputNumber) throws IOException {
FileInputStream fileIn = new FileInputStream(tempOutputFileName);
XSSFWorkbook workBookObj = new XSSFWorkbook(fileIn);
workBookObj.setWorkbookType(XSSFWorkbookType.XLSM);
runOutputNumber = 2;//Hard coded for clarification
if (runOutputNumber > 1) {
int initialNoOfSheets = workBookObj.getNumberOfSheets();
for (int runIndex = 2; runIndex <= runOutputNumber; runIndex++) {
for (int index = 0; index < initialNoOfSheets; index++) {
XSSFSheet sheet = workBookObj.getSheetAt(index);
String sheetName = sheet.getSheetName().trim()
.substring(0, sheet.getSheetName().length() - 1) + runIndex;
workBookObj.cloneSheet(index, sheetName);
}
}
}
FileOutputStream fileOut = new FileOutputStream(outputFileName);
workBookObj.write(fileOut);
fileOut.close();
workBookObj.close();
deleteTempExcel(tempOutputFileName);
}
Error when the excel tries to open:
We found a problem with some content in 'abc.xlsm'. Do you want to try to recover as much as we can? If you trust the source of this workbook, click Yes.
Error: After opening the excel file:
Repaired Records: Table from /xl/tables/table1.xml part (Table)
Finally resolved the issue using jacob api and also by making changes to template. Issue was with local variable defined in Excel which I was able to access by going to (Formulas -> Name Manager) and deleted the variable.Even after deleting the variable I was not able get it to working with Apache POI so ended up using Jacob api. Code is as follows:
package com.ford.ltdrive.model.output.excel.excelenum;
import com.jacob.activeX.ActiveXComponent;
import com.jacob.com.Dispatch;
import com.jacob.com.Variant;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.StandardCopyOption;
import java.util.Date;
import java.util.logging.Level;
import java.util.logging.Logger;
public class CloneSheet {
public void backup(String filepath) {
try {
Date d = new Date();
String dateString = (d.getYear() + 1900) + "_" + d.getMonth() + "_" + d.getDate();
// String backupfilepath = filepath.replace(".xlsm", "_backup_" + dateString + ".xlsm");
//Path copied = Paths.get(backupfilepath);
Path copied1 = Paths.get(filepath + "_tmp");
Path originalPath = Paths.get(filepath);
// Files.copy(originalPath, copied, StandardCopyOption.REPLACE_EXISTING);
Files.copy(originalPath, copied1, StandardCopyOption.REPLACE_EXISTING);
Files.delete(Paths.get(filepath));
} catch (IOException ex) {
Logger.getLogger(CloneSheet.class.getName()).log(Level.SEVERE, null, ex);
}
}
public void cloneSheets(String xlsfile, java.util.List<String> list,int copynum) {
ActiveXComponent app = new ActiveXComponent("Excel.Application");
try {
backup(xlsfile);
app.setProperty("Visible", new Variant(false));
Dispatch excels = app.getProperty("Workbooks").toDispatch();
Dispatch excel = Dispatch.invoke(
excels,
"Open",
Dispatch.Method,
new Object[]{xlsfile + "_tmp", new Variant(false),
new Variant(true)}, new int[1]).toDispatch();
//Dispatch sheets = Dispatch.get((Dispatch) excel, "Worksheets").toDispatch();
int sz = list.size();//"Angle_1pc_SBC_R1"
for (int i = 0; i < sz; i++) {
Dispatch sheet = Dispatch.invoke(excel, "Worksheets", Dispatch.Get,
new Object[]{list.get(i)}, new int[1]).toDispatch();//Whatever sheet you //wanted the new sheet inserted after
//Dispatch workbooksTest = app.getProperty("Sheets").toDispatch();//Get the workbook
//Dispatch sheet2 = Dispatch.call(workbooksTest, "Add").toDispatch();
for(int k=0;k<copynum -1;k++)
{
Dispatch.call(sheet, "Copy", sheet);
}
}
//Moves the sheet behind the desired sheet
Dispatch.invoke(excel, "SaveAs", Dispatch.Method, new Object[]{xlsfile, new Variant(52)}, new int[1]);
Variant f = new Variant(false);
Dispatch.call(excel, "Close", f);
Files.delete(Paths.get(xlsfile + "_tmp"));
} catch (Exception e) {
e.printStackTrace();
} finally {
app.invoke("Quit", new Variant[]{});
}
}
/* public static void main(String args[])
{
java.util.ArrayList<String> list = new java.util.ArrayList();
list.add("Angle_1pc_SBC_R1");
new CloneSheet().cloneSheets("C:\\LTDrive2_4\\Excel\\Test.xlsm", list, 2);
}*/
}
have similar issue.
I have a Threads that was writing to an excel file .
solve it by adding synchronized to the function that I use to write into the excel
I looked at the apache POI documentation and created a function that redacts all the text in a powerpoint. Function works well in replacing texts in slides but not the texts found in grouped textboxes. Is there seperate object that handles the grouped items?
private static void redactText(XMLSlideShow ppt) {
for (XSLFSlide slide : ppt.getSlides()) {
System.out.println("REDACT Slide: " + slide.getTitle());
XSLFTextShape[] shapes = slide.getPlaceholders();
for (XSLFTextShape textShape : shapes) {
List<XSLFTextParagraph> textparagraphs = textShape.getTextParagraphs();
for (XSLFTextParagraph para : textparagraphs) {
List<XSLFTextRun> textruns = para.getTextRuns();
for (XSLFTextRun incomingTextRun : textruns) {
String text = incomingTextRun.getRawText();
System.out.println(text);
if (text.toLowerCase().contains("test")) {
String newText = text.replaceAll("(?i)" + "test", "XXXXXXXX");
incomingTextRun.setText(newText);
}
}
}
}
}
}
If the need is simply getting all text contents independent of in what objects it is, then one could simply do exactly that. Text contents are contained in org.apache.xmlbeans.XmlString elements. In PowerPoint XML they are in a:t tags. Name space a="http://schemas.openxmlformats.org/drawingml/2006/main".
So following code gets all text in all objects in all slides and does replacing case-insensitive string "test" with "XXXXXXXX".
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.xslf.usermodel.*;
import org.openxmlformats.schemas.presentationml.x2006.main.CTSlide;
import org.apache.xmlbeans.XmlObject;
import org.apache.xmlbeans.XmlString;
public class ReadPPTXAllText {
public static void main(String[] args) throws Exception {
XMLSlideShow slideShow = new XMLSlideShow(new FileInputStream("MicrosoftPowerPoint.pptx"));
for (XSLFSlide slide : slideShow.getSlides()) {
CTSlide ctSlide = slide.getXmlObject();
XmlObject[] allText = ctSlide.selectPath(
"declare namespace a='http://schemas.openxmlformats.org/drawingml/2006/main' " +
".//a:t"
);
for (int i = 0; i < allText.length; i++) {
if (allText[i] instanceof XmlString) {
XmlString xmlString = (XmlString)allText[i];
String text = xmlString.getStringValue();
System.out.println(text);
if (text.toLowerCase().contains("test")) {
String newText = text.replaceAll("(?i)" + "test", "XXXXXXXX");
xmlString.setStringValue(newText);
}
}
}
}
FileOutputStream out = new FileOutputStream("MicrosoftPowerPointChanged.pptx");
slideShow.write(out);
slideShow.close();
out.close();
}
}
If one doesn't like the approach of replacing via Xml directly, it is possible to iterate over all slides and their shapes. If a shape is a XSLFTextShape, get the paragraphs and handle them like you did.
If you receive a XSLFGroupShape, iterate over their getShapes() as well. Since they could contain different types of shapes you might use recursion for that. You might handle the shape type XSLFTable also.
But the real trouble starts when you realize, that something you want to replace is divided into several runs ;-)
I'm trying to extract text from a PDF which is full of tables.
In some cases, a column is empty.
When I extract the text from the PDF, the emptys columns are skiped and replaced by a whitespace, therefore, my regulars expressions can't figure out that there was a column with no information at this spot.
Image to a better understanding :
We can see that the columns aren't respected in the extracted text
Sample of my code that extract the text from PDF :
PDFTextStripper reader = new PDFTextStripper();
reader.setSortByPosition(true);
reader.setStartPage(page);
reader.setEndPage(page);
String st = reader.getText(document);
List<String> lines = Arrays.asList(st.split(System.getProperty("line.separator")));
How to maintain the full structure of the original PDF when extracting text from it ?
Thank's a lot.
(This originally was the answer (dated Feb 6 '15) to another question which the OP deleted including all answers. Due to the age, the code in the answer was still based on PDFBox 1.8.x, so some changes might be necessary to make it run with PDFBox 2.0.x.)
In comments the OP showed interest in a solution to extend the PDFBox PDFTextStripper to return text lines which attempt to reflect the PDF file layout which might help in case of the question at hand.
A proof-of-concept for that would be this class:
public class LayoutTextStripper extends PDFTextStripper
{
public LayoutTextStripper() throws IOException
{
super();
}
#Override
protected void startPage(PDPage page) throws IOException
{
super.startPage(page);
cropBox = page.findCropBox();
pageLeft = cropBox.getLowerLeftX();
beginLine();
}
#Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
float recentEnd = 0;
for (TextPosition textPosition: textPositions)
{
String textHere = textPosition.getCharacter();
if (textHere.trim().length() == 0)
continue;
float start = textPosition.getTextPos().getXPosition();
boolean spacePresent = endsWithWS | textHere.startsWith(" ");
if (needsWS | spacePresent | Math.abs(start - recentEnd) > 1)
{
int spacesToInsert = insertSpaces(chars, start, needsWS & !spacePresent);
for (; spacesToInsert > 0; spacesToInsert--)
{
writeString(" ");
chars++;
}
}
writeString(textHere);
chars += textHere.length();
needsWS = false;
endsWithWS = textHere.endsWith(" ");
try
{
recentEnd = getEndX(textPosition);
}
catch (IllegalArgumentException | IllegalAccessException | NoSuchFieldException | SecurityException e)
{
throw new IOException("Failure retrieving endX of TextPosition", e);
}
}
}
#Override
protected void writeLineSeparator() throws IOException
{
super.writeLineSeparator();
beginLine();
}
#Override
protected void writeWordSeparator() throws IOException
{
needsWS = true;
}
void beginLine()
{
endsWithWS = true;
needsWS = false;
chars = 0;
}
int insertSpaces(int charsInLineAlready, float chunkStart, boolean spaceRequired)
{
int indexNow = charsInLineAlready;
int indexToBe = (int)((chunkStart - pageLeft) / fixedCharWidth);
int spacesToInsert = indexToBe - indexNow;
if (spacesToInsert < 1 && spaceRequired)
spacesToInsert = 1;
return spacesToInsert;
}
float getEndX(TextPosition textPosition) throws IllegalArgumentException, IllegalAccessException, NoSuchFieldException, SecurityException
{
Field field = textPosition.getClass().getDeclaredField("endX");
field.setAccessible(true);
return field.getFloat(textPosition);
}
public float fixedCharWidth = 3;
boolean endsWithWS = true;
boolean needsWS = false;
int chars = 0;
PDRectangle cropBox = null;
float pageLeft = 0;
}
It is used like this:
PDDocument document = PDDocument.load(PDF);
LayoutTextStripper stripper = new LayoutTextStripper();
stripper.setSortByPosition(true);
stripper.fixedCharWidth = charWidth; // e.g. 5
String text = stripper.getText(document);
fixedCharWidth is the assumed character width. Depending on the writing in the PDF in question a different value might be more apropos. In my sample documents values from 3..6 were of interest.
It essentially emulates the analogous solution for iText in this answer. Results differ a bit, though, as iText text extraction forwards text chunks and PDFBox text extraction forwards individual characters.
Please be aware that this is merely a proof-of-concept. It especially does not take any rotation into account
How Can I Generate an AST from java src code Using ANTLR?
any help?
OK, here are the steps:
Go to the ANTLR site and download the latest version
Download the Java.g and the JavaTreeParser.g files from here.
Run the following commands:
java -jar antlrTool Java.g
java -jar antlrTool JavaTreeParser.g
5 files will be generated:
Java.tokens
JavaLexer.java
JavaParser.java
JavaTreeParser.g
JavaTreeParser.tokens
use this java code to generate the Abstract Syntax Tree and to print it:
String input = "public class HelloWord {"+
"public void print(String r){" +
"for(int i = 0;true;i+=2)" +
"System.out.println(r);" +
"}" +
"}";
CharStream cs = new ANTLRStringStream(input);
JavaLexer jl = new JavaLexer(cs);
CommonTokenStream tokens = new CommonTokenStream();
tokens.setTokenSource(jl);
JavaParser jp = new JavaParser(tokens);
RuleReturnScope result = jp.compilationUnit();
CommonTree t = (CommonTree) result.getTree();
CommonTreeNodeStream nodes = new CommonTreeNodeStream(t);
nodes.setTokenStream(tokens);
JavaTreeParser walker = new JavaTreeParser(nodes);
System.out.println("\nWalk tree:\n");
printTree(t,0);
System.out.println(tokens.toString());
}
public static void printTree(CommonTree t, int indent) {
if ( t != null ) {
StringBuffer sb = new StringBuffer(indent);
for ( int i = 0; i < indent; i++ )
sb = sb.append(" ");
for ( int i = 0; i < t.getChildCount(); i++ ) {
System.out.println(sb.toString() + t.getChild(i).toString());
printTree((CommonTree)t.getChild(i), indent+1);
}
}
}
The setps to generate java src AST using antlr4 are:
Install antlr4 you can use this link to do that.
After installation download the JAVA grammar from here.
Now generate Java8Lexer and Java8Parser using the command:
antlr4 -visitor Java8.g4
This will generate several files such as Java8BaseListener.java Java8BaseVisitor.java Java8Lexer.java Java8Lexer.tokens Java8Listener.java Java8Parser.java Java8.tokens Java8Visitor.java
Use this code to generate AST:
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.ParserRuleContext;
import org.antlr.v4.runtime.RuleContext;
import org.antlr.v4.runtime.tree.ParseTree;
public class ASTGenerator {
public static String readFile() throws IOException {
File file = new File("path/to/the/test/file.java");
byte[] encoded = Files.readAllBytes(file.toPath());
return new String(encoded, Charset.forName("UTF-8"));
}
public static void main(String args[]) throws IOException {
String inputString = readFile();
ANTLRInputStream input = new ANTLRInputStream(inputString);
Java8Lexer lexer = new Java8Lexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
Java8Parser parser = new Java8Parser(tokens);
ParserRuleContext ctx = parser.classDeclaration();
printAST(ctx, false, 0);
}
private static void printAST(RuleContext ctx, boolean verbose, int indentation) {
boolean toBeIgnored = !verbose && ctx.getChildCount() == 1 && ctx.getChild(0) instanceof ParserRuleContext;
if (!toBeIgnored) {
String ruleName = Java8Parser.ruleNames[ctx.getRuleIndex()];
for (int i = 0; i < indentation; i++) {
System.out.print(" ");
}
System.out.println(ruleName + " -> " + ctx.getText());
}
for (int i = 0; i < ctx.getChildCount(); i++) {
ParseTree element = ctx.getChild(i);
if (element instanceof RuleContext) {
printAST((RuleContext) element, verbose, indentation + (toBeIgnored ? 0 : 1));
}
}
}
}
After you are done coding you can use gradle to build your project or you can download antlr-4.7.1-complete.jar in your project directory and start compiling.
If you want a the output in a DOT file so that u can visualise the AST then you can refer to this QnA post or directly refer to this repository in which i have used gradle to build the project.
Hope this helps. :)