Identify hidden text Word 2003/2007 using Apache POI

Identify hidden text Word 2003/2007 using Apache POI - java

I am converting a Word (2003 and 2007) document to HTML format. I have managed to read the text, formats etc from the Word document. But the document contains some hidden text like 'Header Change History' which need not be displayed on the page. Is there any way to identify hidden texts from a Word document.
Any help will be much valuable.

I am not sure if this is a complete (or even accurate) solution, but for the files in the DOCX format, it seems that you can check if a character run is hidden by
XWPFRun cr;
if (cr.getCTR().getRPr().getVanish() != null){
// it is hidden
}
Got this from reverse-engineering the XML, and at least in my usage it seems to work. Would be very glad for additional (more informed) input, and a way to do the same thing in the old binary file format.

The following code snippet helps in identifying if the text is hidden
POIFSFileSystem fs = null;
boolean isHidden = false;
try {
fs = new POIFSFileSystem(new FileInputStream(filesname));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
String[] paragraphs = we.getParagraphText();
System.out.println("Word Document has " + paragraphs.length
+ " paragraphs");
Range range = doc.getRange();
for (int k = 0; k < range.numParagraphs(); k++) {
org.apache.poi.hwpf.usermodel.Paragraph paragraph = range
.getParagraph(k);
paragraph.text().trim();
paragraph.text().replaceAll("\\cM?\r?\n", "");
for (int j = 0; j < paragraph.numCharacterRuns(); j++) {
org.apache.poi.hwpf.usermodel.CharacterRun cr = paragraph
.getCharacterRun(j);
if (cr.isVanished()) {
// it is hidden
System.out.println("text is hidden ");
isHidden = true;
break;
}
}

Related

PDFBox search for text on specific page in new PDF

I'm searching a way to check my new PDF for a specific String on every page.
The idea is to go on every page and if project name is missing from the page to add it (before saving the pdf - doc.save(new FileOutputStream(new File(pathToFile)));
I already tried:
document.save(new FileOutputStream(new File(pathToFile)));
PDDocument document = PDDocument.load(new File(pathToFile));
for (int i = 1; i < document.getNumberOfPages(); i++) {
PDFTextStripper reader = new PDFTextStripper();
reader.setStartPage(i);
reader.setEndPage(i);
String pageText = reader.getText(document);
System.out.println(pageText);
}
the result is : Hello World which is ok.
but this is working only if document is already saved and then load it again.
In my case when the document is not saved yet:
for (int i = 1; i < document.getNumberOfPages(); i++) {
PDFTextStripper reader = new PDFTextStripper();
reader.setStartPage(i);
reader.setEndPage(i);
String pageText = reader.getText(document);
System.out.println(pageText);
}
the result is empty String

Obviasly there is no way to find text before saving the document so I started a new approach.
oldPagesCount = document.getNumberOfPages();
addTableInformation(List<String> informationToAdd);
if (oldPagesCount < document.getNumberOfPages()) {
// we have auto generated pages and we should add projec name-number
for (int i = oldPagesCount; i < document.getNumberOfPages(); i++) {
page = document.getPage(i);
}
addProjectInfo(project);
}
}
In this case if table info is moved to multiple pages the code is going on every newly added page and adding project information. Hope that this will help to everybody that need to do something like this.

replace string with unicode text in pdf file using PDFbox?

I need to read the strings from PDF file and replace it with the Unicode text.If it is ASCII chars everything is fine. But with Unicode characters, it showing question marks/junk text.No problem with font file(ttf) I am able to write a unicode text to the pdf file with a different class (PDFContentStream). With this class, there is no option to replace text but we can add new text.
Sample unicode text
Bɐɑɒ
issue (Address column)
https://drive.google.com/file/d/1DbsApTCSfTwwK3txsDGW8sXtDG_u-VJv/view?usp=sharing
I am using PDFBox.
Please help me with this.....
check the code I am using.....
enter image description herepublic static PDDocument _ReplaceText(PDDocument document, String searchString, String replacement)
throws IOException {
if (StringUtils.isEmpty(searchString) || StringUtils.isEmpty(replacement)) {
return document;
}
for (PDPage page : document.getPages()) {
PDResources resources = new PDResources();
PDFont font = PDType0Font.load(document, new File("arial-unicode-ms.ttf"));
//PDFont font2 = PDType0Font.load(document, new File("avenir-next-regular.ttf"));
resources.add(font);
//resources.add(font2);
//resources.add(PDType1Font.TIMES_ROMAN);
page.setResources(resources);
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof Operator) {
Operator op = (Operator) next;
String pstring = "";
int prej = 0;
// Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj")) {
// Tj takes one operator and that is the string to display so lets update that
// operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else if (op.getName().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
if (j == prej) {
pstring += string;
} else {
prej = j;
pstring = string;
}
}
}
if (searchString.equals(pstring.trim())) {
COSString cosString2 = (COSString) previous.getObject(0);
cosString2.setValue(replacement.getBytes());
int total = previous.size() - 1;
for (int k = total; k > 0; k--) {
previous.remove(k);
}
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(document);
OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
out.close();
page.setContents(updatedStream);
}
return document;
}

Your code utterly breaks the PDF, cf. the Adobe Preflight output:
The cause is obvious, your code
PDResources resources = new PDResources();
PDFont font = PDType0Font.load(document, new File("arial-unicode-ms.ttf"));
resources.add(font);
page.setResources(resources);
drops the pre-existing page Resources and your replacement contains only a single font the name of which you allow PDFBox to choose arbitrarily.
You must not drop existing resources as they are used in your document.
Inspecting the content of your PDF page it becomes obvious that the encoding of the originally used fonts T1_0 and T1_1 either is a single byte encoding or a mixed single/multi-byte encoding; the lower single byte values appear to be encoded ASCII-like.
I would assume that the encoding is WinAnsiEncoding or a subset thereof. As a corollary your task
to read the strings from PDF file and replace it with the Unicode text
cannot be implemented as a simple replacement, at least not with arbitrary Unicode code points in mind.
What you can implement instead is:
First run your source PDF through a customized text stripper which instead of extracting the plain text searches for your strings to replace and returns their positions. There are numerous questions and answers here that show you how to determine coordinates of strings in text stripper sub classes, a recent one being this one.
Next remove those original strings from your PDF. In your case an approach similar to your original code above (without dropping the resource, obviously), replacing the strings by equally long strings of spaces might work even it is a dirty hack.
Finally add your replacements at the determined positions using a PDFContentStream in append mode; for this add your new font to the existing resources.
Please be aware, though, that PDF is not designed to be used like this. Template PDFs can be used as background for new content, but attempting to replace content therein usually is a bad design leading to trouble. If you need to mark positions in the template, use annotations which can easily be dropped during fill-in. Or use AcroForm forms, the native PDF form technology, to start with.

JSoup extract only specific parts from Wikipedia

I have managed to extract the information in the "tables" on the right side of a Wikipedia article. However I also want to get paragraphs from the main text of the articles.
The code I'm using atm is only working about 60% of the time(Nullpointers or no text at all). In the example below I'm only interested in the tho first paragraphs, however that is irrelevant for my question.
In the picture below I show what parts I want the text from. I want to be able to iterate through all ... parts in the < divid="mw-content-text"....class="mw-content-ltr"> block.
StringBuilder sb = new StringBuilder();
String url = baseUrl + location;
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
Element firstParagraph = paragraphs.first();
Element elementTwo = firstParagraph.nextElementSibling();
if (elementTwo == null) {
for (int i = 0; i < 2; i++) {
sb.append(paragraphs.get(i).text());
}
} else {
sb.append(elementTwo.text());
}
return sb.toString();

How to read font size of each word in a word document using POI?

I am trying to find out whether there exist anything in the word document that has a font of 2. However, I have not been able to do this. To begin with, I've tried to read the font of each word in a sample word document that only has one line and 7 words. I am not getting the correct results.
Here is my code:
HWPFDocument doc = new HWPFDocument (fileStream);
WordExtractor we = new WordExtractor(doc);
Range range = doc.getRange()
String[] paragraphs = we.getParagraphText();
for (int i = 0; i < paragraphs.length; i++) {
Paragraph pr = range.getParagraph(i);
int k = 0
while (true) {
CharacterRun run = pr.getCharacterRun(k++);
System.out.println("Color: " + run.getColor());
System.out.println("Font: " + run.getFontName());
System.out.println("Font Size: " + run.getFontSize());
if (run.getEndOffSet() == pr.getEndOffSet())
break;
}
}
However, the above code always doubles the font size. i.e. if the actual font size in the document is 12 then it outputs 24 and if actual font is 8 then it outputs 16.
Is this the correct way to read font size from a word document ??

Yes, that's the correct way; the measurement is in half points.
In a docx, you'd have something like:
<w:rPr>
<w:sz w:val="28" />
</w:rPr>
ECMA 376 spec on #sz defines the unit as ST_HpsMeasure (Measurement in Half-Points)
Its the same with the binary doc format, which HWPF supports. If you look at [MS-DOC], you'll see it also specifies the size of text in half-points.

Apache POI HWPF - problem in convert doc file to pdf

I am currently working Java project with use of apache poi.
Now in my project I want to convert doc file to pdf file. The conversion done successfully but I only get text in pdf not any text style or text colour.
My pdf file looks like a black & white. While my doc file is coloured and have different style of text.
This is my code,
POIFSFileSystem fs = null;
Document document = new Document();
try {
System.out.println("Starting the test");
fs = new POIFSFileSystem(new FileInputStream("/document/test2.doc"));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
OutputStream file = new FileOutputStream(new File("/document/test.pdf"));
PdfWriter writer = PdfWriter.getInstance(document, file);
Range range = doc.getRange();
document.open();
writer.setPageEmpty(true);
document.newPage();
writer.setPageEmpty(true);
String[] paragraphs = we.getParagraphText();
for (int i = 0; i < paragraphs.length; i++) {
org.apache.poi.hwpf.usermodel.Paragraph pr = range.getParagraph(i);
// CharacterRun run = pr.getCharacterRun(i);
// run.setBold(true);
// run.setCapitalized(true);
// run.setItalic(true);
paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", "");
System.out.println("Length:" + paragraphs[i].length());
System.out.println("Paragraph" + i + ": " + paragraphs[i].toString());
// add the paragraph to the document
document.add(new Paragraph(paragraphs[i]));
}
System.out.println("Document testing completed");
} catch (Exception e) {
System.out.println("Exception during test");
e.printStackTrace();
} finally {
// close the document
document.close();
}
}
please help me.
Thnx in advance.

If you look at Apache Tika, there's a good example of reading some style information from a HWPF document. The code in Tika generates HTML based on the HWPF contents, but you should find that something very similar works for your case.
The Tika class is
https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
One thing to note about word documents is that everything in any one Character Run has the same formatting applied to it. A Paragraph is therefore made up of one or more Character Runs. Some styling is applied to a Paragraph, and other parts are done on the runs. Depending on what formatting interests you, it may therefore be on the paragraph or the run.

If you use WordExtractor, you will get text only. Try using CharacterRun class. You will get style along with text. Please refer following Sample code.
Range range = doc.getRange();
for (int i = 0; i < range.numParagraphs(); i++) {
org.apache.poi.hwpf.usermodel.Paragraph poiPara = range.getParagraph(i);
int j = 0;
while (true) {
CharacterRun run = poiPara.getCharacterRun(j++);
System.out.println("Color "+run.getColor());
System.out.println("Font size "+run.getFontSize());
System.out.println("Font Name "+run.getFontName());
System.out.println(run.isBold()+" "+run.isItalic()+" "+run.getUnderlineCode());
System.out.println("Text is "+run.text());
if (run.getEndOffset() == poiPara.getEndOffset()) {
break;
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.