Apache POI :- Get Headings from DOC file

Apache POI :- Get Headings from DOC file - java

I am messing with apache poi to manipulate word document. Is there any way to get headings from a doc file? i am able to get plain text from the doc but I need to differentiate all headings from the document file?. IS any function available in apache poi api to get only headings from the ms word file??

Promoting a comment to an answer
There are two ways to make a "Heading" in Word. The "proper" way, and the way that most people seem to do it...
In the styles dropdown, pick the appropriate header style, write your text, then go back to the normal paragraph style for the next line
Highlight a line, and bump up the font size + make it bold or italic
If your users are doing #2, you've basically no real hope of identifying the Headings. Short of writing some fuzzy matching logic to try to spot when the font size jumps, you're out of luck
For #1, it's fairly easy in Apache POI. What you'll want to do is grab the style description of the style that applies to a paragraph, then get the name of the style. If that starts with Heading (case insensitive), you know you've found a heading. Get the text of that paragraph, and move on through the document.
If you look at the Apache Tika MS-Word parser which is built on top of POI, you'll see a good example there of iterating over the paragraphs and checking the styles

just as Gagravarr saying:
For #1, it's fairly easy in Apache POI. What you'll want to do is grab the style description of the style that applies to a paragraph, then get the name of the style. If that starts with Heading (case insensitive), you know you've found a heading. Get the text of that paragraph, and move on through the document.
using Apache POI code like this :
File f=new File("test.docx");
FileInputStream fis = new FileInputStream(f);
XWPFDocument xdoc=new XWPFDocument(OPCPackage.open(fis));
XWPFStyles styles=xdoc.getStyles();
List<XWPFParagraph> xwpfparagraphs =xdoc.getParagraphs();
System.out.println();
for(int i=0;i<xwpfparagraphs.size();i++)
{
System.out.println("paragraph style id "+(i+1)+":"+xwpfparagraphs.get(i).getStyleID());
if(xwpfparagraphs.get(i).getStyleID()!=null)
{
String styleid=xwpfparagraphs.get(i).getStyleID();
XWPFStyle style=styles.getStyle(styleid);
if(style!=null)
{
System.out.println("Style name:"+style.getName());
if(style.getName().startsWith("heading"))
{
//this is a heading
}
}
}
}

At least for HWPF (i.e. the old binary doc format) and if you have a properly formatted file (so type #1 of the other answers) you should not rely exclusively on the style name - in fact, this may be a language-dependent value ("Heading" in English, "Titre" in French, etc.).
Paragraph.getLvl(), which encodes the level where the respective paragraph is shown in Word's outline view, often makes a good secondary source. 1 constitutes the most significant level, all subsequent numbers up to 8 stand for less significant heading candidates and 9 is the value that Word assigns to ordinary (non-heading) paragraphs by default.

Related

iText: Compare font after processing string using FontSelector

tl;dr:
In iText, is there a way to access the font's name, or figure out the language of a font that has been applied to a Phrase from a FontSelector?
This question is in relation to an issue we've been having for the printwikipedia project --- github -- issue .
I have a body of text that I do not have control over coming in to a FontSelector to be processed. Some of that text is in Arabic and some is in Hebrew and I am trying to figure out the best way of detecting the type of font in order to have it print correctly as seen here: http://developers.itextpdf.com/question/how-create-persian-content-pdf
using
pdfCell.setRunDirection(PdfWriter.RUN_DIRECTION_RTL);
After running the strings through the FontSelector to become Phrases and then placing them within the cell formatted with the code above all the text ends up on the right.
Formatting of this text is very important so I cannot have all of my text aligned to the right but only when it is meant to be read as right to left. So what I believe should be the best course of action is to detect the font that has been applied to the Phrase and then alter the cell if necessary.
public FontSelector fs = new FontSelector();
// add a whole lot of fonts to fs
// incoming line of some sort of text
Phrase ph = fs.process(line);
System.out.println(ph.getFont().toString());
The above code will output some extremely varied results. Pretty much a new id for every font created for each piece of text. I can't figure out a way to compare a font that exists in the fontselector object with a font that has been applied to the incoming text.
Is this the best method to figure out what fonts are in a phrase?
How can I access the font's name, or figure out the language of a font that has been applied to a Phrase from a FontSelector?

I went through the itext documentation a bit more carefully and discovered my answer.
Phrases are made up of Chunks which contain the fonts. If one simple does a .getChunks() command on the output phrase and then iterators over the chunks they can then compare the fonts by doings a .getFont() on the chunks in the arraylist and proceeding to apply whatever styles you wish from there.

iText PDF bad character conversion

i have a PDF to read that is making me craszy.
The pdf rapresent the electricity bill (in italian language) of a customer and he want me to read text from it.
Now the problem. When i copy paste text from pdf to notepad i get a bunch of incomprehensible characters...
after a lot of research I found my answer. The pdf contains all fonts but not ontiene the cmap corresponding to allow the export of the text. I found this link which refers however to an older version of itext(I'm using version 5.5.5).
what I want to achieve, if possible, is the conversion of text from glyph code to unicode.
I've found some reference to Cmap-something but dunno how to use them and apparently no examples on the net :(
this is what i've tryed
PdfReader reader = new PdfReader("MyFile.pdf");
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
strategy = parser.processContent(1, new SimpleTextExtractionStrategy());
String text = strategy.getResultantText();
String cmapFile="UnicodeBigUnmarked";
byte[] text = encodedText.getBytes();
String cid = PdfEncodings.convertToString(text, cmapFile);
The Cid is a pretty japanes sequence of chars
and also:
FontFactory.registerDirectory("myDirectoryWithAllFonts");
Just before trying the conversion. This solution seems to give no results
Any help will be appreciated.

You say: When i copy paste text from pdf to notepad i get a bunch of incomprehensible characters. I assume that you are talking about selecting text in Adobe Reader and trying to paste it in a text editor.
If this doesn't succeed, you have a PDF that doesn't allow you to extract text from the PDF because the text isn't stored in the PDF correctly. Watch this video for the full explanation.
Let's take a look at your PDF from the inside:
We see the start of a text object (where it says BT which stands for Begin Text). A font /C2_1 is defined with font size 1. At first sight, this may look odd, but the font will be scaled to size 6.9989 in a transformation. Then we see some text arrays containing strings of double byte characters such as I R H E Z M W M S R I H I P.
How should iText interpret these characters? To find out, we need to look at the encoding that is used for the font corresponding with /C2_1:
Aha, the Unicode characters stored in the content stream correspond with the actual characters we need: IRHE ZMWMSRI HIP and so on. That's exactly what we see when we convert the PDF to text using iText.
But wait a minute! How come that we see other characters when we look at the PDF using Adobe Reader? Well, characters such as I, R, H and so on are addresses that correspond with the "program" of a glyph. This program is responsible for drawing the character on the page. One would expect that in this case, the character I would correspond with the glyph (or "the drawing" if you prefer this word) of the letter I. No such luck in your PDF.
Now what does Adobe do when you use "Copy with formatting"? Plenty of magic that currently isn't implemented in iText. Why not? Hmm... I don't know the budget of Adobe, but it's probably much, much higher than the budget of the iText Group. Extracting text from documents that contain confusing information about fonts isn't on the technical roadmap of the iText Group.

Program with PDFBox searching for words

I would like to make a program that search for words in a pdf
using PDFBox.
Here is my little program:
List<String> words ;// List of words
PDDocument document = PDDocument.load("D:\\INIT.pdf");
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);
Comparing(content,words);//methode for searching those words on my text
System.out.println(content);
But is it possible to look directly into the PDF without the text with getText?
getText returns a string .in the case we have a big text in pdf File can this String bear the same text , is there another type to use for this case when the text is big and not supported by String ????

I hope you find a solution for this within PDFBox.
The whole process is rather more difficult than it seems. For example PDF text is broken into discontinuous fragments and spaces are often represented as gaps rather than space characters. There's a need both to abstract the fragments and also to retain the link between the human-readable text and the underlying fragments within the PDF. It is quite tricky.
Anyhow if you don't find a satisfactory solution within PDFBox ABCpdf will do this for you. For example the link below shows how to find and highlight keywords in a PDF.
http://www.websupergoo.com/helppdf9net/source/8-abcpdf.operations/8-textoperation/1-methods/group.htm
I work on the ABCpdf .NET software component so my replies may feature concepts based around ABCpdf. It's just what I know. :-)

How to get given paragraph content of pdf file using iText library?

Is there any way to get number of paragraphs or content of given paragraph in pdf file using iText library ?. I saw some classes like Paragraph, Chunk in some code to create new pdf file but I can not find any way to get these classes in reading file. Every idea is appreciated

Is the PDF you're talking about a Tagged PDF? If not, you are making the wrong assumptions about PDF. In a PDF, content is drawn on a page. For instance: an iText PdfPTable is converted into text state operators that draw snippets of text to a canvas, as well as graphics state operators that draw paths and shapes. If the PDF isn't tagged, the lines don't know that they are borders of a table; a word doesn't know to which cell it belongs.
The same goes for paragraphs: a snippet of text doesn't know whether it belongs to a sentence, to a paragraph, to a title line,...
Due to the very nature of PDF, what you're looking for may be impossible (using iText or any other software product), or may require heuristics (artificial intelligence) to examine all text state operators and the semantics of the content to get a result that mimics how humans would interpret text.
It's very easy to achieve if your PDF is tagged correctly. See the ParseTaggedPdf example.

Without having any form of GUI, what set of classes should be used to represent formatted text in memory?

I have an application I'm writing which reads a docx file. It appears that I may need to read the formatting of the text, and not just the content. I have googled the matter but finding a search term that finds me what I'm looking for, most of it points me to using formatted text inputs and the like.
Does anyone know what class I should be using?

Apache poi should give you (at least some) access to Excel styles - such as colors, fonts etc. I'm not sure about exotic cases, but it's certainly possible to obtain cell font color for example. The following code works for me:
XSSFCell cell = ...
if (IndexedColors.WHITE.getIndex() == cell.getCellStyle().getFont().getColor()) {
...
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.