Extracting heading and paragraphs from doc and docx files using apache-poi

Extracting heading and paragraphs from doc and docx files using apache-poi - java

I am trying to read Microsoft word documents via apache-poi and found that there are couple of convenient methods provided to scan through document like getText(), getParagraphList() etc.. But my use case is slightly different and the way we want to scan through any document is, it should give us events/information like heading, paragraph, table in the same sequence as they appear in document. It will help me in preparing a document structure like,
<content>
<section>
<heading> ABC </heading>
<paragraph>xyz </paragraph>
<paragraph>scanning through APIs</paragraph>
<section>
.
.
.
</content>
The main intent is to maintain the relationship between heading and paragraphs as in original document. Not sure but can something like this work for me,
Iterator<IBodyElement> itr = doc.getBodyElementsIterator();
while(itr.hasNext()) {
IBodyElement ele = itr.next();
System.out.println(ele.getElementType());
}
I was able to get the paragraph list but not heading information using this code. Just to mention, I would be interested in all headings, they might be explicitly marked as heading by using style or by using large font size.

Headers aren't stored inline in the main document, they live elsewhere, which is why you're not getting them as body elements. Body elements are things like sections, paragraphs and tables, not headers, so you have to fetch them yourself.
If you look at this code in Apache Tika, you'll see an example of how to do so. Assuming you're iterating over the body elements, and want headers / footers of paragraphs, you'll want code something like this (based on the Tika code):
for(IBodyElement element : bodyElement.getBodyElements()) {
if(element instanceof XWPFParagraph) {
XWPFParagraph paragraph = (XWPFParagraph)element;
XWPFHeaderFooterPolicy headerFooterPolicy = null;
if (paragraph.getCTP().getPPr() != null) {
CTSectPr ctSectPr = paragraph.getCTP().getPPr().getSectPr();
if(ctSectPr != null) {
headerFooterPolicy = new XWPFHeaderFooterPolicy(document, ctSectPr);
// Handle Header
}
}
// Handle paragraph
if (headerFooterPolicy != null) {
// Handle footer
}
}
if(element instanceof XWPFTable) {
XWPFTable table = (XWPFTable)element;
// Handle table
}
if (element instanceof XWPFSDT){
XWPFSDT sdt = (XWPFSDT) element;
// Handle SDT
}
}

Related

How to replace date field with some text in the ViewMaster (Vertical) for word/pdf using Aspose?

Aspose code is inserting Viewmaster(vertical) with default date to
select as a text inside. I want to replace with some text as shown in
the image.
Followed the code mentioned in ViewMaster(vertical) using Aspose
to generate the ViewMaster(Vertical) in the word/pdf. can someone help
in getting the right code to replace the date with text

Date is set in structured document tag. You can use code like this to get and modify value of this SDT:
// Get structured document tags from footer.
NodeCollection tags = doc.FirstSection.HeadersFooters[HeaderFooterType.FooterPrimary].GetChildNodes(NodeType.StructuredDocumentTag, true);
foreach (StructuredDocumentTag tag in tags)
{
if (tag.Title.Equals("Date") && tag.SdtType == SdtType.Date)
{
tag.IsShowingPlaceholderText = false;
tag.FullDate = DateTime.Now;
// By default SDT is minded to XML. We can simply remove mapping to use value set in FullDate property.
tag.XmlMapping.Delete();
}
}
If you do not need date, but need to insert some custom text, you can remove the tag and insert a simple paragraph with text instead. For example:
// Get structured document tags from footer.
NodeCollection tags = doc.FirstSection.HeadersFooters[HeaderFooterType.FooterPrimary].GetChildNodes(NodeType.StructuredDocumentTag, true);
foreach (StructuredDocumentTag tag in tags)
{
if (tag.Title.Equals("Date") && tag.SdtType == SdtType.Date)
{
// Put an empty paragraph ater the structured document tag
Paragraph p = new Paragraph(doc);
tag.ParentNode.InsertAfter(p, tag);
// Remove tag
tag.Remove();
// move DocumentBuilder to the newly inserted paragraph and insert some text.
builder.MoveTo(p);
builder.Write("This is my custom vertical text");
}
}

Java JSoup: article extraction with image links and paragraph

I am currently making an article content extraction application using Jsoup and Java. My problem is when I scrape the article, Jsoup tends to return a list of Element rather than preserves the order of the article. For example, in an normal article with more than 1 image, it could has an order like this: (Title, sapo, image, paragraph, image, paragraph, paragraph, image, paragraph). So how can I scrape the main content of the website (text and image links) without losing its order?
Below is my idea for doing that but it doesn't work.
int cur = 0;
Document doc = Jsoup.connect(url).get();
Elements elements = doc.select("div");
for (Element element : elements) {
if (element.select("div[type=\"Photo\"] img").hasAttr("src")) {
Elements temp = element.select("div[type=\"Photo\"] img");
System.out.println(temp.get(cur).attr("src"));
cur++;
}
System.out.println(element.select("p span").text());
System.out.println("");
}

If you wanted to extract the article data from the sites that you linked to in the comment, you could do something like this:
Document doc = Jsoup.connect(url).get();
// Full article
Elements elements = doc.select("div.sidebar-1");
System.out.println("## Article title:");
System.out.println(elements.select("h1.title-detail").text());
System.out.println("## Article summary:");
System.out.println(elements.select("p.description").text());
// Images and paragraphs
for (Element e : elements.select("article.fck_detail p,figure")) {
if (e.is("p")) {
System.out.println("## Paragraph");
System.out.println(e.text());
} else {
System.out.println("## Image (image URL)");
System.out.println(e.select("img[src]").attr("src"));
}
}
The idea is this one:
find the outermost container that contains the full article
extract title and the summary
loop through the image (figure) and paragraph (p) elements of the article - the order will be preserved automatically

Use JSoup to get all textual links

I'm using JSoup to grab content from web pages.
I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.
Example of links I want:
Link to Some Page
Since it contains the text "Link to Some Page"
Links I don't want:
<img src="someimage.jpg"/>
My code looks like this. How can I modify it to only get the first type of link?
Document document = // I get my document object
Elements linksOnPage = document.select("a[href]")
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

You could do something like this.
It does it's job though it's probably not the fanciest solution out there.
Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.
Document doc = // get the doc
Elements linksOnPage = document.select("a");
for (Element pageElem : linksOnPage){
String link = "";
if(pageElem.text().trim().equals(""))
continue;
// do smth with it
}

I am using this and it's working fine:
Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))");
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

Write text and tables in to word, with whitespaces/enters

I'm writing text and text from tables into a word document.
With the following code the tables are placed under the right paragraphs.
Iterator<IBodyElement> iter = xdoc.getBodyElementsIterator();
while (iter.hasNext())
{
IBodyElement elem = iter.next();
if (elem instanceof XWPFParagraph)
{
relevantText.setText(((XWPFParagraph) elem).getText());
} else if (elem instanceof XWPFTable)
{
tabellen.setText(((XWPFTable) elem).getText());
}
}
Now when I try to make a whitespace/enter with addBreak() or addCarriageReturn() the order of my document is wrong. The table text is placed after all the text.
Has anyone a solution for this?

I had the same problem a couple of days ago. did you create 2 diffrent runs for the paragraphs and the tables?
Because I did, and when I changed it to 1 run it did work for me.
Like this:
XWPFRun text = paragraph.createRun();

Docx4j - Images in the document

How can we remove an image from the docx4j.
Say I have 10 images, and i want to replace 8 images with my own byte array/binary data, and I want to delete remaining 2.
I am also having trouble in locating images.
Is it somehow possible to replace text placeholders in the document with images?

Refer to this post : http://vixmemon.blogspot.com/2013/04/docx4j-replace-text-placeholders-with.html
for(Object obj : elemetns){
if(obj instanceof Tbl){
Tbl table = (Tbl) obj;
List rows = getAllElementFromObject(table, Tr.class);
for(Object trObj : rows){
Tr tr = (Tr) trObj;
List cols = getAllElementFromObject(tr, Tc.class);
for(Object tcObj : cols){
Tc tc = (Tc) tcObj;
List texts = getAllElementFromObject(tc, Text.class);
for(Object textObj : texts){
Text text = (Text) textObj;
if(text.getValue().equalsIgnoreCase("${MY_PLACE_HOLDER}")){
File file = new File("C:\\image.jpeg");
P paragraphWithImage = addInlineImageToParagraph(createInlineImage(file));
tc.getContent().remove(0);
tc.getContent().add(paragraphWithImage);
}
}
System.out.println("here");
}
}
System.out.println("here");
}
}
wordMLPackage.save(new java.io.File("C:\\result.docx"));

See docx4j checking checkboxes for the 2 approaches to finding stuff (XPath, or non XPath traversal).
VariableReplace allows you to replace text placeholders, but not with images. I think there may be code floating around (in the docx4j forums?) which extends it to do that.
But I'd suggest you use content control databinding instead. See how to create a new word from template with docx4j
You can use base64 encoded images in your XML data, and docx4j and/or Word will do the rest.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extracting heading and paragraphs from doc and docx files using apache-poi - java

Related

How to replace date field with some text in the ViewMaster (Vertical) for word/pdf using Aspose?

Java JSoup: article extraction with image links and paragraph

Use JSoup to get all textual links

Write text and tables in to word, with whitespaces/enters

Docx4j - Images in the document

Categories

Resources