How to extract font family from OOXML using Apache POI?

How to extract font family from OOXML using Apache POI? - java

I am trying to extract the font style that is applied to a specific paragraph with Apache POI. The method getStyle() returns null on the my XWPFParagraph object.
Calling the method getCTR().getRPr().getRStyle() on the first XWPFRun object also returns null.
Calling the method getStyle().getDocDefaults().getRPrDefault() on my XWPFDocument object returns this:
<w:rPr>
<w:rFonts w:asciiTheme="minorHAnsi"/>
<w:sz w:val="22"/>
<w:szCs w:val="22"/>
<w:lang w:val="en-GB" w:eastAsia="en-US" w:bidi="ar-SA"/>
</w:rPr>
Where there are no w:ascii attribute in the w:rFonts tag. There is however a w:asciiTheme attribute declared in the tag. How can I extract the information under the given theme with Apache POI?
The font style for this example is defined as the theme minorHAnsi and the theme can be found in the theme1.xml file. But how can I for example extract the attribute under the a:latin tag using Apache POI?
Here is an sample from what it looks like in the theme1.xml file:
<a:minorFont>
<a:latin typeface="Calibri"/>
<a:ea typeface=""/>
<a:cs typeface=""/>
<a:font script="Jpan" typeface="ＭＳ 明朝"/>
<a:font script="Hang" typeface="맑은 고딕"/>
<a:font script="Hans" typeface="宋体"/>
...
<a:font script="Viet" typeface="Arial"/>
<a:font script="Uigh" typeface="Microsoft Uighur"/>
<a:font script="Geor" typeface="Sylfaen"/>
</a:minorFont>

If the question is how to get the /word/theme/theme1.xml out of the *.docx file system, then how to parse that and then get <a:minorFont><a:latin... out of it, then this could be solved like so:
First do using methods of OPCPackage to get the package part /word/theme/theme1.xml.
...
XWPFDocument document = new XWPFDocument(new FileInputStream("./WordExample.docx"));
OPCPackage oPCPackage = document.getPackage();
PackagePartName partName = PackagingURIHelper.createPartName("/word/theme/theme1.xml");
PackagePart themePart = oPCPackage.getPart(partName);
...
Then, if we have that PackagePart, do parsing that into a org.openxmlformats.schemas.drawingml.x2006.main.ThemeDocument. Then do using methods of org.openxmlformats.schemas.drawingml.x2006.main.ThemeDocument to get the child elements of that.
...
ThemeDocument themeDocument = ThemeDocument.Factory.parse(themePart.getInputStream());
CTOfficeStyleSheet theme = themeDocument.getTheme();
CTBaseStyles themeElements = theme.getThemeElements();
CTFontScheme fontScheme = themeElements.getFontScheme();
CTFontCollection minorFont = fontScheme.getMinorFont();
CTTextFont latin = minorFont.getLatin();
...
Unfortunately there is no API documentation of org.openxmlformats.schemas.* public available. So, to get a such, we need downloading sources of ooxml-schemas (for example from https://repo1.maven.org/maven2/org/apache/poi/ooxml-schemas/1.4/) and then using javadoc to create a API documentation from the sources.
Complete example:
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.apache.poi.openxml4j.opc.*;
import org.openxmlformats.schemas.drawingml.x2006.main.*;
public class WordGetThemeDocument {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("./WordExample.docx"));
OPCPackage oPCPackage = document.getPackage();
PackagePartName partName = PackagingURIHelper.createPartName("/word/theme/theme1.xml");
PackagePart themePart = oPCPackage.getPart(partName);
System.out.println(themePart);
ThemeDocument themeDocument = ThemeDocument.Factory.parse(themePart.getInputStream());
CTOfficeStyleSheet theme = themeDocument.getTheme();
CTBaseStyles themeElements = theme.getThemeElements();
CTFontScheme fontScheme = themeElements.getFontScheme();
CTFontCollection minorFont = fontScheme.getMinorFont();
CTTextFont latin = minorFont.getLatin();
System.out.println(latin);
String typeFace = latin.getTypeface();
System.out.println(typeFace);
document.close();
}
}

Related

encoding issue after pdfbox

I want to extract text in PDF on Java, so I use pdfbox library. PDF file seems to have been written by hwp(korea word edit software) before it was converted to a PDF file.
This is my simple API.
#RestController
#RequiredArgsConstructor
public class QuestionController {
private final QuestionParseService questionParseService;
#GetMapping("/")
public ResponseEntity<?> parsePDF() throws IOException {
return ResponseEntity.ok(questionParseService.parsePDF());
}
}
#Service
public class QuestionParseService {
public String parsePDF() throws IOException {
File file = new File("filePath");
PDDocument document = PDDocument.load(file);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);
return content;
}
}
This is my PDF file PDF file
But, the API result of question 1 was


×
 

의 값은? [2점]
①  ②  ③  ④  ⑤ 
How can I get correctly encoded text?

Not able to create package using docx4j WordprocessingMLPackage in java

Converting a html file or string through docx4j getting an error while running the code
public static void convertHtmltoWord2(String html) {
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
NumberingDefinitionsPart ndp = new NumberingDefinitionsPart();
wordMLPackage.getMainDocumentPart().addTargetPart(ndp);
ndp.unmarshalDefaultNumbering();
// Convert the HTML, and add it into the empty docx we made
XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
XHTMLImporter.setHyperlinkStyle("Hyperlink");
wordMLPackage.getMainDocumentPart().getContent().addAll(
XHTMLImporter.convert(html, baseURL) );
wordMLPackage.save(new java.io.File("C:\\Converted_Word.docx") );
}
Below Error:
java.util.NoSuchElementException
at org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart.<init>(MainDocumentPart.java:76)
at org.docx4j.openpackaging.packages.WordprocessingMLPackage.createPackage(WordprocessingMLPackage.java:432)
at org.docx4j.openpackaging.packages.WordprocessingMLPackage.createPackage(WordprocessingMLPackage.java:421)
Any idea why its not working?

PdfBox flatten pdf template fields but pdf is still editable

I have a pdf template and with the following code I open it, edit, and then save it with another name after flattening it. But when I open the new pdf file, the fields are still editable.
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("template.pdf"));
PDDocumentCatalog docCatalog = doc.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
for ( PDField field : acroForm.getFields()) {
if (field.getFieldType().equals("Tx")) {
field.setValue(field.getPartialName());
}
System.out.println(field.getFieldType());
}
acroForm.flatten();
doc.save("finalFile.pdf");
doc.close();
}
I read other questions about flattening but no one has my problem.
Am I missing anything?
I'm on PDFBox 2.0.12

PDF parser text contains

I want to verify PDF Document using TestNG and PDFBox.
I would ask, is PDF able to check contains text like this:
PDFParser parser = new PDFParser(stream);
parser.getDocument().conntains("ABC")

Try below code:-
public void ReadPDF() throws Exception {
URL TestURL = new URL("http://www.axmag.com/download/pdfurl-guide.pdf");
BufferedInputStream TestFile = new BufferedInputStream(TestURL.openStream());
PDFParser TestPDF = new PDFParser(TestFile);
TestPDF.parse();
String TestText = new PDFTextStripper().getText(TestPDF.getPDDocument());
Assert.assertTrue(TestText.contains("Open the setting.xml, you can see it is like this"));
}
Download libraries :- https://pdfbox.apache.org/index.html

InputSource not working in Android while retrieving data from Assets?

After checking the answers from the other questions and applying them to my program, I still have problems loading assets from my Project.
Here is the block of code which has the problem and it seems to not go past the line with the comment //BUG
public void parseXmlFile(String xml_file, String object, String xml_class, String pointer, Activity activity, TextView test) throws ParserConfigurationException, SAXException, IOException{
view = test;
view.setText("2.1");
categories_list = new ArrayList<Category>();
//instantiate String object sto be used throughout class
object_name = object;
class_type = xml_class;
value_pointer = pointer;
//get factory
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
//get instance of document builder to build document from xml file
DocumentBuilder builder = factory.newDocumentBuilder();
//Create bytestream which allows a stream of data from xml to document builder
ByteArrayInputStream byte_stream = new ByteArrayInputStream(xml_file.getBytes("utf-8"));
//create an input source for the bytestream
InputSource input_src = new InputSource(activity.getAssets().open(xml_file));
dom = builder.parse(input_src); //BUG
parseDocument();
byte_stream.close();
}
My XML is in the assets folder and the app doesn't crash, it just doesn't do anything. The response i'm supposed to get is a listview of buttons gotten from the parsing of the XML. The algorithm to parse the XML works as it has been tested in a normal Java application. I just want to apply it to the Android app.
If you want more info just comment.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to extract font family from OOXML using Apache POI? - java

Related

encoding issue after pdfbox

Not able to create package using docx4j WordprocessingMLPackage in java

PdfBox flatten pdf template fields but pdf is still editable

PDF parser text contains

InputSource not working in Android while retrieving data from Assets?

Categories

Resources