encoding issue after pdfbox - java

I want to extract text in PDF on Java, so I use pdfbox library. PDF file seems to have been written by hwp(korea word edit software) before it was converted to a PDF file.
This is my simple API.
#RestController
#RequiredArgsConstructor
public class QuestionController {
private final QuestionParseService questionParseService;
#GetMapping("/")
public ResponseEntity<?> parsePDF() throws IOException {
return ResponseEntity.ok(questionParseService.parsePDF());
}
}
#Service
public class QuestionParseService {
public String parsePDF() throws IOException {
File file = new File("filePath");
PDDocument document = PDDocument.load(file);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);
return content;
}
}
This is my PDF file PDF file
But, the API result of question 1 was


×
 

의 값은? [2점]
①  ②  ③  ④  ⑤ 
How can I get correctly encoded text?

Related

Creating a new header in docx4j

I have a maven project with docx4j. I have managed to successfully convert html file to docx. However I'm interested into inserting a header to the docx file.
In github docx4j there is a sample (link) which I used the it worked as expected, i.e.
Relationship relationship = createHeaderPart(wordMLPackage);
public static Relationship createHeaderPart(
WordprocessingMLPackage wordprocessingMLPackage)
throws Exception {
HeaderPart headerPart = new HeaderPart();
Relationship rel = wordprocessingMLPackage.getMainDocumentPart()
.addTargetPart(headerPart);
// After addTargetPart, so image can be added properly
headerPart.setJaxbElement(getHdr(wordprocessingMLPackage, headerPart));
return rel;
}
public static Hdr getHdr(WordprocessingMLPackage wordprocessingMLPackage,
Part sourcePart) throws Exception {
Hdr hdr = objectFactory.createHdr();
// I modified it for simplicity
P headerParagraph = docx.getMainDocumentPart().createParagraphOfText("hi there");
hdr.getContent().add(headerParagraph);
return hdr;
}
This is working as expected
However I'm interested into using dynamic content from html so I used:
public static Hdr getHdr(WordprocessingMLPackage wordprocessingMLPackage,
Part sourcePart) throws Exception {
Hdr hdr = objectFactory.createHdr();
String html = "<html><body><p>hi there</p></body></html>";
XHTMLImporter XHTMLImporter = new XHTMLImporterImpl(wordprocessingMLPackage);
hdr.getContent().add(XHTMLImporter.convert(html, null));
return hdr;
}
This doesn't work at all. Any ideas?
I just noticed that XHTMLImporter is creating a list of objects, i.e.
public static Hdr getHdr(WordprocessingMLPackage wordprocessingMLPackage,
Part sourcePart) throws Exception {
Hdr hdr = objectFactory.createHdr();
String html = "<html><body><p>hi there</p></body></html>";
XHTMLImporter XHTMLImporter = new XHTMLImporterImpl(wordprocessingMLPackage);
List<Object> list = XHTMLImporter.convert(html, null);
hdr.getContent().add(list.get(0));
return hdr;
}

Adding hyperlink to inner PDF files

I have to create a PDF file by adding two PDF files inside a generated PDF file as a tree structure using iText in Java.
I have to create bookmarks with PDF file names and add a hyperlink to the bookmark. When the bookmark is clicked, the respective PDF should be opened in that PDF file itself, not as a separate PDF.
PDFTREE
pdf1
pdf2
Such bookmarks are referred to as outline elements in the PDF specification (PDF 32000-1:2008, p.367):
The outline consists of a tree-structured hierarchy of outline items (sometimes called bookmarks), which serve as a visual table of contents to display the document’s structure to the user.
If you merge the documents with PdfMerger, the outlines are copied to the resulting PDF by default. However, you want a main-node per document and not a flat list of bookmarks. Since cloning and copying outlines in no trivial task, it is best to let iText handle this. Unfortunately, we have little direct control how outlines are being merged.
We can build a SpecialMerger as a wrapper around PdfMerger to extract the cloned outlines (first step) and get them into a hierarchical structure afterwards (second step). The outline of each merged PDF is temporarily stored in the outlineList together with the desired name of the main node and its reference (page number in the merged PDF). After all the PDFs are merged, we can attach the temporarily stored outlines back to the root-node.
public static class SpecialMerger {
private final PdfDocument outputPdf;
private final PdfMerger merger;
private final PdfOutline rootOutline;
private final List<DocumentOutline> outlineList = new ArrayList<>();
private int nextPageNr = 1;
public SpecialMerger(final PdfDocument outputPdf) {
if (outputPdf.getNumberOfPages() != 0) {
throw new IllegalArgumentException("PDF must be empty");
}
this.outputPdf = outputPdf;
this.merger = new PdfMerger(outputPdf, true, true);
this.rootOutline = outputPdf.getOutlines(false);
}
public void merge(PdfDocument from, int fromPage, int toPage, String filename) {
merger.merge(from, fromPage, toPage); // merge with normal PdfMerger
// extract and clone outline of merged document
final List<PdfOutline> children = new ArrayList<>(rootOutline.getAllChildren());
rootOutline.getAllChildren().clear(); // clear root outline
outlineList.add(new DocumentOutline(filename, nextPageNr, children));
nextPageNr = outputPdf.getNumberOfPages() + 1; // update next page number
}
public void writeOutline() {
outlineList.forEach(o -> {
final PdfOutline outline = rootOutline.addOutline(o.getName()); // bookmark with PDF name
outline.addDestination(PdfExplicitDestination.createFit(outputPdf.getPage(o.getPageNr())));
outline.setStyle(PdfOutline.FLAG_BOLD);
o.getChildern().forEach(outline::addOutline); // add all extracted child bookmarks
});
}
private static class DocumentOutline {
private final String name;
private final int pageNr;
private final List<PdfOutline> childern;
public DocumentOutline(final String pdfName, final int pageNr, final List<PdfOutline> childern) {
this.name = pdfName;
this.pageNr = pageNr;
this.childern = childern;
}
public String getName() {
return name;
}
public int getPageNr() {
return pageNr;
}
public List<PdfOutline> getChildern() {
return childern;
}
}
}
Now, we can use this custom merger to merge the PDFs and then add the outline with writeOutline:
public static void main(String[] args) throws IOException {
String filename1 = "pdf1.pdf";
String filename2 = "pdf2.pdf";
try (
PdfDocument generatedPdf = new PdfDocument(new PdfWriter("output.pdf"));
PdfDocument pdfDocument1 = new PdfDocument(new PdfReader(filename1));
PdfDocument pdfDocument2 = new PdfDocument(new PdfReader(filename2))
) {
final SpecialMerger merger = new SpecialMerger(generatedPdf);
merger.merge(pdfDocument1, 1, pdfDocument1.getNumberOfPages(), filename1);
merger.merge(pdfDocument2, 1, pdfDocument2.getNumberOfPages(), filename2);
merger.writeOutline();
}
}
The result looks like this (Preview and Adobe Acrobat Reader on macOS):
Another option is to make a portfolio by embedding the PDFs. However, this is not supported by all PDF viewers and most users are not accustomed to these portfolios.
public static void main(String[] args) throws IOException {
String filename1 = "pdf1.pdf";
String filename2 = "pdf2.pdf";
try (PdfDocument generatedPdf = new PdfDocument(new PdfWriter("portfolio.pdf"))) {
Document doc = new Document(generatedPdf);
doc.add(new Paragraph("This PDF contains embedded documents."));
doc.add(new Paragraph("Use a compatible PDF viewer if you cannot see them."));
PdfCollection collection = new PdfCollection();
collection.setView(PdfCollection.TILE);
generatedPdf.getCatalog().setCollection(collection);
addAttachment(generatedPdf, filename1, filename1);
addAttachment(generatedPdf, filename2, filename2);
}
}
private static void addAttachment(PdfDocument doc, String attachmentPath, String name) throws IOException {
PdfFileSpec fileSpec = PdfFileSpec.createEmbeddedFileSpec(doc, attachmentPath, name, name, null, null);
doc.addFileAttachment(name, fileSpec);
}
The result in Adobe Acrobat Reader on macOS:

How to parse a big rdf file in rdf4j

I want to parse a huge file in RDF4J using the following code but I get an exception due to parser limit;
public class ConvertOntology {
public static void main(String[] args) throws RDFParseException, RDFHandlerException, IOException {
String file = "swetodblp_april_2008.rdf";
File initialFile = new File(file);
InputStream input = new FileInputStream(initialFile);
RDFParser parser = Rio.createParser(RDFFormat.RDFXML);
parser.setPreserveBNodeIDs(true);
Model model = new LinkedHashModel();
parser.setRDFHandler(new StatementCollector(model));
parser.parse(input, initialFile.getAbsolutePath());
FileOutputStream out = new FileOutputStream("swetodblp_april_2008.nt");
RDFWriter writer = Rio.createWriter(RDFFormat.TURTLE, out);
try {
writer.startRDF();
for (Statement st: model) {
writer.handleStatement(st);
}
writer.endRDF();
}
catch (RDFHandlerException e) {
}
finally {
out.close();
}
}
The parser has encountered more than "100,000" entity expansions in this document; this is the limit imposed by the application.
I execute my code as following as suggested on the RDF4J web site to set up the two parameters (as in the following command)
mvn -Djdk.xml.totalEntitySizeLimit=0 -DentityExpansionLimit=0 exec:java
any help please
The error is due to the Apache Xerces XML parser, rather than the default JDK XML parser.
So Just delete Xerces XML folder from you .m2 repository and the code works fine.

Relative path to file | Springboot

I am new to Spring-boot/Java and trying to read the contents of a file in a String.
What's the issue:
I'm getting "File not found exception" and unable to read the file. Apparently, I'm not giving the correct file path.
i've attached the directory structure and my code. I'm in FeedProcessor file and want to read feed_template.php (see image)
public static String readFileAsString( ) {
String text = "";
try {
// text = new String(Files.readAllBytes(Paths.get("/src/main/template/feed_template_head.php")));
text = new String(Files.readAllBytes(Paths.get("../../template/feed_template_head.php")));
} catch (IOException e) {
e.printStackTrace();
}
return text;
}
You need to put template folder inside resource folder. And then use following code.
#Configuration
public class ReadFile {
private static final String FILE_NAME =
"classpath:template/feed_template_head.php";
#Bean
public void initSegmentPerformanceReportRequestBean(
#Value(FILE_NAME) Resource resource,
ObjectMapper objectMapper) throws IOException {
new BufferedReader(resource.getInputStream()).lines()
.forEach(eachLine -> System.out.println(eachLine));
}
}
I suggest you to go though once Resource topic in spring.
https://docs.spring.io/spring/docs/3.0.x/spring-framework-reference/html/resources.html

PDF parser text contains

I want to verify PDF Document using TestNG and PDFBox.
I would ask, is PDF able to check contains text like this:
PDFParser parser = new PDFParser(stream);
parser.getDocument().conntains("ABC")
Try below code:-
public void ReadPDF() throws Exception {
URL TestURL = new URL("http://www.axmag.com/download/pdfurl-guide.pdf");
BufferedInputStream TestFile = new BufferedInputStream(TestURL.openStream());
PDFParser TestPDF = new PDFParser(TestFile);
TestPDF.parse();
String TestText = new PDFTextStripper().getText(TestPDF.getPDDocument());
Assert.assertTrue(TestText.contains("Open the setting.xml, you can see it is like this"));
}
Download libraries :- https://pdfbox.apache.org/index.html

Categories

Resources