I am working on Java 8, JSF 2, Primefaces 5.1.
Conversation to PDF or Docx works, but when I am displaying file name, it just skips UTF-8 encoded letters, in my case, Lithuanian letters like ą,č,ę,ė,į,š,ų,ū
What I have tried so farm is :
<h:form enctype="multipart/form-data;charset=UTF-8">
Charset.forName("UTF-8").encode(myString)
or
byte[] bytes = templateTitle.getBytes(Charset.forName("UTF-8"));
String title = new String(bytes, Charset.forName("UTF-8"));
or
UTF-8 text is garbled when form is posted as multipart/form-data
checked some tuttorials about encoding, still, no use,
also checked this, but I just do not understand this example...
Primefaces fileDownload non-english file names corrupt
my code:
Download file as docx
public void downloadTemplateAsDocx() throws Exception {
try {
InputStream content = null;
String objID = this.actData.getMainActs().get(0).getId();
ContentStream cmisStream = folderCatalogue.getDocumentContentStream(objID);
content = cmisStream.getStream();
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/hw.html"));
afiPart.setBinaryData(content);
afiPart.setContentType(new ContentType("text/html"));
Relationship altChunkRel = wordMLPackage.getMainDocumentPart().addTargetPart(afiPart);
CTAltChunk ac = Context.getWmlObjectFactory().createCTAltChunk();
ac.setId(altChunkRel.getId());
wordMLPackage.getMainDocumentPart().addObject(ac);
wordMLPackage.getContentTypeManager().addDefaultContentType("html", "text/html");
File fileTmp = File.createTempFile("tempDocFile", "docx");
wordMLPackage.save(fileTmp);
streamedContent = new DefaultStreamedContent(new FileInputStream(fileTmp), cmisStream.getMimeType(),
templateTitle + ".docx", "UTF-8");
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (InvalidFormatException eInv) {
eInv.printStackTrace();
} catch (IOException ioEx) {
ioEx.printStackTrace();
} catch (Docx4JException docxEx) {
docxEx.printStackTrace();
}
}
code for .Pdf file download.
public void downloadTemplateAsPdf() {
try {
InputStream content = null;
String objID = this.actData.getMainActs().get(0).getId();
ContentStream cmisStream = folderCatalogue.getDocumentContentStream(objID);
content = cmisStream.getStream();
File fileTmp = File.createTempFile("tempFile", "pdf");
OutputStream fileStream = new FileOutputStream(fileTmp);
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, fileStream);
document.open();
XMLWorkerHelper worker = XMLWorkerHelper.getInstance();
worker.parseXHtml(writer, document, content, Charset.forName("UTF-8"));
document.close();
fileStream.close();
streamedContent = new DefaultStreamedContent(new FileInputStream(fileTmp), cmisStream.getMimeType(),
templateTitle + ".pdf");
} catch (FileNotFoundException e) {
e.printStackTrace();
System.out.println("File was not found");
} catch (IOException ex) {
ex.printStackTrace();
} catch (Exception exeption) {
exeption.printStackTrace();
}
}
EDIT:
<p:fileDownload value="#{controller.streamedContent}" />
private StreamedContent streamedContent;
Solution,
String title = URLEncoder.encode(templateTitle, "UTF-8");
StringBuilder fileName = new StringBuilder(title);
if (title.contains("+")) {
for (int i = 0; i < title.length(); i++) {
if (title.charAt(i) == '+') {
fileName.setCharAt(i, ' ');
}
}
}
This Encoding works fine, just it replaces all spaces to + that's why I loop over it.
Related
I am trying to merge pdf files but getting error while opening the file. My code is :
public void merge(){
byte[] pdf1 = tobyte("hello");
byte[] pdf2 = tobyte("world");
PDFMergerUtility merger = new PDFMergerUtility();
merger.addSource(new ByteArrayInputStream(pdf1));
merger.addSource(new ByteArrayInputStream(pdf2));
merger.setDestinationFileName("final.pdf");
merger.mergeDocuments();
}
static byte[] tobyte(String message) {
PDDocument doc = new PDDocument();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
doc.save(baos);
return baos.toByteArray();
}
Here is the code that works
//Loading an existing PDF document
File file1 = new File("sample1.pdf");
PDDocument doc1 = null;
try {
doc1 = PDDocument.load(file1);
} catch (IOException e1) {
e1.printStackTrace();
}
File file2 = new File("sample2.pdf");
PDDocument doc2 = null;
try {
doc2 = PDDocument.load(file2);
} catch (IOException e1) {
e1.printStackTrace();
}
//Instantiating PDFMergerUtility class
PDFMergerUtility PDFmerger = new PDFMergerUtility();
//Setting the destination file
PDFmerger.setDestinationFileName("merged.pdf");
//adding the source files
PDFmerger.addSource(file1);
PDFmerger.addSource(file2);
//Merging the two documents
try {
PDFmerger.mergeDocuments();
} catch (COSVisitorException | IOException e) {
e.printStackTrace();
}
System.out.println("Documents merged");
//Closing the documents
try {
doc1.close();
} catch (IOException e) {
e.printStackTrace();
}
try {
doc2.close();
} catch (IOException e) {
e.printStackTrace();
}
I want to render html code to docx. Instead of rendering html(i.e. tables in tabular format) it simply writes html code in it as plain text. I am using docx4j-ImportXHTML jar. I used the code from here and modified it to save in a file.
What am I doing wrong?
public static void xhtmlToDocx(String xhtml, String destinationPath, String fileName)
{
File dir = new File (destinationPath);
File actualFile = new File (dir, fileName);
WordprocessingMLPackage wordMLPackage = null;
try
{
wordMLPackage = WordprocessingMLPackage.createPackage();
}
catch (InvalidFormatException e)
{
e.printStackTrace();
}
XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
//XHTMLImporter.setDivHandler(new DivToSdt());
//OutputStream os = null;
OutputStream fos = null;
try
{
fos = new FileOutputStream(actualFile);
wordMLPackage.getMainDocumentPart().getContent().addAll(
XHTMLImporter.convert( xhtml, null) );
System.out.println(XmlUtils.marshaltoString(wordMLPackage
.getMainDocumentPart().getJaxbElement(), true, true));
// Back to XHTML
HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
htmlSettings.setWmlPackage(wordMLPackage);
// output to an OutputStream.
//os = new ByteArrayOutputStream();
// If you want XHTML output
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML",
true);
Docx4J.toHTML(htmlSettings, fos, Docx4J.FLAG_EXPORT_PREFER_XSL);
}
catch (Docx4JException | FileNotFoundException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
finally{
try {
fos.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
I corrected my code as below:
Use ByteArrayStream instead of FileOutputStream i.e.
Instead of
fos = new FileOutputStream(actualFile);
wordMLPackage.getMainDocumentPart().getContent().addAll(
XHTMLImporter.convert( xhtml, null) );
Use:
fos = new ByteArrayOutputStream();
Add wordMLPackage.save(actualFile)
Full code:
public static void xhtmlToDocx1(String xhtml, String destinationPath, String fileName)
{
File dir = new File (destinationPath);
File actualFile = new File (dir, fileName);
WordprocessingMLPackage wordMLPackage = null;
try
{
wordMLPackage = WordprocessingMLPackage.createPackage();
}
catch (InvalidFormatException e)
{
e.printStackTrace();
}
XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
OutputStream fos = null;
try
{
fos = new ByteArrayOutputStream();
System.out.println(XmlUtils.marshaltoString(wordMLPackage
.getMainDocumentPart().getJaxbElement(), true, true));
HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
htmlSettings.setWmlPackage(wordMLPackage);
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML",
true);
Docx4J.toHTML(htmlSettings, fos, Docx4J.FLAG_EXPORT_PREFER_XSL);
wordMLPackage.save(actualFile);
}
catch (Docx4JException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
finally{
try {
fos.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
String body = "<br>";
Document document = Jsoup.parseBodyFragment(body);
document.outputSettings().escapeMode(EscapeMode.xhtml);
String str = document.body().html();
System.out.println(str);
expect: <br />
result: <br>
Can Jsoup convert value HTML into XHTML?
See Document.OutputSettings.Syntax.xml:
private String toXHTML( String html ) {
final Document document = Jsoup.parse(html);
document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
return document.html();
}
You should tell that syntax you want to leave the string in HTML or XML.
public String parserXHtml(String html) {
org.jsoup.nodes.Document document = Jsoup.parseBodyFragment(html);
document.outputSettings().syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml); //This will ensure the validity
document.outputSettings().charset("UTF-8");
return document.toString();
}
You can use JTidy API to do this. Use jtidy-r938.jar
You can use the following method to get xhtml from html
public static String getXHTMLFromHTML(String inputFile,
String outputFile) throws Exception {
File file = new File(inputFile);
FileOutputStream fos = null;
InputStream is = null;
try {
fos = new FileOutputStream(outputFile);
is = new FileInputStream(file);
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.parse(is, fos);
} catch (FileNotFoundException e) {
e.printStackTrace();
}finally{
if(fos != null){
try {
fos.close();
} catch (IOException e) {
fos = null;
}
fos = null;
}
if(is != null){
try {
is.close();
} catch (IOException e) {
is = null;
}
is = null;
}
}
return outputFile;
}
I want to read the text from the following pdf file. I am using pdfbox version 1.8.8. I am getting the following error.
2014-12-18 15:02:59 WARN XrefTrailerResolver:203 - Did not found XRef object at specified startxref position 4268142
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be cast to org.apache.pdfbox.cos.COSStream
at org.apache.pdfbox.pdmodel.common.COSStreamArray.<init>(COSStreamArray.java:68)
at org.apache.pdfbox.pdmodel.common.PDStream.createFromCOS(PDStream.java:185)
at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:639)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:380)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:275)
at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:288)
at com.algotree.pdf.test.PdfBoxTest.pdftoText(PdfBoxTest.java:53)
at com.algotree.pdf.test.PdfBoxTest.main(PdfBoxTest.java:71)
Yes,i have seen many posts about this error. Still i couldnt find the solution to read this file.
Thanks
file.pdf
This is my code:
static String pdftoText(String fileName) throws IOException {
PDFParser parser;
String parsedText = null;;
PDFTextStripper pdfStripper = new PDFTextStripper();
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File(fileName);
if (!file.isFile()) {
System.err.println("File " + fileName + " does not exist.");
return null;
}
try {
parser = new PDFParser(new FileInputStream(file));
} catch (IOException e) {
System.err.println("Unable to open PDF Parser. " + e.getMessage());
return null;
}
try {
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdfStripper.setSuppressDuplicateOverlappingText(false);
pdDoc = new PDDocument(cosDoc);
int endPage=pdDoc.getPageCount();
if(endPage>300)
endPage=300;
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(endPage);
parsedText = pdfStripper.getText(cosDoc);
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (cosDoc != null)
cosDoc.close();
if (pdDoc != null)
pdDoc.close();
} catch (Exception e) {
e.printStackTrace();
}
}
return parsedText;
}
This one works
static String pdftoText(String fileName) throws IOException {
String parsedText = null;;
PDFTextStripper pdfStripper = new PDFTextStripper();
PDDocument pdDoc = null;
File file = new File(fileName);
if (!file.isFile()) {
System.err.println("File " + fileName + " does not exist.");
return null;
}
try {
pdDoc=PDDocument.loadNonSeq(file, null);
} catch (IOException e) {
System.err.println("Unable to open PDF Parser. " + e.getMessage());
return null;
}
try {
pdfStripper = new PDFTextStripper();
int endPage=pdDoc.getPageCount();
if(endPage>300)
endPage=300;
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(endPage);
parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (pdDoc != null)
pdDoc.close();
} catch (Exception e) {
e.printStackTrace();
}
}
return parsedText;
}
I am using JTidy to convert from HTML to XHTML but I found in my XHTML file this tag .
Can i prevent it ?
this is my code
//from html to xhtml
try
{
fis = new FileInputStream(htmlFileName);
}
catch (java.io.FileNotFoundException e)
{
System.out.println("File not found: " + htmlFileName);
}
Tidy tidy = new Tidy();
tidy.setShowWarnings(false);
tidy.setXmlTags(false);
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setXHTML(true);//
tidy.setMakeClean(true);
Document xmlDoc = tidy.parseDOM(fis, null);
try
{
tidy.pprint(xmlDoc,new FileOutputStream("c.xhtml"));
}
catch(Exception e)
{
}
I had only success, when the input is treated as XML as well. So either set xmltags to true
tidy.setXmlTags(true);
and live with the errors and warnings or do the conversion twice.
First conversion to sanitize the html (html to xhtml) and a second conversion from xhtml to xhtml with set xmltags, thus no errors and warnings occur.
String htmlFileName = "test.html";
try( InputStream in = Thread.currentThread().getContextClassLoader().getResourceAsStream(htmlFileName);
FileOutputStream fos = new FileOutputStream("tmp.xhtml");) {
Tidy tidy = new Tidy();
tidy.setShowWarnings(true);
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setXHTML(true);
tidy.setMakeClean(true);
Document xmlDoc = tidy.parseDOM(in, fos);
} catch (Exception e) {
e.printStackTrace();
}
try( InputStream in = new FileInputStream("tmp.xhtml");
FileOutputStream fos = new FileOutputStream("c.xhtml");) {
Tidy tidy = new Tidy();
tidy.setShowWarnings(true);
tidy.setXmlTags(true);
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setXHTML(true);
tidy.setMakeClean(true);
Document xmlDoc = tidy.parseDOM(in, null);
tidy.pprint(xmlDoc, fos);
} catch (Exception e) {
e.printStackTrace();
}
I used the latest jtidy version 938.
i created a function that parse the the xhtml code and remove the unwelcome tags
and to add a link to the css File "tableStyle.css"
public static String xhtmlparser(){
String Cleanline="";
try {
// the file url
FileInputStream fstream = new FileInputStream("c.xhtml");
// Use DataInputStream to read binary NOT text.
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine = null;
int linescounter=0;
while ((strLine = br.readLine()) != null) {// read every line in the file
String m=strLine.replaceAll(" ", "");
linescounter++;
if(linescounter==5)
m=m+"\n"+ "<link rel="+ "\"stylesheet\" "+"type="+ "\"text/css\" "+"href= " +"\"tableStyle.css\""+ "/>";
Cleanline+=m+"\n";
}
}
catch(IOException e){}
return Cleanline;
}
but as a performance issue is it good?
by the way it works will
You can use the following method to get xhtml from html
public static String getXHTMLFromHTML(String inputFile,
String outputFile) throws Exception {
File file = new File(inputFile);
FileOutputStream fos = null;
InputStream is = null;
try {
fos = new FileOutputStream(outputFile);
is = new FileInputStream(file);
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.parse(is, fos);
} catch (FileNotFoundException e) {
e.printStackTrace();
}finally{
if(fos != null){
try {
fos.close();
} catch (IOException e) {
fos = null;
}
fos = null;
}
if(is != null){
try {
is.close();
} catch (IOException e) {
is = null;
}
is = null;
}
}
return outputFile;
}