how to convert .doc .docx using tika apache?

how to convert .doc .docx using tika apache? - java

i want ask to convert doc docx to file text in here code
enter code here
public DokumenExtractor(String filename) {
context = new ParseContext();
detector = new DefaultDetector();
parser = new AutoDetectParser(detector);
context.set(Parser.class, parser);
outputstream = new ByteArrayOutputStream();
metadata = new Metadata();
try {
process(filename);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void process(String filename) throws Exception {
URL url;
File file = new File(filename);
if (file.isFile()) {
url = file.toURI().toURL();
this.PathFile=(file.getPath()).toString();
} else {
url = new URL(filename);
}
this.input = TikaInputStream.get(url, metadata);
ContentHandler handler = new BodyContentHandler(outputstream);
parser.parse(input, handler, metadata, context);
input.close();
}
but output like this
PAGE * MERGEFORMAT 36
not content of document clean ??? how to remove format Page after get string from document

Related

not able to read content pdf which is open in url

i want to read a content of the pdf which is open in url:-https://dms.careerbuilder.com/viewer?Token=4aeea5b52d6e48a7beca13a992540a66&key=7b6184962856e016a5cdfcb3e27c7c30b34b5caaa6607d7d4e408f4b2ebf9dfd
try {
String pdfContent = readPdfContent(perfecturl);
Assert.assertTrue(pdfContent.contains("Test Kumar"));
Assert.assertTrue(pdfContent.contains("XXXXX"));
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
public static String readPdfContent(String url) throws IOException {
URL pdfUrl = new URL(url);
InputStream in = pdfUrl.openStream();
BufferedInputStream bf = new BufferedInputStream(in);
PDDocument doc = PDDocument.load(bf);
int numberOfPages = getPageCount(doc);
System.out.println("The total number of pages "+numberOfPages);
String content = new PDFTextStripper().getText(doc);
doc.close();
return content;
}
public static int getPageCount(PDDocument doc) {
//get the total number of pages in the pdf document
int pageCount = doc.getNumberOfPages();
return pageCount;
}
it throws me exception:-
Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1093)
at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2580)
at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2551)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:219)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1228)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1128)
pdfbox not able to read the pdf and this url is valid PDF so can any one helep me to get this resolve.

Instead of rendering tables and other html tags in docx these are saved as plain text using docx4j-ImportXHTML

I want to render html code to docx. Instead of rendering html(i.e. tables in tabular format) it simply writes html code in it as plain text. I am using docx4j-ImportXHTML jar. I used the code from here and modified it to save in a file.
What am I doing wrong?
public static void xhtmlToDocx(String xhtml, String destinationPath, String fileName)
{
File dir = new File (destinationPath);
File actualFile = new File (dir, fileName);
WordprocessingMLPackage wordMLPackage = null;
try
{
wordMLPackage = WordprocessingMLPackage.createPackage();
}
catch (InvalidFormatException e)
{
e.printStackTrace();
}
XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
//XHTMLImporter.setDivHandler(new DivToSdt());
//OutputStream os = null;
OutputStream fos = null;
try
{
fos = new FileOutputStream(actualFile);
wordMLPackage.getMainDocumentPart().getContent().addAll(
XHTMLImporter.convert( xhtml, null) );
System.out.println(XmlUtils.marshaltoString(wordMLPackage
.getMainDocumentPart().getJaxbElement(), true, true));
// Back to XHTML
HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
htmlSettings.setWmlPackage(wordMLPackage);
// output to an OutputStream.
//os = new ByteArrayOutputStream();
// If you want XHTML output
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML",
true);
Docx4J.toHTML(htmlSettings, fos, Docx4J.FLAG_EXPORT_PREFER_XSL);
}
catch (Docx4JException | FileNotFoundException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
finally{
try {
fos.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}

I corrected my code as below:
Use ByteArrayStream instead of FileOutputStream i.e.
Instead of
fos = new FileOutputStream(actualFile);
wordMLPackage.getMainDocumentPart().getContent().addAll(
XHTMLImporter.convert( xhtml, null) );
Use:
fos = new ByteArrayOutputStream();
Add wordMLPackage.save(actualFile)
Full code:
public static void xhtmlToDocx1(String xhtml, String destinationPath, String fileName)
{
File dir = new File (destinationPath);
File actualFile = new File (dir, fileName);
WordprocessingMLPackage wordMLPackage = null;
try
{
wordMLPackage = WordprocessingMLPackage.createPackage();
}
catch (InvalidFormatException e)
{
e.printStackTrace();
}
XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
OutputStream fos = null;
try
{
fos = new ByteArrayOutputStream();
System.out.println(XmlUtils.marshaltoString(wordMLPackage
.getMainDocumentPart().getJaxbElement(), true, true));
HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
htmlSettings.setWmlPackage(wordMLPackage);
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML",
true);
Docx4J.toHTML(htmlSettings, fos, Docx4J.FLAG_EXPORT_PREFER_XSL);
wordMLPackage.save(actualFile);
}
catch (Docx4JException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
finally{
try {
fos.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}

JSF Primefaces p:fileDownload file name contains UTF-8 characters

I am working on Java 8, JSF 2, Primefaces 5.1.
Conversation to PDF or Docx works, but when I am displaying file name, it just skips UTF-8 encoded letters, in my case, Lithuanian letters like ą,č,ę,ė,į,š,ų,ū
What I have tried so farm is :
<h:form enctype="multipart/form-data;charset=UTF-8">
Charset.forName("UTF-8").encode(myString)
or
byte[] bytes = templateTitle.getBytes(Charset.forName("UTF-8"));
String title = new String(bytes, Charset.forName("UTF-8"));
or
UTF-8 text is garbled when form is posted as multipart/form-data
checked some tuttorials about encoding, still, no use,
also checked this, but I just do not understand this example...
Primefaces fileDownload non-english file names corrupt
my code:
Download file as docx
public void downloadTemplateAsDocx() throws Exception {
try {
InputStream content = null;
String objID = this.actData.getMainActs().get(0).getId();
ContentStream cmisStream = folderCatalogue.getDocumentContentStream(objID);
content = cmisStream.getStream();
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/hw.html"));
afiPart.setBinaryData(content);
afiPart.setContentType(new ContentType("text/html"));
Relationship altChunkRel = wordMLPackage.getMainDocumentPart().addTargetPart(afiPart);
CTAltChunk ac = Context.getWmlObjectFactory().createCTAltChunk();
ac.setId(altChunkRel.getId());
wordMLPackage.getMainDocumentPart().addObject(ac);
wordMLPackage.getContentTypeManager().addDefaultContentType("html", "text/html");
File fileTmp = File.createTempFile("tempDocFile", "docx");
wordMLPackage.save(fileTmp);
streamedContent = new DefaultStreamedContent(new FileInputStream(fileTmp), cmisStream.getMimeType(),
templateTitle + ".docx", "UTF-8");
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (InvalidFormatException eInv) {
eInv.printStackTrace();
} catch (IOException ioEx) {
ioEx.printStackTrace();
} catch (Docx4JException docxEx) {
docxEx.printStackTrace();
}
}
code for .Pdf file download.
public void downloadTemplateAsPdf() {
try {
InputStream content = null;
String objID = this.actData.getMainActs().get(0).getId();
ContentStream cmisStream = folderCatalogue.getDocumentContentStream(objID);
content = cmisStream.getStream();
File fileTmp = File.createTempFile("tempFile", "pdf");
OutputStream fileStream = new FileOutputStream(fileTmp);
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, fileStream);
document.open();
XMLWorkerHelper worker = XMLWorkerHelper.getInstance();
worker.parseXHtml(writer, document, content, Charset.forName("UTF-8"));
document.close();
fileStream.close();
streamedContent = new DefaultStreamedContent(new FileInputStream(fileTmp), cmisStream.getMimeType(),
templateTitle + ".pdf");
} catch (FileNotFoundException e) {
e.printStackTrace();
System.out.println("File was not found");
} catch (IOException ex) {
ex.printStackTrace();
} catch (Exception exeption) {
exeption.printStackTrace();
}
}
EDIT:
<p:fileDownload value="#{controller.streamedContent}" />
private StreamedContent streamedContent;

Solution,
String title = URLEncoder.encode(templateTitle, "UTF-8");
StringBuilder fileName = new StringBuilder(title);
if (title.contains("+")) {
for (int i = 0; i < title.length(); i++) {
if (title.charAt(i) == '+') {
fileName.setCharAt(i, ' ');
}
}
}
This Encoding works fine, just it replaces all spaces to + that's why I loop over it.

Resolving the issue: Illegal characters in path

I am developing an application that takes an XML file and an attachment to be sent to the following path. This path is for a fax appliance.
I keep getting this error message:
Problem processing drop file "\co1-aux01prd01.tampa.healthe\Fax_Drop\Outbox\FaxDropSample1.xml": Illegal characters in path.
The XML file and the attachment are both being created but not processed.
public class TestSender {
public static void main(String[] args) {
String outBox = "\\\\faxaux\\Fax_Drop\\Outbox";
String filename = "FaxDrop" + ".xml";
String filepath = outBox + "\\" + filename;
Writer writer = null;
try {
BufferedImage image;
URL url = new URL("http://colsolgrp.com/phone/jpg/fax8.jpg");
image = ImageIO.read(url);
//File newImage = new File("\\\\faxaux\\Fax_Drop\\Outbox\\AttachmentFolder\\attachment.jpg");
File newImage = new File("\\\\faxaux\\Fax_Drop\\Outbox\\FaxDrop\\FaxDropImage.jpg");
newImage.mkdirs();
newImage.createNewFile();
ImageIO.write(image, "jpg",newImage);
System.out.println("File has been written");
}
catch(Exception e) {
System.out.println("Could not create file");
}
try {
File f = new File(filepath);
f.createNewFile();
FileOutputStream fileOutputStream = new FileOutputStream(f);
OutputStreamWriter outputStreamWriter = new OutputStreamWriter(fileOutputStream);
writer = new BufferedWriter(outputStreamWriter);
// Create XML file here
}
catch (Throwable ex) {
ex.printStackTrace();
}
finally {
try {
writer.close();
}
catch (Exception ex) {
// Do nothing.
}
}
System.out.println("Success");

Try this:
URI u = new URI(URLEncoder.encode("\\co1-aux01prd01.tampa.healthe\Fax_Drop\Outbox\FaxDropSample1.xml"));

JTidy java API toConvert HTML to XHTML

I am using JTidy to convert from HTML to XHTML but I found in my XHTML file this tag .
Can i prevent it ?
this is my code
//from html to xhtml
try
{
fis = new FileInputStream(htmlFileName);
}
catch (java.io.FileNotFoundException e)
{
System.out.println("File not found: " + htmlFileName);
}
Tidy tidy = new Tidy();
tidy.setShowWarnings(false);
tidy.setXmlTags(false);
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setXHTML(true);//
tidy.setMakeClean(true);
Document xmlDoc = tidy.parseDOM(fis, null);
try
{
tidy.pprint(xmlDoc,new FileOutputStream("c.xhtml"));
}
catch(Exception e)
{
}

I had only success, when the input is treated as XML as well. So either set xmltags to true
tidy.setXmlTags(true);
and live with the errors and warnings or do the conversion twice.
First conversion to sanitize the html (html to xhtml) and a second conversion from xhtml to xhtml with set xmltags, thus no errors and warnings occur.
String htmlFileName = "test.html";
try( InputStream in = Thread.currentThread().getContextClassLoader().getResourceAsStream(htmlFileName);
FileOutputStream fos = new FileOutputStream("tmp.xhtml");) {
Tidy tidy = new Tidy();
tidy.setShowWarnings(true);
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setXHTML(true);
tidy.setMakeClean(true);
Document xmlDoc = tidy.parseDOM(in, fos);
} catch (Exception e) {
e.printStackTrace();
}
try( InputStream in = new FileInputStream("tmp.xhtml");
FileOutputStream fos = new FileOutputStream("c.xhtml");) {
Tidy tidy = new Tidy();
tidy.setShowWarnings(true);
tidy.setXmlTags(true);
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setXHTML(true);
tidy.setMakeClean(true);
Document xmlDoc = tidy.parseDOM(in, null);
tidy.pprint(xmlDoc, fos);
} catch (Exception e) {
e.printStackTrace();
}
I used the latest jtidy version 938.

i created a function that parse the the xhtml code and remove the unwelcome tags
and to add a link to the css File "tableStyle.css"
public static String xhtmlparser(){
String Cleanline="";
try {
// the file url
FileInputStream fstream = new FileInputStream("c.xhtml");
// Use DataInputStream to read binary NOT text.
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine = null;
int linescounter=0;
while ((strLine = br.readLine()) != null) {// read every line in the file
String m=strLine.replaceAll(" ", "");
linescounter++;
if(linescounter==5)
m=m+"\n"+ "<link rel="+ "\"stylesheet\" "+"type="+ "\"text/css\" "+"href= " +"\"tableStyle.css\""+ "/>";
Cleanline+=m+"\n";
}
}
catch(IOException e){}
return Cleanline;
}
but as a performance issue is it good?
by the way it works will

You can use the following method to get xhtml from html
public static String getXHTMLFromHTML(String inputFile,
String outputFile) throws Exception {
File file = new File(inputFile);
FileOutputStream fos = null;
InputStream is = null;
try {
fos = new FileOutputStream(outputFile);
is = new FileInputStream(file);
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.parse(is, fos);
} catch (FileNotFoundException e) {
e.printStackTrace();
}finally{
if(fos != null){
try {
fos.close();
} catch (IOException e) {
fos = null;
}
fos = null;
}
if(is != null){
try {
is.close();
} catch (IOException e) {
is = null;
}
is = null;
}
}
return outputFile;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to convert .doc .docx using tika apache? - java

Related

not able to read content pdf which is open in url

Instead of rendering tables and other html tags in docx these are saved as plain text using docx4j-ImportXHTML

JSF Primefaces p:fileDownload file name contains UTF-8 characters

Resolving the issue: Illegal characters in path

JTidy java API toConvert HTML to XHTML

Categories

Resources