not able to read content pdf which is open in url - java

i want to read a content of the pdf which is open in url:-https://dms.careerbuilder.com/viewer?Token=4aeea5b52d6e48a7beca13a992540a66&key=7b6184962856e016a5cdfcb3e27c7c30b34b5caaa6607d7d4e408f4b2ebf9dfd
try {
String pdfContent = readPdfContent(perfecturl);
Assert.assertTrue(pdfContent.contains("Test Kumar"));
Assert.assertTrue(pdfContent.contains("XXXXX"));
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
public static String readPdfContent(String url) throws IOException {
URL pdfUrl = new URL(url);
InputStream in = pdfUrl.openStream();
BufferedInputStream bf = new BufferedInputStream(in);
PDDocument doc = PDDocument.load(bf);
int numberOfPages = getPageCount(doc);
System.out.println("The total number of pages "+numberOfPages);
String content = new PDFTextStripper().getText(doc);
doc.close();
return content;
}
public static int getPageCount(PDDocument doc) {
//get the total number of pages in the pdf document
int pageCount = doc.getNumberOfPages();
return pageCount;
}
it throws me exception:-
Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1093)
at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2580)
at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2551)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:219)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1228)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1128)
pdfbox not able to read the pdf and this url is valid PDF so can any one helep me to get this resolve.

Related

Java: Manipulate docx document saved in Clob/Blob as text

I have a word document saved in Oracle Clob or mysql Blob I wrote the following code to read from DB --> save into .docx --> manipulate text inside docx document. my question is there any way to manipulate the text inside docx document without writing the data on docx document?
Thanks :)
private static String url = "jdbc:mysql://localhost/test";
private static String username = "root";
private static String password = "root";
public static void main( String[] args) throws ClassNotFoundException, SQLException, IOException
{
Connection conn = null;
Class.forName("com.mysql.jdbc.Driver");
conn = DriverManager.getConnection(url, username, password);
String sql = "SELECT name, description, data FROM documents ";
PreparedStatement stmt = conn.prepareStatement(sql);
ResultSet resultSet = stmt.executeQuery();
while (resultSet.next()) {
String name = resultSet.getString(1);
System.out.println("Name = " + name);
String description = resultSet.getString(2);
System.out.println("Description = " + description);
//
// Get the character stream of our CLOB data
//
Blob blob = resultSet.getBlob(3);
// System.out.println(convertLOB(blob));//convertLOB(blob).toString());
OutputStream fwriter = new FileOutputStream("C:\\The Appfuce Primer.docx");
readFromBlob(blob,fwriter);
String target = "C:\\The Appfuce Primer.docx";
File document = new File(target);
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
try {
parser.parse(new FileInputStream(document), handler, metadata, new ParseContext());
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
System.out.println(metadata);
System.out.println(handler.toString());
}
}
final static int bBufLen = 4 * 8192;
public static long readFromBlob(Blob blob, OutputStream out)
throws SQLException, IOException {
InputStream in = blob.getBinaryStream();
int length = -1;
long read = 0;
byte[] buf = new byte[bBufLen];
while ((length = in.read(buf)) != -1) {
out.write(buf, 0, length);
read += length;
}
in.close();
return read;
}
You can use the Apache POI project to get access to the content of your .docx document.
https://poi.apache.org/document/quick-guide-xwpf.html
Maybe you can call parser.parse directly using blob.getBinaryStream():
...
parser.parse(blob.getBinaryStream(), handler, metadata, new ParseContext());
...
So you don't have to create a temporary file containing the docx document.

JSF Primefaces p:fileDownload file name contains UTF-8 characters

I am working on Java 8, JSF 2, Primefaces 5.1.
Conversation to PDF or Docx works, but when I am displaying file name, it just skips UTF-8 encoded letters, in my case, Lithuanian letters like ą,č,ę,ė,į,š,ų,ū
What I have tried so farm is :
<h:form enctype="multipart/form-data;charset=UTF-8">
Charset.forName("UTF-8").encode(myString)
or
byte[] bytes = templateTitle.getBytes(Charset.forName("UTF-8"));
String title = new String(bytes, Charset.forName("UTF-8"));
or
UTF-8 text is garbled when form is posted as multipart/form-data
checked some tuttorials about encoding, still, no use,
also checked this, but I just do not understand this example...
Primefaces fileDownload non-english file names corrupt
my code:
Download file as docx
public void downloadTemplateAsDocx() throws Exception {
try {
InputStream content = null;
String objID = this.actData.getMainActs().get(0).getId();
ContentStream cmisStream = folderCatalogue.getDocumentContentStream(objID);
content = cmisStream.getStream();
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/hw.html"));
afiPart.setBinaryData(content);
afiPart.setContentType(new ContentType("text/html"));
Relationship altChunkRel = wordMLPackage.getMainDocumentPart().addTargetPart(afiPart);
CTAltChunk ac = Context.getWmlObjectFactory().createCTAltChunk();
ac.setId(altChunkRel.getId());
wordMLPackage.getMainDocumentPart().addObject(ac);
wordMLPackage.getContentTypeManager().addDefaultContentType("html", "text/html");
File fileTmp = File.createTempFile("tempDocFile", "docx");
wordMLPackage.save(fileTmp);
streamedContent = new DefaultStreamedContent(new FileInputStream(fileTmp), cmisStream.getMimeType(),
templateTitle + ".docx", "UTF-8");
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (InvalidFormatException eInv) {
eInv.printStackTrace();
} catch (IOException ioEx) {
ioEx.printStackTrace();
} catch (Docx4JException docxEx) {
docxEx.printStackTrace();
}
}
code for .Pdf file download.
public void downloadTemplateAsPdf() {
try {
InputStream content = null;
String objID = this.actData.getMainActs().get(0).getId();
ContentStream cmisStream = folderCatalogue.getDocumentContentStream(objID);
content = cmisStream.getStream();
File fileTmp = File.createTempFile("tempFile", "pdf");
OutputStream fileStream = new FileOutputStream(fileTmp);
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, fileStream);
document.open();
XMLWorkerHelper worker = XMLWorkerHelper.getInstance();
worker.parseXHtml(writer, document, content, Charset.forName("UTF-8"));
document.close();
fileStream.close();
streamedContent = new DefaultStreamedContent(new FileInputStream(fileTmp), cmisStream.getMimeType(),
templateTitle + ".pdf");
} catch (FileNotFoundException e) {
e.printStackTrace();
System.out.println("File was not found");
} catch (IOException ex) {
ex.printStackTrace();
} catch (Exception exeption) {
exeption.printStackTrace();
}
}
EDIT:
<p:fileDownload value="#{controller.streamedContent}" />
private StreamedContent streamedContent;
Solution,
String title = URLEncoder.encode(templateTitle, "UTF-8");
StringBuilder fileName = new StringBuilder(title);
if (title.contains("+")) {
for (int i = 0; i < title.length(); i++) {
if (title.charAt(i) == '+') {
fileName.setCharAt(i, ' ');
}
}
}
This Encoding works fine, just it replaces all spaces to + that's why I loop over it.

Is it possible to convert HTML into XHTML with Jsoup 1.8.1?

String body = "<br>";
Document document = Jsoup.parseBodyFragment(body);
document.outputSettings().escapeMode(EscapeMode.xhtml);
String str = document.body().html();
System.out.println(str);
expect: <br />
result: <br>
Can Jsoup convert value HTML into XHTML?
See Document.OutputSettings.Syntax.xml:
private String toXHTML( String html ) {
final Document document = Jsoup.parse(html);
document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
return document.html();
}
You should tell that syntax you want to leave the string in HTML or XML.
public String parserXHtml(String html) {
org.jsoup.nodes.Document document = Jsoup.parseBodyFragment(html);
document.outputSettings().syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml); //This will ensure the validity
document.outputSettings().charset("UTF-8");
return document.toString();
}
You can use JTidy API to do this. Use jtidy-r938.jar
You can use the following method to get xhtml from html
public static String getXHTMLFromHTML(String inputFile,
String outputFile) throws Exception {
File file = new File(inputFile);
FileOutputStream fos = null;
InputStream is = null;
try {
fos = new FileOutputStream(outputFile);
is = new FileInputStream(file);
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.parse(is, fos);
} catch (FileNotFoundException e) {
e.printStackTrace();
}finally{
if(fos != null){
try {
fos.close();
} catch (IOException e) {
fos = null;
}
fos = null;
}
if(is != null){
try {
is.close();
} catch (IOException e) {
is = null;
}
is = null;
}
}
return outputFile;
}

how to convert .doc .docx using tika apache?

i want ask to convert doc docx to file text in here code
enter code here
public DokumenExtractor(String filename) {
context = new ParseContext();
detector = new DefaultDetector();
parser = new AutoDetectParser(detector);
context.set(Parser.class, parser);
outputstream = new ByteArrayOutputStream();
metadata = new Metadata();
try {
process(filename);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void process(String filename) throws Exception {
URL url;
File file = new File(filename);
if (file.isFile()) {
url = file.toURI().toURL();
this.PathFile=(file.getPath()).toString();
} else {
url = new URL(filename);
}
this.input = TikaInputStream.get(url, metadata);
ContentHandler handler = new BodyContentHandler(outputstream);
parser.parse(input, handler, metadata, context);
input.close();
}
but output like this
PAGE * MERGEFORMAT 36
not content of document clean ??? how to remove format Page after get string from document

Can duplicating a pdf with PDFBox be small like with iText?

I am reading in a PDF and outputting a PDF with multiple copies of the original PDF in it. I test by doing the same thing for both PDFBox and iText. iText creates a much smaller output if I duplicate each page individually.
The question: Is there another way to do this in PDFBox that results in smaller output PDFs.
For one example input file, generating two copies to the output with both tools:
Original PDF size: 30K
PDFBox (v 1.7.1) generated PDF: 84K
iText (v 5.3.4) generated PDF: 35K
Java code for PDFBox (sorry to inflict error handling on you). Notice how it reads the input over and over and duplicates it as a whole:
PDFMergerUtility merger = new PDFMergerUtility();
PDDocument workplace = null;
try {
for (int cnt = 0; cnt < COPIES; ++cnt) {
PDDocument document = null;
InputStream stream = null;
try {
stream = new FileInputStream(new File(sourceFileName));
document = PDDocument.load(stream);
if (workplace == null) {
workplace = document;
} else {
merger.appendDocument(workplace, document);
}
} finally {
if (document != null && document != workplace) {
document.close();
}
if (stream != null) {
stream.close();
}
}
}
OutputStream out = null;
try {
out = new FileOutputStream(new File(destinationFileName));
workplace.save(out);
} finally {
if (out != null) {
out.close();
}
}
} catch (COSVisitorException e1) {
e1.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (workplace != null) {
try {
workplace.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Code to do it with iText. Notice how it loads the input file page by page and transfers each page to the output:
Document document = null;
PdfReader reader = null;
InputStream inputStream = null;
FileOutputStream outputStream = null;
try {
inputStream = new FileInputStream(new File(sourceFileName));
outputStream = new FileOutputStream(new File(destinationFileName));
document = new Document();
PdfCopy copy = new PdfSmartCopy(document, outputStream);
document.open();
reader = new PdfReader(inputStream);
// loop over the pages in that document
int pdfPageNo = reader.getNumberOfPages();
for (int page = 0; page < pdfPageNo;) {
PdfImportedPage onePage = copy.getImportedPage(reader, ++page);
// duplicate each page N times
for (int i = 0; i < COPIES; ++i) {
copy.addPage(onePage);
}
}
copy.freeReader(reader);
} catch (DocumentException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (reader != null) {
reader.close();
}
if (document != null) {
document.close();
}
try {
if (inputStream != null) {
inputStream.close();
}
if (outputStream != null) {
outputStream.close();
}
} catch (IOException e) {
// do nothing
}
}
Both are surrounded by this:
public class Duplicate {
/** The original PDF file */
private static final String sourceFileName = "PDF_CI_US2CA.pdf";
/** The resulting PDF file. */
private static final String destinationFileName = "itext_output.pdf";
private static final int COPIES = 2;
public static void main(String[] args) {
...
}
}
Using the following solution, I was able to create a PDF file with many duplicate pages and have a minimal impact on storage.
PDDocument samplePdf = null;
try {
samplePdf = PDDocument.load(PDF_PATH);
PDPage page = (PDPage) samplePdf.getDocumentCatalog().getAllPages().get(0);
for(int i = 0; i < COPIES; i++) {
samplePdf.importPage(page);
}
samplePdf.save(SAVE_PATH); //$NON-NLS-1$
} catch (IOException e) {
e.printStackTrace();
} catch (COSVisitorException e) {
e.printStackTrace();
}
In my first attempt I used, samplePdf.addPage(page) but it didn't work as expected. So obviously there is a difference between the add and import functions. I'll have to check the source or documentation to see why. Anyway, this should help you devise a solution for your needs with PDFBox.

Categories

Resources