Validate to check uploaded file is pdf - java

How to validate if the file uploaded is PDF only? not only by extension(.pdf) but also with the content.If someone change the extension of any other file to pdf file then it should fail while uploading.

You can use Apache Tika for this, available here. http://tika.apache.org/
You can also find a practical example here: https://dzone.com/articles/determining-file-types-java

There are many way to validate PDF file. I used itext for check pdf is corrupted or not.
try {
PdfReader pdfReader = new PdfReader(file);
PdfTextExtractor.getTextFromPage(pdfReader, 1);
LOGGER.info("pdfFileValidator ==> Exit");
return true;
} catch (InvalidPdfException e) {
e.printStackTrace();
LOGGER.error("pdfFileValidator ==> Exit. Error ==> " + e.getMessage());
return false;
}
If file is not PDF or file is corrupted than it will throw InvalidPDFException.
For above example you need itext library.

There are many validation libraries that you can use in order to validate if a file is PDF compliant. For instance, you can use - veradpf or pdfbox. Of course you can use any other library that would do the work for you. As it was already mentioned, tika is another library that can read file metadata and tell you what the file is.
As an example (a bare one), you can do something with pdfbox. Also keep in mind that this will validate if the file is PDF/A compliant.
boolean validateImpl(File file) {
PreflightDocument document = new PreflightParser(file).getPreflightDocument();
try {
document.validate();
ValidationResult validationResult = document.getResult();
if (validationResult.isValid()) {
return true;
}
} catch (Exception e) {
// Error validating
}
return false;
}
or with Tika, you can do something like
public ContentType tikaDetect(File file) {
Tika tika = new Tika();
String detectedType = tika.detect(file);
}

Related

Apache PDFBox to open temporary created PDF file

I'm using apache pdfbox 2.x version and I am trying to read a temp created file.
Below is my code to create a temp file and read it:
Path mergedTempFile = null;
try {
mergedTempFile = Files.createTempFile("merge_", ".pdf");
PDDocument pdDocument = PDDocument.load(mergedTempFile.toFile());
But it gives error:
java.io.IOException: Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1098)
at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2577)
at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2560)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:219)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1099)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1082)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1006)
at com.howtodoinjava.demo.PdfboxApi.test(PdfboxApi.java:326)
at com.howtodoinjava.demo.PdfboxApi.main(PdfboxApi.java:317)
From this link I have got a reference but it did not help anyway:
Similar Issue Link
Please help me with this. Still, I can not get rid of this.
PDDocument.load(...) is used to parse an existing PDF.
The passed temporary file (mergedTempFile) is empty, thus the exception. Just create a PDDocument with the constructor (resides in-memory) and later save it with PDDocument.save(...).
Path mergedTempFile = null;
try {
mergedTempFile = Files.createTempFile("merge_", ".pdf");
try (PDDocument pdDocument = new PDDocument()) {
// add content
pdDocument.addPage(new PDPage()); // empty page as an example
pdDocument.save(mergedTempFile.toFile());
}
} catch (IOException e) {
// exception handling
}
// use mergedTempFile for further logic

Find the Mime-Type of a downloaded file

In the code below, how will I find the mime type of the file that just got saved in File object?
File fileBase = new File("classpath:test");
try {
FileUtils.copyURLToFile(
new URL(docUploadDto.getAssetFileUrl()),
fileBase);
} catch (MalformedURLException e) {
responseDto.setMessage("Malformed URL : "+e.getMessage());
responseDto.setStatus("500");
return responseDto;
} catch (IOException e) {
responseDto.setMessage("IO Error : "+e.getMessage());
responseDto.setStatus("500");
return responseDto;
}
My docUploadDto.getAssetFileUrl() can download any kind of File.
Short answer: You can't.
Long answer:
You can guess what mime type the file is supposed to be by looking at the extension and file contents. Mime types are just standardized identifiers that associate different types of file endings, such as jpg and jpeg together. They aren't found in the file itself.
Files.probeContentType(path) does indeed try its best to guess the mime as Asier said in the comment above, but the implementation is OS-specific and not complete. It's recommended to use Apache Tika instead:
Tika tika = new Tika();
MediaType type = tika.detect(new FileInputStream(fileBase));

PDF Box - Unable to renameTo or Delete files

I'm fairly new to programming and I've been trying to use PDFBox for a personal project that I have. I'm basically trying to verify if the PDF has specific keywords in it, if YES I want to transfer the file to a "approved" folder.
I know the code below is poor written, but I'm not able to transfer nor delete the file correctly:
try (Stream<Path> filePathStream = Files.walk(Paths.get("C://pdfbox_teste"))) {
filePathStream.forEach(filePath -> {
if (Files.isRegularFile(filePath)) {
String arquivo = filePath.toString();
File file = new File(arquivo);
try {
// Loading an existing document
PDDocument document = PDDocument.load(file);
// Instantiate PDFTextStripper class
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
String[] words = text.split("\\.|,|\\s");
for (String word : words) {
// System.out.println(word);
if (word.equals("Revisão") || word.equals("Desenvolvimento")) {
// System.out.println(word);
if(file.renameTo(new File("C://pdfbox_teste//Aprovados//" + file.getName()))){
document.close();
System.out.println("Arquivo transferido corretamente");
file.delete();
};
}
}
System.out.println("Fim do documento: " + arquivo);
System.out.println("----------------------------");
document.close();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
});
I wanted to have the files transferred into the new folder. Instead, sometimes they only get deleted and sometimes nothing happens. I imagine the error is probably on the foreach, but I can't seem to find a way to fix it.
You try to rename the file while it is still open, and only close it afterwards:
// your code, does not work
if(file.renameTo(new File("C://pdfbox_teste//Aprovados//" + file.getName()))){
document.close();
System.out.println("Arquivo transferido corretamente");
file.delete();
};
Try to close the document first, so the file is no longer accessed by your process, and then it should be possible to rename it:
// fixed code:
document.close();
if(file.renameTo(new File("C://pdfbox_teste//Aprovados//" + file.getName()))){
System.out.println("Arquivo transferido corretamente");
};
And as Mahesh K pointed out, you don't have to delete the (original) file after you renamed it. Rename does not make a duplicate where the original file would need to be deleted, it just renames it.
After calling renameTo, you shouldn't be using delete.. as per my understanding renameTo works like move command. Pls see this

Apache POI fails to save (HWPFDocument.write) large word doc files

I want to remove word metadata from .doc files. My .docx files works fine with XWPFDocument, but the following code for removing metadata fails for large (> 1MB) files. For example using a 6MB .doc file with images, it outputs a 4.5MB file in which some images are removed.
public static InputStream removeMetaData(InputStream inputStream) throws IOException {
POIFSFileSystem fss = new POIFSFileSystem(inputStream);
HWPFDocument doc = new HWPFDocument(fss);
// **it even fails on large files if you remove from here to 'until' below**
SummaryInformation si = doc.getSummaryInformation();
si.removeAuthor();
si.removeComments();
si.removeLastAuthor();
si.removeKeywords();
si.removeSubject();
si.removeTitle();
doc.getDocumentSummaryInformation().removeCategory();
doc.getDocumentSummaryInformation().removeCompany();
doc.getDocumentSummaryInformation().removeManager();
try {
doc.getDocumentSummaryInformation().removeCustomProperties();
} catch (Exception e) {
// can not remove above
}
// until
ByteArrayOutputStream os = new ByteArrayOutputStream();
doc.write(os);
os.flush();
os.close();
return new ByteArrayInputStream(os.toByteArray());
}
Related posts:
How to save the Word Document using POI API?
https://stackoverflow.com/questions/9758955/saving-poi-document-correctly
Which version of Apache POI are you using ?
This seems to be the Bug 46220 - Regression: Some embedded images being lost .
Please upgrade to the latest release of POI (3.8) and try again.
Hope that helps.

How to write images, swf's, videos and anything else that is stored on a website to a file on my computer using streams

I'm trying to write a program that copies a website to my harddrive. This is easy enough to do just copying over the source and saving it as an html file, but In doing that you can't access any of the pictures, videos etc offline. I was wondering if there is a way to do this using an input/output stream and if so how exactly to do it...
Thanks so much in advance
If you have URL of the file to be downloaded then you can simply do it using apache commons-io
org.apache.commons.io.FileUtils.copyURLToFile(URL, File);
EDIT :
This code will download a zip file on your desktop.
import static org.apache.commons.io.FileUtils.copyURLToFile;
public static void Download() {
URL dl = null;
File fl = null;
try {
fl = new File(System.getProperty("user.home").replace("\\", "/") + "/Desktop/Screenshots.zip");
dl = new URL("http://example.com/uploads/Screenshots.zip");
copyURLToFile(dl, fl);
} catch (Exception e) {
System.out.println(e);
}
}

Categories

Resources