How to read from PDF using Selenium webdriver and Java

How to read from PDF using Selenium webdriver and Java - java

I am trying to read the contents of a PDF file using Java-Selenium. Below is my code. getWebDriver is a custom method in the framework. It returns the webdriver.
URL urlOfPdf = new URL(this.getWebDriver().getCurrentUrl());
BufferedInputStream fileToParse = new BufferedInputStream(urlOfPdf.openStream());
PDFParser parser = new PDFParser((RandomAccessRead) fileToParse);
parser.parse();
String output = new PDFTextStripper().getText(parser.getPDDocument());
The second line of the code gives compile time error if I don't parse it to RandomAccessRead type.
And when I parse it, I get this run time error:
java.lang.ClassCastException: java.io.BufferedInputStream cannot be cast to org.apache.pdfbox.io.RandomAccessRead
I need help with getting rid of these errors.

First of, unless you want to interfere in the PDF loading process, there is no need to explicitly use the PdfParser class. You can instead use a static PDDocument.load method:
URL urlOfPdf = new URL(this.getWebDriver().getCurrentUrl());
BufferedInputStream fileToParse = new BufferedInputStream(urlOfPdf.openStream());
PDDocument document = PDDocument.load(fileToParse);
String output = new PDFTextStripper().getText(document);
Otherwise, if you do want to interfere in the loading process, you have to create a RandomAccessRead instance for your BufferedInputStream, you cannot simply cast it because the classes are not related.
You can do that like this
URL urlOfPdf = new URL(this.getWebDriver().getCurrentUrl());
BufferedInputStream fileToParse = new BufferedInputStream(urlOfPdf.openStream());
MemoryUsageSetting memUsageSetting = MemoryUsageSetting.setupMainMemoryOnly();
ScratchFile scratchFile = new ScratchFile(memUsageSetting);
PDFParser parser;
try
{
RandomAccessRead source = scratchFile.createBuffer(fileToParse);
parser = new PDFParser(source);
parser.parse();
}
catch (IOException ioe)
{
IOUtils.closeQuietly(scratchFile);
throw ioe;
}
String output = new PDFTextStripper().getText(parser.getPDDocument());
(This essentially is copied and pasted from the source of PDDocument.load.)

Related

Save file from a website with java

I'm trying to build a jsoup based java app to automatically download English subtitles for films (I'm lazy, I know. It was inspired from a similar python based app). It's supposed to ask you the name of the film and then download an English subtitle for it from subscene.
I can make it reach the download link but I get an Unhandled content type error when I try to 'go' to that link. Here's my code
public static void main(String[] args) {
try {
String videoName = JOptionPane.showInputDialog("Title: ");
subscene(videoName);
}
catch (Exception e) {
System.out.println(e.getMessage());
}
}
public static void subscene(String videoName){
try {
String siteName = "http://www.subscene.com";
String[] splits = videoName.split("\\s+");
String codeName = "";
String text = "";
if(splits.length>1){
for(int i=0;i<splits.length;i++){
codeName = codeName+splits[i]+"-";
}
videoName = codeName.substring(0, videoName.length());
}
System.out.println("videoName is "+videoName);
// String url = "http://www.subscene.com/subtitles/"+videoName+"/english";
String url = "http://www.subscene.com/subtitles/title?q="+videoName+"&l=";
System.out.println("url is "+url);
Document doc = Jsoup.connect(url).get();
Element exact = doc.select("h2.exact").first();
Element yuel = exact.nextElementSibling();
Elements lis = yuel.children();
System.out.println(lis.first().children().text());
String hRef = lis.select("div.title > a").attr("href");
hRef = siteName+hRef+"/english";
System.out.println("hRef is "+hRef);
doc = Jsoup.connect(hRef).get();
Element nonHI = doc.select("td.a40").first();
Element papa = nonHI.parent();
Element link = papa.select("a").first();
text = link.text();
System.out.println("Subtitle is "+text);
hRef = link.attr("href");
hRef = siteName+hRef;
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
Jsoup.connect(hRef).get(); //<-- Here's where the problem lies
}
catch (java.io.IOException e) {
System.out.println(e.getMessage());
}
}
Can someone please help me so I don't have to manually download subs?
I just found out that using
java.awt.Desktop.getDesktop().browse(java.net.URI.create(hRef));
instead of
Jsoup.connect(hRef).get();
downloads the file after prompting me to save it. But I don't want to be prompted because this way I won't be able to read the name of the downloaded zip file (I want to unzip it after saving using java).

Assuming that your files are small, you can do it like this. Note that you can tell Jsoup to ignore the content type.
// get the file content
Connection connection = Jsoup.connect(path);
connection.timeout(5000);
Connection.Response resultImageResponse = connection.ignoreContentType(true).execute();
// save to file
FileOutputStream out = new FileOutputStream(localFile);
out.write(resultImageResponse.bodyAsBytes());
out.close();
I would recommend to verify the content before saving.
Because some servers will just return a HTML page when the file cannot be found, i.e. a broken hyperlink.
...
String body = resultImageResponse.body();
if (body == null || body.toLowerCase().contains("<body>"))
{
throw new IllegalStateException("invalid file content");
}
...

Here:
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
//specifically here
Jsoup.connect(hRef).get();
Looks like jsoup expects that the result of Jsoup.connect(hRef) should be an HTML or some text that it's able to parse, that's why the message states:
Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml
I followed the execution of your code manually and the last URL you're trying to access returns a content type of application/x-zip-compressed, thus the cause of the exception.
In order to download this file, you should use a different approach. You could use the old but still useful URLConnection, URL or use a third party library like Apache HttpComponents to fire a GET request and retrieve the result as an InputStream, wrap it into a proper writer and write your file into your disk.
Here's an example about doing this using URL:
URL url = new URL(hRef);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream("D:\\foo.zip"));
final int BUFFER_SIZE = 1024 * 4;
byte[] buffer = new byte[BUFFER_SIZE];
BufferedInputStream bis = new BufferedInputStream(in);
int length;
while ( (length = bis.read(buffer)) > 0 ) {
out.write(buffer, 0, length);
}
out.close();
in.close();

How to sign an InputStream from a PDF file with PDFBox 2.0.0

I want to sign a InputStream from a PDF file without using a temporary file.
Here I convert InputStream to File and this work fine :
InputStream inputStream = this.signatureObjPAdES.getSignatureDocument().getInputStream();
OutputStream outputStream = new FileOutputStream(new File("C:/temp.pdf"));
int read = 0;
byte[] bytes = new byte[1024];
while ((read = inputStream.read(bytes)) != -1) {
outputStream.write(bytes, 0, read);
}
PDDocument document = PDDocument.load(new File("C:/temp.pdf"));
...
document.addSignature(new PDSignature(this.dts.getDocumentTimeStamp()), this);
document.saveIncremental(new FileOutputStream("C:/result.pdf");
document.close();
But I want to do this directly :
PDDocument document = PDDocument.load(inputStream);
Problem: at run
Exception in thread "main" java.lang.NullPointerException
at java.io.RandomAccessFile.<init>(Unknown Source)
at org.apache.pdfbox.io.RandomAccessBufferedFileInputStream.<init>(RandomAccessBufferedFileInputStream.java:77)
at org.apache.pdfbox.pdmodel.PDDocument.saveIncremental(PDDocument.java:961)
All ideas are welcome.
Thank you.
EDIT:
It's now working with the release of PDFBox 2.0.0.

The cause
The immediate hindrance is in the method PDDocument.saveIncremental() itself:
public void saveIncremental(OutputStream output) throws IOException
{
InputStream input = new RandomAccessBufferedFileInputStream(incrementalFile);
COSWriter writer = null;
try
{
writer = new COSWriter(output, input);
writer.write(this, signInterface);
writer.close();
}
finally
{
if (writer != null)
{
writer.close();
}
}
}
(PDDocument.java)
The member incrementalFile used in the first line is only set during a PDDocument.load with a File parameter.
Thus, this method cannot be used.
A work-around
Fortunately the method PDDocument.saveIncremental() only uses methods and values publicly available with the sole exception of signInterface, but you know the value of it because you set it in your code in the line right before the saveIncremental call:
document.addSignature(new PDSignature(this.dts.getDocumentTimeStamp()), this);
document.saveIncremental(new FileOutputStream("C:/result.pdf"));
Thus, instead of calling PDDocument.saveIncremental() you can do the equivalent in your code.
To do so you furthermore need a replacement value for the InputStream input. It needs to return a stream with the identical content as inputStream in your
PDDocument document = PDDocument.load(inputStream);
So you need to use that stream twice. As you have not said whether that inputStream can be reset, we'll first copy it into a byte[] which we forward both to PDDocument.load and new COSWriter.
Thus, replace your
PDDocument document = PDDocument.load(inputStream);
...
document.addSignature(new PDSignature(this.dts.getDocumentTimeStamp()), this);
document.saveIncremental(new FileOutputStream("C:/result.pdf"));
document.close();
by
byte[] inputBytes = IOUtils.toByteArray(inputStream);
PDDocument document = PDDocument.load(new ByteArrayInputStream(inputBytes));
...
document.addSignature(new PDSignature(this.dts.getDocumentTimeStamp()), this);
saveIncremental(new FileOutputStream("C:/result.pdf"),
new ByteArrayInputStream(inputBytes), document, this);
document.close();
and add a new method saveIncremental to your class inspired by the original PDDocument.saveIncremental():
void saveIncremental(OutputStream output, InputStream input, PDDocument document, SignatureInterface signatureInterface) throws IOException
{
COSWriter writer = null;
try
{
writer = new COSWriter(output, input);
writer.write(document, signatureInterface);
writer.close();
}
finally
{
if (writer != null)
{
writer.close();
}
}
}
On the side
I said above
As you have not said whether that inputStream can be reset, we'll first copy it into a byte[] which we forward both to PDDocument.load and new COSWriter.
Actually there is another reason to do so: COSWriter.doWriteSignature() retrieves the length of the original PDF like this:
long inLength = incrementalInput.available();
(COSWriter.java)
The documentation of InputStream.available() states, though:
Note that while some implementations of InputStream will return the total number of bytes in the stream, many will not.
To re-use inputStream instead of using a byte[] and ByteArrayInputStreams as above, therefore, inputStream not only needs to support reset() but also needs to be one of the few InputStream implementations which return the total number of bytes in the stream as available.
FileInputStream and ByteArrayInputStream both do return the total number of bytes in the stream as available.
There may still be more issues when using generic InputStreams instead of these two.

Hey Cyril Bremaud, you can use this approach, since the PDDocument class has 3 overloaded constructor, you can go ahead and provide only the file path if you like and it will work as well. But for your requirement to be able to pass an InputStream directly to the PDDocument constructor, use this code:
lStrInputPDFfile = "samples_pdf_signing\Country Calendar.pdf";
lOsPDFInput = new java.io.FileInputStream(lStrInputPDFfile);
jPDFDocument = new org.apache.pdfbox.pdmodel.PDDocument().load(lOsPDFInput);
But this also work in my case:
lStrInputPDFfile = "samples_pdf_signing\Country Calendar.pdf";
jPDFDocument = new org.apache.pdfbox.pdmodel.PDDocument().load(lStrInputPDFfile);
Note: `InputStream is a parent class of FileInputStream and that is why the above code works.
updated my code, please check again. Thanks to #mkl for pointing that out.

How to read Nutch content from Java/Scala?

I'm using Nutch to crawl some websites (as a process that runs separate of everything else), while I want to use a Java (Scala) program to analyse the HTML data of websites using Jsoup.
I got Nutch to work by following the tutorial (without the script, only executing the individual instructions worked), and I think it's saving the websites' HTML in the crawl/segments/<time>/content/part-00000 directory.
The problem is that I cannot figure out how to actually read the website data (URLs and HTML) in a Java/Scala program. I read this document, but find it a bit overwhelming since I've never used Hadoop.
I tried to adapt the example code to my environment, and this is what I arrived at (mostly by guesswprk):
val reader = new MapFile.Reader(FileSystem.getLocal(new Configuration()), ".../apache-nutch-1.8/crawl/segments/20140711115438/content/part-00000", new Configuration())
var key = null
var value = null
reader.next(key, value) // test for a single value
println(key)
println(value)
However, I am getting this exception when I run it:
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1873)
at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)
I am not sure how to work with a MapFile.Reader, specifically, what constructor parameters I am supposed to pass to it. What Configuration objects am I supposed to pass in? Is that the correct FileSystem? And is that the data file I'm interested in?

Scala:
val conf = NutchConfiguration.create()
val fs = FileSystem.get(conf)
val file = new Path(".../part-00000/data")
val reader = new SequenceFile.Reader(fs, file, conf)
val webdata = Stream.continually {
val key = new Text()
val content = new Content()
reader.next(key, content)
(key, content)
}
println(webdata.head)
Java:
public class ContentReader {
public static void main(String[] args) throws IOException {
Configuration conf = NutchConfiguration.create();
Options opts = new Options();
GenericOptionsParser parser = new GenericOptionsParser(conf, opts, args);
String[] remainingArgs = parser.getRemainingArgs();
FileSystem fs = FileSystem.get(conf);
String segment = remainingArgs[0];
Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
Text key = new Text();
Content content = new Content();
// Loop through sequence files
while (reader.next(key, content)) {
try {
System.out.write(content.getContent(), 0,
content.getContent().length);
} catch (Exception e) {
}
}
}
}
Alternatively, you can use org.apache.nutch.segment.SegmentReader (example).

SVNKit to find diff between two files stored at separate locations with separate revision numbers

I am writing a Java program using the SVNKit API, and I need to use the correct class or call in the API that would allow me to find the diff between files stored in separate locations.
1st file:
https://abc.edc.xyz.corp/svn/di-edc/tags/ab-cde-fgh-axsym-1.0.0/src/site/apt/releaseNotes.apt
2nd file:
https://abc.edc.xyz.corp/svn/di-edc/tags/ab-cde-fgh-axsym-1.1.0/src/site/apt/releaseNotes.apt
I have used the listed API calls to generate the diff output, but I am unsuccessful so far.
DefaultSVNDiffGenerator diffGenerator = new DefaultSVNDiffGenerator();
diffGenerator.displayFileDiff("", file1, file2, "10983", "8971", "text", "text/plain", output);
diffClient.doDiff(svnUrl1, SVNRevision.create(10868), svnUrl2, SVNRevision.create(8971), SVNDepth.IMMEDIATES, false, System.out);
Can anyone provide guidance on the correct way to do this?

Your code looks correct. But prefer using the new API:
final SvnOperationFactory svnOperationFactory = new SvnOperationFactory();
try {
final ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
final SvnDiffGenerator diffGenerator = new SvnDiffGenerator();
diffGenerator.setBasePath(new File(""));
final SvnDiff diff = svnOperationFactory.createDiff();
diff.setSources(SvnTarget.fromURL(url1, svnRevision1), SvnTarget.fromURL(url2, svnRevision1));
diff.setDiffGenerator(diffGenerator);
diff.setOutput(byteArrayOutputStream);
diff.run();
} finally {
svnOperationFactory.dispose();
}

Problems converting from an object to XML in java

What I'm trying to do is to convert an object to xml, then use a String to transfer it via Web Service so another platform (.Net in this case) can read the xml and then deparse it into the same object. I've been reading this article:
http://simple.sourceforge.net/download/stream/doc/tutorial/tutorial.php#start
And I've been able to do everything with no problems until here:
Serializer serializer = new Persister();
PacienteObj pac = new PacienteObj();
pac.idPaciente = "1";
pac.nomPaciente = "Sonia";
File result = new File("example.xml");
serializer.write(pac, result);
I know this will sound silly, but I can't find where Java creates the new File("example.xml"); so I can check the information.
And I wanna know if is there any way to convert that xml into a String instead of a File, because that's what I need exactly. I can't find that information at the article.
Thanks in advance.

And I wanna know if is there any way to convert that xml into a String instead of a File, because that's what I need exactly. I can't find that information at the article.
Check out the JavaDoc. There is a method that writes to a Writer, so you can hook it up to a StringWriter (which writes into a String):
StringWriter result = new StringWriter(expectedLength);
serializer.write(pac, result)
String s = result.toString();

You can use an instance of StringWriter:
Serializer serializer = new Persister();
PacienteObj pac = new PacienteObj();
pac.idPaciente = "1";
pac.nomPaciente = "Sonia";
StringWriter result = new StringWriter();
serializer.write(pac, result);
String xml = result.toString(); // xml now contains the serialized data

Log or print the below statement will tell you where the file is on the file system.
result.getAbsolutePath()

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to read from PDF using Selenium webdriver and Java - java

Related

Save file from a website with java

How to sign an InputStream from a PDF file with PDFBox 2.0.0

How to read Nutch content from Java/Scala?

SVNKit to find diff between two files stored at separate locations with separate revision numbers

Problems converting from an object to XML in java

Categories

Resources