I am trying to export HTML page into a PDF using iText7.1.0 and pdfHTML2.0.0. For some reason, the pages have formatting issue for the Pie chart images (aligned horizontally in HTML whereas vertically aligned in PDF) and the table (titled "Features") on left is pushed down vertically. The jsFiddle link to my HTML code that is being used by PDF renderer.
Below is the Java code used for rendering the PDF (Page1.html is the same HTML code in the fiddle):
/*
* Copyright 2016-2017, iText Group NV.
* This example was created by Bruno Lowagie.
* It was written in the context of the following book:
* https://leanpub.com/itext7_pdfHTML
* Go to http://developers.itextpdf.com for more info.
*/
package com.itextpdf.htmlsamples.chapter01;
import java.io.File;
import java.io.IOException;
import com.itextpdf.html2pdf.HtmlConverter;
import com.itextpdf.licensekey.LicenseKey;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* Converts a simple HTML file to PDF using File objects
* as arguments for the convertToPdf() method.
*/
public class C01E03_HelloWorld {
/** The Base URI of the HTML page. */
public static final String BASEURI = "src/main/resources/html/";
/** The path to the source HTML file. */
public static final String SRC = String.format("%sPage1.html", BASEURI);
/** The target folder for the result. */
public static final String TARGET = "target/results/ch01/";
/** The path to the resulting PDF file. */
public static final String DEST = String.format("%stest-03.pdf", TARGET);
/**
* The main method of this example.
*
* #param args no arguments are needed to run this example.
* #throws IOException Signals that an I/O exception has occurred.
*/
public static void main(String[] args) throws IOException {
LicenseKey.loadLicenseFile("C://Users//Sparks//Desktop//itextkey-0.xml");
File file = new File(TARGET);
file.mkdirs();
new C01E03_HelloWorld().createPdf(BASEURI, SRC, DEST);
}
/**
* Creates the PDF file.
*
* #param baseUri the base URI
* #param src the path to the source HTML file
* #param dest the path to the resulting PDF
* #throws IOException Signals that an I/O exception has occurred.
*/
public void createPdf(String baseUri, String src, String dest) throws IOException {
HtmlConverter.convertToPdf(new File(src), new File(dest));
}
}
The output PDF file is here. It should have formatting similar to the one in HTML page.
Any suggestions would be helpful.
Related
I have a screenshot that I have got from the browser through the javascript code converted into string (type). I need Web Service that could read this string and convert it into image on the server side.
The question is:
How should I make a web Service that would be posible to read the string from browser and convert it into image to the server side.
Below the code is shown
package com.myfirst.wsServer;
import java.io.FileOutputStream;
import java.util.UUID;
import javax.jws.WebService;
import org.apache.commons.codec.binary.Base64;
#WebService
public class ConvertToImage {
/**
* Convert input string image to image
* #param imageDataString
*/
public static void StringToImage(String imageDataString){
try {
// Converting a Base64 String into Image byte array
byte[] imageByteArray = decodeImage(imageDataString);
// Write a image byte array into file system
FileOutputStream imageOutFile = new FileOutputStream("D:\\" + getNewFileName() + ".png");
imageOutFile.write(imageByteArray);
imageOutFile.close();
System.out.println("Image Successfully Manipulated!");
}
catch (Exception e) {
// TODO: handle exception
}
}
/**
* Encodes the byte array into base64 string
*
* #param imageByteArray - byte array
* #return String
*/
public static String encodeImage(byte[] imageByteArray) {
return Base64.encodeBase64URLSafeString(imageByteArray);
}
/**
* Decodes the base64 string into byte array
*
* #param imageDataString
* #return byte array
*/
public static byte[] decodeImage(String imageDataString) {
return Base64.decodeBase64(imageDataString);
}
/**
* Generate uuid an convert to String
* #return uuid.toString()
*/
public static String getNewFileName(){
UUID uuid = UUID.randomUUID();
return uuid.toString();
}
}
Any questions or suggestions are welcomed
Thank you
PS: If you could help me and you are interested in javasvript code I can share it.
I followed some previous advice on StackOverflow here (Extract specific parts of PDF documents) to extract data from PDFs. I have limited programming experience and virtually none with Java (used C++ when I was younger).
After overcoming some newbie difficulty having the right jars available via the buildpath, I have run into what seems to be a define problem. I hope this isn't a stupid question, I've tried for quite some time to get this to work.
Here's the error I receive:
Exception in thread "main" java.lang.Error: Unresolved compilation problem:
The constructor RegionTextRenderFilter(Rectangle) is undefined
at PDFExtract.PDFExtract.parsePdf(PDFExtract.java:41)
at PDFExtract.PDFExtract.main(PDFExtract.java:59)
Here is my code:
package PDFExtract;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Rectangle;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.FilteredTextRenderListener;
import com.itextpdf.text.pdf.parser.LocationTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
import com.itextpdf.text.pdf.parser.RegionTextRenderFilter;
import com.itextpdf.text.pdf.parser.RenderFilter;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
public class PDFExtract {
/** The original PDF that will be parsed. */
public static final String PREFACE = "resources/pdfs/preface.pdf";
/** The resulting text file. */
public static final String RESULT = "results/output.txt";
/**
* Parses a specific area of a PDF to a plain text file.
* #param pdf the original PDF
* #param txt the resulting text
* #throws IOException
*/
public void parsePdf(String pdf, String txt) throws IOException {
PdfReader reader = new PdfReader(pdf);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
Rectangle rect = new Rectangle(70, 80, 420, 500);
RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
out.println(PdfTextExtractor.getTextFromPage(reader, i, strategy));
}
out.flush();
out.close();
reader.close();
}
/**
* Main method.
* #param args no arguments needed
* #throws DocumentException
* #throws IOException
*/
public static void main(String[] args) throws IOException, DocumentException {
new PDFExtract().parsePdf(PREFACE, RESULT);
}
}
The error appears on this line "RenderFilter filter = new RegionTextRenderFilter(rect);"
Hopefully someone can help me! Again I apologize if this is a stupid question. I tried jumping from a Hello World to this and it's taxing my aged understanding of C++
Is there a way in JNA to load multiple dependent libraries with Java?
I usually use Native.loadLibrary(...) to load one DLL. But I guess its not working this way because I assign this function call to the instance member.
Let's say I have library foo and library bar. bar has a dependency on foo; it also has a dependency on baz, which we are not mapping with JNA:
public class Foo {
public static final boolean LOADED;
static {
Native.register("foo");
LOADED = true;
}
public static native void call_foo();
}
public class Bar {
static {
// Reference "Foo" so that it is loaded first
if (Foo.LOADED) {
System.loadLibrary("baz");
// Or System.load("/path/to/libbaz.so")
Native.register("bar");
}
}
public static native void call_bar();
}
The call to System.load/loadLibrary will only be necessary if baz is neither on your library load path (PATH/LD_LIBRARY_PATH, for windows/linux respectively) nor in the same directory as bar (windows only).
EDIT
You can also do this via interface mapping:
public interface Foo extends Library {
Foo INSTANCE = (Foo)Native.loadLibrary("foo");
}
public interface Bar extends Library {
// Reference Foo prior to instantiating Bar, just be sure
// to reference the Foo class prior to creating the Bar instance
Foo FOO = Foo.INSTANCE;
Bar INSTANCE = (Bar)Native.loadLibrary("bar");
}
Loading lib transient dependencies with JNA from JAR Resources.
My resources folder res:
res/
`-- linux-x86-64
|-- libapi.so
|-- libdependency.so
-
MyApiLibrary api = (MyApiLibrary) Native.loadLibrary("libapi.so", MyApiLibrary.class, options);
API Explodes:
Caused by: java.lang.UnsatisfiedLinkError: Error loading shared library libdependency.so: No such file or directory
Can be solved by loading dependencies beforehand by hand:
import com.sun.jna.Library;
Native.loadLibrary("libdependency.so", Library.class);
MyApiLibrary api = (MyApiLibrary) Native.loadLibrary("libapi.so", MyApiLibrary.class, options);
Basically you have to build dependency tree in reverse, by hand, by yourself.
I recommend setting
java -Djna.debug_load=true -Djna.debug_load.jna=true
Furthermore, setting jna.library.path to Resource has no effect, because JNA extracts to filesystem, then it loads lib. Lib on filesystem can NOT access other libs within jar.
Context class loader classpath. Deployed native libraries may be
installed on the classpath under ${os-prefix}/LIBRARY_FILENAME, where
${os-prefix} is the OS/Arch prefix returned by
Platform.getNativeLibraryResourcePrefix(). If bundled in a jar file,
the resource will be extracted to jna.tmpdir for loading, and later
removed (but only if jna.nounpack is false or not set).
Javadoc
RTFM and happy coding. JNA v.4.1.0
I was in a similar situation, dealing with multiplatform and several dependent libraries, but needing to load only one. Here is my take.
Suppose you get a set 32/64 win/linux libraries with dependencies.
Suppose you only need to have a JNA binding for libapi
You'll need to organize them into your jar like this:
linux-x86-64
|-- libapi.so
|-- libdependency.so
linux-x86
|-- libapi.so
|-- libdependency.so
win32-x86-64
|-- libapi.dll
|-- libdependency.dll
win32-x86
|-- libapi.dll
|-- libdependency.dll
You can:
determine if executing from a JAR file (avoids performing the operation when executing from your favorite IDE ; see How to get the path of a running JAR file?)
use JNA to determine your current executing platform
extract all appropriate library files into java temp folder (using elements from this answer: https://stackoverflow.com/a/58318009/7237062 (or related answers) should do the trick)
Tell JNA to look into the newly created temp folder
and voilĂ !
missing in code example is the directory cleanup at application shutdown, but I leave that to you
The main part should look like that:
MainClass.java
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.InvalidPathException;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.StandardCopyOption;
import java.util.Optional;
import java.util.jar.JarFile;
import com.sun.jna.Platform;
public class MainClass {
private static final String JAVA_IO_TMPDIR = "java.io.tmpdir";
private static final String TEMP_DIR = System.getProperty(JAVA_IO_TMPDIR);
private static final String JNA_LIBRARY_PATH = "jna.library.path";
public static void main(String[] args) {
// ...
// path management here maybe suboptimal ... feel free to improve
// from https://stackoverflow.com/questions/320542/how-to-get-the-path-of-a-running-jar-file
URL current_jar_dir = Overview.class.getProtectionDomain().getCodeSource().getLocation();
Path jar_path = Paths.get(current_jar_dir.toURI());
String folderContainingJar = jar_path.getParent().toString();
ResourceCopy r = new ResourceCopy(); // class from https://stackoverflow.com/a/58318009/7237062
Optional<JarFile> jar = r.jar(MainClass.class);
if (jar.isPresent()) {
try {
System.out.println("JAR detected");
File target_dir = new File(TEMP_DIR);
System.out.println(String.format("Trying copy from %s %s to %s", jar.get().getName(), Platform.RESOURCE_PREFIX, target_dir));
// perform dir copy
r.copyResourceDirectory(jar.get(), Platform.RESOURCE_PREFIX, target_dir);
// add created folders to JNA lib loading path
System.setProperty(JNA_LIBRARY_PATH, target_dir.getCanonicalPath().toString());
} catch(Exception e) {
e.printStackTrace(); // TODO: handle exception ?
}
} else {
System.out.println("NO JAR");
}
// ...
}
ResourceCopy.java (copy here for completeness; taken from https://stackoverflow.com/a/58318009)
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;
import java.nio.file.Files;
import java.util.Enumeration;
import java.util.Optional;
import java.util.jar.JarEntry;
import java.util.jar.JarFile;
/**
* A helper to copy resources from a JAR file into a directory. source :
* https://stackoverflow.com/a/58318009
*/
public final class ResourceCopy {
/**
* URI prefix for JAR files.
*/
private static final String JAR_URI_PREFIX = "jar:file:";
/**
* The default buffer size.
*/
private static final int BUFFER_SIZE = 8 * 1024;
/**
* Copies a set of resources into a temporal directory, optionally
* preserving the paths of the resources.
*
* #param preserve
* Whether the files should be placed directly in the directory
* or the source path should be kept
* #param paths
* The paths to the resources
* #return The temporal directory
* #throws IOException
* If there is an I/O error
*/
public File copyResourcesToTempDir(final boolean preserve, final String... paths) throws IOException {
final File parent = new File(System.getProperty("java.io.tmpdir"));
File directory;
do {
directory = new File(parent, String.valueOf(System.nanoTime()));
} while (!directory.mkdir());
return this.copyResourcesToDir(directory, preserve, paths);
}
/**
* Copies a set of resources into a directory, preserving the paths and
* names of the resources.
*
* #param directory
* The target directory
* #param preserve
* Whether the files should be placed directly in the directory
* or the source path should be kept
* #param paths
* The paths to the resources
* #return The temporal directory
* #throws IOException
* If there is an I/O error
*/
public File copyResourcesToDir(final File directory, final boolean preserve, final String... paths)
throws IOException {
for (final String path : paths) {
final File target;
if (preserve) {
target = new File(directory, path);
target.getParentFile().mkdirs();
} else {
target = new File(directory, new File(path).getName());
}
this.writeToFile(Thread.currentThread().getContextClassLoader().getResourceAsStream(path), target);
}
return directory;
}
/**
* Copies a resource directory from inside a JAR file to a target directory.
*
* #param source
* The JAR file
* #param path
* The path to the directory inside the JAR file
* #param target
* The target directory
* #throws IOException
* If there is an I/O error
*/
public void copyResourceDirectory(final JarFile source, final String path, final File target) throws IOException {
final Enumeration<JarEntry> entries = source.entries();
final String newpath = String.format("%s/", path);
while (entries.hasMoreElements()) {
final JarEntry entry = entries.nextElement();
if (entry.getName().startsWith(newpath) && !entry.isDirectory()) {
final File dest = new File(target, entry.getName().substring(newpath.length()));
final File parent = dest.getParentFile();
if (parent != null) {
parent.mkdirs();
}
this.writeToFile(source.getInputStream(entry), dest);
}
}
}
/**
* The JAR file containing the given class.
*
* #param clazz
* The class
* #return The JAR file or null
* #throws IOException
* If there is an I/O error
*/
public Optional<JarFile> jar(final Class<?> clazz) throws IOException {
final String path = String.format("/%s.class", clazz.getName().replace('.', '/'));
final URL url = clazz.getResource(path);
Optional<JarFile> optional = Optional.empty();
if (url != null) {
final String jar = url.toString();
final int bang = jar.indexOf('!');
if (jar.startsWith(ResourceCopy.JAR_URI_PREFIX) && bang != -1) {
optional = Optional.of(new JarFile(jar.substring(ResourceCopy.JAR_URI_PREFIX.length(), bang)));
}
}
return optional;
}
/**
* Writes an input stream to a file.
*
* #param input
* The input stream
* #param target
* The target file
* #throws IOException
* If there is an I/O error
*/
private void writeToFile(final InputStream input, final File target) throws IOException {
final OutputStream output = Files.newOutputStream(target.toPath());
final byte[] buffer = new byte[ResourceCopy.BUFFER_SIZE];
int length = input.read(buffer);
while (length > 0) {
output.write(buffer, 0, length);
length = input.read(buffer);
}
input.close();
output.close();
}
}
I'm trying to write my own crawljax 3.6 plugin in Java. It should tell crawljax which is a very famous web-crawler to also download files, which he finds on webpages. (PDF, Images, and so on). I don't want only the HTML or actual DOM-Tree. I would like to get access to the files (PDF, jpg) he finds.
How can I tell crawljax to download PDF files, images and so on?
Thanks for any help!
This is what I have so far -a new Class using the default plugin (CrawlOverview):
import java.io.File;
import java.io.IOException;
import java.util.concurrent.TimeUnit;
import org.apache.commons.io.FileUtils;
import com.crawljax.browser.EmbeddedBrowser.BrowserType;
import com.crawljax.condition.NotXPathCondition;
import com.crawljax.core.CrawlSession;
import com.crawljax.core.CrawljaxRunner;
import com.crawljax.core.configuration.BrowserConfiguration;
import com.crawljax.core.configuration.CrawljaxConfiguration;
import com.crawljax.core.configuration.CrawljaxConfiguration.CrawljaxConfigurationBuilder;
import com.crawljax.core.configuration.Form;
import com.crawljax.core.configuration.InputSpecification;
import com.crawljax.plugins.crawloverview.CrawlOverview;
/**
* Example of running Crawljax with the CrawlOverview plugin on a single-page
* web app. The crawl will produce output using the {#link CrawlOverview}
* plugin.
*/
public final class Main {
private static final long WAIT_TIME_AFTER_EVENT = 200;
private static final long WAIT_TIME_AFTER_RELOAD = 20;
private static final String URL = "http://demo.crawljax.com";
/**
* Run this method to start the crawl.
*
* #throws IOException
* when the output folder cannot be created or emptied.
*/
public static void main(String[] args) throws IOException {
CrawljaxConfigurationBuilder builder = CrawljaxConfiguration
.builderFor(URL);
builder.addPlugin(new CrawlOverview());
builder.crawlRules().insertRandomDataInInputForms(false);
// click these elements
builder.crawlRules().clickDefaultElements();
builder.crawlRules().click("div");
builder.crawlRules().click("a");
builder.setMaximumStates(10);
builder.setMaximumDepth(3);
// Set timeouts
builder.crawlRules().waitAfterReloadUrl(WAIT_TIME_AFTER_RELOAD,
TimeUnit.MILLISECONDS);
builder.crawlRules().waitAfterEvent(WAIT_TIME_AFTER_EVENT,
TimeUnit.MILLISECONDS);
// We want to use two browsers simultaneously.
builder.setBrowserConfig(new BrowserConfiguration(BrowserType.FIREFOX,
1));
CrawljaxRunner crawljax = new CrawljaxRunner(builder.build());
crawljax.call();
}
}
As images are concerned - I don't see any problem, Crawljax loads these just fine for me.
On the PDF topic:
Unfortunately Crawljax is hardcoded to skip links to PDF files.
See com.crawljax.core.CandidateElementExtractor:342:
/**
* #param href
* the string to check
* #return true if href has the pdf or ps pattern.
*/
private boolean isFileForDownloading(String href) {
final Pattern p = Pattern.compile(".+.pdf|.+.ps|.+.zip|.+.mp3");
Matcher m = p.matcher(href);
if (m.matches()) {
return true;
}
return false;
}
This could be solved by modifying Crawljax source and introducing a configuration option for pattern above.
After that limitations of Selenium regarding non-HTML files apply: PDF is either viewed in Firefox JavaScript PDF viewer, a download pop-up appears or the file is downloaded. It is somewhat possible to interact with the JavaScript viewer, it is not possible to interact with the download popup but if autodownload is enabled then the file is downloaded to disk.
If you would like to set Firefox to automatically download file without popping up a download dialog:
import javax.inject.Provider;
static class MyFirefoxProvider implements Provider<EmbeddedBrowser> {
#Override
public EmbeddedBrowser get() {
FirefoxProfile profile = new FirefoxProfile();
profile.setPreference("browser.download.folderList", 2);
profile.setPreference("browser.download.dir", "/tmp");
profile.setPreference("browser.helperApps.neverAsk.saveToDisk",
"application/octet-stream,application/pdf,application/x-gzip");
// disable Firefox's built-in PDF viewer
profile.setPreference("pdfjs.disabled", true);
// disable Adobe Acrobat PDF preview plugin
profile.setPreference("plugin.scan.plid.all", false);
profile.setPreference("plugin.scan.Acrobat", "99.0");
FirefoxDriver driver = new FirefoxDriver(profile);
return WebDriverBackedEmbeddedBrowser.withDriver(driver);
}
}
And use the newly created FirefoxProvider:
BrowserConfiguration bc =
new BrowserConfiguration(BrowserType.FIREFOX, 1, new MyFirefoxProvider());
Obtain the links manually using Jsoup by using the CSS selector a[href] on getStrippedDom(), iterate through the elements and use a HttpURLConnection / HttpsURLConnection to download them.
I'm using a rdf crawler, in that I had a class named as:
import edu.unika.aifb.rdf.crawler.*;
import com.hp.hpl.jena.rdf.model.*;
import com.hp.hpl.jena.util.FileManager;
These are class file termed as error, and I try with jena packages but I had attached, it does not make any changes.
Update:
Full SampleCrawl.java class content:
import java.util.*;
import edu.unika.aifb.rdf.crawler.*;
/**
* Call this class with 3 arguments - URL to crawl to,
* depth and time in seconds
*/
public class SampleCrawl {
/**
* #param uRI
* #param depth
* #param time
*/
#SuppressWarnings("rawtypes")
public SampleCrawl(Vector uRI, Vector hf, int depth, int time){
// Initialize Crawling parameters
CrawlConsole c = new CrawlConsole(uRI,hf,depth,time);
// get an ontology file from its local location
// (OPTIONAL)
c.setLocalNamespace("http://www.daml.org/2000/10/daml-ont","c:\\temp\\rdf\\schemas\\daml-ont.rdf");
// set all the paths to get all the results
c.setLogPath("c:\\temp\\crawllog.xml");
c.setCachePath("c:\\temp\\crawlcache.txt");
c.setModelPath("c:\\temp\\crawlmodel.rdf");
try{
// crawl and get RDF model
c.start();
// This writes all three result files out
c.writeResults();
}catch(Exception e){
}
}
/**
* #param args
* #throws Exception
*/
#SuppressWarnings({ "rawtypes", "unchecked" })
public static void main(String[] args) throws Exception {
if (args.length != 3) {
System.err.println("Usage: java -cp [JARs] SampleCrawl [URL] [depth:int] [time:int]");
System.exit(0);
}
Vector uris = new Vector();
uris.add(args[0]);
// no host filtering - crawl to all hosts
Vector hostfilter = null;
/* You may want to do something else to enable host filtering:
* Vector hostfilter = new Vector();
* hostfilter.add("http://www.w3.org");
*/
int depth = 2;
int time = 60;
try {
depth = Integer.parseInt(args[1]);
time = Integer.parseInt(args[2]);
}
catch (Exception e) {
System.err.println("Illegal argument types:");
System.err.println("Argument list: URI:String depth:int time(s):int");
System.exit(0);
}
new SampleCrawl(uris,hostfilter,depth,time);
}
}
Question:
How to add import edu.unika.aifb.rdf.crawler.; error occurs here
I googled the package that you're trying to import, and it appears that you're using Kaon. Assuming that's so, you have made an error in your import declaration. You have:
import edu.unika.aifb.rdf.crawler.*;
whereas the download available on SourceForge would require:
import edu.unika.aifb.rdf.rdfcrawler.*;
As an aside, it would be helpful if you would include information, such as "I'm trying to use Kaon's rdfcrawler from ..." in your question. Otherwise, we have to try to guess important details in your setup.