Relative paths in Flying Saucer XHTML?

Relative paths in Flying Saucer XHTML? - java

I am using Flying Saucer to render some PDF documents from strings to XHTML. My code is something like:
iTextRenderer.setDocument(documentGenerator.generate(xhtmlDocumentAsString));
iTextRenderer.layout();
iTextRenderer.createPDF(outputStream);
What I'm trying to understand is, when using this method, where are relative paths in the XHTML resolved from? For example, for images or stylesheets. I am able to use this method to successfully generate a text-based document, but I need to understand how to reference my images and CSS.

The setDocument() method takes two parameters: document and url.
The url parameter indicates the base url used to prepend to relative paths that appear in the xhtml, such as in img tags.
Suppose you have:
<img src="images/img1.jpg">
Now suppose the folder "images" is located at:
C:/physical/route/to/app/images/
You may use setDocument() as:
renderer.setDocument(xhtmlDoc, "file:///C:/physical/route/to/app/");
Notice the trailing slash, it won't work without it.
This is the way it worked for me. I assume you could use other types of urls such as "http://...".

This week I worked on this, and I give you what worked fine for me.
In real life, your XHTML document points to multiple resources (images, css, etc.) with relative paths.
You also have to explain to Flying Saucer where to find them. They can be in your classpath, or in your file system. (If they are on the network, you can just set the base url, so this won't help)
So you have to extend the ITextUserAgent like this:
private static class ResourceLoaderUserAgent extends ITextUserAgent {
public ResourceLoaderUserAgent(ITextOutputDevice outputDevice) {
super(outputDevice);
}
protected InputStream resolveAndOpenStream(String uri) {
InputStream is = super.resolveAndOpenStream(uri);
String fileName = "";
try {
String[] split = uri.split("/");
fileName = split[split.length - 1];
} catch (Exception e) {
return null;
}
if (is == null) {
// Resource is on the classpath
try{
is = ResourceLoaderUserAgent.class.getResourceAsStream("/etc/images/" + fileName);
} catch (Exception e) {
}
if (is == null) {
// Resource is in the file system
try {
is = new FileInputStream(new File("C:\\images\\" + fileName));
} catch (Exception e) {
}
}
return is;
}
}
And you use it like this:
// Output stream containing the result
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ITextRenderer renderer = new ITextRenderer();
ResourceLoaderUserAgent callback = new ResourceLoaderUserAgent(renderer.getOutputDevice());
callback.setSharedContext(renderer.getSharedContext());
renderer.getSharedContext().setUserAgentCallback(callback);
renderer.setDocumentFromString(htmlSourceAsString);
renderer.layout();
renderer.createPDF(baos);
renderer.finishPDF();
Cheers.

The best solution for me was:
renderer.setDocumentFromString(htmlContent, new ClassPathResource("/META-INF/pdfTemplates/").getURL().toExternalForm());
Then all the provided styles and images in html (like
<img class="logo" src="images/logo.png" />
<link rel="stylesheet" type="text/css" media="all" href="css/style.css"></link>
) were rendered as expected.

AtilaUy's answer is spot-on for the default way things work in Flying Saucer.
The more general answer is that it asks the UserAgentContext. It will call setBaseURL() on the UserAgentContext when the document is set in. Then it will call resolveURL() to resolve relative URLs and ultimately resolveAndOpenStream() when it wants to read the actual resource data.
Well, this answer is probably way too late for you to make use of it anyway, but I needed an answer like this when I set out, and setting a custom user agent context is the solution I ended up using.

You can either have file paths, which should be absolute, or http:// urls. Relative paths can work but aren't portable because it depends on what directory you ran your program from

I think a easier approach would be:
DomNodeList<DomElement> images = result.getElementsByTagName("img");
for (DomElement e : images) {
e.setAttribute("src", result.getFullyQualifiedUrl(e.getAttribute("src")).toString());
}

Another way to resolve paths is to override UserAgentCallback#resolveURI, which offers a more dynamic behavior than a fixed URL (as in AtilaUy's answer, which looks quite valid for most cases).
This is how I make an XHTMLPane use dynamically-generated stylesheets:
public static UserAgentCallback interceptCssResourceLoading(
final UserAgentCallback defaultAgentCallback,
final Map< URI, CSSResource > cssResources
) {
return new UserAgentCallback() {
#Override
public CSSResource getCSSResource( final String uriAsString ) {
final URI uri = uriQuiet( uriAsString ) ; // Just rethrow unchecked exception.
final CSSResource cssResource = cssResources.get( uri ) ;
if( cssResource == null ) {
return defaultAgentCallback.getCSSResource( uriAsString ) ;
} else {
return cssResource ;
}
}
#Override
public String resolveURI( final String uriAsString ) {
final URI uri = uriQuiet( uriAsString ) ;
if( cssResources.containsKey( uri ) ) {
return uriAsString ;
} else {
return defaultAgentCallback.resolveURI( uriAsString ) ;
}
}
// Delegate all other methods to defaultUserAgentCallback.
} ;
}
Then I use it like that:
final UserAgentCallback defaultAgentCallback =
xhtmlPanel.getSharedContext().getUserAgentCallback() ;
xhtmlPanel.getSharedContext().setUserAgentCallback(
interceptCssResourceLoading( defaultAgentCallback, cssResources ) ) ;
xhtmlPanel.setDocumentFromString( xhtml, null, new XhtmlNamespaceHandler() ) ;

Related

How to remove a specific image from a PDF with PDFBox

I need to remove a specific image from PDF file according its metadata. Sadly. all examples I can find in Internet are using discarded methods.
I write it something like this:
try (PDDocument doc = PDDocument.load(new ByteArrayInputStream(pdf))) {
doc.getPages().forEach(page ->
{
PDResources resources = page.getResources();
List<COSName> itemsToRemove = new ArrayList<>();
resources.getXObjectNames().forEach(propertyName -> {
if(!resources.isImageXObject(propertyName)) {
return;
}
PDXObject pdxObject = resources.getXObject(propertyName);
PDImageXObject pdImageXObject = (PDImageXObject)pdxObject;
PDMetadata metadata = pdImageXObject.getMetadata();
if(checkMetadata(metadata)){
// What should I use here?
page.getCOSObject().removeItem(propertyName);
}
});
// Should I use page.setResources(resources); ?
});
doc.save(baos);
} catch (Exception e) {
//Code here
}

It works same way like it does in example RemoveAllText.java, just with different tag.
Use code from this example, just use "Do" instead of "Tj".
Of course, if you need to load metadata, etc, you should enumerate and check images threw page resources (like in my example)

Java How to Normalise a URL and Remove Fragment

How to normalise a URL in Java to remove the fragment. I.e. from https://www.website.com#something to https://www.website.com
This is possible with the URL.Normalize code, although in this specific use case I've only got a full absolute URL which needs to remain intact.
I'd like to be able to modify this code slightly to remove the fragment from the URL;
//The website below is just an example. In reality, this URL is unknown and could be anything. Both with and without a fragment depending on the use case
URL absUrl = new URL("https://www.website.com#something");
My thoughts so far is that this is only going to be possible by breaking down the URL into the Protocol + Domain + Path then joining it all back together which does appear to work, but there must be a more elegant way of doing this.

Fragment removal is fairly simple using the conversion methods toURI and toURL. So to convert a URL to a URI:
URL url = /*what have you*/ …
URI u = url.toURI();
To remove any fragment from the URI:
if( u.getFragment() != null ) { // Remake with same parts, less the fragment:
u = new URI( u.getScheme(), u.getSchemeSpecificPart(), /*fragment*/null ); }
In reconstructing a URI from its parts like that, it’s important to use the decoded getters (as shown), not the corresponding raw ones. For authority on this usage, see e.g. the Identity section of the API.
To convert the result back to a URL:
url = u.toURL();

Fragments do not exist as a separate entity in Java URLs. But you can convert a URL into a URI and back to remove a fragment. I did it like this:
URL url;
...
if (url.toString().contains("#")) {
URI uri = null;
try {
uri = new URI(url.getProtocol(), url.getHost(), url.getPath(), null);
String file = "";
if (uri.getPath() != null) {
file += uri.getPath();
}
if (uri.getQuery() != null) {
file += uri.getQuery();
}
url = new URL(uri.getScheme(), uri.getHost(), uri.getPort(), file);
} catch (URISyntaxException e) {
...
} catch (MalformedURLException e) {
...
}
}

Error: org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage

I am trying to extract image from the pdf using pdfbox. I have taken help from this post . It worked for some of the pdfs but for others/most it did not. For example, I am not able to extract the figures in this file
After doing some research I found that PDResources.getImages is deprecated. So, I am using PDResources.getXObjects(). With this, I am not able to extract any image from the PDF and instead get this message at the console:
org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage
Now I am stuck and unable to find the solution. Please assist if anyone can.
//////UPDATE AS REPLY ON COMMENTS///
I am using pdfbox-1.8.10
Here is the code:
public void getimg ()throws Exception {
try {
String sourceDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/inputs/Yavaa.pdf";
String destinationDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/outputs/";
File oldFile = new File(sourceDir);
if (oldFile.exists()){
PDDocument document = PDDocument.load(sourceDir);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
String fileName = oldFile.getName().replace(".pdf", "_cover");
int totalImages = 1;
for (PDPage page : list) {
PDResources pdResources = page.getResources();
Map pageImages = pdResources.getXObjects();
if (pageImages != null){
Iterator imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext()){
String key = (String) imageIter.next();
Object obj = pageImages.get(key);
if(obj instanceof PDXObjectImage) {
PDXObjectImage pdxObjectImage = (PDXObjectImage) obj;
pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);
totalImages++;
}
}
}
}
} else {
System.err.println("File not exist");
}
}
catch (Exception e){
System.err.println(e.getMessage());
}
}
//// PARTIAL SOLUTION/////
I have solved the problem of the error message. I have updated the correct code in the post as well. However, the problem remains the same. I am still not able to extract the images from few of the files. Like the one, I have mentioned in this post. Any solution in that regards.

The first problem with the original code is that XObjects can be PDXObjectImage or PDXObjectForm, so it is needed to check the instance. The second problem is that the code doesn't walk PDXObjectForm recursively, forms can have resources too. The third problem (only in 1.8) is that you used getResources() instead of findResources(), getResources() doesn't check higher levels.
Code for 1.8 can be found here:
https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractImages.java?view=markup
Code for 2.0 can be found here:
https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractImages.java?view=markup&sortby=date
(Even these are not always perfect, see this answer)
The fourth problem is that your file doesn't have any XObjects at all. All "graphics" were really vector drawings, these can't be "extracted" like embedded images. All you could do is to convert the PDF pages to images, and then mark and cut what you need.

How to know whether a string path is Web URL or a File based

I have a text field to acquire location information (String type) from User. It could be file directory based (e.g. C:\directory) or Web url (e.g. http://localhost:8008/resouces). The system will read some predetermined metadata file from the location.
Given the input string, how can I detect the nature of the path location whether it is a file based or Web URL effectively.
So far I have tried.
URL url = new URL(location); // will get MalformedURLException if it is a file based.
url.getProtocol().equalsIgnoreCase("http");
File file = new File(location); // will not hit exception if it is a url.
file.exist(); // return false if it is a url.
I am still struggling to find a best way to tackle both scenarios. :-(
Basically I would not prefer to explicitly check the path using the prefix such as http:// or https://
Is there an elegant and proper way of doing this?

You can check if the location starts with http:// or https://:
String s = location.trim().toLowerCase();
boolean isWeb = s.startsWith("http://") || s.startsWith("https://");
Or you can use the URI class instead of URL, URI does not throw MalformedURLException like the URL class:
URI u = new URI(location);
boolean isWeb = "http".equalsIgnoreCase(u.getScheme())
|| "https".equalsIgnoreCase(u.getScheme())
Although new URI() may also throw URISyntaxException if you use backslash in location for example. Best way would be to either use prefix check (my first suggestion) or create a URL and catch MalformedURLException which if thrown you'll know it cannot be a valid web url.

If you're open to the use of a try/catch scenario being "elegant", here is a way that is more specific:
try {
processURL(new URL(location));
}
catch (MalformedURLException ex){
File file = new File(location);
if (file.exists()) {
processFile(file);
}
else {
throw new PersonalException("Can't find the file");
}
}
This way, you're getting the automatic URL syntax checking and, that failing, the check for file existence.

you can try:
static public boolean isValidURL(String urlStr) {
try {
URI uri = new URI(urlStr);
return uri.getScheme().equals("http") || uri.getScheme().equals("https");
}
catch (Exception e) {
return false;
}
}
note that this will return false for any other reason that invalidates the url, ofor a non http/https url: a malformed url is not necessarily an actual file name, and a good file name can be referring to a non exisiting one, so use it in conjunction with you file existence check.

public boolean urlIsFile(String input) {
if (input.startsWith("file:")) return true;
try { return new File(input).exists(); } catch (Exception e) {return false;}
}
This is the best method because it is hassle free, and will always return true if you have a file reference. For instance, other solutions don't and cannot cover the plethora of protocol schemes available such as ftp, sftp, scp, or any future protocol implementations. So this one is the one for all uses and purposes; with the caveat of the file must exist, if it doesn't begin with the file protocol.
if you look at the logic of the function by it's name, you should understand that, returning false for a non existent direct path lookup is not a bug, that is the fact.

ResourceTool : to recover a jpg with ResourceNode [velocity]

During my searching, I would like a piece of advice about the following situation :
the guy, on my website, choose a parcel to send, when he validates the choice, some carriers appear as results. Now some carriers have different offers with different logos located in a special directory.
Now the business logic I would like to use is :
If in the directory I find the peculiar logo corresponding to the special offer, I will take the logo to display It in the web page with the special offer.
I choose to do this work with the ResourceTool from Velocity
I have to implement 2 methods getLogo() and getLabel().
The getLogo() will look for the special logo.
I think to use this method to recover the object :
public static ResourceNode getResource(Context context, ResourceType resourceType, String...keys) {
try {
if (null != ResourcesTool.instance) {
ResourceNode resource = ResourcesTool.instance.getResourceSet(context, resourceType);
if (null != resource) {
Deque < String > keyDeque = new ArrayDeque < > ();
for (String key: keys) {
keyDeque.add(key);
}
return (ResourceNode) resource.sub(keyDeque);
}
}
} catch (Exception e) {
BoxtaleLogger.debug("[ResourcesTool.getResource] error: ", e);
}
return null;
}
Now I am searching a example to merely use this method to recover the different .jpg
question 2 : I don't understand what is the meaningful of Context context in this method ?
Then the resourceType is an enum either a String or a picture (the logo in fact)

All right I found It :
public String getLogo(ResourceNode node){
//readable variables
String ope_code = (String)((Instance)get("operateur")).get("ope_code");
String paysDest = ((Instance)db.getEntity("tab_pays").fetch((Integer) get("pz_id"))).get("pz_iso");
String path = node.get(ope_code+"_"+get("srv_code")+"_"+paysDest+".png");
if (path=null){
path = node.get(ope_code+"_"+get("srv_code")+".png");
if (path=null){
path = node.get(ope_code+".png");
}
}
return path ;
}
Now I am testing the method, I will tell you after.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Relative paths in Flying Saucer XHTML? - java

You can either have file paths, which should be absolute, or http:// urls. Relative paths can work but aren't portable because it depends on what directory you ran your program from

I think a easier approach would be: DomNodeList<DomElement> images = result.getElementsByTagName("img"); for (DomElement e : images) { e.setAttribute("src", result.getFullyQualifiedUrl(e.getAttribute("src")).toString()); }

Related

How to remove a specific image from a PDF with PDFBox

Java How to Normalise a URL and Remove Fragment

Error: org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage

How to know whether a string path is Web URL or a File based

ResourceTool : to recover a jpg with ResourceNode [velocity]

Categories

Resources