I am using the java2word library for generating a word document from IBM Notes Database data.
My problem is that the document I am getting is interpreted as containing errors by ms word and only recoverable by text.
When I click the "Go to" button on the Word Repair pop up window (once I opened my doc in recovery) nothing happens and from the dialog, I can't tell anything at all. (in German)
I have Successfully used 2 other libraries with my Agent Class so I am pretty sure it can't be the data Collection and parsing to my Document Writing class.
The following Code runs successfully.
DataRow Class used for temporarily storing data:
public class DataRow {
String Date;
String VorgangDesc;
String DayShort;
double Hours;
public DataRow(String Dayshort, String Vorgangdesc, double hours, String date1){
Date=date1;
VorgangDesc=Vorgangdesc;
DayShort=Dayshort;
Hours=hours;
}
}
BerichtsHeft Class used for implementing java2word:
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import lotus.domino.Session;
import word.api.interfaces.IDocument;
import word.api.interfaces.IElement;
import word.utils.TestUtils;
import word.utils.Utils;
import word.w2004.Document2004;
import word.w2004.Document2004.Encoding;
import word.w2004.elements.BreakLine;
import word.w2004.elements.Table;
import word.w2004.elements.tableElements.TableEle;
public class BerichtsHeft {
public String Name;
public String startD;
public Session CurrentS;
public int TableCount=1;
public String Abteilung;
public int AusbildungsJahr;
String[] ItemsLastRow = new String[] {"..." , "...", "..."};
String[] ItemsFirstRow = new String[] {"Ausbildungsnachweis", "Nr." + TableCount, "Woche vom" + startD + "bis" + "e end"};
PrintWriter writer = null;
Table CurrentTable;
IDocument myDoc;
public BerichtsHeft(String strName, String startDate, Session CurrentSes, String abteilung){
this.Name=strName;
this.startD=startDate;
this.CurrentS=CurrentSes;
this.Abteilung=abteilung;
this.myDoc = new Document2004();
myDoc.encoding(Encoding.UTF_8);
}
public void Spacer(){
myDoc.addEle(BreakLine.times(1).create());
}
public void createTable(ArrayList<DataRow> DataList){
Table tbl = new Table();
CurrentTable = tbl;
String[] ItemsFlexible = new String[3];
AddFirstRow(ItemsFirstRow);
for(int ij2=0; ij2<DataList.size(); ij2++){
ItemsFlexible[0]=DataList.get(ij2).DayShort.toString();
ItemsFlexible[1]=DataList.get(ij2).VorgangDesc.toString();
ItemsFlexible[2]=Double.toString(DataList.get(ij2).Hours);
AddRow(ItemsFlexible);
}
AddLastRow(ItemsLastRow);
myDoc.addEle(CurrentTable);
TableCount++;
Spacer();
}
public void AddFirstRow(String[] Items){
CurrentTable.addTableEle(TableEle.TH, Items);
}
public void AddRow(String[] items){
CurrentTable.addTableEle(TableEle.TD, items);
}
public void AddLastRow(String[] items){
CurrentTable.addTableEle(TableEle.TD, items);
}
public void logNext(){
}
public void SaveDoc(){
File fileObj = new File("C:\\temp\\test2.doc");
PrintWriter writer = null;
try {
writer = new PrintWriter(fileObj);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
String myWord = myDoc.getContent();
writer.println(myWord);
writer.close();
}
}
Pastebin link to the text formated word document
PasteBin link to text Word Doc
What is the best technique to find the origin of these errors?
I found the Error in the Text document, it was caused by a special character, because the encoding was UTF_8 and not ISO8859_1.
I wrote a pattern. I have a list for conditions(gettin rules from json).Data(json) is coming form kafka server . I want to filter the data with this list. But it is not working. How can I do that?
I am not sure about keyedstream and alarms in for. Can flink work like this?
main program:
package cep_kafka_eample.cep_kafka;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.node.ObjectNode;
import com.google.gson.Gson;
import com.google.gson.JsonArray;
import com.google.gson.JsonParser;
import org.apache.flink.cep.CEP;
import org.apache.flink.cep.PatternSelectFunction;
import org.apache.flink.cep.PatternStream;
import org.apache.flink.cep.pattern.Pattern;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.SlidingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010;
import org.apache.flink.streaming.util.serialization.JSONDeserializationSchema;
import util.AlarmPatterns;
import util.Rules;
import util.TypeProperties;
import java.io.FileReader;
import java.util.*;
public class MainClass {
public static void main( String[] args ) throws Exception
{
ObjectMapper mapper = new ObjectMapper();
JsonParser parser = new JsonParser();
Object obj = parser.parse(new FileReader(
"c://new 5.json"));
JsonArray array = (JsonArray)obj;
Gson googleJson = new Gson();
List<Rules> ruleList = new ArrayList<>();
for(int i = 0; i< array.size() ; i++) {
Rules jsonObjList = googleJson.fromJson(array.get(i), Rules.class);
ruleList.add(jsonObjList);
}
//apache kafka properties
Properties properties = new Properties();
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("bootstrap.servers", "localhost:9092");
//starting flink
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(1000).setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
//get kafka values
FlinkKafkaConsumer010<ObjectNode> myConsumer = new FlinkKafkaConsumer010<>("demo", new JSONDeserializationSchema(),
properties);
List<Pattern<ObjectNode,?>> patternList = new ArrayList<>();
DataStream<ObjectNode> dataStream = env.addSource(myConsumer);
dataStream.windowAll(SlidingProcessingTimeWindows.of(Time.seconds(10), Time.seconds(5)));
DataStream<ObjectNode> keyedStream = dataStream;
//get pattern list, keyeddatastream
for(Rules rules : ruleList){
List<TypeProperties> typePropertiesList = rules.getTypePropList();
for (int i = 0; i < typePropertiesList.size(); i++) {
TypeProperties typeProperty = typePropertiesList.get(i);
if (typeProperty.getGroupType() != null && typeProperty.getGroupType().equals("group")) {
keyedStream = keyedStream.keyBy(
jsonNode -> jsonNode.get(typeProperty.getPropName().toString())
);
}
}
Pattern<ObjectNode,?> pattern = new AlarmPatterns().getAlarmPattern(rules);
patternList.add(pattern);
}
//CEP pattern and alarms
List<DataStream<Alert>> alertList = new ArrayList<>();
for(Pattern<ObjectNode,?> pattern : patternList){
PatternStream<ObjectNode> patternStream = CEP.pattern(keyedStream, pattern);
DataStream<Alert> alarms = patternStream.select(new PatternSelectFunction<ObjectNode, Alert>() {
private static final long serialVersionUID = 1L;
public Alert select(Map<String, List<ObjectNode>> map) throws Exception {
return new Alert("new message");
}
});
alertList.add(alarms);
}
env.execute("Flink CEP monitoring job");
}
}
getAlarmPattern:
package util;
import org.apache.flink.cep.pattern.Pattern;
import org.apache.flink.cep.pattern.conditions.IterativeCondition;
import org.apache.flink.streaming.api.datastream.DataStream;
import com.fasterxml.jackson.databind.node.ObjectNode;
public class AlarmPatterns {
public Pattern<ObjectNode, ?> getAlarmPattern(Rules rules) {
//MySimpleConditions conditions = new MySimpleConditions();
Pattern<ObjectNode, ?> alarmPattern = Pattern.<ObjectNode>begin("first")
.where(new IterativeCondition<ObjectNode>() {
#Override
public boolean filter(ObjectNode jsonNodes, Context<ObjectNode> context) throws Exception {
for (Criterias criterias : rules.getCriteriaList()) {
if (criterias.getCriteriaType().equals("equals")) {
return jsonNodes.get(criterias.getPropName()).equals(criterias.getCriteriaValue());
} else if (criterias.getCriteriaType().equals("greaterThen")) {
if (!jsonNodes.get(criterias.getPropName()).equals(criterias.getCriteriaValue())) {
return false;
}
int count = 0;
for (ObjectNode node : context.getEventsForPattern("first")) {
count += node.get("value").asInt();
}
return Integer.compare(count, 5) > 0;
} else if (criterias.getCriteriaType().equals("lessThen")) {
if (!jsonNodes.get(criterias.getPropName()).equals(criterias.getCriteriaValue())) {
return false;
}
int count = 0;
for (ObjectNode node : context.getEventsForPattern("first")) {
count += node.get("value").asInt();
}
return Integer.compare(count, 5) < 0;
}
}
return false;
}
}).times(rules.getRuleCount());
return alarmPattern;
}
}
Thanks for using FlinkCEP!
Could you provide some more details about what exactly is the error message (if any)? This will help a lot at pinning down the problem.
From a first look at the code, I can make the following observations:
At first, the line:
dataStream.windowAll(SlidingProcessingTimeWindows.of(Time.seconds(10), Time.seconds(5)));
will never be executed, as you never use this stream in the rest of your program.
Second, you should specify a sink to be taken after the select(), e.g. print() method on each of your PatternStreams. If you do not do so, then your output gets discarded. You can have a look here for examples, although the list is far from exhaustive.
Finally, I would recommend adding a within() clause to your pattern, so that you do not run out of memory.
Error was from my json object. I will fix it. When i am run job on intellij cep doesn't work. When submit from flink console it works.
I know this was not possible before, but now with the following update:
https://developers.google.com/web/updates/2017/04/devtools-release-notes#screenshots
this seems to be possible using Chrome Dev Tools.
Is it possible now using Selenium in Java?
Yes it possible to take a full page screenshot with Selenium since Chrome v59. The Chrome driver has two new endpoints to directly call the DevTools API:
/session/:sessionId/chromium/send_command_and_get_result
/session/:sessionId/chromium/send_command
The Selenium API doesn't implement these commands, so you'll have to send them directly with the underlying executor. It's not straightforward, but at least it's possible to produce the exact same result as DevTools.
Here's an example with python working on a local or remote instance:
from selenium import webdriver
import json, base64
capabilities = {
'browserName': 'chrome',
'chromeOptions': {
'useAutomationExtension': False,
'args': ['--disable-infobars']
}
}
driver = webdriver.Chrome(desired_capabilities=capabilities)
driver.get("https://stackoverflow.com/questions")
png = chrome_takeFullScreenshot(driver)
with open(r"C:\downloads\screenshot.png", 'wb') as f:
f.write(png)
, and the code to take a full page screenshot :
def chrome_takeFullScreenshot(driver) :
def send(cmd, params):
resource = "/session/%s/chromium/send_command_and_get_result" % driver.session_id
url = driver.command_executor._url + resource
body = json.dumps({'cmd':cmd, 'params': params})
response = driver.command_executor._request('POST', url, body)
return response.get('value')
def evaluate(script):
response = send('Runtime.evaluate', {'returnByValue': True, 'expression': script})
return response['result']['value']
metrics = evaluate( \
"({" + \
"width: Math.max(window.innerWidth, document.body.scrollWidth, document.documentElement.scrollWidth)|0," + \
"height: Math.max(innerHeight, document.body.scrollHeight, document.documentElement.scrollHeight)|0," + \
"deviceScaleFactor: window.devicePixelRatio || 1," + \
"mobile: typeof window.orientation !== 'undefined'" + \
"})")
send('Emulation.setDeviceMetricsOverride', metrics)
screenshot = send('Page.captureScreenshot', {'format': 'png', 'fromSurface': True})
send('Emulation.clearDeviceMetricsOverride', {})
return base64.b64decode(screenshot['data'])
With Java:
public static void main(String[] args) throws Exception {
ChromeOptions options = new ChromeOptions();
options.setExperimentalOption("useAutomationExtension", false);
options.addArguments("disable-infobars");
ChromeDriverEx driver = new ChromeDriverEx(options);
driver.get("https://stackoverflow.com/questions");
File file = driver.getFullScreenshotAs(OutputType.FILE);
}
import java.lang.reflect.Method;
import java.util.Map;
import com.google.common.collect.ImmutableMap;
import org.openqa.selenium.OutputType;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeDriverService;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.remote.CommandInfo;
import org.openqa.selenium.remote.HttpCommandExecutor;
import org.openqa.selenium.remote.http.HttpMethod;
public class ChromeDriverEx extends ChromeDriver {
public ChromeDriverEx() throws Exception {
this(new ChromeOptions());
}
public ChromeDriverEx(ChromeOptions options) throws Exception {
this(ChromeDriverService.createDefaultService(), options);
}
public ChromeDriverEx(ChromeDriverService service, ChromeOptions options) throws Exception {
super(service, options);
CommandInfo cmd = new CommandInfo("/session/:sessionId/chromium/send_command_and_get_result", HttpMethod.POST);
Method defineCommand = HttpCommandExecutor.class.getDeclaredMethod("defineCommand", String.class, CommandInfo.class);
defineCommand.setAccessible(true);
defineCommand.invoke(super.getCommandExecutor(), "sendCommand", cmd);
}
public <X> X getFullScreenshotAs(OutputType<X> outputType) throws Exception {
Object metrics = sendEvaluate(
"({" +
"width: Math.max(window.innerWidth,document.body.scrollWidth,document.documentElement.scrollWidth)|0," +
"height: Math.max(window.innerHeight,document.body.scrollHeight,document.documentElement.scrollHeight)|0," +
"deviceScaleFactor: window.devicePixelRatio || 1," +
"mobile: typeof window.orientation !== 'undefined'" +
"})");
sendCommand("Emulation.setDeviceMetricsOverride", metrics);
Object result = sendCommand("Page.captureScreenshot", ImmutableMap.of("format", "png", "fromSurface", true));
sendCommand("Emulation.clearDeviceMetricsOverride", ImmutableMap.of());
String base64EncodedPng = (String)((Map<String, ?>)result).get("data");
return outputType.convertFromBase64Png(base64EncodedPng);
}
protected Object sendCommand(String cmd, Object params) {
return execute("sendCommand", ImmutableMap.of("cmd", cmd, "params", params)).getValue();
}
protected Object sendEvaluate(String script) {
Object response = sendCommand("Runtime.evaluate", ImmutableMap.of("returnByValue", true, "expression", script));
Object result = ((Map<String, ?>)response).get("result");
return ((Map<String, ?>)result).get("value");
}
}
To do this with Selenium Webdriver in Java takes a bit of work.. As hinted by Florent B. we need to change some classes uses by the default ChromeDriver to make this work. First we need to make a new DriverCommandExecutor which adds the new Chrome commands:
import com.google.common.collect.ImmutableMap;
import org.openqa.selenium.remote.CommandInfo;
import org.openqa.selenium.remote.http.HttpMethod;
import org.openqa.selenium.remote.service.DriverCommandExecutor;
import org.openqa.selenium.remote.service.DriverService;
public class MyChromeDriverCommandExecutor extends DriverCommandExecutor {
private static final ImmutableMap<String, CommandInfo> CHROME_COMMAND_NAME_TO_URL;
public MyChromeDriverCommandExecutor(DriverService service) {
super(service, CHROME_COMMAND_NAME_TO_URL);
}
static {
CHROME_COMMAND_NAME_TO_URL = ImmutableMap.of("launchApp", new CommandInfo("/session/:sessionId/chromium/launch_app", HttpMethod.POST)
, "sendCommandWithResult", new CommandInfo("/session/:sessionId/chromium/send_command_and_get_result", HttpMethod.POST)
);
}
}
After that we need to create a new ChromeDriver class which will then use this thing. We need to create the class because the original has no constructor that lets us replace the command executor... So the new class becomes:
import com.google.common.collect.ImmutableMap;
import org.openqa.selenium.Capabilities;
import org.openqa.selenium.WebDriverException;
import org.openqa.selenium.chrome.ChromeDriverService;
import org.openqa.selenium.html5.LocalStorage;
import org.openqa.selenium.html5.Location;
import org.openqa.selenium.html5.LocationContext;
import org.openqa.selenium.html5.SessionStorage;
import org.openqa.selenium.html5.WebStorage;
import org.openqa.selenium.interactions.HasTouchScreen;
import org.openqa.selenium.interactions.TouchScreen;
import org.openqa.selenium.mobile.NetworkConnection;
import org.openqa.selenium.remote.FileDetector;
import org.openqa.selenium.remote.RemoteTouchScreen;
import org.openqa.selenium.remote.RemoteWebDriver;
import org.openqa.selenium.remote.html5.RemoteLocationContext;
import org.openqa.selenium.remote.html5.RemoteWebStorage;
import org.openqa.selenium.remote.mobile.RemoteNetworkConnection;
public class MyChromeDriver extends RemoteWebDriver implements LocationContext, WebStorage, HasTouchScreen, NetworkConnection {
private RemoteLocationContext locationContext;
private RemoteWebStorage webStorage;
private TouchScreen touchScreen;
private RemoteNetworkConnection networkConnection;
//public MyChromeDriver() {
// this(ChromeDriverService.createDefaultService(), new ChromeOptions());
//}
//
//public MyChromeDriver(ChromeDriverService service) {
// this(service, new ChromeOptions());
//}
public MyChromeDriver(Capabilities capabilities) {
this(ChromeDriverService.createDefaultService(), capabilities);
}
//public MyChromeDriver(ChromeOptions options) {
// this(ChromeDriverService.createDefaultService(), options);
//}
public MyChromeDriver(ChromeDriverService service, Capabilities capabilities) {
super(new MyChromeDriverCommandExecutor(service), capabilities);
this.locationContext = new RemoteLocationContext(this.getExecuteMethod());
this.webStorage = new RemoteWebStorage(this.getExecuteMethod());
this.touchScreen = new RemoteTouchScreen(this.getExecuteMethod());
this.networkConnection = new RemoteNetworkConnection(this.getExecuteMethod());
}
#Override
public void setFileDetector(FileDetector detector) {
throw new WebDriverException("Setting the file detector only works on remote webdriver instances obtained via RemoteWebDriver");
}
#Override
public LocalStorage getLocalStorage() {
return this.webStorage.getLocalStorage();
}
#Override
public SessionStorage getSessionStorage() {
return this.webStorage.getSessionStorage();
}
#Override
public Location location() {
return this.locationContext.location();
}
#Override
public void setLocation(Location location) {
this.locationContext.setLocation(location);
}
#Override
public TouchScreen getTouch() {
return this.touchScreen;
}
#Override
public ConnectionType getNetworkConnection() {
return this.networkConnection.getNetworkConnection();
}
#Override
public ConnectionType setNetworkConnection(ConnectionType type) {
return this.networkConnection.setNetworkConnection(type);
}
public void launchApp(String id) {
this.execute("launchApp", ImmutableMap.of("id", id));
}
}
This is mostly a copy of the original class, but with some constructors disabled (because some of the needed code is package private). If you are in need of these constructors you must place the classes in the package org.openqa.selenium.chrome.
With these changes you are able to call the required code, as shown by Florent B., but now in Java with the Selenium API:
import com.google.common.collect.ImmutableMap;
import org.openqa.selenium.remote.Command;
import org.openqa.selenium.remote.Response;
import javax.annotation.Nonnull;
import javax.annotation.Nullable;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Base64;
import java.util.HashMap;
import java.util.Map;
public class ChromeExtender {
#Nonnull
private MyChromeDriver m_wd;
public ChromeExtender(#Nonnull MyChromeDriver wd) {
m_wd = wd;
}
public void takeScreenshot(#Nonnull File output) throws Exception {
Object visibleSize = evaluate("({x:0,y:0,width:window.innerWidth,height:window.innerHeight})");
Long visibleW = jsonValue(visibleSize, "result.value.width", Long.class);
Long visibleH = jsonValue(visibleSize, "result.value.height", Long.class);
Object contentSize = send("Page.getLayoutMetrics", new HashMap<>());
Long cw = jsonValue(contentSize, "contentSize.width", Long.class);
Long ch = jsonValue(contentSize, "contentSize.height", Long.class);
/*
* In chrome 61, delivered one day after I wrote this comment, the method forceViewport was removed.
* I commented it out here with the if(false), and hopefully wrote a working alternative in the else 8-/
*/
if(false) {
send("Emulation.setVisibleSize", ImmutableMap.of("width", cw, "height", ch));
send("Emulation.forceViewport", ImmutableMap.of("x", Long.valueOf(0), "y", Long.valueOf(0), "scale", Long.valueOf(1)));
} else {
send("Emulation.setDeviceMetricsOverride",
ImmutableMap.of("width", cw, "height", ch, "deviceScaleFactor", Long.valueOf(1), "mobile", Boolean.FALSE, "fitWindow", Boolean.FALSE)
);
send("Emulation.setVisibleSize", ImmutableMap.of("width", cw, "height", ch));
}
Object value = send("Page.captureScreenshot", ImmutableMap.of("format", "png", "fromSurface", Boolean.TRUE));
// Since chrome 61 this call has disappeared too; it does not seem to be necessary anymore with the new code.
// send("Emulation.resetViewport", ImmutableMap.of());
send("Emulation.setVisibleSize", ImmutableMap.of("x", Long.valueOf(0), "y", Long.valueOf(0), "width", visibleW, "height", visibleH));
String image = jsonValue(value, "data", String.class);
byte[] bytes = Base64.getDecoder().decode(image);
try(FileOutputStream fos = new FileOutputStream(output)) {
fos.write(bytes);
}
}
#Nonnull
private Object evaluate(#Nonnull String script) throws IOException {
Map<String, Object> param = new HashMap<>();
param.put("returnByValue", Boolean.TRUE);
param.put("expression", script);
return send("Runtime.evaluate", param);
}
#Nonnull
private Object send(#Nonnull String cmd, #Nonnull Map<String, Object> params) throws IOException {
Map<String, Object> exe = ImmutableMap.of("cmd", cmd, "params", params);
Command xc = new Command(m_wd.getSessionId(), "sendCommandWithResult", exe);
Response response = m_wd.getCommandExecutor().execute(xc);
Object value = response.getValue();
if(response.getStatus() == null || response.getStatus().intValue() != 0) {
//System.out.println("resp: " + response);
throw new MyChromeDriverException("Command '" + cmd + "' failed: " + value);
}
if(null == value)
throw new MyChromeDriverException("Null response value to command '" + cmd + "'");
//System.out.println("resp: " + value);
return value;
}
#Nullable
static private <T> T jsonValue(#Nonnull Object map, #Nonnull String path, #Nonnull Class<T> type) {
String[] segs = path.split("\\.");
Object current = map;
for(String name: segs) {
Map<String, Object> cm = (Map<String, Object>) current;
Object o = cm.get(name);
if(null == o)
return null;
current = o;
}
return (T) current;
}
}
This lets you use the commands as specified, and creates a file with a png format image inside it. You can of course also directly create a BufferedImage by using ImageIO.read() on the bytes.
In Selenium 4, FirefoxDriver provides a getFullPageScreenshotAs method that handles vertical and horizontal scrolling, as well as fixed elements (e.g. navbars). ChromeDriver may implement this method in later releases.
System.setProperty("webdriver.gecko.driver", "path/to/geckodriver");
final FirefoxOptions options = new FirefoxOptions();
// set options...
final FirefoxDriver driver = new FirefoxDriver(options);
driver.get("https://stackoverflow.com/");
File fullScreenshotFile = driver.getFullPageScreenshotAs(OutputType.FILE);
// File will be deleted once the JVM exits, so you should copy it
For ChromeDriver, the selenium-shutterbug library can be used.
Shutterbug.shootPage(driver, Capture.FULL, true).save();
// Take screenshot of the whole page using Chrome DevTools
I am using jsoup 1.7.3 with Whitelist custom configuration.
Apparently it sanitizes all the HTML comments (<!-- ... -->) inside the document.
It also sanitizes the <!DOCTYPE ...> element.
How can I get jsoup Whitelist to allow comments as is?
How can I define the !DOCTYPE element as allowed element with any attribute?
This is not possible by standard JSoup classes and its not dependent on WhiteList. Its the org.jsoup.safety.Cleaner. The cleaner uses a Node traverser that allows only elements and text nodes. Also only the body is parsed. So the head and doctype are ignored completely. So to achieve this you'll have to create a custom cleaner. For example if you have an html like
<!DOCTYPE html>
<html>
<head>
<!-- This is a script -->
<script type="text/javascript">
function newFun() {
alert(1);
}
</script>
</head>
<body>
<map name="diagram_map">
<area id="area1" />
<area id="area2" />
</map>
<!-- This is another comment. -->
<div>Test</div>
</body>
</html>
You will first create a custom cleaner copying the orginal one. However please note the package should org.jsoup.safety as the cleaner uses some of the protected method of Whitelist associated with. Also there is not point in extending the Cleaner as almost all methods are private and the inner node traverser is final.
package org.jsoup.safety;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Attribute;
import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Comment;
import org.jsoup.nodes.DataNode;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.DocumentType;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.parser.Tag;
import org.jsoup.select.NodeTraversor;
import org.jsoup.select.NodeVisitor;
public class CustomCleaner {
private Whitelist whitelist;
public CustomCleaner(Whitelist whitelist) {
Validate.notNull(whitelist);
this.whitelist = whitelist;
}
public Document clean(Document dirtyDocument) {
Validate.notNull(dirtyDocument);
Document clean = Document.createShell(dirtyDocument.baseUri());
copyDocType(dirtyDocument, clean);
if (dirtyDocument.head() != null)
copySafeNodes(dirtyDocument.head(), clean.head());
if (dirtyDocument.body() != null) // frameset documents won't have a body. the clean doc will have empty body.
copySafeNodes(dirtyDocument.body(), clean.body());
return clean;
}
private void copyDocType(Document dirtyDocument, Document clean) {
dirtyDocument.traverse(new NodeVisitor() {
public void head(Node node, int depth) {
if (node instanceof DocumentType) {
clean.prependChild(node);
}
}
public void tail(Node node, int depth) { }
});
}
public boolean isValid(Document dirtyDocument) {
Validate.notNull(dirtyDocument);
Document clean = Document.createShell(dirtyDocument.baseUri());
int numDiscarded = copySafeNodes(dirtyDocument.body(), clean.body());
return numDiscarded == 0;
}
private final class CleaningVisitor implements NodeVisitor {
private int numDiscarded = 0;
private final Element root;
private Element destination; // current element to append nodes to
private CleaningVisitor(Element root, Element destination) {
this.root = root;
this.destination = destination;
}
public void head(Node source, int depth) {
if (source instanceof Element) {
Element sourceEl = (Element) source;
if (whitelist.isSafeTag(sourceEl.tagName())) { // safe, clone and copy safe attrs
ElementMeta meta = createSafeElement(sourceEl);
Element destChild = meta.el;
destination.appendChild(destChild);
numDiscarded += meta.numAttribsDiscarded;
destination = destChild;
} else if (source != root) { // not a safe tag, so don't add. don't count root against discarded.
numDiscarded++;
}
} else if (source instanceof TextNode) {
TextNode sourceText = (TextNode) source;
TextNode destText = new TextNode(sourceText.getWholeText(), source.baseUri());
destination.appendChild(destText);
} else if (source instanceof Comment) {
Comment sourceComment = (Comment) source;
Comment destComment = new Comment(sourceComment.getData(), source.baseUri());
destination.appendChild(destComment);
} else if (source instanceof DataNode) {
DataNode sourceData = (DataNode) source;
DataNode destData = new DataNode(sourceData.getWholeData(), source.baseUri());
destination.appendChild(destData);
} else { // else, we don't care about comments, xml proc instructions, etc
numDiscarded++;
}
}
public void tail(Node source, int depth) {
if (source instanceof Element && whitelist.isSafeTag(source.nodeName())) {
destination = destination.parent(); // would have descended, so pop destination stack
}
}
}
private int copySafeNodes(Element source, Element dest) {
CleaningVisitor cleaningVisitor = new CleaningVisitor(source, dest);
NodeTraversor traversor = new NodeTraversor(cleaningVisitor);
traversor.traverse(source);
return cleaningVisitor.numDiscarded;
}
private ElementMeta createSafeElement(Element sourceEl) {
String sourceTag = sourceEl.tagName();
Attributes destAttrs = new Attributes();
Element dest = new Element(Tag.valueOf(sourceTag), sourceEl.baseUri(), destAttrs);
int numDiscarded = 0;
Attributes sourceAttrs = sourceEl.attributes();
for (Attribute sourceAttr : sourceAttrs) {
if (whitelist.isSafeAttribute(sourceTag, sourceEl, sourceAttr))
destAttrs.put(sourceAttr);
else
numDiscarded++;
}
Attributes enforcedAttrs = whitelist.getEnforcedAttributes(sourceTag);
destAttrs.addAll(enforcedAttrs);
return new ElementMeta(dest, numDiscarded);
}
private static class ElementMeta {
Element el;
int numAttribsDiscarded;
ElementMeta(Element el, int numAttribsDiscarded) {
this.el = el;
this.numAttribsDiscarded = numAttribsDiscarded;
}
}
}
Once you have both you could do cleaning as normal. Like
import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.safety.CustomCleaner;
import org.jsoup.safety.Whitelist;
public class CustomJsoupSanitizer {
public static void main(String[] args) {
try {
Document doc = Jsoup.parse(new File("t2.html"), "UTF-8");
CustomCleaner cleaner = new CustomCleaner(Whitelist.relaxed().addTags("script"));
Document doc2 = cleaner.clean(doc);
System.out.println(doc2.html());
} catch (IOException e) {
e.printStackTrace();
}
}
}
This will give you the sanitized output for above html as
<!DOCTYPE html>
<html>
<head>
<!-- This is a script -->
<script>
function newFun() {
alert(1);
}
</script>
</head>
<body>
<!-- This is another comment. -->
<div>
Test
</div>
</body>
</html>
You can customize the cleaner to match your requirement. i.e to avoid head node or script tag etc...
The Jsoup Cleaner doesn't give you a chance here (l. 100):
} else { // else, we don't care about comments, xml proc instructions, etc
numDiscarded++;
}
Only instances of Element and TextNode may remain in the cleaned HTML.
Your only chance may be something horrible like parsing the document, replacing the comments and the doctype with a special whitelisted tag, cleaning the document and then parsing and replacing the special tags again.
Am new in using Xpath parsing in Java for Xmls. But I learnt it and it worked pretty well until this below issue am not sure how to go traverse to next node in this . Please find the below code and Let me know what needs to be corrected .
package test;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.TransformerException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
public class CallTestcall {
public static void main(String[] args) throws Exception {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
String responsePath1 = "C:/Verizon/webserviceTestTool/generatedResponse/example.xml";
Document doc1 = builder.parse(responsePath1);
String responsePath0 = "C:/Verizon/webserviceTestTool/generatedResponse/response.xml";
Document doc0 = builder.parse(responsePath0);
example0(doc0);
example1(doc1);
}
private static void example0(Document example)
throws XPathExpressionException, TransformerException {
System.out.println("\n*** First example - namespacelookup hardcoded ***");
XPath xPath = XPathFactory.newInstance().newXPath();
xPath.setNamespaceContext(new HardcodedNamespaceResolver());
String result = xPath.evaluate("s:Envelope/s:Body/ns1:UpdateSessionResponse",
example);
// I tried all the Values to traverse further to UpdateSessionResult but am not able to I used the following xpath expressions
result = xPath.evaluate("s:Envelope/s:Body/ns1:UpdateSessionResponse/a:UpdateSessionResult",
example);
result = xPath.evaluate("s:Envelope/s:Body/ns1:UpdateSessionResponse/i:UpdateSessionResult",
example);
System.out.println("example0 : "+result);
}
private static void example1(Document example)
throws XPathExpressionException, TransformerException {
System.out.println("\n*** First example - namespacelookup hardcoded ***");
XPath xPath = XPathFactory.newInstance().newXPath();
xPath.setNamespaceContext(new HardcodedNamespaceResolver());
String result = xPath.evaluate("books:booklist/technical:book/:author",
example);
System.out.println("example1 : "+result);
}
}
Please find the class that implements nameSpaceContext where I have added the prefixes
package test;
import java.util.Iterator;
import javax.xml.XMLConstants;
import javax.xml.namespace.NamespaceContext;
public class HardcodedNamespaceResolver implements NamespaceContext {
/**
* This method returns the uri for all prefixes needed. Wherever possible it
* uses XMLConstants.
*
* #param prefix
* #return uri
*/
public String getNamespaceURI(String prefix) {
if (prefix == null) {
throw new IllegalArgumentException("No prefix provided!");
} else if (prefix.equals(XMLConstants.DEFAULT_NS_PREFIX)) {
return "http://univNaSpResolver/book";
} else if (prefix.equals("books")) {
return "http://univNaSpResolver/booklist";
} else if (prefix.equals("fiction")) {
return "http://univNaSpResolver/fictionbook";
} else if (prefix.equals("technical")) {
return "http://univNaSpResolver/sciencebook";
} else if (prefix.equals("s")) {
return "http://schemas.xmlsoap.org/soap/envelope/";
} else if (prefix.equals("a")) {
return "http://channelsales.corp.cox.com/vzw/v1/data/";
} else if (prefix.equals("i")) {
return "http://www.w3.org/2001/XMLSchema-instance";
} else if (prefix.equals("ns1")) {
return "http://channelsales.corp.cox.com/vzw/v1/";
}
else {
return XMLConstants.NULL_NS_URI;
}
}
public String getPrefix(String namespaceURI) {
// Not needed in this context.
return null;
}
public Iterator getPrefixes(String namespaceURI) {
// Not needed in this context.
return null;
}
}
Please find my Xml ::::
String XmlString = "<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/"><s:Body><UpdateSessionResponse xmlns="http://channelsales.corp.cox.com/vzw/v1/"><UpdateSessionResult xmlns:a="http://channelsales.corp.cox.com/vzw/v1/data/" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<a:ResponseHeader>
<a:SuccessFlag>true</a:SuccessFlag>
<a:ErrorCode i:nil="true"/>
<a:ErrorMessage i:nil="true"/>
<a:Timestamp>2012-12-05T15:28:35.5363903-05:00</a:Timestamp>
</a:ResponseHeader>
<a:SessionId>cd3ce09e-eb33-48e8-b628-ecd406698aee</a:SessionId>
<a:CacheKey i:nil="true"/>
Try the following. It works for me.
result = xPath.evaluate("/s:Envelope/s:Body/ns1:UpdateSessionResponse/ns1:UpdateSessionResult",
example);
Since you are searching from the root of the document, precede the xpath expression with a forward slash (/)
Also, in the XML fragment below, the string xmlns="http... means you are setting that to be the default namespace. In your namespace resolver you are giving this the prefix ns1. So even though UpdateSessionResult is defining two namespace prefixes a and i, it does not use those prefixes itself (for example <a:UpdateSessionResult...) therefore it belongs to the default namespace (named 'ns1')
<UpdateSessionResponse xmlns="http://channelsales.corp.cox.com/vzw/v1/">
<UpdateSessionResult xmlns:a="http://channelsales.corp.cox.com/vzw/v1/data/" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
That's why you need to use ns1:UpdateSessionResult instead of either a:UpdateSessionResult or i:UpdateSessionResult