My question: What's a good way to parse the information below?
I have a java program that gets it's input from XML. I have a feature which will send an error email if there was any problem in the processing. Because parsing the XML could be a problem, I want to have a feature that would be able to regex the emails out of the xml (because if parsing was the problem then I couldn't get the error e-mails out of the xml normally).
Requirements:
I want to be able to parse the to, cc, and bcc attributes seperately
There are other elements which have to, cc, and bcc attributes
Whitespace does not matter, so my example may show the attributes on a newline, but that's not always the case.
The order of the attributes does not matter.
Here's an example of the xml:
<error_options
to="your_email#your_server.com"
cc="cc_error#your_server.com"
bcc="bcc_error#your_server.com"
reply_to="someone_else#their_server.com"
from="bo_error#some_server.org"
subject="Error running System at ##TIMESTAMP##"
force_send="false"
max_email_size="10485760"
oversized_email_action="zip;split_all"
>
I tried this error_options.{0,100}?to="(.*?)", but that matched me down to reply_to. That made me think there are probably some cases I might miss, which is why I'm posting this as a question.
This piece will put all attributes from your String s="<error_options..." into a map:
Pattern p = Pattern.compile("\\s+?(.+?)=\"(.+?)\\s*?\"",Pattern.DOTALL);
Map a = new HashMap() ;
Matcher m = p.matcher(s) ;
while( m.find() ) {
String key = m.group(1).trim() ;
String val = m.group(2).trim() ;
a.put(key, val) ;
}
...then you can extract the values that you're interested in from that map.
This question is similar to RegEx match open tags except XHTML self-contained tags. Never ever parse XML or HTML with regular expressions. There are many XML parser implementation in Java to do this task properly. Read the document and parse the attributes one by one.
Don't mind, if the users XML is not well-formed, the parsers can handle a lot of sloppiness.
/<error_options(?=\s)[^>]*?(?<=\n)\s*to="([^"]*)"/s;
/<error_options(?=\s)[^>]*?(?<=\n)\s*cc="([^"]*)"/s;
/<error_options(?=\s)[^>]*?(?<=\n)\s*bcc="([^"]*)"/s;
Related
I am using JSoup library in Java to sanitize input to prevent XSS attacks. It works well for simple inputs like alert('vulnerable').
Example:
String data = "<script>alert('vulnerable')</script>";
data = Jsoup.clean(data, , Whitelist.none());
data = StringEscapeUtils.unescapeHtml4(data); //StringEscapeUtils from apache-commons lib
System.out.println(data);
Output: ""
However, if I tweak the input to the following, JSoup cannot sanitize the input.
String data = "<<b>script>alert('vulnerable');<</b>/script>";
data = Jsoup.clean(data, , Whitelist.none());
data = StringEscapeUtils.unescapeHtml4(data);
System.out.println(data);
Output: <script>alert('vulnerable');</script>
This output obviously still prone to XSS attacks. Is there a way to fully sanitize the input so that all HTML tags is removed from input?
Not sure if this is the best solution, but a temporary workaround would be parsing the raw text into a Doc and then clean the combined text of the Doc element and all its children:
String unsafe = "<<b>script>alert('vulnerable');<</b>/script>";
Document doc = Jsoup.parse(unsafe);
String safe = Jsoup.clean(doc.text(), Whitelist.none());
System.out.println(safe);
Wait for someone else to come up with the best solution.
The problem is that you are unescaping the safe HTML that jsoup has made. The output of the Cleaner is HTML. The none safelist passes no tags, only the textnodes, as HTML.
So the input:
<<b>script>alert('vulnerable');<</b>/script>
Through the Cleaner returns:
<script>alert('vulnerable');</script>
which is perfectly safe for presenting as HTML. See https://try.jsoup.org/~hfn2nvIglfl099_dVxLQEPxekqg
Just don't include the unescape line.
I get a stream of values as CSV , based on some condition I need to generate a XML including only a set of values from the CSV. For e.g .
Input : a:value1, b:value2, c:value3, d:value4, e:value5.
if (condition1)
XML O/P = <Request><ValueOfA>value1</ValueOfA><ValueOfE>value5</ValueOfE></Request>
else if (condition2)
XML O/P = <Request><ValueOfB>value2</ValueOfB><ValueOfD>value4</ValueOfD></Request>
I want to externalize the process in a way that given a template the output XML is generated accordingly. String manipulation is the easiest way of implementing this but I do not want to mess up the XML if some special characters appear in the input, etc. Please suggest.
Perhaps you could benefit from templating engine, something like Apache Velocity.
I would suggest creating an xsd and using JAXB to create the Java binding classes that you can use to generate the XML.
I recommend my own templating engine (JATL http://code.google.com/p/jatl/) Although its geared to (X)HTML its also very good at generating XML.
I didn't bother solving the whole problem for you (that is double splitting on the input ("," and then ":").) but this is how you would use JATL.
final String a = "stuff";
HtmlWriter html = new HtmlWriter() {
#Override
protected void build() {
//If condition1
start("Request").start("ValueOfA").text(a).end().end();
}
};
//Now write.
StringWriter writer = new StringWriter();
String results = html.write(writer).getBuffer().toString();
Which would generate
<Request><ValueOfA>stuff</ValueOfA></Request>
All the correct escaping is handled for you.
I have an XML which contains many special symbols like ® (HTML number ®) etc.
and HTML names like ã (HTML number ã) etc.
I am trying to replace these HTML symbols and HTML names with corresponding HTML number using Java. For this, I first converted XML file to string and then used replaceAll method as:
File fn = new File("myxmlfile.xml");
String content = FileUtils.readFileToString(fn);
content = content.replaceAll("®", "&\#174");
FileUtils.writeStringToFile(fn, content);
But this is not working.
Can anyone please tell how to do it.
Thanks !!!
The signature for the replaceAll method is:
public String replaceAll(String regex, String replacement)
You have to be careful that your first parameter is a valid regular expression. The Java Pattern class describes the constructs used in a Java regular expression.
Based on what I see in the Pattern class description, I don't see what's wrong with:
content = content.replaceAll("®", "&\#174");
You could try:
content = content.replaceAll("\\p(®)", "&\#174");
and see if that works better.
I don't think that \# is a valid escape sequence.
BTW, what's wrong with "®" ?
If you want HTML numbers try first escaping for XML.
Use EscapeUtils from Apache Commons Lang.
Java may have trouble dealing with it, so first I prefere to escape Java, and after that XML or HTML.
String escapedStr= StringEscapeUtils.escapeJava(yourString);
escapedStr= StringEscapeUtils.escapeXML(yourString);
escapedStr= StringEscapeUtils.escapeHTML(yourString);
How may I convert HTML to text keeping linebreaks (produced by elements like br,p,div, ...) possibly using NekoHTML or any decent enough HTML parser
Example:
Hello<br/>World
to:
Hello\n
World
Here is a function I made to output text (including line breaks) by iterating over the nodes using Jsoup.
public static String htmlToText(InputStream html) throws IOException {
Document document = Jsoup.parse(html, null, "");
Element body = document.body();
return buildStringFromNode(body).toString();
}
private static StringBuffer buildStringFromNode(Node node) {
StringBuffer buffer = new StringBuffer();
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
buffer.append(textNode.text().trim());
}
for (Node childNode : node.childNodes()) {
buffer.append(buildStringFromNode(childNode));
}
if (node instanceof Element) {
Element element = (Element) node;
String tagName = element.tagName();
if ("p".equals(tagName) || "br".equals(tagName)) {
buffer.append("\n");
}
}
return buffer;
}
w3m -dump -no-cookie input.html > output.txt
I did find a relatively clever solution in html2txt: THE ASCIINATOR which does an admirable job of producing nroff like output (e.g. like man ls run on a terminal). It produces output in the Markdown style that StackOverflow uses as input.
For moderately complex pages like this page, the output is somewhat scattered as it tries mightily to turn non-linear layout into something linear. The output from less complicated markup is pretty readable.
If you don't mind hard-wrapped/designed-for-monospace output, lynx -dump produces good plain text from HTML.
HTML to Text:
I am taking this statement to mean that all HTML formatting, except line-breaks, will be abandoned.
What I have done for such a venture is using regexp to detect any set of tag enclosure.
If the value within the tags are br or br/, a line-break is inserted, otherwise the tag is discarded.
It works only for simple html pages. Tables will obviously be linearised.
I had been thinking of detecting the title value between the title tag enclosure, so that the converter automatically places the title at the top of the page. Needs to put in a little more algorithm. By my time is better spent with ...
I am reading on using Google Data APIs to upload a document to Google Docs and then using the same API to download/export it as text. Or, why text, when I could do pdf. But you have to get a Google account if you don't already have one.
Google docs data download/export
Google docs data api for java
Does it matter what language you use? You could always use pattern matching. Basically HTML lien break tags (br,p,div, ...) you can replace with "\n" and remove all the other tags. You could always store the tags in an array so you can easily check when you go through the HTML file. Then any other tags and all the other end tags (/p,..) can be replaced with an empty string therefore getting your result.
What I am doing is validating URLs from my code. So I have a file with url's in it and I want to see if they exist or not. If they exist, the web page contains xml code in which there will be an email address I want to extract.
I go round a while loop and in each instance, if the url exists, The xml is added to a string. This one big string contains the xml code. What I want to do is extract the email address from this string with the xml code in it. I can't use the methods in the string api as they require you to specify the sarting index which I don't know as it varies each time.
What I was hoping to do was search the string for a sub-string starting with (e.g. "<email id>") and ending with (e.g. "</email id>") and add the string between these strings to a seperate string.
Does anyone know if this is possible to do or if there is an easier/different way of doing what I want to do?
Thanks.
If you know well the structure of the XML document, I'll recommand to use XPath.
For example, with emails contained in <email>a#b.com</email>, there will a XPath request like /root/email (depends on your xml structure)
By executing this XPath query on your XML file, you will automatically get all <email> element (Node) returned in an array. And if you have XML element, you have XML content. (#getNodeValue)
To answer your subject question: .indexOf, or, regular expressions.
But after a brief review of your question, you should really be processing the XML document properly.
A regular expression that will find and return strings between two " characters:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
private final static Pattern pattern = Pattern.compile("\"(.*?)\"");
private void doStuffWithStringsBetweenQuotes(String source) {
Matcher matcher = pattern.matcher(source);
while (matcher.find()) {
String match = matcher.group(1);
}
}
Have you try to use Regex? Probably a sample document will be very useful for this kind of question.
Check out the org.xml.sax API. It is very easy to use and allows you to parse through XML and do whatever you want with the contents whenever you come across anything of interest. So you could easily add some logic to look for < email > start elements then save the contents (characters) which will contain your email address.
If I understand your question correctly you are extracting pieces of XML from multiple web pages and concatenating them into a big 'xml' string,
something that looks like
"<somedata>blah</somedata>
<email>a.b#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>a.c#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>a.d#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
"
I'd advise making that a somewhat valid xml document by including a root element.
"
<?xml version="1.0" encoding="ISO-8859-1"?>
<newRoot>
<somedata>blah</somedata>
<email>a.b#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>a.c#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>a.d#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
</newroot>"
Then you could load that into an Xml Document object and can use Xpath expressions to extract the email nodes and their values.
If you don't want to do that that you could use the indexOf(String str, int fromIndex) method to find the <email> and </email> (or whatever they are called) positions. and then substring based on those. That's not a particularly clean or easy to read way of doing it though.