I want to replace < and > with < and > if it is not a part of an html tag.
Input will be a string that may contain certain html tags. It can also contain less than & greater than signs (">" "<").
For example:
String example1 = "-> <b> Bold </b> <-";
String example2 = "< <i> Italic </i> >"
String example3 = "<i>foo >> </i>"
As output I want to get:
String output1 = "-> <b> Bold </b> <-";
String output2 = "< <i> Italic </i> >";
String output3 = "<i>foo >> </i>";
So replaceAll doesn't work, I have to use a regular expression match I guess. Any ideas? Some other way?
Note1: 3rd party library is not an option because of certain project requirements.
Note2: We support only a subset of HTML tags(text styling tags: italic, underline, bold etc.)
This is a non-trival task. HTML is not a regular language (perhaps it is irregular?) so you can not parse it using regular expressions. I suggest the following:
Option 1
Use this if you do not need to preserve white space.
Remove all whitespace from the input.
Split the input into tokens using "<" and ">" as the seperators; preserve seperators.
Process as follows:
if the token is not a supported HTML tag and contains a "<", convert the "<" as desired.
if the token is not a supported HTML tag and contains a ">", convert the ">" as desired.
pass HTML tags unchanged.
Option 2
Process each input line using multi character look ahead.
For each character in the input. Convert characters are {">", "<"}
Is the character a convert character.
if no, advance to next character.
if yes, look ahead to determine if this is a supported HTML tag (this is the tricky part).
if not part of a supported HTML tag, convert the character.
if part of a supported HTML tag, advance to the character following the HTML tag.
If you only support five html tags you could first remove those tags from the text.
replace < and > by < and > and then add the html tags again. You remove <b> from the text by replacing it by for instance [b]. Do the same with the other tags.
If you can't be bothered to use an external library then you would need to make an array with all the html tags and run it against the string.
I don't really recommend it because there are libraries for that...
Assuming arbitrary HTML files, you have to isolate text nodes and run replace on those.
If you can't use existing libraries, I'd just write my own.
(JSoup can do this but it's an 'external library' -- that is, not included in the Java SE standard, but just re-implementing it is an option.)
Assuming that the strings are containing valid HTML tags . Following method could be applied to parse the strings to achieve the result you looking for:
private static String parse(String str)
{
StringBuilder sBuilder = new StringBuilder();
for (int i = 0 ; i < str.length() ; i++)
{
char ch = str.charAt(i);
if (ch == '>' && i != 0)
{
char c = str.charAt( i - 1);
if (Character.isWhitespace(c) || !Character.isLetter(c))
{
sBuilder.append(">");
}
else
sBuilder.append(ch);
}
else if (ch == '>' && i==0)
{
sBuilder.append(">");
}
else if (ch == '<' && i < str.length() - 1)
{
char c = str.charAt( i + 1);
if (!(c=='/' || Character.isLetter(c)))
{
sBuilder.append("<");
}
else
sBuilder.append(ch);
}
else if (ch == '<' && i == str.length() - 1)
{
sBuilder.append("<");
}
else
{
sBuilder.append(ch);
}
}
return sBuilder.toString();
}
Related
This question already has answers here:
How to unescape a Java string literal in Java?
(11 answers)
Closed 2 years ago.
I am working on adding search / replace functionality to an android application.
I would like the user to be able to search and replace using regular expressions. The search functionality works correctly, however a newline character in the replacement string \n is interpreted as a literal 'n'.
Here is what my code is doing:
Pattern.compile(search.getText().toString()).matcher(text).replaceAll(replace.getText().toString())
Given
text is a CharSequence with contents A \n\n\nB (note the trailing space after 'A')
search being a textbox with the contents \s+\n
replace being a text box with contents \n.
I expect the result of the code above to be text = A\n\n\nB (trailing spaces removed).
Instead the result is text = An\n\nB. i.e. the \n in the replace text box is interpreted as a literal 'n'.
I would like to know what I can do in order to read the contents of replace, such that the \n is interpreted as a newline.
Note that I can achieve the desired result in the example by capturing the newline like with search = \s+(\n) and replace = $1. This is not, however, the issue.
For the purposes of this discussion I am only considering Unix line endings.
Edit:
using replace contents = \\n results in a literal '\n' being inserted.
i.e.
A
B
is transformed to
A\n
B
The approach suggested by Wiktor Stribiżew found in in stackoverflow.com/a/4298836 works for me.
Essentially, we need to parse the string and replace each escaped character with the correct escaped sequence.
Here is the code I used:
private String unescape(final String input) {
final StringBuilder builder = new StringBuilder();
boolean isEscaped = false;
for (int i = 0; i < input.length(); i++) {
char current = input.charAt(i);
if (isEscaped) {
if (current == 't') {
builder.append('\t');
} else if (current == 'b') {
builder.append('\b');
} else if (current == 'r') {
builder.append('\r');
} else if (current == 'n') {
builder.append('\n');
} else if (current == 'f') {
builder.append('\f');
} else if (current == '\\' || current == '\'' || current == '"') {
builder.append(current);
} else {
throw new IllegalArgumentException("Illegal escape sequence.");
}
isEscaped = false;
} else if (current == '\\') {
isEscaped = true;
} else {
builder.append(current);
}
}
return builder.toString();
}
It isn't as complete or as correct as the solution in the answer linked above, but it appears to work correctly for my purposes.
The issue here seems to be that '' is itself a special character. You have to double it for regex to see it as a necessary character. Doing a double replacement on all '' should proof useful. like this
yourString = yourString.replace("\", "\\");
how do I remove comments start with "//" and with /**, * etc.? I haven't found any solutions on Stack Overflow that has helped me very much, a lot of them have been way above my head and I'm still at most basics.
What I have thought about so far:
for (int i = 0; i < length; i++) {
for (j = i; j < length; j++) {
if (obj.charAt(j) == '/' && obj.charAt(j + 1) == '/')
But I'm not really sure how to replace the words following those characters. And how to end when to stop the replacement with a "//" comment. With the /* comments, atleast conceptually I know I should replace all words till "*/" pops up. Though again, I'm not sure how to limit the replacement till that point. To replace I thought replacing the charAt after the second "/" with an empty string until....where? I cannot figure out where to "end" the replacing.
I have looked at a few implementations on Stack, but I really didn't get it. Any help is appreciated, especially if it's at a basic level and understandable for someone not well versed in programming!
Thanks.
I have done something similar with regex (Java 9+):
// Checks for
// 1) Single char literal '"'
// 2) Single char literal '\"'
// 3) Strings; termination ignores \", \\\", ..., but allows ", \\", \\\\", ...
// 4) Single-line comment // ... to first \n
// 5) Multi-line comments /*[*] ... */
Pattern regex = Pattern.compile(
"(?s)('\"'|'\\\"'|\".*?(?<!\\\\)(?:\\\\\\\\)*\"|//[^\n]*|/\\*.*?\\*/)");
// Assuming 'text' contains your java text
// Leaves 1,2,3) unchanged and replaces comments 4,5) with ""
// Need quoteReplacement to prevent matcher processing $ and \
String textWithoutComments = regex.matcher(text).replaceAll(
m -> m.group().charAt(0) == '/' ? "" : Matcher.quoteReplacement(m.group()));
If you don't have Java 9+ then you could use this replace function:
String textWithoutComments = replaceAll(regex, text,
m -> m.group().charAt(0) == '/' ? "" : m.group());
public static String replaceAll(Pattern p, String s,
Function<MatchResult, String> replacer) {
Matcher m = p.matcher(s);
StringBuilder b = new StringBuilder();
int lastStart = 0;
while (m.find()) {
String replacement = replacer.apply(m);
b.append(s.substring(lastStart, m.start())).append(replacement);
lastStart = m.end();
}
return b.append(s.substring(lastStart)).toString();
}
I'm not sure if you're using an IDE like IntelliJ or Eclipse but you could do this without using code if you're just interested in removing all comments for the project. You can do this with "Replace in Path" tool. Notice how "Regex" is checked, allowing us to match lines based on regular expressions.
This configuration in the tool will delete all lines starting with a // and replace it with an empty line.
The command to get to this on a Mac is ctrl + shift + r.
I'm using XmlPullParser to parse some custom XML. I open the XML file like this...
XmlPullParser xpp = activity.getResources().getXml(R.xml.myXML);
And later I read the following XML node
<L> ####</L>
using the code
String str = "";
if( xpp.next() == XmlPullParser.TEXT )
str = xpp.getText();
return str;
Instead of returning
' ####'
I get
' ####'
(single quotes added by me for clarity.) NOTE The missing leading space.
It appears getText is stripping the leading space? When the XML doesn't contain a leading space, my code works as expected.
I can't find any property of XMLPullParser that allows me to tell it to keep all whitespace. Nor can I change the XML to add double quotes around the text with leading whitespace.
XmlPullParser.next() and XmlPullParser.getText() can return the content in several pieces, in an unpredictable way. In your case, maybe the very first space char is returned as a first piece and silently dropped by your program if it iterates on xpp.next() without concatenating the pieces. The algorithm should more be:
String str = "";
while (xpp.next() == XmlPullParser.TEXT) {
str += xpp.getText();
}
return str;
What are kinds of whitespaces in Java?
I need to check in my code if the text contains any whitespaces.
My code is:
if (text.contains(" ") || text.contains("\t") || text.contains("\r")
|| text.contains("\n"))
{
//code goes here
}
I already know about \n ,\t ,\r and space.
For a non-regular expression approach, you can check Character.isWhitespace for each character.
boolean containsWhitespace(String s) {
for (int i = 0; i < s.length(); ++i) {
if (Character.isWhitespace(s.charAt(i)) {
return true;
}
}
return false;
}
Which are the white spaces in Java?
The documentation specifies what Java considers to be whitespace:
public static boolean isWhitespace(char ch)
Determines if the specified character is white space according to Java. A character is a Java whitespace character if and only if it satisfies one of the following criteria:
It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space
('\u00A0', '\u2007', '\u202F').
It is '\u0009', HORIZONTAL TABULATION.
It is '\u000A', LINE FEED.
It is '\u000B', VERTICAL TABULATION.
It is '\u000C', FORM FEED.
It is '\u000D', CARRIAGE RETURN.
It is '\u001C', FILE SEPARATOR.
It is '\u001D', GROUP SEPARATOR.
It is '\u001E', RECORD SEPARATOR.
It is '\u001F', UNIT SEPARATOR.
boolean containsWhitespace = false;
for (int i = 0; i < text.length() && !containsWhitespace; i++) {
if (Character.isWhitespace(text.charAt(i)) {
containsWhitespace = true;
}
}
return containsWhitespace;
or, using Guava,
boolean containsWhitespace = CharMatcher.WHITESPACE.matchesAnyOf(text);
If you want to consider a regular expression based way of doing it
if(text.split("\\s").length > 1){
//text contains whitespace
}
Use Character.isWhitespace() rather than creating your own.
In Java how does one turn a String into a char or a char into a String?
If you can use apache.commons.lang in your project, the easiest way would be just to use the method provided there:
public static boolean containsWhitespace(CharSequence seq)
Check whether the given CharSequence contains any whitespace characters.
Parameters:
seq - the CharSequence to check (may be null)
Returns:
true if the CharSequence is not empty and contains at least 1 whitespace character
It handles empty and null parameters and provides the functionality at a central place.
From sun docs:
\s A whitespace character: [ \t\n\x0B\f\r]
The simplest way is to use it with regex.
boolean whitespaceSearchRegExp(String input) {
return java.util.regex.Pattern.compile("\\s").matcher(input).find();
}
Why don't you check if text.trim() has a different length? :
if(text.length() == text.trim().length() || otherConditions){
//your code
}
Is there a recommended way to escape <, >, " and & characters when outputting HTML in plain Java code? (Other than manually doing the following, that is).
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = source.replace("<", "<").replace("&", "&"); // ...
StringEscapeUtils from Apache Commons Lang:
import static org.apache.commons.lang.StringEscapeUtils.escapeHtml;
// ...
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = escapeHtml(source);
For version 3:
import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;
// ...
String escaped = escapeHtml4(source);
An alternative to Apache Commons: Use Spring's HtmlUtils.htmlEscape(String input) method.
Nice short method:
public static String escapeHTML(String s) {
StringBuilder out = new StringBuilder(Math.max(16, s.length()));
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c > 127 || c == '"' || c == '\'' || c == '<' || c == '>' || c == '&') {
out.append("&#");
out.append((int) c);
out.append(';');
} else {
out.append(c);
}
}
return out.toString();
}
Based on https://stackoverflow.com/a/8838023/1199155 (the amp is missing there). The four characters checked in the if clause are the only ones below 128, according to http://www.w3.org/TR/html4/sgml/entities.html
There is a newer version of the Apache Commons Lang library and it uses a different package name (org.apache.commons.lang3). The StringEscapeUtils now has different static methods for escaping different types of documents (http://commons.apache.org/proper/commons-lang/javadocs/api-3.0/index.html). So to escape HTML version 4.0 string:
import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;
String output = escapeHtml4("The less than sign (<) and ampersand (&) must be escaped before using them in HTML");
For those who use Google Guava:
import com.google.common.html.HtmlEscapers;
[...]
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = HtmlEscapers.htmlEscaper().escape(source);
Be careful with this. There are a number of different 'contexts' within an HTML document: Inside an element, quoted attribute value, unquoted attribute value, URL attribute, javascript, CSS, etc... You'll need to use a different encoding method for each of these to prevent Cross-Site Scripting (XSS). Check the OWASP XSS Prevention Cheat Sheet for details on each of these contexts. You can find escaping methods for each of these contexts in the OWASP ESAPI library -- https://github.com/ESAPI/esapi-java-legacy.
On android (API 16 or greater) you can:
Html.escapeHtml(textToScape);
or for lower API:
TextUtils.htmlEncode(textToScape);
For some purposes, HtmlUtils:
import org.springframework.web.util.HtmlUtils;
[...]
HtmlUtils.htmlEscapeDecimal("&"); //gives &
HtmlUtils.htmlEscape("&"); //gives &
org.apache.commons.lang3.StringEscapeUtils is now deprecated. You must now use org.apache.commons.text.StringEscapeUtils by
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-text</artifactId>
<version>${commons.text.version}</version>
</dependency>
While #dfa answer of org.apache.commons.lang.StringEscapeUtils.escapeHtml is nice and I have used it in the past it should not be used for escaping HTML (or XML) attributes otherwise the whitespace will be normalized (meaning all adjacent whitespace characters become a single space).
I know this because I have had bugs filed against my library (JATL) for attributes where whitespace was not preserved. Thus I have a drop in (copy n' paste) class (of which I stole some from JDOM) that differentiates the escaping of attributes and element content.
While this may not have mattered as much in the past (proper attribute escaping) it is increasingly become of greater interest given the use use of HTML5's data- attribute usage.
Java 8+ Solution:
public static String escapeHTML(String str) {
return str.chars().mapToObj(c -> c > 127 || "\"'<>&".indexOf(c) != -1 ?
"&#" + c + ";" : String.valueOf((char) c)).collect(Collectors.joining());
}
String#chars returns an IntStream of the char values from the String. We can then use mapToObj to escape the characters with a character code greater than 127 (non-ASCII characters) as well as the double quote ("), single quote ('), left angle bracket (<), right angle bracket (>), and ampersand (&). Collectors.joining concatenates the Strings back together.
To better handle Unicode characters, String#codePoints can be used instead.
public static String escapeHTML(String str) {
return str.codePoints().mapToObj(c -> c > 127 || "\"'<>&".indexOf(c) != -1 ?
"&#" + c + ";" : new String(Character.toChars(c)))
.collect(Collectors.joining());
}
The most of libraries offer escaping everything they can including hundreds of symbols and thousands of non-ASCII characters which is not what you want in UTF-8 world.
Also, as Jeff Williams noted, there's no single “escape HTML” option, there are several contexts.
Assuming you never use unquoted attributes, and keeping in mind that different contexts exist, it've written my own version:
private static final long TEXT_ESCAPE =
1L << '&' | 1L << '<';
private static final long DOUBLE_QUOTED_ATTR_ESCAPE =
TEXT_ESCAPE | 1L << '"';
private static final long SINGLE_QUOTED_ATTR_ESCAPE =
TEXT_ESCAPE | 1L << '\'';
private static final long ESCAPES =
DOUBLE_QUOTED_ATTR_ESCAPE | SINGLE_QUOTED_ATTR_ESCAPE;
// 'quot' and 'apos' are 1 char longer than '#34' and '#39'
// which I've decided to use
private static final String REPLACEMENTS = ""&'<";
private static final int REPL_SLICES = /* [0, 5, 10, 15, 19) */
5<<5 | 10<<10 | 15<<15 | 19<<20;
// These 5-bit numbers packed into a single int
// are indices within REPLACEMENTS which is a 'flat' String[]
private static void appendEscaped(
Appendable builder, CharSequence content, long escapes) {
try {
int startIdx = 0, len = content.length();
for (int i = 0; i < len; i++) {
char c = content.charAt(i);
long one;
if (((c & 63) == c) && ((one = 1L << c) & escapes) != 0) {
// -^^^^^^^^^^^^^^^ -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
// | | take only dangerous characters
// | java shifts longs by 6 least significant bits,
// | e. g. << 0b110111111 is same as >> 0b111111.
// | Filter out bigger characters
int index = Long.bitCount(ESCAPES & (one - 1));
builder.append(content, startIdx, i /* exclusive */).append(
REPLACEMENTS,
REPL_SLICES >>> (5 * index) & 31,
REPL_SLICES >>> (5 * (index + 1)) & 31
);
startIdx = i + 1;
}
}
builder.append(content, startIdx, len);
} catch (IOException e) {
// typically, our Appendable is StringBuilder which does not throw;
// also, there's no way to declare 'if A#append() throws E,
// then appendEscaped() throws E, too'
throw new UncheckedIOException(e);
}
}
Consider copy-pasting from Gist without line length limit.
UPD: As another answer suggests, > escaping is not necessary; also, " within attr='…' is allowed, too. I've updated the code accordingly.
You may check it out yourself:
<!DOCTYPE html>
<html lang="en">
<head><title>Test</title></head>
<body>
<p title="<"I'm double-quoted!">"><"Hello!"></p>
<p title='<"I'm single-quoted!">'><"Goodbye!"></p>
</body>
</html>