Convert plain text to HTML text in Java - java

I have java program, which will receive plain text from server. The plain text may contain URLs. Is there any Class in Java library to convert plain text to HTML text? Or any other library? If there are not then what is the solution?

You should do some replacements on the text programmatically. Here are some clues:
All Newlines should be converted to "<br>\n" (The \n for better readability of the output).
All CRs should be dropped (who uses DOS encoding anyway).
All pairs of spaces should be replaced with " "
Replace "<" with "<"
Replace "&" with "&"
All other characters < 128 should be left as they are.
All other characters >= 128 should be written as "&#"+((int)myChar)+";", to make them readable in every encoding.
To autodetect your links, you could either use a regex like "http://[^ ]+", or "www.[^ ]" and convert them like JB Nizet said. to ""+url+"", but only after having done all the other replacements.
The code to do this looks something like this:
public static String escape(String s) {
StringBuilder builder = new StringBuilder();
boolean previousWasASpace = false;
for( char c : s.toCharArray() ) {
if( c == ' ' ) {
if( previousWasASpace ) {
builder.append(" ");
previousWasASpace = false;
continue;
}
previousWasASpace = true;
} else {
previousWasASpace = false;
}
switch(c) {
case '<': builder.append("<"); break;
case '>': builder.append(">"); break;
case '&': builder.append("&"); break;
case '"': builder.append("""); break;
case '\n': builder.append("<br>"); break;
// We need Tab support here, because we print StackTraces as HTML
case '\t': builder.append(" "); break;
default:
if( c < 128 ) {
builder.append(c);
} else {
builder.append("&#").append((int)c).append(";");
}
}
}
return builder.toString();
}
However, the link conversion has yet to be added. If someone does it, please update the code.

I found a solution using pattern matching. Here is my code -
String str = "(?i)\\b((?:https?://|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:\'\".,<>?«»“”‘’]))";
Pattern patt = Pattern.compile(str);
Matcher matcher = patt.matcher(plain);
plain = matcher.replaceAll("$1");
And Here are the input and output -
Input text is variable plain:
some text and then the URL http://www.google.com and then some other text.
Output :
some text and then the URL http://www.google.com and then some other text.

Just joined the coded from all answers:
private static String txtToHtml(String s) {
StringBuilder builder = new StringBuilder();
boolean previousWasASpace = false;
for (char c : s.toCharArray()) {
if (c == ' ') {
if (previousWasASpace) {
builder.append(" ");
previousWasASpace = false;
continue;
}
previousWasASpace = true;
} else {
previousWasASpace = false;
}
switch (c) {
case '<':
builder.append("<");
break;
case '>':
builder.append(">");
break;
case '&':
builder.append("&");
break;
case '"':
builder.append(""");
break;
case '\n':
builder.append("<br>");
break;
// We need Tab support here, because we print StackTraces as HTML
case '\t':
builder.append(" ");
break;
default:
builder.append(c);
}
}
String converted = builder.toString();
String str = "(?i)\\b((?:https?://|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:\'\".,<>?«»“”‘’]))";
Pattern patt = Pattern.compile(str);
Matcher matcher = patt.matcher(converted);
converted = matcher.replaceAll("$1");
return converted;
}

Use this
public static String stringToHTMLString(String string) {
StringBuffer sb = new StringBuffer(string.length());
// true if last char was blank
boolean lastWasBlankChar = false;
int len = string.length();
char c;
for (int i = 0; i < len; i++) {
c = string.charAt(i);
if (c == ' ') {
// blank gets extra work,
// this solves the problem you get if you replace all
// blanks with , if you do that you loss
// word breaking
if (lastWasBlankChar) {
lastWasBlankChar = false;
sb.append(" ");
} else {
lastWasBlankChar = true;
sb.append(' ');
}
} else {
lastWasBlankChar = false;
//
// HTML Special Chars
if (c == '"')
sb.append(""");
else if (c == '&')
sb.append("&");
else if (c == '<')
sb.append("<");
else if (c == '>')
sb.append(">");
else if (c == '\n')
// Handle Newline
sb.append("<br/>");
else {
int ci = 0xffff & c;
if (ci < 160)
// nothing special only 7 Bit
sb.append(c);
else {
// Not 7 Bit use the unicode system
sb.append("&#");
sb.append(new Integer(ci).toString());
sb.append(';');
}
}
}
}
return sb.toString();
}

If your plain text is a URL (which is different from containing a hyperlink, as you wrote in your question), then transforming it into a hyperlink in HTML is simply done by
String hyperlink = "<a href='" + url + "'>" + url + "</a>";

In Android application I just implemented HTMLifying of a content ( see https://github.com/andstatus/andstatus/issues/375 ). Actual transformation was done in literary 3 lines of code using Android system libraries. This gives an advantage of using better implementation at each subsequent version of Android libraries.
private static String htmlifyPlain(String textIn) {
SpannableString spannable = SpannableString.valueOf(textIn);
Linkify.addLinks(spannable, Linkify.WEB_URLS);
return Html.toHtml(spannable);
}

Related

Regexp not matching extension [duplicate]

Is there a standard (preferably Apache Commons or similarly non-viral) library for doing "glob" type matches in Java? When I had to do similar in Perl once, I just changed all the "." to "\.", the "*" to ".*" and the "?" to "." and that sort of thing, but I'm wondering if somebody has done the work for me.
Similar question: Create regex from glob expression
Globbing is also planned for implemented in Java 7.
See FileSystem.getPathMatcher(String) and the "Finding Files" tutorial.
There's nothing built-in, but it's pretty simple to convert something glob-like to a regex:
public static String createRegexFromGlob(String glob)
{
String out = "^";
for(int i = 0; i < glob.length(); ++i)
{
final char c = glob.charAt(i);
switch(c)
{
case '*': out += ".*"; break;
case '?': out += '.'; break;
case '.': out += "\\."; break;
case '\\': out += "\\\\"; break;
default: out += c;
}
}
out += '$';
return out;
}
this works for me, but I'm not sure if it covers the glob "standard", if there is one :)
Update by Paul Tomblin: I found a perl program that does glob conversion, and adapting it to Java I end up with:
private String convertGlobToRegEx(String line)
{
LOG.info("got line [" + line + "]");
line = line.trim();
int strLen = line.length();
StringBuilder sb = new StringBuilder(strLen);
// Remove beginning and ending * globs because they're useless
if (line.startsWith("*"))
{
line = line.substring(1);
strLen--;
}
if (line.endsWith("*"))
{
line = line.substring(0, strLen-1);
strLen--;
}
boolean escaping = false;
int inCurlies = 0;
for (char currentChar : line.toCharArray())
{
switch (currentChar)
{
case '*':
if (escaping)
sb.append("\\*");
else
sb.append(".*");
escaping = false;
break;
case '?':
if (escaping)
sb.append("\\?");
else
sb.append('.');
escaping = false;
break;
case '.':
case '(':
case ')':
case '+':
case '|':
case '^':
case '$':
case '#':
case '%':
sb.append('\\');
sb.append(currentChar);
escaping = false;
break;
case '\\':
if (escaping)
{
sb.append("\\\\");
escaping = false;
}
else
escaping = true;
break;
case '{':
if (escaping)
{
sb.append("\\{");
}
else
{
sb.append('(');
inCurlies++;
}
escaping = false;
break;
case '}':
if (inCurlies > 0 && !escaping)
{
sb.append(')');
inCurlies--;
}
else if (escaping)
sb.append("\\}");
else
sb.append("}");
escaping = false;
break;
case ',':
if (inCurlies > 0 && !escaping)
{
sb.append('|');
}
else if (escaping)
sb.append("\\,");
else
sb.append(",");
break;
default:
escaping = false;
sb.append(currentChar);
}
}
return sb.toString();
}
I'm editing into this answer rather than making my own because this answer put me on the right track.
Thanks to everyone here for their contributions. I wrote a more comprehensive conversion than any of the previous answers:
/**
* Converts a standard POSIX Shell globbing pattern into a regular expression
* pattern. The result can be used with the standard {#link java.util.regex} API to
* recognize strings which match the glob pattern.
* <p/>
* See also, the POSIX Shell language:
* http://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html#tag_02_13_01
*
* #param pattern A glob pattern.
* #return A regex pattern to recognize the given glob pattern.
*/
public static final String convertGlobToRegex(String pattern) {
StringBuilder sb = new StringBuilder(pattern.length());
int inGroup = 0;
int inClass = 0;
int firstIndexInClass = -1;
char[] arr = pattern.toCharArray();
for (int i = 0; i < arr.length; i++) {
char ch = arr[i];
switch (ch) {
case '\\':
if (++i >= arr.length) {
sb.append('\\');
} else {
char next = arr[i];
switch (next) {
case ',':
// escape not needed
break;
case 'Q':
case 'E':
// extra escape needed
sb.append('\\');
default:
sb.append('\\');
}
sb.append(next);
}
break;
case '*':
if (inClass == 0)
sb.append(".*");
else
sb.append('*');
break;
case '?':
if (inClass == 0)
sb.append('.');
else
sb.append('?');
break;
case '[':
inClass++;
firstIndexInClass = i+1;
sb.append('[');
break;
case ']':
inClass--;
sb.append(']');
break;
case '.':
case '(':
case ')':
case '+':
case '|':
case '^':
case '$':
case '#':
case '%':
if (inClass == 0 || (firstIndexInClass == i && ch == '^'))
sb.append('\\');
sb.append(ch);
break;
case '!':
if (firstIndexInClass == i)
sb.append('^');
else
sb.append('!');
break;
case '{':
inGroup++;
sb.append('(');
break;
case '}':
inGroup--;
sb.append(')');
break;
case ',':
if (inGroup > 0)
sb.append('|');
else
sb.append(',');
break;
default:
sb.append(ch);
}
}
return sb.toString();
}
And the unit tests to prove it works:
/**
* #author Neil Traft
*/
public class StringUtils_ConvertGlobToRegex_Test {
#Test
public void star_becomes_dot_star() throws Exception {
assertEquals("gl.*b", StringUtils.convertGlobToRegex("gl*b"));
}
#Test
public void escaped_star_is_unchanged() throws Exception {
assertEquals("gl\\*b", StringUtils.convertGlobToRegex("gl\\*b"));
}
#Test
public void question_mark_becomes_dot() throws Exception {
assertEquals("gl.b", StringUtils.convertGlobToRegex("gl?b"));
}
#Test
public void escaped_question_mark_is_unchanged() throws Exception {
assertEquals("gl\\?b", StringUtils.convertGlobToRegex("gl\\?b"));
}
#Test
public void character_classes_dont_need_conversion() throws Exception {
assertEquals("gl[-o]b", StringUtils.convertGlobToRegex("gl[-o]b"));
}
#Test
public void escaped_classes_are_unchanged() throws Exception {
assertEquals("gl\\[-o\\]b", StringUtils.convertGlobToRegex("gl\\[-o\\]b"));
}
#Test
public void negation_in_character_classes() throws Exception {
assertEquals("gl[^a-n!p-z]b", StringUtils.convertGlobToRegex("gl[!a-n!p-z]b"));
}
#Test
public void nested_negation_in_character_classes() throws Exception {
assertEquals("gl[[^a-n]!p-z]b", StringUtils.convertGlobToRegex("gl[[!a-n]!p-z]b"));
}
#Test
public void escape_carat_if_it_is_the_first_char_in_a_character_class() throws Exception {
assertEquals("gl[\\^o]b", StringUtils.convertGlobToRegex("gl[^o]b"));
}
#Test
public void metachars_are_escaped() throws Exception {
assertEquals("gl..*\\.\\(\\)\\+\\|\\^\\$\\#\\%b", StringUtils.convertGlobToRegex("gl?*.()+|^$#%b"));
}
#Test
public void metachars_in_character_classes_dont_need_escaping() throws Exception {
assertEquals("gl[?*.()+|^$#%]b", StringUtils.convertGlobToRegex("gl[?*.()+|^$#%]b"));
}
#Test
public void escaped_backslash_is_unchanged() throws Exception {
assertEquals("gl\\\\b", StringUtils.convertGlobToRegex("gl\\\\b"));
}
#Test
public void slashQ_and_slashE_are_escaped() throws Exception {
assertEquals("\\\\Qglob\\\\E", StringUtils.convertGlobToRegex("\\Qglob\\E"));
}
#Test
public void braces_are_turned_into_groups() throws Exception {
assertEquals("(glob|regex)", StringUtils.convertGlobToRegex("{glob,regex}"));
}
#Test
public void escaped_braces_are_unchanged() throws Exception {
assertEquals("\\{glob\\}", StringUtils.convertGlobToRegex("\\{glob\\}"));
}
#Test
public void commas_dont_need_escaping() throws Exception {
assertEquals("(glob,regex),", StringUtils.convertGlobToRegex("{glob\\,regex},"));
}
}
There are couple of libraries that do Glob-like pattern matching that are more modern than the ones listed:
Theres Ants Directory Scanner
And
Springs AntPathMatcher
I recommend both over the other solutions since Ant Style Globbing has pretty much become the standard glob syntax in the Java world (Hudson, Spring, Ant and I think Maven).
I recently had to do it and used \Q and \E to escape the glob pattern:
private static Pattern getPatternFromGlob(String glob) {
return Pattern.compile(
"^" + Pattern.quote(glob)
.replace("*", "\\E.*\\Q")
.replace("?", "\\E.\\Q")
+ "$");
}
This is a simple Glob implementation which handles * and ? in the pattern
public class GlobMatch {
private String text;
private String pattern;
public boolean match(String text, String pattern) {
this.text = text;
this.pattern = pattern;
return matchCharacter(0, 0);
}
private boolean matchCharacter(int patternIndex, int textIndex) {
if (patternIndex >= pattern.length()) {
return false;
}
switch(pattern.charAt(patternIndex)) {
case '?':
// Match any character
if (textIndex >= text.length()) {
return false;
}
break;
case '*':
// * at the end of the pattern will match anything
if (patternIndex + 1 >= pattern.length() || textIndex >= text.length()) {
return true;
}
// Probe forward to see if we can get a match
while (textIndex < text.length()) {
if (matchCharacter(patternIndex + 1, textIndex)) {
return true;
}
textIndex++;
}
return false;
default:
if (textIndex >= text.length()) {
return false;
}
String textChar = text.substring(textIndex, textIndex + 1);
String patternChar = pattern.substring(patternIndex, patternIndex + 1);
// Note the match is case insensitive
if (textChar.compareToIgnoreCase(patternChar) != 0) {
return false;
}
}
// End of pattern and text?
if (patternIndex + 1 >= pattern.length() && textIndex + 1 >= text.length()) {
return true;
}
// Go on to match the next character in the pattern
return matchCharacter(patternIndex + 1, textIndex + 1);
}
}
Similar to Tony Edgecombe's answer, here is a short and simple globber that supports * and ? without using regex, if anybody needs one.
public static boolean matches(String text, String glob) {
String rest = null;
int pos = glob.indexOf('*');
if (pos != -1) {
rest = glob.substring(pos + 1);
glob = glob.substring(0, pos);
}
if (glob.length() > text.length())
return false;
// handle the part up to the first *
for (int i = 0; i < glob.length(); i++)
if (glob.charAt(i) != '?'
&& !glob.substring(i, i + 1).equalsIgnoreCase(text.substring(i, i + 1)))
return false;
// recurse for the part after the first *, if any
if (rest == null) {
return glob.length() == text.length();
} else {
for (int i = glob.length(); i <= text.length(); i++) {
if (matches(text.substring(i), rest))
return true;
}
return false;
}
}
It may be a slightly hacky approach. I've figured it out from NIO2's Files.newDirectoryStream(Path dir, String glob) code. Pay attention that every match new Path object is created. So far I was able to test this only on Windows FS, however, I believe it should work on Unix as well.
// a file system hack to get a glob matching
PathMatcher matcher = ("*".equals(glob)) ? null
: FileSystems.getDefault().getPathMatcher("glob:" + glob);
if ("*".equals(glob) || matcher.matches(Paths.get(someName))) {
// do you stuff here
}
UPDATE
Works on both - Mac and Linux.
The previous solution by Vincent Robert/dimo414 relies on Pattern.quote() being implemented in terms of \Q...\E, which is not documented in the API and therefore may not be the case for other/future Java implementations. The following solution removes that implementation dependency by escaping all occurrences of \E instead of using quote(). It also activates DOTALL mode ((?s)) in case the string to be matched contains newlines.
public static Pattern globToRegex(String glob)
{
return Pattern.compile(
"(?s)^\\Q" +
glob.replace("\\E", "\\E\\\\E\\Q")
.replace("*", "\\E.*\\Q")
.replace("?", "\\E.\\Q") +
"\\E$"
);
}
I don't know about a "standard" implementation, but I know of a sourceforge project released under the BSD license that implemented glob matching for files. It's implemented in one file, maybe you can adapt it for your requirements.
There is sun.nio.fs.Globs but it is not part of the public API.
You can use it indirectly via:
FileSystems.getDefault().getPathMatcher("glob:<myPattern>")
But it returns PathMatcher, which is inconvenient to work with. Since it can accept only Path as parameter (not File).
One possible option is to convert the PathMatcher to regex pattern (just call its 'toString()' method).
Another option is to use dedicated Glob library like glob-library-java.
Long ago I was doing a massive glob-driven text filtering so I've written a small piece of code (15 lines of code, no dependencies beyond JDK).
It handles only '*' (was sufficient for me), but can be easily extended for '?'.
It is several times faster than pre-compiled regexp, does not require any pre-compilation (essentially it is a string-vs-string comparison every time the pattern is matched).
Code:
public static boolean miniglob(String[] pattern, String line) {
if (pattern.length == 0) return line.isEmpty();
else if (pattern.length == 1) return line.equals(pattern[0]);
else {
if (!line.startsWith(pattern[0])) return false;
int idx = pattern[0].length();
for (int i = 1; i < pattern.length - 1; ++i) {
String patternTok = pattern[i];
int nextIdx = line.indexOf(patternTok, idx);
if (nextIdx < 0) return false;
else idx = nextIdx + patternTok.length();
}
if (!line.endsWith(pattern[pattern.length - 1])) return false;
return true;
}
}
Usage:
public static void main(String[] args) {
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
try {
// read from stdin space separated text and pattern
for (String input = in.readLine(); input != null; input = in.readLine()) {
String[] tokens = input.split(" ");
String line = tokens[0];
String[] pattern = tokens[1].split("\\*+", -1 /* want empty trailing token if any */);
// check matcher performance
long tm0 = System.currentTimeMillis();
for (int i = 0; i < 1000000; ++i) {
miniglob(pattern, line);
}
long tm1 = System.currentTimeMillis();
System.out.println("miniglob took " + (tm1-tm0) + " ms");
// check regexp performance
Pattern reptn = Pattern.compile(tokens[1].replace("*", ".*"));
Matcher mtchr = reptn.matcher(line);
tm0 = System.currentTimeMillis();
for (int i = 0; i < 1000000; ++i) {
mtchr.matches();
}
tm1 = System.currentTimeMillis();
System.out.println("regexp took " + (tm1-tm0) + " ms");
// check if miniglob worked correctly
if (miniglob(pattern, line)) {
System.out.println("+ >" + line);
}
else {
System.out.println("- >" + line);
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Copy/paste from here
By the way, it seems as if you did it the hard way in Perl
This does the trick in Perl:
my #files = glob("*.html")
# Or, if you prefer:
my #files = <*.html>

Indent all lines after a certain character until another character

For an assignment we are creating a java program that accepts a java file, fixes messy code and outputs to a new file.
We are to assume there is only one bracket { } per line and that each bracket occurs at the end of the line. If/else statements also use brackets.
I am currently having trouble finding a way to indent every line after an opening bracket until next closing bracket, then decreasing indent after closing bracket until the next opening bracket. We are also required to use the methods below:
Updated code a bit:
public static void processJavaFile() {
}
}
This algorithm should get you started. I left a few glitches that you'll have to fix.
(For example it doesn't indent your { brackets } as currently written, and it adds an extra newline for every semicolon)
The indentation is handled by a 'depth' counter which keeps track of how many 'tabs' to add.
Consider using a conditional for loop instead of a foreach if you want more control over each iteration. (I wrote this quick n' dirty just to give you an idea of how it might be done)
public String parse(String input) {
StringBuilder output = new StringBuilder();
int depth = 0;
boolean isNewLine = false;
boolean wasSpaced = false;
boolean isQuotes = false;
String tab = " ";
for (char c : input.toCharArray()) {
switch (c) {
case '{':
output.append(c + "\n");
depth++;
isNewLine = true;
break;
case '}':
output.append("\n" + c);
depth--;
isNewLine = true;
break;
case '\n':
isNewLine = true;
break;
case ';':
output.append(c);
isNewLine = true;
break;
case '\'':
case '"':
if (!isQuotes) {
isQuotes = true;
} else {
isQuotes = false;
}
output.append(c);
break;
default:
if (c == ' ') {
if (!isQuotes) {
if (!wasSpaced) {
wasSpaced = true;
output.append(c);
}
} else {
output.append(c);
}
} else {
wasSpaced = false;
output.append(c);
}
break;
}
if (isNewLine) {
output.append('\n');
for (int i = 0; i < depth; i++) {
output.append(tab);
}
isNewLine = false;
}
}
return output.toString();
}

Use a stack to check .txt file for ('s, {'s, ['s, etc - Java

I am trying to write a method in java to search a text file that I imported for specific characters. The file is actually a java program that I designed and converted to a .txt file.
When an opening brace/bracket is found, I am supposed to add (push) it to a stack and then when a corresponding closing brace/bracket is found I am supposed to remove (pop) it from the stack.
The purpose is to see if I have the correct amount of ), }, ] and > to correspond with the (, {, [ and >. If they all match up the method should return true, if they don't it should return false.
Anyone know how I can write this?
This is the sample implementation for balancing the brackets in a input text file
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.Stack;
public class BalanceBrackets {
private Stack<Character> symbolStack;
public void balance(String inputText) {
symbolStack = new Stack<Character>();
for (int index = 0; index < inputText.length(); index++) {
char currentSymbol = inputText.charAt(index);
switch (currentSymbol) {
case '(':
case '[':
case '{':
symbolStack.push(currentSymbol);
break;
case ')':
case ']':
case '}':
if (!symbolStack.isEmpty()) {
char symbolStackTop = symbolStack.pop();
if ((currentSymbol == '}' && symbolStackTop != '{')
|| (currentSymbol == ')' && symbolStackTop != '(')
|| (currentSymbol == ']' && symbolStackTop != '[')) {
System.out.println("Unmatched closing bracket while parsing " + currentSymbol + " at " + index+1);
return;
}
} else {
System.out.println("Extra closing bracket while parsing " + currentSymbol + " at character " + index+1);
return;
}
break;
default:
break;
}
}
if (!symbolStack.isEmpty())
System.out.println("Insufficient closing brackets after parsing the entire input text");
else
System.out.println("Brackets are balanced");
}
public static void main(String[] args) throws IOException {
BufferedReader in = new BufferedReader(new FileReader("D://input.txt"));
String input = null;
StringBuilder sb = new StringBuilder();
while ((input = in.readLine()) != null) {
sb.append(input);
}
new BalanceBrackets().balance(sb.toString());
}
}

Is there an equivalent of java.util.regex for "glob" type patterns?

Is there a standard (preferably Apache Commons or similarly non-viral) library for doing "glob" type matches in Java? When I had to do similar in Perl once, I just changed all the "." to "\.", the "*" to ".*" and the "?" to "." and that sort of thing, but I'm wondering if somebody has done the work for me.
Similar question: Create regex from glob expression
Globbing is also planned for implemented in Java 7.
See FileSystem.getPathMatcher(String) and the "Finding Files" tutorial.
There's nothing built-in, but it's pretty simple to convert something glob-like to a regex:
public static String createRegexFromGlob(String glob)
{
String out = "^";
for(int i = 0; i < glob.length(); ++i)
{
final char c = glob.charAt(i);
switch(c)
{
case '*': out += ".*"; break;
case '?': out += '.'; break;
case '.': out += "\\."; break;
case '\\': out += "\\\\"; break;
default: out += c;
}
}
out += '$';
return out;
}
this works for me, but I'm not sure if it covers the glob "standard", if there is one :)
Update by Paul Tomblin: I found a perl program that does glob conversion, and adapting it to Java I end up with:
private String convertGlobToRegEx(String line)
{
LOG.info("got line [" + line + "]");
line = line.trim();
int strLen = line.length();
StringBuilder sb = new StringBuilder(strLen);
// Remove beginning and ending * globs because they're useless
if (line.startsWith("*"))
{
line = line.substring(1);
strLen--;
}
if (line.endsWith("*"))
{
line = line.substring(0, strLen-1);
strLen--;
}
boolean escaping = false;
int inCurlies = 0;
for (char currentChar : line.toCharArray())
{
switch (currentChar)
{
case '*':
if (escaping)
sb.append("\\*");
else
sb.append(".*");
escaping = false;
break;
case '?':
if (escaping)
sb.append("\\?");
else
sb.append('.');
escaping = false;
break;
case '.':
case '(':
case ')':
case '+':
case '|':
case '^':
case '$':
case '#':
case '%':
sb.append('\\');
sb.append(currentChar);
escaping = false;
break;
case '\\':
if (escaping)
{
sb.append("\\\\");
escaping = false;
}
else
escaping = true;
break;
case '{':
if (escaping)
{
sb.append("\\{");
}
else
{
sb.append('(');
inCurlies++;
}
escaping = false;
break;
case '}':
if (inCurlies > 0 && !escaping)
{
sb.append(')');
inCurlies--;
}
else if (escaping)
sb.append("\\}");
else
sb.append("}");
escaping = false;
break;
case ',':
if (inCurlies > 0 && !escaping)
{
sb.append('|');
}
else if (escaping)
sb.append("\\,");
else
sb.append(",");
break;
default:
escaping = false;
sb.append(currentChar);
}
}
return sb.toString();
}
I'm editing into this answer rather than making my own because this answer put me on the right track.
Thanks to everyone here for their contributions. I wrote a more comprehensive conversion than any of the previous answers:
/**
* Converts a standard POSIX Shell globbing pattern into a regular expression
* pattern. The result can be used with the standard {#link java.util.regex} API to
* recognize strings which match the glob pattern.
* <p/>
* See also, the POSIX Shell language:
* http://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html#tag_02_13_01
*
* #param pattern A glob pattern.
* #return A regex pattern to recognize the given glob pattern.
*/
public static final String convertGlobToRegex(String pattern) {
StringBuilder sb = new StringBuilder(pattern.length());
int inGroup = 0;
int inClass = 0;
int firstIndexInClass = -1;
char[] arr = pattern.toCharArray();
for (int i = 0; i < arr.length; i++) {
char ch = arr[i];
switch (ch) {
case '\\':
if (++i >= arr.length) {
sb.append('\\');
} else {
char next = arr[i];
switch (next) {
case ',':
// escape not needed
break;
case 'Q':
case 'E':
// extra escape needed
sb.append('\\');
default:
sb.append('\\');
}
sb.append(next);
}
break;
case '*':
if (inClass == 0)
sb.append(".*");
else
sb.append('*');
break;
case '?':
if (inClass == 0)
sb.append('.');
else
sb.append('?');
break;
case '[':
inClass++;
firstIndexInClass = i+1;
sb.append('[');
break;
case ']':
inClass--;
sb.append(']');
break;
case '.':
case '(':
case ')':
case '+':
case '|':
case '^':
case '$':
case '#':
case '%':
if (inClass == 0 || (firstIndexInClass == i && ch == '^'))
sb.append('\\');
sb.append(ch);
break;
case '!':
if (firstIndexInClass == i)
sb.append('^');
else
sb.append('!');
break;
case '{':
inGroup++;
sb.append('(');
break;
case '}':
inGroup--;
sb.append(')');
break;
case ',':
if (inGroup > 0)
sb.append('|');
else
sb.append(',');
break;
default:
sb.append(ch);
}
}
return sb.toString();
}
And the unit tests to prove it works:
/**
* #author Neil Traft
*/
public class StringUtils_ConvertGlobToRegex_Test {
#Test
public void star_becomes_dot_star() throws Exception {
assertEquals("gl.*b", StringUtils.convertGlobToRegex("gl*b"));
}
#Test
public void escaped_star_is_unchanged() throws Exception {
assertEquals("gl\\*b", StringUtils.convertGlobToRegex("gl\\*b"));
}
#Test
public void question_mark_becomes_dot() throws Exception {
assertEquals("gl.b", StringUtils.convertGlobToRegex("gl?b"));
}
#Test
public void escaped_question_mark_is_unchanged() throws Exception {
assertEquals("gl\\?b", StringUtils.convertGlobToRegex("gl\\?b"));
}
#Test
public void character_classes_dont_need_conversion() throws Exception {
assertEquals("gl[-o]b", StringUtils.convertGlobToRegex("gl[-o]b"));
}
#Test
public void escaped_classes_are_unchanged() throws Exception {
assertEquals("gl\\[-o\\]b", StringUtils.convertGlobToRegex("gl\\[-o\\]b"));
}
#Test
public void negation_in_character_classes() throws Exception {
assertEquals("gl[^a-n!p-z]b", StringUtils.convertGlobToRegex("gl[!a-n!p-z]b"));
}
#Test
public void nested_negation_in_character_classes() throws Exception {
assertEquals("gl[[^a-n]!p-z]b", StringUtils.convertGlobToRegex("gl[[!a-n]!p-z]b"));
}
#Test
public void escape_carat_if_it_is_the_first_char_in_a_character_class() throws Exception {
assertEquals("gl[\\^o]b", StringUtils.convertGlobToRegex("gl[^o]b"));
}
#Test
public void metachars_are_escaped() throws Exception {
assertEquals("gl..*\\.\\(\\)\\+\\|\\^\\$\\#\\%b", StringUtils.convertGlobToRegex("gl?*.()+|^$#%b"));
}
#Test
public void metachars_in_character_classes_dont_need_escaping() throws Exception {
assertEquals("gl[?*.()+|^$#%]b", StringUtils.convertGlobToRegex("gl[?*.()+|^$#%]b"));
}
#Test
public void escaped_backslash_is_unchanged() throws Exception {
assertEquals("gl\\\\b", StringUtils.convertGlobToRegex("gl\\\\b"));
}
#Test
public void slashQ_and_slashE_are_escaped() throws Exception {
assertEquals("\\\\Qglob\\\\E", StringUtils.convertGlobToRegex("\\Qglob\\E"));
}
#Test
public void braces_are_turned_into_groups() throws Exception {
assertEquals("(glob|regex)", StringUtils.convertGlobToRegex("{glob,regex}"));
}
#Test
public void escaped_braces_are_unchanged() throws Exception {
assertEquals("\\{glob\\}", StringUtils.convertGlobToRegex("\\{glob\\}"));
}
#Test
public void commas_dont_need_escaping() throws Exception {
assertEquals("(glob,regex),", StringUtils.convertGlobToRegex("{glob\\,regex},"));
}
}
There are couple of libraries that do Glob-like pattern matching that are more modern than the ones listed:
Theres Ants Directory Scanner
And
Springs AntPathMatcher
I recommend both over the other solutions since Ant Style Globbing has pretty much become the standard glob syntax in the Java world (Hudson, Spring, Ant and I think Maven).
I recently had to do it and used \Q and \E to escape the glob pattern:
private static Pattern getPatternFromGlob(String glob) {
return Pattern.compile(
"^" + Pattern.quote(glob)
.replace("*", "\\E.*\\Q")
.replace("?", "\\E.\\Q")
+ "$");
}
This is a simple Glob implementation which handles * and ? in the pattern
public class GlobMatch {
private String text;
private String pattern;
public boolean match(String text, String pattern) {
this.text = text;
this.pattern = pattern;
return matchCharacter(0, 0);
}
private boolean matchCharacter(int patternIndex, int textIndex) {
if (patternIndex >= pattern.length()) {
return false;
}
switch(pattern.charAt(patternIndex)) {
case '?':
// Match any character
if (textIndex >= text.length()) {
return false;
}
break;
case '*':
// * at the end of the pattern will match anything
if (patternIndex + 1 >= pattern.length() || textIndex >= text.length()) {
return true;
}
// Probe forward to see if we can get a match
while (textIndex < text.length()) {
if (matchCharacter(patternIndex + 1, textIndex)) {
return true;
}
textIndex++;
}
return false;
default:
if (textIndex >= text.length()) {
return false;
}
String textChar = text.substring(textIndex, textIndex + 1);
String patternChar = pattern.substring(patternIndex, patternIndex + 1);
// Note the match is case insensitive
if (textChar.compareToIgnoreCase(patternChar) != 0) {
return false;
}
}
// End of pattern and text?
if (patternIndex + 1 >= pattern.length() && textIndex + 1 >= text.length()) {
return true;
}
// Go on to match the next character in the pattern
return matchCharacter(patternIndex + 1, textIndex + 1);
}
}
Similar to Tony Edgecombe's answer, here is a short and simple globber that supports * and ? without using regex, if anybody needs one.
public static boolean matches(String text, String glob) {
String rest = null;
int pos = glob.indexOf('*');
if (pos != -1) {
rest = glob.substring(pos + 1);
glob = glob.substring(0, pos);
}
if (glob.length() > text.length())
return false;
// handle the part up to the first *
for (int i = 0; i < glob.length(); i++)
if (glob.charAt(i) != '?'
&& !glob.substring(i, i + 1).equalsIgnoreCase(text.substring(i, i + 1)))
return false;
// recurse for the part after the first *, if any
if (rest == null) {
return glob.length() == text.length();
} else {
for (int i = glob.length(); i <= text.length(); i++) {
if (matches(text.substring(i), rest))
return true;
}
return false;
}
}
It may be a slightly hacky approach. I've figured it out from NIO2's Files.newDirectoryStream(Path dir, String glob) code. Pay attention that every match new Path object is created. So far I was able to test this only on Windows FS, however, I believe it should work on Unix as well.
// a file system hack to get a glob matching
PathMatcher matcher = ("*".equals(glob)) ? null
: FileSystems.getDefault().getPathMatcher("glob:" + glob);
if ("*".equals(glob) || matcher.matches(Paths.get(someName))) {
// do you stuff here
}
UPDATE
Works on both - Mac and Linux.
The previous solution by Vincent Robert/dimo414 relies on Pattern.quote() being implemented in terms of \Q...\E, which is not documented in the API and therefore may not be the case for other/future Java implementations. The following solution removes that implementation dependency by escaping all occurrences of \E instead of using quote(). It also activates DOTALL mode ((?s)) in case the string to be matched contains newlines.
public static Pattern globToRegex(String glob)
{
return Pattern.compile(
"(?s)^\\Q" +
glob.replace("\\E", "\\E\\\\E\\Q")
.replace("*", "\\E.*\\Q")
.replace("?", "\\E.\\Q") +
"\\E$"
);
}
I don't know about a "standard" implementation, but I know of a sourceforge project released under the BSD license that implemented glob matching for files. It's implemented in one file, maybe you can adapt it for your requirements.
There is sun.nio.fs.Globs but it is not part of the public API.
You can use it indirectly via:
FileSystems.getDefault().getPathMatcher("glob:<myPattern>")
But it returns PathMatcher, which is inconvenient to work with. Since it can accept only Path as parameter (not File).
One possible option is to convert the PathMatcher to regex pattern (just call its 'toString()' method).
Another option is to use dedicated Glob library like glob-library-java.
Long ago I was doing a massive glob-driven text filtering so I've written a small piece of code (15 lines of code, no dependencies beyond JDK).
It handles only '*' (was sufficient for me), but can be easily extended for '?'.
It is several times faster than pre-compiled regexp, does not require any pre-compilation (essentially it is a string-vs-string comparison every time the pattern is matched).
Code:
public static boolean miniglob(String[] pattern, String line) {
if (pattern.length == 0) return line.isEmpty();
else if (pattern.length == 1) return line.equals(pattern[0]);
else {
if (!line.startsWith(pattern[0])) return false;
int idx = pattern[0].length();
for (int i = 1; i < pattern.length - 1; ++i) {
String patternTok = pattern[i];
int nextIdx = line.indexOf(patternTok, idx);
if (nextIdx < 0) return false;
else idx = nextIdx + patternTok.length();
}
if (!line.endsWith(pattern[pattern.length - 1])) return false;
return true;
}
}
Usage:
public static void main(String[] args) {
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
try {
// read from stdin space separated text and pattern
for (String input = in.readLine(); input != null; input = in.readLine()) {
String[] tokens = input.split(" ");
String line = tokens[0];
String[] pattern = tokens[1].split("\\*+", -1 /* want empty trailing token if any */);
// check matcher performance
long tm0 = System.currentTimeMillis();
for (int i = 0; i < 1000000; ++i) {
miniglob(pattern, line);
}
long tm1 = System.currentTimeMillis();
System.out.println("miniglob took " + (tm1-tm0) + " ms");
// check regexp performance
Pattern reptn = Pattern.compile(tokens[1].replace("*", ".*"));
Matcher mtchr = reptn.matcher(line);
tm0 = System.currentTimeMillis();
for (int i = 0; i < 1000000; ++i) {
mtchr.matches();
}
tm1 = System.currentTimeMillis();
System.out.println("regexp took " + (tm1-tm0) + " ms");
// check if miniglob worked correctly
if (miniglob(pattern, line)) {
System.out.println("+ >" + line);
}
else {
System.out.println("- >" + line);
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Copy/paste from here
By the way, it seems as if you did it the hard way in Perl
This does the trick in Perl:
my #files = glob("*.html")
# Or, if you prefer:
my #files = <*.html>

htmlentities equivalent in JSP?

I'm a php guy, but I have to do some small project in JSP.
I'm wondering if there's an equivalent to htmlentities function (of php) in JSP.
public static String stringToHTMLString(String string) {...
The same thing does utility from commons-lang library:
org.apache.commons.lang.StringEscapeUtils.escapeHtml
Just export it in custom tld - and you will get a handy method for jsp.
public static String stringToHTMLString(String string) {
StringBuffer sb = new StringBuffer(string.length());
// true if last char was blank
boolean lastWasBlankChar = false;
int len = string.length();
char c;
for (int i = 0; i < len; i++)
{
c = string.charAt(i);
if (c == ' ') {
// blank gets extra work,
// this solves the problem you get if you replace all
// blanks with , if you do that you loss
// word breaking
if (lastWasBlankChar) {
lastWasBlankChar = false;
sb.append(" ");
}
else {
lastWasBlankChar = true;
sb.append(' ');
}
}
else {
lastWasBlankChar = false;
//
// HTML Special Chars
if (c == '"')
sb.append(""");
else if (c == '&')
sb.append("&");
else if (c == '<')
sb.append("<");
else if (c == '>')
sb.append(">");
else if (c == '\n')
// Handle Newline
sb.append("<br/>");
else {
int ci = 0xffff & c;
if (ci < 160 )
// nothing special only 7 Bit
sb.append(c);
else {
// Not 7 Bit use the unicode system
sb.append("&#");
sb.append(new Integer(ci).toString());
sb.append(';');
}
}
}
}
return sb.toString();
}
I suggest using escapeXml set to true attribute of JSTL's directly in JSP
<c:out value="${string}" escapeXml="true" />

Categories

Resources