\u000b and other Control Unicode Characters not compatible with docx4j? [duplicate] - java

The list of valid XML characters is well known, as defined by the spec it's:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
My question is whether or not it's possible to make a PCRE regular expression for this (or its inverse) without actually hard-coding the codepoints, by using Unicode general categories. An inverse might be something like [\p{Cc}\p{Cs}\p{Cn}], except that improperly covers linefeeds and tabs and misses some other invalid characters.

I know this isn't exactly an answer to your question, but it's helpful to have it here:
Regular Expression to match valid XML Characters:
[\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]
So to remove invalid chars from XML, you'd do something like
// filters control characters but allows only properly-formed surrogate sequences
private static Regex _invalidXMLChars = new Regex(
#"(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F\uFEFF\uFFFE\uFFFF]",
RegexOptions.Compiled);
/// <summary>
/// removes any unusual unicode characters that can't be encoded into XML
/// </summary>
public static string RemoveInvalidXMLChars(string text)
{
if (string.IsNullOrEmpty(text)) return "";
return _invalidXMLChars.Replace(text, "");
}
I had our resident regex / XML genius, he of the 4,400+ upvoted post, check this, and he signed off on it.

For systems that internally stores the codepoints in UTF-16, it is common to use surrogate pairs (xD800-xDFFF) for codepoints above 0xFFFF and in those systems you must verify if you really can use for example \u12345 or must specify that as a surrogate pair. (I just found out that in C# you can use \u1234 (16 bit) and \U00001234 (32-bit))
According to Microsoft "the W3C recommendation does not allow surrogate characters inside element or attribute names." While searching W3s website I found C079 and C078 that might be of interest.

I tried this in java and it works:
private String filterContent(String content) {
return content.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "");
}
Thank you Jeff.

The above solutions didn't work for me if the hex code was present in the xml. e.g.
<element></element>
The following code would break:
string xmlFormat = "<element>{0}</element>";
string invalid = " ";
string xml = string.Format(xmlFormat, invalid);
xml = Regex.Replace(xml, #"[\x01-\x08\x0B\x0C\x0E\x0F\u0000-\u0008\u000B\u000C\u000E-\u001F]", "");
XDocument.Parse(xml);
It returns:
XmlException: '', hexadecimal value 0x08, is an invalid character.
Line 1, position 14.
The following is the improved regex and fixed the problem mentioned above:
&#x([0-8BCEFbcef]|1[0-9A-Fa-f]);|[\x01-\x08\x0B\x0C\x0E\x0F\u0000-\u0008\u000B\u000C\u000E-\u001F]
Here is a unit test for the first 300 unicode characters and verifies that only invalid characters are removed:
[Fact]
public void validate_that_RemoveInvalidData_only_remove_all_invalid_data()
{
string xmlFormat = "<element>{0}</element>";
string[] allAscii = (Enumerable.Range('\x1', 300).Select(x => ((char)x).ToString()).ToArray());
string[] allAsciiInHexCode = (Enumerable.Range('\x1', 300).Select(x => "&#x" + (x).ToString("X") + ";").ToArray());
string[] allAsciiInHexCodeLoweCase = (Enumerable.Range('\x1', 300).Select(x => "&#x" + (x).ToString("x") + ";").ToArray());
bool hasParserError = false;
IXmlSanitizer sanitizer = new XmlSanitizer();
foreach (var test in allAscii.Concat(allAsciiInHexCode).Concat(allAsciiInHexCodeLoweCase))
{
bool shouldBeRemoved = false;
string xml = string.Format(xmlFormat, test);
try
{
XDocument.Parse(xml);
shouldBeRemoved = false;
}
catch (Exception e)
{
if (test != "<" && test != "&") //these char are taken care of automatically by my convertor so don't need to test. You might need to add these.
{
shouldBeRemoved = true;
}
}
int xmlCurrentLength = xml.Length;
int xmlLengthAfterSanitize = Regex.Replace(xml, #"&#x([0-8BCEF]|1[0-9A-F]);|[\u0000-\u0008\u000B\u000C\u000E-\u001F]", "").Length;
if ((shouldBeRemoved && xmlCurrentLength == xmlLengthAfterSanitize) //it wasn't properly Removed
||(!shouldBeRemoved && xmlCurrentLength != xmlLengthAfterSanitize)) //it was removed but shouldn't have been
{
hasParserError = true;
Console.WriteLine(test + xml);
}
}
Assert.Equal(false, hasParserError);
}

Another way to remove incorrect XML chars in C# with using XmlConvert.IsXmlChar Method (Available since .NET Framework 4.0)
public static string RemoveInvalidXmlChars(string content)
{
return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
}
or you may check that all characters are XML-valid.
public static bool CheckValidXmlChars(string content)
{
return content.All(ch => System.Xml.XmlConvert.IsXmlChar(ch));
}
.Net Fiddle - https://dotnetfiddle.net/v1TNus
For example, the vertical tab symbol (\v) is not valid for XML, it is valid UTF-8, but not valid XML 1.0, and even many libraries (including libxml2) miss it and silently output invalid XML.

In PHP the regex would look like the following way:
protected function isStringValid($string)
{
$regex = '/[^\x{9}\x{a}\x{d}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+/u';
return (preg_match($regex, $string, $matches) === 0);
}
This would handle all 3 ranges from the xml specification:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Related

java JSON string formatting with regular expression

For a given plain JSON data do the following formatting:
replace all the special characters in key with underscore
remove the key double quote
replace the : with =
Example:
JSON Data: {"no/me": "139.82", "gc.pp": "\u0000\u000", ...}
After formatting: no_me="139.82", gc_pp="\u0000\u000"
Is it possible with a regular expression? or any other single command execution?
A single regex for the whole changes may be overkill. I think you could code something similar to this:
(NOTE: Since i do not code in java, my example is in javascript, just to get you the idea of it)
var json_data = '{"no/me": "139.82", "gc.pp": "0000000", "foo":"bar"}';
console.log(json_data);
var data = JSON.parse(json_data);
var out = '';
for (var x in data) {
var clean_x = x.replace(/[^a-zA-Z0-9]/g, "_");
if (out != '') out += ', ';
out += clean_x + '="' + data[x] + '"';
}
console.log(out);
Basically you loop through the keys and clean them (remove not-wanted characters), with the new key and the original value you create a new string with the format you like.
Important: Bear in mind overlapping ids. For example, both no/me and no#me will overlap into same id no_me. this may not be important since your are not outputting a JSON after all. I tell you just in case.
I haven't done Java in a long time, but I think you need something like this.
I'm assuming you mean 'all Non-Word characters' by specialchars here.
import java.util.regex.*;
String JsonData = '{"no/me": "139.82", "gc.pp": "\u0000\u000", ...}';
// remove { and }
JsonData = JsonData.substring(0, JsonData.length() - 1);
try {
Pattern regex = Pattern.compile("(\"[^\"]+\")\\s*:"); // find the keys, including quotes and colon
Matcher regexMatcher = regex.matcher(JsonData);
while (regexMatcher.find()) {
String temp = regexMatcher.group(1); // "no/me":
String key = regexMatcher.group(2).replaceAll("\\W", "_") + "="; // no_me=
JsonData.replaceAll(temp, key);
}
} catch (PatternSyntaxException ex) {
// regex has syntax error
}
System.out.println(JsonData);

how to detect base64 encoded strings? [duplicate]

I want to decode a Base64 encoded string, then store it in my database. If the input is not Base64 encoded, I need to throw an error.
How can I check if a string is Base64 encoded?
You can use the following regular expression to check if a string constitutes a valid base64 encoding:
^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$
In base64 encoding, the character set is [A-Z, a-z, 0-9, and + /]. If the rest length is less than 4, the string is padded with '=' characters.
^([A-Za-z0-9+/]{4})* means the string starts with 0 or more base64 groups.
([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$ means the string ends in one of three forms: [A-Za-z0-9+/]{4}, [A-Za-z0-9+/]{3}= or [A-Za-z0-9+/]{2}==.
If you are using Java, you can actually use commons-codec library
import org.apache.commons.codec.binary.Base64;
String stringToBeChecked = "...";
boolean isBase64 = Base64.isArrayByteBase64(stringToBeChecked.getBytes());
[UPDATE 1] Deprecation Notice
Use instead
Base64.isBase64(value);
/**
* Tests a given byte array to see if it contains only valid characters within the Base64 alphabet. Currently the
* method treats whitespace as valid.
*
* #param arrayOctet
* byte array to test
* #return {#code true} if all bytes are valid characters in the Base64 alphabet or if the byte array is empty;
* {#code false}, otherwise
* #deprecated 1.5 Use {#link #isBase64(byte[])}, will be removed in 2.0.
*/
#Deprecated
public static boolean isArrayByteBase64(final byte[] arrayOctet) {
return isBase64(arrayOctet);
}
Well you can:
Check that the length is a multiple of 4 characters
Check that every character is in the set A-Z, a-z, 0-9, +, / except for padding at the end which is 0, 1 or 2 '=' characters
If you're expecting that it will be base64, then you can probably just use whatever library is available on your platform to try to decode it to a byte array, throwing an exception if it's not valid base 64. That depends on your platform, of course.
As of Java 8, you can simply use java.util.Base64 to try and decode the string:
String someString = "...";
Base64.Decoder decoder = Base64.getDecoder();
try {
decoder.decode(someString);
} catch(IllegalArgumentException iae) {
// That string wasn't valid.
}
Try like this for PHP5
//where $json is some data that can be base64 encoded
$json=some_data;
//this will check whether data is base64 encoded or not
if (base64_decode($json, true) == true)
{
echo "base64 encoded";
}
else
{
echo "not base64 encoded";
}
Use this for PHP7
//$string parameter can be base64 encoded or not
function is_base64_encoded($string){
//this will check if $string is base64 encoded and return true, if it is.
if (base64_decode($string, true) !== false){
return true;
}else{
return false;
}
}
var base64Rejex = /^(?:[A-Z0-9+\/]{4})*(?:[A-Z0-9+\/]{2}==|[A-Z0-9+\/]{3}=|[A-Z0-9+\/]{4})$/i;
var isBase64Valid = base64Rejex.test(base64Data); // base64Data is the base64 string
if (isBase64Valid) {
// true if base64 formate
console.log('It is base64');
} else {
// false if not in base64 formate
console.log('it is not in base64');
}
Try this:
public void checkForEncode(String string) {
String pattern = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(string);
if (m.find()) {
System.out.println("true");
} else {
System.out.println("false");
}
}
It is impossible to check if a string is base64 encoded or not. It is only possible to validate if that string is of a base64 encoded string format, which would mean that it could be a string produced by base64 encoding (to check that, string could be validated against a regexp or a library could be used, many other answers to this question provide good ways to check this, so I won't go into details).
For example, string flow is a valid base64 encoded string. But it is impossible to know if it is just a simple string, an English word flow, or is it base 64 encoded string ~Z0
There are many variants of Base64, so consider just determining if your string resembles the varient you expect to handle. As such, you may need to adjust the regex below with respect to the index and padding characters (i.e. +, /, =).
class String
def resembles_base64?
self.length % 4 == 0 && self =~ /^[A-Za-z0-9+\/=]+\Z/
end
end
Usage:
raise 'the string does not resemble Base64' unless my_string.resembles_base64?
Check to see IF the string's length is a multiple of 4. Aftwerwards use this regex to make sure all characters in the string are base64 characters.
\A[a-zA-Z\d\/+]+={,2}\z
If the library you use adds a newline as a way of observing the 76 max chars per line rule, replace them with empty strings.
/^([A-Za-z0-9+\/]{4})*([A-Za-z0-9+\/]{4}|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{2}==)$/
this regular expression helped me identify the base64 in my application in rails, I only had one problem, it is that it recognizes the string "errorDescripcion", I generate an error, to solve it just validate the length of a string.
For Flutter, I tested couple of the above comments and translated that into dart function as follows
static bool isBase64(dynamic value) {
if (value.runtimeType == String){
final RegExp rx = RegExp(r'^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$',
multiLine: true,
unicode: true,
);
final bool isBase64Valid = rx.hasMatch(value);
if (isBase64Valid == true) {return true;}
else {return false;}
}
else {return false;}
}
In Java below code worked for me:
public static boolean isBase64Encoded(String s) {
String pattern = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(s);
return m.find();
}
This works in Python:
import base64
def IsBase64(str):
try:
base64.b64decode(str)
return True
except Exception as e:
return False
if IsBase64("ABC"):
print("ABC is Base64-encoded and its result after decoding is: " + str(base64.b64decode("ABC")).replace("b'", "").replace("'", ""))
else:
print("ABC is NOT Base64-encoded.")
if IsBase64("QUJD"):
print("QUJD is Base64-encoded and its result after decoding is: " + str(base64.b64decode("QUJD")).replace("b'", "").replace("'", ""))
else:
print("QUJD is NOT Base64-encoded.")
Summary: IsBase64("string here") returns true if string here is Base64-encoded, and it returns false if string here was NOT Base64-encoded.
C#
This is performing great:
static readonly Regex _base64RegexPattern = new Regex(BASE64_REGEX_STRING, RegexOptions.Compiled);
private const String BASE64_REGEX_STRING = #"^[a-zA-Z0-9\+/]*={0,3}$";
private static bool IsBase64(this String base64String)
{
var rs = (!string.IsNullOrEmpty(base64String) && !string.IsNullOrWhiteSpace(base64String) && base64String.Length != 0 && base64String.Length % 4 == 0 && !base64String.Contains(" ") && !base64String.Contains("\t") && !base64String.Contains("\r") && !base64String.Contains("\n")) && (base64String.Length % 4 == 0 && _base64RegexPattern.Match(base64String, 0).Success);
return rs;
}
There is no way to distinct string and base64 encoded, except the string in your system has some specific limitation or identification.
This snippet may be useful when you know the length of the original content (e.g. a checksum). It checks that encoded form has the correct length.
public static boolean isValidBase64( final int initialLength, final String string ) {
final int padding ;
final String regexEnd ;
switch( ( initialLength ) % 3 ) {
case 1 :
padding = 2 ;
regexEnd = "==" ;
break ;
case 2 :
padding = 1 ;
regexEnd = "=" ;
break ;
default :
padding = 0 ;
regexEnd = "" ;
}
final int encodedLength = ( ( ( initialLength / 3 ) + ( padding > 0 ? 1 : 0 ) ) * 4 ) ;
final String regex = "[a-zA-Z0-9/\\+]{" + ( encodedLength - padding ) + "}" + regexEnd ;
return Pattern.compile( regex ).matcher( string ).matches() ;
}
If the RegEx does not work and you know the format style of the original string, you can reverse the logic, by regexing for this format.
For example I work with base64 encoded xml files and just check if the file contains valid xml markup. If it does not I can assume, that it's base64 decoded. This is not very dynamic but works fine for my small application.
This works in Python:
def is_base64(string):
if len(string) % 4 == 0 and re.test('^[A-Za-z0-9+\/=]+\Z', string):
return(True)
else:
return(False)
Try this using a previously mentioned regex:
String regex = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$";
if("TXkgdGVzdCBzdHJpbmc/".matches(regex)){
System.out.println("it's a Base64");
}
...We can also make a simple validation like, if it has spaces it cannot be Base64:
String myString = "Hello World";
if(myString.contains(" ")){
System.out.println("Not B64");
}else{
System.out.println("Could be B64 encoded, since it has no spaces");
}
if when decoding we get a string with ASCII characters, then the string was
not encoded
(RoR) ruby solution:
def encoded?(str)
Base64.decode64(str.downcase).scan(/[^[:ascii:]]/).count.zero?
end
def decoded?(str)
Base64.decode64(str.downcase).scan(/[^[:ascii:]]/).count > 0
end
Function Check_If_Base64(ByVal msgFile As String) As Boolean
Dim I As Long
Dim Buffer As String
Dim Car As String
Check_If_Base64 = True
Buffer = Leggi_File(msgFile)
Buffer = Replace(Buffer, vbCrLf, "")
For I = 1 To Len(Buffer)
Car = Mid(Buffer, I, 1)
If (Car < "A" Or Car > "Z") _
And (Car < "a" Or Car > "z") _
And (Car < "0" Or Car > "9") _
And (Car <> "+" And Car <> "/" And Car <> "=") Then
Check_If_Base64 = False
Exit For
End If
Next I
End Function
Function Leggi_File(PathAndFileName As String) As String
Dim FF As Integer
FF = FreeFile()
Open PathAndFileName For Binary As #FF
Leggi_File = Input(LOF(FF), #FF)
Close #FF
End Function
import java.util.Base64;
public static String encodeBase64(String s) {
return Base64.getEncoder().encodeToString(s.getBytes());
}
public static String decodeBase64(String s) {
try {
if (isBase64(s)) {
return new String(Base64.getDecoder().decode(s));
} else {
return s;
}
} catch (Exception e) {
return s;
}
}
public static boolean isBase64(String s) {
String pattern = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(s);
return m.find();
}
For Java flavour I actually use the following regex:
"([A-Za-z0-9+]{4})*([A-Za-z0-9+]{3}=|[A-Za-z0-9+]{2}(==){0,2})?"
This also have the == as optional in some cases.
Best!
I try to use this, yes this one it's working
^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$
but I added on the condition to check at least the end of the character is =
string.lastIndexOf("=") >= 0

How to detect URL to different page (also in the same domain)

I have question about detect url in page. I'm founding the best way how it solve. For downloading page I use Jsoup.
URI uri = new URI("http://www.niocchi.com/");
Document doc = Jsoup.connect(uri.toString()).get();
Elements links = doc.select("a")
And this page get me some links. For example this:
http://www.niocchi.com/#Package organization
http://www.niocchi.com/#Architecture
http://www.linkedin.com/in/ivanprado
http://www.niocchi.com/examples/
I need get only different pages without references to paragraphs.
I would like to get from example this:
http://www.linkedin.com/in/ivanprado
http://www.niocchi.com/examples/
It looks like you want to select only these <a> with href attribute with value build from characters which are not #. In that case you can use
doc.select("a[href~=^[^#]+$]")
attribute~=regex is syntax used to check if part of value of attribute can be matched with regex.
regex accepting one or more non # characters can look like this [^#]+
regex accepting only entire string (not only its part) need to be surrounded with ^ and $ anchors which represents
^ - start of the string,
$ end of the string.
You could convert them to strings and then split them based on the # mark.
for example:
public void stringSplitter() {
String result = null;
// example
String[] stringURL = {"http://www.niocchi.com/#Package organization", "http://www.niocchi.com/#Architecture",
"http://www.linkedin.com/in/ivanprado", "http://www.niocchi.com/examples/ "};
try {
for (int i = 0; i < stringURL.length; i++) {
String [] parts = stringURL[i].split("#");
result = parts[0];
System.out.println(result);
}
}catch (Exception ex) {
ex.printStackTrace();
}
}
The output is:
http://www.niocchi.com/
http://www.niocchi.com/
http://www.linkedin.com/in/ivanprado
http://www.niocchi.com/examples/
I would even think about setting a part of the method to return only unique URL's

hex-Encoding in Java goes wrong

me and several experienced Java developers worked on this for like 1 hour now and we cannot get it to work. Someone has any tips for me?
Problem:
We got a text in an Excel file which seems to be encoded completely inconsistent and stupid. Sometimes there are special chars, sometimes not, sometimes they are shown and interpreted differently.
What i wanted to do now is to write a little Java-Script, that checks the given Text in the Excel File and converts all the different Char-sequences into what we want it to be.
My Code:
while (iterator.hasNext()) {
Entity entity = (Entity) iterator.next();
Dataset dataset = produkt_store.getDataset(entity);
FormData formdata = dataset.getFormData();
DomElement dom = (DomElement) formdata.get(lang,
"cs_description_short").get();
String beschreibung = dom.toText(true);
System.out.println("Before: " + beschreibung);
String hexBeschreibung = StringToHex(beschreibung);
String newHexBeschreibung = hexBeschreibung.replaceAll("75 3F", "FC");
newHexBeschreibung = newHexBeschreibung.replaceAll("75 A8", "FC");
//beschreibung2 = beschreibung2.replaceAll("75A8", "FC");
System.out.println("After: " + HexToString(newHexBeschreibung));
System.out.println(hexBeschreibung.equals(newHexBeschreibung) + "\n");
// dom.set(beschreibung);
}
Also i got those functions to encode / decode to hex:
private static String StringToHex(String s) {
if (s.length() == 0)
return "";
char c;
StringBuffer buff = new StringBuffer();
for (int i = 0; i < s.length(); i++) {
c = s.charAt(i);
buff.append(Integer.toHexString(c) + " ");
}
return buff.toString().trim();
}
private static String HexToString(String s) {
if (s.length() == 0)
return "";
String[] arr = s.split(" ");
StringBuffer buff = new StringBuffer();
int i;
for (String str : arr) {
i = Integer.valueOf(str, 16).intValue();
String hs = new Character((char) i).toString();
buff.append(hs);
}
return buff.toString();
}
Example:
Sometimes where there should be an "ü" it is shown as "u?" which we obviously want to avoid. When looking into it in an hex-Editor we see those things represented sometimes as
753F or 75A8. Same goes for "ä" or "ö" or "ß". So even for "u?" it varies from 753F to sometimes being 75A8. We tried to replace that with "ü". Doesn't work. Someone got any tips?
We tried to use String.replaceAll() before that and used something like String.replaceAll("u\?","ü"); But that didn't work either as of nothing was changed at all.
Thanks for any tips on that encoding stuff! :)
EDIT:
This is the solution which works perfectly fine:
beschreibung = beschreibung.replace("U\u0308", "\u00DC"); // "Ü"
beschreibung = beschreibung.replace("u\u0308", "\u00FC"); // "ü"
beschreibung = beschreibung.replace("A\u0308", "\u00C4"); // "Ä"
beschreibung = beschreibung.replace("a\u0308", "\u00E4"); // "ä"
beschreibung = beschreibung.replace("O\u0308", "\u00D6"); // "Ö"
beschreibung = beschreibung.replace("o\u0308", "\u00F6"); // "ö"
beschreibung = beschreibung.replace("s\u0308", "\u00DF"); // "ß"
Somewhere there was ü represented not as one char U-UMLAUT but as SMALL-LETTER-U followed by COMBING-DIACRITICAL-MARK-UMLAUT. This is valid.
Then there was some conversion back, to maybe ISO-8859-1 (or even US-ASCII?), and the Umlaut got separately converted. There was no such character in ISO-8859-1 and you got a question mark instead.
A repair afterwards would be:
String s = ...
s = s.replace("U?", "\u00DC")); // "Ü"
s = s.replace("u?", "\u00FC"); // "ü"
...
(I have escaped the chars to prevent problems with possibly different encoding of java compiler and editor. (Would be an error.)
That can also be done a bit more sophisticated:
s = s.replaceAll("([aouAOU])\\?", "$1\u0308"); // Again ASCII + Umlaut separately
s = TextNormalizer.normalize(s, TextNormalizer.Form.NFC);
// Now single non-ASCII letters.
The TextNormalizer might be a help here.
Caveat: The '?' can also be shown in a console (i.e. from the IDE), as there a conversion takes place too.
Somewhere a conversion was done. This can happen implicitly, where the encoding is optional and such. You might try with setting the system property file.encoding to UTF-8 or Cp1252 (Windows Latin-1).
First thing to check: are upper/lowercase important? e.g. if your toHex produces "75 3f" you won't replace it with your given command. hexBeschreibung = hexBeschreibung.toLowercase() would solve this issue.
Second: (more of a hint) "u?" doesn't mean 'u' + '?', but 'u' + <not unicode character and definitly not '?'>.
I hope my first suggestion will help :)
--
Sorry I can't comment, so I have to edit:
Hex editors may show hex values upper or lower case, because it doesn't matter. You have to check your used String by yourself, because Java may represent hex in Strings with lowercase letters.

JavaCC: How can I specify which token(s) are expected in certain context?

I need to make JavaCC aware of a context (current parent token), and depending on that context, expect different token(s) to occur.
Consider the following pseudo-code:
TOKEN <abc> { "abc*" } // recognizes "abc", "abcd", "abcde", ...
TOKEN <abcd> { "abcd*" } // recognizes "abcd", "abcde", "abcdef", ...
TOKEN <element1> { "element1" "[" expectOnly(<abc>) "]" }
TOKEN <element2> { "element2" "[" expectOnly(<abcd>) "]" }
...
So when the generated parser is "inside" a token named "element1" and it encounter "abcdef" it recognizes it as <abc>, but when its "inside" a token named "element2" it recognizes the same string as <abcd>.
element1 [ abcdef ] // aha! it can only be <abc>
element2 [ abcdef ] // aha! it can only be <abcd>
If I'm not wrong, it would behave similar to more complex DTD definitions of an XML file.
So, how can one specify, in which "context" which token(s) are valid/expected?
NOTE: It would be not enough for my real case to define a kind of "hierarchy" of tokens, so that "abcdef" is always first matched against <abcd> and than <abc>. I really need context-aware tokens.
OK, it seems that you need a technique called lookahead here. Here is a very good tutorial:
Lookahead tutorial
My first attempt was wrong then, but as it works for distinct tokens which define a context I'll leave it here (Maybe it's useful for somebody ;o)).
Let's say we want to have some kind of markup language. All we want to "markup" are:
Expressions consisting of letters (abc...zABC...Z) and whitespaces --> words
Expressions consisting of numbers (0-9) --> numbers
We want to enclose words in tags and numbers in tags. So if i got you right that is what you want to do: If you're in the word context (between word tags) the compiler should expect letters and whitespaces, in the number context it expects numbers.
I created the file WordNumber.jj which defines the grammar and the parser to be generated:
options
{
LOOKAHEAD= 1;
CHOICE_AMBIGUITY_CHECK = 2;
OTHER_AMBIGUITY_CHECK = 1;
STATIC = true;
DEBUG_PARSER = false;
DEBUG_LOOKAHEAD = false;
DEBUG_TOKEN_MANAGER = false;
ERROR_REPORTING = true;
JAVA_UNICODE_ESCAPE = false;
UNICODE_INPUT = false;
IGNORE_CASE = false;
USER_TOKEN_MANAGER = false;
USER_CHAR_STREAM = false;
BUILD_PARSER = true;
BUILD_TOKEN_MANAGER = true;
SANITY_CHECK = true;
FORCE_LA_CHECK = false;
}
PARSER_BEGIN(WordNumberParser)
/** Model-tree Parser */
public class WordNumberParser
{
/** Main entry point. */
public static void main(String args []) throws ParseException
{
WordNumberParser parser = new WordNumberParser(System.in);
parser.Input();
}
}
PARSER_END(WordNumberParser)
SKIP :
{
" "
| "\n"
| "\r"
| "\r\n"
| "\t"
}
TOKEN :
{
< WORD_TOKEN : (["a"-"z"] | ["A"-"Z"] | " " | "." | ",")+ > |
< NUMBER_TOKEN : (["0"-"9"])+ >
}
/** Root production. */
void Input() :
{}
{
( WordContext() | NumberContext() )* < EOF >
}
/** WordContext production. */
void WordContext() :
{}
{
"<WORDS>" (< WORD_TOKEN >)+ "</WORDS>"
}
/** NumberContext production. */
void NumberContext() :
{}
{
"<NUMBER>" (< NUMBER_TOKEN >)+ "</NUMBER>"
}
You can test it with a file like that:
<WORDS>This is a sentence. As you can see the parser accepts it.</WORDS>
<WORDS>The answer to life, universe and everything is</WORDS><NUMBER>42</NUMBER>
<NUMBER>This sentence will make the parser sad. Do not make the parser sad.</NUMBER>
The Last line will cause the parser to throw an exception like this:
Exception in thread "main" ParseException: Encountered " <WORD_TOKEN> "This sentence will make the parser sad. Do not make the parser sad. "" at line 3, column 9.
Was expecting:
<NUMBER_TOKEN> ...
That is because the parser did not find what it expected.
I hope that helps.
Cheers!
P.S.: The parser can't "be" inside a token as a token is a terminal symbol (correct me if I'm wrong) which can't be replaced by production rules any further. So all the context aspects have to be placed inside a production rule (non terminal) like "WordContext" in my example.
You need to use lexer states. Your example becomes something like:
<DEFAULT> TOKEN: { <ELEMENT1: "element1">: IN_ELEMENT1 }
<DEFAULT> TOKEN: { <ELEMENT2: "element2">: IN_ELEMENT2 }
<IN_ELEMENT1> TOKEN: { <ABC: "abc" (...)*>: DEFAULT }
<IN_ELEMENT2> TOKEN: { <ABCD: "abcd" (...)*>: DEFAULT }
Please note that the (...)* are not proper JavaCC syntax, but your example is not either so I can only guess.

Categories

Resources