Why do I get "Not a hexadecimal character" when using tdbloader2 - java

I'm loading a recent DBPedia dump file, specifically short_abstracts_en.nt available from http://data.dws.informatik.uni-mannheim.de/dbpedia/2014/en/short_abstracts_en.nt.bz2 (warning, 409M file).
tdbloader2 fails to load, with:
org.apache.jena.riot.RiotException: [line: 1263473, col: 122] Not a hexadecimal character:
I can replicate this error with riot --validate
$JENA_HOME/bin/riot --validate /var/data/uncompressed/short_abstracts_en.nt
20:04:36 ERROR riot :: [line: 1263473, col: 122] Not a hexadecimal character:
Line 1263473 of that file looks like this:
<http://dbpedia.org/resource/Taiwanese_kana> <http://www.w3.org/2000/01/rdf-schema#comment> "Taiwanese kana (\u30BF\u30A \u30F2\u30A1\u30CC \u30AE\u30A \u30AB\u30A \u30D3\u30A7\u30F ) is a katakana-based writing system once used to write Holo Taiwanese, when Taiwan was ruled by Japan. It functioned as a phonetic guide to hanzi, much like furigana in Japanese or Zhuyin fuhao in Chinese. There were similar systems for other languages in Taiwan as well, including Hakka and Formosan languages.The system was imposed by Japan at the time, and used in a few dictionaries, as well as textbooks."#en .
Column 122 is part of the unicode set of characters: (\u30BF\u30A \u30F2\u30A1\u30CC \u30AE\u30A \u30AB\u30A \u30D3\u30A7\u30F ) (with column 122 in bold: \u30F2).
The error message is correct: \u30F2 is a (valid) unicode character, not a hexadecimal character.
Why does Jena think it should be hex, and what do I do about it?

Related

Problems with encoding in IntellijIDEA on Windows

When I launch my program, it crashes on the line
String[] words = tdElement.text().replaceAll("[^a-zA-Zа-яА-Я ]", " ").split(" ");
with the following exception:
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal character range near index 16
[^a-zA-ZР°-СЏРђ-РЇ ]
This "tdElement" contains both English and Russian letters. When there is no Russian letters in the "tdElement", everything works fine. I tried to go to "Settings" -> "File Encodings" and set the "Global Encoding", "Project Encoding" and "Default encoding for properties files" fields to UTF-8, but it didn`t work. Thank you in advance

Java CSVeed library, option for quote inside unquoted field?

I'm benching several Java libraries to parse csv files. I can't find a solution for the CSVeed library with this line :
af,dekh"iykh'ya,Dekh"iykh'ya,13,,34.60345,69.2405
I have this error :
org.csveed.report.CsvException: Illegal state transition:
Parsing symbol QUOTE_SYMBOL [34] in state INSIDE_FIELD
19970: af,dekh
I understand very well what happen unfortunately I tried different blend of options without succeed. Is there a way?
In fact the perfect line of 7 cols should be :
af,dekh\"iykh\'ya,Dekh\"iykh\'ya,13,,34.60345,69.2405
af,dekh"iykh'ya,Dekh"iykh'ya,13,,34.60345,69.2405
To parse this in the following fields you'll have to turn quoting off in your parser:
af
dekh"iykh'ya
Dekh"iykh'ya
13
<null>
34.60345
69.2405
If quoting can not be turned off, you could use setQuote(char symbol) and provide an unused char as parameter.

Convert from CL8ISO8859P5 encoding to ASCII in Java

probably you may give me some hints what can I do/see in my case:)
There is an Oracle code that converts given hexadecimal input in AMERICAN_AMERICA.CL8ISO8859P5 to ASCII: utl_raw.cast_to_varchar2(utl_raw.convert(hextoraw('31383831303891353080853737303338385A5A'), 'AMERICAN_AMERICA.CL8ISO8859P5', 'AMERICAN_AMERICA.RU8PC866'))
Example input: 31383831303891353080853737303338385A5A, example output: 188108С50АЕ770388ZZ
My pain is to solve how can I do it in Java:) Prerequisite: I have no connection to the database and can't execute prepared SQL statement in order to call this function in Oracle package...
I am able to parsing everything except specific bytes (91 -> 'C1', 8085 -> 'B0B5', 5A -> 'Z') with the following code:
new String(DatatypeConverter.parseHexBinary("31383831303891353080853737303338385A5A"))
I've also tried all standards encodings in String constructor with encoding but there were no positive results:(
Do you know if there are encodings in Java that are identical to AMERICAN_AMERICA.CL8ISO8859P5? Or do you know some libraries or Java functions that are able to make this conversion (AMERICAN_AMERICA.CL8ISO8859P5 to ASCII) ?
Many thanks to you in advance!
AMERICAN_AMERICA.RU8PC866 Oracle encoding is IBM-866 encoding in Java (hint from #kfinity). My issue solved by using
new String(DatatypeConverter.parseHexBinary(input), "IBM-866")
CP866 worked as well.

Why is caret an illegal character in this regular expression?

The environment is Java 8, Saxon 9.8 processor, XSL Stylesheet Version 3, running from Eclipse.
Given the following xslt command in the stylesheet:
<xsl:variable name="output"
select="fn:replace($inputstring,
'^.*exec\s+sp_prepexec.+?N'([^#](?:[^'']|'''')+)''.*$', '$1', 'ism;j')" />
Produces the stacktrace:
net.sf.saxon.trans.XPathException: Invalid character '^' in expression
at net.sf.saxon.expr.parser.XPathParser.grumble(XPathParser.java:281)
at net.sf.saxon.expr.parser.XPathParser.grumble(XPathParser.java:238)
at net.sf.saxon.expr.parser.XPathParser.grumble(XPathParser.java:225)
at net.sf.saxon.expr.parser.XPathParser.nextToken(XPathParser.java:196)
at net.sf.saxon.expr.parser.XPathParser.parseDynamicFunctionCall(XPathParser.java:2358)
at net.sf.saxon.expr.parser.XPathParser.parseStepExpression(XPathParser.java:1974)
...
at org.eclipse.wst.xsl.jaxp.debug.invoker.internal.Main.main(Main.java:72)
I did not find any clue why a caret won't be allowed in that expression - can you support me debugging this?
I was wondering if escaping is a problem, in the code line above I doubled single apostrophes in the expression, also tried it with &apos;, but it is always the same error message.
Given from the flags I assume that Saxon would use the Java regex parser for this, but the returned stack trace does not show that.
This is an example of the input string I want to process:
declare #p1 int
set #p1=328
exec sp_prepexec #p1 output,N'#P1 int,#P2 char(1),#P3 char(1)',N'SELECT "Tbl1009"."RUN_NO" "Col1111","Tbl1009"."DEP_ID" "Col1114" FROM "run" "Tbl1009" WHERE #P1="Tbl1009"."RUN_ID" AND ("Tbl1009"."Profile_ID"=(1) AND #P2=''N'' OR "Tbl1009"."Profile_ID"=(5) AND #P3=''Y'') AND ("Tbl1009"."Profile_ID"=(1) OR "Tbl1009"."Profile_ID"=(5))',150,'N','N'
select #p1
and the required output:
SELECT "Tbl1009"."RUN_NO" "Col1111","Tbl1009"."DEP_ID" "Col1114" FROM "run" "Tbl1009" WHERE #P1="Tbl1009"."RUN_ID" AND ("Tbl1009"."Profile_ID"=(1) AND #P2=''N'' OR "Tbl1009"."Profile_ID"=(5) AND #P3=''Y'') AND ("Tbl1009"."Profile_ID"=(1) OR "Tbl1009"."Profile_ID"=(5))
#WillBarnwell has the correct diagnosis but the wrong solution. The problem with the ' isn't that it is special in regular expressions, the problem is that it is special in XPath, so you need to use XPath-level escaping, and the way to do that is to write it as two apostrophes. This can get pretty bewildering so the best thing is often to move the regex to a variable defined with content:
<xsl:variable name="regex" as="xs:string"
>^.*exec\s+sp_prepexec.+?N'([^#](?:[^']|'')+)'.*$</xsl:variable>
<xsl:variable name="output"
select="fn:replace($inputstring, $regex, '$1', 'ism;j')" />
(Check that carefully because I'm not sure I have fully understood your intent).
syntax error, your regex is ending at the first unescaped single quote and is being interpreted as ^.*exec\s+sp_prepexec.+?N this is then followed by ([^ with ^ being the first illegal character. Notice that the error is originating from the XML parser, not the regex engine.
Escaping your single quotes with \' is not the way to solve this, as #Michael-Kay shows it is define your regex in a variable.

Java regex is working in my system but not in the server

The regular expression is
String regex = "^[\\p{IsHangul}\\p{IsDigit}]+";
And whenever i do
text.matches(regex);
It works fine in my system but not in some of the system.
I am not able to track the issue.
Thank you in advance.
Exception:
Exception in thread "main" java.util.regex.PatternSyntaxException: Unknown character property name {Hangul} near index 13
^[\p{IsHangul}\p{IsDigit}]+
^
at java.util.regex.Pattern.error(Pattern.java:1713)
at java.util.regex.Pattern.charPropertyNodeFor(Pattern.java:2437)
at java.util.regex.Pattern.family(Pattern.java:2412)
at java.util.regex.Pattern.range(Pattern.java:2335)
at java.util.regex.Pattern.clazz(Pattern.java:2268)
at java.util.regex.Pattern.sequence(Pattern.java:1818)
at java.util.regex.Pattern.expr(Pattern.java:1752)
at java.util.regex.Pattern.compile(Pattern.java:1460)
at java.util.regex.Pattern.<init>(Pattern.java:1133)
at java.util.regex.Pattern.compile(Pattern.java:823)
at java.util.regex.Pattern.matches(Pattern.java:928)
at java.lang.String.matches(String.java:2090)
at com.mycompany.helper.ApplicationHelper.main(ApplicationHelper.java:200)
According to Using Regular Expressions in Java:
Java 5 fixes some bugs and adds support for Unicode blocks. ...
Make sure you're using Java 5+ in the server.
It seems that Java version you are using is not able to recognise Hangul as correct script character so you can try to create your own character class which will cover same range as Hongul from newer versions of Java.
From what I see in code in source code of Character.UnicodeScript on Java 8 Hangul refers to Unicode ranges
1100..11FF
302E..302F
3131..318F
3200..321F
3260..327E
A960..A97F
AC00..D7FB
FFA0..FFDF
so maybe try with such pattern
Pattern.compile("^["
+ "\u1100-\u11FF"
+ "\u302E-\u302F"
+ "\u3131-\u318F"
+ "\u3200-\u321F"
+ "\u3260-\u327E"
+ "\uA960-\uA97F"
+ "\uAC00-\uD7FB"
+ "\uFFA0-\uFFDF"
+ "\\p{IsDigit}]+");

Categories

Resources