Umlaut problems with Spark job writing to an NFSv3 mounted volume

Umlaut problems with Spark job writing to an NFSv3 mounted volume - java

I am trying to copy files to an nfsv3 mounted volume during a spark job. Some of the files contain umlauts. For example:
Malformed input or input contains unmappable characters:
/import/nfsmountpoint/Währungszählmaske.pdf
The error occurs in the following line of scala code:
//targetPath is String and looks ok
val target = Paths.get(targetPath)
The file encoding is shown as ANSI X3.4-1968 although the linux locale on the spark machines is set to en_US.UTF-8.
I already tried to change the locale for the spark job itself using the following arguments:
--conf 'spark.executor.extraJavaOptions=-Dsun.jnu.encoding=UTF8 -Dfile.encoding=UTF8'
--conf 'spark.driver.extraJavaOptions=-Dsun.jnu.encoding=UTF8 -Dfile.encoding=UTF8'
This solves the error, but the filename on the target volume looks like this:
/import/nfsmountpoint/W?hrungsz?hlmaske.pdf
The volume mountpoint is:
hnnetapp666.mydomain:/vol/nfsmountpoint on /import/nfsmountpoint type nfs (rw,nosuid,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=4.14.1.36,mountvers=3,mountport=4046,mountproto=udp,local_lock=none,addr=4.14.1.36)
Is there a possible way to fix this?

Solved this by setting the encoding settings like mentioned above and manually converting from and to UTF-8:
Solution for encoding conversion
Just using NFSv4 with UTF-8 support would have been an easier solution.

Related

Java 11 Freemaker with utf-8 resources

We have a Java application (OpenJDK 1.8) - a service generating some payload using the Freemaker templates (mvn version 2.3.31). The content translations are handled using the resource bundles (.property files with translations, e.g. template.properties, template_fi.properties, template_bg.properties, ..). The properties files have content of the utf-8 encoding and all works good.
When migrating to Java 11 (Zulu OpenJDK 11), we started to have an issue with translations, which were not "latin" - having characters not in the ISO-8859-1 charset. All characters out of the charset encoding were changed to ?. (yet the resource files were utf-8 encoded, changing the content using native2ascii did not help)
After some time / experiments we solved the encoding issue using the system property:
-D java.util.PropertyResourceBundle.encoding=ISO-8859-1
I'm looking for an explanation - WHY? I find the property value counterintuitive and I'd like to understand the process.
According to the documentation I understand the ResourceBundle suppose to read the property in using the ISO-8859-1 and throw an exception when encountering an invalid character. The system properly mentioned above should enable having the property file encoded in UTF. Yet the workable solution was explicitly setting the ISO-8859-1 encoding
And indeed testing pure Java implementation, the proper output is achieved using the UTF-8 encoding
System.setProperty("java.util.PropertyResourceBundle.encoding","UTF-8");
// "ISO-8859-1" <- not working
// System.setProperty("java.util.PropertyResourceBundle.encoding","ISO-8859-1");
Locale locale = Locale.forLanguageTag("bg-BG");
ResourceBundle testBundle = ResourceBundle.getBundle("test", locale);
System.out.println(testBundle.getString("name"));
// return encoded, so the terminal doeesn't break the non-latin characters
System.out.println(
Base64.getEncoder()
.encodeToString(testBundle.getString("name").getBytes("UTF-8")));
I assume that the Fremaker library somehow makes some encoding changes internally, yet not sure what/why, the Freemaker internal localized string is a simple bundle

Issue using Coldfusion FileExists when checking files with UTF-8 and ASCII

When trying to detect the existence of the files that were encoded in UTF-8 with FileExists function, the files could not be found.
I found that in the Coldfusion server the Java File Encoding was originally set to "UTF-8". For some unknown reason it was back to default "ASCII". I suspect that this is the issue.
For example, a user uploaded a photo named 云拼花.jpg while the server Java file encoding was set to UTF-8, and now with the server Java file encoding set to ASCII, I use
<cfif FileExists("#currentpath##pic#")>
The result will be not found i.e. file does not exist. However if I simply display it using:
<IMG SRC="/images/#pic#">
The image will display. This caused issues when I try to test the existence of the images. The images are there but can't be found by FileExists.
Now the directory has a mix of files encoded in either UTF-8 or ASCII. Is there anyway to:
force any upload file to UTF-8 encoding
check for the existence of the file
regardless of CF Admin Java File Encoding setting?

Add this to your page.
<cfprocessingdirective pageencoding="utf-8">
This should fix the issue.

Running example command for dcmqrscp in dcm4che, what file is not found?

I tried running the example command on Mac from the command line, but it fails to run. I'm wondering what file is not found. My desktop directory does, in fact, exist. Am I missing some configuration or something?
I just downloaded all the dcm4che-5.12.0 code and sample scripts and executed the command from the example when using the --help option.
The example command is what I tried, and is shown in the attached screenshot. I'm not sure what is missing, and it's not exactly clear.
Any guidance will be appreciated, thanks!

According to the documentation, you have to provide a file and not a directory name to the --dicomdir option (with two "-" by the way):
--dicomdir <file> specify path to a DICOMDIR file of
a DICOM File-set into which
received objects are stored and
from which requested objects are
retrieved
Actually, the example from the documentation reads as follows:
Example: dcmqrscp -b DCMQRSCP:11112 --dicomdir /media/cdrom/DICOMDIR
=> Starts server listening on port 11112, accepting association requests
with DCMQRSCP as called AE title.

Error in installing DBpedia Spotlight while running the server class in jar

I get the following error:
org.dbpedia.spotlight.exceptions.ConfigurationException: Cannot find spotter file ../dist/src/deb/control/data/usr/share/dbpedia-spotlight/spotter.dict
at org.dbpedia.spotlight.model.SpotterConfiguration.<init>(SpotterConfiguration.java:54)
at org.dbpedia.spotlight.model.SpotlightConfiguration.<init>(SpotlightConfiguration.java:143)
at org.dbpedia.spotlight.web.rest.Server.main(Server.java:70)
Usage:
java -jar dbpedia-spotlight.jar org.dbpedia.spotlight.web.rest.Server [config file]
or:
mvn scala:run "-DaddArgs=[config file]"

Quick solution:
wget http://spotlight.dbpedia.org/download/release-0.5/dbpedia-spotlight-quickstart.zip
unzip dbpedia-spotlight-quickstart.zip
cd dbpedia-spotlight-quickstart/
./run.sh
Explanation:
DBpedia Spotlight looks for ~3.5M things of ~320 types in text and tries to disambiguate them to their global unique identifiers in DBpedia. Therefore it needs data files to accompany its jar. A minuscule example is distributed along with the source, but for real use cases you may need the larger files. After you've downloaded the files, you need to modify the configuration in server.properties with the correct path to the files. The error message you got tells you that one of the necessary files (spotter.dict) could not be found in the path you indicated in your server.properties.
More information available here:
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Run-from-a-JAR

tdbloader on Cygwin: Gettging FileNotFoundException: d:\cygdrive\d\....\node2id.idn

I am completely new to Jena/TDB. All I want to do is to load data from some sample rdf, N3 etc file using tdb scripts or through java api.
I am tried to use tbdloader on Cygwin to load data (tdb-0.9.0, on Windows XP with IBM Java 1.6). Following are the command that I ran:
$ export TDBROOT=/cygdrive/d/Project/Store_DB/jena-tdb-0.9.0-incubating
$ export PATH=$TDBROOT/bin:$PATH
I also changed classpath for java in the tdbloader script as mentioned at tdbloader on Cygwin: java.lang.NoClassDefFoundError :
exec java $JVM_ARGS $SOCKS -cp "PATH_OF_JAR_FILES" "tdb.$TDB_CMD" $TDB_SPEC "$#"
So when I run $ tdbloader --help it shows the help correctly.
But when I run
$ tdbloader --loc /cygdrive/d/Project/Store_DB/data1
OR
$ tdbloader --loc /cygdrive/d/Project/Store_DB/data1 test.rdf
I am getting following exception:
com.hp.hpl.jena.tdb.base.file.FileException: Failed to open: d:\cygdrive\d\Project\Store_DB\data1\node2id.idn (mode=rw)
at com.hp.hpl.jena.tdb.base.file.ChannelManager.open$(ChannelManager.java:83)
at com.hp.hpl.jena.tdb.base.file.ChannelManager.openref$(ChannelManager.java:58)
at com.hp.hpl.jena.tdb.base.file.ChannelManager.acquire(ChannelManager.java:47)
at com.hp.hpl.jena.tdb.base.file.FileBase.<init>(FileBase.java:57)
at com.hp.hpl.jena.tdb.base.file.FileBase.<init>(FileBase.java:46)
at com.hp.hpl.jena.tdb.base.file.FileBase.create(FileBase.java:41)
at com.hp.hpl.jena.tdb.base.file.BlockAccessBase.<init>(BlockAccessBase.java:46)
at com.hp.hpl.jena.tdb.base.block.BlockMgrFactory.createStdFile(BlockMgrFactory.java:98)
at com.hp.hpl.jena.tdb.base.block.BlockMgrFactory.createFile(BlockMgrFactory.java:82)
at com.hp.hpl.jena.tdb.base.block.BlockMgrFactory.create(BlockMgrFactory.java:58)
at com.hp.hpl.jena.tdb.setup.Builder$BlockMgrBuilderStd.buildBlockMgr(Builder.java:196)
at com.hp.hpl.jena.tdb.setup.Builder$RangeIndexBuilderStd.createBPTree(Builder.java:165)
at com.hp.hpl.jena.tdb.setup.Builder$RangeIndexBuilderStd.buildRangeIndex(Builder.java:134)
at com.hp.hpl.jena.tdb.setup.Builder$IndexBuilderStd.buildIndex(Builder.java:112)
at com.hp.hpl.jena.tdb.setup.Builder$NodeTableBuilderStd.buildNodeTable(Builder.java:85)
at com.hp.hpl.jena.tdb.setup.DatasetBuilderStd$NodeTableBuilderRecorder.buildNodeTable(DatasetBuilderStd.java:389)
at com.hp.hpl.jena.tdb.setup.DatasetBuilderStd.makeNodeTable(DatasetBuilderStd.java:300)
at com.hp.hpl.jena.tdb.setup.DatasetBuilderStd._build(DatasetBuilderStd.java:167)
at com.hp.hpl.jena.tdb.setup.DatasetBuilderStd.build(DatasetBuilderStd.java:157)
at com.hp.hpl.jena.tdb.setup.DatasetBuilderStd.build(DatasetBuilderStd.java:70)
at com.hp.hpl.jena.tdb.StoreConnection.make(StoreConnection.java:132)
at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction.<init>(DatasetGraphTransaction.java:46)
at com.hp.hpl.jena.tdb.sys.TDBMakerTxn._create(TDBMakerTxn.java:50)
at com.hp.hpl.jena.tdb.sys.TDBMakerTxn.createDatasetGraph(TDBMakerTxn.java:38)
at com.hp.hpl.jena.tdb.TDBFactory._createDatasetGraph(TDBFactory.java:166)
at com.hp.hpl.jena.tdb.TDBFactory.createDatasetGraph(TDBFactory.java:74)
at com.hp.hpl.jena.tdb.TDBFactory.createDataset(TDBFactory.java:53)
at tdb.cmdline.ModTDBDataset.createDataset(ModTDBDataset.java:95)
at arq.cmdline.ModDataset.getDataset(ModDataset.java:34)
at tdb.cmdline.CmdTDB.getDataset(CmdTDB.java:137)
at tdb.cmdline.CmdTDB.getDatasetGraph(CmdTDB.java:126)
at tdb.cmdline.CmdTDB.getDatasetGraphTDB(CmdTDB.java:131)
at tdb.tdbloader.loadQuads(tdbloader.java:163)
at tdb.tdbloader.exec(tdbloader.java:122)
at arq.cmdline.CmdMain.mainMethod(CmdMain.java:97)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:59)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:46)
at tdb.tdbloader.main(tdbloader.java:53)
Caused by: java.io.FileNotFoundException: d:\cygdrive\d\Project\Store_DB\data1\node2id.idn (The system cannot find the path specified.)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:222)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:107)
at com.hp.hpl.jena.tdb.base.file.ChannelManager.open$(ChannelManager.java:80)
... 37 more
I am not sure what node2id.idn file is and why is it expecting it?

The file node2id.idn is one of TDB's internal index files. It's not something that you have to create or manage for yourself. I've just tried tdbloader on cygwin myself, it it worked OK for me. I can think of two basic possibilities:
your disk is full
the TDB index is corrupted
If this is the first file you are loading into an otherwise emtpy TDB, the second possibility is unlikely. If you are loading into a non-empty TDB, try deleting the TDB image and starting again. Note that TDB by itself does not manage concurrent writes: if you have more than one process writing to a single TDB image, you must handle locking at the application level, or use TDB's transactions.
The final possibility, of course, is that your disk is flaky. You might want to try your code on another machine.
If none of these suggestions help, please send a complete minimal test case to the Jena users list.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Umlaut problems with Spark job writing to an NFSv3 mounted volume - java

Solved this by setting the encoding settings like mentioned above and manually converting from and to UTF-8: Solution for encoding conversion Just using NFSv4 with UTF-8 support would have been an easier solution.

Related

Java 11 Freemaker with utf-8 resources

Issue using Coldfusion FileExists when checking files with UTF-8 and ASCII

Running example command for dcmqrscp in dcm4che, what file is not found?

Error in installing DBpedia Spotlight while running the server class in jar

tdbloader on Cygwin: Gettging FileNotFoundException: d:\cygdrive\d\....\node2id.idn

Categories

Resources