MacRoman vs UTF-8

MacRoman vs UTF-8 - java

I am try to pull out byte data from a source, encrypt it, and then store it in the file system.
For encryption, I am using jasypt and the BasicTextEncryptor class. And for storing on to the file system, I am using Apache's Commons IOUtils class.
When required, these files will be decrypted and then sent to the user's browser. This system works on my local machine where the default charset is MacRoman, but it fails on the server where the default charset is UTF-8.
When I explicitly set the encoding at each stage of the process to use MacRoman it works on the server as well, but I am skeptical about doing this as rest of my code uses UTF8.
Is there a way that I can work the code without conversion to MacRoman?

You should just use UTF8 everywhere.
As long as you use the same encoding at each end of an operation (and as long as the encoding can handle all of the characters you need), you'll be fine.

In your comments on another answer, you claim you're not using an encoding, but that's impossible. You're using the BasicTextEncryptor class, which according to this documentation only works on Strings and char arrays. That means that, at some point, you're converting from an encoding-agnostic byte array to an encoding-specific String or char array. That means that you're relying upon an encoding somewhere, whether you realize it or not. You need to track down where that conversion is happening and ensure it has the correct encoding.
Your question states, "When I explicitly set the encoding at each stage of the process", so you will need to know how it's encoded in the database. If that doesn't make sense, read on.
It's also possible that you are simply trying to encrypt a file that you're getting out of the database, and you don't care about the string representation; you want to treat it as plain bytes, not as text. In that case, BasicTextEncrypter ("Utility class for easily performing normal-strength encryption of texts.") is not a good fit for this task. It encrypts strings. The BasicBinaryEncryptor ("Utility class for easily performing normal-strength encryption of binaries (byte arrays).") is what you need.

Related

Characters altered by Lotus when receiving a POST through a Java WebAgent with OpenURL command

I have a Java WebAgent in Lotus-Domino which runs through the OpenURL command (https://link.com/db.nsf/agentName?openagent). This agent is created for receiving a POST with XML content. Before even parsing or saving the (XML) content, the webagent saves the content into a in-memory document:
For an agent run from a browser with the OpenAgent URL command, the
in-memory document is a new document containing an item for each CGI
(Common Gateway Interface) variable supported by Domino®. Each item
has the name and current value of a supported CGI variable. (No design
work on your part is needed; the CGI variables are available
automatically.)
https://www.ibm.com/support/knowledgecenter/en/SSVRGU_9.0.1/basic/H_DOCUMENTCONTEXT_PROPERTY_JAVA.html
The content of the POST will be saved (by Lotus) into the request_content field. When receiving content with this character: é, like:
<Name xml:lang="en">tést</Name>
The é is changed by Lotus to a ?®. This is also what I see when reading out the request_content field in the document properties. Is it possible to save the é as a é and not a: ?® in Lotus?
Solution:
The way I fixed it is via this post:
Link which help me solve this problem
The solution but in Java:
/****** INITIALIZATION ******/
session = getSession();
AgentContext agentContext = session.getAgentContext();
Stream stream = session.createStream();
stream.open("C:\\Temp\\test.txt", "LMBCS");
stream.writeText(agentContext.getDocumentContext().getItemValueString("REQUEST_CONTENT"));
stream.close();
stream.open("C:\\Temp\\test.txt", "UTF-8");
String Content = stream.readText();
stream.close();
System.out.println("Content: " + Content);

I've dealt with this before, but I no longer have access to the code so I'm going to have to work from memory.
This looks like a UTF-8 vs UTF-16 issue, but there are up to five charsets that can come into play: the charset used in the code that does the POST, the charset of the JVM the agent runs in, the charset of the Domino server code, the charset of the NSF - which is always LMBCS, and the charset of the Domino server's host OS.
If I recall correctly, REQUEST_CONTENT is treated as raw data, not character data. To get it right, you have to handle the conversion of REQUEST_CONTENT yourself.
The Notes API calls that you use to save data in the Java agent will automatically convert from Unicode to LMBCS and vice versa, but this only works if Java has interpreted the incoming data stream correctly. I think in most cases, the JVM running under Domino is configured for UTF-16 - though that may not be the case. (I recall some issue with a server in Japan, and one of the charsets that came into play was one of the JIS standard charsets, but I don't recall if that was in the JVM.)
So if I recall correctly, you need to read REQUEST_CONTENT as UTF-8 from a String into a byte array by using getBytes("UTF-8") and then construct a new String from the byte array using new String(byte[] bytes, "UTF-16"). That's assuming that Then pass that string to NotesDocument.ReplaceItemValue() so the Notes API calls should interpret it correctly.
I may have some details wrong here. It's been a while. I built a database a long time ago that shows the LMBCS, UTF-8 and UTF-16 values for all Unicode characters years ago. If you can get down to the byte values, it can be a useful tool for looking at data like this and figuring out what's really going on. It's downloadable from OpenNTF here. In a situation like this, I recall writing code that got the byte array and converted it to hex and wrote it to a NotesItem so that I could see exactly what was coming in and compare it to the database entries.
And, yes, as per the comments, it's much better if you let the XML tools on both sides handle the charset issues and encoding - but it's not always foolproof. You're adding another layer of charsets into the process! You have to get it right. If the goal is to store data in NotesItems, you still have to make sure that the server-side XML tools decode into the correct charset, which may not be the default.

my heart breaks looking at this. I also just passed through this hell, found the old advice, but... I just could not write to disk to solve this trivial matter.
Item item = agentContext.getDocumentContext().getFirstItem("REQUEST_CONTENT");
byte[] bytes = item.getValueCustomDataBytes("");
String content= new String (bytes, Charset.forName("UTF-8"));
Edited in response to comment by OP: There is an old post on this theme:
http://www-10.lotus.com/ldd/nd85forum.nsf/DateAllFlatWeb/ab8a5283e5a4acd485257baa006bbef2?OpenDocument (the same thread that OP used for his workaround)
the guy claims that when he uses a particular http header the method fails.
Now he was working with 8.5 and using LS. In my case I cannot make it fail by sending an additional header (or in function of the string argument)
How I Learned to Stop Worrying and Love the Notes/Domino:
For what it's worth getValueCustomDataBytes() works only with very short payloads. Dependent on content! Starting your text with an accented character such as 'é' will increase the length it still works with... But whatever I tried I could not get past 195 characters. Am I surprised? After all these years with Notes, I must admit I still am...
Well, admittedly it should not have worked in the first place as it is documented to be used only with User Defined Data fields.
Finally
Use IBM's icu4j and icu4j-charset packages - drop them in jvm/lib/ext. Then the code becomes:
byte[] bytes = item.getText().getBytes(CharsetICU.forNameICU("LMBCS"));
String content= new String (bytes, Charset.forName("UTF-8"));
and yes, will need a permission in java.policy:
permission java.lang.RuntimePermission "charsetProvider";
Is this any better than passing through the file system? Don't know. But kinda looks cleaner.

Get file's encoding in Java [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Java : How to determine the correct charset encoding of a stream
User will upload a CSV file to the server, server need to check if the CSV file is encoded as UTF-8. If so need to inform user, (s)he uploaded a wrong encoding file. The problem is how to detect the file user uploaded is UTF-8 encoding? The back end is written in Java. So anyone get the suggestion?

At least in the general case, there's no way to be certain what encoding is used for a file -- the best you can do is a reasonable guess based on heuristics. You can eliminate some possibilities, but at best you're narrowing down the possibilities without confirming any one. For example, most of the ISO 8859 variants allow any byte value (or pattern of byte values), so almost any content could be encoded with almost any ISO 8859 variant (and I'm only using "almost" out of caution, not any certainty that you could eliminate any of the possibilities).
You can, however, make some reasonable guesses. For example, a file that start out with the three characters of a UTF-8 encoded BOM (EF BB BF), it's probably safe to assume it's really UTF-8. Likewise, if you see sequences like: 110xxxxx 10xxxxxx, it's a pretty fair guess that what you're seeing is encoded with UTF-8. You can eliminate the possibility that something is (correctly) UTF-8 enocded if you ever see a sequence like 110xxxxx 110xxxxx. (110xxxxx is a lead byte of a sequence, which must be followed by a non-lead byte, not another lead byte in properly encoded UTF-8).

You can try and guess the encoding using a 3rd party library, for example: http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

Well, you can't. You could show kind of a "preview" (or should I say review?) with some sample data from the file so the user can check if it looks okay. Perhaps with the possibility of selecting different encoding options to help determine the correct one.

Storing files on Database

i have to save a file in any format (XLS, PDF, DOC, JPG ....) in a database using Java. in my experience i would have do this by storing the binary data of the file into a BLOB type field, someone told me that an alternative is coding the binary data as Text using BASE64 and store the string in a TEXT type field. Which one is the best option to performn this task?.
Thanks.
Paul Manjarres

BLOB would be better, simply because you can use a byte[] data type, and you don't have to encode/decode from BASE64. No reason to use BASE64 for simple storage.

The argument for using BLOB is that it takes fewer CPU cycles, less disk and network i/o, less code, and reduces the likelihood of bugs:
As Will Hartung says, using BLOB enables you to skip the encode/decode steps, which will reduce CPU cycles. Moreover, there are many Java libraries for Base64 encoding and decoding, and there are nuances in implementation (ie PEM line wraps). This means to be safe, the same library should be used for encoding and decoding. This creates an unnecessary coupling between the application which creates the record, and the application that reads the record.
The encoded output will be larger than the raw bytes, which means it will take up more disk space (and network i/o).

Use BLOB to put them in database

FILE to BLOB = DB will not query the content and treat it as, as a ... well ... a meaningless binary BLOB regardless of its content. DB knows this field may be 1KB or 1GB and allocates resources accordingly.
FILE to TEXT = DB can query this thing. Strings can be searched replaced modified in the file. But this time DBMS will spend more resources to make this thing work. There may be a 100 char long text inside a field which may or may not be storing 1 million char long text. Files can have any kind of text encoding and invalid characters may be lost due to table/DB encoding settings.No need to use this if content of the files will not be used in SQL queries.
BASE64 = Converts any content to a lovely super valid text. A work around to bypass every compatibility issue. Store anywhere, print it, telegraph it, write it on a paper, convert your favorite selfie to a private key. Output will be meaningless and bigger but it will be an ordinary text.

Efficient JSON encoding for data that may be binary, but is often text

I need to send a JSON packet across the wire with the contents of an arbitrary file. This may be a binary file (like a ZIP file), but most often it will be plain ASCII text.
I'm currently using base64 encoding, which handles all files, but it increases the size of the data significantly - even if the file is ASCII to begin with. Is there a more efficient way I can encode the data, other than manually checking for any non-ASCII characters and then deciding whether or not to base64-encode it?
I'm currently writing this in Python, but will probably need to do the same in Java, C# and C++, so an easily portable solution would be preferable.

Use quoted-printable encoding. Any language should support that.
http://en.wikipedia.org/wiki/Quoted-printable

Character Encoding Trouble - Java

I've written a little application that does some text manipulation and writes the output to a file (html, csv, docx, xml) and this all appears to work fine on Mac OS X. On windows however I seem to get character encoding problems and a lot of '"' seems to disappear and be replaced with some weird stuff. Usually the closing '"' out of a pair.
I use a FreeMarker to create my output files and there is a byte[] array and in one case also a ByteArrayStream between reading the templates and writing the output. I assume this is a character encoding problem so if someone could give me advise or point me to some 'Best Practice' resource for dealing with character encoding in java.
Thanks

There's really only one best practice: be aware that Strings and bytes are two fundamentally different things, and that whenever you convert between them, you are using a character encoding (either implicitly or explicitly), which you need to pay attention to.
Typical problematic spots in the Java API are:
new String(byte[])
String.getBytes()
FileReader, FileWriter
All of these implicitly use the platform default encoding, which depends on the OS and the user's locale settings. Usually, it's a good idea to avoid this and explicitly declare an encoding in the above cases (which FileReader/Writer unfortunately don't allow, so you have to use an InputStreamReader/Writer).
However, your problems with the quotation marks and your use of a template engine may have a much simpler explanation. What program are you using to write your templates? It sounds like it's one that inserts "smart quotes", which are part of the Windows-specific cp1251 encoding but don't exist in the more global ISO-8859-1 encoding.
What you probably need to do is to be aware which encoding your templates are saved in, and configure your template engine to use that encoding when reading in the templates. Also be aware that some texxt files, specifically XML, explicitly declare the encoding in a header, and if that header disagrees with the actual encoding used by the file, you'll invariable run into problems.

You can control which encoding your JVM will run with by supplying f,ex
-Dfile.encoding=utf-8
for (UTF-8 of course) as an argument to the JVM. Then you should get predictable results on all platforms. Example:
java -Dfile.encoding=utf-8 my.MainClass

Running the JVM with a 'standard' encoding via the confusing named -Dfile.encoding will resolve a lot of problems.
Ensuring your app doesn't make use of byte[] <-> String conversions without encoding specified is important, since sometimes you can't enforce the VM encoding (e.g. if you have an app server used by multiple applications)
If you're confused by the whole encoding issue, or want to revise your knowledge, Joel Spolsky wrote a great article on this.

I had to make sure that the OutputStreamWriter uses the correct encoding
OutputStream out = ...
OutputStreamWriter writer = new OutputStreamWriter(out, "UTF-8");
template.process(model, writer);
Plus if you use a ByteArrayOutputStream also make sure to call toString with the correct encoding:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
...
baos.toString("UTF-8");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.