Special characters coming through as ? in SMPP and Java - java

I've spent a crazy amount of time trying to get special characters to come through properly in our application. Our provider told us to use "GSM0338, also known as ISO-8859". To me, this means ISO-8895-1, since we want spanish characters.
The flow: (Telling you everything, since I've been playing around with this for a while.)
Used notepad++ to create the message files in UTF-8 encoding. (No option to save as ISO-8859-1).
Sent each file through a quick Java program which converts and writes new files:
String text = readTheFile(....);
output = text.getBytes("ISO-8859-1");
FileOutputStream fos = new FileOutputStream(filesPathWithoutName + "\\converted\\" + filename);
fos.write(output);
fos.close();
SMPP test class in another project reads these files:
private static String readMessageFile(final String filenameOfFirstMessage) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(filenameOfFirstMessage));
String message;
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append("\n");
line = br.readLine();
}
message = sb.toString();
} finally {
br.close();
}
return message;
}
Calls send
public void send(final String message, final String targetPhone) throws MessageException {
SmppMessage smppMessage = toSmppMessage(message, targetPhone);
smppSmsService.sendMessage(smppMessage);
}
private SmppMessage toSmppMessage(final String message, final String targetPhone) {
SmppMessage smppMessage = new SmppMessage();
smppMessage.setMessage(message);
smppMessage.setRecipientAddress(toGsmAddress(targetPhone));
smppMessage.setSenderAddress(getSenderGsmAddress());
smppMessage.setMessageType(SmppMessage.MSG_TYPE_DATA);
smppMessage.setMessageMode(SmppMessage.MSG_MODE_SAF);
smppMessage.requestStatusReport(true);
return smppMessage;
}
Problem:
SMSs containing letters ñ í ó are delivered, but with these letters displaying as question marks.
Configuration:
smpp.smsc.charset=ISO-8859-1
smpp.data.coding=0x03
Absolutely any help with this would be GREATLY appreciated. Thank you so much for reading.

Well, your provider is wrong. GSM 03.38 is not ISO-8859-1. They are the same up through "Z" (0x5A), but after that they diverge. For instance, in GSM 03.38, ñ is 0x7D, while in ISO-8859-1, it is 0xF1. Since GSM 03.38 is a 7-bit code, anything above 0x7F is going to come out as a "?". Anything after 0x5A is going to come out as something unexpected.
Since Java doesn't usually come with GSM 03.38 support, you're going to have to decode by hand. It shouldn't be too difficult to do, and the following piece of software might already do most of what you need:
Java GSM 03.38 SMS Character Set Translator
You might also find this translation table between GSM 03.38 and Unicode useful.

Related

characters not appearing when I print when I import a file?

I'm importing a file into my code and trying to print it. the file contains
i don't like cake.
pizza is good.
i don’t like "cookies" to.
17.
29.
the second dont has a "right single quotation" and when I print it the output is
don�t
the question mark is printed out a blank square. is there a way to convert it to a regular apostrophe?
EDIT:
public class Somethingsomething {
public static void main(String[] args) throws FileNotFoundException,
IOException {
ArrayList<String> list = new ArrayList<String>();
File file = new File("D:\\project1Test.txt");//D:\\project1Test.txt
if(file.exists()){//checks if file exist
FileInputStream fileStream = new FileInputStream(file);
InputStreamReader input = new InputStreamReader(fileStream);
BufferedReader reader = new BufferedReader(input);
String line;
while( (line = reader.readLine()) != null) {
list.add(line);
}
for(int i = 0; i < list.size(); i ++){
System.out.println(list.get(i));
}
}
}}
it should print as normal but the second "don't" has a white block on the apostrophe
this is the file I'm using https://www.mediafire.com/file/8rk7nwilpj7rn7s/project1Test.txt
edit: if it helps even more my the full document where the character is found here
https://www.nytimes.com/2018/03/25/business/economy/labor-professionals.html
It’s all about character encoding. The way characters are represented isn't always the same and they tend to get misinterpreted.
Characters are usually stored as numbers that depend on the encoding standard (and there are so many of them). For example in ASCII, "a" is 97, and in UTF-8 it's 61.
Now when you see funny characters such as the question mark (called replacement character) in this case, it's usually that an encoding standard is being misinterpreted as another standard, and the replacement character is used to replace the unknown or misinterpreted character.
To fix your problem you need to tell your reader to read your file using a specific character encoding, say SOME-CHARSET.
Replace this:
InputStreamReader input = new InputStreamReader(fileStream);
with this:
InputStreamReader input = new InputStreamReader(fileStream, "SOME-CHARSET");
A list of charsets is available here. Unfortunately, you might want to go through them one by one. A short list of most common ones could be found here.
Your problem is almost certainly the encoding scheme you are using. You can read a file in most any encoding scheme you want. Just tell Java how your input was encoded. UTF-8 is common on Linux. Windows native is CP-1250.
This is the sort of problem you have all the time if you are processing files created on a different OS.
See here and Here
I'll give you a different approach...
Use the appropriate means for reading plain text files. Try this:
public static String getTxtContent(String path)
{
try(BufferedReader br = new BufferedReader(new FileReader(path)))
{
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append(System.lineSeparator());
line = br.readLine();
}
return sb.toString();
}catch(IOException fex){ return null; }
}

use SmbFileInputStream to read data in utf-8 encoding

i have a file which has the following string:
Vol conforme à la réglementation
However, when i read the file using SmbFileInputStream i get:
Vol conforme � la r�glementation
Could you please let me know the best way to read this file so that I get the String as I have in the original file. I am converting it to utf-8, which I am not sure is the correct way. Here is the current code that I am using:
SmbFileInputStream smbFileInputStream = new SmbFileInputStream(fileURL);
BufferedReader bufferedFileReader = new BufferedReader(new InputStreamReader(smbFileInputStream, "UTF-8"));
String line = null;
StringBuilder stringBuilder = new StringBuilder();
try {
while ((line = bufferedFileReader.readLine()) != null) {
if (!line.trim().isEmpty()) {
stringBuilder.append(line);
}
}
return stringBuilder.toString();
} finally {
bufferedFileReader.close();
}
Your file is not UTF-8 encoded. Based on the output of the baked string, it's probably ISO-8859-1 encoded or Windows cp1252 encoded, or even ISO-8859-15.
You should pass these encodings instead. It won't be quickly obvious to know which one of these encoding to use until your data contains a byte which maps to the wrong character.
The Euro symbol is a good test. It doesn't exist in ISO-8859-1 and is in different map positions in cp1252 and ISO-8859-15.
Notepad++ is an awesome tool for quickly checking files with different decodings.

Java misinterpreting apostrophes when parsing input

So I am trying to use the wikipedia api to read the first paragraph of a given wikipedia page. Unfortunately, I wikipedia uses a weird system to deal with special characters, (http://www.mediawiki.org/wiki/API:Data_formats#JSON_parameters) and I was unable to parse the default response without getting the characters with escape sequences. Obviously the best solution would be to interpret these directly in java, but I'm not sure there is a way to do that, so I force a utf8 response. This approach looks like it should work, but when I pass it through my parsing code, it returns:
Ella Marija Lani Yelich-O'Connor (born 7 November 1996).....named among Time?'?s most influential teenagers in the world, and in the following year, she made her way into Forbes?'?s "30 Under 30" list.
Notice that some apostrophes are kept and some aren't. I think that the misinterpreted characters are the result of parsing of previous parsing (I want the plaintext, so I parse the html tags out). Here is my parsing code, its a bit messy but it almost works:
public static String getWikiParagraph (String url){
try {
//System.out.println(url.substring(url.lastIndexOf('/') + 1));
URL apiURL = new URL("http://www.en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&utf8&exintro=&titles="+url.substring(url.lastIndexOf('/') + 1));
BufferedReader br = new BufferedReader(new InputStreamReader(apiURL.openStream(), Charset.forName("UTF-8")));
StringBuilder sb=new StringBuilder();
String read = br.readLine();
while(read != null) {
sb.append(read);
read =br.readLine();
}
String s = sb.toString();
s = Arrays.toString(getTagValues(s).toArray());
s=s.replace("<i>","");
s=s.replace("</i>","");
s=s.replace("?'?","'"); //makes no difference in output
s=s.replace("u200a","");
s=s.replace("<b>","");
s=s.replace("</b>","");
s=s.replace("\\","");
s=s.substring(1, s.length() -1);
return s;
} catch (MalformedURLException e) {
e.printStackTrace();
} catch(IOException e){
System.out.println("Error fetching data from url");
}
return null;
}
private static List<String> getTagValues(final String str) {
final Pattern TAG_REGEX = Pattern.compile("<p>(.+?)</p>");
final List<String> tagValues = new ArrayList<String>();
final Matcher matcher = TAG_REGEX.matcher(str);
while (matcher.find()) {
tagValues.add(matcher.group(1));
}
return tagValues;
}
Any help would be greatly appreciated.
Use a JSON parser and run the results you want to cleanup through something like JSoup. Sure, you could write your own brittle HTML parser, but this is a bit of a fool's errand. HTML is subtle, and quick to anger. Spend your time building your logic and let the utility classes do the grungy stuff.
And, yes. The comments are correct. This JSON has Unicode sequences in it, at least when I look at that URL, which will not render correctly in most terminals.
EDIT
The JSON format is (apparently) subject to change. I got cleaner output by specifying "&continue=" in the URL to go back to an older continuation format. You should probably find out what these continuation format changes mean for you.

Java Loses International Characters in Stream

I am having trouble reading international characters in Java.
The default character set being used is UTF-8 and my Eclipse workspace is also set to this.
I am reading a title of a video from the Internet (Gangam Style in fact ;) ) which contains Korean characters, I am doing this as follows:
BufferedReader stdIn = new BufferedReader(new InputStreamReader(shellCommand.getInputStream()));
String fileName = null, output = null;
while ((output = stdInput.readLine()) != null) {
if (output.indexOf("Destination") > 0) {
System.out.println(output);
I know that the title it will read is: "PSY - GANGNAM STYLE (강남스타일) M/V", but the console displays the following instead: "PSY - GANGNAM STYLE () M V" which causes errors further along in my program.
It seems like the InputStream Reader isn't reading these characters correctly.
Does anyone have any ideas? I've spent the last hour scouring the Internet and haven't found any answers. Thanks in advance everyone.
The default character set being used is UTF-8
The default where? In Java itself, or in the video? It would be a much clearer if you specified this explicitly. You should check that's correct for the video data too.
It seems like the InputStream Reader isn't reading these characters correctly.
Well, all we know is that the text isn't showing properly on the console. Either it isn't being read correctly, or it's not being displayed correctly. You should print out each character's Unicode value so you can check the exact content of the string. For example:
static void logCharacters(String text) {
for (int i = 0; i < text.length(); i++) {
char c = text.charAt(i);
System.out.println(c + " " + Integer.toHexString(c));
}
}
You need to enure default char-set using Charset.defaultCharset().name() else use
InputStreamReader in = new InputStreamReader(shellCommand.getInputStream(), "UTF-8");
I tried sample program and it prints correctly in eclipse. It might be problem of windows console as AlexR has pointed out.
byte[] bytes = "PSY - GANGNAM STYLE (강남스타일) M/V".getBytes();
InputStreamReader reader = new InputStreamReader(new ByteArrayInputStream(bytes));
BufferedReader bufferedReader = new BufferedReader(reader);
String str = bufferedReader.readLine();
System.out.println(str);
Output:
PSY - GANGNAM STYLE (강남스타일) M/V

Converting UTF-8 to ISO-8859-1 in Java

I am reading an XML document (UTF-8) and ultimately displaying the content on a Web page using ISO-8859-1. As expected, there are a few characters are not displayed correctly, such as “, – and ’ (they display as ?).
Is it possible to convert these characters from UTF-8 to ISO-8859-1?
Here is a snippet of code I have written to attempt this:
BufferedReader br = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(), "UTF-8"));
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = br.readLine()) != null) {
sb.append(line);
}
br.close();
byte[] latin1 = sb.toString().getBytes("ISO-8859-1");
return new String(latin1);
I'm not quite sure what's going awry, but I believe it's readLine() that's causing the grief (since the strings would be Java/UTF-16 encoded?). Another variation I tried was to replace latin1 with
byte[] latin1 = new String(sb.toString().getBytes("UTF-8")).getBytes("ISO-8859-1");
I have read previous posts on the subject and I'm learning as I go. Thanks in advance for your help.
I'm not sure if there is a normalization routine in the standard library that will do this. I do not think conversion of "smart" quotes is handled by the standard Unicode normalizer routines - but don't quote me.
The smart thing to do is to dump ISO-8859-1 and start using UTF-8. That said, it is possible to encode any normally allowed Unicode code point into a HTML page encoded as ISO-8859-1. You can encode them using escape sequences as shown here:
public final class HtmlEncoder {
private HtmlEncoder() {}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (int i = 0; i < sequence.length(); i++) {
char ch = sequence.charAt(i);
if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
out.append(ch);
} else {
int codepoint = Character.codePointAt(sequence, i);
// handle supplementary range chars
i += Character.charCount(codepoint) - 1;
// emit entity
out.append("&#x");
out.append(Integer.toHexString(codepoint));
out.append(";");
}
}
return out;
}
}
Example usage:
String foo = "This is Cyrillic Ya: \u044F\n"
+ "This is fraktur G: \uD835\uDD0A\n" + "This is a smart quote: \u201C";
StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder());
System.out.println(sb.toString());
Above, the character LEFT DOUBLE QUOTATION MARK ( U+201C “ ) is encoded as “. A couple of other arbitrary code points are likewise encoded.
Care needs to be taken with this approach. If your text needs to be escaped for HTML, that needs to be done before the above code or the ampersands end up being escaped.
Depending on your default encoding, following lines could cause problem,
byte[] latin1 = sb.toString().getBytes("ISO-8859-1");
return new String(latin1);
In Java, String/Char is always in UTF-16BE. Different encoding is only involved when you convert the characters to bytes. Say your default encoding is UTF-8, the latin1 buffer is treated as UTF-8 and some sequence of Latin-1 may form invalid UTF-8 sequence and you will get ?.
With Java 8, McDowell's answer can be simplified like this (while preserving correct handling of surrogate pairs):
public final class HtmlEncoder {
private HtmlEncoder() {
}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (PrimitiveIterator.OfInt iterator = sequence.codePoints().iterator(); iterator.hasNext(); ) {
int codePoint = iterator.nextInt();
if (Character.UnicodeBlock.of(codePoint) == Character.UnicodeBlock.BASIC_LATIN) {
out.append((char) codePoint);
} else {
out.append("&#x");
out.append(Integer.toHexString(codePoint));
out.append(";");
}
}
return out;
}
}
when you instanciate your String object, you need to indicate which encoding to use.
So replace :
return new String(latin1);
by
return new String(latin1, "ISO-8859-1");

Categories

Resources