I have a file (prueba.txt) and I would like to replace ascii characters 0xE1 (á) for 0x14, 0xE9 (é) for 0x15, 0xF3 (ó) for 0x16 ... In string is possible with String.replace() but it is a char.
import java.io.File;
import java.util.Scanner;
public class Reemplazar {
public static void main(String[] args) throws Exception {
Scanner archivo = new Scanner(new File("prueba.txt"));
while(archivo.hasNextLine()) {
String frase = archivo.nextLine();
for (int i = 0; i < frase.length(); i++) {
char current = frase.charAt(i);
if (current == 0xe1) {
System.out.println("contiene la á: '"+frase+"'");
}
if (current == 0xe9) {
System.out.println("contiene es la é: '"+frase+"'");
}
}
}
}
}
I guess this code is much improved, but ...
Greetings.
First read the text file, then replace the characters.
Reading
A text file has some specific character set and encoding. You must know exactly what it is or that is definitely the system default ANSI character set and encoding. ANSI is not one specific encoding.
But, since you said ANSI, you probably meant the system default. The Scanner constructor you used is for Java's default. You can reasonably assume that Java's default correctly matches the system default.
Replacing characters
All "characters" in Java's String, char and Character datatypes and in an analyzed Java source file are UTF-16 code units, one or two of which encode a Unicode codepoint. Unescaped literal strings and characters are going to be in the encoding of the source file. (Of course, that should be UTF-8.) Regardless, if you type it, see it, save it and compile it with the same encoding, the characters will be what you think they are.
So, once you have text in a string, you can replace, replace, replace, like this:
frase
.replace('á', '►')
.replace('é', '☼')
.replace('ñ', '◄')
or
frase
.replace('\u00E1', '\u25B6')
…
BTW—0x14, 0x15, 0x16 are the encodings for ►, ☼, ◄ in the one encoding for the OEM437 character set.
If you'd rather iterate through the elements of the String, you could do it by each UTF-16 code unit, such as using charAt. That would work best if all your text was characters that UTF-16 encodes with just one code unit. Given that your file encoding is one of the ANSI character sets for a European language, that is likely the case. Or, you can iterate with a codepoint-aware technique as seen in the Java documentation on CharSequence.
It is even better that it is char, because you can do something like this:
yourStringToReplace.replace(0xe1);
char is an integer that is threated like a character insteed of a number (simply speaking)
this replaces the characters and creates a new file "nueva_prueba.txt" with the changed text
public class Reemplazar {
public static void main(String[] args) throws IOException
{
BufferedWriter out;
File f = new File("nueva_prueba.txt");
f.createNewFile();
out = new BufferedWriter(new FileWriter(f));
Scanner archivo = new Scanner(new File("prueba.txt"));
while(archivo.hasNextLine()) {
String frase = archivo.nextLine();
for (int i = 0; i < frase.length(); i++) {
char current = frase.charAt(i);
switch(current)
{
case 0xe1:
System.out.println("contiene la á: '"+frase+"'");
frase = frase.replace((char) 0xe1, (char) 0x14);
System.out.println("nova frase: "+frase);
break;
case 0xe9:
System.out.println("contiene la é: '"+frase+"'");
frase = frase.replace((char) 0xe9, (char) 0x15);
System.out.println("nova frase: "+frase);
break;
case 0xf3:
System.out.println("contiene la ó: '"+frase+"'");
frase = frase.replace((char) 0xf3, (char) 0x16);
System.out.println("nova frase: "+frase);
break;
//... outros / others
default:
break;
}
}
try{
out.write(frase);
out.newLine();
}catch(IOException e){
e.printStackTrace();
}
}
archivo.close();
out.close();
}
}
Hope this helps!
Related
I have a properties file that maps german characters to their hex value (00E4). I had to encode this file with "iso-8859-1" as it was the only way to get the german characters to display. What I'm trying to do is go through german words and check if these characters appear anywhere in the string and if they do replace that value with the hex format. For instance replace the german char with \u00E4.
The code replaces the character fine but instead on one backlash, I'm getting two like so \\u00E4. You can see in the code I'm using "\\u" to try and print \u, but that's not what happens. Any ideas of where I'm going wrong here?
private void createPropertiesMaps(String result) throws FileNotFoundException, IOException
{
Properties importProps = new Properties();
Properties encodeProps = new Properties();
// This props file contains a map of german strings
importProps.load(new InputStreamReader(new FileInputStream(new File(result)), "iso-8859-1"));
// This props file contains the german character mappings.
encodeProps.load(new InputStreamReader(
new FileInputStream(new File("encoding.properties")),
"iso-8859-1"));
// Loop through the german characters
encodeProps.forEach((k, v) ->
{
importProps.forEach((key, val) ->
{
String str = (String) val;
// Find the index of the character if it exists.
int index = str.indexOf((String) k);
if (index != -1)
{
// create new string, replacing the german character
String newStr = str.substring(0, index) + "\\u" + v + str.substring(index + 1);
// set the new property value
importProps.setProperty((String) key, newStr);
if (hasUpdated == false)
{
hasUpdated = true;
}
}
});
});
if (hasUpdated == true)
{
// Write new file
writeNewPropertiesFile(importProps);
}
}
private void writeNewPropertiesFile(Properties importProps) throws IOException
{
File file = new File("import_test.properties");
OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(file), "UTF-8");
importProps.store(writer, "Unicode Translations");
writer.close();
}
The point is that you are not writing a simple text-file but a java properties-file. In a properties-file the backslash-character is an escape-character, so if your property-value contains a backslash Java is so kind to escape it for you - which is not what you want in your case.
You might try to circumvent Java's property-file-mechanism by writing a plian text-file that can be read back in as a proerties-file, but that would mean doing all the formatting that gets provided automatically by the Properties-class manually.
A third-party library in our stack is munging strings containing emoji etc like so:
"Ben \240\159\144\144\240\159\142\169"
That is, decimal bytes, not hexadecimal shorts.
Surely there is an existing routine to turn this back into a proper Unicode string, but all the discussion I've found about this expects the format \u12AF, not \123.
I am not aware of any existing routine, but something simple like this should do the job (assuming the input is available as a string):
public static String unEscapeDecimal(String s) {
try {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(baos, "utf-8");
int pos = 0;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == '\\') {
writer.flush();
baos.write(Integer.parseInt(s.substring(i+1, i+4)));
i += 3;
} else {
writer.write(c);
}
}
writer.flush();
return new String(baos.toByteArray(), "utf-8");
} catch (IOException e) {
throw new RuntimeException(e);
}
}
The writer is just used to make sure existing characters in the string with code points > 127 are encoded correctly, should they occur unescaped. If all non-ascii characters are escaped, the byte array output stream should be sufficient.
I want to edit a php's code which used ord methods, the php's code follow:
$fh=fopen("../ip.data","r");
fseek($fh,327680);
$country_id=ord(fread($fh,1)); //return 137
I try to write code in java :
FileReader ip=new FileReader("ip.data");
int i = 0;
int str;
while((str=ip.read()) != -1){
if(i==327681){
System.out.println(str); //return 0
break;
}
i++;
}
But the two are not equal.
I know ord('a')==97 in PHP, (int)'a'==97 in Java.
The ip.data download here.
FileReader assumes the system's default charset for the input file, which might be utf-8 e.g.* In that case up to four bytes would be read to "form up" the single character you get from FileReader.read().
So maybe (just guesswork at this point though) that's the problem. Your php code doesn't assume any encoding, it just reads an (8bit) byte and its equivalent in java would be e.g.
fis = new FileInputStream("ip.data");
country_id = fis.read(); // FileInputStream.read() is reading a byte, not a character
*) To test that assumption try
FileReader fr = new FileReader("ip.data")
System.out.println( fr.getEncoding() );
What does it print?
Try this:
String str = "t";
byte b = str.getBytes()[0];
int ascii = (int) b;
Maybe you have to use char.
I have been unable to find the reason for this. The only problem I am having in this code is that when the FileWriter tries to put the new value into the text file, it instead puts a ?. I have no clue why, or even what it means. Here is the code:
if (secMessage[1].equalsIgnoreCase("add")) {
if (secMessage.length==2) {
try {
String deaths = readFile("C:/Users/Samboni/Documents/Stuff For Streaming/deaths.txt", Charset.defaultCharset());
FileWriter write = new FileWriter("C:/Users/Samboni/Documents/Stuff For Streaming/deaths.txt");
int comb = Integer.parseInt(deaths) + 1;
write.write(comb);
write.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
And here is the readFile method:
static String readFile(String path, Charset encoding) throws IOException {
byte[] encoded = Files.readAllBytes(Paths.get(path));
return new String(encoded, encoding);
}
Also, the secMessage array is an array of strings containing the words of an IRC message split into individual words, that way the program can react to the commands on a word-by-word basis.
You're calling Writer.write(int). That writes a single UTF-16 code point to the file, taking just the bottom 16 bits. If your platform default encoding isn't able to represent the code point you're trying to write, it will write '?' as a replacement character.
I suspect you actually want to write out a text representation of the number, in which case you should use:
write.write(String.valueOf(comb));
In other words, turn the value into a string and then write it out. So if comb is 123, you'll get three characters ('1', '2', '3') written to the file.
Personally I'd avoid FileWriter though - I prefer using OutputStreamWriter wrapping FileOutputStream so you can control the encoding. Or in Java 7, you can use Files.newBufferedWriter to do it more simply.
write.write(new Integer(comb).toString());
You can convert the int into a string. Otherwise you will need the int to be a character. That will only work for a small subset of numbers, 0-9, so it is not recommended.
I am reading an XML document (UTF-8) and ultimately displaying the content on a Web page using ISO-8859-1. As expected, there are a few characters are not displayed correctly, such as “, – and ’ (they display as ?).
Is it possible to convert these characters from UTF-8 to ISO-8859-1?
Here is a snippet of code I have written to attempt this:
BufferedReader br = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(), "UTF-8"));
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = br.readLine()) != null) {
sb.append(line);
}
br.close();
byte[] latin1 = sb.toString().getBytes("ISO-8859-1");
return new String(latin1);
I'm not quite sure what's going awry, but I believe it's readLine() that's causing the grief (since the strings would be Java/UTF-16 encoded?). Another variation I tried was to replace latin1 with
byte[] latin1 = new String(sb.toString().getBytes("UTF-8")).getBytes("ISO-8859-1");
I have read previous posts on the subject and I'm learning as I go. Thanks in advance for your help.
I'm not sure if there is a normalization routine in the standard library that will do this. I do not think conversion of "smart" quotes is handled by the standard Unicode normalizer routines - but don't quote me.
The smart thing to do is to dump ISO-8859-1 and start using UTF-8. That said, it is possible to encode any normally allowed Unicode code point into a HTML page encoded as ISO-8859-1. You can encode them using escape sequences as shown here:
public final class HtmlEncoder {
private HtmlEncoder() {}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (int i = 0; i < sequence.length(); i++) {
char ch = sequence.charAt(i);
if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
out.append(ch);
} else {
int codepoint = Character.codePointAt(sequence, i);
// handle supplementary range chars
i += Character.charCount(codepoint) - 1;
// emit entity
out.append("&#x");
out.append(Integer.toHexString(codepoint));
out.append(";");
}
}
return out;
}
}
Example usage:
String foo = "This is Cyrillic Ya: \u044F\n"
+ "This is fraktur G: \uD835\uDD0A\n" + "This is a smart quote: \u201C";
StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder());
System.out.println(sb.toString());
Above, the character LEFT DOUBLE QUOTATION MARK ( U+201C “ ) is encoded as “. A couple of other arbitrary code points are likewise encoded.
Care needs to be taken with this approach. If your text needs to be escaped for HTML, that needs to be done before the above code or the ampersands end up being escaped.
Depending on your default encoding, following lines could cause problem,
byte[] latin1 = sb.toString().getBytes("ISO-8859-1");
return new String(latin1);
In Java, String/Char is always in UTF-16BE. Different encoding is only involved when you convert the characters to bytes. Say your default encoding is UTF-8, the latin1 buffer is treated as UTF-8 and some sequence of Latin-1 may form invalid UTF-8 sequence and you will get ?.
With Java 8, McDowell's answer can be simplified like this (while preserving correct handling of surrogate pairs):
public final class HtmlEncoder {
private HtmlEncoder() {
}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (PrimitiveIterator.OfInt iterator = sequence.codePoints().iterator(); iterator.hasNext(); ) {
int codePoint = iterator.nextInt();
if (Character.UnicodeBlock.of(codePoint) == Character.UnicodeBlock.BASIC_LATIN) {
out.append((char) codePoint);
} else {
out.append("&#x");
out.append(Integer.toHexString(codePoint));
out.append(";");
}
}
return out;
}
}
when you instanciate your String object, you need to indicate which encoding to use.
So replace :
return new String(latin1);
by
return new String(latin1, "ISO-8859-1");