Java remove diacritic

Java remove diacritic - java

I am trying to make function which will remove diacritic(dont want to use Normalizer on purpose).Function looks like
private static String normalizeCharacter(Character curr) {
String sdiac = "áäčďéěíĺľňóôőöŕšťúůűüýřžÁÄČĎÉĚÍĹĽŇÓÔŐÖŔŠŤÚŮŰÜÝŘŽ";
String bdiac = "aacdeeillnoooorstuuuuyrzAACDEEILLNOOOORSTUUUUYRZ";
char[] s = sdiac.toCharArray();
char[] b = bdiac.toCharArray();
String ret;
for(int i = 0; i < sdiac.length(); i++){
if(curr == s[i])
curr = b[i];
}
ret = curr.toString().toLowerCase();
ret = ret.replace("\n", "").replace("\r","");
return ret;
}
funcion is called like this(every charracter from file is sent to this function)
private static String readFile(String fName) {
File f = new File(fName);
StringBuilder sb = new StringBuilder();
try{
FileInputStream fStream = new FileInputStream(f);
Character curr;
while(fStream.available() > 0){
curr = (char) fStream.read();
sb.append(normalizeCharacter(curr));
System.out.print(normalizeCharacter(curr));
}
}catch(IOException e){
e.printStackTrace();
}
return sb.toString();
}
file text.txt contains this: ľščťžýáíéúäôň and i expect lcstzyaieuaonin return from program but insted of expected string i get this ¾è yaieuaoò. I know that problem is somewhere in encoding but dont know where. Any ideas ?

You are trying to convert bytes into characters.
However, the character ľ is not represented as a single byte. Its unicode representation is U+013E, and its UTF-8 representation is C4 BE. Thus, it is represented by two bytes. The same is true for the other characters.
Suppose the encoding of your file is UTF-8. Then you read the byte value C4, and then you convert it to a char. This will give you the character U+00C4 (Ä), not U+013E. Then you read the BE, and it is converted to the character U+00BE (¾).
So don't confuse bytes and characters. Instead of using the InputStream directly, you should wrap it with a Reader. A Reader is able to read charecters based on the encoding it is created with:
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream(f), StandardCharsets.UTF_8
)
);
Now, you'll be able to read characters or even whole lines and the encoding will be done directly.
int readVal;
while ( ( readVal = reader.read() ) != -1 ) {
curr = (char)readVal;
// ... the rest of your code
}
Remember that you are still reading an int if you are going to use read() without parameters.

Related

Reading a String that has n length from InputStream or Reader

I know that I can do this. But I also want to know, is there a short way to do this ? For example: Why there is no method that has public String readString(int len); prototype in Reader class hierarchy to do what I want with only single code in this question ?
InputStream in = new FileInputStream("abc.txt");
InputStreamReader inReader = new InputStreamReader(in);
char[] foo = new char[5];
inReader.read(foo);
System.out.println(new String(foo));
// I think this way is too long
// for reading a string that has only 5 character
// from InputStream or Reader
In Python 3 programming language, I can do it very very easy for UTF-8 and another files. Consider the following code.
fl = open("abc.txt", mode="r", encoding="utf-8")
fl.read(1) # returns string that has 1 character
fl.read(3) # returns string that has 3 character
How can I dot it in Java ?
Thanks.

How can I do it in Java ?
The way you're already doing it.
I'd recommend doing it in a reusable helper method, e.g.
final class IOUtil {
public static String read(Reader in, int len) throws IOException {
char[] buf = new char[len];
int charsRead = in.read(buf);
return (charsRead == -1 ? null : new String(buf, 0, charsRead));
}
}
Then use it like this:
try (Reader in = Files.newBufferedReader(Paths.get("abc.txt"), StandardCharsets.UTF_8)) {
System.out.println(IOUtil.read(in, 5));
}

If you want to make a best effort to read as many as the specified number of characters, you may use
int len = 4;
String result;
try(Reader r = new FileReader("abc.txt")) {
CharBuffer b = CharBuffer.allocate(len);
do {} while(b.hasRemaining() && r.read(b) > 0);
result = b.flip().toString();
}
System.out.println(result);
While the Reader may read less than the specified characters (depending on the underlying stream), it will read at least one character before returning or return -1 to signal the end of the stream. So the code above will loop until either, having read the requested number of characters or reached the end of the stream.
Though, a FileReader will usually read all requested characters in one go and read only less when reaching the end of the file.

How to convert escape-decimal text back to unicode in Java

A third-party library in our stack is munging strings containing emoji etc like so:
"Ben \240\159\144\144\240\159\142\169"
That is, decimal bytes, not hexadecimal shorts.
Surely there is an existing routine to turn this back into a proper Unicode string, but all the discussion I've found about this expects the format \u12AF, not \123.

I am not aware of any existing routine, but something simple like this should do the job (assuming the input is available as a string):
public static String unEscapeDecimal(String s) {
try {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(baos, "utf-8");
int pos = 0;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == '\\') {
writer.flush();
baos.write(Integer.parseInt(s.substring(i+1, i+4)));
i += 3;
} else {
writer.write(c);
}
}
writer.flush();
return new String(baos.toByteArray(), "utf-8");
} catch (IOException e) {
throw new RuntimeException(e);
}
}
The writer is just used to make sure existing characters in the string with code points > 127 are encoded correctly, should they occur unescaped. If all non-ascii characters are escaped, the byte array output stream should be sufficient.

Java copy entire file without the double quotes

I have a method to copy the entire file from one destination to another destination using buffer:
InputStream in = new FileInputStream(src);
OutputStream out = new FileOutputStream(dest);
byte[] buf = new byte[1024];
int len;
while ((len = in.read(buf)) > 0) {
out.write(buf, 0, len);
}
in.close();
out.close();
The file is in csv format:
"2280B_TJ1400_001","TJ1400_Type-7SR","192.168.50.76","Aries SDH","6.0","192.168.0.254",24,"2280B Cyberjaya","Mahadzir Ibrahim"
But as you can see it has quotes inside it. Is it possible remove them by based on my exisitng code???
Output should be like this:
2280B_TJ1400_001,TJ1400_Type-7SR,192.168.50.76,Aries SDH,6.0,192.168.0.254,24,2280B Cyberjaya,Mahadzir Ibrahim

If you use a BufferedReader you can use the readLine() function to read the contents of the file as a String. Then you can use the normal functions on String to manipulate it before writing it to the output. By using an OutputStreamWriter you can write the Strings directly.
An advantage of the above is that you never have to bother with the raw bytes, this makes your code easier to read and less prone to mistakes in special cases.
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(src)));
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(dest));
String line;
while ((line = in.readLine()) != null) {
String stringOut = line.replaceAll("\"", "");
out.write(stringOut);
}
in.close();
out.close();
Note that this removes all " characters, not just the ones at the start and end of each String. To do that, you can use a StringTokenizer, or a more complex replace.

Not sure it's a good idea or not, but you can do something like :
while ((len = in.read(buf)) > 0) {
String temp = new String(buf);
temp = temp.replaceAll("\"","");
buf = temp.getBytes();
len = temp.length();
out.write(buf, 0, len);
}

For me, I would read all the file before, in a String, and then strip out the ' " ' in the string. Then write it to the dest file.
Read the file in a string
I found this simple solution. This may not be the best depending on your level of error catching you need.But it's working enough ;)
String content = new Scanner(new File("filename")).useDelimiter("\\Z").next();
Stripout the ' " '
content = content.replaceAll('"', "");
Write it to dest file from here
Files.write(Paths.get("./duke.txt"), msg.getBytes());
This is for java 7+.
Did not test it but it should work !

Not necessarily good style, filtering quotes in binary data, but very solid.
Wrap the original InputStream with your own InputStream, filtering out the double quote.
I have added a quirk: in MS Excel a quoted field may contain a quote, which then is self-escaped, represented as two double quotes.
InputStream in = new UnquotingInputStream(new FileInputStream(src));
/**
* Removes ASCII double quote from an InputStream.
* Two consequtive quotes stand for one quote: self-escaping like used
* by MS Excel.
*/
public class UnquotingInputStream extends InputStream {
private final InputStream in;
private boolean justHadAQuote;
public UnquotingInputStream(InputStream in) {
this.in = in;
}
#Override
public int read() throws IOException {
int c = in.read();
if (c == '\"') {
if (!justHadAQuote) {
justHadAQuote = true;
return read(); // Skip quote
}
}
justHadAQuote = false;
return c;
}
}
Works for all encodings that use ASCII as subset. So not: UTF-16 or EBCDIC.

How to Access string in file by position in Java

I have a text file with the following contents:
one
two
three
four
I want to access the string "three" by its position in the text file in Java.I found the substring concept on google but unable to use it.
so far I am able to read the file contents:
import java.io.*;
class FileRead
{
public static void main(String args[])
{
try{
// Open the file that is the first
// command line parameter
FileInputStream fstream = new FileInputStream("textfile.txt");
// Get the object of DataInputStream
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
//Read File Line By Line
while ((strLine = br.readLine()) != null) {
// Print the content on the console
System.out.println (strLine);
}
//Close the input stream
in.close();
}catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}
}
}
I want to apply the substring concept to the file.It asks for the position and displays the string.
String Str = new String("Welcome to Tutorialspoint.com");
System.out.println(Str.substring(10, 15) );

If you know the byte offsets within the file that you are interested in then it's straightforward:
RandomAccessFile raFile = new RandomAccessFile("textfile.txt", "r");
raFile.seek(startOffset);
byte[] bytes = new byte[length];
raFile.readFully(bytes);
raFile.close();
String str = new String(bytes, "Windows-1252"); // or whatever encoding
But for this to work you have to use byte offsets, not character offsets - if the file is encoded in a variable-width encoding such as UTF-8 then there's no way to seek directly to the nth character, you have to start at the top of the file and read and discard the first n-1 characters.

look for \r\n (linebreaks) in your text file. This way you should be able to count the rows containing your string.
your file in reality looks like this
one\r\n
two\r\n
three\r\n
four\r\n

You seem to be looking for this. The code I posted there works on the byte level, so it may not work for you. Another option is to use the BufferedReader and just read a single character in a loop like this:
String getString(String fileName, int start, int end) throws IOException {
int len = end - start;
if (len <= 0) {
throw new IllegalArgumentException("Length of string to output is zero or negative.");
}
char[] buffer = new char[len];
BufferedReader reader = new BufferedReader(new FileReader(fileName));
for (int i = 0; i < start; i++) {
reader.read(); // Ignore the result
}
reader.read(buffer, 0, len);
return new String(buffer);
}

println(char), characters turn into Chinese?

Please help me to troubleshoot this problem.
A have an input file 'Trial.txt' with content "Thanh Le".
Here is the function I used in an attempt to read from the file:
public char[] importSeq(){
File file = new File("G:\\trial.txt");
char temp_seq[] = new char[100];
try{
FileInputStream fis = new FileInputStream(file);
BufferedInputStream bis = new BufferedInputStream(fis);
DataInputStream dis = new DataInputStream(bis);
int i = 0;
//Try to read all character till the end of file
while(dis.available() != 0){
temp_seq[i]=dis.readChar();
i++;
}
System.out.println(" imported");
} catch (FileNotFoundException e){
e.printStackTrace();
} catch (IOException e){
e.printStackTrace();
}
return temp_seq;
}
And the main function:
public static void main(String[] args) {
Sequence s1 = new Sequence();
char result[];
result = s1.importSeq();
int i = 0;
while(result[i] != 0){
System.out.println(result[i]);
i++;
}
}
And this is the output.
run:
imported
瑨
慮
栠
汥
BUILD SUCCESSFUL (total time: 0 seconds)

That's honestly said a pretty clumsy way to read a text file into a char[].
Here's a better example, assuming that the text file contains only ASCII characters.
File file = new File("G:/trial.txt");
char[] content = new char[(int) file.length()];
Reader reader = null;
try {
reader = new FileReader(file);
reader.read(content);
} finally {
if (reader != null) try { reader.close(); } catch (IOException ignore) {}
}
return content;
And then to print the char[], just do:
System.out.println(content);
Note that InputStream#available() doesn't necessarily do what you're expecting.
See also:
Java IO tutorial

Because in Java a char is made by 2 bytes, so, when you use readChar, it will read pairs of letters and compose them into unicode characters.
You can avoid this by using readByte(..) instead..

Some code to demonstrate, what exactly is happening. A char in Java consists of two bytes and represents one character, the glyph (pixels) you see on the screen. The default encoding in Java is UTF-16, one particular way to use two bytes to represent one of all the glyphs. Your file had one byte to represent one character, probably ASCII. When you read one UTF-16 character, you read two bytes and thus two ASCII characters from your file.
The following code tries to explain how single ASCII bytes 't' and 'h', become one chinese UTF-16 character.
public class Main {
public static void main(String[] args) {
System.out.println((int)'t'); // 116 == x74 (116 is 74 in Hex)
System.out.println((int)'h'); // 104 == x68
System.out.println((int)'瑨'); // 29800 == x7468
// System.out.println('\u0074'); // t
// System.out.println('\u0068'); // h
// System.out.println('\u7468'); // 瑨
char th = (('t' << 8) + 'h'); //x74 x68
System.out.println(th); //瑨 == 29800 == '\u7468'
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java remove diacritic - java

Related

Reading a String that has n length from InputStream or Reader

How to convert escape-decimal text back to unicode in Java

Java copy entire file without the double quotes

How to Access string in file by position in Java

println(char), characters turn into Chinese?

Categories

Resources