How to convert escape-decimal text back to unicode in Java - java

A third-party library in our stack is munging strings containing emoji etc like so:
"Ben \240\159\144\144\240\159\142\169"
That is, decimal bytes, not hexadecimal shorts.
Surely there is an existing routine to turn this back into a proper Unicode string, but all the discussion I've found about this expects the format \u12AF, not \123.

I am not aware of any existing routine, but something simple like this should do the job (assuming the input is available as a string):
public static String unEscapeDecimal(String s) {
try {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(baos, "utf-8");
int pos = 0;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == '\\') {
writer.flush();
baos.write(Integer.parseInt(s.substring(i+1, i+4)));
i += 3;
} else {
writer.write(c);
}
}
writer.flush();
return new String(baos.toByteArray(), "utf-8");
} catch (IOException e) {
throw new RuntimeException(e);
}
}
The writer is just used to make sure existing characters in the string with code points > 127 are encoded correctly, should they occur unescaped. If all non-ascii characters are escaped, the byte array output stream should be sufficient.

Related

Java remove diacritic

I am trying to make function which will remove diacritic(dont want to use Normalizer on purpose).Function looks like
private static String normalizeCharacter(Character curr) {
String sdiac = "áäčďéěíĺľňóôőöŕšťúůűüýřžÁÄČĎÉĚÍĹĽŇÓÔŐÖŔŠŤÚŮŰÜÝŘŽ";
String bdiac = "aacdeeillnoooorstuuuuyrzAACDEEILLNOOOORSTUUUUYRZ";
char[] s = sdiac.toCharArray();
char[] b = bdiac.toCharArray();
String ret;
for(int i = 0; i < sdiac.length(); i++){
if(curr == s[i])
curr = b[i];
}
ret = curr.toString().toLowerCase();
ret = ret.replace("\n", "").replace("\r","");
return ret;
}
funcion is called like this(every charracter from file is sent to this function)
private static String readFile(String fName) {
File f = new File(fName);
StringBuilder sb = new StringBuilder();
try{
FileInputStream fStream = new FileInputStream(f);
Character curr;
while(fStream.available() > 0){
curr = (char) fStream.read();
sb.append(normalizeCharacter(curr));
System.out.print(normalizeCharacter(curr));
}
}catch(IOException e){
e.printStackTrace();
}
return sb.toString();
}
file text.txt contains this: ľščťžýáíéúäôň and i expect lcstzyaieuaonin return from program but insted of expected string i get this ¾è yaieuaoò. I know that problem is somewhere in encoding but dont know where. Any ideas ?
You are trying to convert bytes into characters.
However, the character ľ is not represented as a single byte. Its unicode representation is U+013E, and its UTF-8 representation is C4 BE. Thus, it is represented by two bytes. The same is true for the other characters.
Suppose the encoding of your file is UTF-8. Then you read the byte value C4, and then you convert it to a char. This will give you the character U+00C4 (Ä), not U+013E. Then you read the BE, and it is converted to the character U+00BE (¾).
So don't confuse bytes and characters. Instead of using the InputStream directly, you should wrap it with a Reader. A Reader is able to read charecters based on the encoding it is created with:
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream(f), StandardCharsets.UTF_8
)
);
Now, you'll be able to read characters or even whole lines and the encoding will be done directly.
int readVal;
while ( ( readVal = reader.read() ) != -1 ) {
curr = (char)readVal;
// ... the rest of your code
}
Remember that you are still reading an int if you are going to use read() without parameters.

Java copy entire file without the double quotes

I have a method to copy the entire file from one destination to another destination using buffer:
InputStream in = new FileInputStream(src);
OutputStream out = new FileOutputStream(dest);
byte[] buf = new byte[1024];
int len;
while ((len = in.read(buf)) > 0) {
out.write(buf, 0, len);
}
in.close();
out.close();
The file is in csv format:
"2280B_TJ1400_001","TJ1400_Type-7SR","192.168.50.76","Aries SDH","6.0","192.168.0.254",24,"2280B Cyberjaya","Mahadzir Ibrahim"
But as you can see it has quotes inside it. Is it possible remove them by based on my exisitng code???
Output should be like this:
2280B_TJ1400_001,TJ1400_Type-7SR,192.168.50.76,Aries SDH,6.0,192.168.0.254,24,2280B Cyberjaya,Mahadzir Ibrahim
If you use a BufferedReader you can use the readLine() function to read the contents of the file as a String. Then you can use the normal functions on String to manipulate it before writing it to the output. By using an OutputStreamWriter you can write the Strings directly.
An advantage of the above is that you never have to bother with the raw bytes, this makes your code easier to read and less prone to mistakes in special cases.
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(src)));
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(dest));
String line;
while ((line = in.readLine()) != null) {
String stringOut = line.replaceAll("\"", "");
out.write(stringOut);
}
in.close();
out.close();
Note that this removes all " characters, not just the ones at the start and end of each String. To do that, you can use a StringTokenizer, or a more complex replace.
Not sure it's a good idea or not, but you can do something like :
while ((len = in.read(buf)) > 0) {
String temp = new String(buf);
temp = temp.replaceAll("\"","");
buf = temp.getBytes();
len = temp.length();
out.write(buf, 0, len);
}
For me, I would read all the file before, in a String, and then strip out the ' " ' in the string. Then write it to the dest file.
Read the file in a string
I found this simple solution. This may not be the best depending on your level of error catching you need.But it's working enough ;)
String content = new Scanner(new File("filename")).useDelimiter("\\Z").next();
Stripout the ' " '
content = content.replaceAll('"', "");
Write it to dest file from here
Files.write(Paths.get("./duke.txt"), msg.getBytes());
This is for java 7+.
Did not test it but it should work !
Not necessarily good style, filtering quotes in binary data, but very solid.
Wrap the original InputStream with your own InputStream, filtering out the double quote.
I have added a quirk: in MS Excel a quoted field may contain a quote, which then is self-escaped, represented as two double quotes.
InputStream in = new UnquotingInputStream(new FileInputStream(src));
/**
* Removes ASCII double quote from an InputStream.
* Two consequtive quotes stand for one quote: self-escaping like used
* by MS Excel.
*/
public class UnquotingInputStream extends InputStream {
private final InputStream in;
private boolean justHadAQuote;
public UnquotingInputStream(InputStream in) {
this.in = in;
}
#Override
public int read() throws IOException {
int c = in.read();
if (c == '\"') {
if (!justHadAQuote) {
justHadAQuote = true;
return read(); // Skip quote
}
}
justHadAQuote = false;
return c;
}
}
Works for all encodings that use ASCII as subset. So not: UTF-16 or EBCDIC.

Reading from InflaterInputStream and parsing the result

I am quite new to java, just started yesterday. Since I am a big fan of learning by doing, I am making a small project with it. But I am stucked in this part. I have written a file using this function:
public static boolean writeZippedFile(File destFile, byte[] input) {
try {
// create file if doesn't exist part was here
try (OutputStream out = new DeflaterOutputStream(new FileOutputStream(destFile))) {
out.write(input);
}
return true;
} catch (IOException e) {
// error handlind was here
}
}
Now that I have successully wrote a compressed file using above method, I want to read it back to console. First I need to be able to read the decompressed content and write string representaion of that content to console. However, I have a second problem that I don't want to write characters up to first \0 null character. Here is how I attempt to read the compressed file:
try (InputStream is = new InflaterInputStream(new FileInputStream(destFile))) {
}
and I am completely stuck here. Question is, how to discard first few character until '\0' and then write the rest of the decompressed file to console.
I understand that your data contain text since you want to print a string respresentation. I further assume that the text contains unicode characters. If this is true, then your console should also support unicode for the characters to be displayed correctly.
So you should first read the data byte by byte until you encounter the \0 character and then you can use a BufferedReader to print the rest of the data as lines of text.
try (InputStream is = new InflaterInputStream(new FileInputStream(destFile))) {
// read the stream a single byte each time until we encounter '\0'
int aByte = 0;
while ((aByte = is.read()) != -1) {
if (aByte == '\0') {
break;
}
}
// from now on we want to print the data
BufferedReader b = new BufferedReader(new InputStreamReader(is, "UTF8"));
String line = null;
while ((line = b.readLine()) != null) {
System.out.println(line);
}
b.close();
} catch(IOException e) { // handle }
Skip the first few characters using InputStream#read()
while (is.read() != '\0');

println(char), characters turn into Chinese?

Please help me to troubleshoot this problem.
A have an input file 'Trial.txt' with content "Thanh Le".
Here is the function I used in an attempt to read from the file:
public char[] importSeq(){
File file = new File("G:\\trial.txt");
char temp_seq[] = new char[100];
try{
FileInputStream fis = new FileInputStream(file);
BufferedInputStream bis = new BufferedInputStream(fis);
DataInputStream dis = new DataInputStream(bis);
int i = 0;
//Try to read all character till the end of file
while(dis.available() != 0){
temp_seq[i]=dis.readChar();
i++;
}
System.out.println(" imported");
} catch (FileNotFoundException e){
e.printStackTrace();
} catch (IOException e){
e.printStackTrace();
}
return temp_seq;
}
And the main function:
public static void main(String[] args) {
Sequence s1 = new Sequence();
char result[];
result = s1.importSeq();
int i = 0;
while(result[i] != 0){
System.out.println(result[i]);
i++;
}
}
And this is the output.
run:
imported
瑨
慮
栠
汥
BUILD SUCCESSFUL (total time: 0 seconds)
That's honestly said a pretty clumsy way to read a text file into a char[].
Here's a better example, assuming that the text file contains only ASCII characters.
File file = new File("G:/trial.txt");
char[] content = new char[(int) file.length()];
Reader reader = null;
try {
reader = new FileReader(file);
reader.read(content);
} finally {
if (reader != null) try { reader.close(); } catch (IOException ignore) {}
}
return content;
And then to print the char[], just do:
System.out.println(content);
Note that InputStream#available() doesn't necessarily do what you're expecting.
See also:
Java IO tutorial
Because in Java a char is made by 2 bytes, so, when you use readChar, it will read pairs of letters and compose them into unicode characters.
You can avoid this by using readByte(..) instead..
Some code to demonstrate, what exactly is happening. A char in Java consists of two bytes and represents one character, the glyph (pixels) you see on the screen. The default encoding in Java is UTF-16, one particular way to use two bytes to represent one of all the glyphs. Your file had one byte to represent one character, probably ASCII. When you read one UTF-16 character, you read two bytes and thus two ASCII characters from your file.
The following code tries to explain how single ASCII bytes 't' and 'h', become one chinese UTF-16 character.
public class Main {
public static void main(String[] args) {
System.out.println((int)'t'); // 116 == x74 (116 is 74 in Hex)
System.out.println((int)'h'); // 104 == x68
System.out.println((int)'瑨'); // 29800 == x7468
// System.out.println('\u0074'); // t
// System.out.println('\u0068'); // h
// System.out.println('\u7468'); // 瑨
char th = (('t' << 8) + 'h'); //x74 x68
System.out.println(th); //瑨 == 29800 == '\u7468'
}
}

Java : Read last n lines of a HUGE file

I want to read the last n lines of a very big file without reading the whole file into any buffer/memory area using Java.
I looked around the JDK APIs and Apache Commons I/O and am not able to locate one which is suitable for this purpose.
I was thinking of the way tail or less does it in UNIX. I don't think they load the entire file and then show the last few lines of the file. There should be similar way to do the same in Java too.
I found it the simplest way to do by using ReversedLinesFileReader from apache commons-io api.
This method will give you the line from bottom to top of a file and you can specify n_lines value to specify the number of line.
import org.apache.commons.io.input.ReversedLinesFileReader;
File file = new File("D:\\file_name.xml");
int n_lines = 10;
int counter = 0;
ReversedLinesFileReader object = new ReversedLinesFileReader(file);
while(counter < n_lines) {
System.out.println(object.readLine());
counter++;
}
If you use a RandomAccessFile, you can use length and seek to get to a specific point near the end of the file and then read forward from there.
If you find there weren't enough lines, back up from that point and try again. Once you've figured out where the Nth last line begins, you can seek to there and just read-and-print.
An initial best-guess assumption can be made based on your data properties. For example, if it's a text file, it's possible the line lengths won't exceed an average of 132 so, to get the last five lines, start 660 characters before the end. Then, if you were wrong, try again at 1320 (you can even use what you learned from the last 660 characters to adjust that - example: if those 660 characters were just three lines, the next try could be 660 / 3 * 5, plus maybe a bit extra just in case).
RandomAccessFile is a good place to start, as described by the other answers. There is one important caveat though.
If your file is not encoded with an one-byte-per-character encoding, the readLine() method is not going to work for you. And readUTF() won't work in any circumstances. (It reads a string preceded by a character count ...)
Instead, you will need to make sure that you look for end-of-line markers in a way that respects the encoding's character boundaries. For fixed length encodings (e.g. flavors of UTF-16 or UTF-32) you need to extract characters starting from byte positions that are divisible by the character size in bytes. For variable length encodings (e.g. UTF-8), you need to search for a byte that must be the first byte of a character.
In the case of UTF-8, the first byte of a character will be 0xxxxxxx or 110xxxxx or 1110xxxx or 11110xxx. Anything else is either a second / third byte, or an illegal UTF-8 sequence. See The Unicode Standard, Version 5.2, Chapter 3.9, Table 3-7. This means, as the comment discussion points out, that any 0x0A and 0x0D bytes in a properly encoded UTF-8 stream will represent a LF or CR character. Thus, simply counting the 0x0A and 0x0D bytes is a valid implementation strategy (for UTF-8) if we can assume that the other kinds of Unicode line separator (0x2028, 0x2029 and 0x0085) are not used. You can't assume that, then the code would be more complicated.
Having identified a proper character boundary, you can then just call new String(...) passing the byte array, offset, count and encoding, and then repeatedly call String.lastIndexOf(...) to count end-of-lines.
The ReversedLinesFileReader can be found in the Apache Commons IO java library.
int n_lines = 1000;
ReversedLinesFileReader object = new ReversedLinesFileReader(new File(path));
String result="";
for(int i=0;i<n_lines;i++){
String line=object.readLine();
if(line==null)
break;
result+=line;
}
return result;
I found RandomAccessFile and other Buffer Reader classes too slow for me. Nothing can be faster than a tail -<#lines>. So this it was the best solution for me.
public String getLastNLogLines(File file, int nLines) {
StringBuilder s = new StringBuilder();
try {
Process p = Runtime.getRuntime().exec("tail -"+nLines+" "+file);
java.io.BufferedReader input = new java.io.BufferedReader(new java.io.InputStreamReader(p.getInputStream()));
String line = null;
//Here we first read the next line into the variable
//line and then check for the EOF condition, which
//is the return value of null
while((line = input.readLine()) != null){
s.append(line+'\n');
}
} catch (java.io.IOException e) {
e.printStackTrace();
}
return s.toString();
}
CircularFifoBuffer from apache commons . answer from a similar question at How to read last 5 lines of a .txt file into java
Note that in Apache Commons Collections 4 this class seems to have been renamed to CircularFifoQueue
package com.uday;
import java.io.File;
import java.io.RandomAccessFile;
public class TailN {
public static void main(String[] args) throws Exception {
long startTime = System.currentTimeMillis();
TailN tailN = new TailN();
File file = new File("/Users/udakkuma/Documents/workspace/uday_cancel_feature/TestOOPS/src/file.txt");
tailN.readFromLast(file);
System.out.println("Execution Time : " + (System.currentTimeMillis() - startTime));
}
public void readFromLast(File file) throws Exception {
int lines = 3;
int readLines = 0;
StringBuilder builder = new StringBuilder();
try (RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r")) {
long fileLength = file.length() - 1;
// Set the pointer at the last of the file
randomAccessFile.seek(fileLength);
for (long pointer = fileLength; pointer >= 0; pointer--) {
randomAccessFile.seek(pointer);
char c;
// read from the last, one char at the time
c = (char) randomAccessFile.read();
// break when end of the line
if (c == '\n') {
readLines++;
if (readLines == lines)
break;
}
builder.append(c);
fileLength = fileLength - pointer;
}
// Since line is read from the last so it is in reverse order. Use reverse
// method to make it correct order
builder.reverse();
System.out.println(builder.toString());
}
}
}
A RandomAccessFile allows for seeking (http://download.oracle.com/javase/1.4.2/docs/api/java/io/RandomAccessFile.html). The File.length method will return the size of the file. The problem is determining number of lines. For this, you can seek to the end of the file and read backwards until you have hit the right number of lines.
I had similar problem, but I don't understood to another solutions.
I used this. I hope thats simple code.
// String filePathName = (direction and file name).
File f = new File(filePathName);
long fileLength = f.length(); // Take size of file [bites].
long fileLength_toRead = 0;
if (fileLength > 2000) {
// My file content is a table, I know one row has about e.g. 100 bites / characters.
// I used 1000 bites before file end to point where start read.
// If you don't know line length, use #paxdiablo advice.
fileLength_toRead = fileLength - 1000;
}
try (RandomAccessFile raf = new RandomAccessFile(filePathName, "r")) { // This row manage open and close file.
raf.seek(fileLength_toRead); // File will begin read at this bite.
String rowInFile = raf.readLine(); // First readed line usualy is not whole, I needn't it.
rowInFile = raf.readLine();
while (rowInFile != null) {
// Here I can readed lines (rowInFile) add to String[] array or ArriyList<String>.
// Later I can work with rows from array - last row is sometimes empty, etc.
rowInFile = raf.readLine();
}
}
catch (IOException e) {
//
}
Here is the working for this.
private static void printLastNLines(String filePath, int n) {
File file = new File(filePath);
StringBuilder builder = new StringBuilder();
try {
RandomAccessFile randomAccessFile = new RandomAccessFile(filePath, "r");
long pos = file.length() - 1;
randomAccessFile.seek(pos);
for (long i = pos - 1; i >= 0; i--) {
randomAccessFile.seek(i);
char c = (char) randomAccessFile.read();
if (c == '\n') {
n--;
if (n == 0) {
break;
}
}
builder.append(c);
}
builder.reverse();
System.out.println(builder.toString());
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
Here is the best way I've found to do it. Simple and pretty fast and memory efficient.
public static void tail(File src, OutputStream out, int maxLines) throws FileNotFoundException, IOException {
BufferedReader reader = new BufferedReader(new FileReader(src));
String[] lines = new String[maxLines];
int lastNdx = 0;
for (String line=reader.readLine(); line != null; line=reader.readLine()) {
if (lastNdx == lines.length) {
lastNdx = 0;
}
lines[lastNdx++] = line;
}
OutputStreamWriter writer = new OutputStreamWriter(out);
for (int ndx=lastNdx; ndx != lastNdx-1; ndx++) {
if (ndx == lines.length) {
ndx = 0;
}
writer.write(lines[ndx]);
writer.write("\n");
}
writer.flush();
}
(See commend)
public String readFromLast(File file, int howMany) throws IOException {
int numLinesRead = 0;
StringBuilder builder = new StringBuilder();
try (RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r")) {
try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
long fileLength = file.length() - 1;
/*
* Set the pointer at the end of the file. If the file is empty, an IOException
* will be thrown
*/
randomAccessFile.seek(fileLength);
for (long pointer = fileLength; pointer >= 0; pointer--) {
randomAccessFile.seek(pointer);
byte b = (byte) randomAccessFile.read();
if (b == '\n') {
numLinesRead++;
// (Last line often terminated with a line separator)
if (numLinesRead == (howMany + 1))
break;
}
baos.write(b);
fileLength = fileLength - pointer;
}
/*
* Since line is read from the last so it is in reverse order. Use reverse
* method to make it ordered correctly
*/
byte[] a = baos.toByteArray();
int start = 0;
int mid = a.length / 2;
int end = a.length - 1;
while (start < mid) {
byte temp = a[end];
a[end] = a[start];
a[start] = temp;
start++;
end--;
}// End while
return new String(a).trim();
} // End inner try-with-resources
} // End outer try-with-resources
} // End method
I tried RandomAccessFile first and it was tedious to read the file backwards, repositioning the file pointer upon every read operation. So, I tried #Luca solution and I got the last few lines of the file as a string in just two lines in a few minutes.
InputStream inputStream = Runtime.getRuntime().exec("tail " + path.toFile()).getInputStream();
String tail = new BufferedReader(new InputStreamReader(inputStream)).lines().collect(Collectors.joining(System.lineSeparator()));
Code is 2 lines only
// Please specify correct Charset
ReversedLinesFileReader rlf = new ReversedLinesFileReader(file, StandardCharsets.UTF_8);
// read last 2 lines
System.out.println(rlf.toString(2));
Gradle:
implementation group: 'commons-io', name: 'commons-io', version: '2.11.0'
Maven:
<dependency>
<groupId>commons-io</groupId><artifactId>commons-io</artifactId><version>2.11.0</version>
</dependency>

Categories

Resources