Remove illegal xml characters from UTF-16LE encoded file - java

I have a java application that parses an xml file that was encoded in utf-16le. The xml has been erroring out while being parsed due to illegal xml characters. My solution is to read in this file into a java string, then removing the xml characters, so it can be parsed successfully. It works 99% but there are some slight differences in the input output from this process, not caused by the illegal characters being removed, but going from the utf-16le encoding to java string utf-16.. i think
BufferedReader reader = null;
String fileText = ""; //stored as UTF-16
try {
reader = new BufferedReader(new InputStreamReader(in, "UTF-16LE"));
for (String line; (line = reader.readLine()) != null; ) {
fileText += line;
}
} catch (Exception ex) {
logger.log(Level.WARNING, "Error removing illegal xml characters", ex);
} finally {
if (reader != null) {
reader.close();
}
}
//code to remove illegal chars from string here, irrelevant to problem
ByteArrayInputStream inStream = new ByteArrayInputStream(fileText.getBytes("UTF-16LE"));
Document doc = XmlUtil.openDocument(inStream, XML_ROOT_NODE_ELEM);
Do characters get changed/lost when going from UTF-16LE to UTF-16? Is there a way to do this in java and assuring the input is exactly the same as the output?

Certainly one problem is that readLine throws away the line ending.
You would need to do something like:
fileText += line + "\r\n";
Otherwise XML attributes, DTD entities, or something else could get glued together where at least a space was required. Also you do not want the text content to be altered when it contains a line break.
Performance (speed and memory) can be improved using a
StringBuilder fileText = new StringBuilder();
... fileText.append(line).append("\n");
... fileText.toString();
Then there might be a problem with the first character of the file, which
sometimes redundantly is added: a BOM char.
line = line.replace("\uFEFF", "");

Related

Java CSV Reader - Replace Quotes

I am using CSVReader to read the csv file in Java. In my case, the csv file will have double quotes (") and single quotes ('). Something like this.
SL 12" WIR TREE ASST CD
The below code i am using to read the file.
CsvReader reader = null;
reader = readFile(fileName, delimiter, encoding);
while (reader.readRecord()) {
// Code Part
}
Whenever it cross the reader.readrecord(), its throwing the exception as 'Maximum column length of 100,000 exceeded in column 0 in record 0. Set the SafetySwitch property to false if you're expecting column lengths greater than 100,000 characters to avoid this error.'
What i am trying to do and what i need is,
Since i can't able to do any changes in the file, i am trying to replace the double quotes and single quotes to empty string in java. But it is throwing exception, what ever i mentioned above.
I don't know what CsvReader is (it is not part of standard JDK) but the problem seems to occur in readRecord() and thus way before you have the chance to replace any character. So, CsvReader is not usable here and you should use a less specialised reader such as java.io.BufferedReader, for example.
Given, the delimiter is not a quote or double quote (for obvious reasons) then this code snippet works:
File file = new File(fileName);
InputStream is = new FileInputStream(file);
BufferedReader reader = new BufferedReader(new InputStreamReader(is, encoding));
try {
String line = reader.readLine();
while (line != null) {
//replace qoutes
line = line.replace("\"", "");
line = line.replace("'", "");
//split line according to given delimiter
String[] items = line.split(delimiter);
//handle items...
line = reader.readLine();
}
}
catch (IOException e) {
//handle exception...
}

characters not appearing when I print when I import a file?

I'm importing a file into my code and trying to print it. the file contains
i don't like cake.
pizza is good.
i don’t like "cookies" to.
17.
29.
the second dont has a "right single quotation" and when I print it the output is
don�t
the question mark is printed out a blank square. is there a way to convert it to a regular apostrophe?
EDIT:
public class Somethingsomething {
public static void main(String[] args) throws FileNotFoundException,
IOException {
ArrayList<String> list = new ArrayList<String>();
File file = new File("D:\\project1Test.txt");//D:\\project1Test.txt
if(file.exists()){//checks if file exist
FileInputStream fileStream = new FileInputStream(file);
InputStreamReader input = new InputStreamReader(fileStream);
BufferedReader reader = new BufferedReader(input);
String line;
while( (line = reader.readLine()) != null) {
list.add(line);
}
for(int i = 0; i < list.size(); i ++){
System.out.println(list.get(i));
}
}
}}
it should print as normal but the second "don't" has a white block on the apostrophe
this is the file I'm using https://www.mediafire.com/file/8rk7nwilpj7rn7s/project1Test.txt
edit: if it helps even more my the full document where the character is found here
https://www.nytimes.com/2018/03/25/business/economy/labor-professionals.html
It’s all about character encoding. The way characters are represented isn't always the same and they tend to get misinterpreted.
Characters are usually stored as numbers that depend on the encoding standard (and there are so many of them). For example in ASCII, "a" is 97, and in UTF-8 it's 61.
Now when you see funny characters such as the question mark (called replacement character) in this case, it's usually that an encoding standard is being misinterpreted as another standard, and the replacement character is used to replace the unknown or misinterpreted character.
To fix your problem you need to tell your reader to read your file using a specific character encoding, say SOME-CHARSET.
Replace this:
InputStreamReader input = new InputStreamReader(fileStream);
with this:
InputStreamReader input = new InputStreamReader(fileStream, "SOME-CHARSET");
A list of charsets is available here. Unfortunately, you might want to go through them one by one. A short list of most common ones could be found here.
Your problem is almost certainly the encoding scheme you are using. You can read a file in most any encoding scheme you want. Just tell Java how your input was encoded. UTF-8 is common on Linux. Windows native is CP-1250.
This is the sort of problem you have all the time if you are processing files created on a different OS.
See here and Here
I'll give you a different approach...
Use the appropriate means for reading plain text files. Try this:
public static String getTxtContent(String path)
{
try(BufferedReader br = new BufferedReader(new FileReader(path)))
{
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append(System.lineSeparator());
line = br.readLine();
}
return sb.toString();
}catch(IOException fex){ return null; }
}

Java replace line in a text file

I found this code from another question
private void updateLine(String toUpdate, String updated) throws IOException {
BufferedReader file = new BufferedReader(new FileReader(data));
String line;
String input = "";
while ((line = file.readLine()) != null)
input += line + "\n";
input = input.replace(toUpdate, updated);
FileOutputStream os = new FileOutputStream(data);
os.write(input.getBytes());
file.close();
os.close();
}
This is my file before I replace some lines
example1
example2
example3
But when I replace a line, the file now looks like this
example1example2example3
Which makes it impossible to read the file when there are a lot of lines in it.
How would I go about editing the code above to make my file look what it looked like at the start?
Use System.lineSeparator() instead of \n.
while ((line = file.readLine()) != null)
input += line + System.lineSeparator();
The issue is that on Unix systems, the line separator is \n while on Windows systems, it's \r\n.
In Java versions older then Java 7, you would have to use System.getProperty("line.separator") instead.
As pointed out in the comments, if you have concerns about memory usage, it would be wise to not store the entire output in a variable, but write it out line-by-line in the loop that you're using to process the input.
If you read and modify line by line this has the advantage, that you dont need to fit the whole file in memory. Not sure if this is possible in your case, but it is generally a good thing to aim for streaming. In your case this would in addition remove the need for concatenate the string and you don't need to select a line terminator, because you can write each single transformed line with println(). It requires to write to a different file, which is generally a good thing as it is crash safe. You would lose data if you rewrite a file and get aborted.
private void updateLine(String toUpdate, String updated) throws IOException {
BufferedReader file = new BufferedReader(new FileReader(data));
PrintWriter writer = new PrintWriter(new File(data+".out"), "UTF-8");
String line;
while ((line = file.readLine()) != null)
{
line = line.replace(toUpdate, updated);
writer.println(line);
}
file.close();
if (writer.checkError())
throw new IOException("cannot write");
writer.close();
}
In this case, it assumes that you need to do the replace only on complete lines, not multiple lines. I also added an explicit encoding and use a writer, as you have a string to output.
This is because you use OutputStream which is better for handling binary data. Try using PrintWriter and don't add any line terminator at the end of the lines. Example is here

GZIP eats newlines

I have the following code for compressing and decompressing string.
public static byte[] compress(String str)
{
try
{
ByteArrayOutputStream obj = new ByteArrayOutputStream();
GZIPOutputStream gzip = new GZIPOutputStream(obj);
gzip.write(str.getBytes("UTF-8"));
gzip.close();
return obj.toByteArray();
}
catch (IOException e)
{
e.printStackTrace();
}
return null;
}
public static String decompress(byte[] bytes)
{
try
{
GZIPInputStream gis = new GZIPInputStream(new ByteArrayInputStream(bytes));
BufferedReader bf = new BufferedReader(new InputStreamReader(gis, "UTF-8"));
StringBuilder outStr = new StringBuilder();
String line;
while ((line = bf.readLine()) != null)
{
outStr.append(line);
}
return outStr.toString();
}
catch (IOException e)
{
return e.getMessage();
}
}
I compress into byte array on windows, and then send the byte array through socket to the linux and uncompress it there. However upon uncompression it seem that all my newline characters are gone.
So I thought that the problem was linux to windows relationship. However I have tried writing a simple program on windows that uses it, and found that the newlines are still gone.
Can anyone shed any light as to what causes it? I can't figure out any explanation.
I think the problem is here:
while ((line = bf.readLine()) != null)
{
outStr.append(line);
}
The readLine see's the newline char but doesn't include it in the returned value for line
The problem is worse than you think, perhaps.
readLine() gets all the characters up to, but not including, a newline (or some variety of returns and linefeed characters) OR the end of file. So you don't know if the last line you get had a newline on the end or not.
This might not matter, and if so, you can just add this following the other append:
outStr.append('\n');
Some files might end up with an extra line ending at the end of file.
If it does matter, you will need to use read() and then output all the characters you receive. In that case, you might end up with the infamous "What's at the end of the line?" problem you allude to between Windows, Linux and the MacOS and the way they use different combinations of return and new-line characters to end lines.
It is not GZIP that is "eating" newlines.
It is this code:
while ((line = bf.readLine()) != null)
{
outStr.append(line);
}
The readLine() method reads a line (up to and including a line termination sequence) and then returns it without a newline. You then append it to outStr ... without replacing the line termination that was stripped.
But even if you replaced the line termination, you can't guarantee to preserve the actual line termination sequence that was used ... if you do it that way.
I recommend that you replace the readLine() calls with read() calls; i.e. read and then buffer the data one character at a time. It solves two problems at once. It may even be faster, because you are avoiding the unnecessary overhead of assembling line Strings.

Error reading UTF-8 file in Java

I am trying to read in some sentences from a file that contains unicode characters. It does print out a string but for some reason it messes up the unicode characters
This is the code I have:
public static String readSentence(String resourceName) {
String sentence = null;
try {
InputStream refStream = ClassLoader
.getSystemResourceAsStream(resourceName);
BufferedReader br = new BufferedReader(new InputStreamReader(
refStream, Charset.forName("UTF-8")));
sentence = br.readLine();
} catch (IOException e) {
throw new RuntimeException("Cannot read sentence: " + resourceName);
}
return sentence.trim();
}
The problem is probably in the way that the string is being output.
I suggest that you confirm that you are correctly reading the Unicode characters by doing something like this:
for (char c : sentence.toCharArray()) {
System.err.println("char '" + ch + "' is unicode codepoint " + ((int) ch)));
}
and see if the Unicode codepoints are correct for the characters that are being messed up. If they are correct, then the problem is output side: if not, then input side.
First, you could create the InputStreamReader as
new InputStreamReader(refStream, "UTF-8")
Also, you should verify if the resource really contains UTF-8 content.
One of the most annoying reason could be... your IDE settings.
If your IDE default console encoding is something like latin1 then you'll be struggling very long with different variations of java code but nothing help untill you correctly set some IDE options.

Categories

Resources