GZIP eats newlines

GZIP eats newlines - java

I have the following code for compressing and decompressing string.
public static byte[] compress(String str)
{
try
{
ByteArrayOutputStream obj = new ByteArrayOutputStream();
GZIPOutputStream gzip = new GZIPOutputStream(obj);
gzip.write(str.getBytes("UTF-8"));
gzip.close();
return obj.toByteArray();
}
catch (IOException e)
{
e.printStackTrace();
}
return null;
}
public static String decompress(byte[] bytes)
{
try
{
GZIPInputStream gis = new GZIPInputStream(new ByteArrayInputStream(bytes));
BufferedReader bf = new BufferedReader(new InputStreamReader(gis, "UTF-8"));
StringBuilder outStr = new StringBuilder();
String line;
while ((line = bf.readLine()) != null)
{
outStr.append(line);
}
return outStr.toString();
}
catch (IOException e)
{
return e.getMessage();
}
}
I compress into byte array on windows, and then send the byte array through socket to the linux and uncompress it there. However upon uncompression it seem that all my newline characters are gone.
So I thought that the problem was linux to windows relationship. However I have tried writing a simple program on windows that uses it, and found that the newlines are still gone.
Can anyone shed any light as to what causes it? I can't figure out any explanation.

I think the problem is here:
while ((line = bf.readLine()) != null)
{
outStr.append(line);
}
The readLine see's the newline char but doesn't include it in the returned value for line
The problem is worse than you think, perhaps.
readLine() gets all the characters up to, but not including, a newline (or some variety of returns and linefeed characters) OR the end of file. So you don't know if the last line you get had a newline on the end or not.
This might not matter, and if so, you can just add this following the other append:
outStr.append('\n');
Some files might end up with an extra line ending at the end of file.
If it does matter, you will need to use read() and then output all the characters you receive. In that case, you might end up with the infamous "What's at the end of the line?" problem you allude to between Windows, Linux and the MacOS and the way they use different combinations of return and new-line characters to end lines.

It is not GZIP that is "eating" newlines.
It is this code:
while ((line = bf.readLine()) != null)
{
outStr.append(line);
}
The readLine() method reads a line (up to and including a line termination sequence) and then returns it without a newline. You then append it to outStr ... without replacing the line termination that was stripped.
But even if you replaced the line termination, you can't guarantee to preserve the actual line termination sequence that was used ... if you do it that way.
I recommend that you replace the readLine() calls with read() calls; i.e. read and then buffer the data one character at a time. It solves two problems at once. It may even be faster, because you are avoiding the unnecessary overhead of assembling line Strings.

Related

BufferedReader.readLine() hangs sometimes

In my application, there is a separate thread, ran by ScheduledExecutorService.scheduleAtFixedRate() every minute, which parses rss feeds from multiple websites. I am using Apache HttpClient to receive xml.
Sample code:
InputStream inputStream = HTTPClient.get(url);
String xml = inputStreamToString(inputStream, encoding, websiteName);
public static String inputStreamToString(InputStream inputStream, String encoding, String websiteName)
{
BufferedReader bufferedReader = null;
PrintWriter printWriter = null;
StringBuilder stringBuilder = new StringBuilder();
int letter;
try
{
bufferedReader = new BufferedReader(new InputStreamReader(inputStream, encoding));
printWriter = new PrintWriter(new File("src/doclog/"
+ websiteName + "_"
+ new SimpleDateFormat("MM_dd_yyyy_hh_mm_ss").format(new Date(System.currentTimeMillis()))
+ "_" + encoding + ".txt"), encoding);
while((letter = bufferedReader.read()) != -1)
{
char character = (char) letter;
printWriter.print(character);
stringBuilder.append(character);
}
}
catch(IOException e)
{
throw new RuntimeException(e);
}
finally
{
try
{
if(bufferedReader != null)
{
bufferedReader.close();
}
if(printWriter != null)
{
printWriter.close();
}
}
catch(IOException e)
{
e.printStackTrace();
}
}
System.out.println("String built");
return stringBuilder.toString();
}
And HTTPClient class:
public class HTTPClient
{
private static final HttpClient CLIENT = HttpClientBuilder.create().build();
public static InputStream get(String url)
{
try
{
HttpGet request = new HttpGet(url);
HttpResponse response = CLIENT.execute(request);
System.out.println("Response Code: " + response.getStatusLine().toString());
return response.getEntity().getContent();
}
catch(IOException | IllegalArgumentException e)
{
throw new RuntimeException(e);
}
}
}
As the title says, sometimes there is a chance that bufferedReader.readLine() will hang forever. I've seen another answers on this topic, and they suggest to check if bufferedReader.ready() returns true. The problem is that there are websites, which will always return false in bufferedReader.ready(), while processing them, however they parse just fine.
How can i prevent my thread from hanging on bufferedReader.readLine()?
If it matters, response.getStatusLine().toString() always returns HTTP/1.1 200 OK
EDIT
I just found out that bufferedReader.ready() is actually true when hang happens.
EDIT 2
BufferedReader.read() hangs as well. It is strange that hang happens only when dealing with one single website, and it's occurrence is absolutely random. Application either could be working for 15 hours, receiving hundreds of non-problematic responses, or hang just in 10 minutes after launch. I've started to write all characters of every single update into separate file, and found out that nothing special really happens. Xml reading simply stops forever in the middle of document, the last characters were <p dir="ltr"&g. Updated the code.
Also, it's noteworthy to mention that there can't be any unhandled exceptions, because at the highest level of my ScheduledExecutorService.scheduleAtFixedRate() runnable i catch Throwable, and print it's stackTrace.

The ready() method returns true telling you that there are characters available for reading. The problem is that readLine() blocks until it finds an end-of-line in the input.
public String readLine()
throws IOException
Reads a line of text. A line is considered to be terminated by any one
of a line feed ('\n'), a carriage return ('\r'), or a carriage return
followed immediately by a linefeed.
As you are reading from a stream there is no guarantee that the data will come in at line boundaries so the readLine() call blocks.
You can use the read method which will not block, but you will have to check for EOL yourself.
public int read(char[] cbuf, int off, int len) throws IOException
Reads characters into a portion of an array.
This method implements the general contract of the corresponding read
method of the Reader class. As an additional convenience, it attempts
to read as many characters as possible by repeatedly invoking the read
method of the underlying stream. This iterated read continues until
one of the following conditions becomes true:
The specified number of characters have been read,
The read method of the underlying stream returns -1, indicating end-of-file, or
The ready method of the underlying stream returns false, indicating that further input requests would block.
If the first read on the underlying stream returns -1 to indicate
end-of-file then this method returns -1. Otherwise this method returns
the number of characters actually read.
Also you will have to reconstruct the line from the characters read. It is not ss convenient as reading the entire line at once but it is the way it must be done.

Java replace line in a text file

I found this code from another question
private void updateLine(String toUpdate, String updated) throws IOException {
BufferedReader file = new BufferedReader(new FileReader(data));
String line;
String input = "";
while ((line = file.readLine()) != null)
input += line + "\n";
input = input.replace(toUpdate, updated);
FileOutputStream os = new FileOutputStream(data);
os.write(input.getBytes());
file.close();
os.close();
}
This is my file before I replace some lines
example1
example2
example3
But when I replace a line, the file now looks like this
example1example2example3
Which makes it impossible to read the file when there are a lot of lines in it.
How would I go about editing the code above to make my file look what it looked like at the start?

Use System.lineSeparator() instead of \n.
while ((line = file.readLine()) != null)
input += line + System.lineSeparator();
The issue is that on Unix systems, the line separator is \n while on Windows systems, it's \r\n.
In Java versions older then Java 7, you would have to use System.getProperty("line.separator") instead.
As pointed out in the comments, if you have concerns about memory usage, it would be wise to not store the entire output in a variable, but write it out line-by-line in the loop that you're using to process the input.

If you read and modify line by line this has the advantage, that you dont need to fit the whole file in memory. Not sure if this is possible in your case, but it is generally a good thing to aim for streaming. In your case this would in addition remove the need for concatenate the string and you don't need to select a line terminator, because you can write each single transformed line with println(). It requires to write to a different file, which is generally a good thing as it is crash safe. You would lose data if you rewrite a file and get aborted.
private void updateLine(String toUpdate, String updated) throws IOException {
BufferedReader file = new BufferedReader(new FileReader(data));
PrintWriter writer = new PrintWriter(new File(data+".out"), "UTF-8");
String line;
while ((line = file.readLine()) != null)
{
line = line.replace(toUpdate, updated);
writer.println(line);
}
file.close();
if (writer.checkError())
throw new IOException("cannot write");
writer.close();
}
In this case, it assumes that you need to do the replace only on complete lines, not multiple lines. I also added an explicit encoding and use a writer, as you have a string to output.

This is because you use OutputStream which is better for handling binary data. Try using PrintWriter and don't add any line terminator at the end of the lines. Example is here

Reading from InflaterInputStream and parsing the result

I am quite new to java, just started yesterday. Since I am a big fan of learning by doing, I am making a small project with it. But I am stucked in this part. I have written a file using this function:
public static boolean writeZippedFile(File destFile, byte[] input) {
try {
// create file if doesn't exist part was here
try (OutputStream out = new DeflaterOutputStream(new FileOutputStream(destFile))) {
out.write(input);
}
return true;
} catch (IOException e) {
// error handlind was here
}
}
Now that I have successully wrote a compressed file using above method, I want to read it back to console. First I need to be able to read the decompressed content and write string representaion of that content to console. However, I have a second problem that I don't want to write characters up to first \0 null character. Here is how I attempt to read the compressed file:
try (InputStream is = new InflaterInputStream(new FileInputStream(destFile))) {
}
and I am completely stuck here. Question is, how to discard first few character until '\0' and then write the rest of the decompressed file to console.

I understand that your data contain text since you want to print a string respresentation. I further assume that the text contains unicode characters. If this is true, then your console should also support unicode for the characters to be displayed correctly.
So you should first read the data byte by byte until you encounter the \0 character and then you can use a BufferedReader to print the rest of the data as lines of text.
try (InputStream is = new InflaterInputStream(new FileInputStream(destFile))) {
// read the stream a single byte each time until we encounter '\0'
int aByte = 0;
while ((aByte = is.read()) != -1) {
if (aByte == '\0') {
break;
}
}
// from now on we want to print the data
BufferedReader b = new BufferedReader(new InputStreamReader(is, "UTF8"));
String line = null;
while ((line = b.readLine()) != null) {
System.out.println(line);
}
b.close();
} catch(IOException e) { // handle }

Skip the first few characters using InputStream#read()
while (is.read() != '\0');

Java: Copy strings from a file to another without losing the 'newline format'

Sorry in advance if the title is misleading/wrong but this is the best I can do after a really long day spent practicing with Java. (my brain is melting)
I put this code togheter to read a file and copy it into another file, skipping the line/lines that begins with a given string (BeginOfTheLineToRemove). It actually works and remove the desired line, but, for some reason, it forgets about the \n (newline). Spacing and symbols are copied. I can't figure it out. I really hope someone will help. cheers from a java newb from italy ;)
public void Remover(String file, String BeginOfTheLineToRemove) {
File StartingFile = new File(file);
File EndingFile = new File(StartingFile.getAbsolutePath() + ".tmp");
BufferedReader br = new BufferedReader(new FileReader(file));
PrintWriter pw = new PrintWriter(new FileWriter(EndingFile));
String line;
while ((line = br.readLine()) != null) {
if (line.startsWith(LineToRemoveThatBeginWithThis)) {
continue;
}
pw.write(line);
}
pw.close();
br.close();
}

Use pw.println instead of pw.write. println adds new line character after it writes content.

You are using PrintWriter.write() to write the lines - This does not by default write newline at the end. Use println() instead.

This will probably help you.
The BufferedReader.readLine() method does not read any line termination characters. So therefore your line will not contain any termination characters.

BufferedReader#readLine documentation says:
Returns: A String containing the contents of the line, not including any line-termination characters, or null if the end of the stream has been reached
That is, the reader strips the line termination characters from your Strings, so you need to manually add them again:
// \n on Linux/Mac, \r\n on Windows
String lineSep = System.getProperty("line.separator");
pw.write(line);
pw.write(lineSep);

BufferedReader.readLine() uses the newline to identify the end of the line, and the string that it returns does not contain this newline. The newline is a separator, so it is not considered part of the data.
To compensate for this, you can add a newline to your output, like so:
while((line = br.readLine()) != null) {
if(line.startsWith(LineToRemoveThatBeginWithThis)) continue;
pw.write(line);
pw.println();
}
The extra call to PrintWriter.println() will print a newline after you write out your line of text.

Outside the loop get the system's line seperator:
String lineSeparator = System.getProperty("line.separator");
Then append that to the line you've read in:
pw.write(line+lineSeparator);

Read multiple lines from InputStreamReader (JAVA)

I have an InputStreamReader object. I want to read multiple lines into a buffer/array using one function call (without crating a mass of string objects). Is there a simple way to do so?

First of all mind that InputStreamReader is not so efficient, you should wrap it around a BufferedReader object for maximum performance.
Taken into account this you can do something like this:
public String readLines(InputStreamReader in)
{
BufferedReader br = new BufferedReader(in);
// you should estimate buffer size
StringBuffer sb = new StringBuffer(5000);
try
{
int linesPerRead = 100;
for (int i = 0; i < linesPerRead; ++i)
{
sb.append(br.readLine());
// placing newlines back because readLine() removes them
sb.append('\n');
}
}
catch (Exception e)
{
e.printStackTrace();
}
return sb.toString();
}
Mind that readLine() returns null is EOF is reached, so you should check and take care of it.

If you have some delimiter for multiple lines you can read that many characters using read method with length and offset. Otherwise using a StringBuilder for appending each line read by BufferedReader should work well for you without eating up too much temp memory

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

GZIP eats newlines - java

Related

BufferedReader.readLine() hangs sometimes

Java replace line in a text file

Reading from InflaterInputStream and parsing the result

Java: Copy strings from a file to another without losing the 'newline format'

Read multiple lines from InputStreamReader (JAVA)

Categories

Resources