Java OutputStream that incrementally processes text

Java OutputStream that incrementally processes text - java

I want to incrementally process the text written to an OutputStream as it is written.
For example, suppose we have this program:
import java.io.File;
import java.io.IOException;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.nio.charset.Charset;
public class Streaming {
// Writes file, incrementally, to OutputStream.
static void dump(File file, OutputStream out) throws IOException {
// Implementation omitted
}
static int sum = 0;
public static void main(String[] args) throws IOException {
Charset charSet = Charset.defaultCharset(); // Interpret the file as having this encoding.
dump(new File("file.txt"), new OutputStream() {
#Override
public void write(int b) throws IOException {
// Add b to bytes already read,
// Determine if we have reached the end of the token (using
// the default encoding),
// And parse the token and add it to `sum`
}
});
System.out.println("Sum: " + sum);
}
}
Suppose file.txt is a text file containing a space-delimited list of ints. In this program, I wish to find the sum of the ints in file.txt, accumulating the sum in the sum variable. I would like to avoid building up a String that is millions of characters long.
I'm interested in a way that I can accomplish this using the dump function, which writes the contents of a file to an output stream. I'm not interested in reading the file in another way (e.g. creating a Scanner for file.txt and repeatedly calling nextInt on the scanner). I'm imposing this restriction because I'm using a library that has an API similar to dump, where the client must provide an OutputStream, and the library subsequently writes a lot of text to the output stream.
How can I implement the write method to correctly perform the steps as outlined? I would like to avoid doing the tokenization by hand, since utilities like Scanner are already capable of doing tokenization, and I want to be able to handle any encoding of text (as specified by charSet). However, I can't use Scanner directly, because there's no way of checking (in a non-blocking way) if a token is available:
public static void main(String[] args) throws IOException {
Charset charSet = Charset.defaultCharset();
PipedInputStream in = new PipedInputStream();
try (Scanner sc = new Scanner(in, charSet)) {
dump(new File("file.txt"), new PipedOutputStream(in) {
#Override
public void write(byte[] b, int off, int len) throws IOException {
super.write(b, off, len);
// This will loop infinitely, because `hasNextInt`
// will block if there is no int token currently available.
if (sc.hasNextInt()) {
sum += sc.nextInt();
}
}
});
}
System.out.println("Sum: " + sum);
System.out.println(charSet);
}
Is there a non-blocking utility that can perform the tokenization for me as data is written to the output stream?

If I understand your question correctly, FilterOutputStream is what you want to subclass. DigestOutputStream extends FilterOutputStream and does something somewhat similar to what you want to do: it monitors the bytes as they come through and passes them to a different class for processing.
One solution that comes to mind is for the FilterOutputStream to pass the bytes to a PipedOutputStream, connected to a PipedInputStream which a different thread reads in order to create your sum:
PipedOutputStream sumSink = new PipedOutputStream();
Callable<Long> sumCalculator = new Callable<Long>() {
#Override
public Long call()
throws IOException {
long sum = 0;
PipedInputStream source = new PipedInputStream(sumSink);
try (Scanner scanner = new Scanner(source, charSet)) {
while (scanner.hasNextInt()) {
sum += scanner.nextInt();
}
}
return sum;
}
};
Future<Long> sumTask = ForkJoinPool.commonPool().submit(sumCalculator);
OutputStream dest = getTrueDestinationOutputStream();
dest = new FilterOutputStream(dest) {
#Override
public void write(int b)
throws IOException {
super.write(b);
sumSink.write(b);
}
#Override
public void write(byte[] b)
throws IOException {
super.write(b);
sumSink.write(b);
}
#Override
public void write(byte[] b,
int offset,
int len)
throws IOException {
super.write(b, offset, len);
sumSink.write(b, offset, len);
}
#Override
public void flush()
throws IOException {
super.flush();
sumSink.flush();
}
#Override
public void close()
throws IOException {
super.close();
sumSink.close();
}
};
dump(file, dest);
long sum = sumTask.get();

As "idiomatic" approach, you might want a FilterOutputStream:
These streams sit on top of an already existing output stream (the underlying output stream) which it uses as its basic sink of data, but possibly transforming the data along the way or providing additional functionality.
At least to me, it sounds something like what you describe.
It is a concrete class (unlike OutputStream), so the absolute minimum you can get away with is to provide your constructor and an implementation for the single-byte write() (which is going to be invoked by the default implementations of other write() methods):
public class SumOutputStream extends FilterOutputStream {
public int sum = 0;
public SumOutputStream(OutputStream os) {
super(os);
}
private int num = 0;
public void write(int b) throws IOException {
if (b >= '0' && b <= '9') {
sum -= num;
num = num * 10 + b - '0';
sum += num;
} else {
num = 0;
}
out.write(b);
}
public static void main(String[] args) throws IOException {
try (SumOutputStream sos = new SumOutputStream(new FileOutputStream("test.txt"))) {
sos.write("123 456 78".getBytes());
System.out.println(sos.sum);
sos.write('9');
System.out.println(sos.sum);
}
}
}
This will sum whatever numbers are passing, keeping sum up to date all the time even with partial results (that is what separating the 9 is supposed to show).

Based off #tevemadar's answer. Reads in strings and tries to parse them to ints. If that fails, then you know the number is done and is then added to the sum. The only problem is that my method doesn't add the last number if it occupies the last two bytes. To solve this, you could add a single line method: if(!currNumber.isEmpty()) sum += Integer.parseInt(currNumber); that you can call once the file finishes.
import java.io.FilterOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.Objects;
class SumOutputStream extends FilterOutputStream {
public int sum = 0;
String currNumber = "";
String lastChar = "";
public SumOutputStream(OutputStream os){
super(os);
}
public void write(byte b[], int off, int len) throws IOException {
Objects.checkFromIndexSize(off, len, b.length);
for (int i = 0 ; i < len ; i++) {
try {
if(!lastChar.isEmpty()) {
Integer.parseInt(lastChar);
currNumber += lastChar;
}
} catch(NumberFormatException e) {
if(!currNumber.isEmpty()) sum += Integer.parseInt(currNumber);
currNumber = "";
} catch(NullPointerException e) {
e.printStackTrace();
}
write(b[off + i]);
lastChar = new String(b);
}
}
}

Related

How to output all lines to a given writer and return the number of bytes

How can I output to a given writer all the rows in the list that consist only of Latin letters or numbers and return the number of successfully written bytes before the first exception using a lambda expression?
My code:
public int writeAllCountingBytesTransferred(Writer writer, List<String> list) {
return list.stream().filter(x -> x.matches("^[a-zA-Z0-9]+$") ).forEach(x-> {
try {
writer.write(x);
how do I count bytes?
} catch (IOException e) {
how do I return bytes?
}
});
}

I would map to an IntStream of the string lengths and sum() it:
return list.stream()
.filter(x -> x.matches("^[a-zA-Z0-9]+$"))
.mapToInt(x-> {
try {
writer.write(x);
return x.length();
} catch (IOException e) {
return 0;
}
})
.sum();
Note that when the strings are all letters and digits the char length and byte length are the same.

public class CountingWriter extends FilterWriter {
private long count;
public CountingWriter(Writer out) {
super(out);
}
/**
* The number of chars written. For ASCII this is the number of bytes.
* #return the char count.
*/
public long getCount() {
return count;
}
#Override
public void write(int c) throws IOException {
super.write(c);
++count;
}
#Override
public void write(char[] cbuf, int off, int len) throws IOException {
super.write(cbuf, off, len);
count += len;
}
#Override
public void write(String str, int off, int len) throws IOException {
super.write(str, off, len);
count += len;
}
}
CountingWriter cWriter = new CountingWriter(writer);
...
long count = cWriter.getCount();
A FilterWriter is a nice wrapper class that delegates to the original writer, and can be used for this kind of filtering purposes.
As ASCII implies that every char is written as a single byte. (If the encoding for the file is not UTF-16LE or such.)

buffered reader input into char array

I want to take an input from a txt file and put all the characters to the array so I can perform on it some regex functions. But when I try to read the array with a single loop to check it, nothing appears. What is wrong here?
import java.io.BufferedReader;
import java.io.FileReader;
public class Main
{
public static void main(String[] args)
{
try
{
Task2.doTask2();
}catch(Exception e){};
}
}
class Task2
{
public static void doTask2() throws Exception
{
FileReader fr = new FileReader("F:\\Filip\\TextTask2.txt");
BufferedReader br = new BufferedReader(fr);
char[] sentence = null;
int i;
int j = 0;
while((i = br.read()) != -1)
{
sentence[j] = (char)i;
j++;
}
for(int g = 0; g < sentence.length; g++)
{
System.out.print(sentence[g]);
}
br.close();
fr.close();
}
}

You can read a file simply using File.readAllBytes. Then it's not necessary to create separate readers.
String text = new String(
Files.readAllBytes(Paths.get("F:\\Filip\\TextTask2.txt"))
);
In the original snippet, the file reading function is throwing a NullPointerException because sentence was initialized to null and then dereferenced: sentence[j] = (char)i;
The exception was swallowed by the calling function and not printed, which is why you're not seeing it when you run the program: }catch(Exception e){};
Instead of swallowing the exception declare the calling function as throwing the appropriate checked exception. That way you'll see the stack trace when you run it: public static void main(String[] args) throws IOException {

You are using wrong index , use "g" instead of "i" here.:
System.out.println(sentence[g]);
Also, the best and simplest way to do this is:
package io;
import java.nio.file.*;;
public class ReadTextAsString
{
public static String readFileAsString(String fileName)throws Exception
{
return new String(Files.readAllBytes(Paths.get(fileName)));
}
public static void main(String[] args) throws Exception
{
String data = readFileAsString("F:\\Filip\\TextTask2.txt");
System.out.println(data); //or iterate through data if you want to print each character.
}
}

performance of int Array vs Integer Array

Today when I submitted a solution to codeforces, I used int[] array and my submission got TLE(Time limit exceeded) & after changing it to Integer[] array surprisingly it got AC. I didn't get how the performance is improved.
import java.io.*;
import java.lang.reflect.Array;
import java.util.*;
public class Main {
static class Task {
public void solve(InputReader in, PrintWriter out) throws Exception {
int n = in.nextInt();
Integer[] a = new Integer[n];
for (int i = 0; i < n; i++) a[i] = in.nextInt();
Arrays.sort(a);
long count = 0;
for (int i = 0; i < n; i++) count += Math.abs(i + 1 - a[i]);
out.println(count);
}
}
public static void main(String[] args) throws Exception{
InputStream inputStream = System.in;
OutputStream outputStream = System.out;
InputReader in = new InputReader(inputStream);
PrintWriter out = new PrintWriter(outputStream);
Task task = new Task();
task.solve(in, out);
out.close();
}
static class InputReader {
public BufferedReader reader;
public StringTokenizer tokenizer;
public InputReader(InputStream stream) {
reader = new BufferedReader(new InputStreamReader(stream), 32768);
tokenizer = null;
}
public String next() {
while (tokenizer == null || !tokenizer.hasMoreTokens()) {
try {
tokenizer = new StringTokenizer(reader.readLine());
} catch (IOException e) {
throw new RuntimeException(e);
}
}
return tokenizer.nextToken();
}
public int nextInt() {
return Integer.parseInt(next());
}
}
}

The reason for it is quite simple: the time complexity of the solution with Integer is better.
Sounds strange, doesn't it?
Arrays.sort uses a dual pivot quick sort for primitives, which works in O(N^2) time in the worst case. The test case your solution fails is a specially constructed anti-quick sort test.
However, the version for objects uses merge sort, which runs in O(N * log N) time for any possible input.
Note that it's not a part of the Java language specification (it doesn't say how the sort method should be implemented), but it works like this in most of the real implementations (for example, it is the case for openjdk-8)
P.S. Things like this happen more or less frequently in competitive programming, so I'll recommend to either sort arrays of objects or to use collections.

Java: Sending/Receiving int array via socket to/from a program coded in C

I would like to send an entire integer array from Java to another program which is coded in C, and vice versa for receiving.
I have read from here that i should use short array in Java while using normal int array in the C program.
Being new to Java, I'm still not very sure how to do this correctly. Below is my code:
package tcpcomm;
import java.io.IOException;
import java.io.PrintWriter;
import java.net.Socket;
import java.net.UnknownHostException;
import java.util.Scanner;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import datatypes.Message;
public class PCClient {
public static final String C_ADDRESS = "192.168.1.1";
public static final int C_PORT = 8888;
private static PCClient _instance;
private Socket _clientSocket;
private ByteArrayOutputStream _toCProgram;
private ByteArrayInputStream _fromCProgram;
private PCClient() {
}
public static PCClient getInstance() {
if (_instance == null) {
_instance = new PCClient();
}
return _instance;
}
public static void main (String[] args) throws UnknownHostException, IOException {
int msg[] = {Message.READ_SENSOR_VALUES};
ByteArrayOutputStream out = new ByteArrayOutputStream();
PCClient pcClient = PCClient.getInstance();
pcClient.setUpConnection(C_ADDRESS, C_PORT);
System.out.println("C program successfully connected");
while (true) {
pcClient.sendMessage(out, msg);
ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray());
int[] msgReceived = pcClient.readMessage(in);
}
}
public void setUpConnection (String IPAddress, int portNumber) throws UnknownHostException, IOException{
_clientSocket = new Socket(C_ADDRESS, C_PORT);
_toCProgram = new PrintWriter(_clientSocket.getOutputStream());
_fromCProgram = new Scanner(_clientSocket.getInputStream());
}
public void closeConnection() throws IOException {
if (!_clientSocket.isClosed()) {
_clientSocket.close();
}
}
public void sendMessage(OutputStream out, int[] msg) throws IOException {
int count = 0;
DataOutputStream dataOut = new DataOutputStream(out);
dataOut.writeInt(msg.length);
System.out.println("Message sent: ");
for (int e : msg) {
dataOut.writeInt(e);
System.out.print(e + " ");
if(count % 2 == 1)
System.out.print("\n");
count++;
}
dataOut.flush();
}
public int[] readMessage(InputStream in) throws IOException {
int count = 0;
DataInputStream dataIn = new DataInputStream(in);
int[] msg = new int[dataIn.readInt()];
System.out.println("Message received: ");
for (int i = 0; i < msg.length; ++i) {
msg[i] = dataIn.readInt();
System.out.print(msg[i] + " ");
if(count % 2 == 1)
System.out.print("\n");
count++;
}
return msg;
}
}
Any guidance on how to correct the code appreciated!!
Do i need to change the stream from bytearray to intarray? Sounds like a lot of conversion.. (Based on answers I can find here)

To answer your question: No, you don't need to change streams from byte to int. Part of the nature of I/O streams in Java is that they are byte-based. So when you want to write an int to a stream you need to split it into its byte elements and write them each.
Java uses 4 byte signed integers and you cannot change that. For your counterpart (your C program) you need to make sure that it uses an equivalent type, such as int32_t from stdint.h. This can be considered as part of your "protocol".
Java already provides the means to read and write int, long, etc. values from and to streams with java.io.DataInputStream and java.io.DataOutputStream. It is important to keep in mind that DataOutputStream#writeInt(int) writes the four bytes of an int high to low. This is called Endianness, but I'm pretty sure you know that already. Also this can be considered as another part of your "protocol".
Lastly, when transmitting a structure, the endpoint that's reading must know how much it has to read. For fixed size structures both sides (client and server) know the size already, but arrays can vary. So when sending, you need to tell your counterpart how much you will be sending. In case of arrays it's pretty simple, just send the array length (just another int) first. This can be considered as yet another part of your "protocol".
I've written an example for writing and reading int arrays to and from streams.
package com.acme;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.Arrays;
public class SendAndReceive {
public static void main(String[] args) throws IOException {
int[] ints = new int[] {Integer.MIN_VALUE, -1, 0, 1, Integer.MAX_VALUE};
ByteArrayOutputStream out = new ByteArrayOutputStream();
writeInts(out, ints);
ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray());
int[] results = readInts(in);
if (!Arrays.equals(ints, results)) System.out.println("Damn!");
else System.out.println("Aaall's well!");
}
private static void writeInts(OutputStream out, int[] ints) throws IOException {
DataOutputStream dataOut = new DataOutputStream(out);
dataOut.writeInt(ints.length);
for (int e : ints) dataOut.writeInt(e);
dataOut.flush();
}
private static int[] readInts(InputStream in) throws IOException {
DataInputStream dataIn = new DataInputStream(in);
int[] ints = new int[dataIn.readInt()];
for (int i = 0; i < ints.length; ++i) ints[i] = dataIn.readInt();
return ints;
}
}
All you need to do is use the methods writeInts(OutputStream, int[]) and readInts(InputStream) and call them with your socket streams.
I've not implemented a C program for demonstration purposes, because there's no standard socket implementation (although there's the Boost library, but it's not standard and not C but C++).
Have fun!

how do I find out how many characters or bytes have been read from a stream?

Java has LineNumberReader which lets me keep track of the line I am on, but how do I keep track of the byte (or char) position in a stream?
I want something similar to lseek(<fd>,0,SEEK_CUR) for files in C.
EDIT:
I am reading a file using LineNumberReader in = new LineNumberReader(new FileReader(file)) and I want to be able to print something like "processed XX% of the file" every now and then. The easiest way I know is to look at the file.length() first and divide the current file position by it.

I suggest extending FilterInputStream as follows
public class ByteCountingInputStream extends FilterInputStream {
private long position = 0;
protected ByteCountingInputStream(InputStream in) {
super(in);
}
public long getPosition() {
return position;
}
#Override
public int read() throws IOException {
int byteRead = super.read();
if (byteRead > 0) {
position++;
}
return byteRead;
}
#Override
public int read(byte[] b) throws IOException {
int bytesRead = super.read(b);
if (bytesRead > 0) {
position += bytesRead;
}
return bytesRead;
}
#Override
public int read(byte[] b, int off, int len) throws IOException {
int bytesRead = super.read(b, off, len);
if (bytesRead > 0) {
position += bytesRead;
}
return bytesRead;
}
#Override
public long skip(long n) throws IOException {
long skipped;
skipped = super.skip(n);
position += skipped;
return skipped;
}
#Override
public synchronized void mark(int readlimit) {
return;
}
#Override
public synchronized void reset() throws IOException {
return;
}
#Override
public boolean markSupported() {
return false;
}
}
And you would use it like this:
File f = new File("filename.txt");
ByteCountingInputStream bcis = new ByteCountingInputStream(new FileInputStream(f));
LineNumberReader lnr = new LineNumberReader(new InputStreamReader(bcis));
int chars = 0;
String line;
while ((line = lnr.readLine()) != null) {
chars += line.length() + 2;
System.out.println("Chars read: " + chars);
System.out.println("Bytes read: " + bcis.getPosition());
}
You will notice a few things:
This version counts bytes because it implements InputStream.
It might just be easier to count the characters or bytes yourself in the client code.
This code will count bytes as soon as they are read from the filesystem into a buffer even if they haven't been processed by the LineNumberReader. You could put count characters in a subclass of LineNumberReader instead to get around this. Unfortunately, you can't easily produce a percentage because, unlike bytes, there is no cheap way to know the number of characters in a file.

The ByteCountingInputStream solution has a drawback that it counts the input bytes even before they were processed by the LineNumberReader. This was not what I needed for my reporting, and I came up with an alternative. I assume the input file be an ASCII text with Unix-style line ending (single LF character).
I have built a subset of LineNumberReader that adds position reporting:
import java.io.*;
public class FileLineNumberReader {
private final LineNumberReader lnr;
private final long length;
private long pos;
public FileLineNumberReader(String path) throws IOException {
lnr = new LineNumberReader(new FileReader(path));
length = new File(path).length();
}
public long getLineNumber() {
return lnr.getLineNumber();
}
public String readLine() throws IOException {
String res = lnr.readLine();
if (res != null) {
pos += res.length() + 1;
}
return res;
}
public long getPercent() {
return 100*pos/length;
}
}
Note that this class hides many methods defined for the encapsulated LineNumberReader, which are not relevant for my purposes.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java OutputStream that incrementally processes text - java

Related

How to output all lines to a given writer and return the number of bytes

buffered reader input into char array

performance of int Array vs Integer Array

Java: Sending/Receiving int array via socket to/from a program coded in C

how do I find out how many characters or bytes have been read from a stream?

Categories

Resources