Remove or ignore character from reader

Remove or ignore character from reader - java

i am reading all characters into stream. I am reading it with inputStream.read. This is java.io.Reader inputStream.
How can i ignore special characters like # when reading into buffer.
code
private final void FillBuff() throws java.io.IOException
{
int i;
if (maxNextCharInd == 4096)
maxNextCharInd = nextCharInd = 0;
try {
if ((i = inputStream.read(nextCharBuf, maxNextCharInd,
4096 - maxNextCharInd)) == -1)
{
inputStream.close();
throw new java.io.IOException();
}
else
maxNextCharInd += i;
return;
}
catch(java.io.IOException e) {
if (bufpos != 0)
{
--bufpos;
backup(0);
}
else
{
bufline[bufpos] = line;
bufcolumn[bufpos] = column;
}
throw e;
}
}

You can use a custom FilterReader.
class YourFilterReader extends FilterReader{
#Override
public int read() throws IOException{
int read;
do{
read = super.read();
} while(read == '#');
return read;
}
#Override
public int read(char[] cbuf, int off, int len) throws IOException{
int read = super.read(cbuf, off, len);
if (read == -1) {
return -1;
}
int pos = off - 1;
for (int readPos = off; readPos < off + read; readPos++) {
if (read == '#') {
continue;
} else {
pos++;
}
if (pos < readPos) {
cbuf[pos] = cbuf[readPos];
}
}
return pos - off + 1;
}
}
Resources :
Javadoc - FilterReader
BCRDF - Skipping Invalid XML Character with ReaderFilter
On the same topic :
filter/remove invalid xml characters from stream

All those readers, writers and streams implement the Decorator pattern. Each decorator adds additional behaviour and functionality to the underlying implementation.
A solution for you requirement could be a FilterReader:
public class FilterReader implements Readable, Closeable {
private Set<Character> blacklist = new HashSet<Character>();
private Reader reader;
public FilterReader(Reader reader) {
this.reader = reader;
}
public void addFilter(char filtered) {
blacklist.add(filtered);
}
#Override
public void close() throws IOException {reader.close();}
#Override
public int read(char[] charBuf) {
char[] temp = new char[charBuf.length];
int charsRead = reader.read(temp);
int index = -1;
if (!(charsRead == -1)) {
for (char c:temp) {
if (!blacklist.contains(c)) {
charBuf[index] = c;
index++;
}
}
}
return index;
}
}
Note - the class java.io.FilterReader is a decorator with zero functionality. You can extend it or just ignore it and create your own decorator (which I prefer in this case).

You could implement an own inputstream derived from InputStream. Then override the read methods so that they filter a special character out of the stream.

private final void FillBuff() throws java.io.IOException
{
int i;
if (maxNextCharInd == 4096)
maxNextCharInd = nextCharInd = 0;
try {
Reader filterReader = new FilterReader(inputStream) {
public int read() {
do {
result = super.read();
} while (specialCharacter(result));
return result;
}
};
if ((i = filterReader.read(nextCharBuf, maxNextCharInd,
4096 - maxNextCharInd)) == -1)
{
inputStream.close();
throw new java.io.IOException();
}
else
maxNextCharInd += i;
return;
}
catch(java.io.IOException e) {
if (bufpos != 0)
{
--bufpos;
backup(0);
}
else
{
bufline[bufpos] = line;
bufcolumn[bufpos] = column;
}
throw e;
}
}

Related

Write in StringBuilder - a value that is not char

I do not usually ask here.
I have a problem with the code I wrote down - I built a compression code - implementation of LZ77.
Everything works in code - when I use files that are based on ascii text in English.
When I use a bmp file - which works differently from a plain text file - I have a problem.
In a text file - I can write the character as it is - it works.
In the bmp file - when I try to compress it - I come across characters that are not English text letters - so I can not compress the file
That I'm trying to write a letter in English into a String builder it works - but in other characters - I can not write them inside the stringBuilder - as I try to write them - it performs null.
code:
main:
import java.io.IOException;
public class Main {
public static void main(String[] args) throws IOException, ClassNotFoundException
{
String inPath = "C:\\Users\\avraam\\Documents\\final-assignment\\LZ77\\Tiny24bit.bmp";
String outPath = "C:\\Users\\avraam\\Documents\\final-assignment\\LZ77\\encoded.txt";
String decompressedPath = "C:\\Users\\avraam\\Documents\\final-assignment\\LZ77\\decoded.bmp";
int windowSize = 512;
int lookaheadBufferSize = 200;
LZ77 compress = new LZ77(inPath,outPath,windowSize,lookaheadBufferSize);
compress.compress();
LZ77 decompress = new LZ77(outPath,decompressedPath,windowSize,lookaheadBufferSize);
decompress.decompress();
System.out.println("DONE!");
}
}
LZ77
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.io.Writer;
import java.nio.file.Files;
import java.util.BitSet;
public class LZ77 {
private String inPath = null;
private String outPath = null;
private File inFile;
private File outFile;
private final int windowSize;
private final int lookaheadBufferSize;
private final int searchBufferSize;
private int nextByteIndex = 0;
private int nextBitIndex = 0;
private int currentSearchBufferSize = 0;
private int currentLookaheadBufferSize = 0;
private int appendToWindowBuffer = 0;
private byte[] source = null;
public LZ77(String inPath,String outPath,int windowSize,int lookaheadBufferSize) throws IOException
{
this.inPath = inPath;
this.outPath = outPath;
this.inFile = new File(inPath);
this.outFile = new File(outPath);
this.windowSize = windowSize;
this.lookaheadBufferSize = lookaheadBufferSize;
this.searchBufferSize = windowSize - lookaheadBufferSize;
this.source = Files.readAllBytes(inFile.toPath());
}
public void compress() throws IOException
{
StringBuilder dictionary = new StringBuilder();
bufferInitialize(dictionary);
StringBuilder compressed = new StringBuilder();
encode(dictionary,compressed);
addSizeBitsMod64(compressed);
//System.out.println(compressed);
writeFile(compressed);
}
public void bufferInitialize(StringBuilder dictionary)
{
for (int i = 0; i < lookaheadBufferSize; i++) {
if(source.length>nextByteIndex) {
dictionary.append((char)Byte.toUnsignedInt(source[nextByteIndex]));
nextByteIndex++;
currentLookaheadBufferSize++;
}
else
{
break;
}
}
}
public void encode(StringBuilder dictionary,StringBuilder compressed)
{
while(currentLookaheadBufferSize > 0)
{
Match match = findMatch(dictionary);
WriteMatch(compressed,match.offset,match.length,dictionary.charAt(currentSearchBufferSize + match.length));
appendToWindowBuffer = increaseBuffer(match.length);
appendBuffer(dictionary);
}
}
public Match findMatch(StringBuilder dictionary)
{
Match match= new Match(0,0, "");
String matchedString = null;
int offset;
int matchLookAheadIndex = currentSearchBufferSize;
if(!haveAnyMatch(dictionary))
{
}
else {
matchedString = "" + dictionary.charAt(matchLookAheadIndex);
offset = findMatchIndex(dictionary,matchedString);
while(offset != -1 && matchLookAheadIndex < dictionary.length() - 1)
{
match.SetLength(match.length + 1);
match.SetOffset(offset);
match.SetValue(matchedString);
matchLookAheadIndex++;
matchedString +=dictionary.charAt(matchLookAheadIndex);
offset = findMatchIndex(dictionary,matchedString);
}
}
return match;
}
public int findMatchIndex(StringBuilder dictionary,String value)
{
int stringLength = value.length();
String tmpMatch = null;
int offsetMatch;
for (int i = currentSearchBufferSize - 1; i >=0; i--)
{
tmpMatch = dictionary.substring(i, i +stringLength );
offsetMatch = currentSearchBufferSize - i;
if(tmpMatch.equals(value))
{
return offsetMatch;
}
}
return -1;
}
public boolean haveAnyMatch(StringBuilder dictionary)
{
if (currentSearchBufferSize == 0)
{
return false;
}
if(!isExistInSearchBuffer(dictionary,dictionary.charAt(currentSearchBufferSize)))
{
return false;
}
return true;
}
public boolean isExistInSearchBuffer(StringBuilder dictionary, char isCharAtDictionary)
{
for (int i = 0; i < currentSearchBufferSize; i++) {
if(dictionary.charAt(i) == isCharAtDictionary)
{
return true;
}
}
return false;
}
public int increaseBuffer(int matchLength)
{
return 1 + matchLength;
}
public int findBitSize(int decimalNumber) {
if(decimalNumber >= 256)
{
return 16;
}
else
{
return 8;
}
}
public void convertStringToBitSet(StringBuilder compressed,BitSet encodedBits)
{
for (int i = 0; i < compressed.length(); i++) {
if(compressed.charAt(i)==1)
{
encodedBits.set(i);
}
}
}
public BitSet ConvertToBits(StringBuilder compressed)
{
BitSet encodedBits = new BitSet(compressed.length());
int nextIndexOfOne = compressed.indexOf("1", 0);
while( nextIndexOfOne != -1)
{
encodedBits.set(nextIndexOfOne);
nextIndexOfOne++;
nextIndexOfOne = compressed.indexOf("1", nextIndexOfOne);
}
return encodedBits;
}
public void writeFile(StringBuilder compressed) throws IOException
{
BitSet encodedBits = new BitSet(compressed.length());
encodedBits = ConvertToBits(compressed);
FileOutputStream writer = new FileOutputStream(this.outPath);
ObjectOutputStream objectWriter = new ObjectOutputStream(writer);
objectWriter.writeObject(encodedBits);
objectWriter.close();
}
public void appendBuffer(StringBuilder dictionary)
{
for (int i = 0; i < appendToWindowBuffer && i < source.length; i++) {
if(ableDeleteChar(dictionary))
{
dictionary.deleteCharAt(0);
}
if(nextByteIndex<source.length)
{
char nextByte = (char)Byte.toUnsignedInt(source[nextByteIndex]);
dictionary.append(nextByte);
nextByteIndex++;
}
else
{
currentLookaheadBufferSize--;
}
if(currentSearchBufferSize < searchBufferSize)
{
currentSearchBufferSize++;
}
}
appendToWindowBuffer = 0;
}
public void WriteMatch(StringBuilder compressed,int offset, int length, char character)
{
/*int offsetBitSizeCheck, lengthBitSizeCheck;
offsetBitSizeCheck = findBitSize(offset);
lengthBitSizeCheck = findBitSize(length);
*/
String offsetInBits = writeInt(offset);
String LengthInBits = writeInt(length);
String characterInBits = writeChar(character);
String totalBits = offsetInBits + LengthInBits + characterInBits;
compressed.append(totalBits);
//compressed.append("<"+ offset + ","+ length +","+ character + ">");
System.out.print("<"+ offset + ","+ length +","+ character + ">");
}
public String writeInt(int decimalNumber)
{
int BitSizeCheck = findBitSize(decimalNumber);
StringBuilder binaryString = new StringBuilder();
binaryString.append(convertNumToBinaryString(decimalNumber));
while (binaryString.length() < BitSizeCheck)
{
binaryString.insert(0, "0");
}
if(BitSizeCheck == 8)
{
binaryString.insert(0, "0");
}
else
{
binaryString.insert(0, "1");
}
return binaryString.toString();
}
public String convertNumToBinaryString(int decimalNumber)
{
return Integer.toString(decimalNumber, 2);
}
public String writeChar(char character)
{
StringBuilder binaryString = new StringBuilder();
binaryString.append(convertNumToBinaryString((int)character));
while (binaryString.length() < 8)
{
binaryString.insert(0, "0");
}
return binaryString.toString();
}
public boolean ableDeleteChar(StringBuilder dictionary)
{
if(dictionary.length() == windowSize )
{
return true;
}
if(currentLookaheadBufferSize < lookaheadBufferSize)
{
if(currentSearchBufferSize == searchBufferSize)
{
return true;
}
}
return false;
}
public void addSizeBitsMod64(StringBuilder compressed)
{
int bitsLeft = compressed.length()%64;
String bitsLeftBinary = writeInt(bitsLeft);
compressed.insert(0, bitsLeftBinary);
}
public void decompress () throws ClassNotFoundException, IOException
{
BitSet source = readObjectFile();
//System.out.println(source.toString());
StringBuilder decompress = new StringBuilder ();
int bitSetLength = findLengthBitSet(source);
decode(decompress,bitSetLength,source);
writeDecode(decompress);
}
public BitSet readObjectFile() throws IOException, ClassNotFoundException
{
FileInputStream input = new FileInputStream(this.inPath);
ObjectInputStream objectInput = new ObjectInputStream(input);
BitSet restoredDataInBits = (BitSet) objectInput.readObject();
objectInput.close();
return restoredDataInBits;
}
public void decode(StringBuilder decompress, int bitSetLength,BitSet source)
{
System.out.println("decode: ");
System.out.println();
while(nextBitIndex < bitSetLength)
{
Match match = convertBitsToMatch(source);
//System.out.print("<"+ match.offset + ","+ match.length +","+ match.value + ">");
addDecode(decompress, match);
}
}
public void addDecode(StringBuilder decompress, Match match)
{
int RelativeOffset;
char decodeChar;
if(match.length == 0 && match.offset == 0)
{
decompress.append(match.value);
}
else
{
RelativeOffset = decompress.length() - match.offset;
System.out.println(RelativeOffset);
for (int i = 0; i < match.length; i++) {
decodeChar = decompress.charAt(RelativeOffset);
decompress.append(decodeChar);
RelativeOffset++;
}
decompress.append(match.value);
}
}
public Match convertBitsToMatch(BitSet source)
{
int offset;
int length;
char character;
if(source.get(nextBitIndex) == false)
{
nextBitIndex++;
offset = findOffsetLengthMatch(8,source);
}
else
{
nextBitIndex++;
offset = findOffsetLengthMatch(16,source);
}
if(source.get(nextBitIndex) == false)
{
nextBitIndex++;
length = findOffsetLengthMatch(8,source);
}
else
{
nextBitIndex++;
length = findOffsetLengthMatch(16,source);
}
character = findCharacterMatch(source);
//System.out.println("offset: " + offset + " length: " + length);
Match match = new Match(length,offset,""+character);
System.out.print("<"+ match.offset + ","+ match.length +","+ match.value + ">");
return match;
}
public int findOffsetLengthMatch(int index, BitSet source)
{
StringBuilder offsetLengthBinary = new StringBuilder();
for (int i = 0; i < index; i++) {
if(source.get(nextBitIndex) == false)
{
offsetLengthBinary.append('0');
nextBitIndex++;
}
else
{
offsetLengthBinary.append('1');
nextBitIndex++;
}
}
int offsetLengthDecimal = convertBinaryStringToDecimal(offsetLengthBinary);
//System.out.println("problem here: " + offsetLengthDecimal + " the binary is : " + offsetLengthBinary);
return offsetLengthDecimal;
}
public char findCharacterMatch(BitSet source)
{
StringBuilder charBinary = new StringBuilder();
for (int i = 0; i < 8; i++) {
if(source.get(nextBitIndex) == false)
{
charBinary.append('0');
nextBitIndex++;
}
else
{
charBinary.append('1');
nextBitIndex++;
}
}
char charDecimal = (char)convertBinaryStringToDecimal(charBinary);
return charDecimal;
}
public int findLengthBitSet(BitSet source)
{
StringBuilder lengthBinary = new StringBuilder();
for (int i = 0; i < 9; i++) {
if(source.get(i) == false)
{
lengthBinary.append('0');
nextBitIndex++;
}
else
{
lengthBinary.append('1');
nextBitIndex++;
}
}
int lengthModule = convertBinaryStringToDecimal(lengthBinary);
int lengthNotUsed = 64 - lengthModule;
int fullLength = source.size() - lengthNotUsed + 9 ;
return fullLength;
}
public int convertBinaryStringToDecimal(StringBuilder lengthBinary)
{
int length = Integer.parseInt(lengthBinary.toString(), 2);
//System.out.println("length: " + length + "lengthBinary: " + lengthBinary);
return length;
}
public void writeDecode (StringBuilder decompress) throws IOException
{
Writer write = new FileWriter(this.outFile);
write.write(decompress.toString());
write.close();
}
}
Match
public class Match {
protected int length;
protected int offset;
protected String value;
public Match(int length, int offset, String value)
{
this.length=length;
this.offset=offset;
this.value = value;
}
public void SetOffset(int offset) { this.offset = offset; }
public void SetLength(int length) { this.length = length; }
public void SetValue(String value) { this.value = value; }
public void AddValue(char value) { this.value += value; }
public void Reset()
{
this.offset = 0;
this.length = 0;
this.value = "";
}
}

How to fix Stack Overflow from recursive getHeight method

When I run my code for a method calculating the height of a binary search tree, it results in a Stack Overflow error, but only for trees with more than one node (BSTElements in my program). I have read that this is due to a faulty recursive call, but cannot identify the problem in my code.
public int getHeight() {
return getHeight(this.getRoot());
}
private int getHeight(BSTElement<String,MorseCharacter> element) {
int height=0;
if (element == null) {
return -1;
}
int leftHeight = getHeight(element.getLeft());
int rightHeight = getHeight(element.getRight());
if (leftHeight > rightHeight) {
height = leftHeight;
} else {
height = rightHeight;
}
return height +1;
}
Here is full code:
public class MorseCodeTree {
private static BSTElement<String, MorseCharacter> rootElement;
public BSTElement<String, MorseCharacter> getRoot() {
return rootElement;
}
public static void setRoot(BSTElement<String, MorseCharacter> newRoot) {
rootElement = newRoot;
}
public MorseCodeTree(BSTElement<String,MorseCharacter> element) {
rootElement = element;
}
public MorseCodeTree() {
rootElement = new BSTElement("Root", "", new MorseCharacter('\0', null));
}
public int getHeight() {
return getHeight(this.getRoot());
}
private int getHeight(BSTElement<String,MorseCharacter> element) {
if (element == null) {
return -1;
} else {
int leftHeight = getHeight(element.getLeft());
int rightHeight = getHeight(element.getRight());
if (leftHeight < rightHeight) {
return rightHeight + 1;
} else {
return leftHeight + 1;
}
}
}
public static boolean isEmpty() {
return (rootElement == null);
}
public void clear() {
rootElement = null;
}
public static void add(BSTElement<String,MorseCharacter> newElement) {
BSTElement<String, MorseCharacter> target = rootElement;
String path = "";
String code = newElement.getKey();
for (int i=0; i<code.length(); i++) {
if (code.charAt(i)== '.') {
if (target.getLeft()!=null) {
target=target.getLeft();
} else {
target.setLeft(newElement);
target=target.getLeft();
}
} else {
if (target.getRight()!=null) {
target=target.getRight();
} else {
target.setRight(newElement);
target=target.getRight();
}
}
}
MorseCharacter newMorseChar = newElement.getValue();
newElement.setLabel(Character.toString(newMorseChar.getLetter()));
newElement.setKey(Character.toString(newMorseChar.getLetter()));
newElement.setValue(newMorseChar);
}
public static void main(String[] args) {
MorseCodeTree tree = new MorseCodeTree();
BufferedReader reader;
try {
reader = new BufferedReader(new FileReader(file));
String line = reader.readLine();
while (line != null) {
String[] output = line.split(" ");
String letter = output[0];
MorseCharacter morseCharacter = new MorseCharacter(letter.charAt(0), output[1]);
BSTElement<String, MorseCharacter> bstElement = new BSTElement(letter, output[1], morseCharacter);
tree.add(bstElement);
line = reader.readLine();
System.out.println(tree.getHeight());
}
reader.close();
} catch (IOException e) {
System.out.println("Exception" + e);
}

There doesn't appear to be anything significantly wrong1 with the code that you have shown us.
If this code is giving a StackOverflowException for a small tree, that most likely means that your tree has been created incorrectly and has a cycle (loop) in it. If your recursive algorithm encounters a cycle in the "tree" it will loop until the stack overflows2.
To be sure of this diagnosis, we need to see an MVCE which includes all code needed to construct an example tree that exhibits thie behavior.
1 - There is possibly an "off by one" error in the height calculation, but that won't cause a stack overflow.
2 - Current Java implementations do not do tail-call optimization.

Lucene 6.1 Custom Tokenizer and Analyzer

I'm asking for some help with Lucene 6.1 API.
I tried to extend Lucene's Tokenizer and Analyzer, but I don't understand all guides. In all tutorials, User's Tokenizer overrides the increment. In constructor they have Reader class and in User's Analyzer class they override createComponents method. But in Lucene it has only 1 String argument, so how can I add Reader to my Analyzer?
My code:
public class ChemTokenizer extends Tokenizer{
protected CharTermAttribute charTermAttribute = addAttribute(CharTermAttribute.class);
protected String stringToTokenize;
protected int position = 0;
protected List<int[]> chemicals = new ArrayList<>();
#Override
public boolean incrementToken() throws IOException {
// Clear anything that is already saved in this.charTermAttribute
this.charTermAttribute.setEmpty();
// Get the position of the next symbol
int nextIndex = -1;
Pattern p = Pattern.compile("[^A-zА-я]");
Matcher m = p.matcher(stringToTokenize.substring(position));
nextIndex = m.start();
// Did we lose chemicals?
for (int[] pair: chemicals) {
if (pair[0] < nextIndex && pair[1] > nextIndex) {
//We are in the chemical name
if (position == pair[0]) {
nextIndex = pair[1];
}
else {
nextIndex = pair[0];
}
}
}
// Next separator was found
if (nextIndex != -1) {
String nextToken = stringToTokenize.substring(position, nextIndex);
charTermAttribute.append(nextToken);
position = nextIndex + 1;
return true;
}
// Last part of text
else if (position < stringToTokenize.length()) {
String nextToken = stringToTokenize.substring(position);
charTermAttribute.append(nextToken);
position = stringToTokenize.length();
return true;
}
else {
return false;
}
}
public ChemTokenizer(Reader reader,List<String> additionalKeywords) {
int numChars;
char[] buffer = new char[1024];
StringBuilder stringBuilder = new StringBuilder();
try {
while ((numChars =
reader.read(buffer, 0, buffer.length)) != -1) {
stringBuilder.append(buffer, 0, numChars);
}
}
catch (IOException e) {
throw new RuntimeException(e);
}
stringToTokenize = stringBuilder.toString();
//Checking for keywords
//Doesnt work properly if text has chemical synonyms
for (String keyword: additionalKeywords) {
int[] tmp = new int[2];
//Start of keyword
tmp[0] = stringToTokenize.indexOf(keyword);
tmp[1] = tmp[0] + keyword.length() - 1;
chemicals.add(tmp);
}
}
/* Reset the stored position for this object when reset() is called.
*/
#Override
public void reset() throws IOException {
super.reset();
position = 0;
chemicals = new ArrayList<>();
}
}
And code for Analyzer:
public class ChemAnalyzer extends Analyzer{
List<String> additionalKeywords;
public ChemAnalyzer(List<String> ad) {
additionalKeywords = ad;
}
#Override
protected TokenStreamComponents createComponents(String s, Reader reader) {
Tokenizer tokenizer = new ChemTokenizer(reader,additionalKeywords);
TokenStream filter = new LowerCaseFilter(tokenizer);
return new TokenStreamComponents(tokenizer, filter);
}
}
The problem is that this code doesn't work with Lucene 6

This is what I found in github search, guess you have to create a new tokenizer with out read.
#Override
protected TokenStreamComponents createComponents(String fieldName) {
return new TokenStreamComponents(new WhitespaceTokenizer()); }

Cache file in memory and read in parallel

I've a program (simple log parser) that's so slow couse in some cases it had to full scan input file. So I think to pre-cache the entire file (~100MB) in and read it with multiple thread.
With actual configuration I use the BufferedReader to do the "main read" and RandomAccessFile to goto onto specific offset and read what I need.
I've tried this way:
..
Reader reader = null;
if (cache) {
// caching file in memory
br = new BufferedReader(new FileReader(file));
buffer = new StringBuilder();
for (String line = br.readLine(); line != null; line = br.readLine()) {
buffer.append(line).append(CR);
}
br.close();
reader = new StringReader(buffer.toString());
} else {
reader = new FileReader(file);
}
br = new BufferedReader(reader);
for (String line = br.readLine(); line != null; line = br.readLine()) {
offset += line.length() + 1; // Il +1 è per il line.separator
matcher = Constants.PT_BEGIN_COMPOSITION.matcher(line);
if (matcher.matches()) {
linecount++;
record = new Record();
record.setCompositionCode(matcher.group(1));
matcher = Constants.PT_PREFIX.matcher(line);
if (matcher.matches()) {
record.setBeginComposition(Constants.SDF_DATE.parse(matcher.group(1)));
record.setProcessId(matcher.group(2));
if (cache) {
executor.submit(new PubblicationParser(buffer, offset, record));
} else {
executor.submit(new PubblicationParser(file, offset, record));
}
records.add(record);
} else {
br.close();
throw new ParseException(line, 0);
}
}
}
In the PubblicationParser there is a init() method that choose what custom reader to use. A RandomAccessFileReader:
if (file != null) {
this.logReader = new RandomAccessFileReader(file, offset);
} else if (sb != null) {
this.logReader = new StringBuilderReader(sb, (int) offset);
}
And this is my 2 custom reader:
//
public class StringBuilderReader implements LogReader {
public static final String CR = System.getProperty("line.separator");
private final StringBuilder sb;
private int offset;
public StringBuilderReader(StringBuilder sb, int offset) {
super();
this.sb = sb;
this.offset = offset;
}
#Override
public String readLine() throws IOException {
if (offset >= sb.length()) {
return null;
}
int indexOf = sb.indexOf(CR, offset);
if (indexOf < 0) {
indexOf = sb.length();
}
String substring = sb.substring(offset, indexOf);
offset = indexOf + CR.length();
return substring;
}
#Override
public void close() throws IOException {
// TODO Auto-generated method stub
}
}
//
public class RandomAccessFileReader implements LogReader {
private static final String FILEMODE_R = "r";
private final RandomAccessFile raf;
public RandomAccessFileReader(File file, long offset) throws IOException {
this.raf = new RandomAccessFile(file, FILEMODE_R);
this.raf.seek(offset);
}
#Override
public void close() throws IOException {
raf.close();
}
#Override
public String readLine() throws IOException {
return raf.readLine();
}
}
The problem is that the "cache way" is so slow and I understand why!

You should be making sure that it is indeed the I/O making your application slow, not something else (e.g inefficient logic in your parser). For that, you could use a Java profiler (JProfiler, for example).
If it is indeed I/O, then it might be better to use some ready-made solution to load the file into memory - essentially that's what you are trying to implement yourself.
Have a look at MappedByteBuffer and ByteBuffer.

How to determine File Format? DOS/Unix/MAC

I have written the following method to detemine whether file in question is formatted with DOS/ MAC, or UNIX line endings.
I see at least 1 obvious issue:
1. i am hoping that i will get the EOL on the first run, say within first 1000 bytes. This may or may not happen.
I ask you to review this and suggest improvements which will lead to hardening the code and making it more generic.
THANK YOU.
new FileFormat().discover(fileName, 0, 1000);
and then
public void discover(String fileName, int offset, int depth) throws IOException {
BufferedInputStream in = new BufferedInputStream(new FileInputStream(fileName));
FileReader a = new FileReader(new File(fileName));
byte[] bytes = new byte[(int) depth];
in.read(bytes, offset, depth);
a.close();
in.close();
int thisByte;
int nextByte;
boolean isDos = false;
boolean isUnix = false;
boolean isMac = false;
for (int i = 0; i < (bytes.length - 1); i++) {
thisByte = bytes[i];
nextByte = bytes[i + 1];
if (thisByte == 10 && nextByte != 13) {
isDos = true;
break;
} else if (thisByte == 13) {
isUnix = true;
break;
} else if (thisByte == 10) {
isMac = true;
break;
}
}
if (!(isDos || isMac || isUnix)) {
discover(fileName, offset + depth, depth + 1000);
} else {
// do something clever
}
}

Your method seems unnecessarily complicated. Why not:
public class FileFormat {
public enum FileType { WINDOWS, UNIX, MAC, UNKNOWN }
private static final char CR = '\r';
private static final char LF = '\n';
public static FileType discover(String fileName) throws IOException {
Reader reader = new BufferedReader(new FileReader(fileName));
FileType result = discover(reader);
reader.close();
return result;
}
private static FileType discover(Reader reader) throws IOException {
int c;
while ((c = reader.read()) != -1) {
switch(c) {
case LF: return FileType.UNIX;
case CR: {
if (reader.read() == LF) return FileType.WINDOWS;
return FileType.MAC;
}
default: continue;
}
}
return FileType.UNKNOWN;
}
}
Which puts this in a static method that you can then call and use as:
switch(FileFormat.discover(fileName) {
case WINDOWS: ...
case MAC: ...
case UNKNOWN: ...
}

Here's a rough implementation that guesses the line ending type based on a simple majority and falls back on unknown in a worst-case scenario:
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.EnumMap;
import java.util.Map;
import java.util.Scanner;
class LineEndings
{
private enum ExitState
{
SUCCESS, FAILURE;
}
public enum LineEndingType
{
DOS("Windows"), MAC("Mac OS Classic"), UNIX("Unix/Linux/Mac OS X"), UNKNOWN("Unknown");
private final String name;
private LineEndingType(String name)
{
this.name = name;
}
public String toString()
{
if (null == this.name) {
return super.toString();
}
else {
return this.name;
}
}
}
public static void main(String[] arguments)
{
ExitState exitState = ExitState.SUCCESS;
File inputFile = getInputFile();
if (null == inputFile) {
exitState = ExitState.FAILURE;
System.out.println("Error: No input file specified.");
}
else {
System.out.println("Determining line endings for: " + inputFile.getName());
try {
LineEndingType lineEndingType = getLineEndingType(inputFile);
System.out.println("Determined line endings: " + lineEndingType);
}
catch (java.io.IOException exception) {
exitState = ExitState.FAILURE;
System.out.println("Error: " + exception.getMessage());
}
}
switch (exitState) {
case SUCCESS:
System.exit(0);
break;
case FAILURE:
System.exit(1);
break;
}
}
private static File getInputFile()
{
File inputFile = null;
Scanner stdinScanner = new Scanner(System.in);
while (true) {
System.out.println("Enter the input file name:");
System.out.print(">> ");
if (stdinScanner.hasNext()) {
String inputFileName = stdinScanner.next();
inputFile = new File(inputFileName);
if (!inputFile.exists()) {
System.out.println("File not found.\n");
}
else if (!inputFile.canRead()) {
System.out.println("Could not read file.\n");
}
else {
break;
}
}
else {
inputFile = null;
break;
}
}
System.out.println();
return inputFile;
}
private static LineEndingType getLineEndingType(File inputFile)
throws java.io.IOException, java.io.FileNotFoundException
{
EnumMap<LineEndingType, Integer> lineEndingTypeCount =
new EnumMap<LineEndingType, Integer>(LineEndingType.class);
BufferedReader inputReader = new BufferedReader(new FileReader(inputFile));
LineEndingType currentLineEndingType = null;
while (inputReader.ready()) {
int token = inputReader.read();
if ('\n' == token) {
currentLineEndingType = LineEndingType.UNIX;
}
else if ('\r' == token) {
if (inputReader.ready()) {
int nextToken = inputReader.read();
if ('\n' == nextToken) {
currentLineEndingType = LineEndingType.DOS;
}
else {
currentLineEndingType = LineEndingType.MAC;
}
}
}
if (null != currentLineEndingType) {
incrementLineEndingType(lineEndingTypeCount, currentLineEndingType);
currentLineEndingType = null;
}
}
return getMostFrequentLineEndingType(lineEndingTypeCount);
}
private static void incrementLineEndingType(Map<LineEndingType, Integer> lineEndingTypeCount, LineEndingType targetLineEndingType)
{
Integer targetLineEndingCount = lineEndingTypeCount.get(targetLineEndingType);
if (null == targetLineEndingCount) {
targetLineEndingCount = 0;
}
else {
targetLineEndingCount++;
}
lineEndingTypeCount.put(targetLineEndingType, targetLineEndingCount);
}
private static LineEndingType getMostFrequentLineEndingType(Map<LineEndingType, Integer> lineEndingTypeCount)
{
Integer maximumEntryCount = Integer.MIN_VALUE;
Map.Entry<LineEndingType, Integer> mostFrequentEntry = null;
for (Map.Entry<LineEndingType, Integer> entry : lineEndingTypeCount.entrySet()) {
int entryCount = entry.getValue();
if (entryCount > maximumEntryCount) {
mostFrequentEntry = entry;
maximumEntryCount = entryCount;
}
}
if (null != mostFrequentEntry) {
return mostFrequentEntry.getKey();
}
else {
return LineEndingType.UNKNOWN;
}
}
}

There is a whole lot wrong with this. You need to understand the FileInputStream class better. Note that read is not guaranteed to read all the bytes you requested. offset is the offset into the array, not the file. And so on.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Remove or ignore character from reader - java

You could implement an own inputstream derived from InputStream. Then override the read methods so that they filter a special character out of the stream.

Related

Write in StringBuilder - a value that is not char

How to fix Stack Overflow from recursive getHeight method

Lucene 6.1 Custom Tokenizer and Analyzer

Cache file in memory and read in parallel

How to determine File Format? DOS/Unix/MAC

Categories

Resources