I've a program (simple log parser) that's so slow couse in some cases it had to full scan input file. So I think to pre-cache the entire file (~100MB) in and read it with multiple thread.
With actual configuration I use the BufferedReader to do the "main read" and RandomAccessFile to goto onto specific offset and read what I need.
I've tried this way:
..
Reader reader = null;
if (cache) {
// caching file in memory
br = new BufferedReader(new FileReader(file));
buffer = new StringBuilder();
for (String line = br.readLine(); line != null; line = br.readLine()) {
buffer.append(line).append(CR);
}
br.close();
reader = new StringReader(buffer.toString());
} else {
reader = new FileReader(file);
}
br = new BufferedReader(reader);
for (String line = br.readLine(); line != null; line = br.readLine()) {
offset += line.length() + 1; // Il +1 è per il line.separator
matcher = Constants.PT_BEGIN_COMPOSITION.matcher(line);
if (matcher.matches()) {
linecount++;
record = new Record();
record.setCompositionCode(matcher.group(1));
matcher = Constants.PT_PREFIX.matcher(line);
if (matcher.matches()) {
record.setBeginComposition(Constants.SDF_DATE.parse(matcher.group(1)));
record.setProcessId(matcher.group(2));
if (cache) {
executor.submit(new PubblicationParser(buffer, offset, record));
} else {
executor.submit(new PubblicationParser(file, offset, record));
}
records.add(record);
} else {
br.close();
throw new ParseException(line, 0);
}
}
}
In the PubblicationParser there is a init() method that choose what custom reader to use. A RandomAccessFileReader:
if (file != null) {
this.logReader = new RandomAccessFileReader(file, offset);
} else if (sb != null) {
this.logReader = new StringBuilderReader(sb, (int) offset);
}
And this is my 2 custom reader:
//
public class StringBuilderReader implements LogReader {
public static final String CR = System.getProperty("line.separator");
private final StringBuilder sb;
private int offset;
public StringBuilderReader(StringBuilder sb, int offset) {
super();
this.sb = sb;
this.offset = offset;
}
#Override
public String readLine() throws IOException {
if (offset >= sb.length()) {
return null;
}
int indexOf = sb.indexOf(CR, offset);
if (indexOf < 0) {
indexOf = sb.length();
}
String substring = sb.substring(offset, indexOf);
offset = indexOf + CR.length();
return substring;
}
#Override
public void close() throws IOException {
// TODO Auto-generated method stub
}
}
//
public class RandomAccessFileReader implements LogReader {
private static final String FILEMODE_R = "r";
private final RandomAccessFile raf;
public RandomAccessFileReader(File file, long offset) throws IOException {
this.raf = new RandomAccessFile(file, FILEMODE_R);
this.raf.seek(offset);
}
#Override
public void close() throws IOException {
raf.close();
}
#Override
public String readLine() throws IOException {
return raf.readLine();
}
}
The problem is that the "cache way" is so slow and I understand why!
You should be making sure that it is indeed the I/O making your application slow, not something else (e.g inefficient logic in your parser). For that, you could use a Java profiler (JProfiler, for example).
If it is indeed I/O, then it might be better to use some ready-made solution to load the file into memory - essentially that's what you are trying to implement yourself.
Have a look at MappedByteBuffer and ByteBuffer.
Related
I have 4 large files (around 1.5 gb each) and I want to process these files, read each line of the file and convert it to a customer object. I have the following implementation.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UncheckedIOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;
import java.util.function.Consumer;
import java.util.zip.GZIPInputStream;
import static java.nio.charset.StandardCharsets.UTF_8;
public class CustomerDataAccess {
public static void main(String[] args) throws IOException {
CustomerFileItem john = new CustomerFileItem("CustFile1", "http://w.customer1.com");
CustomerFileItem sarah = new CustomerFileItem("CustFile2", "http://w.customer2.com");
CustomerFileItem charles = new CustomerFileItem("CustFile3", "http://w.customer3.com");
List<CustomerFileItem> customers = Arrays.asList(john, sarah, charles);
Iterator<CustomerFileLineItem> custList = new CustIterator(customers);
}
public static class CustIterator implements Iterator<CustomerFileLineItem> {
private static final int HEADER_LINES = 9; // 8 + 1 blank line
BufferedReader bufferedReader;
private int index = 0;
private final List<CustomerFileItem> custFileItems = new ArrayList<>();
public CustIterator(final List<CustomerFileItem> custFileItems) throws IOException {
this.custFileItems.addAll(custFileItems);
processNext();
}
private void processNext() throws IOException {
if (bufferedReader != null) {
bufferedReader.close();
}
if (index < custFileItems.size()) { // only update if there's another file
CustomerFileItem custFileItem = custFileItems.get(index);
GZIPInputStream gis = new GZIPInputStream(new URL(custFileItem.url).openStream());
// default buffer size is 8 KB
bufferedReader = new BufferedReader(new InputStreamReader(gis, UTF_8));
// read the first few lines
for (int i = 0; i < HEADER_LINES; i++) {
bufferedReader.readLine();
}
}
index++;
}
#Override
public boolean hasNext() {
try {
boolean currentReaderStatus = bufferedReader.ready();
if (currentReaderStatus) {
return true;
} else if (index < custFileItems.size()) {
// at end of current file, try to get the next one
processNext();
return hasNext();
} else { // no more files left
return false;
}
} catch (IOException e) {
try {
bufferedReader.close();
} catch (IOException e1) {
throw new UncheckedIOException(e1);
}
throw new UncheckedIOException(e);
}
}
#Override
public CustomerFileLineItem next() {
try {
String line = bufferedReader.readLine();
if (line != null) {
return new CustomerFileLineItem(line);
} else {
return null;
}
} catch (IllegalArgumentException exception) {
return null;
} catch (IOException e) {
try {
bufferedReader.close();
} catch (IOException e1) {
throw new UncheckedIOException(e1);
}
throw new UncheckedIOException(e);
}
}
#Override
public void remove() {
throw new UnsupportedOperationException();
}
#Override
public void forEachRemaining(final Consumer<? super CustomerFileLineItem> action) {
throw new UnsupportedOperationException();
}
}
public static class CustomerFileLineItem {
private static final int NUMBER_OF_FIELDS = 4;
final String id;
final String productNumber;
final String usageType;
final String operation;
public CustomerFileLineItem(final String line) {
String[] strings = line.split(",");
if (strings.length != NUMBER_OF_FIELDS) {
throw new IllegalArgumentException(String.format("Malformed customer file line: %s", line));
}
this.id = strings[0];
this.productNumber = strings[1];
this.usageType = strings[3];
this.operation = strings[4];
}
}
static class CustomerFileItem {
private String fileName;
private String url;
public CustomerFileItem(String fileName, String url) {
this.fileName = fileName;
this.url = url;
}
}
}
In one of use case I want use streams in the output list(custList). But I know I can't use streams with Iterator. How I can convert it to Spliterator? Or how can I implement the same that I implement with Iterator in Spliterator?
TL;DR You don’t need to implement an Iterator or Spliterator, you can simply use a Stream in the first place:
private static final int HEADER_LINES = 9; // 8 + 1 blank line
Stream<CustomerFileLineItem> stream = customers.stream()
.flatMap(custFileItem -> {
try {
GZIPInputStream gis
= new GZIPInputStream(new URL(custFileItem.url).openStream());
BufferedReader br = new BufferedReader(new InputStreamReader(gis, UTF_8));
// read the first few lines
for (int i = 0; i < HEADER_LINES; i++) br.readLine();
return br.lines().onClose(() -> {
try { br.close(); }
catch(IOException ex) { throw new UncheckedIOException(ex); }
});
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
})
.map(CustomerFileLineItem::new);
But for completeness, addressing the question literally:
First of all, you should not add a method definition like
#Override
public void forEachRemaining(final Consumer<? super CustomerFileLineItem> action) {
throw new UnsupportedOperationException();
}
This method will surely backfire when you use the Stream API, as that’s where most non-short-circuiting operations will end up.
There is not even a reason to add it. When you don’t declare the method, you’ll get a reasonable default method from the Iterator interface.
When you fixed this issue, you can easily convert the Iterator to a Spliterator using Spliterators.pliteratorUnknownSize(Iterator, int).
But there is no reason to do so. Your code becomes simpler when implementing Spliterator in the first place:
public static class CustIterator
extends Spliterators.AbstractSpliterator<CustomerFileLineItem> {
private static final int HEADER_LINES = 9; // 8 + 1 blank line
BufferedReader bufferedReader;
private final ArrayDeque<CustomerFileItem> custFileItems;
public CustIterator(final List<CustomerFileItem> custFileItems) throws IOException {
super(Long.MAX_VALUE, ORDERED|NONNULL);
this.custFileItems = new ArrayDeque<>(custFileItems);
processNext();
}
#Override
public boolean tryAdvance(Consumer<? super CustomerFileLineItem> action) {
if(bufferedReader == null) return false;
try {
String line = bufferedReader.readLine();
while(line == null) {
processNext();
if(bufferedReader == null) return false;
line = bufferedReader.readLine();
}
action.accept(new CustomerFileLineItem(line));
return true;
}
catch(IOException ex) {
if(bufferedReader != null) try {
bufferedReader.close();
bufferedReader = null;
}
catch(IOException ex2) {
ex.addSuppressed(ex2);
}
throw new UncheckedIOException(ex);
}
}
private void processNext() throws IOException {
if (bufferedReader != null) {
bufferedReader.close();
bufferedReader = null;
}
if (!custFileItems.isEmpty()) { // only update if there's another file
CustomerFileItem custFileItem = custFileItems.remove();
GZIPInputStream gis
= new GZIPInputStream(new URL(custFileItem.url).openStream());
// default buffer size is 8 KB
bufferedReader = new BufferedReader(new InputStreamReader(gis, UTF_8));
// read the first few lines
for (int i = 0; i < HEADER_LINES; i++) {
bufferedReader.readLine();
}
}
}
}
But, as said at the beginning, you don’t even need to implement a Spliterator here.
Every Iterable<T> object has the following methods:
Iterator<T> iterator() returning Iterator<T>
default Spliterator<T> spliterator() (default method) returning Spliterator<T>
Therefore, you want to create Iterable<T> back from Iterator<T> which requires to override the only one non-default and abstract method:
Iterable<CustomerFileLineItem> iterable = new Iterable<CustomerFileLineItem>() {
#Override
public Iterator<CustomerFileLineItem> iterator() {
return custList;
}
};
This can be shortened into a lambda expression resulting in:
Iterable<CustomerFileLineItem> iterable = () -> custList;
Spliterator<CustomerFileLineItem> spliterator = iterable.spliterator();
... so the Stream is easily to be created:
Stream<CustomerFileLineItem> stream = StreamSupport.stream(spliterator, false);
I use LanguageTool for some spellchecking and spell correction functionality in my application.
The LanguageTool documentation describes how to exclude words from spell checking (with call the addIgnoreTokens(...) method of the spell checking rule you're using).
How do you add some words (e.g., from a specific dictionary) to spell checking? That is, can LanguageTool fix words with misspellings and suggest words from my specific dictionary?
Unfortunately, the API doesn't support this I think. Without the API, you can add words to spelling.txt to get them accepted and used as suggestions. With the API, you might need to extend MorfologikSpellerRule and change this place of the code. (Disclosure: I'm the maintainer of LanguageTool)
I have similar requirement, which is load some custom words into dictionary as "suggest words", not just "ignored words". And finally I extend MorfologikSpellerRule to do this:
Create class MorfologikSpellerRuleEx extends from MorfologikSpellerRule, override the method "match()", and write my own "initSpeller()" for creating spellers.
And then for the language tool, create this custom speller rule to replace existing one.
Code:
Language lang = new AmericanEnglish();
JLanguageTool langTool = new JLanguageTool(lang);
langTool.disableRule("MORFOLOGIK_RULE_EN_US");
try {
MorfologikSpellerRuleEx spellingRule = new MorfologikSpellerRuleEx(JLanguageTool.getMessageBundle(), lang);
spellingRule.setSpellingFilePath(spellingFilePath);
//spellingFilePath is the file has my own words + words from /hunspell/spelling_en-US.txt
langTool.addRule(spellingRule);
} catch (IOException e) {
e.printStackTrace();
}
The code of my custom MorfologikSpellerRuleEx:
public class MorfologikSpellerRuleEx extends MorfologikSpellerRule {
private String spellingFilePath = null;
private boolean ignoreTaggedWords = false;
public MorfologikSpellerRuleEx(ResourceBundle messages, Language language) throws IOException {
super(messages, language);
}
#Override
public String getFileName() {
return "/en/hunspell/en_US.dict";
}
#Override
public String getId() {
return "MORFOLOGIK_SPELLING_RULE_EX";
}
#Override
public void setIgnoreTaggedWords() {
ignoreTaggedWords = true;
}
public String getSpellingFilePath() {
return spellingFilePath;
}
public void setSpellingFilePath(String spellingFilePath) {
this.spellingFilePath = spellingFilePath;
}
private void initSpellerEx(String binaryDict) throws IOException {
String plainTextDict = null;
if (JLanguageTool.getDataBroker().resourceExists(getSpellingFileName())) {
plainTextDict = getSpellingFileName();
}
if (plainTextDict != null) {
BufferedReader br = null;
if (this.spellingFilePath != null) {
try {
br = new BufferedReader(new FileReader(this.spellingFilePath));
}
catch (Exception e) {
br = null;
}
}
if (br != null) {
speller1 = new MorfologikMultiSpeller(binaryDict, br, plainTextDict, 1);
speller2 = new MorfologikMultiSpeller(binaryDict, br, plainTextDict, 2);
speller3 = new MorfologikMultiSpeller(binaryDict, br, plainTextDict, 3);
br.close();
}
else {
speller1 = new MorfologikMultiSpeller(binaryDict, plainTextDict, 1);
speller2 = new MorfologikMultiSpeller(binaryDict, plainTextDict, 2);
speller3 = new MorfologikMultiSpeller(binaryDict, plainTextDict, 3);
}
setConvertsCase(speller1.convertsCase());
} else {
throw new RuntimeException("Could not find ignore spell file in path: " + getSpellingFileName());
}
}
private boolean canBeIgnored(AnalyzedTokenReadings[] tokens, int idx, AnalyzedTokenReadings token)
throws IOException {
return token.isSentenceStart() || token.isImmunized() || token.isIgnoredBySpeller() || isUrl(token.getToken())
|| isEMail(token.getToken()) || (ignoreTaggedWords && token.isTagged()) || ignoreToken(tokens, idx);
}
#Override
public RuleMatch[] match(AnalyzedSentence sentence) throws IOException {
List<RuleMatch> ruleMatches = new ArrayList<>();
AnalyzedTokenReadings[] tokens = getSentenceWithImmunization(sentence).getTokensWithoutWhitespace();
// lazy init
if (speller1 == null) {
String binaryDict = null;
if (JLanguageTool.getDataBroker().resourceExists(getFileName())) {
binaryDict = getFileName();
}
if (binaryDict != null) {
initSpellerEx(binaryDict); //here's the change
} else {
// should not happen, as we only configure this rule (or rather its subclasses)
// when we have the resources:
return toRuleMatchArray(ruleMatches);
}
}
int idx = -1;
for (AnalyzedTokenReadings token : tokens) {
idx++;
if (canBeIgnored(tokens, idx, token)) {
continue;
}
// if we use token.getToken() we'll get ignored characters inside and speller
// will choke
String word = token.getAnalyzedToken(0).getToken();
if (tokenizingPattern() == null) {
ruleMatches.addAll(getRuleMatches(word, token.getStartPos(), sentence));
} else {
int index = 0;
Matcher m = tokenizingPattern().matcher(word);
while (m.find()) {
String match = word.subSequence(index, m.start()).toString();
ruleMatches.addAll(getRuleMatches(match, token.getStartPos() + index, sentence));
index = m.end();
}
if (index == 0) { // tokenizing char not found
ruleMatches.addAll(getRuleMatches(word, token.getStartPos(), sentence));
} else {
ruleMatches.addAll(getRuleMatches(word.subSequence(index, word.length()).toString(),
token.getStartPos() + index, sentence));
}
}
}
return toRuleMatchArray(ruleMatches);
}
}
I'm asking for some help with Lucene 6.1 API.
I tried to extend Lucene's Tokenizer and Analyzer, but I don't understand all guides. In all tutorials, User's Tokenizer overrides the increment. In constructor they have Reader class and in User's Analyzer class they override createComponents method. But in Lucene it has only 1 String argument, so how can I add Reader to my Analyzer?
My code:
public class ChemTokenizer extends Tokenizer{
protected CharTermAttribute charTermAttribute = addAttribute(CharTermAttribute.class);
protected String stringToTokenize;
protected int position = 0;
protected List<int[]> chemicals = new ArrayList<>();
#Override
public boolean incrementToken() throws IOException {
// Clear anything that is already saved in this.charTermAttribute
this.charTermAttribute.setEmpty();
// Get the position of the next symbol
int nextIndex = -1;
Pattern p = Pattern.compile("[^A-zА-я]");
Matcher m = p.matcher(stringToTokenize.substring(position));
nextIndex = m.start();
// Did we lose chemicals?
for (int[] pair: chemicals) {
if (pair[0] < nextIndex && pair[1] > nextIndex) {
//We are in the chemical name
if (position == pair[0]) {
nextIndex = pair[1];
}
else {
nextIndex = pair[0];
}
}
}
// Next separator was found
if (nextIndex != -1) {
String nextToken = stringToTokenize.substring(position, nextIndex);
charTermAttribute.append(nextToken);
position = nextIndex + 1;
return true;
}
// Last part of text
else if (position < stringToTokenize.length()) {
String nextToken = stringToTokenize.substring(position);
charTermAttribute.append(nextToken);
position = stringToTokenize.length();
return true;
}
else {
return false;
}
}
public ChemTokenizer(Reader reader,List<String> additionalKeywords) {
int numChars;
char[] buffer = new char[1024];
StringBuilder stringBuilder = new StringBuilder();
try {
while ((numChars =
reader.read(buffer, 0, buffer.length)) != -1) {
stringBuilder.append(buffer, 0, numChars);
}
}
catch (IOException e) {
throw new RuntimeException(e);
}
stringToTokenize = stringBuilder.toString();
//Checking for keywords
//Doesnt work properly if text has chemical synonyms
for (String keyword: additionalKeywords) {
int[] tmp = new int[2];
//Start of keyword
tmp[0] = stringToTokenize.indexOf(keyword);
tmp[1] = tmp[0] + keyword.length() - 1;
chemicals.add(tmp);
}
}
/* Reset the stored position for this object when reset() is called.
*/
#Override
public void reset() throws IOException {
super.reset();
position = 0;
chemicals = new ArrayList<>();
}
}
And code for Analyzer:
public class ChemAnalyzer extends Analyzer{
List<String> additionalKeywords;
public ChemAnalyzer(List<String> ad) {
additionalKeywords = ad;
}
#Override
protected TokenStreamComponents createComponents(String s, Reader reader) {
Tokenizer tokenizer = new ChemTokenizer(reader,additionalKeywords);
TokenStream filter = new LowerCaseFilter(tokenizer);
return new TokenStreamComponents(tokenizer, filter);
}
}
The problem is that this code doesn't work with Lucene 6
This is what I found in github search, guess you have to create a new tokenizer with out read.
#Override
protected TokenStreamComponents createComponents(String fieldName) {
return new TokenStreamComponents(new WhitespaceTokenizer()); }
I use Java for file reading. Here's my code:
public static String[] fajlbeolvasa(String s) throws IOException
{
ArrayList<String> list = new ArrayList<>();
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(s), "UTF8"));
while(true)
{
String line = reader.readLine();
if (line == null)
{
break;
}
list.add(line);
}
}
However, when I read the file, then the output will be incorrect shaped.
For example: "Farkasgyep\305\261". Maybe something wrong with the BOM.
How can I solve this problem in Java? Will be grateful for any help.
You can try to check for BOM in the following way, this treat the file as byte[], you shouldn't have problem using this with your file:
private static boolean isBOMPresent(byte[] content){
boolean result = false;
byte[] bom = new byte[3];
try (ByteArrayInputStream is = new ByteArrayInputStream(content)) {
int bytesReaded = is.read(bom);
if(bytesReaded != -1) {
String stringContent = new String(Hex.encodeHex(bom));
if (BOM_HEX_ENCODE.equalsIgnoreCase(stringContent)) {
result = true;
}
}
} catch (Exception e) {
LOGGER.error(e);
}
return result;
}
Then, if you need to remove it you can use this:
public static byte[] removeBOM(byte[] fileWithBOM) {
final String BOM_HEX_ENCODE = "efbbbf";
if (isBOMPresent(fileWithBOM)) {
ByteBuffer bb = ByteBuffer.wrap(fileWithBOM);
byte[] bom = new byte[3];
bb.get(bom, 0, bom.length);
byte[] contentAfterFirst3Bytes = new byte[fileWithBOM.length - 3];
bb.get(contentAfterFirst3Bytes, 0, contentAfterFirst3Bytes.length);
return contentAfterFirst3Bytes;
} else {
return fileWithBOM;
}
}
In my servlet I am running a few command line commands in background, I've successfully printed output on console.
My doGet()
public void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
{
String[] command =
{
"zsh"
};
Process p = Runtime.getRuntime().exec(command);
new Thread(new SyncPipe(p.getErrorStream(), response.getOutputStream())).start();
new Thread(new SyncPipe(p.getInputStream(), response.getOutputStream())).start();
PrintWriter stdin = new PrintWriter(p.getOutputStream());
stdin.println("source ./taxenv/bin/activate");
stdin.println("python runner.py");
stdin.close();
int returnCode = 0;
try {
returnCode = p.waitFor();
}
catch (InterruptedException e) {
e.printStackTrace();
} System.out.println("Return code = " + returnCode);
}
class SyncPipe implements Runnable
{
public SyncPipe(InputStream istrm, OutputStream ostrm) {
istrm_ = istrm;
ostrm_ = ostrm;
}
public void run() {
try
{
final byte[] buffer = new byte[1024];
for (#SuppressWarnings("unused")
int length = 0; (length = istrm_.read(buffer)) != -1; )
{
// ostrm_.write(buffer, 0, length);
((PrintStream) ostrm_).println();
}
}
catch (Exception e)
{
e.printStackTrace();
}
}
private final OutputStream ostrm_;
private final InputStream istrm_;
}
Now, I want to save the ostrm_ to a string or list, and use that inside doGet()
How to achieve this?
==============================EDIT============================
Based on answers below, I've edited my code as follows
int length = 0; (length = istrm_.read(buffer)) != -1; )
{
// ostrm_.write(buffer, 0, length);
String str = IOUtils.toString(istrm_, "UTF-8");
//((PrintStream) ostrm_).println();
System.out.println(str);
}
Now, How do I get the str in runnable class into my doGet()?
You can use Apache Commons IO.
Here is the documentation of IOUtils.toString() from their javadocs
Gets the contents of an InputStream as a String using the specified character encoding. This
method buffers the input internally, so there is no need to use a
BufferedInputStream.
Parameters: input - the InputStream to read from encoding - the
encoding to use, null means platform default Returns: the requested
String Throws: NullPointerException - if the input is null IOException
- if an I/O error occurs
Example Usage:
String str = IOUtils.toString(yourInputStream, "UTF-8");
You can call something like the following:
(EDIT: added also the client calls)
public void run() {
try
{
String out = getAsString(istrm_);
((PrintStream) ostrm_).println(out);
} catch (Exception e) {
e.printStackTrace();
}
}
public static String getAsString(InputStream is) throws Exception {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
int cur = -1;
while((cur = is.read()) != -1 ){
baos.write(cur);
}
return getAsString(baos.toByteArray());
}
public static String getAsString(byte[] arr) throws Exception {
String res = "";
for(byte b : arr){
res+=(char)b;
}
return res;
}