I want to use a Stream to parallelize processing of a heterogenous set of remotely stored JSON files of unknown number (the number of files is not known upfront). The files can vary widely in size, from 1 JSON record per file up to 100,000 records in some other files. A JSON record in this case means a self-contained JSON object represented as one line in the file.
I really want to use Streams for this and so I implemented this Spliterator:
public abstract class JsonStreamSpliterator<METADATA, RECORD> extends AbstractSpliterator<RECORD> {
abstract protected JsonStreamSupport<METADATA> openInputStream(String path);
abstract protected RECORD parse(METADATA metadata, Map<String, Object> json);
private static final int ADDITIONAL_CHARACTERISTICS = Spliterator.IMMUTABLE | Spliterator.DISTINCT | Spliterator.NONNULL;
private static final int MAX_BUFFER = 100;
private final Iterator<String> paths;
private JsonStreamSupport<METADATA> reader = null;
public JsonStreamSpliterator(Iterator<String> paths) {
this(Long.MAX_VALUE, ADDITIONAL_CHARACTERISTICS, paths);
}
private JsonStreamSpliterator(long est, int additionalCharacteristics, Iterator<String> paths) {
super(est, additionalCharacteristics);
this.paths = paths;
}
private JsonStreamSpliterator(long est, int additionalCharacteristics, Iterator<String> paths, String nextPath) {
this(est, additionalCharacteristics, paths);
open(nextPath);
}
#Override
public boolean tryAdvance(Consumer<? super RECORD> action) {
if(reader == null) {
String path = takeNextPath();
if(path != null) {
open(path);
}
else {
return false;
}
}
Map<String, Object> json = reader.readJsonLine();
if(json != null) {
RECORD item = parse(reader.getMetadata(), json);
action.accept(item);
return true;
}
else {
reader.close();
reader = null;
return tryAdvance(action);
}
}
private void open(String path) {
reader = openInputStream(path);
}
private String takeNextPath() {
synchronized(paths) {
if(paths.hasNext()) {
return paths.next();
}
}
return null;
}
#Override
public Spliterator<RECORD> trySplit() {
String nextPath = takeNextPath();
if(nextPath != null) {
return new JsonStreamSpliterator<METADATA,RECORD>(Long.MAX_VALUE, ADDITIONAL_CHARACTERISTICS, paths, nextPath) {
#Override
protected JsonStreamSupport<METADATA> openInputStream(String path) {
return JsonStreamSpliterator.this.openInputStream(path);
}
#Override
protected RECORD parse(METADATA metaData, Map<String,Object> json) {
return JsonStreamSpliterator.this.parse(metaData, json);
}
};
}
else {
List<RECORD> records = new ArrayList<RECORD>();
while(tryAdvance(records::add) && records.size() < MAX_BUFFER) {
// loop
}
if(records.size() != 0) {
return records.spliterator();
}
else {
return null;
}
}
}
}
The problem I'm having is that while the Stream parallelizes beautifully at first, eventually the largest file is left processing in a single thread. I believe the proximal cause is well documented: the spliterator is "unbalanced".
More concretely, appears that the trySplit method is not called after a certain point in the Stream.forEach's lifecycle, so the extra logic to distribute small batches at the end of trySplit is rarely executed.
Notice how all the spliterators returned from trySplit share the same paths iterator. I thought this was a really clever way to balance the work across all spliterators, but it hasn't been enough to achieve full parallelism.
I would like the parallel processing to proceed first across files, and then when few large files are still left spliterating, I want to parallelize across chunks of the remaining files. That was the intent of the else block at the end of trySplit.
Is there an easy / simple / canonical way around this problem?
Your trySplit should output splits of equal size, regardless of the size of the underlying files. You should treat all the files as a single unit and fill up the ArrayList-backed spliterator with the same number of JSON objects each time. The number of objects should be such that processing one split takes between 1 and 10 milliseconds: lower than 1 ms and you start approaching the costs of handing off the batch to a worker thread, higher than that and you start risking uneven CPU load due to tasks which are too coarse-grained.
The spliterator is not obliged to report a size estimate, and you are already doing this correctly: your estimate is Long.MAX_VALUE, which is a special value meaning "unbounded". However, if you have many files with a single JSON object, resulting in batches of size 1, this will hurt your performance in two ways: the overhead of opening-reading-closing the file may become a bottleneck and, if you manage to escape that, the cost of thread handoff may be significant compared to the cost of processing one item, again causing a bottleneck.
Five years ago I was solving a similar problem, you can have a look at my solution.
After much experimentation, I was still not able to get any added parallelism by playing with the size estimates. Basically, any value other than Long.MAX_VALUE will tend to cause the spliterator to terminate too early (and without any splitting), while on the other hand a Long.MAX_VALUE estimate will cause trySplit to be called relentlessly until it returns null.
The solution I found is to internally share resources among the spliterators and let them rebalance amongst themselves.
Working code:
public class AwsS3LineSpliterator<LINE> extends AbstractSpliterator<AwsS3LineInput<LINE>> {
public final static class AwsS3LineInput<LINE> {
final public S3ObjectSummary s3ObjectSummary;
final public LINE lineItem;
public AwsS3LineInput(S3ObjectSummary s3ObjectSummary, LINE lineItem) {
this.s3ObjectSummary = s3ObjectSummary;
this.lineItem = lineItem;
}
}
private final class InputStreamHandler {
final S3ObjectSummary file;
final InputStream inputStream;
InputStreamHandler(S3ObjectSummary file, InputStream is) {
this.file = file;
this.inputStream = is;
}
}
private final Iterator<S3ObjectSummary> incomingFiles;
private final Function<S3ObjectSummary, InputStream> fileOpener;
private final Function<InputStream, LINE> lineReader;
private final Deque<S3ObjectSummary> unopenedFiles;
private final Deque<InputStreamHandler> openedFiles;
private final Deque<AwsS3LineInput<LINE>> sharedBuffer;
private final int maxBuffer;
private AwsS3LineSpliterator(Iterator<S3ObjectSummary> incomingFiles, Function<S3ObjectSummary, InputStream> fileOpener,
Function<InputStream, LINE> lineReader,
Deque<S3ObjectSummary> unopenedFiles, Deque<InputStreamHandler> openedFiles, Deque<AwsS3LineInput<LINE>> sharedBuffer,
int maxBuffer) {
super(Long.MAX_VALUE, 0);
this.incomingFiles = incomingFiles;
this.fileOpener = fileOpener;
this.lineReader = lineReader;
this.unopenedFiles = unopenedFiles;
this.openedFiles = openedFiles;
this.sharedBuffer = sharedBuffer;
this.maxBuffer = maxBuffer;
}
public AwsS3LineSpliterator(Iterator<S3ObjectSummary> incomingFiles, Function<S3ObjectSummary, InputStream> fileOpener, Function<InputStream, LINE> lineReader, int maxBuffer) {
this(incomingFiles, fileOpener, lineReader, new ConcurrentLinkedDeque<>(), new ConcurrentLinkedDeque<>(), new ArrayDeque<>(maxBuffer), maxBuffer);
}
#Override
public boolean tryAdvance(Consumer<? super AwsS3LineInput<LINE>> action) {
AwsS3LineInput<LINE> lineInput;
synchronized(sharedBuffer) {
lineInput=sharedBuffer.poll();
}
if(lineInput != null) {
action.accept(lineInput);
return true;
}
InputStreamHandler handle = openedFiles.poll();
if(handle == null) {
S3ObjectSummary unopenedFile = unopenedFiles.poll();
if(unopenedFile == null) {
return false;
}
handle = new InputStreamHandler(unopenedFile, fileOpener.apply(unopenedFile));
}
for(int i=0; i < maxBuffer; ++i) {
LINE line = lineReader.apply(handle.inputStream);
if(line != null) {
synchronized(sharedBuffer) {
sharedBuffer.add(new AwsS3LineInput<LINE>(handle.file, line));
}
}
else {
return tryAdvance(action);
}
}
openedFiles.addFirst(handle);
return tryAdvance(action);
}
#Override
public Spliterator<AwsS3LineInput<LINE>> trySplit() {
synchronized(incomingFiles) {
if (incomingFiles.hasNext()) {
unopenedFiles.add(incomingFiles.next());
return new AwsS3LineSpliterator<LINE>(incomingFiles, fileOpener, lineReader, unopenedFiles, openedFiles, sharedBuffer, maxBuffer);
} else {
return null;
}
}
}
}
This is not a direct answer to your question. But I think it is worth a try with Stream in library abacus-common:
void test_58601518() throws Exception {
final File tempDir = new File("./temp/");
// Prepare the test files:
// if (!(tempDir.exists() && tempDir.isDirectory())) {
// tempDir.mkdirs();
// }
//
// final Random rand = new Random();
// final int fileCount = 1000;
//
// for (int i = 0; i < fileCount; i++) {
// List<String> lines = Stream.repeat(TestUtil.fill(Account.class), rand.nextInt(1000) * 100 + 1).map(it -> N.toJSON(it)).toList();
// IOUtil.writeLines(new File("./temp/_" + i + ".json"), lines);
// }
N.println("Xmx: " + IOUtil.MAX_MEMORY_IN_MB + " MB");
N.println("total file size: " + Stream.listFiles(tempDir).mapToLong(IOUtil::sizeOf).sum() / IOUtil.ONE_MB + " MB");
final AtomicLong counter = new AtomicLong();
final Consumer<Account> yourAction = it -> {
counter.incrementAndGet();
it.toString().replace("a", "bbb");
};
long startTime = System.currentTimeMillis();
Stream.listFiles(tempDir) // the file/data source could be local file system or remote file system.
.parallel(2) // thread number used to load the file/data and convert the lines to Java objects.
.flatMap(f -> Stream.lines(f).map(line -> N.fromJSON(Account.class, line))) // only certain lines (less 1024) will be loaded to memory.
.parallel(8) // thread number used to execute your action.
.forEach(yourAction);
N.println("Took: " + ((System.currentTimeMillis()) - startTime) + " ms" + " to process " + counter + " lines/objects");
// IOUtil.deleteAllIfExists(tempDir);
}
Till end, the CPU usage on my laptop is pretty high(about 70%), and it took about 70 seconds to process 51,899,100 lines/objects from 1000 files with Intel(R) Core(TM) i5-8365U CPU and Xmx256m jvm memory. Total file size is about: 4524 MB. if yourAction is not a heavy operation, sequential stream could be even faster than parallel stream.
F.Y.I I'm the developer of abacus-common
Related
I've a regex pattern of words like welcome1|welcome2|changeme... which I need to search for in thousands of files (varies between 100 to 8000) ranging from 1KB to 24 MB each, in size.
I would like to know if there's a faster way of pattern matching than doing what I have been trying.
Environment:
jdk 1.8
Windows 10
Unix4j Library
Here's what I tried till now
try (Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(FilePredicates.isFileAndNotDirectory())) {
List<String> obviousStringsList = Strings_PASSWORDS.stream()
.map(s -> ".*" + s + ".*").collect(Collectors.toList()); //because Unix4j apparently needs this
Pattern pattern = Pattern.compile(String.join("|", obviousStringsList));
GrepOptions options = new GrepOptions.Default(GrepOption.count,
GrepOption.ignoreCase,
GrepOption.lineNumber,
GrepOption.matchingFiles);
Instant startTime = Instant.now();
final List<Path> filesWithObviousStringss = stream
.filter(path -> !Unix4j.grep(options, pattern, path.toFile()).toStringResult().isEmpty())
.collect(Collectors.toList());
System.out.println("Time taken = " + Duration.between(startTime, Instant.now()).getSeconds() + " seconds");
}
I get Time taken = 60 seconds which makes me think I'm doing something really wrong.
I've tried different ways with the stream and on an average every method takes about a minute to process my current folder of 6660 files.
Grep on mysys2/mingw64 takes about 15 seconds and exec('grep...') in node.js takes about 12 seconds consistently.
I chose Unix4j because it provides java native grep and clean code.
Is there a way to produce better results in Java, that I'm sadly missing?
The main reason why native tools can process such text files much faster, is their assumption of one particular charset, especially when it has an ASCII based 8 Bit encoding, whereas Java performs a byte to character conversion whose abstraction is capable of supporting arbitrary charsets.
When we similarly assume a single charset with the properties named above, we can use lowlevel tools which may increase the performance dramatically.
For such an operation, we define the following helper methods:
private static char[] getTable(Charset cs) {
if(cs.newEncoder().maxBytesPerChar() != 1f)
throw new UnsupportedOperationException("Not an 8 bit charset");
byte[] raw = new byte[256];
IntStream.range(0, 256).forEach(i -> raw[i] = (byte)i);
char[] table = new char[256];
cs.newDecoder().onUnmappableCharacter(CodingErrorAction.REPLACE)
.decode(ByteBuffer.wrap(raw), CharBuffer.wrap(table), true);
for(int i = 0; i < 128; i++)
if(table[i] != i) throw new UnsupportedOperationException("Not ASCII based");
return table;
}
and
private static CharSequence mapAsciiBasedText(Path p, char[] table) throws IOException {
try(FileChannel fch = FileChannel.open(p, StandardOpenOption.READ)) {
long actualSize = fch.size();
int size = (int)actualSize;
if(size != actualSize) throw new UnsupportedOperationException("file too large");
MappedByteBuffer mbb = fch.map(FileChannel.MapMode.READ_ONLY, 0, actualSize);
final class MappedCharSequence implements CharSequence {
final int start, size;
MappedCharSequence(int start, int size) {
this.start = start;
this.size = size;
}
public int length() {
return size;
}
public char charAt(int index) {
if(index < 0 || index >= size) throw new IndexOutOfBoundsException();
byte b = mbb.get(start + index);
return b<0? table[b+256]: (char)b;
}
public CharSequence subSequence(int start, int end) {
int newSize = end - start;
if(start<0 || end < start || end-start > size)
throw new IndexOutOfBoundsException();
return new MappedCharSequence(start + this.start, newSize);
}
public String toString() {
return new StringBuilder(size).append(this).toString();
}
}
return new MappedCharSequence(0, size);
}
}
This allows to map a file into the virtual memory and project it directly to a CharSequence, without copy operations, assuming that the mapping can be done with a simple table and, for ASCII based charsets, the majority of the characters do not even need a table lookup, as their numerical value is identical to the Unicode codepoint.
With these methods, you may implement the operation as
// You need this only once per JVM.
// Note that running inside IDEs like Netbeans may change the default encoding
char[] table = getTable(Charset.defaultCharset());
try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(Files::isRegularFile)) {
Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
long startTime = System.nanoTime();
final List<Path> filesWithObviousStringss = stream//.parallel()
.filter(path -> {
try {
return pattern.matcher(mapAsciiBasedText(path, table)).find();
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
})
.collect(Collectors.toList());
System.out.println("Time taken = "
+ TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}
This runs much faster than the normal text conversion, but still supports parallel execution.
Besides requiring an ASCII based single byte encoding, there’s the restriction that this code doesn’t support files larger than 2 GiB. While it is possible to extend the solution to support larger files, I wouldn’t add this complication unless really needed.
I don’t know what “Unix4j” provides that isn’t already in the JDK, as the following code does everything with built-in features:
try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(Files::isRegularFile)) {
Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
long startTime = System.nanoTime();
final List<Path> filesWithObviousStringss = stream
.filter(path -> {
try(Scanner s = new Scanner(path)) {
return s.findWithinHorizon(pattern, 0) != null;
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
})
.collect(Collectors.toList());
System.out.println("Time taken = "
+ TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}
One important property of this solution is that it doesn’t read the whole file, but stops at the first encountered match. Also, it doesn’t deal with line boundaries, which is suitable for the words you’re looking for, as they never contain line breaks anyway.
After analyzing the findWithinHorizon operation, I consider that line by line processing may be better for larger files, so, you may try
try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(Files::isRegularFile)) {
Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
long startTime = System.nanoTime();
final List<Path> filesWithObviousStringss = stream
.filter(path -> {
try(Stream<String> s = Files.lines(path)) {
return s.anyMatch(pattern.asPredicate());
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
})
.collect(Collectors.toList());
System.out.println("Time taken = "
+ TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}
instead.
You may also try to turn the stream to parallel mode, e.g.
try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(Files::isRegularFile)) {
Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
long startTime = System.nanoTime();
final List<Path> filesWithObviousStringss = stream
.parallel()
.filter(path -> {
try(Stream<String> s = Files.lines(path)) {
return s.anyMatch(pattern.asPredicate());
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
})
.collect(Collectors.toList());
System.out.println("Time taken = "
+ TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}
It’s hard to predict whether this has a benefit, as in most cases, the I/O dominates such an operation.
I never used Unix4j yet, but Java provides nice file APIs as well nowadays. Also, Unix4j#grep seems to return all the found matches (as you're using .toStringResult().isEmpty()), while you seem to just need to know whether at least one match got found (which means that you should be able to break once one match is found). Maybe this library provides another method that could better suit your needs, e.g. something like #contains? Without the use of Unix4j, Stream#anyMatch could be a good candidate here. Here is a vanilla Java solution if you want to compare with yours:
private boolean lineContainsObviousStrings(String line) {
return Strings_PASSWORDS // <-- weird naming BTW
.stream()
.anyMatch(line::contains);
}
private boolean fileContainsObviousStrings(Path path) {
try (Stream<String> stream = Files.lines(path)) {
return stream.anyMatch(this::lineContainsObviousStrings);
}
}
public List<Path> findFilesContainingObviousStrings() {
Instant startTime = Instant.now();
try (Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))) {
return stream
.filter(FilePredicates.isFileAndNotDirectory())
.filter(this::fileContainsObviousStrings)
.collect(Collectors.toList());
} finally {
Instant endTime = Instant.now();
System.out.println("Time taken = " + Duration.between(startTime, endTime).getSeconds() + " seconds");
}
}
Please try this out too (if it is possible), I am curious how it performs on your files.
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UncheckedIOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Optional;
import java.util.concurrent.TimeUnit;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class Filescan {
public static void main(String[] args) throws IOException {
Filescan sc = new Filescan();
sc.findWords("src/main/resources/files", new String[]{"author", "book"}, true);
}
// kind of Tuple/Map.Entry
static class Pair<K,V>{
final K key;
final V value;
Pair(K key, V value){
this.key = key;
this.value = value;
}
#Override
public String toString() {
return key + " " + value;
}
}
public void findWords(String directory, String[] words, boolean ignorecase) throws IOException{
final String[] searchWords = ignorecase ? toLower(words) : words;
try (Stream<Path> stream = Files.walk(Paths.get(directory)).filter(Files::isRegularFile)) {
long startTime = System.nanoTime();
List<Pair<Path,Map<String, List<Integer>>>> result = stream
// you can test it with parallel execution, maybe it is faster
.parallel()
// searching
.map(path -> findWordsInFile(path, searchWords, ignorecase))
// filtering out empty optionals
.filter(Optional::isPresent)
// unwrap optionals
.map(Optional::get).collect(Collectors.toList());
System.out.println("Time taken = " + TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()
- startTime) + " seconds");
System.out.println("result:");
result.forEach(System.out::println);
}
}
private String[] toLower(String[] words) {
String[] ret = new String[words.length];
for (int i = 0; i < words.length; i++) {
ret[i] = words[i].toLowerCase();
}
return ret;
}
private static Optional<Pair<Path,Map<String, List<Integer>>>> findWordsInFile(Path path, String[] words, boolean ignorecase) {
try (BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(path.toFile())))) {
String line = br.readLine();
line = ignorecase & line != null ? line.toLowerCase() : line;
Map<String, List<Integer>> map = new HashMap<>();
int linecount = 0;
while(line != null){
for (String word : words) {
if(line.contains(word)){
if(!map.containsKey(word)){
map.put(word, new ArrayList<Integer>());
}
map.get(word).add(linecount);
}
}
line = br.readLine();
line = ignorecase & line != null ? line.toLowerCase() : line;
linecount++;
}
if(map.isEmpty()){
// returning empty optional when nothing in the map
return Optional.empty();
}else{
// returning a path-map pair with the words and the rows where each word has been found
return Optional.of(new Pair<Path,Map<String, List<Integer>>>(path, map));
}
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}
}
}
We are looking to have a String builder that holds references to some events in the device,
We considered to write and read a file but the cost of opening and closing a file every time we write to it seems too high.
The issue is that sometimes we are getting a StackOverflow exception even if we try to keep the StringBuilder for just a defined size
public class DiagnosticUtil {
private static final int DIAGNOSTIC_SIZE = 5000;
public static StringBuilder DIAGNOSTICS_HOLDER = new StringBuilder(DIAGNOSTIC_SIZE);
public static void addDiagnosticLine(String message){
try {
//Limits the size of the diagnostics recolection removing the first 2000 characters
if (DiagnosticUtil.DIAGNOSTICS_HOLDER.length() > DIAGNOSTIC_SIZE - 300) {
DiagnosticUtil.DIAGNOSTICS_HOLDER.delete(0, DiagnosticUtil.DIAGNOSTICS_HOLDER.length() - 2000);
}
DIAGNOSTICS_HOLDER.append(TimeUtils.getCurrentDate()).append(message).append("\n");
}catch (Exception e){
Timber.d("Error saving additional data");
}
}
}
The question is, Is this a good approach? Or should we save this logs to an external file?.
Thanks!
When you create StringBuilder with preferred size you consume memory, because internally in StringBuilder created char[] array with this size before you pass any String to it, so you need to use default constructor there. Why do you decide to use Builder there instead of List? I don't see all picture but i think you may be prefer to choose something different with two approaches (memory log and file log store) When you collect certain amount of messages simple write it to file, in this way you don't need to touch filesystem for every message and don't populate memory with that amount of log data. You need code something like this:
public class DiagnosticUtil {
private final static int threshold = 1000;
private static List<String> messages = new ArrayList<>();
private static final File log = new File("path to your file");
public static void addDiagnosticLine(String message) {
if (messages.size() > threshold) {
try (BufferedWriter file = new BufferedWriter(new FileWriter(log))) {
for (String msg : messages) {
file.write(msg);
}
file.flush();
} catch (IOException e) {
Timber.d("Error saving additional data " + e);
}
messages = new ArrayList<>();
} else {
messages.add(TimeUtils.getCurrentDate() + message + "\n");
}
}
}
pay attention this is procedural code, not oop, util classes are bad
I wrote a method processTrainDirectory which is supposed to import and process all the text files from a given directory. Individually processing the files takes about the same time for each file (90ms), but when I use the method for batch importing a given directory, the time per file increases incrementally (from 90ms to over 4000ms after 300 files). The batch importing method is as follows:
public void processTrainDirectory(String folderPath, Category category) {
File folder = new File(folderPath);
File[] listOfFiles = folder.listFiles();
if (listOfFiles != null) {
for (File file : listOfFiles) {
if (file.isFile()) {
processTrainText(file.getPath(), category);
}
}
}
else {
System.out.println(foo);
}
}
As I said, the method processTrainText is called per text file in the directory. This method takes incrementally longer when used inside processTrainDirectory. The method processTrainText is as follows:
public void processTrainText(String path, Category category){
trainTextAmount++;
Map<String, Integer> text = prepareText(path);
update(text, category);
}
I called processTrainText 200 times on 200 different texts manual and the time that this took was 200 * 90ms. But when I have a directory of 200 files and use processTrainDirectory it takes 90-92-96-104....3897-3940-4002ms which is WAY longer.
The problem persists when I call processTrainText a second time; it does not reset. Do you have any idea why this is or what the cause it, and how I can solve it?
Any help is greatly appreciated!
EDIT: somebody asked what other called methods did so here are all the used methods from my class BayesianClassifier all others are deleted for clarification, underneath you can find the class Category:
public class BayesianClassifier {
private Map<String, Integer> vocabulary;
private List<Category> categories;
private int trainTextAmount;
private int testTextAmount;
private GUI gui;
public Map<String, Integer> prepareText(String path) {
String text = readText(path);
String normalizedText = normalizeText(text);
String[] tokenizedText = tokenizeText(normalizedText);
return countText(tokenizedText);
}
public String readText(String path) {
BufferedReader br;
String result = "";
try {
br = new BufferedReader(new FileReader(path));
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append("\n");
line = br.readLine();
}
result = sb.toString();
br.close();
} catch (IOException e) {
e.printStackTrace();
}
return result;
}
public Map<String, Integer> countText(String[] words){
Map<String, Integer> result = new HashMap<>();
for(int i=0; i < words.length; i++){
if (!result.containsKey(words[i])){
result.put(words[i], 1);
}
else {
result.put(words[i], result.get(words[i]) + 1);
}
}
return result;
}
public void processTrainText(String path, Category category){
trainTextAmount++;
Map<String, Integer> text = prepareText(path);
update(text, category);
}
public void update(Map<String, Integer> text, Category category) {
category.addText();
for (Map.Entry<String, Integer> entry : text.entrySet()){
if(!vocabulary.containsKey(entry.getKey())){
vocabulary.put(entry.getKey(), entry.getValue());
category.updateFrequency(entry);
category.updateProbability(entry);
category.updatePrior();
}
else {
vocabulary.put(entry.getKey(), vocabulary.get(entry.getKey()) + entry.getValue());
category.updateFrequency(entry);
category.updateProbability(entry);
category.updatePrior();
}
for(Category cat : categories){
if (!cat.equals(category)){
cat.addWord(entry.getKey());
cat.updatePrior();
}
}
}
}
public void processTrainDirectory(String folderPath, Category category) {
File folder = new File(folderPath);
File[] listOfFiles = folder.listFiles();
if (listOfFiles != null) {
for (File file : listOfFiles) {
if (file.isFile()) {
processTrainText(file.getPath(), category);
}
}
}
else {
System.out.println(foo);
}
}
This is my Category class (all the methods that are not needed are deleted for clarification:
public class Category {
private String categoryName;
private double prior;
private Map<String, Integer> frequencies;
private Map<String, Double> probabilities;
private int textAmount;
private BayesianClassifier bc;
public Category(String categoryName, BayesianClassifier bc){
this.categoryName = categoryName;
this.bc = bc;
this.frequencies = new HashMap<>();
this.probabilities = new HashMap<>();
this.textAmount = 0;
this.prior = 0.00;
}
public void addWord(String word){
this.frequencies.put(word, 0);
this.probabilities.put(word, 0.0);
}
public void updateFrequency(Map.Entry<String, Integer> entry){
if(!this.frequencies.containsKey(entry.getKey())){
this.frequencies.put(entry.getKey(), entry.getValue());
}
else {
this.frequencies.put(entry.getKey(), this.frequencies.get(entry.getKey()) + entry.getValue());
}
}
public void updateProbability(Map.Entry<String, Integer> entry){
double chance = ((double) this.frequencies.get(entry.getKey()) + 1) / (sumFrequencies() + bc.getVocabulary().size());
this.probabilities.put(entry.getKey(), chance);
}
public Integer sumFrequencies(){
Integer sum = 0;
for (Integer integer : this.frequencies.values()) {
sum = sum + integer;
}
return sum;
}
}
It looks like the times per file are growing linearly and the total time quadratically. This means that with each file you're processing the data of all previous files. Indeed, you are:
updateProbability calls sumFrequencies, which runs through the entire frequencies, which grows with each file. That's the culprit. Simply create a field int sumFrequencies and update it in `updateFrequency.
As a further improvement, consider using Guava Multiset, which does the counting in a simpler and more efficient way (no autoboxing). After fixing your code, consider letting it be reviewed on CR; there are quite a few minor problems with it.
what is this method doing?
update(text, category);
If it is doing what may be a random call of me than this may be your bottleneck.
If you call it in a single way without additional context and it is updating some general data structure than yes it will always take the same time.
If it updates something that holds data from your past iterations than I am pretty sure it will take more and more time - check complexiy of update() method then and reduce your bottleneck.
Update:
Your method updateProbability is working on all the data you gathered so far when you are calculating sum of frequencies - thus taking more and more time the more files you process. This is your bottleneck.
There is no need of calculating it every time - just save it and update it every time something changes to minimize amount of calculation.
I've got a .mod file and I can run it in java(Using netbeans).
The file gets data from another file .dat, because the guy who was developing it used GUSEK. Now we need to implement it in java, but i dont know how to put data in the K constant in the .mod file.
Doesn't matter the way, can be through database querys or file reading.
I dont know anything about math programming, i just need to add values to the already made glpk function.
Here's the .mod function:
# OPRE
set K;
param mc {k in K};
param phi {k in K};
param cman {k in K};
param ni {k in K};
param cesp;
param mf;
var x {k in K} binary;
minimize custo: sum {k in K} (mc[k]*phi[k]*(1-x[k]) + cman[k]*phi[k]*x[k]);
s.t. recursos: sum {k in K} (cman[k]*phi[k]*x[k]) - cesp <= 0;
s.t. ocorrencias: sum {k in K} (ni[k] + (1-x[k])*phi[k]) - mf <= 0;
end;
And here's the java code:
package br.com.genera.service.otimi;
import org.gnu.glpk.*;
public class Gmpl implements GlpkCallbackListener, GlpkTerminalListener {
private boolean hookUsed = false;
public static void main(String[] arg) {
String[] nomeArquivo = new String[2];
nomeArquivo[0] = "C:\\PodaEquipamento.mod";
System.out.println(nomeArquivo[0]);
GLPK.glp_java_set_numeric_locale("C");
System.out.println(nomeArquivo[0]);
new Gmpl().solve(nomeArquivo);
}
public void solve(String[] arg) {
glp_prob lp = null;
glp_tran tran;
glp_iocp iocp;
String fname;
int skip = 0;
int ret;
// listen to callbacks
GlpkCallback.addListener(this);
// listen to terminal output
GlpkTerminal.addListener(this);
fname = arg[0];
lp = GLPK.glp_create_prob();
System.out.println("Problem created");
tran = GLPK.glp_mpl_alloc_wksp();
ret = GLPK.glp_mpl_read_model(tran, fname, skip);
if (ret != 0) {
GLPK.glp_mpl_free_wksp(tran);
GLPK.glp_delete_prob(lp);
throw new RuntimeException("Model file not found: " + fname);
}
// generate model
GLPK.glp_mpl_generate(tran, null);
// build model
GLPK.glp_mpl_build_prob(tran, lp);
// set solver parameters
iocp = new glp_iocp();
GLPK.glp_init_iocp(iocp);
iocp.setPresolve(GLPKConstants.GLP_ON);
// do not listen to output anymore
GlpkTerminal.removeListener(this);
// solve model
ret = GLPK.glp_intopt(lp, iocp);
// postsolve model
if (ret == 0) {
GLPK.glp_mpl_postsolve(tran, lp, GLPKConstants.GLP_MIP);
}
// free memory
GLPK.glp_mpl_free_wksp(tran);
GLPK.glp_delete_prob(lp);
// do not listen for callbacks anymore
GlpkCallback.removeListener(this);
// check that the hook function has been used for terminal output.
if (!hookUsed) {
System.out.println("Error: The terminal output hook was not used.");
System.exit(1);
}
}
#Override
public boolean output(String str) {
hookUsed = true;
System.out.print(str);
return false;
}
#Override
public void callback(glp_tree tree) {
int reason = GLPK.glp_ios_reason(tree);
if (reason == GLPKConstants.GLP_IBINGO) {
System.out.println("Better solution found");
}
}
}
And i'm getting this in the console:
Reading model section from C:\PodaEquipamento.mod...
33 lines were read
Generating custo...
C:\PodaEquipamento.mod:24: no value for K
glp_mpl_build_prob: invalid call sequence
Hope someone can help, thanks.
The best way would be to read the data file the same way you read the modelfile.
ret = GLPK.glp_mpl_read_data(tran, fname_data, skip);
if (ret != 0) {
GLPK.glp_mpl_free_wksp(tran);
GLPK.glp_delete_prob(lp);
throw new RuntimeException("Data file not found: " + fname_data);
}
I resolved just copying the data block from the .data file into the .mod file.
Anyway,Thanks puhgee.
I have an InputStream which takes the html file as input parameter. I have to get the bytes from the input stream .
I have a string: "XYZ". I'd like to convert this string to byte format and check if there is a match for the string in the byte sequence which I obtained from the InputStream. If there is then, I have to replace the match with the bye sequence for some other string.
Is there anyone who could help me with this? I have used regex to find and replace. however finding and replacing byte stream, I am unaware of.
Previously, I use jsoup to parse html and replace the string, however due to some utf encoding problems, the file seems to appear corrupted when I do that.
TL;DR: My question is:
Is a way to find and replace a string in byte format in a raw InputStream in Java?
Not sure you have chosen the best approach to solve your problem.
That said, I don't like to (and have as policy not to) answer questions with "don't" so here goes...
Have a look at FilterInputStream.
From the documentation:
A FilterInputStream contains some other input stream, which it uses as its basic source of data, possibly transforming the data along the way or providing additional functionality.
It was a fun exercise to write it up. Here's a complete example for you:
import java.io.*;
import java.util.*;
class ReplacingInputStream extends FilterInputStream {
LinkedList<Integer> inQueue = new LinkedList<Integer>();
LinkedList<Integer> outQueue = new LinkedList<Integer>();
final byte[] search, replacement;
protected ReplacingInputStream(InputStream in,
byte[] search,
byte[] replacement) {
super(in);
this.search = search;
this.replacement = replacement;
}
private boolean isMatchFound() {
Iterator<Integer> inIter = inQueue.iterator();
for (int i = 0; i < search.length; i++)
if (!inIter.hasNext() || search[i] != inIter.next())
return false;
return true;
}
private void readAhead() throws IOException {
// Work up some look-ahead.
while (inQueue.size() < search.length) {
int next = super.read();
inQueue.offer(next);
if (next == -1)
break;
}
}
#Override
public int read() throws IOException {
// Next byte already determined.
if (outQueue.isEmpty()) {
readAhead();
if (isMatchFound()) {
for (int i = 0; i < search.length; i++)
inQueue.remove();
for (byte b : replacement)
outQueue.offer((int) b);
} else
outQueue.add(inQueue.remove());
}
return outQueue.remove();
}
// TODO: Override the other read methods.
}
Example Usage
class Test {
public static void main(String[] args) throws Exception {
byte[] bytes = "hello xyz world.".getBytes("UTF-8");
ByteArrayInputStream bis = new ByteArrayInputStream(bytes);
byte[] search = "xyz".getBytes("UTF-8");
byte[] replacement = "abc".getBytes("UTF-8");
InputStream ris = new ReplacingInputStream(bis, search, replacement);
ByteArrayOutputStream bos = new ByteArrayOutputStream();
int b;
while (-1 != (b = ris.read()))
bos.write(b);
System.out.println(new String(bos.toByteArray()));
}
}
Given the bytes for the string "Hello xyz world" it prints:
Hello abc world
The following approach will work but I don't how big the impact is on the performance.
Wrap the InputStream with a InputStreamReader,
wrap the InputStreamReader with a FilterReader that replaces the strings, then
wrap the FilterReader with a ReaderInputStream.
It is crucial to choose the appropriate encoding, otherwise the content of the stream will become corrupted.
If you want to use regular expressions to replace the strings, then you can use Streamflyer, a tool of mine, which is a convenient alternative to FilterReader. You will find an example for byte streams on the webpage of Streamflyer. Hope this helps.
I needed something like this as well and decided to roll my own solution instead of using the example above by #aioobe. Have a look at the code. You can pull the library from maven central, or just copy the source code.
This is how you use it. In this case, I'm using a nested instance to replace two patterns two fix dos and mac line endings.
new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");
Here's the full source code:
/**
* Simple FilterInputStream that can replace occurrances of bytes with something else.
*/
public class ReplacingInputStream extends FilterInputStream {
// while matching, this is where the bytes go.
int[] buf=null;
int matchedIndex=0;
int unbufferIndex=0;
int replacedIndex=0;
private final byte[] pattern;
private final byte[] replacement;
private State state=State.NOT_MATCHED;
// simple state machine for keeping track of what we are doing
private enum State {
NOT_MATCHED,
MATCHING,
REPLACING,
UNBUFFER
}
/**
* #param is input
* #return nested replacing stream that replaces \n\r (DOS) and \r (MAC) line endings with UNIX ones "\n".
*/
public static InputStream newLineNormalizingInputStream(InputStream is) {
return new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");
}
/**
* Replace occurances of pattern in the input. Note: input is assumed to be UTF-8 encoded. If not the case use byte[] based pattern and replacement.
* #param in input
* #param pattern pattern to replace.
* #param replacement the replacement or null
*/
public ReplacingInputStream(InputStream in, String pattern, String replacement) {
this(in,pattern.getBytes(StandardCharsets.UTF_8), replacement==null ? null : replacement.getBytes(StandardCharsets.UTF_8));
}
/**
* Replace occurances of pattern in the input.
* #param in input
* #param pattern pattern to replace
* #param replacement the replacement or null
*/
public ReplacingInputStream(InputStream in, byte[] pattern, byte[] replacement) {
super(in);
Validate.notNull(pattern);
Validate.isTrue(pattern.length>0, "pattern length should be > 0", pattern.length);
this.pattern = pattern;
this.replacement = replacement;
// we will never match more than the pattern length
buf = new int[pattern.length];
}
#Override
public int read(byte[] b, int off, int len) throws IOException {
// copy of parent logic; we need to call our own read() instead of super.read(), which delegates instead of calling our read
if (b == null) {
throw new NullPointerException();
} else if (off < 0 || len < 0 || len > b.length - off) {
throw new IndexOutOfBoundsException();
} else if (len == 0) {
return 0;
}
int c = read();
if (c == -1) {
return -1;
}
b[off] = (byte)c;
int i = 1;
try {
for (; i < len ; i++) {
c = read();
if (c == -1) {
break;
}
b[off + i] = (byte)c;
}
} catch (IOException ee) {
}
return i;
}
#Override
public int read(byte[] b) throws IOException {
// call our own read
return read(b, 0, b.length);
}
#Override
public int read() throws IOException {
// use a simple state machine to figure out what we are doing
int next;
switch (state) {
case NOT_MATCHED:
// we are not currently matching, replacing, or unbuffering
next=super.read();
if(pattern[0] == next) {
// clear whatever was there
buf=new int[pattern.length]; // clear whatever was there
// make sure we start at 0
matchedIndex=0;
buf[matchedIndex++]=next;
if(pattern.length == 1) {
// edgecase when the pattern length is 1 we go straight to replacing
state=State.REPLACING;
// reset replace counter
replacedIndex=0;
} else {
// pattern of length 1
state=State.MATCHING;
}
// recurse to continue matching
return read();
} else {
return next;
}
case MATCHING:
// the previous bytes matched part of the pattern
next=super.read();
if(pattern[matchedIndex]==next) {
buf[matchedIndex++]=next;
if(matchedIndex==pattern.length) {
// we've found a full match!
if(replacement==null || replacement.length==0) {
// the replacement is empty, go straight to NOT_MATCHED
state=State.NOT_MATCHED;
matchedIndex=0;
} else {
// start replacing
state=State.REPLACING;
replacedIndex=0;
}
}
} else {
// mismatch -> unbuffer
buf[matchedIndex++]=next;
state=State.UNBUFFER;
unbufferIndex=0;
}
return read();
case REPLACING:
// we've fully matched the pattern and are returning bytes from the replacement
next=replacement[replacedIndex++];
if(replacedIndex==replacement.length) {
state=State.NOT_MATCHED;
replacedIndex=0;
}
return next;
case UNBUFFER:
// we partially matched the pattern before encountering a non matching byte
// we need to serve up the buffered bytes before we go back to NOT_MATCHED
next=buf[unbufferIndex++];
if(unbufferIndex==matchedIndex) {
state=State.NOT_MATCHED;
matchedIndex=0;
}
return next;
default:
throw new IllegalStateException("no such state " + state);
}
}
#Override
public String toString() {
return state.name() + " " + matchedIndex + " " + replacedIndex + " " + unbufferIndex;
}
}
There isn't any built-in functionality for search-and-replace on byte streams (InputStream).
And, a method for completing this task efficiently and correctly is not immediately obvious. I have implemented the Boyer-Moore algorithm for streams, and it works well, but it took some time. Without an algorithm like this, you have to resort to a brute-force approach where you look for the pattern starting at every position in the stream, which can be slow.
Even if you decode the HTML as text, using a regular expression to match patterns might be a bad idea, since HTML is not a "regular" language.
So, even though you've run into some difficulties, I suggest you pursue your original approach of parsing the HTML as a document. While you are having trouble with the character encoding, it will probably be easier, in the long run, to fix the right solution than it will be to jury-rig the wrong solution.
I needed a solution to this, but found the answers here incurred too much memory and/or CPU overhead. The below solution significantly outperforms the others here in these terms based on simple benchmarking.
This solution is especially memory-efficient, incurring no measurable cost even with >GB streams.
That said, this is not a zero-CPU-cost solution. The CPU/processing-time overhead is probably reasonable for all but the most demanding/resource-sensitive scenarios, but the overhead is real and should be considered when evaluating the worthiness of employing this solution in a given context.
In my case, our max real-world file size that we are processing is about 6MB, where we see added latency of about 170ms with 44 URL replacements. This is for a Zuul-based reverse-proxy running on AWS ECS with a single CPU share (1024). For most of the files (under 100KB), the added latency is sub-millisecond. Under high-concurrency (and thus CPU contention), the added latency could increase, however we are currently able to process hundreds of the files concurrently on a single node with no humanly-noticeable latency impact.
The solution we are using:
import java.io.IOException;
import java.io.InputStream;
public class TokenReplacingStream extends InputStream {
private final InputStream source;
private final byte[] oldBytes;
private final byte[] newBytes;
private int tokenMatchIndex = 0;
private int bytesIndex = 0;
private boolean unwinding;
private int mismatch;
private int numberOfTokensReplaced = 0;
public TokenReplacingStream(InputStream source, byte[] oldBytes, byte[] newBytes) {
assert oldBytes.length > 0;
this.source = source;
this.oldBytes = oldBytes;
this.newBytes = newBytes;
}
#Override
public int read() throws IOException {
if (unwinding) {
if (bytesIndex < tokenMatchIndex) {
return oldBytes[bytesIndex++];
} else {
bytesIndex = 0;
tokenMatchIndex = 0;
unwinding = false;
return mismatch;
}
} else if (tokenMatchIndex == oldBytes.length) {
if (bytesIndex == newBytes.length) {
bytesIndex = 0;
tokenMatchIndex = 0;
numberOfTokensReplaced++;
} else {
return newBytes[bytesIndex++];
}
}
int b = source.read();
if (b == oldBytes[tokenMatchIndex]) {
tokenMatchIndex++;
} else if (tokenMatchIndex > 0) {
mismatch = b;
unwinding = true;
} else {
return b;
}
return read();
}
#Override
public void close() throws IOException {
source.close();
}
public int getNumberOfTokensReplaced() {
return numberOfTokensReplaced;
}
}
I came up with this simple piece of code when I needed to serve a template file in a Servlet replacing a certain keyword by a value. It should be pretty fast and low on memory. Then using Piped Streams I guess you can use it for all sorts of things.
/JC
public static void replaceStream(InputStream in, OutputStream out, String search, String replace) throws IOException
{
replaceStream(new InputStreamReader(in), new OutputStreamWriter(out), search, replace);
}
public static void replaceStream(Reader in, Writer out, String search, String replace) throws IOException
{
char[] searchChars = search.toCharArray();
int[] buffer = new int[searchChars.length];
int x, r, si = 0, sm = searchChars.length;
while ((r = in.read()) > 0) {
if (searchChars[si] == r) {
// The char matches our pattern
buffer[si++] = r;
if (si == sm) {
// We have reached a matching string
out.write(replace);
si = 0;
}
} else if (si > 0) {
// No match and buffered char(s), empty buffer and pass the char forward
for (x = 0; x < si; x++) {
out.write(buffer[x]);
}
si = 0;
out.write(r);
} else {
// No match and nothing buffered, just pass the char forward
out.write(r);
}
}
// Empty buffer
for (x = 0; x < si; x++) {
out.write(buffer[x]);
}
}