I have the following code to do different things in one stream.
private void getBuildInformation(Stream<String> lines)
{
Supplier<Stream<String>> streamSupplier = () -> lines;
String buildNumber = null;
String scmRevision = null;
String timestamp = null;
String buildTag = null;
Optional<String> hasBuildNumber = streamSupplier.get().filter(s -> s.contains(LogProps.PLM_BUILD)).findFirst();
if (hasBuildNumber.isPresent())
{
buildNumber = hasBuildNumber.get();
String[] temp = buildNumber.split("=");
if (temp.length >= 2)
buildNumber = temp[1].trim();
}
Optional<String> hasSCMRevision = streamSupplier.get().filter(s -> s.contains(LogProps.SCM_REVISION_50)).findFirst();
if (hasSCMRevision.isPresent())
{
scmRevision = hasSCMRevision.get();
String[] temp = scmRevision.split(":");
if (temp.length >= 4)
scmRevision = temp[3].trim();
}
Optional<String> hasBuildTag = streamSupplier.get().filter(s -> s.contains(LogProps.BUILD_TAG_50)).findFirst();
if (hasBuildTag.isPresent())
{
buildTag = hasBuildTag.get();
String[] temp = buildTag.split(":");
if (temp.length >= 4)
buildTag = temp[3].trim();
}
Optional<String> hasTimestamp = streamSupplier.get().filter(s -> s.contains(LogProps.BUILD_TIMESTAMP_50)).findFirst();
if (hasTimestamp.isPresent())
{
timestamp = hasTimestamp.get();
String[] temp = timestamp.split(":");
if (temp.length >= 4)
timestamp = temp[3].trim();
}
}
Now the problem is, if I call the first time
Optional<String> hasBuildNumber = streamSupplier.get().filter(s -> s.contains(LogProps.PLM_BUILD)).findFirst();
it is working properly, but if I call the next
Optional<String> hasSCMRevision = streamSupplier.get().filter(s -> s.contains(LogProps.SCM_REVISION_50)).findFirst();
I get the following exception:
Exception in thread "Thread-21" java.lang.IllegalStateException: stream has already been operated upon or closed
at java.util.stream.AbstractPipeline.<init>(AbstractPipeline.java:203)
at java.util.stream.ReferencePipeline.<init>(ReferencePipeline.java:94)
at java.util.stream.ReferencePipeline$StatelessOp.<init>(ReferencePipeline.java:618)
at java.util.stream.ReferencePipeline$2.<init>(ReferencePipeline.java:163)
at java.util.stream.ReferencePipeline.filter(ReferencePipeline.java:162)
at com.dscsag.dscxps.model.analysis.Analysis.getECTRBuildInformation(Analysis.java:205)
at com.dscsag.dscxps.model.analysis.Analysis.parseLogFile(Analysis.java:153)
at com.dscsag.dscxps.model.analysis.Analysis.analyze(Analysis.java:135)
at com.dscsag.dscxps.model.XPSModel.lambda$startAnalysis$0(XPSModel.java:467)
at com.dscsag.dscxps.model.XPSModel$$Lambda$1/12538894.run(Unknown Source)
at java.lang.Thread.run(Thread.java:745)
Since I read this page http://winterbe.com/posts/2014/07/31/java8-stream-tutorial-examples/ I think it should be working, cause the supplier provides new streams on get().
If you re-write your supplier as an anonymous pre-java 8 class. That would be equivalent to:
Supplier<Stream<String>> streamSupplier = new Supplier<Stream<String>>() {
#Override
public Stream<String> get() {
return lines;
}
};
Maybe here it becomes more obvious that you are returning the same stream instance each time you call get on your supplier (and hence the exception thrown on the second call because findFirst is a short-circuiting terminal operation). You are not returning a brand new Stream.
In the webpage example you gave, the writer uses Stream.of which create a brand new Stream each time get is called, that's why it works.
AFAIK there is no way to duplicate a Stream from an existing one. So one workaround would be to pass the object from which the Stream comes in and then get the Stream in the supplier.
public class Test {
public static void main(String[] args) {
getBuildInformation(Arrays.asList("TEST", "test"));
}
private static void getBuildInformation(List<String> lines) {
Supplier<Stream<String>> streamSupplier = () -> lines.stream();
Optional<String> hasBuildNumber = streamSupplier.get().filter(s -> s.contains("t")).findFirst();
System.out.println(hasBuildNumber);
Optional<String> hasSCMRevision = streamSupplier.get().filter(s -> s.contains("T")).findFirst();
System.out.println(hasSCMRevision);
}
}
Which output:
Optional[test]
Optional[TEST]
Since you get the lines from a Path object, handling the exception in the Supplier itself can come quite ugly so what you can do is to create an helper method that will handle the Exception to be catched, then it would be like this:
private static void getBuildInformation(Path path) {
Supplier<Stream<String>> streamSupplier = () -> lines(path);
//do your stuff
}
private static Stream<String> lines(Path path) {
try {
return Files.lines(path);
}
catch (IOException e) {
throw new UncheckedIOException(e);
}
}
Related
I want to use a Stream to parallelize processing of a heterogenous set of remotely stored JSON files of unknown number (the number of files is not known upfront). The files can vary widely in size, from 1 JSON record per file up to 100,000 records in some other files. A JSON record in this case means a self-contained JSON object represented as one line in the file.
I really want to use Streams for this and so I implemented this Spliterator:
public abstract class JsonStreamSpliterator<METADATA, RECORD> extends AbstractSpliterator<RECORD> {
abstract protected JsonStreamSupport<METADATA> openInputStream(String path);
abstract protected RECORD parse(METADATA metadata, Map<String, Object> json);
private static final int ADDITIONAL_CHARACTERISTICS = Spliterator.IMMUTABLE | Spliterator.DISTINCT | Spliterator.NONNULL;
private static final int MAX_BUFFER = 100;
private final Iterator<String> paths;
private JsonStreamSupport<METADATA> reader = null;
public JsonStreamSpliterator(Iterator<String> paths) {
this(Long.MAX_VALUE, ADDITIONAL_CHARACTERISTICS, paths);
}
private JsonStreamSpliterator(long est, int additionalCharacteristics, Iterator<String> paths) {
super(est, additionalCharacteristics);
this.paths = paths;
}
private JsonStreamSpliterator(long est, int additionalCharacteristics, Iterator<String> paths, String nextPath) {
this(est, additionalCharacteristics, paths);
open(nextPath);
}
#Override
public boolean tryAdvance(Consumer<? super RECORD> action) {
if(reader == null) {
String path = takeNextPath();
if(path != null) {
open(path);
}
else {
return false;
}
}
Map<String, Object> json = reader.readJsonLine();
if(json != null) {
RECORD item = parse(reader.getMetadata(), json);
action.accept(item);
return true;
}
else {
reader.close();
reader = null;
return tryAdvance(action);
}
}
private void open(String path) {
reader = openInputStream(path);
}
private String takeNextPath() {
synchronized(paths) {
if(paths.hasNext()) {
return paths.next();
}
}
return null;
}
#Override
public Spliterator<RECORD> trySplit() {
String nextPath = takeNextPath();
if(nextPath != null) {
return new JsonStreamSpliterator<METADATA,RECORD>(Long.MAX_VALUE, ADDITIONAL_CHARACTERISTICS, paths, nextPath) {
#Override
protected JsonStreamSupport<METADATA> openInputStream(String path) {
return JsonStreamSpliterator.this.openInputStream(path);
}
#Override
protected RECORD parse(METADATA metaData, Map<String,Object> json) {
return JsonStreamSpliterator.this.parse(metaData, json);
}
};
}
else {
List<RECORD> records = new ArrayList<RECORD>();
while(tryAdvance(records::add) && records.size() < MAX_BUFFER) {
// loop
}
if(records.size() != 0) {
return records.spliterator();
}
else {
return null;
}
}
}
}
The problem I'm having is that while the Stream parallelizes beautifully at first, eventually the largest file is left processing in a single thread. I believe the proximal cause is well documented: the spliterator is "unbalanced".
More concretely, appears that the trySplit method is not called after a certain point in the Stream.forEach's lifecycle, so the extra logic to distribute small batches at the end of trySplit is rarely executed.
Notice how all the spliterators returned from trySplit share the same paths iterator. I thought this was a really clever way to balance the work across all spliterators, but it hasn't been enough to achieve full parallelism.
I would like the parallel processing to proceed first across files, and then when few large files are still left spliterating, I want to parallelize across chunks of the remaining files. That was the intent of the else block at the end of trySplit.
Is there an easy / simple / canonical way around this problem?
Your trySplit should output splits of equal size, regardless of the size of the underlying files. You should treat all the files as a single unit and fill up the ArrayList-backed spliterator with the same number of JSON objects each time. The number of objects should be such that processing one split takes between 1 and 10 milliseconds: lower than 1 ms and you start approaching the costs of handing off the batch to a worker thread, higher than that and you start risking uneven CPU load due to tasks which are too coarse-grained.
The spliterator is not obliged to report a size estimate, and you are already doing this correctly: your estimate is Long.MAX_VALUE, which is a special value meaning "unbounded". However, if you have many files with a single JSON object, resulting in batches of size 1, this will hurt your performance in two ways: the overhead of opening-reading-closing the file may become a bottleneck and, if you manage to escape that, the cost of thread handoff may be significant compared to the cost of processing one item, again causing a bottleneck.
Five years ago I was solving a similar problem, you can have a look at my solution.
After much experimentation, I was still not able to get any added parallelism by playing with the size estimates. Basically, any value other than Long.MAX_VALUE will tend to cause the spliterator to terminate too early (and without any splitting), while on the other hand a Long.MAX_VALUE estimate will cause trySplit to be called relentlessly until it returns null.
The solution I found is to internally share resources among the spliterators and let them rebalance amongst themselves.
Working code:
public class AwsS3LineSpliterator<LINE> extends AbstractSpliterator<AwsS3LineInput<LINE>> {
public final static class AwsS3LineInput<LINE> {
final public S3ObjectSummary s3ObjectSummary;
final public LINE lineItem;
public AwsS3LineInput(S3ObjectSummary s3ObjectSummary, LINE lineItem) {
this.s3ObjectSummary = s3ObjectSummary;
this.lineItem = lineItem;
}
}
private final class InputStreamHandler {
final S3ObjectSummary file;
final InputStream inputStream;
InputStreamHandler(S3ObjectSummary file, InputStream is) {
this.file = file;
this.inputStream = is;
}
}
private final Iterator<S3ObjectSummary> incomingFiles;
private final Function<S3ObjectSummary, InputStream> fileOpener;
private final Function<InputStream, LINE> lineReader;
private final Deque<S3ObjectSummary> unopenedFiles;
private final Deque<InputStreamHandler> openedFiles;
private final Deque<AwsS3LineInput<LINE>> sharedBuffer;
private final int maxBuffer;
private AwsS3LineSpliterator(Iterator<S3ObjectSummary> incomingFiles, Function<S3ObjectSummary, InputStream> fileOpener,
Function<InputStream, LINE> lineReader,
Deque<S3ObjectSummary> unopenedFiles, Deque<InputStreamHandler> openedFiles, Deque<AwsS3LineInput<LINE>> sharedBuffer,
int maxBuffer) {
super(Long.MAX_VALUE, 0);
this.incomingFiles = incomingFiles;
this.fileOpener = fileOpener;
this.lineReader = lineReader;
this.unopenedFiles = unopenedFiles;
this.openedFiles = openedFiles;
this.sharedBuffer = sharedBuffer;
this.maxBuffer = maxBuffer;
}
public AwsS3LineSpliterator(Iterator<S3ObjectSummary> incomingFiles, Function<S3ObjectSummary, InputStream> fileOpener, Function<InputStream, LINE> lineReader, int maxBuffer) {
this(incomingFiles, fileOpener, lineReader, new ConcurrentLinkedDeque<>(), new ConcurrentLinkedDeque<>(), new ArrayDeque<>(maxBuffer), maxBuffer);
}
#Override
public boolean tryAdvance(Consumer<? super AwsS3LineInput<LINE>> action) {
AwsS3LineInput<LINE> lineInput;
synchronized(sharedBuffer) {
lineInput=sharedBuffer.poll();
}
if(lineInput != null) {
action.accept(lineInput);
return true;
}
InputStreamHandler handle = openedFiles.poll();
if(handle == null) {
S3ObjectSummary unopenedFile = unopenedFiles.poll();
if(unopenedFile == null) {
return false;
}
handle = new InputStreamHandler(unopenedFile, fileOpener.apply(unopenedFile));
}
for(int i=0; i < maxBuffer; ++i) {
LINE line = lineReader.apply(handle.inputStream);
if(line != null) {
synchronized(sharedBuffer) {
sharedBuffer.add(new AwsS3LineInput<LINE>(handle.file, line));
}
}
else {
return tryAdvance(action);
}
}
openedFiles.addFirst(handle);
return tryAdvance(action);
}
#Override
public Spliterator<AwsS3LineInput<LINE>> trySplit() {
synchronized(incomingFiles) {
if (incomingFiles.hasNext()) {
unopenedFiles.add(incomingFiles.next());
return new AwsS3LineSpliterator<LINE>(incomingFiles, fileOpener, lineReader, unopenedFiles, openedFiles, sharedBuffer, maxBuffer);
} else {
return null;
}
}
}
}
This is not a direct answer to your question. But I think it is worth a try with Stream in library abacus-common:
void test_58601518() throws Exception {
final File tempDir = new File("./temp/");
// Prepare the test files:
// if (!(tempDir.exists() && tempDir.isDirectory())) {
// tempDir.mkdirs();
// }
//
// final Random rand = new Random();
// final int fileCount = 1000;
//
// for (int i = 0; i < fileCount; i++) {
// List<String> lines = Stream.repeat(TestUtil.fill(Account.class), rand.nextInt(1000) * 100 + 1).map(it -> N.toJSON(it)).toList();
// IOUtil.writeLines(new File("./temp/_" + i + ".json"), lines);
// }
N.println("Xmx: " + IOUtil.MAX_MEMORY_IN_MB + " MB");
N.println("total file size: " + Stream.listFiles(tempDir).mapToLong(IOUtil::sizeOf).sum() / IOUtil.ONE_MB + " MB");
final AtomicLong counter = new AtomicLong();
final Consumer<Account> yourAction = it -> {
counter.incrementAndGet();
it.toString().replace("a", "bbb");
};
long startTime = System.currentTimeMillis();
Stream.listFiles(tempDir) // the file/data source could be local file system or remote file system.
.parallel(2) // thread number used to load the file/data and convert the lines to Java objects.
.flatMap(f -> Stream.lines(f).map(line -> N.fromJSON(Account.class, line))) // only certain lines (less 1024) will be loaded to memory.
.parallel(8) // thread number used to execute your action.
.forEach(yourAction);
N.println("Took: " + ((System.currentTimeMillis()) - startTime) + " ms" + " to process " + counter + " lines/objects");
// IOUtil.deleteAllIfExists(tempDir);
}
Till end, the CPU usage on my laptop is pretty high(about 70%), and it took about 70 seconds to process 51,899,100 lines/objects from 1000 files with Intel(R) Core(TM) i5-8365U CPU and Xmx256m jvm memory. Total file size is about: 4524 MB. if yourAction is not a heavy operation, sequential stream could be even faster than parallel stream.
F.Y.I I'm the developer of abacus-common
I'm learning java 8 and im trying to process a csv file in java;
List<Catalogo> catalogos = new ArrayList<>();
try (Stream<String> lines = Files.lines(Paths.get("src\\main\\resources\\productos.csv"), Charset.forName("Cp1252"))) {
List<String[]> data = lines.map(s -> s.split(","))
.collect(Collectors.toList());
createCatalog(catalogos, data);
catalogos.forEach(System.out::println);
} catch (IOException e) {
e.printStackTrace();
}
}
public static void createCatalog(List<Catalogo> catalogos, List<String[]> data) {
for (String[] x : data) {
for (int i = 0; i < x.length; i++) {
Catalogo catalogo = new Catalogo();
catalogo.setCodigo(x[0]);
catalogo.setProducto(x[1]);
catalogo.setTipo(x[2]);
catalogo.setPrecio(x[3]);
catalogo.setInventario(x[4]);
catalogos.add(catalogo);
}
}
}
I would like to know if it's possible to improve this code, I did not like it the way I have done it;
You can directly map to your object using a constructor that accepts all your attributes such as :
try...
List<Catalogo> catalogos = lines.map(s -> s.split(","))
.map(s -> new Catalogo(s[0], s[1], s[2], s[3], s[4]))
.collect(Collectors.toList());
catch...
where the constructor based on existing code would be of signature:
Catalogo(String codigo, String producto, String tipo, String precio, String inventario)
I've a regex pattern of words like welcome1|welcome2|changeme... which I need to search for in thousands of files (varies between 100 to 8000) ranging from 1KB to 24 MB each, in size.
I would like to know if there's a faster way of pattern matching than doing what I have been trying.
Environment:
jdk 1.8
Windows 10
Unix4j Library
Here's what I tried till now
try (Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(FilePredicates.isFileAndNotDirectory())) {
List<String> obviousStringsList = Strings_PASSWORDS.stream()
.map(s -> ".*" + s + ".*").collect(Collectors.toList()); //because Unix4j apparently needs this
Pattern pattern = Pattern.compile(String.join("|", obviousStringsList));
GrepOptions options = new GrepOptions.Default(GrepOption.count,
GrepOption.ignoreCase,
GrepOption.lineNumber,
GrepOption.matchingFiles);
Instant startTime = Instant.now();
final List<Path> filesWithObviousStringss = stream
.filter(path -> !Unix4j.grep(options, pattern, path.toFile()).toStringResult().isEmpty())
.collect(Collectors.toList());
System.out.println("Time taken = " + Duration.between(startTime, Instant.now()).getSeconds() + " seconds");
}
I get Time taken = 60 seconds which makes me think I'm doing something really wrong.
I've tried different ways with the stream and on an average every method takes about a minute to process my current folder of 6660 files.
Grep on mysys2/mingw64 takes about 15 seconds and exec('grep...') in node.js takes about 12 seconds consistently.
I chose Unix4j because it provides java native grep and clean code.
Is there a way to produce better results in Java, that I'm sadly missing?
The main reason why native tools can process such text files much faster, is their assumption of one particular charset, especially when it has an ASCII based 8 Bit encoding, whereas Java performs a byte to character conversion whose abstraction is capable of supporting arbitrary charsets.
When we similarly assume a single charset with the properties named above, we can use lowlevel tools which may increase the performance dramatically.
For such an operation, we define the following helper methods:
private static char[] getTable(Charset cs) {
if(cs.newEncoder().maxBytesPerChar() != 1f)
throw new UnsupportedOperationException("Not an 8 bit charset");
byte[] raw = new byte[256];
IntStream.range(0, 256).forEach(i -> raw[i] = (byte)i);
char[] table = new char[256];
cs.newDecoder().onUnmappableCharacter(CodingErrorAction.REPLACE)
.decode(ByteBuffer.wrap(raw), CharBuffer.wrap(table), true);
for(int i = 0; i < 128; i++)
if(table[i] != i) throw new UnsupportedOperationException("Not ASCII based");
return table;
}
and
private static CharSequence mapAsciiBasedText(Path p, char[] table) throws IOException {
try(FileChannel fch = FileChannel.open(p, StandardOpenOption.READ)) {
long actualSize = fch.size();
int size = (int)actualSize;
if(size != actualSize) throw new UnsupportedOperationException("file too large");
MappedByteBuffer mbb = fch.map(FileChannel.MapMode.READ_ONLY, 0, actualSize);
final class MappedCharSequence implements CharSequence {
final int start, size;
MappedCharSequence(int start, int size) {
this.start = start;
this.size = size;
}
public int length() {
return size;
}
public char charAt(int index) {
if(index < 0 || index >= size) throw new IndexOutOfBoundsException();
byte b = mbb.get(start + index);
return b<0? table[b+256]: (char)b;
}
public CharSequence subSequence(int start, int end) {
int newSize = end - start;
if(start<0 || end < start || end-start > size)
throw new IndexOutOfBoundsException();
return new MappedCharSequence(start + this.start, newSize);
}
public String toString() {
return new StringBuilder(size).append(this).toString();
}
}
return new MappedCharSequence(0, size);
}
}
This allows to map a file into the virtual memory and project it directly to a CharSequence, without copy operations, assuming that the mapping can be done with a simple table and, for ASCII based charsets, the majority of the characters do not even need a table lookup, as their numerical value is identical to the Unicode codepoint.
With these methods, you may implement the operation as
// You need this only once per JVM.
// Note that running inside IDEs like Netbeans may change the default encoding
char[] table = getTable(Charset.defaultCharset());
try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(Files::isRegularFile)) {
Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
long startTime = System.nanoTime();
final List<Path> filesWithObviousStringss = stream//.parallel()
.filter(path -> {
try {
return pattern.matcher(mapAsciiBasedText(path, table)).find();
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
})
.collect(Collectors.toList());
System.out.println("Time taken = "
+ TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}
This runs much faster than the normal text conversion, but still supports parallel execution.
Besides requiring an ASCII based single byte encoding, there’s the restriction that this code doesn’t support files larger than 2 GiB. While it is possible to extend the solution to support larger files, I wouldn’t add this complication unless really needed.
I don’t know what “Unix4j” provides that isn’t already in the JDK, as the following code does everything with built-in features:
try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(Files::isRegularFile)) {
Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
long startTime = System.nanoTime();
final List<Path> filesWithObviousStringss = stream
.filter(path -> {
try(Scanner s = new Scanner(path)) {
return s.findWithinHorizon(pattern, 0) != null;
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
})
.collect(Collectors.toList());
System.out.println("Time taken = "
+ TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}
One important property of this solution is that it doesn’t read the whole file, but stops at the first encountered match. Also, it doesn’t deal with line boundaries, which is suitable for the words you’re looking for, as they never contain line breaks anyway.
After analyzing the findWithinHorizon operation, I consider that line by line processing may be better for larger files, so, you may try
try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(Files::isRegularFile)) {
Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
long startTime = System.nanoTime();
final List<Path> filesWithObviousStringss = stream
.filter(path -> {
try(Stream<String> s = Files.lines(path)) {
return s.anyMatch(pattern.asPredicate());
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
})
.collect(Collectors.toList());
System.out.println("Time taken = "
+ TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}
instead.
You may also try to turn the stream to parallel mode, e.g.
try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(Files::isRegularFile)) {
Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
long startTime = System.nanoTime();
final List<Path> filesWithObviousStringss = stream
.parallel()
.filter(path -> {
try(Stream<String> s = Files.lines(path)) {
return s.anyMatch(pattern.asPredicate());
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
})
.collect(Collectors.toList());
System.out.println("Time taken = "
+ TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}
It’s hard to predict whether this has a benefit, as in most cases, the I/O dominates such an operation.
I never used Unix4j yet, but Java provides nice file APIs as well nowadays. Also, Unix4j#grep seems to return all the found matches (as you're using .toStringResult().isEmpty()), while you seem to just need to know whether at least one match got found (which means that you should be able to break once one match is found). Maybe this library provides another method that could better suit your needs, e.g. something like #contains? Without the use of Unix4j, Stream#anyMatch could be a good candidate here. Here is a vanilla Java solution if you want to compare with yours:
private boolean lineContainsObviousStrings(String line) {
return Strings_PASSWORDS // <-- weird naming BTW
.stream()
.anyMatch(line::contains);
}
private boolean fileContainsObviousStrings(Path path) {
try (Stream<String> stream = Files.lines(path)) {
return stream.anyMatch(this::lineContainsObviousStrings);
}
}
public List<Path> findFilesContainingObviousStrings() {
Instant startTime = Instant.now();
try (Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))) {
return stream
.filter(FilePredicates.isFileAndNotDirectory())
.filter(this::fileContainsObviousStrings)
.collect(Collectors.toList());
} finally {
Instant endTime = Instant.now();
System.out.println("Time taken = " + Duration.between(startTime, endTime).getSeconds() + " seconds");
}
}
Please try this out too (if it is possible), I am curious how it performs on your files.
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UncheckedIOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Optional;
import java.util.concurrent.TimeUnit;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class Filescan {
public static void main(String[] args) throws IOException {
Filescan sc = new Filescan();
sc.findWords("src/main/resources/files", new String[]{"author", "book"}, true);
}
// kind of Tuple/Map.Entry
static class Pair<K,V>{
final K key;
final V value;
Pair(K key, V value){
this.key = key;
this.value = value;
}
#Override
public String toString() {
return key + " " + value;
}
}
public void findWords(String directory, String[] words, boolean ignorecase) throws IOException{
final String[] searchWords = ignorecase ? toLower(words) : words;
try (Stream<Path> stream = Files.walk(Paths.get(directory)).filter(Files::isRegularFile)) {
long startTime = System.nanoTime();
List<Pair<Path,Map<String, List<Integer>>>> result = stream
// you can test it with parallel execution, maybe it is faster
.parallel()
// searching
.map(path -> findWordsInFile(path, searchWords, ignorecase))
// filtering out empty optionals
.filter(Optional::isPresent)
// unwrap optionals
.map(Optional::get).collect(Collectors.toList());
System.out.println("Time taken = " + TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()
- startTime) + " seconds");
System.out.println("result:");
result.forEach(System.out::println);
}
}
private String[] toLower(String[] words) {
String[] ret = new String[words.length];
for (int i = 0; i < words.length; i++) {
ret[i] = words[i].toLowerCase();
}
return ret;
}
private static Optional<Pair<Path,Map<String, List<Integer>>>> findWordsInFile(Path path, String[] words, boolean ignorecase) {
try (BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(path.toFile())))) {
String line = br.readLine();
line = ignorecase & line != null ? line.toLowerCase() : line;
Map<String, List<Integer>> map = new HashMap<>();
int linecount = 0;
while(line != null){
for (String word : words) {
if(line.contains(word)){
if(!map.containsKey(word)){
map.put(word, new ArrayList<Integer>());
}
map.get(word).add(linecount);
}
}
line = br.readLine();
line = ignorecase & line != null ? line.toLowerCase() : line;
linecount++;
}
if(map.isEmpty()){
// returning empty optional when nothing in the map
return Optional.empty();
}else{
// returning a path-map pair with the words and the rows where each word has been found
return Optional.of(new Pair<Path,Map<String, List<Integer>>>(path, map));
}
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}
}
}
I am new to the Java 8 concurrency features such as CompletableFuture and I hope you can help to get started with the following use case.
There is a service called TimeConsumingServices that provides time consuming operations which I'd like to run in parallel since all of them are independent.
interface TimeConsumingService {
default String hello(String name) {
System.out.println(System.currentTimeMillis() + " > hello " + name);
return "Hello " + name;
}
default String planet(String name) {
System.out.println(System.currentTimeMillis() + " > planet " + name);
return "Planet: " + name;
}
default String echo(String name) {
System.out.println(System.currentTimeMillis() + " > echo " + name);
return name;
}
default byte[] convert(String hello, String planet, String echo) {
StringBuilder sb = new StringBuilder();
sb.append(hello);
sb.append(planet);
sb.append(echo);
return sb.toString().getBytes();
}
}
So far I implemented the following example and I have managed to call all three service methods in parallel.
public class Runner implements TimeConsumingService {
public static void main(String[] args) {
new Runner().doStuffAsync();
}
public void doStuffAsync() {
CompletableFuture<String> future1 = CompletableFuture.supplyAsync(() -> this.hello("Friend"));
CompletableFuture<String> future2 = CompletableFuture.supplyAsync(() -> this.planet("Earth"));
CompletableFuture<String> future3 = CompletableFuture.supplyAsync(() -> this.echo("Where is my echo?"));
CompletableFuture.allOf(future1, future2, future3).join();
}
}
Is there a way to collect the return values of each service call and invoke the method byte[]‘ convert(String, String, String)?
To combine the result after you have returned them all you can do something like this
CompletableFuture<byte[]> byteFuture = CompletableFuture.allOf(cf1, cf2, cf3)
.thenApplyAsync(aVoid -> convert(cf1.join(), cf2.join(), cf3.join()));
byte[] bytes = byteFuture.join();
This will run all of your futures, wait for them all to complete, then as soon as they are all finished will call your convert method you mention.
After join you can simply get() values from future1 like:
String s1 = future1.get()
and so on
You can combine them using thenCombine() method if there is only 3 futures to complete:
final CompletableFuture<byte[]> byteFuture = future1.thenCombine(future2, (t, u) -> {
StringBuilder sb = new StringBuilder();
sb.append(t);
sb.append(u);
return sb.toString();
}).thenCombine(future3, (t, u) -> {
StringBuilder sb = new StringBuilder();
sb.append(t);
sb.append(u);
return sb.toString();
}).thenApply(s -> s.getBytes());
try {
final byte[] get = byteFuture.get();
} catch (InterruptedException | ExecutionException ex) {
}
I am trying to learn how to utilize Java 8 features(such as lambdas and streams) in my daily programming, since it makes for much cleaner code.
Here's what I am currently working on:
I get a string stream from a local file with some data which I turn into objects later. The input file structure looks something like this:
Airport name; Country; Continent; some number;
And my code looks like this:
public class AirportConsumer implements AirportAPI {
List<Airport> airports = new ArrayList<Airport>();
#Override
public Stream<Airport> getAirports() {
Stream<String> stream = null;
try {
stream = Files.lines(Paths.get("resources/planes.txt"));
stream.forEach(line -> createAirport(line));
} catch (IOException e) {
e.printStackTrace();
}
return airports.stream();
}
public void createAirport(String line) {
String airport, country, continent;
int length;
airport = line.substring(0, line.indexOf(';')).trim();
line = line.replace(airport + ";", "");
country = line.substring(0,line.indexOf(';')).trim();
line = line.replace(country + ";", "");
continent = line.substring(0,line.indexOf(';')).trim();
line = line.replace(continent + ";", "");
length = Integer.parseInt(line.substring(0,line.indexOf(';')).trim());
airports.add(new Airport(airport, country, continent, length));
}
}
And in my main class I iterate over the object stream and print out the results:
public class Main {
public void toString(Airport t){
System.out.println(t.getName() + " " + t.getContinent());
}
public static void main(String[] args) throws IOException {
Main m = new Main();
m.whatever();
}
private void whatever() throws IOException {
AirportAPI k = new AirportConsumer();
Stream<Airport> s;
s = k.getAirports();
s.forEach(this::toString);
}
}
My question is this: How can I optimize this code, so I don't have to parse the lines from the file separately, but instead create a stream of objects Airport straight from the source file? Or is this the extent in which I can do this?
You need to use map() to transform the data as it comes past.
Files.lines(Paths.get("resources/planes.txt"))
.map(line -> createAirport(line));
This will return a Stream<Airport> - if you want to return a List, then you'll need to use the collect method at the end.
This approach is also stateless, which means you won't need the instance-level airports value.
You'll need to update your createAirport method to return something:
public Airport createAirport(String line) {
String airport = line.substring(0, line.indexOf(';')).trim();
line = line.replace(airport + ";", "");
String country = line.substring(0,line.indexOf(';')).trim();
line = line.replace(country + ";", "");
String continent = line.substring(0,line.indexOf(';')).trim();
line = line.replace(continent + ";", "");
int length = Integer.parseInt(line.substring(0,line.indexOf(';')).trim());
return new Airport(airport, country, continent, length);
}
If you're looking for a more functional approach to your code, you may want to consider a rewrite of createAirport so it doesn't mutate line. Builders are also nice for this kind of thing.
public Airport createAirport(final String line) {
final String[] fields = line.split(";");
return new Airport(fields[0].trim(),
fields[1].trim(),
fields[2].trim(),
Integer.parseInt(fields[3].trim()));
}
Throwing it all together, your class now looks like this.
public class AirportConsumer implements AirportAPI {
#Override
public Stream<Airport> getAirports() {
Stream<String> stream = null;
try {
stream = Files.lines(Paths.get("resources/planes.txt"))
.map(line -> createAirport(line));
} catch (IOException e) {
stream = Stream.empty();
e.printStackTrace();
}
return stream;
}
private Airport createAirport(final String line) {
final String[] fields = line.split(";");
return new Airport(fields[0].trim(),
fields[1].trim(),
fields[2].trim(),
Integer.parseInt(fields[3].trim()));
}
}
The code posted by Steve looks great. But there are still two places can be improved:
1, How to split a string.
2, It may cause issue if the people forget or don't know to close the stream created by calling getAirports() method. So it's better to finish the task(toList() or whatever) in place.
Here is code by abacus-common
try(Reader reader = IOUtil.createBufferedReader(file)) {
List<Airport> airportList = Stream.of(reader).map(line -> {
String[] strs = Splitter.with(";").trim(true).splitToArray(line);
return Airport(strs[0], strs[1], strs[2], Integer.valueOf(strs[3]));
}).toList();
} catch (IOException e) {
throw new RuntimeException(e);
}
// Or By the Try:
List<Airport> airportList = Try.stream(file).call(s -> s.map(line -> {
String[] strs = Splitter.with(";").trim(true).splitToArray(line);
return Airport(strs[0], strs[1], strs[2], Integer.valueOf(strs[3]));
}).toList())
Disclosure: I'm the developer of abacus-common.