Java: Filtering lots of data

Java: Filtering lots of data - java

I have ~10M rows of data, each containing ~1000 columns (String & Numeric). What I need is to be able to apply simple filters (>, <, RANGE, ==) to this data set as quick as possible (less than a second to get 10K slice for this data).
What kind of production ready technology, which could be used from Java exist?

Where is your data coming from? This sounds like a task for a database.

A sql database with an index on the fields you're filtering. The index can be based on numeric value, which will make range and equals queries pretty quick.

If it's not from database,
You can do it in few threads and then combine the results in order to improve performance.
Like, here AMOUNT is a number of elements in your map:
package com.stackoverflow.test;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
public class Test6 {
private static final int AMOUNT = 10000000;
private static final int CORES = Runtime.getRuntime().availableProcessors();
private static final int PART = AMOUNT / CORES;
private static final class MapFilterTask implements Callable<Map<String,Number >> {
private Integer fromElement;
private Integer toElement;
private Map<String,Number > map;
private MapFilterTask(Map<String,Number > map, Integer fromElement, Integer toElement) {
this.map=map;
this.fromElement = fromElement;
this.toElement = toElement;
}
public Map<String,Number > call() throws Exception {
for(int i=fromElement; i<=toElement; i++){
//filetr your map and return filtered resutl
}
}
}
public static void main(String[] args) throws InterruptedException, ExecutionException {
Map<String,Number > yourMap =new HashMap<String, Number>();
ExecutorService taskExecutor = Executors.newFixedThreadPool(CORES);
List<Callable<Map<String,Number >>> tasks = new ArrayList<Callable<Map<String,Number >>>();
for (int i = 0; i < CORES; i++) {
tasks.add(new MapFilterTask(yourMap,i*PART,(i+1)*PART));
}
List<Future<Map<String,Number >>> futures = taskExecutor.invokeAll(tasks);
Map<String,Number > newMap =new HashMap<String, Number>();
for(Future<Map<String,Number >> feature : futures){
newMap.putAll(feature.get());
}
// Map<String,Numeric>
}
}
And for me it works 4 times faster only with VM args : -Xms2048M -Xmx2048M
Without VM args I got 1.7 time increment on my laptop with 4 cores processor and Linux Mint OS.

Related

Spark Dataset Foreach function does not iterate

Context
I want to iterate over a Spark Dataset and update a HashMap for each row.
Here is the code I have:
// At this point, I have a my_dataset variable containing 300 000 rows and 10 columns
// - my_dataset.count() == 300 000
// - my_dataset.columns().length == 10
// Declare my HashMap
HashMap<String, Vector<String>> my_map = new HashMap<String, Vector<String>>();
// Initialize the map
for(String col : my_dataset.columns())
{
my_map.put(col, new Vector<String>());
}
// Iterate over the dataset and update the map
my_dataset.foreach( (ForeachFunction<Row>) row -> {
for(String col : my_map.KeySet())
{
my_map.get(col).add(row.get(row.fieldIndex(col)).toString());
}
});
Issue
My issue is that the foreach doesn't iterate at all, the lambda is never executed and I don't know why.
I implemented it as indicated here: How to traverse/iterate a Dataset in Spark Java?
At the end, all the inner Vectors remain empty (as they were initialized) despite the Dataset is not (Take a look to the first comments in the given code sample).
I know that the foreach never iterates because I did two tests:
Add an AtomicInteger to count the iterations, increment it right in the beginning of the lambda with incrementAndGet() method. => The counter value remains 0 at the end of the process.
Print a debug message right in the beginning of the lambda. => The message is never displayed.
I'm not used of Java (even less with Java lambdas) so maybe I missed an important point but I can't find what.

I am probably a little old school, but I never like lambdas too much, as it can get pretty complicated.
Here is a full example of a foreach():
package net.jgp.labs.spark.l240_foreach.l000;
import java.io.Serializable;
import org.apache.spark.api.java.function.ForeachFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class ForEachBookApp implements Serializable {
private static final long serialVersionUID = -4250231621481140775L;
private final class BookPrinter implements ForeachFunction<Row> {
private static final long serialVersionUID = -3680381094052442862L;
#Override
public void call(Row r) throws Exception {
System.out.println(r.getString(2) + " can be bought at " + r.getString(
4));
}
}
public static void main(String[] args) {
ForEachBookApp app = new ForEachBookApp();
app.start();
}
private void start() {
SparkSession spark = SparkSession.builder().appName("For Each Book").master(
"local").getOrCreate();
String filename = "data/books.csv";
Dataset<Row> df = spark.read().format("csv").option("inferSchema", "true")
.option("header", "true")
.load(filename);
df.show();
df.foreach(new BookPrinter());
}
}
As you can see, this example reads a CSV file and prints a message from the data. It is fairly simple.
The foreach() instantiates a new class, where the work is done.
df.foreach(new BookPrinter());
The work is done in the call() method of the class:
private final class BookPrinter implements ForeachFunction<Row> {
#Override
public void call(Row r) throws Exception {
...
}
}
As you are new to Java, make sure you have the right signature (for classes and methods) and the right imports.
You can also clone the example from https://github.com/jgperrin/net.jgp.labs.spark/tree/master/src/main/java/net/jgp/labs/spark/l240_foreach/l000. This should help you with foreach().

Generating 4 digits of multiple unique IDs per millisecond in java

I am inserting into database with batchupdate. I need to generate actual 10 digits of id out of which first six digits will remain same i.e. a yyMMdd I am trying to append four unique digits to this block of yyMMdd pattern as per local date.
the problem is, it generates lots of duplicate as the FOR loop is running ahead of time in ms.
Expected Pattern : 210609xxxx where 210609 is taken from yyMMdd pattern of LocalDate in java and xxxx needs to be unique for even if FOR loop calls this method multiple times per millisecond.
public Long getUniquedeltaId() {
final Long LIMIT = 10000L;
final Long deltaId = Long.parseLong(Long.toString(Long.parseLong((java.time.LocalDate.now()
.format(DateTimeFormatter
.ofPattern("yyMMdd")))))
.concat(Long.toString(System.currentTimeMillis() % LIMIT)));
System.out.println("deltaId"+deltaId);
return deltaId;
I tried using System.nanoTime() but its returning only one uniqueId.

If you need a plainly simple list of unique IDs at one point in time, you can use the following method:
package example;
import java.util.List;
import java.util.concurrent.ThreadLocalRandom;
import java.util.stream.Collectors;
public class Main {
public static void main(String[] args) {
final String prefix = "20210609";// hardcoded prefix for the example
final List<String> uniqueIds = ThreadLocalRandom.current().ints(0, 10_000)// ints from 0 to 10000 (exclusive) -> every possible 4 digit number
.distinct()// only distinct numbers
.limit(1000L)// exactly 1000 (up to 10000 possible)
.mapToObj(v -> String.format("%04d", v))// always 4 digits (format as string, lpad with 0s
.map(v -> prefix + v)// add our prefix
.collect(Collectors.toList());
System.out.println(uniqueIds);
}
}
If you need a component that provides you with one unique ID at a time, you can use this class:
package example;
import java.time.LocalDate;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.concurrent.atomic.AtomicInteger;
public class IdGenerator {
private static final int LENGTH = 10_000;
private static final DateTimeFormatter DTF = DateTimeFormatter.ofPattern("yyMMdd");// unlike SimpleDateFormat, this is thread-safe
private final Object monitor;
private final AtomicInteger offset;
private final AtomicBoolean generationInProgress;
private volatile int[] ids;
private volatile LocalDate lastGeneratedDate;
public IdGenerator() {
this.monitor = new Object();
this.offset = new AtomicInteger(0);
this.generationInProgress = new AtomicBoolean(false);
this.ids = new int[LENGTH];
this.lastGeneratedDate = LocalDate.MIN;
}
public String nextId() throws InterruptedException {
final LocalDate currentDate = LocalDate.now();
while (this.lastGeneratedDate.isBefore(currentDate)) {
if (this.generationInProgress.compareAndSet(false, true)) {
this.ids = ThreadLocalRandom.current().ints(0, LENGTH)
.distinct()
.limit(LENGTH)
.toArray();
this.offset.set(0);
this.lastGeneratedDate = currentDate;
this.generationInProgress.set(false);
synchronized (this.monitor) {
this.monitor.notifyAll();
}
}
while (this.generationInProgress.get()) {
synchronized (this.monitor) {
this.monitor.wait();
}
}
}
final int myIndex = this.offset.getAndIncrement();
if (myIndex >= this.ids.length) {
throw new IllegalStateException("no more ids today");
}
return currentDate.format(DTF) + String.format("%04d", this.ids[myIndex]);
}
}
Note that your pattern allows only 10000 unique IDs per day. You limit yourself to 4 digits per day (10^4 = 10000).

java 8 parallel stream confusion/issue

I am new to parallel stream and trying to make 1 sample program that will calculate value * 100(1 to 100) and store it in map.
While executing code I am getting different count on each iteration.
I may be wrong at somewhere so please guide me anyone knows the proper way to do so.
code:
import java.util.*;
import java.lang.*;
import java.io.*;
import java.util.stream.Collectors;
public class Main{
static int l = 0;
public static void main (String[] args) throws java.lang.Exception {
letsGoParallel();
}
public static int makeSomeMagic(int data) {
l++;
return data * 100;
}
public static void letsGoParallel() {
List<Integer> dataList = new ArrayList<>();
for(int i = 1; i <= 100 ; i++) {
dataList.add(i);
}
Map<Integer, Integer> resultMap = new HashMap<>();
dataList.parallelStream().map(f -> {
Integer xx = 0;
{
xx = makeSomeMagic(f);
}
resultMap.put(f, xx);
return 0;
}).collect(Collectors.toList());
System.out.println("Input Size: " + dataList.size());
System.out.println("Size: " + resultMap.size());
System.out.println("Function Called: " + l);
}
}
Runnable Code
Last Output
Input Size: 100
Size: 100
Function Called: 98
On each time run output differs.
I want to use parallel stream in my own application but due to this confusion/issue I can't.
In my application I have 100-200 unique numbers on which some same operation needs to be performed. In short there's function which process something.

Your access to both the HashMap and to the l variable are both not thread safe, which is why the output is different in each run.
The correct way to do what you are trying to do is collecting the Stream elements into a Map:
Map<Integer, Integer> resultMap =
dataList.parallelStream()
.collect(Collectors.toMap (Function.identity (), Main::makeSomeMagic));
EDIT: The l variable is still updated in a not thread safe way with this code, so you'll have to add your own thread safety if the final value of the variable is important to you.

By putting some values in resultMap you're using a side-effect:
dataList.parallelStream().map(f -> {
Integer xx = 0;
{
xx = makeSomeMagic(f);
}
resultMap.put(f, xx);
return 0;
})
The API states:
Stateless operations, such as filter and map, retain no state from
previously seen element when processing a new element -- each element
can be processed independently of operations on other elements.
Going on with:
Stream pipeline results may be nondeterministic or incorrect if the
behavioral parameters to the stream operations are stateful. A
stateful lambda (or other object implementing the appropriate
functional interface) is one whose result depends on any state which
might change during the execution of the stream pipeline.
It follows an example similar to yours showing:
... if the mapping operation is performed in parallel, the results for
the same input could vary from run to run, due to thread scheduling
differences, whereas, with a stateless lambda expression the results
would always be the same.
That explains your observation: On each time run output differs.
The right approach is shown by #Eran

Hopefully it works fine. by making Synchronied function makeSomeMagic and using Threadsafe data structure ConcurrentHashMap
and write simple statement
dataList.parallelStream().forEach(f -> resultMap.put(f, makeSomeMagic(f)));
Whole code is here :
import java.util.*;
import java.lang.*;
import java.io.*;
import java.util.stream.Collectors;
public class Main{
static int l = 0;
public static void main (String[] args) throws java.lang.Exception {
letsGoParallel();
}
public synchronized static int makeSomeMagic( int data) { // make it synchonized
l++;
return data * 100;
}
public static void letsGoParallel() {
List<Integer> dataList = new ArrayList<>();
for(int i = 1; i <= 100 ; i++) {
dataList.add(i);
}
Map<Integer, Integer> resultMap = new ConcurrentHashMap<>();// use ConcurrentHashMap
dataList.parallelStream().forEach(f -> resultMap.put(f, makeSomeMagic(f)));
System.out.println("Input Size: " + dataList.size());
System.out.println("Size: " + resultMap.size());
System.out.println("Function Called: " + l);
}
}

There is no need to count how many times the method invoked.
Stream will help you do loop in byte code.
Pass your logic(function) to Stream, do not use no thread-safe variable in multi-thread(include parallelStream)
like this.
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
public class ParallelStreamClient {
// static int l = 0;---> no need to count times.
public static void main(String[] args) throws java.lang.Exception {
letsGoParallel();
}
public static int makeSomeMagic(int data) {
// l++;-----> this is no thread-safe way
return data * 100;
}
public static void letsGoParallel() {
List<Integer> dataList = new ArrayList<>();
for (int i = 1; i <= 100; i++) {
dataList.add(i);
}
Map<Integer, Integer> resultMap =
dataList.parallelStream().collect(Collectors.toMap(i -> i,ParallelStreamClient::makeSomeMagic));
System.out.println("Input Size: " + dataList.size());
System.out.println("Size: " + resultMap.size());
//System.out.println("Function Called: " + l);
}

Getting random lines from Trove (TObjectIntHashMap)?

Is there a way to get random lines from a Trove (TObjectIntHashMap)? I'm using Random to test how fast a Trove can seek/load 10,000 lines. Specifically, I'd like to pass in a random integer and have the Trove seek/load that line. I've tried using the get() method, but it requires that I pass a string rather than a random int. I've also considered using keys() to return an array and reading from that array, but that would defeat the purpose as I wouldn't be reading directly from the Trove. Here's my code:
import java.io.IOException;
import java.util.List;
import java.util.Random;
import com.comScore.TokenizerTests.Methods.TokenizerUtilities;
import gnu.trove.TObjectIntHashMap;
public class Trove {
public static TObjectIntHashMap<String> lines = new TObjectIntHashMap<String>();
public static void TroveMethod(List<String> fileInArrayList)
throws IOException {
TObjectIntHashMap<String> lines = readToTrove(fileInArrayList);
TokenizerUtilities.writeOutTrove(lines);
}
public static TObjectIntHashMap<String> readToTrove(
List<String> fileInArrayList) {
int lineCount = 0;
for (int i = 0; i < fileInArrayList.size(); i++) {
lines.adjustOrPutValue(fileInArrayList.get(i), 1, 1);
lineCount++;
}
TokenizerUtilities.setUrlInput(lineCount);
return lines;
}
public static void loadRandomMapEntries() {
Random rnd = new Random(lines.size());
int loadCount = 10000;
for (int i = 0; i < loadCount; i++) {
lines.get(rnd);
}
TokenizerUtilities.setLoadCount(loadCount);
}
}
The method in question is loadRandomMapEntries(), specifically the for-loop. Any help is appreciated. Thanks!

I would:
Create an array of the values you want to insert.
Loop through the array and insert those keys.
Pick a random index from the array and do the lookup for that key.
There are benchmarks that come bundled with Trove that essentially do that already, so you could take a look at those.
Keep in mind that benchmarking is tricky to get right. I'd recommend that you use a framework like JMH for your benchmarking and be sure to always test in your application to see real-world performance.

Sorting Strings in arbitrary order

I have a group of Strings which represent product sizes in which most of them are duplicated in meaning but not name. (IE the size Large has at least 14 different spellings possible, each of which needs to be preserved.) I need to sort these based on the size they represent. Any possible Small value should come before any possible Medium value etc.
The only way I see this being possible is to implement a specific Comparator which contains different Sets grouping each size on the base size it represents. Then I can implement the -1,0,1 relationship by determining which Set that particular size falls into.
Is there a more robust way to accomplish this? Specifically I'm worried about 2 weeks from now when someone comes up with yet another way to spell Large.
edit: to be clear its not the actual comparator I have a question with, its the setup with the sets containing each group. Is this a normal way to handle this situation? How do I future proof it so each new size addition doesn't require a full recompile / deploy?

Custom comparator is the solution. I do not understand why do you worry that this is not robust enough.

A simple approach would be to load the size aliases from a resourcebundle. Some example code (put all the files in the same package):
An interface to encapsulate the size property
public interface Sized {
public String getSize();
}
A product class
public class Product implements Sized {
private final String size;
public Product(String size) {
this.size = size;
}
public String getSize() {
return size;
}
#Override
public String toString() {
return size;
}
}
A comparator that does the magic:
import java.util.Comparator;
import java.util.HashMap;
import java.util.Map;
import java.util.ResourceBundle;
public class SizedComparator implements Comparator<Sized> {
// maps size aliases to canonical sizes
private static final Map<String, String> sizes = new HashMap<String, String>();
static {
// create the lookup map from a resourcebundle
ResourceBundle sizesBundle = ResourceBundle
.getBundle(SizedComparator.class.getName());
for (String canonicalSize : sizesBundle.keySet()) {
String[] aliases = sizesBundle.getString(canonicalSize).split(",");
for (String alias : aliases) {
sizes.put(alias, canonicalSize);
}
}
}
#Override
public int compare(Sized s1, Sized s2) {
int result;
String c1 = getCanonicalSize(s1);
String c2 = getCanonicalSize(s2);
if (c1 == null && c2 == null) {
result = 0;
} else if (c1 == null) {
result = -1;
} else if (c2 == null) {
result = 1;
} else {
result = c1.compareTo(c2);
}
return result;
}
private String getCanonicalSize(Sized s1) {
String result = null;
if (s1 != null && s1.getSize() != null) {
result = sizes.get(s1.getSize());
}
return result;
}
}
SizedComparator.properties:
1 = Small,tiny
2 = medium,Average
3 = Large,big,HUGE
A unit test (just for the happy flow):
import org.junit.Before;
import org.junit.Test;
public class FieldSortTest {
private static final String SMALL = "tiny";
private static final String LARGE = "Large";
private static final String MEDIUM = "medium";
private Comparator<Sized> instance;
#Before
public void setup() {
instance = new SizedComparator();
}
#Test
public void testHappy() {
List<Product> products = new ArrayList<Product>();
products.add(new Product(MEDIUM));
products.add(new Product(LARGE));
products.add(new Product(SMALL));
Collections.sort(products, instance);
Assert.assertSame(SMALL, products.get(0).getSize());
Assert.assertSame(MEDIUM, products.get(1).getSize());
Assert.assertSame(LARGE, products.get(2).getSize());
}
}
Note that ResourceBundles are cached automatically. You can reload the ResourceBundle programmatically with:
ResourceBundle.clearCache();
(since Java 1.6). Alternatively you could use some Spring magic to create an auto-reloading message resource.
If reading from a rickety properties file is not cool enough you could quite easily keep your size aliases in a database too.

To impose an arbitrary ordering on a collection of strings (or objects in general), the standard means to do this is to implement a Comparator as you suggest.
Apart from the 'manual' solution you suggest, you could consider comparing the relative edit distance of strings to canonical examples. This will be more flexible in the sense that it will work on alternatives you haven't thought of. But in terms of the work involved, it might be overkill for your application.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java: Filtering lots of data - java

Where is your data coming from? This sounds like a task for a database.

A sql database with an index on the fields you're filtering. The index can be based on numeric value, which will make range and equals queries pretty quick.

Related

Spark Dataset Foreach function does not iterate

Generating 4 digits of multiple unique IDs per millisecond in java

java 8 parallel stream confusion/issue

Getting random lines from Trove (TObjectIntHashMap)?

Sorting Strings in arbitrary order

Categories

Resources