How to report memory leaks in Android/Java using PhantomReference

How to report memory leaks in Android/Java using PhantomReference - java

After reconstructing an old piece of code I once wrote, then forgot, now rewritten... I am putting it here as a wiki for all to use :-)
So, basically: If you got memory leaks in a complex Android app, containing images and cross-references. How would you go and find which (type of) objects are leaking? There are a few (very hard to learn and use) tools provided with the Android SDK. Probably there are more which I don't know. Yet, Java do provide PhantomReference as a mean to do this, even though going through the mess required to set up the required classes can be much work (And nasty too... JDK-8034946).
But what is the most simple/effective way of doing so? My solution below.

LeakCanary is a 3rd party library which automatically detects memory leaks, after adding the dependency you can add the following line to your application class:
LeakCanary.install(this);
The library provides a nice notification & trace of the leak, you can also define your own reference watchers (although the default ones seem to work fairly well).

My solution in one class: "MemCheck"
To monitor any object, just call:
MemCheck.add( this ); // in any class constructor
This will "monitor" the number of objects allocated, and most importantly - Deallocated.
To log the leaks at any required time, call:
MemCheck.countAndLog();
As an alternative, set MemCheck.periodic = 5 (number of seconds).
This will report the number of monitored objects in memory every 5 seconds. It will also conveniently log the used/free memory.
So, MemCheck.java:
package com.xyz.util;
import java.lang.ref.PhantomReference;
import java.lang.ref.ReferenceQueue;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Map.Entry;
import java.util.TreeMap;
import android.os.Handler;
import android.util.Log;
public class MemCheck
{
private static final boolean enabled = true;
private static final int periodic = 0; // seconds, 0 == disabled
private static final String tag = MemCheck.class.getName();
private static TreeMap<String, RefCount> mObjectMap = new TreeMap<String, RefCount>();
private static Runnable mPeriodicRunnable = null;
private static Handler mPeriodicHandler = null;
public static void add( Object object )
{
if( !enabled )
return;
synchronized( mObjectMap )
{
String name = object.getClass().getName();
RefCount queue = mObjectMap.get( name );
if( queue == null )
{
queue = new RefCount();
mObjectMap.put( name, queue );
queue.add( object );
}
else
queue.add( object );
}
}
public static void countAndLog()
{
if( !enabled )
return;
System.gc();
Log.d( tag, "Log report starts" );
Iterator<Entry<String, RefCount>> entryIter = mObjectMap.entrySet().iterator();
while( entryIter.hasNext() )
{
Entry<String, RefCount> entry = entryIter.next();
String name = entry.getKey();
RefCount refCount = entry.getValue();
Log.d( tag, "Class " + name + " has " + refCount.countRefs() + " objects in memory." );
}
logMemoryUsage();
Log.d( tag, "Log report done" );
}
public static void logMemoryUsage()
{
if( !enabled )
return;
Runtime runtime = Runtime.getRuntime();
Log.d( tag, "Max Heap: " + runtime.maxMemory() / 1048576 + " MB, Used: " + runtime.totalMemory() / 1048576 +
" MB, Free: " + runtime.freeMemory() / 1024 + " MB" );
if( periodic > 0 )
{
if( mPeriodicRunnable != null )
mPeriodicHandler.removeCallbacks( mPeriodicRunnable );
if( mPeriodicHandler == null )
mPeriodicHandler = new Handler();
mPeriodicRunnable = new Runnable()
{
#Override
public void run()
{
mPeriodicRunnable = null;
countAndLog();
logMemoryUsage(); // this will run the next
}
};
mPeriodicHandler.postDelayed( mPeriodicRunnable, periodic * 1000 );
}
}
private static class RefCount
{
private ReferenceQueue<Object> mQueue = new ReferenceQueue<Object>();
private HashSet<Object> mRefHash = new HashSet<Object>();
private int mRefCount = 0;
public void add( Object o )
{
synchronized( this )
{
mRefHash.add( new PhantomReference<Object>( o, mQueue ) ); // Note: References MUST be kept alive for their references to be enqueued
mRefCount++;
}
}
public int countRefs()
{
synchronized( this )
{
Object ref;
while( ( ref = mQueue.poll() ) != null )
{
mRefHash.remove( ref );
mRefCount--;
}
return mRefCount;
}
}
}
}

Related

Java Completable Future - better to have loop of futures or future of a loop?

I need to run a method hundreds of times for different data, at various points of the method its waiting on data from DB or response from web call. It seems to make sense to run this asynchronously so that processing can occur during wait times, however I must wait on the result returning from all the runs before moving on, my question is what's the difference between:
Creating a single completableFuture and running the loop within this. Then ensuring the completableFuture has finished before moving on.
OR
Creating a loop of completable futures each with a single method call, then using allOf to then wait on the last one finishing.
Thanks

Running the for loop inside CompletableFututre is not a good idea because that for loop will be executed synchronously. Having multiple CompletableFuture that call the same method multiple times is a better idea but you should make sure that all methods that are blocking are executed asynchronously.
List<CompletableFuture> futures = Arrays.asList("1", "2", "3")
.stream()
.map(a -> CompletableFuture.supplyAsync(() -> method1(),
executorService))
.map(a -> a.thenCompose(b -> CompletableFuture.supplyAsync(() -> dbcall(b),
dbExecutorService)))
.collect(Collectors.toList());
This way method1 and dbcall are executed on different ExecutorService and blocking call to DB on dbcall method in dbExecutorService does not lead to threads being exhausted in executorService.

Project Loom
This work will be simpler when Project Loom technology arrives in Java.
This project is adding virtual threads (fibers) to the Java concurrency toolbox. Many running virtual threads can be mapped to run on top of platform/kernel threads. When a virtual thread blocks, it is “parked”, and another virtual thread is assigned to execute on the platform/kernel thread. This switching between virtual threads is done very quickly, making thread-blocking extremely cheap in terms of its impact on performance.
Virtual threads are also extremely cheap in terms of its use of memory. Whereas platform/kernel threads are allocated rather large stack sizes no matter the need, virtual threads have a stack that expands as needed… and shrinks when no longer needed.
Virtual threads promise to eliminate the risks of using thread pools. Every thread is fresh with its own ThreadLocal values.
Experimental builds based on early-access Java 17 are available now. The Project Loom team is soliciting feedback.
AutoCloseable
In Loom, the ExecutorService interface becomes AutoCloseable. So we can use try-with-resources syntax. The flow-of-control leaves the try block only after all submitted tasks are done/failed/canceled. When leaving the try block, the executor service is automatically closed.
No need for CompletableFuture
You can simply launch many virtual threads, millions even, and let them run. Most of the need for the many methods on CompletableFuture evaporates. For more info, see the more recent presentations and interviews with Ron Pressler of Oracle.
We can simply spin off all the tasks on virtual threads, collecting Future objects along the way. Then simply wait for all those tasks to finish.
Example code
Establish an ExecutorService instance. Submit to that executor service your Callable tasks. Capture the returned Future objects to track successful completion.
int countTasks = 1_000 ; // Number of tasks to spin off into threads.
List < Future < YourResultClass > > futures = new ArrayList <>( countTasks );
try (
ExecutorService executorService = Executors.newVirtualThreadExecutor() ;
)
{
for ( int i = 0 ; i < countTasks ; i++ )
{
// Submit a Callable object to the executor service.
Future < YourResultClass > future = executorService.submit( ( ) -> {
// The work to be done in each thread.
YourResultClass yourResultObject = … ;
return yourResultObject;
} );
futures.add( future );
}
}
// At this point, flow-of-control blocks until all submitted tasks are done/failed/canceled.
// After this point, the executor service is automatically closed.
After the work is done, we can example the collected Future objects to verify results.
// Report on all the futures, all the results of the thousand tasks.
for ( Future < YourResultClass > future : futures )
{
try
{
System.out.println(
"future.isDone(): " + ( future.isDone() + " | future.isCompletedNormally(): " + future.isCompletedNormally() + " | future.isCancelled(): " + future.isCancelled() + " | result: " + future.get().toString() )
);
}
catch ( InterruptedException e )
{
e.printStackTrace();
}
catch ( ExecutionException e )
{
e.printStackTrace();
}
}
Example app
Here is a complete app. Not what I would do in production of course, but it makes for a decent demonstration I hope.
This code spins off a thousand tasks. Each task makes a REST call to ask Wikipedia for a random page. The contents of that page are then written to a H2 database. We collect the Future objects returned when submitting Callable tasks to the executor service, and we examine those after the work is done.
I configured an in-memory database, but you could just as well put the database in storage.
For simplicity, I defined my WikipediaPage class as a record. This new feature in Java 16 is a brief way to write a class whose main purpose is to immutable and transparently carry data. The compiler implicitly creates the constructor, getters, equals & hashCode, and toString. Not important to this Answer; you could just as well use a conventional class.
package work.basil.example.loopingfutures;
import org.h2.jdbcx.JdbcDataSource;
import javax.sql.DataSource;
import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.sql.*;
import java.time.*;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
import java.util.UUID;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
public class App
{
public static void main ( String[] args )
{
System.out.println( Runtime.version() );
System.out.println( Runtime.getRuntime().maxMemory() );
App app = new App();
app.demo();
}
private void demo ( )
{
DataSource dataSource = this.configureDataSource();
this.destroyDatabaseContents( dataSource );
this.createDatabase( dataSource );
this.work( dataSource , 1_000 );
this.dumpTable( dataSource );
}
private DataSource configureDataSource ( )
{
JdbcDataSource dataSource = Objects.requireNonNull( new JdbcDataSource() ); // Implementation of `DataSource` bundled with H2.
dataSource.setURL( "jdbc:h2:mem:looping_loom_example_db;DB_CLOSE_DELAY=-1" ); // Set `DB_CLOSE_DELAY` to `-1` to keep in-memory database in existence after connection closes.
dataSource.setUser( "scott" );
dataSource.setPassword( "tiger" );
return dataSource;
}
private void destroyDatabaseContents ( DataSource dataSource )
{
try (
Connection conn = dataSource.getConnection() ;
)
{
String sql = """
DROP TABLE IF EXISTS wikipedia_page_
;
""";
System.out.println( "sql: \n" + sql );
try ( Statement stmt = conn.createStatement() ; )
{
stmt.execute( sql );
}
}
catch ( SQLException e )
{
e.printStackTrace();
}
}
private void createDatabase ( final DataSource dataSource )
{
try (
Connection conn = dataSource.getConnection() ;
)
{
String sql = """
CREATE TABLE wikipedia_page_
(
id_ UUID NOT NULL PRIMARY KEY ,
url_ VARCHAR NOT NULL ,
content_ CLOB NOT NULL ,
when_fetched_ TIMESTAMP WITH TIME ZONE NOT NULL ,
row_created_ TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP()
)
;
""";
System.out.println( "sql: \n" + sql );
try ( Statement stmt = conn.createStatement() ; )
{
stmt.execute( sql );
}
}
catch ( SQLException e )
{
e.printStackTrace();
}
}
private void work ( final DataSource dataSource , final int countPagesToFetchFromWikipedia )
{
List < Future < WikipediaPage > > futures = new ArrayList <>( countPagesToFetchFromWikipedia );
long start = System.nanoTime();
try (
ExecutorService executorService = newVirtualThreadExecutor() ;
)
{
for ( int i = 0 ; i < countPagesToFetchFromWikipedia ; i++ )
{
// Submit a Callable object to the executor service.
Future < WikipediaPage > future = executorService.submit( ( ) -> {
// To meet Wikipedia's limit of 200 requests per second, let's throttle by sleeping the worker thread.
try {Thread.sleep( Duration.ofMillis( 100 ) ); } catch ( InterruptedException e ) { e.printStackTrace(); }
WikipediaPage page = this.fetchPage();
this.persistPage( dataSource , page );
return page;
} );
futures.add( future );
}
}
// At this point, flow-of-control blocks until all submitted tasks are done/failed/canceled. The executor service is automatically closed.
Duration duration = Duration.ofNanos( System.nanoTime() - start );
System.out.println( "duration = " + duration + " for a count of " + countPagesToFetchFromWikipedia );
// Report on all the futures, all the results of the thousand tasks.
for ( Future < WikipediaPage > future : futures )
{
try
{
System.out.println(
"future.isDone(): " + ( future.isDone() + " | future.isCompletedNormally(): " + future.isCompletedNormally() + " | future.isCancelled(): " + future.isCancelled() + " | result: " + future.get().toString() )
);
}
catch ( InterruptedException e )
{
e.printStackTrace();
}
catch ( ExecutionException e )
{
e.printStackTrace();
}
}
System.out.println( "INFO - End of `work` method. Message # 26893b25-b09c-40d5-8cee-60e6a1d53852." );
}
private WikipediaPage fetchPage ( )
{
WikipediaPage page = null; // To be returned.
HttpClient client =
HttpClient
.newBuilder()
.followRedirects( HttpClient.Redirect.NORMAL )
.build();
HttpRequest request =
HttpRequest
.newBuilder()
.uri( URI.create( "https://en.wikipedia.org/api/rest_v1/page/random/summary" ) )
.build();
try
{
HttpResponse < String > response = client.send( request , HttpResponse.BodyHandlers.ofString() );
OffsetDateTime whenFetched = OffsetDateTime.now( ZoneOffset.UTC );
URI uri = response.uri();
String content = response.body();
System.out.println( "response = " + response );
System.out.println( "content = " + content );
page = new WikipediaPage( UUID.randomUUID() , uri.toString() , content , whenFetched );
}
catch ( IOException e )
{
e.printStackTrace();
}
catch ( InterruptedException e )
{
e.printStackTrace();
}
return Objects.requireNonNull( page );
}
private void persistPage ( final DataSource dataSource , final WikipediaPage wikipediaPage )
{
Objects.requireNonNull( dataSource );
Objects.requireNonNull( wikipediaPage );
String sql = """
INSERT INTO wikipedia_page_ ( id_ , url_ , content_ , when_fetched_ )
VALUES ( ? , ? , ? , ? )
;
""";
System.out.println( "sql: \n" + sql );
try (
Connection conn = dataSource.getConnection() ;
PreparedStatement pstmt = Objects.requireNonNull( conn ).prepareStatement( sql ) ;
)
{
pstmt.setObject( 1 , wikipediaPage.id );
pstmt.setString( 2 , wikipediaPage.url );
pstmt.setString( 3 , wikipediaPage.content );
pstmt.setObject( 4 , wikipediaPage.whenFetched );
int countRowsAffected = pstmt.executeUpdate();
if ( countRowsAffected != 1 )
{
System.out.println( "ERROR - Failed to insert row. Message # 4c3503d9-8cad-4e21-a625-c58054d9ca78." );
}
}
catch ( SQLException e )
{
e.printStackTrace();
}
}
private void dumpTable ( final DataSource dataSource )
{
String sql = "SELECT * FROM wikipedia_page_ ;";
System.out.println( "sql: \n" + sql );
try (
Connection conn = dataSource.getConnection() ;
Statement stmt = conn.createStatement() ;
ResultSet rs = stmt.executeQuery( sql ) ;
)
{
while ( rs.next() )
{
//Retrieve by column name
UUID id = rs.getObject( "id_" , UUID.class );
String url = rs.getString( "url_" );
String content = rs.getString( "content_" );
OffsetDateTime whenFetched = rs.getObject( "when_fetched_" , OffsetDateTime.class );
WikipediaPage wp = new WikipediaPage( id , url , content , whenFetched );
System.out.println( "wp = " + wp );
}
}
catch ( SQLException e )
{
e.printStackTrace();
}
}
record WikipediaPage(UUID id , String url , String content , OffsetDateTime whenFetched)
{
#Override
public String toString ( )
{
// Omitting the `content` field for brevity.
return "WikipediaPage{ " +
"id=" + id +
" | url='" + url + '\'' +
" | whenFetched=" + whenFetched +
" }";
}
}
}
Results
All of these results are for 1,000 tasks. Run on a Mac mini (2018) 3 GHz Intel Core i5 with 32 gigs of RAM, macOS Mojave 10.14.6, Java 17-loom+2-42 assigned maximum memory of 8 gigs (8589934592).
When running with Executors.newSingleThreadExecutor(), takes 7 minutes.
When running with Executors.newFixedThreadPool( 10 ), takes 1 minute.
When running with Executors.newFixedThreadPool( 100 ), takes half a minute.
When running with Executors.newVirtualThreadExecutor(), takes a quarter minute.

Initialize multiple numeric fields at once in JAVA that begin with certain values

I am working on a Java class that contains a ton of numeric fields. Most of them would begin with something like 'CMTH' or 'FYTD'. Is it possible to initialize all fields of the same type that begin or end with a certain value. For example I have the following fields:
CMthRepCaseACR CMthRepUnitACR CMthRecCaseACR CMthRecUnitACR CMthHecCaseACR CMthHecUnitACR FYTDHecCaseACR FYTDHecUnitACR CMthBBKCaseACR CMthBBKUnitACR CMthPIHCaseACR .
I am trying to figure if it is possible to initialize all fields to zero that end with an 'ACR' or begin with an 'Cmth"
I know I can do something like cmtha = cmthb = cmthc = 0 but I was wondering there was a command where you can some kind of mask to initialize
Thanks

Assuming that you cannot change that said Java class (and e.g. use a collection or map to store the values) your best bet is probably reflection (see also: Trail: The Reflection API). Reflection gives you access to all fields of the class and you can then implement whatever matching you'd like.
Here's a short demo to get you started, minus error handling, sanity checks and adaptions to your actual class:
import java.util.stream.Stream;
public class Demo {
private static class DemoClass {
private int repCaseACR = 1;
private int CMthRepUnit = 2;
private int foo = 3;
private int bar = 4;
#Override
public String toString() {
return "DemoClass [repCaseACR=" + repCaseACR + ", CMthRepUnit=" + CMthRepUnit + ", foo=" + foo + ", bar="
+ bar + "]";
}
}
public static void main(String[] args) {
DemoClass demoClass = new DemoClass();
System.out.println("before: " + demoClass);
resetFields(demoClass, "CMth", null);
System.out.println("after prefix reset: " + demoClass);
resetFields(demoClass, null, "ACR");
System.out.println("after suffix reset: " + demoClass);
}
private static void resetFields(DemoClass instance, String prefix, String suffix) {
Stream.of(instance.getClass().getDeclaredFields())
.filter(field ->
(prefix != null && field.getName().startsWith(prefix))
|| (suffix != null && field.getName().endsWith(suffix)))
.forEach(field -> {
field.setAccessible(true);
try {
field.set(instance, 0);
} catch (IllegalArgumentException | IllegalAccessException e) {
// TODO handle me
}
});
}
}
Output:
before: DemoClass [repCaseACR=1, CMthRepUnit=2, foo=3, bar=4]
after prefix reset: DemoClass [repCaseACR=1, CMthRepUnit=0, foo=3, bar=4]
after suffix reset: DemoClass [repCaseACR=0, CMthRepUnit=0, foo=3, bar=4]
Note: Both links are seriously dated but the core functionality of reflection is still the same.

Java Watch Service - Wait until modifications are finished

I have a Watcher that updates my data structures when a change is heard. However, if the change is not instantaneous (i.e. if a large file is being copied from another file system, or a big part of the file is modified), the data-structure tries to update too early and throws an error.
How can I modify my code so that updateData() is called after only the last ENTRY_MODIFY is called, rather than after every single ENTRY_MODIFY.
private static boolean processWatcherEvents () {
WatchKey key;
try {
key = watcher.poll( 10, TimeUnit.MILLISECONDS );
} catch ( InterruptedException e ) {
return false;
}
Path directory = keys.get( key );
if ( directory == null ) {
return false;
}
for ( WatchEvent <?> event : key.pollEvents() ) {
WatchEvent.Kind eventKind = event.kind();
WatchEvent <Path> watchEvent = (WatchEvent<Path>)event;
Path child = directory.resolve( watchEvent.context() );
if ( eventKind == StandardWatchEventKinds.ENTRY_MODIFY ) {
//TODO: Wait until modifications are "finished" before taking these actions.
if ( Files.isDirectory( child ) ) {
updateData( child );
}
}
boolean valid = key.reset();
if ( !valid ) {
keys.remove( key );
}
}
return true;
}

As #TT suggested, you can do it pretty easily with file locks.
When you get an event, use a blocking method lock() on read and write access. Hence the operation is blocking, the code automatically waits until the write operation is finished.
FileChannel channel = new RandomAccessFile(file, "rw").getChannel();
try (channel) { // auto closable, uses channel.close() in finally block
channel.lock(); // wait until file modifications are finished
channel.read(...); // now you can safely read the file
}
However, this won't work between different JVM processes, because they don't share the same lock.

Is your problem can be solved by using timestamp.
Create a map for storing the timestamp to the map.
Map<Path, Long> fileTimeStamps;
For process event check last modified timestamp.
long oldFileModifiedTimeStamp = fileTimeStamps.get(filePath);
long newFileModifiedTimeStamp = filePath.toFile().lastModified();
if (newFileModifiedTimeStamp > oldFileModifiedTimeStamp)
{
fileTimeStamps.remove(filePath);
onEventOccurred();
fileTimeStamps.put(filePath, filePath.toFile().lastModified());
}

I ended up writing a thread that keeps a list of things I want updated and delays actually updating them until 80 milliseconds have passed. Whenever an ENTRY_MODIFY event happens, it resets the counter. I think this is a good solution, but there may be better?
#SuppressWarnings({ "rawtypes", "unchecked" })
private static boolean processWatcherEvents () {
WatchKey key;
try {
key = watcher.poll( 10, TimeUnit.MILLISECONDS );
} catch ( InterruptedException e ) {
return false;
}
Path directory = keys.get( key );
if ( directory == null ) {
return false;
}
for ( WatchEvent <?> event : key.pollEvents() ) {
WatchEvent.Kind eventKind = event.kind();
WatchEvent <Path> watchEvent = (WatchEvent<Path>)event;
Path child = directory.resolve( watchEvent.context() );
if ( eventKind == StandardWatchEventKinds.ENTRY_CREATE ) {
if ( Files.isDirectory( child ) ) {
loadMe.add( child );
} else {
loadMe.add( child.getParent() );
}
} else if ( eventKind == StandardWatchEventKinds.ENTRY_DELETE ) {
//Handled by removeMissingFiles(), can ignore.
} else if ( eventKind == StandardWatchEventKinds.ENTRY_MODIFY ) {
System.out.println( "Modified: " + child.toString() ); //TODO: DD
if ( Files.isDirectory( child ) ) {
modifiedFileDelayedUpdater.addUpdateItem( child );
} else {
modifiedFileDelayedUpdater.addUpdateItem( child );
}
} else if ( eventKind == StandardWatchEventKinds.OVERFLOW ) {
for ( Path path : musicSourcePaths ) {
updateMe.add( path );
}
}
boolean valid = key.reset();
if ( !valid ) {
keys.remove( key );
}
}
return true;
}
...
class UpdaterThread extends Thread {
public static final int DELAY_LENGTH_MS = 80;
public int counter = DELAY_LENGTH_MS;
Vector <Path> updateItems = new Vector <Path> ();
public void run() {
while ( true ) {
long sleepTime = 0;
try {
long startSleepTime = System.currentTimeMillis();
Thread.sleep ( 20 );
sleepTime = System.currentTimeMillis() - startSleepTime;
} catch ( InterruptedException e ) {} //TODO: Is this OK to do? Feels like a bad idea.
if ( counter > 0 ) {
counter -= sleepTime;
} else if ( updateItems.size() > 0 ) {
Vector <Path> copyUpdateItems = new Vector<Path> ( updateItems );
for ( Path path : copyUpdateItems ) {
Library.requestUpdate ( path );
updateItems.remove( path );
}
}
}
}
public void addUpdateItem ( Path path ) {
counter = DELAY_LENGTH_MS;
if ( !updateItems.contains( path ) ) {
updateItems.add ( path );
}
}
};

Scala actors inefficiency issue

Let me start out by saying that I'm new to Scala; however, I find the Actor based concurrency model interesting, and I tried to give it a shot for a relatively simple application. The issue that I'm running into is that, although I'm able to get the application to work, the result is far less efficient (in terms of real time, CPU time, and memory usage) than an equivalent Java based solution that uses threads that pull messages off an ArrayBlockingQueue. I'd like to understand why. I suspect that it's likely my lack of Scala knowledge, and that I'm causing all the inefficiency, but after several attempts to rework the application without success, I decided to reach out to the community for help.
My problem is this:
I have a gzipped file with many lines in the format of:
SomeID comma_separated_list_of_values
For example:
1234 12,45,82
I'd like to parse each line and get an overall count of the number of occurrences of each value in the comma separated list.
This file may be pretty large (several GB compressed), but the number of unique values per file is pretty small (at most 500). I figured this would be a pretty good opportunity to try to write an Actor-based concurrent Scala application. My solution involves a main driver that creates a pool of parser Actors. The main driver then reads lines from stdin, passes the line off to an Actor that parses the line and keeps a local count of the values. When the main driver has read the last line, it passes a message to each actor indicating that all lines have been read. When the actor receive the 'done' message, they pass their counts to an aggregator that sums the counts from all actors. Once the counts from all parsers have been aggregated, the main driver prints out the statistics.
The problem:
The main issue that I'm encountering is the incredible amount of inefficiency of this application. It uses far more CPU and far more memory than an "equivalent" Java application that uses threads and an ArrayBlockingQueue. To put this in perspective, here are some stats that I gathered for a 10 million line test input file:
Scala 1 Actor (parser):
real 9m22.297s
user 235m31.070s
sys 21m51.420s
Java 1 Thread (parser):
real 1m48.275s
user 1m58.630s
sys 0m33.540s
Scala 5 Actors:
real 2m25.267s
user 63m0.730s
sys 3m17.950s
Java 5 Threads:
real 0m24.961s
user 1m52.650s
sys 0m20.920s
In addition, top reports that the Scala application has about 10x the resident memory size. So we're talking about orders of magnitude more CPU and memory here for orders of magnitude worse performance, and I just can't figure out what is causing this. Is it a GC issue, or am I somehow creating far more copies of objects than I realize?
Additional details that may or may not be of importance:
The scala application is wrapped by a Java class so that I could
deliver a self-contained executable JAR file (I don't have the Scala
jars on every machine that I might want to run this app).
The application is being invoked as follows: gunzip -c gzFilename |
java -jar StatParser.jar
Here is the code:
Main Driver:
import scala.actors.Actor._
import scala.collection.{ immutable, mutable }
import scala.io.Source
class StatCollector (numParsers : Int ) {
private val parsers = new mutable.ArrayBuffer[StatParser]()
private val aggregator = new StatAggregator()
def generateParsers {
for ( i <- 1 to numParsers ) {
val parser = new StatParser( i, aggregator )
parser.start
parsers += parser
}
}
def readStdin {
var nextParserIdx = 0
var lineNo = 1
for ( line <- Source.stdin.getLines() ) {
parsers( nextParserIdx ) ! line
nextParserIdx += 1
if ( nextParserIdx >= numParsers ) {
nextParserIdx = 0
}
lineNo += 1
}
}
def informParsers {
for ( parser <- parsers ) {
parser ! true
}
}
def printCounts {
val countMap = aggregator.getCounts()
println( "ID,Count" )
/*
for ( key <- countMap.keySet ) {
println( key + "," + countMap.getOrElse( key, 0 ) )
//println( "Campaign '" + key + "': " + countMap.getOrElse( key, 0 ) )
}
*/
countMap.toList.sorted foreach {
case (key, value) =>
println( key + "," + value )
}
}
def processFromStdIn {
aggregator.start
generateParsers
readStdin
process
}
def process {
informParsers
var completedParserCount = aggregator.getNumParsersAggregated
while ( completedParserCount < numParsers ) {
Thread.sleep( 250 )
completedParserCount = aggregator.getNumParsersAggregated
}
printCounts
}
}
The Parser Actor:
import scala.actors.Actor
import collection.mutable.HashMap
import scala.util.matching
class StatParser( val id: Int, val aggregator: StatAggregator ) extends Actor {
private var countMap = new HashMap[String, Int]()
private val sep1 = "\t"
private val sep2 = ","
def getCounts(): HashMap[String, Int] = {
return countMap
}
def act() {
loop {
react {
case line: String =>
{
val idx = line.indexOf( sep1 )
var currentCount = 0
if ( idx > 0 ) {
val tokens = line.substring( idx + 1 ).split( sep2 )
for ( token <- tokens ) {
if ( !token.equals( "" ) ) {
currentCount = countMap.getOrElse( token, 0 )
countMap( token ) = ( 1 + currentCount )
}
}
}
}
case doneProcessing: Boolean =>
{
if ( doneProcessing ) {
// Send my stats to Aggregator
aggregator ! this
}
}
}
}
}
}
The Aggregator Actor:
import scala.actors.Actor
import collection.mutable.HashMap
class StatAggregator extends Actor {
private var countMap = new HashMap[String, Int]()
private var parsersAggregated = 0
def act() {
loop {
react {
case parser: StatParser =>
{
val cm = parser.getCounts()
for ( key <- cm.keySet ) {
val currentCount = countMap.getOrElse( key, 0 )
val incAmt = cm.getOrElse( key, 0 )
countMap( key ) = ( currentCount + incAmt )
}
parsersAggregated += 1
}
}
}
}
def getNumParsersAggregated: Int = {
return parsersAggregated
}
def getCounts(): HashMap[String, Int] = {
return countMap
}
}
Any help that could be offered in understanding what is going on here would be greatly appreciated.
Thanks in advance!
---- Edit ---
Since many people responded and asked for the Java code, here is the simple Java app that I created for comparison purposes. I realize that this is not great Java code, but when I saw the performance of the Scala application, I just whipped up something quick to see how a Java Thread-based implementation would perform as a base-line:
Parsing Thread:
import java.util.Hashtable;
import java.util.Map;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.TimeUnit;
public class JStatParser extends Thread
{
private ArrayBlockingQueue<String> queue;
private Map<String, Integer> countMap;
private boolean done;
public JStatParser( ArrayBlockingQueue<String> q )
{
super( );
queue = q;
countMap = new Hashtable<String, Integer>( );
done = false;
}
public Map<String, Integer> getCountMap( )
{
return countMap;
}
public void alldone( )
{
done = true;
}
#Override
public void run( )
{
String line = null;
while( !done || queue.size( ) > 0 )
{
try
{
// line = queue.take( );
line = queue.poll( 100, TimeUnit.MILLISECONDS );
if( line != null )
{
int idx = line.indexOf( "\t" ) + 1;
for( String token : line.substring( idx ).split( "," ) )
{
if( !token.equals( "" ) )
{
if( countMap.containsKey( token ) )
{
Integer currentCount = countMap.get( token );
currentCount++;
countMap.put( token, currentCount );
}
else
{
countMap.put( token, new Integer( 1 ) );
}
}
}
}
}
catch( InterruptedException e )
{
// TODO Auto-generated catch block
System.err.println( "Failed to get something off the queue: "
+ e.getMessage( ) );
e.printStackTrace( );
}
}
}
}
Driver:
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Hashtable;
import java.util.List;
import java.util.Map;
import java.util.TreeSet;
import java.util.concurrent.ArrayBlockingQueue;
public class JPS
{
public static void main( String[] args )
{
if( args.length <= 0 || args.length > 2 || args[0].equals( "-?" ) )
{
System.err.println( "Usage: JPS [filename]" );
System.exit( -1 );
}
int numParsers = Integer.parseInt( args[0] );
ArrayBlockingQueue<String> q = new ArrayBlockingQueue<String>( 1000 );
List<JStatParser> parsers = new ArrayList<JStatParser>( );
BufferedReader reader = null;
try
{
if( args.length == 2 )
{
reader = new BufferedReader( new FileReader( args[1] ) );
}
else
{
reader = new BufferedReader( new InputStreamReader( System.in ) );
}
for( int i = 0; i < numParsers; i++ )
{
JStatParser parser = new JStatParser( q );
parser.start( );
parsers.add( parser );
}
String line = null;
while( (line = reader.readLine( )) != null )
{
try
{
q.put( line );
}
catch( InterruptedException e )
{
// TODO Auto-generated catch block
System.err.println( "Failed to add line to q: "
+ e.getMessage( ) );
e.printStackTrace( );
}
}
// At this point, we've put everything on the queue, now we just
// need to wait for it to be processed.
while( q.size( ) > 0 )
{
try
{
Thread.sleep( 250 );
}
catch( InterruptedException e )
{
}
}
Map<String,Integer> countMap = new Hashtable<String,Integer>( );
for( JStatParser jsp : parsers )
{
jsp.alldone( );
Map<String,Integer> cm = jsp.getCountMap( );
for( String key : cm.keySet( ) )
{
if( countMap.containsKey( key ))
{
Integer currentCount = countMap.get( key );
currentCount += cm.get( key );
countMap.put( key, currentCount );
}
else
{
countMap.put( key, cm.get( key ) );
}
}
}
System.out.println( "ID,Count" );
for( String key : new TreeSet<String>(countMap.keySet( )) )
{
System.out.println( key + "," + countMap.get( key ) );
}
for( JStatParser parser : parsers )
{
try
{
parser.join( 100 );
}
catch( InterruptedException e )
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
System.exit( 0 );
}
catch( IOException e )
{
System.err.println( "Caught exception: " + e.getMessage( ) );
e.printStackTrace( );
}
}
}

I'm not sure this is a good test case for actors. For one thing, there's almost no interaction between actors. This is a simple map/reduce, which calls for parallelism, not concurrency.
The overhead on the actors is also pretty heavy, and I don't know how many actual threads are being allocated. Depending on how many processors you have, you might have less threads than on the Java program -- which seems to be the case, given that the speed-up is 4x instead of 5x.
And the way you wrote the actors is optimized for idle actors, the kind of situation where you have hundreds or thousands or actors, but only few of them doing actual work at any time. If you wrote the actors with while/receive instead of loop/react, they'd perform better.
Now, actors would make it easy to distribute the application over many computers, except that you violated one of the tenets of actors: you are calling methods on the actor object. You should never do that with actors and, in fact, Akka prevents you from doing so. A more actor-ish way of doing this would be for the aggregator to ask each actor for their key sets, compute their union, and then, for each key, ask all actors to send their count for that key.
I'm not sure, however, that the actor overhead is what you are seeing. You provided no information about the Java implementation, but I daresay you use mutable maps, and maybe even a single concurrent mutable map -- a very different implementation than what you are doing in Scala.
There's also no information on how the file is read (such a big file might have buffering issues), or how it is parsed in Java. Since most of the work is reading and parsing the file, not counting the tokens, differences in implementation there can easily overcome any other issue.
Finally, about resident memory size, Scala has a 9 MB library (in addition to what JVM brings), which might be what you are seeing. Of course, if you are using a single concurrent map in Java vs 6 immutable maps in Scala, that will certainly make a big difference in memory usage patterns.

Scala actors give way Akka actors last days... and more is coming - Viktor is hAkking further to make last the best: https://twitter.com/viktorklang/status/229694698397257728
BTW: Open Source is great power! This day should be holiday of all JVM-based community:
http://www.marketwire.com/press-release/azul-systems-announces-new-initiative-support-open-source-community-with-free-zing-jvm-1684899.htm

Fast CSV parsing

I have a java server app that download CSV file and parse it. The parsing can take from 5 to 45 minutes, and happens each hour.This method is a bottleneck of the app so it's not premature optimization. The code so far:
client.executeMethod(method);
InputStream in = method.getResponseBodyAsStream(); // this is http stream
String line;
String[] record;
reader = new BufferedReader(new InputStreamReader(in), 65536);
try {
// read the header line
line = reader.readLine();
// some code
while ((line = reader.readLine()) != null) {
// more code
line = line.replaceAll("\"\"", "\"NULL\"");
// Now remove all of the quotes
line = line.replaceAll("\"", "");
if (!line.startsWith("ERROR"){
//bla bla
continue;
}
record = line.split(",");
//more error handling
// build the object and put it in HashMap
}
//exceptions handling, closing connection and reader
Is there any existing library that would help me to speed up things? Can I improve existing code?

Apache Commons CSV
Have you seen Apache Commons CSV?
Caveat On Using split
Bear in mind is that split only returns a view of the data, meaning that the original line object is not eligible for garbage collection whilst there is a reference to any of its views. Perhaps making a defensive copy will help? (Java bug report)
It also is not reliable in grouping escaped CSV columns containing commas

opencsv
Take a look at opencsv.
This blog post, opencsv is an easy CSV parser, has example usage.

The problem of your code is that it's using replaceAll and split which are very costly operation. You should definitely consider using a csv parser/reader that would do a one pass parsing.
There is a benchmark on github
https://github.com/uniVocity/csv-parsers-comparison
that unfortunately is ran under java 6. The number are slightly different under java 7 and 8. I'm trying to get more detail data for different file size but it's work in progress
see https://github.com/arnaudroger/csv-parsers-comparison

Apart from the suggestions made above, I think you can try improving your code by using some threading and concurrency.
Following is the brief analysis and suggested solution
From the code it seems that you are reading the data over the network (most possibly apache-common-httpclient lib).
You need to make sure that bottleneck that you are saying is not in the data transfer over the network.
One way to see is just dump the data in some file (without parsing) and see how much does it take. This will give you an idea how much time is actually spent in parsing (when compared to current observation).
Now have a look at how java.util.concurrent package is used. Some of the link that you can use are (1,2)
What you ca do is the tasks that you are doing in for loop can be executed in a thread.
Using the threadpool and concurrency will greatly improve your performance.
Though the solution involves some effort, but at the end this will surly help you.

opencsv
You should have a look at OpenCSV. I would expect that they have performance optimizations.

A little late here, there is now a few benchmarking projects for CSV parsers. Your selection will depend on the exact use-case (i.e. raw data vs data binding etc).
SimpleFlatMapper
uniVocity
sesseltjonna-csv (disclaimer: I wrote this parser)

Quirk-CSV
The new kid on the block. It uses java annotations and is built on apache-csv which one of the faster libraries out there for csv parsing.
This library is also thread safe as well if you wanted to re-use the CSVProcessor you can and should.
Example:
Pojo
#CSVReadComponent(type = CSVType.NAMED)
#CSVWriteComponent(type = CSVType.ORDER)
public class Pojo {
#CSVWriteBinding(order = 0)
private String name;
#CSVWriteBinding(order = 1)
#CSVReadBinding(header = "age")
private Integer age;
#CSVWriteBinding(order = 2)
#CSVReadBinding(header = "money")
private Double money;
#CSVReadBinding(header = "name")
public void setA(String name) {
this.name = name;
}
#Override
public String toString() {
return "Name: " + name + System.lineSeparator() + "\tAge: " + age + System.lineSeparator() + "\tMoney: "
+ money;
}}
Main
import java.io.IOException;
import java.io.StringReader;
import java.io.StringWriter;
import java.util.*;
public class SimpleMain {
public static void main(String[] args) {
String csv = "name,age,money" + System.lineSeparator() + "Michael Williams,34,39332.15";
CSVProcessor processor = new CSVProcessor(Pojo.class);
List<Pojo> list = new ArrayList<>();
try {
list.addAll(processor.parse(new StringReader(csv)));
list.forEach(System.out::println);
System.out.println();
StringWriter sw = new StringWriter();
processor.write(list, sw);
System.out.println(sw.toString());
} catch (IOException e) {
}
}}
Since this is built on top of apache-csv you can use the powerful tool CSVFormat. Lets say the delimiter for the csv are pipes (|) instead of commas(,) you could for Example:
CSVFormat csvFormat = CSVFormat.DEFAULT.withDelimiter('|');
List<Pojo> list = processor.parse(new StringReader(csv), csvFormat);
Another benefit are inheritance is also consider.
For other examples on handling reading/writing non-primitive data

For speed you do not want to use replaceAll, and you don't want to use regex either. What you basically always want to do in critical cases like that is making a state-machine character by character parser. I've done that having rolled the whole thing into an Iterable function. It also takes in the stream and parses it without saving it out or caching it. So if you can abort early that's likely going to go fine as well. It should also be short enough and well coded enough to make it obvious how it works.
public static Iterable<String[]> parseCSV(final InputStream stream) throws IOException {
return new Iterable<String[]>() {
#Override
public Iterator<String[]> iterator() {
return new Iterator<String[]>() {
static final int UNCALCULATED = 0;
static final int READY = 1;
static final int FINISHED = 2;
int state = UNCALCULATED;
ArrayList<String> value_list = new ArrayList<>();
StringBuilder sb = new StringBuilder();
String[] return_value;
public void end() {
end_part();
return_value = new String[value_list.size()];
value_list.toArray(return_value);
value_list.clear();
}
public void end_part() {
value_list.add(sb.toString());
sb.setLength(0);
}
public void append(int ch) {
sb.append((char) ch);
}
public void calculate() throws IOException {
boolean inquote = false;
while (true) {
int ch = stream.read();
switch (ch) {
default: //regular character.
append(ch);
break;
case -1: //read has reached the end.
if ((sb.length() == 0) && (value_list.isEmpty())) {
state = FINISHED;
} else {
end();
state = READY;
}
return;
case '\r':
case '\n': //end of line.
if (inquote) {
append(ch);
} else {
end();
state = READY;
return;
}
break;
case ',': //comma
if (inquote) {
append(ch);
} else {
end_part();
break;
}
break;
case '"': //quote.
inquote = !inquote;
break;
}
}
}
#Override
public boolean hasNext() {
if (state == UNCALCULATED) {
try {
calculate();
} catch (IOException ex) {
}
}
return state == READY;
}
#Override
public String[] next() {
if (state == UNCALCULATED) {
try {
calculate();
} catch (IOException ex) {
}
}
state = UNCALCULATED;
return return_value;
}
};
}
};
}
You would typically process this quite helpfully like:
for (String[] csv : parseCSV(stream)) {
//<deal with parsed csv data>
}
The beauty of that API there is worth the rather cryptic looking function.

Apache Commons CSV ➙ 12 seconds for million rows
Is there any existing library that would help me to speed up things?
Yes, the Apache Commons CSV project works very well in my experience.
Here is an example app that uses Apache Commons CSV library to write and read rows of 24 columns: An integer sequential number, an Instant, and the rest are random UUID objects.
For 10,000 rows, the writing and the read each take about half a second. The reading includes reconstituting the Integer, Instant, and UUID objects.
My example code lets you toggle on or off the reconstituting of objects. I ran both with a million rows. This creates a file of 850 megs. I am using Java 12 on a MacBook Pro (Retina, 15-inch, Late 2013), 2.3 GHz Intel Core i7, 16 GB 1600 MHz DDR3, Apple built-in SSD.
For a million rows, ten seconds for reading plus two seconds for parsing:
Writing: PT25.994816S
Reading only: PT10.353912S
Reading & parsing: PT12.219364S
Source code is a single .java file. Has a write method, and a read method. Both methods called from a main method.
I opened a BufferedReader by calling Files.newBufferedReader.
package work.basil.example;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVParser;
import org.apache.commons.csv.CSVPrinter;
import org.apache.commons.csv.CSVRecord;
import java.io.*;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.Duration;
import java.time.Instant;
import java.util.UUID;
public class CsvReadingWritingDemo
{
public static void main ( String[] args )
{
CsvReadingWritingDemo app = new CsvReadingWritingDemo();
app.write();
app.read();
}
private void write ()
{
Instant start = Instant.now();
int limit = 1_000_000; // 10_000 100_000 1_000_000
Path path = Paths.get( "/Users/basilbourque/IdeaProjects/Demo/csv.txt" );
try (
Writer writer = Files.newBufferedWriter( path, StandardCharsets.UTF_8 );
CSVPrinter printer = new CSVPrinter( writer , CSVFormat.RFC4180 );
)
{
printer.printRecord( "id" , "instant" , "uuid_01" , "uuid_02" , "uuid_03" , "uuid_04" , "uuid_05" , "uuid_06" , "uuid_07" , "uuid_08" , "uuid_09" , "uuid_10" , "uuid_11" , "uuid_12" , "uuid_13" , "uuid_14" , "uuid_15" , "uuid_16" , "uuid_17" , "uuid_18" , "uuid_19" , "uuid_20" , "uuid_21" , "uuid_22" );
for ( int i = 1 ; i <= limit ; i++ )
{
printer.printRecord( i , Instant.now() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() );
}
} catch ( IOException ex )
{
ex.printStackTrace();
}
Instant stop = Instant.now();
Duration d = Duration.between( start , stop );
System.out.println( "Wrote CSV for limit: " + limit );
System.out.println( "Elapsed: " + d );
}
private void read ()
{
Instant start = Instant.now();
int count = 0;
Path path = Paths.get( "/Users/basilbourque/IdeaProjects/Demo/csv.txt" );
try (
Reader reader = Files.newBufferedReader( path , StandardCharsets.UTF_8) ;
)
{
CSVFormat format = CSVFormat.RFC4180.withFirstRecordAsHeader();
CSVParser parser = CSVParser.parse( reader , format );
for ( CSVRecord csvRecord : parser )
{
if ( true ) // Toggle parsing of the string data into objects. Turn off (`false`) to see strictly the time taken by Apache Commons CSV to read & parse the lines. Turn on (`true`) to get a feel for real-world load.
{
Integer id = Integer.valueOf( csvRecord.get( 0 ) ); // Annoying zero-based index counting.
Instant instant = Instant.parse( csvRecord.get( 1 ) );
for ( int i = 3 - 1 ; i <= 22 - 1 ; i++ ) // Subtract one for annoying zero-based index counting.
{
UUID uuid = UUID.fromString( csvRecord.get( i ) );
}
}
count++;
if ( count % 1_000 == 0 ) // Every so often, report progress.
{
//System.out.println( "# " + count );
}
}
} catch ( IOException e )
{
e.printStackTrace();
}
Instant stop = Instant.now();
Duration d = Duration.between( start , stop );
System.out.println( "Read CSV for count: " + count );
System.out.println( "Elapsed: " + d );
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.