Issues with Dynamic Destinations in Dataflow - java

I have a Dataflow job that reads data from pubsub and based on the time and filename writes the contents to GCS where the folder path is based on the YYYY/MM/DD. This allows files to be generated in folders based on date and uses apache beam's FileIO and Dynamic Destinations.
About two weeks ago, I noticed an unusual buildup of unacknowledged messages. Upon restarting the df job the errors disappeared and new files were being written in GCS.
After a couple of days, writing stopped again, except this time, there were errors claiming that processing was stuck. After some trusty SO research, I found out that this was likely caused by a deadlock issue in pre 2.90 Beam because it used the Conscrypt library as the default security provider. So, I upgraded to Beam 2.11 from Beam 2.8.
Once again, it worked, until it didn't. I looked more closely at the error and noticed that it had a problem with a SimpleDateFormat object, which isn't thread-safe. So, I switched to use Java.time and DateTimeFormatter, which is thread-safe. It worked until it didn't. However, this time, the error was slightly different and didn't point to anything in my code:
The error is provided below.
Processing stuck in step FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles for at least 05m00s without outputting or completing in state process
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(
at org.apache.beam.runners.dataflow.worker.MetricTrackingWindmillServerStub.getStateData(
at org.apache.beam.runners.dataflow.worker.WindmillStateReader.startBatchAndBlock(
at org.apache.beam.runners.dataflow.worker.WindmillStateReader$WrappedFuture.get(
at org.apache.beam.runners.dataflow.worker.WindmillStateReader$BagPagingIterable$1.computeNext(
at$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source)
This error started occurring approximately 5 hours after job deployment and at an increasing rate over time. Writing slowed significantly within 24 hours. I have 60 workers and I suspect that one worker fails every time there is an error, which eventually kills the job.
In my writer, I parse the lines for certain keywords (may not be the best way) in order to determine which folder it belongs in. I then proceed to insert the file to GCS with the determined filename. Here is the code I use for my writer:
The partition function is provided as the following:
public static class datePartition implements SerializableFunction<String, String> {
private String filename;
public datePartition(String filename) {
this.filename = filename;
public String apply(String input) {
String folder_name = "NaN";
String date_dtf = "NaN";
String date_literal = "NaN";
try {
Matcher foldernames = Pattern.compile("\"foldername\":\"(.*?)\"").matcher(input);
if(foldernames.find()) {
folder_name =;
else {
Matcher folderid = Pattern.compile("\"folderid\":\"(.*?)\"").matcher(input);
if(folderid.find()) {
folder_name =;
Matcher date_long = Pattern.compile("\"timestamp\":\"(.*?)\"").matcher(input);
if(date_long.find()) {
date_literal =;
if(Utilities.isNumeric(date_literal)) {
LocalDateTime date = LocalDateTime.ofInstant(Instant.ofEpochMilli(Long.valueOf(date_literal)), ZoneId.systemDefault());
date_dtf = date.format(dtf);
else {
date_dtf = date_literal.split(":")[0].replace("-", "/").replace("T", "/");
return folder_name + "/" + date_dtf + "h/" + filename;
catch(Exception e) {
LOG.error("ERROR with either foldername or date");
LOG.error("Line : " + input);
LOG.error("folder : " + folder_name);
LOG.error("Date : " + date_dtf);
return folder_name + "/" + date_dtf + "h/" + filename;
And the actual place where the pipeline is deployed and run can be found below:
public void streamData() {
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))
.apply(options.getWindowDuration() + " Window",
.apply(new GenericFunctions.extractMsg())
.apply(FileIO.<String, String>writeDynamic()
.by(new datePartition(options.getOutputFilenamePrefix()))
.withNaming(type -> FileIO.Write.defaultNaming(type, ".txt"))

The error 'Processing stuck ...' indicates that some particular operation took longer than 5m, not that the job is permanently stuck. However, since the step FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles is the one that is stuck and the job gets cancelled/killed, I would think on an issue while the job is writing temp files.
I found out the BEAM-7689 issue which is related to a second-granularity timestamp (yyyy-MM-dd_HH-mm-ss) that is used to write temporary files. This happens because several concurrent jobs can share the same temporary directory and this can cause that one of the jobs deletes it before the other(s) job finish(es).
According to the previous link, to mitigate the issue please upgrade to SDK 2.14. And let us know if the error is gone.

Since posting this question, I've optimized the dataflow job to dodge bottlenecks and increase parallelization. Much like rsantiago explained, processing stuck isn't an error, but simply a way dataflow communicates that a step is taking significantly longer than other steps, which is essentially a bottleneck that can't be cleared with the given resources. The changes I made seem to have addressed them. The new code is as follows:
public void streamData() {
try {
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))
.apply(options.getWindowDuration() + " Window",
.by(new datePartition(options.getOutputFilenamePrefix()))
(SerializableFunction<PubsubMessage, String>) inputMsg -> new String(inputMsg.getPayload(), StandardCharsets.UTF_8)),
.withNaming(type -> new CrowdStrikeFileNaming(type))
catch(Exception e) {
LOG.error("Unable to deploy pipeline");
LOG.error(e.toString(), e);
The biggest change involved removing the extractMsg() function and changing partitioning to only use metadata. Both of these steps forced deserialization/reserialization of messages and heavily impacted performance.
Additionally, since my data set was unbounded, I had to set a non-zero number of shards. I wanted to simplify my filenaming policy, so I set it to 1 without knowing how much it hurt performance. Since then, I've found a good balance of workers/shards/machine type for my job (mostly based on guess & check, unfortunately).
Although it's still possible that a bottleneck might be observed with a large enough data load, the pipeline has been performing well despite heavy load (3-5tb per day). The changes also significantly improved autoscaling, but I'm not sure why. The dataflow job now reacts to spikes and valleys a lot quicker.


Move already process file from one folder to another folder in flink

I am a new bee to flink and facing some challenges to solve the below use case
Use Case description:
I will receive a csv file with a timestamp on every single day in some folder say input. The file format would be file_name_dd-mm-yy-hh-mm-ss.csv.
Now my flink pipeline will read this csv file in a row by row fashion and it will be written to my Kafka topic.
Immediately after completion of data reading this file needs to be moved to another folder historic folder.
Why i need this is because : suppose that your ververica server stops either abruptly or manually and if you have all the processed files lying at the same location then after the ververica restart flink will re read all the files that it had processed earlier. So to prevent this scenario those files needs to be immediately move already read files to another location.
I googled a lot but did not find anything so can you guide me to achieve this.
Let me know if anything else is required.
Out of the box Flink provides the facility to monitor directory for new files and read them - via StreamExecutionEnvironment.getExecutionEnvironment.readFile (see similar stack overflow threads for examples - How to read newly added file in a directory in Flink / Monitoring directory for new files with Flink for data streams , etc.)
Looking into the source code of the readFile function, it calls for createFileInput() method, which simply instantiates ContinuousFileMonitoringFunction, ContinuousFileReaderOperatorFactory and configures the source -
addSource(monitoringFunction, sourceName, null, boundedness)
.transform("Split Reader: " + sourceName, typeInfo, factory);
ContinuousFileMonitoringFunction is actually a place where most of the logic happens.
So, if I were to implement your requirement, I would extend the functionality of ContinuousFileMonitoringFunction with my own logic of moving the processed file into the history folder and constructed the source from this function.
Given that the run method performs the read and forwarding inside the checkpointLock -
synchronized (checkpointLock) {
monitorDirAndForwardSplits(fileSystem, context);
I would say it's safe to move to historic folder on checkpoint completion files which have the modification day older then globalModificationTime, which is updated in monitorDirAndForwardSplits on splits collecting.
That said, I would extend the ContinuousFileMonitoringFunction class and implement the CheckpointListener interface, and in notifyCheckpointComplete would move the already processed files to historic folder:
public class ArchivingContinuousFileMonitoringFunction<OUT> extends ContinuousFileMonitoringFunction<OUT> implements CheckpointListener {
public void notifyCheckpointComplete(long checkpointId) throws Exception {
Map<Path, FileStatus> eligibleFiles = listEligibleForArchiveFiles(fs, new Path(path));
// do move logic
* Returns the paths of the files already processed.
* #param fileSystem The filesystem where the monitored directory resides.
private Map<Path, FileStatus> listEligibleForArchiveFiles(FileSystem fileSystem, Path path) {
final FileStatus[] statuses;
try {
statuses = fileSystem.listStatus(path);
} catch (IOException e) {
// we may run into an IOException if files are moved while listing their status
// delay the check for eligible files in this case
return Collections.emptyMap();
if (statuses == null) {
LOG.warn("Path does not exist: {}", path);
return Collections.emptyMap();
} else {
Map<Path, FileStatus> files = new HashMap<>();
// handle the new files
for (FileStatus status : statuses) {
if (!status.isDir()) {
Path filePath = status.getPath();
long modificationTime = status.getModificationTime();
if (shouldIgnore(filePath, modificationTime)) {
files.put(filePath, status);
} else if (format.getNestedFileEnumeration() && format.acceptFile(status)) {
files.putAll(listEligibleForArchiveFiles(fileSystem, status.getPath()));
return files;
and then define the data stream manually with the custom function:
ContinuousFileMonitoringFunction<OUT> monitoringFunction =
new ArchivingContinuousFileMonitoringFunction <>(
inputFormat, monitoringMode, getParallelism(), interval);
ContinuousFileReaderOperatorFactory<OUT, TimestampedFileInputSplit> factory = new ContinuousFileReaderOperatorFactory<>(inputFormat);
final Boundedness boundedness = Boundedness.CONTINUOUS_UNBOUNDED;
env.addSource(monitoringFunction, sourceName, null, boundedness)
.transform("Split Reader: " + sourceName, typeInfo, factory);
Flink itself does not provide a solution for doing this. You might need to build something yourself, or find a workflow tool that can be configured to handle this.
You can ask about this on the flink user mailing list. I know others have written scripts to do this; perhaps someone can share a solution.

AWS Lambda/ Aws Batch work flow

I have written a lambda that is triggered off s3 bucket to unzip a zip file and process a text document inside. Due to the limitation of memory of lambda i need to move my process over to something like AWS batch. Correct me if I am wrong but my work flow should look something like this.
work flow
I beleive I need to write a lambda to put the location of the s3 bucket on amazons SQS were a AWS batch can read the location and do all the unzipping/data processing their were their is more memory.
Here is my current lambda, it takes in the event triggered by the s3 bucket, checks to see if it is a zip file then pushes the name of that s3 Key to SQS.
Should I tell AWS batch to start reading the queue here in my lambda?
I am totally new to AWS in general and not sure were to go from here.
public class dockerEventHandler implements RequestHandler<S3Event, String> {
private static BigData app = new BigData();
private static DomainOfConstants CONST = new DomainOfConstants();
private static Logger log = Logger.getLogger(S3EventProcessorUnzip.class);
private static AmazonSQS SQS;
private static CreateQueueRequest createQueueRequest;
private static Matcher matcher;
private static String srcBucket, srcKey, extension, myQueueUrl;
public String handleRequest(S3Event s3Event, Context context)
try {
for (S3EventNotificationRecord record : s3Event.getRecords())
srcBucket = record.getS3().getBucket().getName();
srcKey = record.getS3().getObject().getKey().replace('+', ' ');
srcKey = URLDecoder.decode(srcKey, "UTF-8");
matcher = Pattern.compile(".*\\.([^\\.]*)").matcher(srcKey);
if (!matcher.matches())
{ + srcKey);
return "";
extension =;
if (!"zip".equals(extension))
{"Skipping non-zip file " + srcKey + " with extension " + extension);
return "";
}"Sending object location to key" + srcBucket + "//" + srcKey);
//pass in only the reference of where the object is located
createQue(CONST.getQueueName(), srcKey);
catch (IOException e)
return "Ok";
* Setup connection to amazon SQS
* TODO - Find updated api for sqs connection to eliminate depreciation
* */
public static void sQSConnection() {
app.setAwsCredentials(CONST.getAccessKey(), CONST.getSecretKey());
SQS = new AmazonSQSClient(app.getAwsCredentials());
Region usEast1 = Region.getRegion(Regions.US_EAST_1);
catch(Exception e){
//Create new Queue
public static void createQue(String queName, String message){
createQueueRequest = new CreateQueueRequest(queName);
myQueueUrl = SQS.createQueue(createQueueRequest).getQueueUrl();
//Send reference to the s3 objects location to the queue
public static void sendMessage(String SIMPLE_QUE_URL, String S3KeyName){
SQS.sendMessage(new SendMessageRequest(SIMPLE_QUE_URL, S3KeyName));
//Fire AWS batch to pull from que
private static void initializeBatch(){
I have setup docker and understand docker images. I believe my docker image should contain all the code to read the queue, unzip, process and kit the file to RDS all in one docker image/container.
I am looking for someone who has something similar done they could share to help. Something along the lines of :
Mr. S3: Hey lambda I have a file
Mr. Lambda :Okay S3 I see you, hey aws batch could you unzip and do stuff to this
Mr. Batch: Gotchya mr lambda, ill take care of that and put it in RDS or some data base after.
I have not written the class/docker image yet but i have all the code done to process/unzip and kick off to rds done. Lambda just is limited to memory due to some of the files being 1gb or bigger.
Okay so after looking through the AWS docs on Batch, you don't need an SQS queue. Batch has a concept called Job Queue which is similar to an SQS FIFO queue, but different in that these job queues have priorities, and jobs within them can have dependencies on other jobs. The basic process is:
First the weird part is setting up IAM roles so that container agents can talk to the container service, and AWS batch is able to launch various instances when it needs to (there's also a separate role needed for if you do spot instances). The details on permissions required can be found in this doc (PDF) at around page 54.
Now when that's done you setup a compute environment. These are EC2 on-demand or spot instances which hold your containers. Jobs operate on a container level. The idea is that your compute environment is the max resource allocation that your job containers can utilize. Once that limit is hit, your jobs have to wait for resources to be freed up.
Now you create a job queue. This associates jobs with the compute environment you created.
Now you create a job definition. Well, technically you don't have to and can do it through lambda but this makes things a bit easier. Your job definition will indicate what container resources will be needed for your job ( you can of course override this in lambda as well )
Now that this is all done you'll want to create a lambda function. This will be triggered by your S3 bucket event. The function will need necessary IAM permissions to run submit job against the batch service (as well as any other permissions). Basically all the lambda needs to do is call submit job to AWS batch. The basic parameters you'll want are the job queue and the job definition. You'll also set the S3 key for the zip needed as a parameter to the job.
Now when the appropriate S3 event is triggered, it calls lambda, which then submits the job to the AWS batch job queue. Then assuming the setup is all good it will happily pull up resources to process your job. Note that depending on EC2 instance size and container resources allocated this may take a bit (much longer than prepping a Lambda function).

How to automatically collapse repetitive log output in log4j

Every once in a while, a server or database error causes thousands of the same stack trace in the server log files. It might be a different error/stacktrace today than a month ago. But it causes the log files to rotate completely, and I no longer have visibility into what happened before. (Alternately, I don't want to run out of disk space, which for reasons outside my control right now is limited--I'm addressing that issue separately). At any rate, I don't need thousands of copies of the same stack trace--just a dozen or so should be enough.
I would like it if I could have log4j/log4j2/another system automatically collapse repetitive errors, so that they don't fill up the log files. For example, a threshold of maybe 10 or 100 exceptions from the same place might trigger log4j to just start counting, and wait until they stop coming, then output a count of how many more times they appeared.
What pre-made solutions exist (a quick survey with links is best)? If this is something I should implement myself, what is a good pattern to start with and what should I watch out for?
Will the BurstFilter do what you want? If not, please create a Jira issue with the algorithm that would work for you and the Log4j team would be happy to consider it. Better yet, if you can provide a patch it would be much more likely to be incorporated.
Log4j's BurstFilter will certainly help prevent you filling your disks. Remember to configure it so that it applies in as limited a section of code as you can, or you'll filter out messages you might want to keep (that is, don't use it on your appender, but on a particular logger that you isolate in your code).
I wrote a simple utility class at one point that wrapped a logger and filtered based on n messages within a given Duration. I used instances of it around most of my warning and error logs to protect the off chance that I'd run into problems like you did. It worked pretty well for my situation, especially because it was easier to quickly adapt for different situations.
Something like:
public DurationThrottledLogger(Logger logger, Duration throttleDuration, int maxMessagesInPeriod) {
public void info(String msg) {
getMsgAddendumIfNotThrottled().ifPresent(addendum-> + addendum));
private synchronized Optional<String> getMsgAddendumIfNotThrottled() {
LocalDateTime now =;
String msgAddendum;
if (throttleDuration.compareTo(Duration.between(lastInvocationTime, now)) <= 0) {
// last one was sent longer than throttleDuration ago - send it and reset everything
if (throttledInDurationCount == 0) {
msgAddendum = " [will throttle future msgs within throttle period]";
} else {
msgAddendum = String.format(" [previously throttled %d msgs received before %s]",
throttledInDurationCount = 0;
numMessagesSentInCurrentPeriod = 1;
lastInvocationTime = now;
return Optional.of(msgAddendum);
} else if (numMessagesSentInCurrentPeriod < maxMessagesInPeriod) {
msgAddendum = String.format(" [message %d of %d within throttle period]", numMessagesSentInCurrentPeriod + 1, maxMessagesInPeriod);
// within throttle period, but haven't sent max messages yet - send it
return Optional.of(msgAddendum);
} else {
// throttle it
return emptyOptional;
I'm pulling this from an old version of the code, unfortunately, but the gist is there. I wrote a bunch of static factory methods that I mainly used because they let me write a single line of code to create one of these for that one log message:
} catch (IOException e) {
DurationThrottledLogger.error(logger, Duration.ofSeconds(1), "Received IO Exception. Exiting current reader loop iteration.", e);
This probably won't be as important in your case; for us, we were using a somewhat underpowered graylog instance that we could hose down fairly easily.

file.lastModified() is never what was set with file.setLastModified()

I do have a problem with millis set and read on Android 2.3.4 on a Nexus One. This is the code:
File fileFolder = new File(Environment.getExternalStorageDirectory(), appName + "/"
+ URLDecoder.decode(folder.getUrl()));
if (fileFolder != null && !fileFolder.exists()) {
if (fileFolder != null && fileFolder.exists()) {
long l = fileFolder.lastModified();
In this small test I write 1310198774 but the result that is returned from lastModified() is 1310199771000.
Even if I cut the trailing "000" there's a difference of several minutes.
I need to sync files between a webservice and the Android device. The lastmodification millis are part of the data sent by this service. I do set the millis to the created/copied files and folders to check if the file/folder needs to be overwritten.
Everything is working BUT the millis that are returned from the filesystem are different from the values that were set.
I'm pretty sure there's something wrong with my code - but I can't find it.
Many thanks in advance.
On Jelly Bean+, it's different (mostly on Nexus devices yet, and others that use the new fuse layer for /mnt/shell/emulated sdcard emulation):
It's a VFS permission problem, the syscall utimensat() fails with EPERM due to inappropriate permissions (e.g. ownership).
in platform/system/core/sdcard/sdcard.c:
/* all files owned by root.sdcard */
attr->uid = 0;
attr->gid = AID_SDCARD_RW;
From utimensat()'s syscall man page:
2. the caller's effective user ID must match the owner of the file; or
3. the caller must have appropriate privileges.
To make any change other than setting both timestamps to the current time
(i.e., times is not NULL, and both tv_nsec fields are not UTIME_NOW and both
tv_nsec fields are not UTIME_OMIT), either condition 2 or 3 above must apply.
Old FAT offers an override of the iattr->valid flag via a mount option to allow changing timestamps to anyone, FUSE+Android's sdcard-FUSE don't do this at the moment (so the 'inode_change_ok() call fails) and the attempt gets rejected with -EPERM. Here's FAT's ./fs/fat/file.c:
/* Check for setting the inode time. */
ia_valid = attr->ia_valid;
if (ia_valid & TIMES_SET_FLAGS) {
if (fat_allow_set_time(sbi, inode))
attr->ia_valid &= ~TIMES_SET_FLAGS;
error = inode_change_ok(inode, attr);
I also added this info to this open bug.
So maybe I'm missing something but I see some problems with your code above. Your specific problem may be due (as #JB mentioned) to Android issues but for posterity, I thought I'd provide an answer.
First off, File.setLastModified() takes the time in milliseconds. Here are the javadocs. You seem to be trying to set it in seconds. So your code should be something like:
As mentioned in the javadocs, many filesystems only support seconds granularity for last-modification time. So if you need to see the same modification time in a file then you should do something like the following:
private void changeModificationFile(File file, long time) {
// round the value down to the nearest second
file.setLastModified((time / 1000) * 1000);
If this all doesn't work try this (ugly) workaround quoted from
//As a workaround, this ugly hack will set the last modified date to now:
RandomAccessFile raf = new RandomAccessFile(file, "rw");
long length = raf.length();
raf.setLength(length + 1);
Works on some devices but not on others. Do not design a solution that relies on it working. See
Here is a simple test to see if it works.
public void testSetLastModified() throws IOException {
long time = 1316137362000L;
File file = new File("/mnt/sdcard/foo");
assertEquals(time, file.lastModified());
If you only want to change the date/time of a directory to the current date/time (i.e., "now"), then you can create some sort of temporary file inside that directory, write something into it, then immediately delete it. This has the effect of changing the 'lastModified()' date/time of the directory to the present date/time. This won't work though, if you want to change the directory date/time to some other random value, and can't be applied to a file, obviously.

Query Windows Search from Java

I would like to get to query Windows Vista Search service directly ( or indirectly ) from Java.
I know it is possible to query using the search-ms: protocol, but I would like to consume the result within the app.
I have found good information in the Windows Search API but none related to Java.
I would mark as accepted the answer that provides useful and definitive information on how to achieve this.
Thanks in advance.
Does anyone have a JACOB sample, before I can mark this as accepted?
You may want to look at one of the Java-COM integration technologies. I have personally worked with JACOB (JAva COm Bridge):
Which was rather cumbersome (think working exclusively with reflection), but got the job done for me (quick proof of concept, accessing MapPoint from within Java).
The only other such technology I'm aware of is Jawin, but I don't have any personal experience with it:
Update 04/26/2009:
Just for the heck of it, I did more research into Microsoft Windows Search, and found an easy way to integrate with it using OLE DB. Here's some code I wrote as a proof of concept:
public static void main(String[] args) {
DispatchPtr connection = null;
DispatchPtr results = null;
try {
connection = new DispatchPtr("ADODB.Connection");
"Provider=Search.CollatorDSO;" +
"Extended Properties='Application=Windows';");
results = (DispatchPtr)connection.invoke("Execute",
"select System.Title, System.Comment, System.ItemName, System.ItemUrl, System.FileExtension, System.ItemDate, System.MimeType " +
"from SystemIndex " +
"where contains('Foo')");
int count = 0;
while(!((Boolean)results.get("EOF")).booleanValue()) {
++ count;
DispatchPtr fields = (DispatchPtr)results.get("Fields");
int numFields = ((Integer)fields.get("Count")).intValue();
for (int i = 0; i < numFields; ++ i) {
DispatchPtr item =
(DispatchPtr)fields.get("Item", new Integer(i));
item.get("Name") + ": " + item.get("Value"));
System.out.println("\nCount:" + count);
} catch (COMException e) {
} finally {
try {
} catch (COMException e) {
try {
} catch (COMException e) {
try {
} catch (COMException e) {
To compile this, you'll need to make sure that the JAWIN JAR is in your classpath, and that jawin.dll is in your path (or java.library.path system property). This code simply opens an ADO connection to the local Windows Desktop Search index, queries for documents with the keyword "Foo," and print out a few key properties on the resultant documents.
Let me know if you have any questions, or need me to clarify anything.
Update 04/27/2009:
I tried implementing the same thing in JACOB as well, and will be doing some benchmarks to compare performance differences between the two. I may be doing something wrong in JACOB, but it seems to consistently be using 10x more memory. I'll be working on a jcom and com4j implementation as well, if I have some time, and try to figure out some quirks that I believe are due to the lack of thread safety somewhere. I may even try a JNI based solution. I expect to be done with everything in 6-8 weeks.
Update 04/28/2009:
This is just an update for those who've been following and curious. Turns out there are no threading issues, I just needed to explicitly close my database resources, since the OLE DB connections are presumably pooled at the OS level (I probably should have closed the connections anyway...). I don't think I'll be any further updates to this. Let me know if anyone runs into any problems with this.
Update 05/01/2009:
Added JACOB example per Oscar's request. This goes through the exact same sequence of calls from a COM perspective, just using JACOB. While it's true JACOB has been much more actively worked on in recent times, I also notice that it's quite a memory hog (uses 10x as much memory as the Jawin version)
public static void main(String[] args) {
Dispatch connection = null;
Dispatch results = null;
try {
connection = new Dispatch("ADODB.Connection");, "Open",
"Provider=Search.CollatorDSO;Extended Properties='Application=Windows';");
results =, "Execute",
"select System.Title, System.Comment, System.ItemName, System.ItemUrl, System.FileExtension, System.ItemDate, System.MimeType " +
"from SystemIndex " +
"where contains('Foo')").toDispatch();
int count = 0;
while(!Dispatch.get(results, "EOF").getBoolean()) {
++ count;
Dispatch fields = Dispatch.get(results, "Fields").toDispatch();
int numFields = Dispatch.get(fields, "Count").getInt();
for (int i = 0; i < numFields; ++ i) {
Dispatch item =, "Item", new Integer(i)).
Dispatch.get(item, "Name") + ": " +
Dispatch.get(item, "Value"));
System.out.println();, "MoveNext");
} finally {
try {, "Close");
} catch (JacobException e) {
try {, "Close");
} catch (JacobException e) {
As few posts here suggest you can bridge between Java and .NET or COM using commercial or free frameworks like JACOB, JNBridge, J-Integra etc..
Actually I had an experience with with one of these third parties (an expensive one :-) ) and I must say I will do my best to avoid repeating this mistake in the future. The reason is that it involves many "voodoo" stuff you can't really debug, it's very complicated to understand what is the problem when things go wrong.
The solution I would suggest you to implement is to create a simple .NET application that makes the actual calls to the windows search API. After doing so, you need to establish a communication channel between this component and your Java code. This can be done in various ways, for example by messaging to a small DB that your application will periodically pull. Or registering this component on the machine IIS (if exists) and expose simple WS API to communicate with it.
I know that it may sound cumbersome but the clear advantages are: a) you communicate with windows search API using the language it understands (.NET or COM) , b) you control all the application paths.
Any reason why you couldn't just use Runtime.exec() to query via search-ms and read the BufferedReader with the result of the command? For example:
public class ExecTest {
public static void main(String[] args) throws IOException {
Process result = Runtime.getRuntime().exec("search-ms:query=microsoft&");
BufferedReader output = new BufferedReader(new InputStreamReader(result.getInputStream()));
StringBuffer outputSB = new StringBuffer(40000);
String s = null;
while ((s = output.readLine()) != null) {
outputSB.append(s + "\n");
String result = output.toString();
There are several libraries out there for calling COM objects from java, some are opensource (but their learning curve is higher) some are closed source and have a quicker learning curve. A closed source example is EZCom. The commercial ones tend to focus on calling java from windows as well, something I've never seen in opensource.
In your case, what I would suggest you do is front the call in your own .NET class (I guess use C# as that is closest to Java without getting into the controversial J#), and focus on making the interoperability with the .NET dll. That way the windows programming gets easier, and the interface between Windows and java is simpler.
If you are looking for how to use a java com library, the MSDN is the wrong place. But the MSDN will help you write what you need from within .NET, and then look at the com library tutorials about invoking the one or two methods you need from your .NET objects.
Given the discussion in the answers about using a Web Service, you could (and probably will have better luck) build a small .NET app that calls an embedded java web server rather than try to make .NET have the embedded web service, and have java be the consumer of the call. For an embedded web server, my research showed Winstone to be good. Not the smallest, but is much more flexible.
The way to get that to work is to launch the .NET app from java, and have the .NET app call the web service on a timer or a loop to see if there is a request, and if there is, process it and send the response.

