Finding min/max values in a KafkaStream (KStream) object - java

I have a Kafka Stream application and Avro schemas for each of the topics and also for the key. Key topic schema is same for all.
Now, there is a KafkaStream (KStream) object with the known key object as the key and a value object (derived from the AvroSchema) which extends org.apache.avro.specific.SpecificRecordBase but it could be any of my avro schemas for the topic content.
KStream<CustomKey, ? extends SpecificRecordBase> myStream = ...
What I want to achieve is to run min and max functions on this stream. The problem is that I don't know what is the ? object, and as there are 30+ (and will increase in the future) topics, I don't wanna do a switch-case. So I have the followings:
public KStream<CustomKey, ? extends SpecificRecordBase> max(
final KStream<CustomKey, ? extends SpecificRecordBase> myStream,
final String attributeName) {
SpecificRecordBase maxValue = ...;
myStream.foreach((key, value) -> {
value.get(attributeName) // I want to find the max value for this attribute,
// but at this point we don't know it's type and
// and we can't assign maxValue = value, because this is a lambda
// function.
});
// find and return the max value
}
My question is, how can I calculate the max value for the myStream on the attributeName attribute?

it could be any of my avro schemas for the topic content
Then you need to extends ClassWithMinMaxFields. Otherwise, you will be unable to extract it from generic SpecificRecordBase object.
Also, your method returns a stream. You cannot return the min/max. If that is your objective, you need a plain consumer to scan the whole topic, beginning to (eventual) end.
To do this (correctly) with Streams API, you would either
need to build a KTable for every value, grouped by key, then do a table scan for the min/max, as you need them.
Create a new topic using aggregate DSL function, initialized with {"min": +Inf, "max": -Inf}, then on new records you check old vs new records, if you have a new min and/or max, set them and return the new record. Then, you still need an external consumer to fetch the most recent min/max events.
If you had a consistent Avro type, you could use ksqlDB functions

Related

How to get private vendor attribute tag in C_FIND from pixelmed?

i'm trying to read a private vendor tag from a dicom server.
The only tags I'm able to read successfully are the standard DICOM tagFromNames
the tag is 2001,100b, and in my example set of files they definitely have that entry in their header
here is the code for calling the CFIND request
SpecificCharacterSet specificCharacterSet = new SpecificCharacterSet((String[])null);
AttributeList identifier = new AttributeList();
//specify attributes to retrieve and pass in any search criteria
//query root of "study" to retrieve studies
studies.removeAllElements();
identifier.putNewAttribute(TagFromName.QueryRetrieveLevel).addValue("STUDY");
identifier.putNewAttribute(TagFromName.PatientName,specificCharacterSet).addValue("*");
identifier.putNewAttribute(TagFromName.PatientID,specificCharacterSet);
identifier.putNewAttribute(TagFromName.StudyID);
identifier.putNewAttribute(TagFromName.PatientAge);
identifier.putNewAttribute(TagFromName.PatientSex);
identifier.putNewAttribute(TagFromName.ModalitiesInStudy);
identifier.putNewAttribute(TagFromName.AccessionNumber);
identifier.putNewAttribute(TagFromName.StudyInstanceUID);
identifier.putNewAttribute(TagFromName.StudyDescription);
identifier.putNewAttribute(TagFromName.StudyDate).addValue(date);
identifier.putNewAttribute(TagFromName.StudyTime);
AttributeTag at = new com.pixelmed.dicom.AttributeTag("0x2001,0x100b");
identifier.putNewAttribute(at);
IdentifierHandler ih = new IdentifierHandler(){
#Override
public void doSomethingWithIdentifier(AttributeList id) throws DicomException {
studies.add(new Study(id, reportfolder));
//Attempt to read private dicom tag from received identifier
System.out.println(id.get(at));
}
};
new FindSOPClassSCU(serv.getAddress(),serv.getPort(), serv.getAetitle(), "ISPReporter",SOPClass.StudyRootQueryRetrieveInformationModelFind,identifier,ih);
However, my output from the query, receives 7 identifiers that match for the date however when I try to read the 2001,100b tag, the error I get reads:
DicomException: No such data element as (0x2001,0x100b) in dictionary
if I use this line instead
identifier.put(new com.pixelmed.dicom.TextAttribute(at) {
public int getMaximumLengthOfEntireValue() { return 20; }
});
Then I get:
null
null
null
null
null
null
null
(null for each identifier returned)
Two things (second one moot because this won't work anyway because of the first):
C-FIND SCPs query against a database of a subset of data elements previously extracted from the DICOM image header and indexed - only a (small) subset of data elements present in images are actually indexed, as described; the standard requires very few in the Query Information Models, and the IHE Scheduled Workflow (SWF) profile a few more (Query Images Transaction Table 4.14-1; implementers could index every data element (or at least every standard data elements), but this is rarely done (PixelMed doesn't, though I have consider doing it adaptively as data elements are encountered now that hsqldb supports adding columns; NoSQL based implementations might find this easier)
When you encode a private data element, whether it be in a query identifier/response, or in an image header, you need to include its creator; i.e., for (2001,100b), you need to include (2001,0010); otherwise the creator of the private data element is not specified.
David

Converting Linq queries to Java 8

Im traslating a old enterprise App who uses C# with Linq queries to Java 8. I have some of those queries who I'm not able to reproduce using Lambdas as I dont know how C# works with those.
For example, in this Linq:
from register in registers
group register by register.muleID into groups
select new Petition
{
Data = new PetitionData
{
UUID = groups.Key
},
Registers = groups.ToList<AuditRegister>()
}).ToList<Petition>()
I undestand this as a GroupingBy on Java 8 Lambda, but what's the "select new PetitionData" inside of the query? I don't know how to code it in Java.
I have this at this moment:
Map<String, List<AuditRegister>> groupByMuleId =
registers.stream().collect(Collectors.groupingBy(AuditRegister::getMuleID));
Thank you and regards!
The select LINQ operation is similar to the map method of Stream in Java. They both transform each element of the sequence into something else.
collect(Collectors.groupingBy(AuditRegister::getMuleID)) returns a Map<String, List<AuditRegister>> as you know. But the groups variable in the C# version is an IEnumerable<IGrouping<string, AuditRegister>>. They are quite different data structures.
What you need is the entrySet method of Map. It turns the map into a Set<Map.Entry<String, List<AuditRegister>>>. Now, this data structure is more similar to IEnumerable<IGrouping<string, AuditRegister>>. This means that you can create a stream from the return value of entry, call map, and transform each element into a Petition.
groups.Key is simply x.getKey(), groups.ToList() is simply x.getValue(). It should be easy.
I suggest you to create a separate method to pass into the map method:
// you can probably came up with a more meaningful name
public static Petition mapEntryToPetition(Map.Entry<String, List<AuditRegister>> entry) {
Petition petition = new Petition();
PetitionData data = new PetitionData();
data.setUUID(entry.getKey());
petition.setData(data);
petition.setRegisters(entry.getValue());
return petition;
}

How to apply map function in Spark DataFrame using Java?

I am trying to use map function on DataFrame in Spark using Java. I am following the documentation which says
map(scala.Function1 f, scala.reflect.ClassTag evidence$4)
Returns a new RDD by applying a function to all rows of this DataFrame.
While using the Function1 in map , I need to implement all the functions. I have seen some questions related to this , but the solution provided converts the DataFrame into RDD.
How can I use the map function in DataFrame without converting it into a RDD also what is the second parameter of map ie scala.reflect.ClassTag<R> evidence$4
I am using Java 7 and Spark 1.6.
I know your question is about Java 7 and Spark 1.6, but in Spark 2 (and obviously Java 8), you can have a map function as part of a class, so you do not need to manipulate Java lambdas.
The call would look like:
Dataset<String> dfMap = df.map(
new CountyFipsExtractorUsingMap(),
Encoders.STRING());
dfMap.show(5);
The class would look like:
/**
* Returns a substring of the values in the id2 column.
*
* #author jgp
*/
private final class CountyFipsExtractorUsingMap
implements MapFunction<Row, String> {
private static final long serialVersionUID = 26547L;
#Override
public String call(Row r) throws Exception {
String s = r.getAs("id2").toString().substring(2);
return s;
}
}
You can find more details in this example on GitHub.
I think map is not the right way to use on a DataFrame. Maybe you should have a look at the examples in the API
There they show how to operate on DataFrames
You can use the dataset directly, need not convert the read data to RDD, its unnecessary consumption of resource.
dataset.map(mapfuncton{...}, encoder); this should suffice your needs.
Because you don't give any specific problems, there're some common alternatives to map in DataFrame like select, selectExpr, withColumn. If the spark sql builtin functions can't fit your task, you can use UTF.

Is map reduce suitable for invoking web services and transforming xml data?

We have a job that runs on a single node taking up to 40m to complete, and with M/R we hope to get that down to less than 2m, but we're not sure what parts of the process go into map() and reduce().
Current Process:
For a list of keys, call a web service for each key and get xml response; transform xml into pipe-delimited format; output a single file in the end...
def keys = 100..9999
def output = new StringBuffer()
keys.each(){ key ->
def xmlResponse = callRemoteService( key)
def transformed = convertToPipeDelimited( xmlResponse)
output.append( transformed)
}
file.write( output)
Map/Reduce Model
Here's how I modeled it with map/reduce, just want to make sure I'm on the right path...
Mapper
The keys get pulled from keys.txt; I call the remote service for each key and store key/xml pair...
public static class XMLMapper extends Mapper<Text, Text, Text, Text> {
private Text xml = new Text();
public void map(Text key, Text value, Context context){
String xmlResponse = callRemoteService( key)
xml.set( xmlResponse)
context.write(key, xml);
}
}
Reducer
For each key/xml pair, I transform the xml to pipe-delimited format, then write out the result...
public static class XMLToPipeDelimitedReducer extends Reducer<Text,Text,Text,Text> {
private Text result = new Text();
public void reduce(Text key, Iterable<Text> values, Context context ) {
String xml = values.iterator().next();
String transformed = convertToPipeDelimited( xml);
result.set( transformed);
context.write( key, result);
}
}
Questions
Is it good practice to call the web service in map() while doing the
transform in reduce(); any benefits from doing both operations in
map()?
I don't check for duplicates in reduce() because keys.txt
contains no duplicate keys; is that safe?
How can I control the format of the output file? TextOutputFormat looks interesting; I want it to read like this...
100|foo bar|$456,098
101|bar foo|$20,980
You should do the transform map-side, for a couple of reasons:
Turning from xml to pipe-delimited will reduce the amount of data you're serializing and transmitting into the reducer.
You will be running multiple map jobs, but a single reduce job, so you want to transform map-side to take advantage of that parallelism.
Since all the work is map-side, you can just use the provided IdentityReducer and not have to write your own code for that.
If you want a single output file, you'll want to use a single reducer; map-reduce produces one output file per reducer.
If you're sure there are no duplicate keys, then yes, it should be safe to ignore duplicates reduce-side.
I believe TextOutputFormat will by default write your (key, value) pairs to file as a tab-separated string, so not quite the format you want. See here for how you might change that.
Your webservice is going to be one limiting factor here. Assuming you want your 40-minute job to run in 2 minutes, you'll probably want 40 or so map jobs reading from it. Can it handle 40 concurrent readers?
Your other limiting factor is going to be the reduce-side. Assuming you want a single output file sorted by key, you're going to have to use a single reducer, and it will have to sort all your input, which could take a little bit.
Once you have your code working, you'll have to run some experiments and see what settings give you a reasonable run-time. Good luck.

Reading large sets of data in Java

I am using Java to read and process some datasets from the UCI Machine Learning Repository.
I started out with making a class for each dataset and working with the particular class file. Every attribute in the dataset was represented by a corresponding data member in the class of the required type. This approach worked fine till no. of attributed <10-15. I just increased or decreased the data members of the class and changed their types to model new datasets. I also made the required changes to the functions.
The problem:
I have to work with much large datasets now. Ones with >20-30 attributes are vey tedious to work with in this manner. I dont need to query. My data discretization algorithm just needs 4 scans of the data to discretize it. My work ends right after the discretization. What would be an effective strategy here?
I hope I have been able to state my problem clearly.
Some options:
Write a code generator to read the meta-data of the file and generate the equivalent class file.
Don't bother with classes; keep the data in arrays of Object or String and cast them as needed.
Create a class that contains a collection of DataElements and subclass DataElements for all the types you need and use the meta-data to create the right class at runtime.
Create a simple DataSet class that contains a member like the following:
public class DataSet {
private List<Column> columns = new ArrayList<Column>();
private List<Row> rows = new ArrayList<Row>();
public void parse( File file ) {
// routines to read CSV data into this class
}
}
public class Row {
private Object[] data;
public void parse( String row, List<Column> columns ) {
String[] row = data.split(",");
data = new Object[row.length];
int i = 0;
for( Column column : columns ) {
data[i] = column.convert(row[i]);
i++;
}
}
}
public class Column {
private String name;
private int index;
private DataType type;
public Object convert( String data ) {
if( type == DataType.NUMERIC ) {
return Double.parseDouble( data );
} else {
return data;
}
}
}
public enum DataType {
CATEGORICAL, NUMERIC
}
That'll handle any data set you wish to use. The only issue is the user must define the dataset by defining the columns and their respective data types to the DataSet. You can do it in code or reading it in from a file whatever you think is easier. You might be able to default a lot of the configuration data (say as CATEGORICAL), or attempt to parse the field if that fails it must be CATEGORICAL otherwise its numeric. Normally, the file contains a header you could parse to find the names of the columns, then you just need to figure out the data type by looking at the data in that column. A simple algorithm to guess the data type goes a long way in aiding you. Essentially this is the exact same data structure every other package uses for data like this (eg R, Weka, etc).
I did something like that in one of my projects; lots of variable data, and in my case I obtained the data from the Internet. Since I needed to query, sort, etc., I spent some time designing a database to accommodate all the variations of the data (not all entries had the same number of properties). It did take a while but in the end I used the same code to get the data for any entry (using JPA in my case). My IDE (NetBeans) created most of the code straight using the database schema.
From your question, it is not clear on how you plan to use the data so I'm answering based on personal experience.

Categories

Resources