Unit testing ElasticSearch search result converter - java

In our project I have written a small class which is designed to take the result from an ElasticSearch query containing a named aggregation and return information about each of the buckets returned in the result in a neutral format, suitable for passing on to our UI.
public class AggsToSimpleChartBasicConverter {
private SearchResponse searchResponse;
private String aggregationName;
private static final Logger logger = LoggerFactory.getLogger(AggsToSimpleChartBasicConverter.class);
public AggsToSimpleChartBasicConverter(SearchResponse searchResponse, String aggregationName) {
this.searchResponse = searchResponse;
this.aggregationName = aggregationName;
}
public void setChartData(SimpleChartData chart,
BucketExtractors.BucketNameExtractor keyExtractor,
BucketExtractors.BucketValueExtractor valueExtractor) {
Aggregations aggregations = searchResponse.getAggregations();
Terms termsAggregation = aggregations.get(aggregationName);
if (termsAggregation != null) {
for (Terms.Bucket bucket : termsAggregation.getBuckets()) {
chart.add(keyExtractor.extractKey(bucket), Long.parseLong(valueExtractor.extractValue(bucket).toString()));
}
} else {
logger.warn("Aggregation " + aggregationName + " could not be found");
}
}
}
I want to write a unit test for this class by calling setChartData() and performing some assertions against the object passed in, since the mechanics of it are reasonably simple. However in order to do so I need to construct an instance of org.elasticsearch.action.search.SearchResponse containing some test data, which is required by my class's constructor.
I looked at implementing a solution similar to this existing question, but the process for adding aggregation data to the result is more involved and requires the use of private internal classes which would likely change in a future version, even if I could get it to work initially.
I reviewed the ElasticSearch docs on unit testing and there is a mention of a class org.elasticsearch.test.ESTestCase.java (source) but there is no guidance on how to use this class and I'm not convinced it is intended for this scenario.
How can I easily unit test this class in a manner which is not likely to break in future ES releases?
Note, I do not want to have to start up an instance of ElasticSearch, embedded or otherwise since that is overkill for this simple unit test and would significantly slow down the execution.

Related

Can I get the Field value in String into custom TokenFilter in Apache Solr?

I need to write a custom LemmaTokenFilter, which replaces and indexes the words with their lemmatized(base) form. The problem is, that I get the base forms from an external API, meaning I need to call the API, send my text, parse the response and send it as a Map<String, String> to my LemmaTokenFilter. The map contains pairs of <originalWord, baseFormOfWord>. However, I cannot figure out how can I access the full value of the text field, which is being proccessed by the TokenFilters.
One idea is to go through the tokenStream one by one when the LemmaTokenFilter is being created by the LemmaTokenFilterFactory, however I would need to watch out to not edit anything in the tokenStream, somehow reset the current token(since I would need to call the .increment() method on it to get all the tokens), but most importantly this seems unnecessary, since the field value is already there somewhere and I don't want to spend time trying to put it together again from the tokens. This implementation would probably be too slow.
Another idea would be to just process every token separately, however calling an external API with only one word and then parsing the response is definitely too inefficient.
I have found something on using the ResourceLoaderAware interface, however I don't really understand how could I use this to my advantage. I could probably save the map in a text file before every indexing, but writing to a file, opening it and reading from it before every document indexing seems too slow as well.
So the best way would be to just pass the value of the field as a String to the constructor of LemmaTokenFilter, however I don't know how to access it from the create() method of the LemmaTokenFilterFactory.
I could not find any help googling it, so any ideas are welcome.
Here's what I have so far:
public final class LemmaTokenFilter extends TokenFilter {
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private Map<String, String> lemmaMap;
protected LemmaTokenFilter(TokenStream input, Map<String, String> lemmaMap) {
super(input);
this.lemmaMap = lemmaMap;
}
#Override
public boolean incrementToken() throws IOException {
if (input.incrementToken()) {
String term = termAtt.toString();
String lemma;
if ((lemma = lemmaMap.get(term)) != null) {
termAtt.setEmpty();
termAtt.copyBuffer(lemma.toCharArray(), 0, lemma.length());
}
return true;
} else {
return false;
}
}
}
public class LemmaTokenFilterFactory extends TokenFilterFactory implements ResourceLoaderAware {
public LemmaTokenFilterFactory(Map<String, String> args) {
super(args);
if (!args.isEmpty()) {
throw new IllegalArgumentException("Unknown parameters: " + args);
}
}
#Override
public TokenStream create(TokenStream input) {
return new LemmaTokenFilter(input, getLemmaMap(getFieldValue(input)));
}
private String getFieldValue(TokenStream input) {
//TODO: how?
return "Šach je desková hra pro dva hráče, v dnešní soutěžní podobě zároveň považovaná i za odvětví sportu.";
}
private Map<String, String> getLemmaMap(String data) {
return UdPipeService.getLemma(data);
}
#Override
public void inform(ResourceLoader loader) throws IOException {
}
}
1. API based approach:
You can create an Analysis Chain with the Custom lemmatizer on top. To design this lemmatizer, I guess you can look at the implementation of the Keyword Tokenizer;
Such that you can read everything whatever is there inside the input and then call your API;
Replace all your tokens from the API response in the input text;
After that in Analysis Chain, use standard or white space tokenizer to tokenized your data.
2. File-Based Approach
It will follow all the same steps, except calling the API it can use the hashmap, from the files mentioned while defining the TokenStream
Now coming to the ResourceLoaderAware:
It is required when you need to indicate your Tokenstream that resource has changed it has inform method which takes care of that. For reference, you can look into StemmerOverrideFilter
Keyword Tokenizer: Emits the entire input as a single token.
So I think I found the answer, or actually two answers.
One would be to write my client application in a way, that incoming requests are first processed - the field value is sent to the external API and the response is stored into some global variable, which can then be accessed from the custom TokenFilters.
Another one would be to use custom UpdateRequestProcessors, which allow us to modify the content of the incoming document, calling the external API and again saving the response so it's somehow globally accessible from custom TokenFilters. Here Erik Hatcher talks about the use of the ScriptUpdateProcessor, which I believe can be used in my case too.
Hope this helps to anyone stumbling upon a similar problem, because I had a hard time looking for a solution to this(could not find any similar threads on SO)

junit test case for my api

i am new to writing junits.I have my below java api which gets a unique value every time from database.It contains just a single query.I need to write junit for below api.can anybody give some suggestions how should i approach??
public static int getUniqueDBCSequence() throws Exception
{
int w_seq = 0;
QueryData w_ps = null;
ResultSet w_rs = null;
try
{
w_ps = new QueryData("SELECT GETUNIQUENUMBER.NEXTVAL FROM DUAL");
w_rs = SQLService.executeQuery(w_ps);
while ( w_rs.next() )
{
w_seq = w_rs.getInt(1);
}
}
catch (Exception a_ex)
{
LOGGER.fatal("Error occured : " + a_ex.getMessage());
}
finally
{
SQLService.closeResultSet(w_rs);
}
return w_seq;
}
You are using only static methods : in the class under test but also in the dependencies of it.
It is really not a testable code with JUnit.
Besides, what you do you want to test unitary ?
Your test has no substantive logic.
You could make SQLService.executeQuery() a method instance to be able to mock it. But really which interest to mock it ?
To assert that the result is returned w_seq = w_rs.getInt(1); ?
It looks like technical assertions that have few value and maintaining unit tests with few value should be avoided.
Now, you could test with DBunit or tools to populate a in memory database and executes the code against.
But the query executed have a strong coupling with Oracle sequences.
So, you could have some difficulties to do it.

Concurrent inserting to DB

I made a parser based on Jsoup. This parser handles a page with pagination. This page contains, for example, 100 links to be parsed. I created a main loop that goes over pagination. And I need to run async tasks to parse each of 100 items on each page. As I understand, Jsoup does not support async requests handling. After handling each of item I need to save it to DB. I want to avoid errors during insert into DB's table (if threads will use the same id for different items at the same time, if its possible). What you could suggest?
Could I use simple Thread instance to parse each item:
public class ItemParser extends Thread {
private String url;
private MySpringDataJpaRepository repo;
public ItemParser(String url, MySpringDataJpaRepository repoReference) {
this.url = url;
this.repo = repoReference;
}
#Override
public void run() {
final MyItem item = jsoupParseItem();
repo.save(item);
}
}
And run this like:
public class Parser {
#Autowired
private MySpringDataJpaRepository repoReference; // <-- SINGLETON
public static void main(String[] args) {
int pages = 10000;
for (int i = 0; i < pages; i++) {
Document currentPage = Jsoup.parse();
List<String> links = currentPage.extractLinks(); // contains 100 links to be parsed on each for-loop iteration
links.forEach(link -> new ItemParser(link, repoReference).start());
}
}
}
I know that this code is not compilable, I just want to show you my idea.
Or maybe it's better to use Spring Batch?
What is best practice to solve this?
What do you think?
If you use row level locking should be fine. It might save problems to have each insert be a transaction but this has implications given the whole notion of a transaction as a unit of work (i.e. if a single insert fails do you want the whole run to fail and rollback?).
Also, if you use UUIDs or db-generated ids you won't have any collision issues.
As to how to structure the code, I'd look at using Runnables for each task, and a thread pool executor. Too many threads and the system will lose efficiency for trying to manage them all. I notice you're using spring, so take a look at https://docs.spring.io/spring/docs/current/spring-framework-reference/html/scheduling.html

Web Service Return Function Specification Instead of Object?

Apologies if this question is a duplicate (or if it has an obvious answer that I'm missing) -->
Is there a practice or pattern that involves a web service returning a function definition to the client, instead of a value or object?
For an extra rough outlining example:
I'm interested in the results of some statistical model. I have a dataset of 100,000 objects of class ClientSideClass.
The statistical model sits on a server, where it has to have constant access to a large database and be re-calibrated/re-estimated frequently.
The statistical model takes some mathematical form, like RESULT = function(ClientSideClass) = AX + BY + anotherFunction(List(Z))
The service in question takes requests that have a ClientSideClass object, performs the calculation using the most recent statistical model, and then returns a result object of class ModelResultClass.
In pseudo OOP (again, sorry for the gnarly example) :
My program as a client :
static void main() {
/* assume that this assignment is meaningful and that all
the objects in allTheThings have the same identifying kerjigger */
SomeIdentifier id = new SomeIdentifier("kerjigger");
ClientSideClass[100000] allTheThings = GrabThoseThings(id);
for (ClientSideClass c : allTheThings) {
ModelResult mr = Service.ServerSideMethod(c);
// more interesting things
}
}
With my client side class :
ClientSideClass {
SomeIdentifier ID {}
int A {}
double[] B {}
HashTable<String,SomeSimpleClass> SomeHash {}
}
On the server, my main service :
Service {
HashTable<SomeIdentifier,ModelClass> currentModels {}
ModelClass GetCurrentModel(SomeIdentifier id) {
return currentModels.get(id);
}
ModelResultClass ServerSideMethod(ClientSideClass clientObject) {
ModelClass mc = GetCurrentModel(clientObject.ID);
return mc.Calculate(clientObject);
}
}
ModelClass {
FormulaClass ModelFormula {}
ModelResultClass Calculate(ClientSideClass clientObject) {
// apply formula to client object in whatever way
ModelResult mr = ModelFormula.Execute(clientObject);
return mr;
}
}
FormulaClass {
/* no idea what this would look like, just assume
that it is mutable and can change when the model
is updated */
ModelResultClass Execute(clientObject) {
/* do whatever operations on the client object
to get the forecast result
!!! this method is mutable, it could change in
functional form and/or parameter values */
return someResult;
}
}
This form results in a lot of network chatter, and it seems like it could make parallel processing problematic because there's a potential bottleneck in the number of requests the server can process simultaneously and/or how blocking those calls might be.
In a contrasting form, instead of returning a result object, could the service return a function specification? I'm thinking along the lines of a Lisp macro or an F# quotation or something. Those could be sent back to the client as simple text and then processed client-side, right?
So the ModelClass would instead look something like this? -->
ModelClass {
FormulaClass ModelFormula {}
String FunctionSpecification {
/* some algorithm to transform the current model form
to a recognizable text-formatted form */
string myFuncForm = FeelTheFunc();
return myFuncForm;
}
}
And the ServerSideMethod might look like this -->
String ServerSideMethod(SomeIdentifier id) {
ModelClass mc = GetCurrentModel(id);
return mc.FunctionSpecification;
}
As a client, I guess I would call the new service like this -->
static void main() {
/* assume that this assignment is meaningful and that all
the objects in allTheThings have the same identifier */
SomeIdentifier id = new SomeIdentifier("kerjigger");
ClientSideClass[100000] allTheThings = GrabThoseThings(id);
string functionSpec = Service.ServerSideMethod(id);
for (ClientSideClass c : allTheThings) {
ModelResult mr = SomeExecutionFramework.Execute(functionSpec, c);
}
}
This seems like an improvement in terms of cutting the network bottleneck, but it should also be readily modified so that it could be sped up by simply throwing threads at it.
Is this approach reasonable? Are there existing resources or frameworks that do this sort of thing or does anyone have experience with it? Specifically, I'm very interested in a use-case where an "interpretable" function can be utilized in a large web service that's written in an OO language (i.e. Java or C#).
I would be interested in specific implementation suggestions (e.g. use Clojure with a Java service or F# with a C#/WCF service) but I'd also be stoked on any general advice or insight.

Java pattern for parameters of which only one needs to be non-null?

In the last time I often write long functions that have several parameters but use only one of them and the functionality is only different at a few keypoints that are scattered around the function. Thus splitting the function would create too many small functions without a purpose. Is this good style or is there a good general refactoring pattern for this? To be more clear, an example:
public performSearch(DataBase dataBase, List<List<String>> segments) {performSearch(dataBase,null,null,segments);}
public performSearch(DataBaseCache dataBaseCache,List<List<String>> segments) {performSearch(null,dataBaseCache,null,segments);}
public performSearch(DataBase dataBase, List<String> keywords {performSearch(dataBase,null,keywords,null);}
public performSearch(DataBaseCache dataBaseCache,List<String> keywords) {performSearch(null,dataBaseCache,keywords,null);}
/** either dataBase or dataBaseCache may be null, dataBaseCache is used if it is non-null, else dataBase is used (slower). */
private void performSearch(DataBase dataBase, DataBaseCache dataBaseCache, List<String> keywords, List<List<String>> segments)
{
SearchObject search = new SearchObject();
search.setFast(true);
...
search.setNumberOfResults(25);
if(dataBaseCache!=null) {search.setSource(dataBaseCache);}
else {search.setSource(dataBase);}
... do some stuff ...
if(segments==null)
{
// create segments from keywords
....
segments = ...
}
}
This style of code works but I don't like all those null parameters and the possibilities of calling methods like this wrong (both parameters null, what happens if both are non-null) but I don't want to write 4 seperate functions either... I know this may be too general but maybe someone has a general solution to this principle of problems :-)
P.S.: I don't like to split up a long function if there is no reason for it other than it being long (i.e. if the subfunctions are only ever called in that order and only by this one function) especially if they are tightly interwoven and would need a big amount of parameters transported around them.
I think it is very bad procedural style. Try to avoid such coding. Since you already have a bulk of such code it may be very hard to re-factor it because each method contains its own logic that is slightly different from other. BTW the fact that it is hard is an evidence that the style is bad.
I think you should use behavioral patterns like
Chain of responsibilities
Command
Strategy
Template method
that can help you to change your procedural code to object oriented.
Could you use something like this
public static <T> T firstNonNull(T...parameters) {
for (T parameter: parameters) {
if (parameter != null) {
return parameter;
}
}
throw new IllegalArgumentException("At least one argument must be non null");
}
It does not check if more than one parameter is not null and they must be of the same type, but you could use it like this:
search.setSource(firstNonNull(dataBaseCache, database));
Expecting nulls is an anti-pattern because it litters your code with NullPointerExceptions waiting to happen. Use the builder pattern to construct the SearchObject. This is the signature you want, I'll let you figure out the implementation:
class SearchBuilder {
SearchObject search = new SearchObject();
List<String> keywords = new ArrayList<String>();
List<List<String>> segments = new ArrayList<List<String>>();
public SearchBuilder(DataBase dataBase) {}
public SearchBuilder(DataBaseCache dataBaseCache) {}
public void addKeyword(String keyword) {}
public void addSegment(String... segment) {}
public void performSearch();
}
I agree with what Alex said. Without knowing the problem I would recommend following structure based on what was in the example:
public interface SearchEngine {
public SearchEngineResult findByKeywords(List<String> keywords);
}
public class JDBCSearchEngine {
private DataSource dataSource;
public JDBCSearchEngine(DataSource dataSource) {
this.dataSource = dataSource;
}
public SearchEngineResult findByKeywords(List<String> keywords) {
// Find from JDBC datasource
// It might be useful to use a DAO instead of datasource, if you have database operations other that searching
}
}
public class CachingSearchEngine {
private SearchEngine searchEngine;
public CachingSearchEngine(SearchEngine searchEngine) {
this.searchEngine = searchEngine;
}
public SearchEngineResult findByKeywords(List<String> keywords) {
// First check from cache
...
// If not found, then fetch from real search engine
SearchEngineResult result = searchEngine.findByKeywords(keywords);
// Then add to cache
// Return the result
return result;
}
}

Categories

Resources