How to use Wordnet Synonyms with Hibernate Search? - java

I've been trying to figure out how to use WordNet synonyms with a search function I'm developing which uses Hibernate Search 5.6.1. At first, I thought about using Hibernate Search annotations:
#TokenFilterDef(factory = SynonymFilterFactory.class, params = {#Parameter(name = "ignoreCase", value = "true"),
#Parameter(name = "expand", value = "true"),#Parameter(name = "synonyms", value = "synonymsfile") })
However, this requires an actual file populated with synonyms. From WordNet I was only able to get ".pl" files. So I tried manually making a SynonymAnalyzer class which would read from the ".pl" file:
public class SynonymAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new StandardTokenizer();
TokenStream result = new StandardFilter(source);
result = new LowerCaseFilter(result);
SynonymMap wordnetSynonyms = null;
try {
wordnetSynonyms = loadSynonyms();
} catch (IOException e) {
e.printStackTrace();
}
result = new SynonymFilter(result, wordnetSynonyms, false);
result = new StopFilter(result, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
return new TokenStreamComponents(source, result);
}
private SynonymMap loadSynonyms() throws IOException {
File file = new File("synonyms\\wn_s.pl");
InputStream stream = new FileInputStream(file);
Reader reader = new InputStreamReader(stream);
SynonymMap.Builder parser = null;
parser = new WordnetSynonymParser(true, true, new StandardAnalyzer(CharArraySet.EMPTY_SET));
try {
((WordnetSynonymParser) parser).parse(reader);
} catch (ParseException e) {
e.printStackTrace();
}
return parser.build();
}
}
The problem with this method is that I'm getting java.lang.OutOfMemoryError which I'm assuming is because there's too many synonyms or something? What is the proper way to do this, everywhere I've looked online has suggested using WordNet but I can't seem to find an example with Hibernate Search Annotations. Any help is appreciated, thanks!

The wordnet format is actually supported by SynonymFilterFactory. You're simply missing the "format" parameter in your annotation configuration; by default, the factory uses the Solr format.
Change your annotation to this:
#TokenFilterDef(
factory = SynonymFilterFactory.class,
params = {
#Parameter(name = "ignoreCase", value = "true"),
#Parameter(name = "expand", value = "true"),
#Parameter(name = "synonyms", value = "synonymsfile"),
#Parameter(name = "format", value = "wordnet") // Add this
}
)
Also, make sure that the value of the "synonyms" parameter is the path of a file in your classpath (e.g. "com/acme/synonyms.pl", or just "synonyms.pl" if the file is at the root of your "resources" directory).
In general when you have an issue with the parameters of a Lucene filter/tokenizer factory, your best bet is having a look at the source code of that factory, or having a look at this page.

Related

Spring Boot exporting huge database to csv via REST endpoint

I need to build a spring boot application which exposes a REST endpoint to export a huge database table as CSV file with different filter parameters. I am trying to find an efficient solution to this problem.
Currently, I am using spring-data-jpa to query the database table, which returns a list of POJOs. Then write this list to HttpServletResponse as CSV file using Apache Commons CSV. There are couple of issues with this approach. First, it loads all the data into memory. And secondly, it is slow.
I am not doing any business logic with the data, is it necessary to use jpa and entity(POJO) in this case. I feel this is the area where causing the problem.
You can try the new SpringWebflux introduced with Spring 5:
https://www.baeldung.com/spring-webflux
First create the controller a Flux from DataBuffer:
#GetMapping(path = "/report/detailReportFile/{uid}" , produces = "text/csv")
public Mono<Void> getWorkDoneReportDetailSofkianoFile (#PathVariable(name = "uid") String uid,
#RequestParam(name = "startDate", required = false, defaultValue = "0") long start,
#RequestParam(name = "endDate" , required = false, defaultValue = "0") long end,
ServerHttpResponse response) {
var startDate = start == 0 ? GenericData.GENERIC_DATE : new Date(start);
var endDate = end == 0 ? new Date() : new Date(end);
response.getHeaders().set(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename="+uid+".csv");
response.getHeaders().add("Accept-Ranges", "bytes");
Flux<DataBuffer> df = queryWorkDoneUseCase.findWorkDoneByIdSofkianoAndDateBetween(uid, startDate, endDate).collectList()
.flatMapMany(workDoneList -> WriteCsvToResponse.writeWorkDone(workDoneList));
return response.writeWith(df);
}
Now the DataBuffer must be created in my case create it using opencsv with a StringBuffer
public static Flux<DataBuffer> writeWorkDone(List<WorkDone> workDoneList) {
try {
StringWriter writer = new StringWriter();
ColumnPositionMappingStrategy<WorkDone> mapStrategy = new ColumnPositionMappingStrategy<>();
mapStrategy.setType(WorkDone.class);
String[] columns = new String[]{"idSofkiano", "nameSofkiano","idProject", "nameProject", "description", "hours", "minutes", "type"};
mapStrategy.setColumnMapping(columns);
StatefulBeanToCsv<WorkDone> btcsv = new StatefulBeanToCsvBuilder<WorkDone>(writer)
.withQuotechar(CSVWriter.NO_QUOTE_CHARACTER)
.withMappingStrategy(mapStrategy)
.withSeparator(',')
.build();
btcsv.write(workDoneList);
return Flux.just(stringBuffer(writer.getBuffer().toString()));
} catch (CsvException ex) {
return Flux.error(ex.getCause());
}
}
private static DataBuffer stringBuffer(String value) {
byte[] bytes = value.getBytes(StandardCharsets.UTF_8);
NettyDataBufferFactory nettyDataBufferFactory = new NettyDataBufferFactory(ByteBufAllocator.DEFAULT);
DataBuffer buffer = nettyDataBufferFactory.allocateBuffer(bytes.length);
buffer.write(bytes);
return buffer;
}

How to get server base path to find and read a json

I have this rest and I'm trying to mock some responses
I'm working on a WebSphere server with Spring Boot
#RequestMapping(method = RequestMethod.GET, value = "", produces = {MediaType.APPLICATION_JSON_VALUE, "application/hal+json"})
public Resources<String> getAssemblyLines() throws IOException {
String fullMockPath = servletContext.getContextPath() + "\\assets\\services-mocks\\assembly-lines\\get-assembly-lines.ok.json";
List<String> result = new ArrayList<String>();
result.add(fullMockPath);
try {
byte[] rawJson = Files.readAllBytes(Paths.get(fullMockPath));
Map<String, String> mappedJson = new HashMap<String, String>();
String jsonMock = new ObjectMapper().writeValueAsString(mappedJson);
result = new ArrayList<String>();
result.add(jsonMock);
} catch (IOException e) {
result.add("Not found");
result.add(e.getMessage());
}
return new Resources<String>(result,
linkTo(methodOn(this.getClass()).getAssemblyLines()).withSelfRel());
}
I get
FileNotFoundException
Tired Tushinov's solution
System.getProperty("user.dir");
But that returns the PATH of my server, not of my document root (and yes, they're in different folders)
How can understand my base path?
To your question How can understand my base path?. You can use:
System.getProperty("user.dir")
System.getProperty("user.dir")will return the path to your project.
Example output:
C:\folder_with_java_projects\CURRENT_PROJECT
So if the file is inside your project folder you can just do the following:
System.getProperty("user.dir") + "somePackage\someJson.json";

How to read file using groovy and store its contents are variables?

I'm looking for groovy specific way to read file and store its content as different variables. Example of my properties file:
#Local credentials:
postgresql.url = xxxx.xxxx.xxxx
postgresql.username = xxxxxxx
postgresql.password = xxxxxxx
console.url = xxxxx.xxxx.xxx
At the moment I'm using this java code to read the file and use variables:
Properties prop = new Properties();
InputStream input = null;
try {
input = new FileInputStream("config.properties");
prop.load(input);
this.postgresqlUser = prop.getProperty("postgresql.username")
this.postgresqlPass = prop.getProperty("postgresql.password")
this.postgresqlUrl = prop.getProperty("postgresql.url")
this.consoleUrl = prop.getProperty("console.url")
} catch (IOException ex) {
ex.printStackTrace();
} finally {
if (input != null) {
try {
input.close();
} catch (IOException e) {
}
}
}
}
My colleague recommended to use groovy way to deal with this and mentioned streams but I can't seem to find much information about on how to store data in separate variables, what I know so far is that def text = new FileInputStream("config.properties").getText("UTF-8") could read whole file and store it in one variable, but not separate. Any help would be appreciated
If you're willing to make your property file keys and class properties abide by a naming convention, then you can apply the property file values quite easily. Here's an example:
def config = '''
#Local credentials:
postgresql.url = xxxx.xxxx.xxxx
postgresql.username = xxxxxxx
postgresql.password = xxxxxxx
console.url = xxxxx.xxxx.xxx
'''
def props = new Properties().with {
load(new StringBufferInputStream(config))
delegate
}
class Foo {
def postgresqlUsername
def postgresqlPassword
def postgresqlUrl
def consoleUrl
Foo(Properties props) {
props.each { key, value ->
def propertyName = key.replaceAll(/\../) { it[1].toUpperCase() }
setProperty(propertyName, value)
}
}
}
def a = new Foo(props)
assert a.postgresqlUsername == 'xxxxxxx'
assert a.postgresqlPassword == 'xxxxxxx'
assert a.postgresqlUrl == 'xxxx.xxxx.xxxx'
assert a.consoleUrl == 'xxxxx.xxxx.xxx'
In this example, the property keys are converted by dropping the '.' and capitalizing the following letter. So postgresql.url becomes postgresqlUrl. Then it's just a matter for iterating through the keys and calling setProperty() to apply the value.
Take a look at the ConfigSlurper:
http://mrhaki.blogspot.de/2009/10/groovy-goodness-using-configslurper.html

How to read Nutch content from Java/Scala?

I'm using Nutch to crawl some websites (as a process that runs separate of everything else), while I want to use a Java (Scala) program to analyse the HTML data of websites using Jsoup.
I got Nutch to work by following the tutorial (without the script, only executing the individual instructions worked), and I think it's saving the websites' HTML in the crawl/segments/<time>/content/part-00000 directory.
The problem is that I cannot figure out how to actually read the website data (URLs and HTML) in a Java/Scala program. I read this document, but find it a bit overwhelming since I've never used Hadoop.
I tried to adapt the example code to my environment, and this is what I arrived at (mostly by guesswprk):
val reader = new MapFile.Reader(FileSystem.getLocal(new Configuration()), ".../apache-nutch-1.8/crawl/segments/20140711115438/content/part-00000", new Configuration())
var key = null
var value = null
reader.next(key, value) // test for a single value
println(key)
println(value)
However, I am getting this exception when I run it:
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1873)
at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)
I am not sure how to work with a MapFile.Reader, specifically, what constructor parameters I am supposed to pass to it. What Configuration objects am I supposed to pass in? Is that the correct FileSystem? And is that the data file I'm interested in?
Scala:
val conf = NutchConfiguration.create()
val fs = FileSystem.get(conf)
val file = new Path(".../part-00000/data")
val reader = new SequenceFile.Reader(fs, file, conf)
val webdata = Stream.continually {
val key = new Text()
val content = new Content()
reader.next(key, content)
(key, content)
}
println(webdata.head)
Java:
public class ContentReader {
public static void main(String[] args) throws IOException {
Configuration conf = NutchConfiguration.create();
Options opts = new Options();
GenericOptionsParser parser = new GenericOptionsParser(conf, opts, args);
String[] remainingArgs = parser.getRemainingArgs();
FileSystem fs = FileSystem.get(conf);
String segment = remainingArgs[0];
Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
Text key = new Text();
Content content = new Content();
// Loop through sequence files
while (reader.next(key, content)) {
try {
System.out.write(content.getContent(), 0,
content.getContent().length);
} catch (Exception e) {
}
}
}
}
Alternatively, you can use org.apache.nutch.segment.SegmentReader (example).

Specify indexName and type of index in elasticsearch in a properties file

I'm using elasticsearch and spring in my application. For each index type, I have a document mapping. Using #Document annotation I have specified the indexName and type of index. For eg: #Document(indexName = "myproject", type = "user"). But for writing unit tests, I would want to create indexes with a different indexName. Hence I want the indexName to be read from a properties file. How to do this in spring?
You can solve this problem by just using SPeL. It allows you to set a expression that Spring will parse at compile time. Hence allowing the Annotations to be processed during the compilation.
#Document(collection = "#{#environment.getProperty('index.access-log')}")
public class AccessLog{
...
}
Before Spring 5.x:
Note that there is no # in the SPeL.
#Document(collection = "#{environment.getProperty('index.access-log')}")
public class AccessLog{
...
}
Also I found that Spring also supports a much simpler Expression #Document(collection = "${index.access-log}") but I had mixed results with this.
After setting up the annotation as above you can use either
application.properties
index.access-log=index_access
or application.yaml
index :
access : index_access
Just use ElasticSearchTemplate from your unit tests to create the index with a different name and then use the method "index" or "bulkIndex" to index a document into the new index you just created.
esTemplate.createIndex(newIndexName, loadfromFromFile(settingsFileName));
esTemplate.putMapping(newIndexName, "user", loadfromFromFile(userMappingFileName));
List<IndexQuery> indexes = users.parallelStream().map(user -> {
IndexQuery index = new IndexQuery();
index.setIndexName(newIndexName);
index.setType("user");
index.setObject(user);
index.setId(String.valueOf(user.getId()));
return index;
}).collect(Collectors.toList());
esTemplate.bulkIndex(indexes);
//Load file from src/java/resources or /src/test/resources
public String loadfromFromFile(String fileName) throws IllegalStateException {
StringBuilder buffer = new StringBuilder(2048);
try {
InputStream is = getClass().getResourceAsStream(fileName);
LineNumberReader reader = new LineNumberReader(new InputStreamReader(is));
while (reader.ready()) {
buffer.append(reader.readLine());
buffer.append(' ');
}
} catch (Exception e) {
throw new IllegalStateException("couldn't load file " + fileName, e);
}
return buffer.toString();
}
This should work as its working for me. Same scenario.

Categories

Resources