Advice on reading indexes - java

I'm trying to figure out the right way to read lucene index only once whilst running the application multiple times, how can I do that in java?
Because indexed data will not change so reading them each time would not be necessary. Can someone explain me the logic of it reading them only once? thank you
UPDATE :
public List initTableObject() throws IOException {
Directory fSDirectory = FSDirectory.open(new File(INDEX_NAME));
List<String>termList = new ArrayList<String>();
RAMDirectory directory = new RAMDirectory(fSDirectory);
IndexReader iReader = IndexReader.open(fSDirectory);
FilterIndexReader fReader = new FilterIndexReader(iReader);
// int numOfDocs = fReader.numDocs();
TermEnum terms = fReader.terms();
while (terms.next()){
Term term = terms.term();
String termText = term.text();
termList.add(termText);
}
iReader.close();
return termList;
}
I'm really new with lucene and this, so here is what I've got so far I'm just not there yet with RAMDirectory.
This method retrieves list because I need this index list to compare with some files that I have. How can I store this list to the RAM so I can use it in my other part of application for comparison ?

I think the answer on this question might be of use.

Related

Lucene 6 - How to influence ranking with numeric value?

I am new to Lucene, so apologies for any unclear wording. I am working on an author search engine. The search query is the author name. The default search results are good - they return the names that match the most. However, we want to rank the results by author popularity as well, a blend of both the default similarity and a numeric value representing the circulations their titles have. The problem with the default results is it returns authors nobody is interested in, and while I can rank by circulation alone, the top result is generally not a great match in terms of name. I have been looking for days for a solution for this.
This is how I am building my index:
IndexWriter writer = new IndexWriter(FSDirectory.open(Paths.get(INDEX_LOCATION)),
new IndexWriterConfig(new StandardAnalyzer()));
writer.deleteAll();
for (Contributor contributor : contributors) {
Document doc = new Document();
doc.add(new TextField("name", contributor.getName(), Field.Store.YES));
doc.add(new StoredField("contribId", contributor.getContribId()));
doc.add(new NumericDocValuesField("sum", sum));
writer.addDocument(doc);
}
writer.close();
The name is the field we want to search on, and the sum is the field we want to weight our search results with (but still taking into account the best match for the author name). I'm not sure if adding the sum to the document is the correct thing to do in this situation. I know that there will need to be some experimentation to figure out how to best blend the weighting of the two factors, but my problem is I don't know how to do it in the first place.
Any examples I've been able to find are either pre-Lucene 4 or don't seem to work. I thought this was what I was looking for, but it doesn't seem to work. Help appreciated!
As demonstrated in the blog post you linked, you could use a CustomScoreQuery; this would give you a lot of flexibility and influence over the scoring process, but it is also a bit overkill. Another possibility is to use a FunctionScoreQuery; since they behave differently, I will explain both.
Using a FunctionScoreQuery
A FunctionScoreQuery can modify a score based on a field.
Let's say you create you are usually performing a search like this:
Query q = .... // pass the user input to the QueryParser or similar
TopDocs hits = searcher.search(query, 10); // Get 10 results
Then you can modify the query in between like this:
Query q = .....
// Note that a Float field would work better.
DoubleValuesSource boostByField = DoubleValuesSource.fromLongField("sum");
// Create a query, based on the old query and the boost
FunctionScoreQuery modifiedQuery = new FunctionScoreQuery(q, boostByField);
// Search as usual
TopDocs hits = searcher.search(query, 10);
This will modify the query based on the value of field. Sadly, however, there isn't a possibility to control the influence of the DoubleValuesSource (besides by scaling the values during indexing) - at least none that I know of.
To have more control, consider using the CustomScoreQuery.
Using a CustomScoreQuery
Using this kind of query will allow you to modify a score of each result any way you like. In this context we will use it to alter the score based on a field in the index. First, you will have to store your value during indexing:
doc.add(new StoredField("sum", sum));
Then we will have to create our very own query class:
private static class MyScoreQuery extends CustomScoreQuery {
public MyScoreQuery(Query subQuery) {
super(subQuery);
}
// The CustomScoreProvider is what actually alters the score
private class MyScoreProvider extends CustomScoreProvider {
private LeafReader reader;
private Set<String> fieldsToLoad;
public MyScoreProvider(LeafReaderContext context) {
super(context);
reader = context.reader();
// We create a HashSet which contains the name of the field
// which we need. This allows us to retrieve the document
// with only this field loaded, which is a lot faster.
fieldsToLoad = new HashSet<>();
fieldsToLoad.add("sum");
}
#Override
public float customScore(int doc_id, float currentScore, float valSrcScore) throws IOException {
// Get the result document from the index
Document doc = reader.document(doc_id, fieldsToLoad);
// Get boost value from index
IndexableField field = doc.getField("sum");
Number number = field.numericValue();
// This is just an example on how to alter the current score
// based on the value of "sum". You will have to experiment
// here.
float influence = 0.01f;
float boost = number.floatValue() * influence;
// Return the new score for this result, based on the
// original lucene score.
return currentScore + boost;
}
}
// Make sure that our CustomScoreProvider is being used.
#Override
public CustomScoreProvider getCustomScoreProvider(LeafReaderContext context) {
return new MyScoreProvider(context);
}
}
Now you can use your new Query class to modify an existing query, similar to the FunctionScoreQuery:
Query q = .....
// Create a query, based on the old query and the boost
MyScoreQuery modifiedQuery = new MyScoreQuery(q);
// Search as usual
TopDocs hits = searcher.search(query, 10);
Final remarks
Using a CustomScoreQuery, you can influence the scoring process in all kinds of ways. Remember however that the method customScore is called for each search result - so don't perform any expensive computations there, as this would severely slow down the search process.
I've creating a small gist of a full working example of the CustomScoreQuery here: https://gist.github.com/philippludwig/14e0d9b527a6522511ae79823adef73a

How to implement Word2Vec in Java?

I installed word2Vec using this tutorial on by Ubuntu laptop. Is it completely necessary to install DL4J in order to implement word2Vec vectors in Java? I'm comfortable working in Eclipse and I'm not sure that I want all the other pre-requisites that DL4J wants me to install.
Ideally there would be a really easy way for me to just use the Java code I've already written (in Eclipse) and change a few lines -- so that word look-ups that I am doing would retrieve a word2Vec vector instead of the current retrieval process I'm using.
Also, I've looked into using GloVe, however, I do not have MatLab. Is it possible to use GloVe without MatLab? (I got an error while installing it because of this). If so, the same question as above goes... I have no idea how to implement it in Java.
What is preventing you from saving the word2vec (the C program) output in text format and then read the file with a Java piece of code and load the vectors in a hashmap keyed by the word string?
Some code snippets:
// Class to store a hashmap of wordvecs
public class WordVecs {
HashMap<String, WordVec> wordvecmap;
....
void loadFromTextFile() {
String wordvecFile = prop.getProperty("wordvecs.vecfile");
wordvecmap = new HashMap();
try (FileReader fr = new FileReader(wordvecFile);
BufferedReader br = new BufferedReader(fr)) {
String line;
while ((line = br.readLine()) != null) {
WordVec wv = new WordVec(line);
wordvecmap.put(wv.word, wv);
}
}
catch (Exception ex) { ex.printStackTrace(); }
}
....
}
// class for each wordvec
public class WordVec implements Comparable<WordVec> {
public WordVec(String line) {
String[] tokens = line.split("\\s+");
word = tokens[0];
vec = new float[tokens.length-1];
for (int i = 1; i < tokens.length; i++)
vec[i-1] = Float.parseFloat(tokens[i]);
norm = getNorm();
}
....
}
If you want to get the nearest neighbours for a given word, you can keep a list of N nearest pre-computed neighbours associated with each WordVec object.
Dl4j author here. Our word2vec implementation is targeted for people who need to have custom pipelines. I don't blame you for going the simple route here.
Our word2vec implementation is meant for when you want to do something with them not for messing around. The c word2vec format is pretty straight forward.
Here is parsing logic in java if you'd like:
https://github.com/deeplearning4j/deeplearning4j/blob/374609b2672e97737b9eb3ba12ee62fab6cfee55/deeplearning4j-scaleout/deeplearning4j-nlp/src/main/java/org/deeplearning4j/models/embeddings/loader/WordVectorSerializer.java#L113
Hope that helps a bit

Check for new files in a loop - java

I have a program that needs to read files. I need to check every 10 seconds if there is new files.
To do that, I've made this :
ArrayList<File>oldFiles = new ArrayList<File>();
ArrayList<File>files=new ArrayList<File>();
while(isFinished != true){
files=listFilesForFolder(folder);
if(oldFiles.size() != files.size()){
System.out.println("Here is when a new file(s) is(are) in the folder");
}
Thread.sleep(10000);
}
Basically, the listFilesForFolder is getting a folder destination, and check the files in there.
My problem : My program does every loop my reading function on every file. I want to do my reading function ONLY on new files.
How can I do something like :
new files - old files = my files I want to read.
Rather than your approach why not store the DateTime of the last time that you checked.
Then compare this time to the File.lastModified value
The problem with your appraoch is that the array sizes will be different even in a file is deleted, and will be the same if one file is deleted and one file is added.
Rather than comparing old and new files, why not write a method to just return Last Modified Files.
public static ArrayList<File> listLastModifiedFiles(File folder,
long sleepDuration) throws Exception {
ArrayList<File> newFileList = new ArrayList<File>();
for (File fileEntry : folder.listFiles())
if ((System.currentTimeMillis() - fileEntry.lastModified()) <= sleepDuration)
newFileList.add(fileEntry);
return newFileList;
}
//Sample usage:
long sleepDuration = 10000;
ArrayList<File> newFileList;
int counter = 10;
while (counter-- > 0) {
newFileList = listLastModifiedFiles(folder, sleepDuration);
for (File File : newFileList)
System.out.println(File.getName());
Thread.sleep(sleepDuration);
}
You can use sets. Instead of returning an ArrayList, you could return a set instead.
newFiles.removeAll(oldFiles);
would then give you all the files that are not in the old set. I'm not saying that working with the modification date as Scary Wombat has pointed out is a worse idea, I'm just offering another solution.
Additionally, you have to modify your oldFiles to hold all files you've already encountered. The following example I think does what you're trying to achieve.
private static Set<File> findFilesIn(File directory) {
// Or whatever logic you have for finding files
return new HashSet<File>(Arrays.asList(directory.listFiles()));
}
public static void main(String[] args) throws Throwable {
Set<File> allFiles = new HashSet<File>(); // Renamed from oldFiles
Set<File> newFiles = new HashSet<File>();
File dir = new File("/tmp/stackoverflow/");
while (true) {
allFiles.addAll(newFiles); // Add files from last round to collection of all files
newFiles = findFilesIn(dir);
newFiles.removeAll(allFiles); // Remove all the ones we already know.
System.out.println(String.format("Found %d new files: %s", newFiles.size(), newFiles));
System.out.println("Sleeping...");
Thread.sleep(5000);
}
}
Sets are a more appropiate data storage for your case since you don't need any order in your collection of files and can benefit from faster lookup times (when using a HashSet).
Assuming that you only need to detect new files, not modified ones, and no file will be removed while your code is running:
ArrayList implements removeAll(Collection c), which does exactly what you want:
Removes from this list all of its elements that are contained in the
specified collection.
You might want to consider using the Java WatchService API which uses the low level operating system to notify you of changes to the file system. It's more efficient and faster than listing the files in directory.
There is a tutorial at Watching a Directory for Changes and the API is documented here: Interface WatchService

JHDF5 - How to avoid dataset being overwritten

I am using JHDF5 to log a collection of values to a hdf5 file. I am currently using two ArrayLists to do this, one with the values and one with the names of the values.
ArrayList<String> valueList = new ArrayList<String>();
ArrayList<String> nameList = new ArrayList<String>();
valueList.add("Value1");
valueList.add("Value2");
nameList.add("Name1");
nameList.add("Name2");
IHDF5Writer writer = HDF5Factory.configure("My_Log").keepDataSetsIfTheyExist().writer();
HDF5CompoundType<List<?>> type = writer.compound().getInferredType("", nameList, valueList);
writer.compound().write("log1", type, valueList);
writer.close();
This will log the values in the correct way to the file My_Log and in the dataset "log1". However, this example always overwrites the previous log of the values in the dataset "log1". I want to be able to log to the same dataset everytime, adding the latest log to the next line/index of the dataset. For example, if I were to change the value of "Name2" to "Value3" and log the values, and then change "Name1" to "Value4" and "Name2" to "Value5" and log the values, the dataset should look like this:
I thought the keepDataSetsIfTheyExist() option to would prevent the dataset to be overwritten, but apparently it doesn't work that way.
Something similar to what I want can be achieved in some cases with writer.compound().writeArrayBlock(), and specify by what index the array block shall be written. However, this solution doesn't seem to be compatible with my current code, where I have to use lists for handling my data.
Is there some option to achieve this that I have overlooked, or can't this be done with JHDF5?
I don't think that will work. It is not quite clear to me, but I believe the getInferredType() you are using is creating a data set with 2 name -> value entries. So it is effectively creating an object inside the hdf5. The best solution I could come up with was to read the previous values add them to the valueList before outputting:
ArrayList<String> valueList = new ArrayList<>();
valueList.add("Value1");
valueList.add("Value2");
try (IHDF5Reader reader = HDF5Factory.configure("My_Log.h5").reader()) {
String[] previous = reader.string().readArray("log1");
for (int i = 0; i < previous.length; i++) {
valueList.add(i, previous[i]);
}
} catch (HDF5FileNotFoundException ex) {
// Nothing to do here.
}
MDArray<String> values = new MDArray<>(String.class, new long[]{valueList.size()});
for (int i = 0; i < valueList.size(); i++) {
values.set(valueList.get(i), i);
}
try (IHDF5Writer writer = HDF5Factory.configure("My_Log.h5").writer()) {
writer.string().writeMDArray("log1", values);
}
If you call this code a second time with "Value3" and "Value4" instead, you will get 4 values. This sort of solution might become unpleasant if you start to have hierarchies of datasets however.
To solve your issue, you need to define the dataset log1 as extendible so that it can store an unknown number of log entries (that are generated over time) and write these using a point or hyperslab selection (otherwise, the dataset will be overwritten).
If you are not bound to a specific technology to handle HDF5 files, you may wish to give a look at HDFql which is an high-level language to manage HDF5 files easily. A possible solution for your use-case using HDFql (in Java) is:
public class Example
{
public Class Log
{
String name1;
String name2;
}
public boolean doSomething(Log log)
{
log.name1 = "Value1";
log.name2 = "Value2";
return true;
}
public static void main(String args[])
{
// declare variables
Log log = new Log();
int variableNumber;
// create an HDF5 file named 'My_Log.h5' and use (i.e. open) it
HDFql.execute("CREATE AND USE FILE My_Log.h5");
// create an extendible HDF5 dataset named 'log1' of data type compound
HDFql.execute("CREATE DATASET log1 AS COMPOUND(name1 AS VARCHAR, name2 AS VARCHAR)(0 TO UNLIMITED)");
// register variable 'log' for subsequent usage (by HDFql)
variableNumber = HDFql.variableRegister(log);
// call function 'doSomething' that does something and populates variable 'log' with an entry
while(doSomething(log))
{
// alter (i.e. extend) dataset 'log1' to +1 (i.e. add a new row)
HDFql.execute("ALTER DIMENSION log1 TO +1");
// insert (i.e. write) data stored in variable 'log' into dataset 'log1' using a point selection
HDFql.execute("INSERT INTO log1(-1) VALUES FROM MEMORY " + variableNumber);
}
}
}

Delete all files in 'folder' or with prefix in Google Cloud Bucket from Java

I know the idea of 'folders' is sort of non existent or different in Google Cloud Storage, but I need a way to delete all objects in a 'folder' or with a given prefix from Java.
The GcsService has a delete function, but as far as I can tell it only takes 1 GscFilename object and does not honor wildcards (i.e., "folderName/**" did not work).
Any tips?
The API only supports deleting a single object at a time. You can only request many deletions using many HTTP requests or by batching many delete requests. There is no API call to delete multiple objects using wildcards or the like. In order to delete all of the objects with a certain prefix, you'd need to list the objects, then make a delete call for each object that matches the pattern.
The command-line utility, gsutil, does exactly that when you ask it to delete the path "gs://bucket/dir/**. It fetches a list of objects matching that pattern, then it makes a delete call for each of them.
If you need a quick solution, you could always have your Java program exec gsutil.
Here is the code that corresponds to the above answer in case anyone else wants to use it:
public void deleteFolder(String bucket, String folderName) throws CoultNotDeleteFile {
try
{
ListResult list = gcsService.list(bucket, new ListOptions.Builder().setPrefix(folderName).setRecursive(true).build());
while(list.hasNext())
{
ListItem item = list.next();
gcsService.delete(new GcsFilename(file.getBucket(), item.getName()));
}
}
catch (IOException e)
{
//Error handling
}
}
Extremely late to the party, but here's for current google searches. We can delete multiple blobs efficiently by leveraging com.google.cloud.storage.StorageBatch.
Like so:
public static void rmdir(Storage storage, String bucket, String dir) {
StorageBatch batch = storage.batch();
Page<Blob> blobs = storage.list(bucket, Storage.BlobListOption.currentDirectory(),
Storage.BlobListOption.prefix(dir));
for(Blob blob : blobs.iterateAll()) {
batch.delete(blob.getBlobId());
}
batch.submit();
}
This should run MUCH faster than deleting one by one when your bucket/folder contains a non trivial amount of items.
Edit since this is getting a little attention, I'll demo error handling:
public static boolean rmdir(Storage storage, String bucket, String dir) {
List<StorageBatchResult<Boolean>> results = new ArrayList<>();
StorageBatch batch = storage.batch();
try {
Page<Blob> blobs = storage.list(bucket, Storage.BlobListOption.currentDirectory(),
Storage.BlobListOption.prefix(dir));
for(Blob blob : blobs.iterateAll()) {
results.add(batch.delete(blob.getBlobId()));
}
} finally {
batch.submit();
return results.stream().allMatch(r -> r != null && r.get());
}
}
This method will:
Delete every blob in the given folder of the given bucket returning true if so. The method will return false otherwise. One can look into the return method of batch.delete() for a better understanding and error proofing.
To ensure ALL items are deleted, you could call this like:
boolean success = false
while(!success)) {
success = rmdir(storage, bucket, dir);
}
I realise this is an old question, but I just stumbled upon the same issue and found a different way to resolve it.
The Storage class in the Google Cloud Java Client for Storage includes a method to list the blobs in a bucket, which can also accept an option to set a prefix to filter results to blobs whose names begin with the prefix.
For example, deleting all the files with a given prefix from a bucket can be achieved like this:
Storage storage = StorageOptions.getDefaultInstance().getService();
Iterable<Blob> blobs = storage.list("bucket_name", Storage.BlobListOption.prefix("prefix")).iterateAll();
for (Blob blob : blobs) {
blob.delete(Blob.BlobSourceOption.generationMatch());
}

Categories

Resources