How to get the Index Size in Solr using Java

How to get the Index Size in Solr using Java - java

I need to get total size of an index in Apache Solr using Java. The following code gets the total number of documents but I am looking for the size. And with the use of ReplicationHandler I was thinking that I can get the index size as told by someone here on this link.. http://lucene.472066.n3.nabble.com/cheking-the-size-of-the-index-using-solrj-API-s-td692686.html but I am not getting the index size.
BufferedWriter out1 = null;
FileWriter fstream1 = new FileWriter("src/test/resources/solr-document-id-desc.txt");
out1 = new BufferedWriter(fstream1);
ApplicationContext context = null;
context = new ClassPathXmlApplicationContext("application-context.xml");
CommonsHttpSolrServer solrServer = (CommonsHttpSolrServer) context.getBean("solrServer");
SolrQuery solrQuery = new SolrQuery().setQuery("*:*");
QueryResponse rsp = solrServer.query(solrQuery);
//I am trying to use replicationhandler but I am not able to get the index size using statistics. Is there any way to get the index size..?
ReplicationHandler handler2 = new ReplicationHandler();
System.out.println( handler2.getDescription());
NamedList statistics = handler2.getStatistics();
System.out.println("Statistics "+ statistics);
System.out.println(rsp.getResults().getNumFound());
Iterator<SolrDocument> iter = rsp.getResults().iterator();
while (iter.hasNext()) {
SolrDocument resultDoc = iter.next();
System.out.println(resultDoc.getFieldNames());
String id = (String) resultDoc.getFieldValue("numFound");
String description = (String) resultDoc.getFieldValue("description");
System.out.println(id+"~~"+description);
out1.write(id+"~~"+description);
out1.newLine();
}
out1.close();
Any suggestions will be appreciated..
Update Code:-
ReplicationHandler handler2 = new ReplicationHandler();
System.out.println( handler2.getDescription());
NamedList statistics = handler2.getStatistics();
System.out.println("Statistics "+ statistics.get("indexSize"));

The indexsize is available with the statistics in ReplicationHandler
org.apache.solr.handler.ReplicationHandler
code
public NamedList getStatistics() {
NamedList list = super.getStatistics();
if (core != null) {
list.add("indexSize", NumberUtils.readableSize(getIndexSize()));
}
}
You can use the URL http://localhost:8983/solr/replication?command=details , which returns the index size.
<lst name="details">
<str name="indexSize">26.13 KB</str>
.....
</lst>
Not sure if it works with the instantiation of ReplicationHandler, as it would need the reference of the core and the index.

You can use the command in the data directory
- du -kx

as said in this post you can use MAT tool in order to see the memory consumption. I think that you could use in your code. Enjoy solr!

Related

Upload documents into Watson's Retrieve & Rank service

I'm implementing a solution using Watson's Retrieve & Rank service.
When I use the tooling interface, I upload my documents and they appear as a list, where I can click on any of them to open up all the Titles that are inside the document ( Answer Units ), as you can see on the Picture 1 and Picture 2.
When I try to upload documents via Java, it wont recognize the documents, they get uploaded in parts ( Answer units as documents ), each part as a new document.
I would like to know how can I upload my documents as a entire document and not only parts of it?
Here's the codes for the upload function in Java:
public Answers ConvertToUnits(File doc, String collection) throws ParseException, SolrServerException, IOException{
DC.setUsernameAndPassword(USERNAME,PASSWORD);
Answers response = DC.convertDocumentToAnswer(doc).execute();
SolrInputDocument newdoc = new SolrInputDocument();
WatsonProcessing wp = new WatsonProcessing();
Collection<SolrInputDocument> newdocs = new ArrayList<SolrInputDocument>();
for(int i=0; i<response.getAnswerUnits().size(); i++)
{
String titulo = response.getAnswerUnits().get(i).getTitle();
String id = response.getAnswerUnits().get(i).getId();
newdoc.addField("title", titulo);
for(int j=0; j<response.getAnswerUnits().get(i).getContent().size(); j++)
{
String texto = response.getAnswerUnits().get(i).getContent().get(j).getText();
newdoc.addField("body", texto);
}
wp.IndexDocument(newdoc,collection);
newdoc.clear();
}
wp.ComitChanges(collection);
return response;
}
public void IndexDocument(SolrInputDocument newdoc, String collection) throws SolrServerException, IOException
{
UpdateRequest update = new UpdateRequest();
update.add(newdoc);
UpdateResponse addResponse = solrClient.add(collection, newdoc);
}

You can specify config options in this line:
Answers response = DC.convertDocumentToAnswer(doc).execute();
I think something like this should do the trick:
String configAsString = "{ \"conversion_target\":\"answer_units\", \"answer_units\": { \"selector_tags\": [] } }";
JsonParser jsonParser = new JsonParser();
JsonObject customConfig = jsonParser.parse(configAsString).getAsJsonObject();
Answers response = DC.convertDocumentToAnswer(doc, null, customConfig).execute();
I've not tried it out, so might not have got the syntax exactly right, but hopefully this will get you on the right track.
Essentially, what I'm trying to do here is use the selector_tags option in the config (see https://www.ibm.com/watson/developercloud/doc/document-conversion/customizing.shtml#htmlau for doc on this) to specify which tags the document should be split on. By specifying an empty list with no tags in, it results in it not being split at all - and coming out in a single answer unit as you want.
(Note that you can do this through the tooling interface, too - by unticking the "Split my documents up into individual answers for me" option when you upload the document)

Trouble building Shapefile in Geotools

I have a project where I want to load in a given shapefile, and pick out polygons above a certain size before writing the results to a new shapefile. Maybe not the most efficient, but I've got code that successfully does all of that, right up to the point where it is supposed to write the shapefile. I get no errors, but the resulting shapefile has no usable data in it. I've followed as many tutorials as possible, but still I'm coming up blank.
The first bit of code is where I read in a shapefile, pickout the polygons I want, and put then into a feature collection. This part seems to work fine as far as I can tell.
public class ShapefileTest {
public static void main(String[] args) throws MalformedURLException, IOException, FactoryException, MismatchedDimensionException, TransformException, SchemaException {
File oldShp = new File("Old.shp");
File newShp = new File("New.shp");
//Get data from the original ShapeFile
Map<String, Object> map = new HashMap<String, Object>();
map.put("url", oldShp.toURI().toURL());
//Connect to the dataStore
DataStore dataStore = DataStoreFinder.getDataStore(map);
//Get the typeName from the dataStore
String typeName = dataStore.getTypeNames()[0];
//Get the FeatureSource from the dataStore
FeatureSource<SimpleFeatureType, SimpleFeature> source = dataStore.getFeatureSource(typeName);
SimpleFeatureCollection collection = (SimpleFeatureCollection) source.getFeatures(); //Get all of the features - no filter
//Start creating the new Shapefile
final SimpleFeatureType TYPE = createFeatureType(); //Calls a method that builds the feature type - tested and works.
DefaultFeatureCollection newCollection = new DefaultFeatureCollection(); //To hold my new collection
try (FeatureIterator<SimpleFeature> features = collection.features()) {
while (features.hasNext()) {
SimpleFeature feature = features.next(); //Get next feature
SimpleFeatureBuilder fb = new SimpleFeatureBuilder(TYPE); //Create a new SimpleFeature based on the original
Integer level = (Integer) feature.getAttribute(1); //Get the level for this feature
MultiPolygon multiPoly = (MultiPolygon) feature.getDefaultGeometry(); //Get the geometry collection
//First count how many new polygons we will have
int numNewPoly = 0;
for (int i = 0; i < multiPoly.getNumGeometries(); i++) {
double area = getArea(multiPoly.getGeometryN(i));
if (area > 20200) {
numNewPoly++;
}
}
//Now build an array of the larger polygons
Polygon[] polys = new Polygon[numNewPoly]; //Array of new geometies
int iPoly = 0;
for (int i = 0; i < multiPoly.getNumGeometries(); i++) {
double area = getArea(multiPoly.getGeometryN(i));
if (area > 20200) { //Write the new data
polys[iPoly] = (Polygon) multiPoly.getGeometryN(i);
iPoly++;
}
}
GeometryFactory gf = new GeometryFactory(); //Create a geometry factory
MultiPolygon mp = new MultiPolygon(polys, gf); //Create the MultiPolygonyy
fb.add(mp); //Add the geometry collection to the feature builder
fb.add(level);
fb.add("dBA");
SimpleFeature newFeature = SimpleFeatureBuilder.build( TYPE, new Object[]{mp, level,"dBA"}, null );
newCollection.add(newFeature); //Add it to the collection
}
At this point I have a collection that looks right - it has the correct bounds and everything. The next bit if code is where I put it into a new Shapefile.
//Time to put together the new Shapefile
Map<String, Serializable> newMap = new HashMap<String, Serializable>();
newMap.put("url", newShp.toURI().toURL());
newMap.put("create spatial index", Boolean.TRUE);
DataStore newDataStore = DataStoreFinder.getDataStore(newMap);
newDataStore.createSchema(TYPE);
String newTypeName = newDataStore.getTypeNames()[0];
SimpleFeatureStore fs = (SimpleFeatureStore) newDataStore.getFeatureSource(newTypeName);
Transaction t = new DefaultTransaction("add");
fs.setTransaction(t);
fs.addFeatures(newCollection);
t.commit();
ReferencedEnvelope env = fs.getBounds();
}
}
I put in the very last code to check the bounds of the FeatureStore fs, and it comes back null. Obviously, loading the newly created shapefile (which DOES get created and is ab out the right size), nothing shows up.

The solution actually had nothing to do with the code I posted - it had everything to do with my FeatureType definition. I did not include the "the_geom" to my polygon feature type, so nothing was getting written to the file.

I believe you are missing the step to finalize/close the file. Try adding this after the the t.commit line.
fs.close();
As an expedient alternative, you might try out the Shapefile dumper utility mentioned in the Shapefile DataStores docs. Using that may simplify your second code block into two or three lines.

Sending a POST request with multiple keywords as parameters using HTMLUnit

I am sending a POST request using HTMLUnit that sends keywords as parameters. An example of the URL is:
website.com/foo/bar/api?keywords=word1,word2,word3&language=en
The problem is my application is dynamically picking these words and the amount of words can go up to 10 or 20 or even more. How do you append a Set of words as values to a HTTP request. My code at the moment is:
requestSettings = new WebRequest(new URL("website.com/foo/bar/api?"),
HttpMethod.POST);
Iterator<String> itr = list.iterator();
while(itr.hasNext()) {
requestSettings.getRequestParameters()
.add(new NameValuePair("keywords[]", itr.next()));
}
requestSettings.getRequestParameters().add(new NameValuePair("language", "en"));
System.out.println(requestSettings.getUrl().toString());
response = webClient.getPage(requestSettings).getWebResponse();
This code does not return a valid respone. What am I doing wrong here?

Give this a try:
using (var client = new WebClient())
{
var dataObject = new {
KeyWords = "one, two, three"
};
var serializer = new JavaScriptSerializer();
var json = serializer.Serialize(dataObject);
var response = client.UploadString("yourUrl", json);
}

How to keep Lucene index without deleted documents

This is my first question on Stack Overflow,so wish me luck.
I am doing a classification process over a Lucene index with java and i need to update a document field named category. I have been using Lucene 4.2 with the index writer updateDocument() function for that purpose and its working very well, except for the deletion part. Even if i use the forceMergeDeletes() function after the update the index show me some already deleted documents. For example, if I run the classification over an index with 1000 documents the final amount of documents in the index remain the same and work as expected, but when I increase the index documents to 10000 the index shows some already deleted documents but not all. So, how can I actually erase those deleted documents from index?
Here is some snippets of my code:
public static void main(String[] args) throws IOException, ParseException {
///////////////////////Preparing config data////////////////////////////
File indexDir = new File("/indexDir");
Directory fsDir = FSDirectory.open(indexDir);
IndexWriterConfig iwConf = new IndexWriterConfig(Version.LUCENE_42, new WhitespaceSpanishAnalyzer());
iwConf.setOpenMode(IndexWriterConfig.OpenMode.APPEND);
IndexWriter indexWriter = new IndexWriter(fsDir, iwConf);
IndexReader reader = DirectoryReader.open(fsDir);
IndexSearcher indexSearcher = new IndexSearcher(reader);
KNearestNeighborClassifier classifier = new KNearestNeighborClassifier(100);
AtomicReader ar = new SlowCompositeReaderWrapper((CompositeReader) reader);
classifier.train(ar, "text", "category", new WhitespaceSpanishAnalyzer());
System.out.println("***Before***");
showIndexedDocuments(reader);
System.out.println("***Before***");
int maxdoc = reader.maxDoc();
int j = 0;
for (int i = 0; i < maxdoc; i++) {
Document doc = reader.document(i);
String clusterClasif = doc.get("category");
String text = doc.get("text");
String docid = doc.get("doc_id");
ClassificationResult<BytesRef> result = classifier.assignClass(text);
String classified = result.getAssignedClass().utf8ToString();
if (!classified.isEmpty() && clusterClasif.compareTo(classified) != 0) {
Term term = new Term("doc_id", docid);
doc.removeField("category");
doc.add(new StringField("category",
classified, Field.Store.YES));
indexWriter.updateDocument(term,doc);
j++;
}
}
indexWriter.forceMergeDeletes(true);
indexWriter.close();
System.out.println("Classified documents count: " + j);
System.out.println();
reader.close();
reader = DirectoryReader.open(fsDir);
System.out.println("Deleted docs: " + reader.numDeletedDocs());
System.out.println("***After***");
showIndexedDocuments(reader);
}
private static void showIndexedDocuments(IndexReader reader) throws IOException {
int maxdoc = reader.maxDoc();
for (int i = 0; i < maxdoc; i++) {
Document doc = reader.document(i);
String idDoc = doc.get("doc_id");
String text = doc.get("text");
String category = doc.get("category");
System.out.println("Id Doc: " + idDoc);
System.out.println("Category: " + category);
System.out.println("Text: " + text);
System.out.println();
}
System.out.println("Total: " + maxdoc);
}
I have spend many hours looking for a solution to this, someones say that the deleted documents in the index are not important and that eventually they will be erased when we keep adding documents to the index, but I need to control that process in a way I can iterate over the index documents at any time and that the documents I retrieve are actually the lived ones. Lucene versions previous to 4.0 had a function in the IndexReader class named isDeleted(docId) that gives if a document has been marked has deleted, that could be just half of the solution to my problem but I have not found a way to do that with the version 4.2 of Lucene. If you know how to do that I really appreciate if you share it.

You can check is a document is deleted is the MultiFields class, like:
Bits liveDocs = MultiFields.getLiveDocs(reader);
if (!liveDocs.get(docID)) ...
So, working this into your code, perhaps something like:
int maxdoc = reader.maxDoc();
Bits liveDocs = MultiFields.getLiveDocs(reader);
for (int i = 0; i < maxdoc; i++) {
if (!liveDocs.get(docID)) continue;
Document doc = reader.document(i);
String idDoc = doc.get("doc_id");
....
}
By the way, sounds like you have previously been working with 3.X, and are now on 4.X. The Lucene Migration Guide is very helpful for these understanding these sorts of changes between versions, and how to resolve them.

Lucene changing from RAMDirectory to FSDIrectory - Content-Field missing

I'm just a lucene starter and and i got stuck on a problem during a change from a RAMDIrectory to a FSDirectory:
First my code:
private static IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_43,
new StandardAnalyzer(Version.LUCENE_43));
Directory DIR = FSDirectory.open(new File(INDEXLOC)); //INDEXLOC = "path/to/dir/"
// RAMDirectory DIR = new RAMDirectory();
// Index some made up content
IndexWriter writer =
new IndexWriter(DIR, iwc);
// Store both position and offset information
FieldType type = new FieldType();
type.setStored(true);
type.setStoreTermVectors(true);
type.setStoreTermVectorOffsets(true);
type.setStoreTermVectorPositions(true);
type.setIndexed(true);
type.setTokenized(true);
IDocumentParser p = DocumentParserFactory.getParser(f);
ArrayList<ParserDocument> DOCS = p.getParsedDocuments();
for (int i = 0; i < DOCS.size(); i++) {
Document doc = new Document();
Field id = new StringField("id", "doc_" + i, Field.Store.YES);
doc.add(id);
Field text = new Field("content", DOCS.get(i).getContent(), type);
doc.add(text);
writer.addDocument(doc);
}
writer.close();
// Get a searcher
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(DIR));
// Do a search using SpanQuery
SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "zahl"));
TopDocs results = searcher.search(fleeceQ, 10);
for (int i = 0; i < results.scoreDocs.length; i++) {
ScoreDoc scoreDoc = results.scoreDocs[i];
System.out.println("Score Doc: " + scoreDoc);
}
IndexReader reader = searcher.getIndexReader();
AtomicReader wrapper = SlowCompositeReaderWrapper.wrap(reader);
Map<Term, TermContext> termContexts = new HashMap<Term, TermContext>();
Spans spans = fleeceQ.getSpans(wrapper.getContext(), new Bits.MatchAllBits(reader.numDocs()), termContexts);
int window = 2;// get the words within two of the match
while (spans.next() == true) {
Map<Integer, String> entries = new TreeMap<Integer, String>();
System.out.println("Doc: " + spans.doc() + " Start: " + spans.start() + " End: " + spans.end());
int start = spans.start() - window;
int end = spans.end() + window;
Terms content = reader.getTermVector(spans.doc(), "content");
TermsEnum termsEnum = content.iterator(null);
BytesRef term;
while ((term = termsEnum.next()) != null) {
// could store the BytesRef here, but String is easier for this
// example
String s = new String(term.bytes, term.offset, term.length);
DocsAndPositionsEnum positionsEnum = termsEnum.docsAndPositions(null, null);
if (positionsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
int i = 0;
int position = -1;
while (i < positionsEnum.freq() && (position = positionsEnum.nextPosition()) != -1) {
if (position >= start && position <= end) {
entries.put(position, s);
}
i++;
}
}
}
System.out.println("Entries:" + entries);
}
it's just some code i found on a great website and i wanted to try .... everything works great using the RAMDirectory. But if i change it to my FSDirectory it's giving me a NullpointerException like :
Exception in thread "main" java.lang.NullPointerException at
com.org.test.TextDB.myMethod(TextDB.java:184) at
com.org.test.Main.main(Main.java:31)
The statement Terms content = reader.getTermVector(spans.doc(), "content"); seems to get no result and returns null. so the exception. but why? in my ramDIR everything works fine.
It seems that the indexWriter or the Reader (really don't know) didn't write or didn't read the field "content" properly from the index. But i really don't know why its 'written' in a RAMDirectory and not written in a FSDIrectory?!
Anybody an idea to that?

Gave this a test a quick test run, and I can't reproduce your issue.
I think the most likely issue here is old documents in your index. The way this is written, every time it is run, more documents will be added to your index. Old documents from previous runs won't get deleted, or overwritten, they'll just stick around. So, if you have run this before on the same directory, say perhaps, before you added the line type.setStoreTermVectors(true);, some of your results may be these old documents with term vectors, and reader.getTermVector(...) will return null, if the document does not store term vectors.
Of course, anything indexed in a RAMDirectory will be dropped as soon as execution finishes, so the issue would not occur in that case.
Simple solution would be to try deleting the index directory and run it again.
If you want to start with a fresh index when you run this, you can set that up through the IndexWriterConfig:
private static IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_43,
new StandardAnalyzer(Version.LUCENE_43));
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
That's a guess, of course, but seems consistent with the behavior you've described.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to get the Index Size in Solr using Java - java

You can use the command in the data directory - du -kx

as said in this post you can use MAT tool in order to see the memory consumption. I think that you could use in your code. Enjoy solr!

Related

Upload documents into Watson's Retrieve & Rank service

Trouble building Shapefile in Geotools

Sending a POST request with multiple keywords as parameters using HTMLUnit

How to keep Lucene index without deleted documents

Lucene changing from RAMDirectory to FSDIrectory - Content-Field missing

Categories

Resources