Batch indexing to solr

Batch indexing to solr - java

I have a java class that sends http post requests to a solr instance to index json files. it is implemented in a multithreaded manner. However, I have realized that sending so many http requests (close to 20,000) is causing the network to be a bottle neck. I read online that I can do batch indexing, but I can't find any clear examples. Is there any advice to batch index solr?
Thank you.

For generic JSON, you must have a configuration somewhere in solrconfig.xml that defines how it is treated.
One of the parameters is split. You might be able to use it to combine your JSON documents into a one bigger one that Solr would split and process separately. Notice that the specific format may be a little different for different Solr versions. Get the correct version of the downloadable reference guide PDF, if something is not working.
Or, if you can generate it, use JSON format Solr understands directly and which has full support for multiple documents.

Related

Elastic Stack - REST API logging with full JSON request and response

Background
We have a web server written in Java that communicates with thousands of mobile apps via HTTPS REST APIs.
For investigation purposes we have to log all API calls - currently this is implemented as a programming #Aspect, and for each API call we save an api_call_log object into a MySQL table with the following attributes:
tenant_id
username
device_uuid
api_method
api_version
api_start_time
api_processing_duration
request_parameters
full_request (JSON)
full_response (JSON)
response_code
Problem
As you can imagine after reaching a certain throughput this solution doesn't scale well, and also querying this table is very slow even with the use of the right MySQL indices.
Approach
That's why we want to use the Elastic Stack to re-implement this solution, however I am a bit stuck at the moment.
Question
I couldn't found any Logstash plugins yet that would suit my needs - should I output this api_call_log object into a log file instead and use Logstash to parse, filter and transform that file?

Exactly this is what I would do in this case. Write your log to a file using a framework like logback, rotate it. If you want easy parsing use json as logging format (also available in logback). Then use Filebeat in order to ingest the logfile as it gets written. If you need to transform/parse the messages in elasticsearch ingest nodes using pipelines.
Consider tagging/enriching the logfiles read by filebeat with machine or enviroment specific informations in order to ask for them in your visualisation or report etc.
The filebeat-to-elastic approach is the simplest one. Try this first. If you can't get your parsing done in elasticsearch pipelines, put a logstash in between.
Using filebeat you'll get many stuff for free like backpressure handling and daily indicies what comes very handy in the logging scenario we are discussing here.
When you need a visualization or search ui, have a look on kibana or grafana.
And if you have more questions, raise a new question here.
Have Fun!
https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html

Apache Solr, SolrJ vs Data Import Handler for parsing XML

I'm hoping to use Solr to run searches from info parsed from XML files.
These XML files are not in Solr's document format, as such I have to parse them and get the fields I need that way.
I am familiar with Java programming and was wondering if SolrJ would be an easier method than using the Data Import Handler. I'm considering running through each XML file I have and parsing the fields that I need from each. Is there any downside to one method over the other? I imagine since I have familiarity with Java it may be easier to parse the XML that way?
I will probably need multiple conditions and regular expressions. If anything, a reliable way to get my fields from relatively unstructured XML.
How would SolrJ work with the interface? That is, if I index using SolrJ, can I do my queries through the interface still?

DIH was designed for prototyping, though some people do use it for production. You can start from it, but be ready to jump to SolrJ or other methods if you hit its limitations. And if you have very complex mappings, you may be better off with SolrJ to start from.
You can also apply XSLT transform on an incoming XML document to map it to the Solr format.
And as said elsewhere, search is a separate issue from indexing.

How you index your content into Solr is orthogonal to how you query it. You can index any way you want, as long as it produces the right docs in the index.
Now, regarding indexing, if DIH will get what you need without much tweaking go for it. But if you need to do a lot of tweaking of the data, in the end you might finish faster if you just write some java with Solr. With Solr you have all the flexibility, with DIH you are more constrained (think of the 80/20 rule).

Proper way to migrate documents in couchbase (API 1.4.x -> 2.0.x)

I would like to migrate documents persisted in couchbase via API 1.4.10 to new documents provided by API 2.0.5 like JsonDocument. I found that there is possibility to add custom transcoders to Bucket, so when decoding documents I can check for flags and decide which transcoder exactly should I use. But it seems to me that this is not quite good solution. Are there any other ways to do that in a proper way? Thanks.
Migration can be done only at runtime upon user request since there are too many documents, we can not migrate them all at once in the background.

You don't need to use a custom transcoder to read documents created with the 1.x SDK. Instead, use the LegacyDocument type to read (and write) documents in legacy format.
More importantly, you shouldn't continue running with a mix of legacy and new documents in the database for very long. The LegacyDocument type is provided to facilitate the migration from the old format to the new SDK. The best practice in this case is to deploy an intermediate version of your application which attempts to read documents in one format, then falls back on trying to read them in the other. Legacy to new or vice versa, depending on which type of document is accessed more frequently at first. Once you have the intermediate version deployed, you should run a background task that will read and convert all documents from the old format to the new. This is pretty straightforward: you just try to read documents as LegacyDocument and, if it succeeds, you store the document right back as a JsonDocument using the CAS value you got earlier. If you can't read the document as legacy, then it's already in the new format. The task should be throttled enough that it doesn't cause a large increase in database load. After the task finishes, remove the fallback code from the application and just read and write everthing as JsonDocument.
You mention having too many documents - how many is that? We've successfully migrated datasets with multiple billions of documents this way. This, admittedly, took several days to run. If you have a database that's larger than that, or has a very low resident ratio, it might not be practical to attempt to convert all documents.

Nutch + Solr on top level page only

I've been trying to use Nutch to crawl over over the first page of the domains in my urls file and then use Solr to make keywords in the crawled data searchable. So far I haven't been able to get anything working this way, unless the two pages are linked together.
I realize this is probably an issue of the pages having no incoming links, and therefore the PageRank algorithm discards the page content. I tried adjusting the parameters so that the default score is higher for urls not in the graph, but I'm still getting the same results.
Is there anything people know of that can build an index over pages with no incoming links?
Thanks!

Try a nutch inject command to insert the "no-incomming-link" URL into the nutch DB.
I guess that if you don't see anything in your solr indexes, it is because no data for those URLs is stored in the nutch DB (since nutch will take care to sync its DB with the indexes). Not having data in the DB may be explained by the fact that the URLs are isolated, hence you can try the inject command to include those sites.
I would try to actually see the internal DB to verify the nutch behavior, since before inserting values in the indexes, nutch stores data inside its DBs.
Assigning a higher score has no effect, since lucene will give you a result as long as the data is in the index.

Solr now reads HTML files using Tika by default, so that's not a problem.
http://wiki.apache.org/solr/TikaEntityProcessor
If all you want is listed pages, is there a specific reason to use the Nutch crawler? Or could you just feed URLs to Solr and go from there?

Questions about SOLR documents and some more

Website: Classifieds website (users may put ads, search ads etc)
I plan to use SOLR for searching and then return results as ID nr:s only, and then use those ID nr:s and query mysql, and then lastly display the results with those ID:s.
Currently I have around 30 tables in MySQL, one for each category.
1- Do you think I should do it differently than above?
2- Should I use only one SOLR document, or multiple documents? Also, is document the same as a SOLR index?
3- Would it be better to Only use SOLR and skip MySQL knowing that I have alot of columns in each table? Personally I am much better at using MySQL than SOLR.
4- Say the user wants to search for cars in a specific region, how is this type of querying performed/done in SOLR? Ex: q=cars&region=washington possible?
You may think there is alot of info about SOLR out there, but there isn't, and especially not about using PHP with SOLR and a SOLR php client... Maybe I will write something when I have learned all this... Or maybe one of you could write something up!
Thanks again for all help...

First, the definitions: a Solr/Lucene document is roughly the equivalent of a database row. An index is roughly the same as a database table.
I recommend trying to store all the classified-related information in Solr. Querying Solr and then the database is inefficient and very likely unnecessary.
Querying in a specific region would be something like q=cars+region:washington assuming you have a region field in Solr.
The Solr wiki has tons of good information and a pretty good basic tutorial. Of course this can always be improved, so if you find anything that isn't clear please let the Solr team know about it.
I can't comment on the PHP client since I don't use PHP.

Solr is going to return it's results in a syntax easily parsible using SimpleXml. You could also use the SolPHP client library: http://wiki.apache.org/solr/SolPHP.
Solr is really quite efficient. I suggest putting as much data into your Solr index as necessary to retrieve everything in one hit from Solr. This could mean much less database traffic for you.
If you've installed the example Solr application (comes with Jetty), then you can develop Solr queries using the admin interface. The URI of the result is pretty much what you'd be constructing in PHP.
The most difficult part when beginning with Solr is getting the solrconfig.xml and the schema.xml files correct. I suggest starting with a very basic config, and restart your web app each time you add a field. Starting off with the whole schema.xml can be confusing.

2- Should I use only one SOLR document, or multiple documents? Also, is document the
same as a SOLR index?
3- Would it be better to Only use SOLR and skip MySQL knowing that I have alot of
columns in each table? Personally I am much better at using MySQL than SOLR.
A document is "an instance" of solr index. Take into account that you can build only one solr index per solr Core. A core acts as an independent solr Server into the same solr insallation.
http://wiki.apache.org/solr/CoreAdmin
Yo can build one index merging some table contents and some other indexes to perform second level searches...
would you give more details about your architecture and data??

As suggested by others you can store and index your mysql data and can run query in solr index, thus making mysql unnecessary to use.
You don't need to just store and index ids and query and get ids and then run mysql query to get additional data against that id. You can just store other data corresponding to ids in solr itself.
Regarding solr PHP client, then you don't need to use and it is recommended to directly use REST like Solr Web API. You can use PHP function like file_get_contents("http://IP:port/solr/#/core/select?q=query&start=0&rows=100&wt=json") or use curl with PHP if you need to. Both ways are almost same and efficient. This will return data in json as wt=json. Then use PHP function json_decode($returned_data) to get that data in object.
If you need to ask anything just reply.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.