I am trying to understand on how to perform queries in Redisearch strictly with "begins with" and I keep getting "contains".
For example if I have fields with values like 'football', 'myfootball', 'greenfootball' and would provide a search term like this:
> FT.SEARCH myIdx #myfield:foot*
I want just to get 'football' but I keep getting other fields that contain the word instead of beginning with that word.
Is there a way to avoid this?
I was trying to use VERBATIM and things like #myfield:^foot* but nothing.
I am using JRedisearch as a client but eventually I had to enter the DB and perform these queries manually in order to figure out what's happening. That being said, is this possible to do with this client at the moment?
Thanks
EDIT
A sample of my index setup:
Client client = new Client(INDEX_NAME, url, PORT);
Schema sc = new Schema().addSortableTextField("url", 1.0); // using this field for query
client.dropIndex(true);
client.createIndex(sc, Client.IndexOptions.Default());
return client;
Sample document:
id: // random uuid
urlPath: myfootbal
application: web
market: Europe
After checking the RDB provided I see that when searching foot* you are not getting myfootbal. The replies look like this: /dot-com/plp/football/x/index.html. You are getting those replies because this url is tokenized, and '/' is one of the tokenize chars. If you do not want those urls to be tokenized you need to declare them as TAGS and not as TEXT. This way the entire url will be indexed as is and when search for foot* it will not appear in the results.
For more information about TAGS see the FT.CREATE documentation: https://oss.redislabs.com/redisearch/Commands.html
Related
I have list of webpages around 1 million, I want to efficiently just extract text from those pages. Currently I am using BeautifulSoup library in python to get text from HTML and using request command to get html of a webpage. This approach extract some extra information in addition to the text like if any javascript is listed in body.
Could you please suggest me any suitable and efficient way to do the task. I looked at scrapy but it looks like it crawls specific website. Can we pass it list of specific webpages to get information from ?
Thank you in advance.
Yes, you can use Scrapy to crawl a set of URLs in a generic fashion.
You simply need to set them on the start_urls list attribute of your spider, or reimplement the start_requests spider method to yield requests from any data source, and then implement your parse callback to perform the generic content extraction you want.
You can use html-text to extract text from them, and regular Scrapy selectors to extract additional data like the one you mention.
In scrapy you can set up your own parser. E.g. Beautiful soup. This parser you can call from your parse method.
To extract text from generic pages I traverse the body only, exclude comments etc and some tags like script, style, etc:
for snippet in soup.find('body').descendants:
if isinstance(snippet, bs4.element.NavigableString) \
and not isinstance(snippet, EXCLUDED_STRING_TYPES)\
and snippet.parent.name not in EXCLUDED_TAGS:
snippet = re.sub(UNICODE_WHITESPACES, ' ', snippet)
snippet = snippet.strip()
if snippet != '':
snippets.append(snippet)
with
EXCLUDED_STRING_TYPES = (bs4.Comment, bs4.CData, bs4.ProcessingInstruction, bs4.Declaration)
EXCLUDED_TAGS = ['script', 'noscript', 'style', 'pre', 'code']
UNICODE_WHITESPACES = re.compile(u'[\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004'
u'\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000]+')
This is a link generated from Google Alerts, and I would like to get where you get redirected. So I need the URL and I would have to retrieve it with Java. I have checked for the response, but no location header redirect.
https://www.google.com/url?rct=j&sa=t&url=http://naija247news.com/2016/03/nigerian-bond-yields-rise-after-cbns-interest-rate-hike-aimed-at-luring-investors/&ct=ga&cd=CAIyGjA3ZmJiYzk0ZDM0N2U2MjU6Y29tOmVuOlVT&usg=AFQjCNGs7HsYSodEUnECfdAatG6KgY18DA
Maybe something like this:
String URL = "https://www.google.com/url?rct=j&sa=t&url=http://naija247news.com/2016/03/nigerian-bond-yields-rise-after-cbns-interest-rate-hike-aimed-at-luring-investors/&ct=ga&cd=CAIyGjA3ZmJiYzk0ZDM0N2U2MjU6Y29tOmVuOlVT&usg=AFQjCNGs7HsYSodEUnECfdAatG6KgY18DA";
String subStr = URL.substring(URL.indexOf("url=") + 1, URL.indexOf("&ct"));
I forgot what the starting and ending position has to be exactly, which indexes. So you would have to verify that and check it creates a substring at the right position. But the basic idea is to cut out the URL you need and nothing more. This is an example for what you forwarded. It could be that you would have to search for something else to know the end of the substring, when you have a different URL (in the provided example I look for &ct, which maybe be not be the case in another URL). You will have to look up several URLs you have to know how to cut out the URL.
In my elasticsearch I want to get all the indices' name of the cluster. How can I do using java?
I search the internet but there's no much useful information.
You can definitely do it with the following simple Java code:
List<IndexMetaData> indices = client.admin().cluster()
.prepareState().get().getState()
.getMetaData().getIndices();
The list you obtain contains the details on all the indices available in your ES cluster.
You can use:
client.admin().indices().prepareGetIndex().setFeatures().get().getIndices();
Use setFeatures() without parameter to just get index name. Otherwise, other data, such as MAPPINGS and SETTINGS of index, will also be returned by default.
Thanks for #Val's answer. According to your method, I use it in my projects, the code is:
ClusterStateResponse response = transportClient.admin().cluster() .prepareState()
.execute().actionGet();
String[] indices=response.getState().getMetaData().getConcreteAllIndices();
This method can put all the indices name into a String array. The method works.
there's another method I think but not tried:
ImmutableOpenMap<String, MappingMetaData> mappings = node.client().admin().cluster()
.prepareState().execute().actionGet().getState().getMetaData().getIndices().
then, we can get the keys of mappings to get all the indices.
Thanks again!
I am trying to extract data out of a website access log as part of a java program. Every entry in the log has a url. I have successfully extracted the url out of each record.
Within the url, there is a parameter that I want to capture so that I can use it to query a database. Unfortunately, it doesn't seem that the web developers used any one standard to write the parameter's name.
The parameter is usually called "course_id", but I have also seen "courseId", "course%3DId", "course%253Did", etc. The format for the parameter name and value is usually course_id=_22222_1, where the number I want is between the "_" and "_1". (The value is always the same, even if the parameter name varies.)
So, my idea was to use the regex /^.*course_id[^_]*_(\d*)_1.*$/i to find and extract the number.
In java, my code is
java.util.regex.Pattern courseIDPattern = java.util.regex.Pattern.compile(".*course[^i]*id[^_]*_(\\d*)_1.*", java.util.regex.Pattern.CASE_INSENSITIVE);
java.util.regex.Matcher courseIDMatcher = courseIDPattern.matcher(_url);
_courseID = "";
if(courseIDMatcher.matches())
{
_courseID = retrieveCourseID(courseIDMatcher.group(1));
return;
}
This works for a lot of the records. However, some records do not record the course_id, even though the parameter is in the url. One such example is the record:
/webapps/contentDetail?course_id=_223629_1&content_id=_3641164_1&rich_content_level=RICH&language=en_US&v=1&ver=4.1.2
However, I used notepad++ to do a regex replace on this (in fact, every) url using the regex above, and the url was successfully replaced by the course ID, implying that the regex is not incorrect.
Am I doing something wrong in the java code, or is the java matcher broken?
i haven't found the answer to my problem so I decided to write my question to get some help.
I use lucene to index the objects in computer memory(they exist only in my java code). While processing the code I index (using WhitespaceAnalyzer) the field with value objA/4.
My problem starts when I want to find it after the indexation (also using WhitespaceAnalyzer).
When i create a query obj* , I find all objects that start with obj - if i create a query objA/4 I also can find this object.
However i don't know how to find all objects starting with objA/ , when I create a query objA/* lucene is changing it to obja/* and finds nothing.
I've checked and "/" is not a special character so i dont need any "\" preceding it.
So my question is how to ask to get all objects that starts with objA/ (for example - objA/0, objA/1, objA/2, objA/3)?
Are you using QueryParser.escape(String) to escape everything correctly?
The code i'm using:
String node = "objA/*";
Query node_query = MultiFieldQueryParser.parse(node, "nodeName", new WhitespaceAnalyzer());
BooleanQuery bq = new BooleanQuery();
bq.add(node_query, BooleanClause.Occur.MUST);
System.out.println("We're asking for - " + bq);
IndexSearcher looker = new IndexSearcher(rep_index);
Hits hits = looker.search(bq);