How to prevent solr from decoding a url while indexing? - java

I'm using Solrj to index document in Solr, one of the field being url. While creating the solr document and subsequently passing it to a SolrServer, I'm not doing any explicit decoding, in order to keep the original format of the url. But, once it's indexed, the urls are decoded.
Here's a test example which contains apostrophe.
http://test.com/test/Help/What%e2%80%99s_N1
In solr index, it's being decoded to
http://test.com/test/Help/What's_N1
Here's a sample code :
SolrServer solrServer = new StreamingUpdateSolrServer(solrPostUrl, solrQueueSize, solrThreads);
SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.addField("url", "http://test.com/test/Help/What%e2%80%99s_N1");
UpdateResponse solrResponse = solrServer.add(solrDoc);
I looked into the SolrInputDocument object, it does have the right format, i.e. the encoded version.
I'll appreciate if someone can provide pointers to this.
Thanks

I think its because of your tokenizers
A good general purpose tokenizer that strips many extraneous
characters and sets token types to meaningful values. Token types are
only useful for subsequent token filters that are type-aware of the
same token types. There aren't any filters that use
StandardTokenizer's types.
about standardTokenizer
check it out here
you can change all of this behaviour in the solr/schema.xml

Related

How to URL-encode the the whole xml value of a query param using Spring's rest template?

I am working on a Spring Boot application
I need to make a request to an external service, old and ill-conceived. The request take the form of a HTTP GET (or POST) call, but the payload, an xml content, need to be passed as a query parameter. For example,
GET http://ill-service.com/plain.cgi?XML_DATA=<request attribute="attributeValue"><content contentAttribute="plain"/></request>
Of course, the value of query param XML_DATA need to be URL encoded, and normally, the RestTemplate of Spring boot work good on that, following RFC 3986 (see http://www.ietf.org/rfc/rfc3986.txt).
Except that, as allowed by this RFC, '/' and '=' character are left in the param value, giving me the following query :
GET http://ill-service.com/plain.cgi?XML_DATA=%3Crequest%20attribute=%22attributeValue%22%3E%3Ccontent%20contentAttribute=%22plain%22/%3E%3C/request%3E
In a perfect wold, this would be good, but do you remember when I said that the service I am trying to call is ill-conceived ? In another world, it needs to have the full content of XML_DATA URL-encoded. In another words, it needs the following query:
GET http://ill-service.com/plain.cgi?XML_DATA=%3Crequest%20attribute%3D%22attributeValue%22%3E%3Ccontent%20contentAttribute%3D%22plain%22%2F%3E%3C%2Frequest%3E%0A
I am quite lost on how to instruct the rest template or the UriComponentBuilder I am using to do so. Any help would be greatly appreciated
Probably u can use spring's UriUtils class
Use java.net.URLEncoder to encode your XML payload first and then append the encoded payload.
Following the suggestion of Vasif, and some information about UriComponentBuilder I found the following solutions :
String xmlContent = "<request attribute="attributeValue"><content contentAttribute="plain"/></request>";
URI uri = UriComponentsBuilder.fromHttpUrl("http://ill-service.com/plain.cgi")
//This part set the query param as a full encoded value, not as query value encoded
.queryParam("XML_DATA", UriUtils.encode(xmlContent, "UTF-8"))
//The build(true) indicate to the builder that the Uri is already encoded
.build(true).toUri();
String responseStr = restTemplate.getForObject(uri ,String.class)

Jackson JSON Handling of Unicode symbols

I'm calling a webservice that returns text including the ascii symbols representing the ® symbol. For example:
ACME Corp® Services
I use spring to return this textual data directly as a JSON object, and by the time it gets into the browser the json data remains correct:
"service": "ACME Corp® Services"
But upon being rendered via a Handlebars template and written into the page I get:
ACME Corp® Services
Do I need to sanitize the JSON data before using it? If so, what are the best practices for doing that? Otherwise, perhaps there is a change I should make on the back-end but I am not sure what that would be.
You do not need to sanitize content, but you must make sure it uses valid encoding allowed by JSON specification: typically UTF-8 (alternatives being UTF-16 and UTF-32).
If content is not encoded as UTF-8 but something else (like ISO-8859-1 aka "Latin-1"), you will need to construct Reader to decode it properly:
Reader r = new InputStreamReader(in, StandardCharset.ISO_8859_1);
MyPOJO pojo = mapper.readValue(r, MyPOJO.class);
Problem you seem to be having is that encoding used is incorrect.

Documenting JSON in URL not possible

In my Rest API it should be possible to retrieve data which is inside a bounding box. Because the bounding box has four coordinates I want to design the GET requests in such way, that they accept the bounding box as JSON. Therefore I need to be able to send and document JSON strings as URL parameter.
The test itself works, but I can not document these requests with Spring RestDocs (1.0.0.RC1). I reproduced the problem with a simpler method. See below:
#Test public void ping_username() throws Exception
{
String query = "name={\"user\":\"Müller\"}";
String encodedQuery = URLEncoder.encode(query, "UTF-8");
mockMvc.perform(get(URI.create("/ping?" + encodedQuery)))
.andExpect(status().isOk())
.andDo(document("ping_username"));
}
When I remove .andDo(document("ping_username")) the test passes.
Stacktrace:
java.lang.IllegalArgumentException: Illegal character in query at index 32: http://localhost:8080/ping?name={"user":"Müller"}
at java.net.URI.create(URI.java:852)
at org.springframework.restdocs.mockmvc.MockMvcOperationRequestFactory.createOperationRequest(MockMvcOperationRequestFactory.java:79)
at org.springframework.restdocs.mockmvc.RestDocumentationResultHandler.handle(RestDocumentationResultHandler.java:93)
at org.springframework.test.web.servlet.MockMvc$1.andDo(MockMvc.java:158)
at application.rest.RestApiTest.ping_username(RestApiTest.java:65)
After I received the suggestion to encode the URL I tried it, but the problem remains.
The String which is used to create the URI in my test is now /ping?name%3D%7B%22user%22%3A%22M%C3%BCller%22%7D.
I checked the class MockMvcOperationRequestFactory which appears in the stacktrace, and in line 79 the following code is executed:
URI.create(getRequestUri(mockRequest)
+ (StringUtils.hasText(queryString) ? "?" + queryString : ""))
The problem here is that a not encoded String is used (in my case http://localhost:8080/ping?name={"user":"Müller"}) and the creation of the URI fails.
Remark:
Andy Wilkinson's answer is the solution for the problem. Although I think that David Sinfield is right and JSONs should be avoided in the URL to keep it simple. For my bounding box I will use a comma separated string, as it is used in WMS 1.1: BBOX=x1,y1,x2,y2
You haven't mentioned the version of Spring REST Docs that you're using, but I would guess that the problem is with URIUtil. I can't tell for certain as I can't see where URIUtil is from.
Anyway, using the JDK's URLEncoder works for me with Spring REST Docs 1.0.0.RC1:
String query = "name={\"user\":\"Müller\"}";
String encodedQuery = URLEncoder.encode(query, "UTF-8");
mockMvc.perform(get(URI.create("/baz?" + encodedQuery)))
.andExpect(status().isOk())
.andDo(document("ping_username"));
You can then use URLDecoder.decode on the server side to get the original JSON:
URLDecoder.decode(request.getQueryString(), "UTF-8")
The problem is that URIs have to be encoded as ACII. And ü is not a valid ASCII character, so it must be escaped in the url with % escaping.
If you are using Tomcat, you can use URIEncoding="UTF-8" in the Connector element of the server.xml, to configure UTF-8 escaping as default. If you do this, ü will be automatically converted to %C3%BC, which is the ASCII representation of the \uC3BC Unicode code-point (which represents ü).
Edit: It seems that I have missed the exact point of the error, but it is still the same error. Curly braces are invalid in a URI. Only the following characters are acceptable according to RFC 3986:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]#!$&'()*+,;=%
So these must be escaped too.

XSS VULNERABILITY FOR XML -- response.getWriter().write(xml.toString());

I need to fix a issue for xss vulnerability. the code segment is below.
StringBuffer xml = new StringBuffer();
xml.append("<?xml version=\"1.0\"?>");
xml.append("<parent>");
xml.append("<child>");
for(int cntr=0; cntr < dataList.size(); cntr++){
AAAAA obj = (AAAAA) dataList.get(cntr);
if(obj.getStatus().equals(Constants.ACTIVE)){
xml.append("<accountNumber>");
xml.append(obj.getAccountNumber());
xml.append("</accountNumber>");
xml.append("<partnerName>");
xml.append(obj.getPartnerName());
xml.append("</partnerName>");
xml.append("<accountType>");
xml.append(obj.getAccountType());
xml.append("</accountType>");
xml.append("<priority>");
xml.append(obj.getPriority());
xml.append("</priority>");
}
}
xml.append("</child>");
xml.append("</parent>");
response.getWriter().write(xml.toString());
response.setContentType("text/xml");
response.setHeader("Cache-Control", "no-cache");
The issue is at the line having the syntax response.getWriter().write(xml.toString()); It says that it is vulnerable for xss attack. I have done sufficient home work and also installed ESAPI 2.0. but I donot know how to implement the solutions.
Please suggest a solution.
You should always escape any text and attribute nodes you insert into an XML document, so I would expect to see
xml.append("<accountType>");
xml.append(escape(obj.getAccountType()));
xml.append("</accountType>");
where escape() looks after characters that need special treatment, eg. "<", "&", "]]>", and surrogate pairs.
Better still, don't construct XML by string concatenation. Use a serialization library that allows you to write
out.startElement("accountType");
out.text(obj.getAccountType());
out.endElement();
(I use a Saxon serializer with the StAX XMLStreamWriter interface when I need to do this, but there are plenty of alternatives available.)
As I can understand:
AAAAA obj = (AAAAA) dataList.get(cntr);
here you have got some data from external source.
Then you've got to validate this data. Otherwise anyone can put any data there, that would cause the destruction on client side (cookies will be stolened for example).
ANSWER-- the code using the ESAPI is below.
xml.append(ESAPI.encoder().encodeForXML(desc));
It will escape the data in the variable 'desc'. By the implementation of this, the content in the variable 'desc' will be readed as data not executable code and hence the data will not get executed in the browser on the response of the back end java code.

Lucene 3.5 Custom Payloads

Working with a Lucene index, I have a standard document format that looks something like this:
Name: John Doe
Job: Plumber
Hobby: Fishing
My goal is to append a payload to the job field that would hold additional information about Plumbing, for instance, a wikipedia link to the plumbing article. I do not want to put payloads anywhere else. Initially, I found an example that covered what I'd like to do, but it used Lucene 2.2, and has no updates to reflect the changes in the token stream api.
After some more research, I came up with this little monstrosity to build a custom token stream for that field.
public static TokenStream tokenStream(final String fieldName, Reader reader, Analyzer analyzer, final String item) {
final TokenStream ts = analyzer.tokenStream(fieldName, reader) ;
TokenStream res = new TokenStream() {
CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
PayloadAttribute payAtt = addAttribute(PayloadAttribute.class);
public boolean incrementToken() throws IOException {
while(true) {
boolean hasNext = ts.incrementToken();
if(hasNext) {
termAtt.append("test");
payAtt.setPayload(new Payload(item.getBytes()));
}
return hasNext;
}
}
};
return res;
}
When I take the token stream and iterate over all the results, prior to adding it to a field, I see it successfully paired the term and the payload. After calling reset() on the stream, I add it to a document field and index the document. However, when I print out the document and look at the index with Luke, my custom token stream didn't make the cut. The field name appears correctly, but the term value from the token stream does not appear, nor does either indicate the successful attachment of a payload.
This leads me to 2 questions. First, did I use the token stream correctly and if so, why doesn't it tokenize when I add it to the field? Secondly, if I didn't use the stream correctly, do I need to write my own analyzer. This example was cobbled together using the Lucene standard analyzer to generate the token stream and write the document. I'd like to avoid writing my own analyzer if possible because I only wish to append the payload to one field!
Edit:
Calling code
TokenStream ts = tokenStream("field", new StringReader("value"), a, docValue);
CharTermAttribute cta = ts.getAttribute(CharTermAttribute.class);
PayloadAttribute payload = ts.getAttribute(PayloadAttribute.class);
while(ts.incrementToken()) {
System.out.println("Term = " + cta.toString());
System.out.println("Payload = " + new String(payload.getPayload().getData()));
}
ts.reset();
It's very hard to tell why the payloads are not saved, the reason may lay in the code that uses the method that you presented.
The most convenient way to set payloads is in a TokenFilter -- I think that taking this approach will give you much cleaner code and in turn make your scenario work correctly. I think that it's most illustrative to take a look at some filter of this type in Lucene source, e.g. TokenOffsetPayloadTokenFilter. You can find an example of how it should be used in the test for this class.
Please also consider if there is no better place to store these hyperlinks than in payloads. Payloads have very special application for e.g. boosting some terms depending on their location or formatting in the original document, part of speech... Their main purpose is to affect how the search is performed, so they are normally numeric values, efficiently packed to cut down the index size.
I might be missing something, but...
You don't need a custom tokenizer to associate additional information to a Lucene document. Just store is as an unanalyzed field.
doc.Add(new Field("fname", "Joe", Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("job", "Plumber", Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("link","http://www.example.com", Field.Store.YES, Field.Index.NO));
You can then get the "link" field just like any other field.
Also, if you did need a custom tokenizer, then you would definitely need a custom analyzer to implement it, for both the index building and searching.

Categories

Resources