Ignore punctuation in query in Sqlite

Ignore punctuation in query in Sqlite - java

I'm using Sqlite with Android (Java).
I have a database that contains texts with hebrew punctuation.
My problem is that when I'm doing a SELECT for certain value (without punctuation) I don't get all the results as I guess the DB is not ignoring the records that are punctuated and treating the punctuation as a normal characters.
After doing a search, I found some answers which says I should register a collation for it (sqlite3_create_collation).
As I've never used collations, I would like if some one will give me a hint on how to register it and use it to get the correct full result as I want.
For example:
SELECT * FROM sometable WHERE punctuated_field LIKE '%re%'
I would like to get both the following:
dream
drém
Currently I'm getting just:
dream
I read this relevant answer but didn't managed to understand how to implement it within my query or the Java code.
I would be happy to have someone writing the full query required for me to write within my code.
Thanks in advance!

The Android API does not allow registering custom collations.
You have to make do with the built-in collations, or with Android's LOCALIZED and UNICODE collations.

Since the Android sqlite API doesn't expose anything to set up custom collations, you'll have to figure some other way to solve the problem.
One is to add another column where you have the strings normalized i.e. accent marks ("punctuation" as you like) removed. Then do your LIKE matching on this normalized column and use the original column for display purposes. The cost of this is larger data size and some extra code when inserting into the database.
I've described one such normalization approach in here:
How to ignore accent in SQLite query (Android) - I have no idea how well that works with Hebrew chars though.

Related

Lucene get list of matched keywords

I have a Java (lucene 4) based application and a set of keywords fed into the application as a search query (the terms may include more than one words, eg it can be: “memory”, “old house”, “European Union law”, etc).
I need a way to get the list of matched keywords out of an indexed document and possibly also get keyword positions in the document (also for the multi-word keywords).
I tried with the lucene highlight package but I need to get only the keywords without any surrounding portion of text. It also returns multi-word keywords in separate fragments.
I would greatly appreciate any help.

There's a similar (possibly same) question here:
Get matched terms from Lucene query
Did you see this?
The solution suggested there is to disassemble a complicated query into a more simple query, until you get a TermQuery, and then check via searcher.explain(query, docId) (because if it matches, you know that's the term).
I think It's not very efficient, but
it worked for me until I ran into SpanQueries. it might be enough for you.

Get diacritic insensitive results from Realm database query

I'm in trouble with a simple query to get strings from Realm engine in Java for an Android app.
As said in the title of my topic, I want to get diacritic insensitive results from my query.
Example:
If user type the word "securite", I want my query to return "securite" and "sécurité".
How can I do that ?
Thanks a lot in advance for your help !

While Realm doesn't support that currently. Depending on how much of the data you control, you can also add a "normalized" field you can use in your search. There is an approach described here: Remove diacritics from string in Java

This is not possible in Realm at the moment. Your only option is to manage tables containing all the possibilities for each letter of the alphabet you are interested in. Something like [a, á, à, å, etc] and then for each string compute all the possible permutations and build a huge query with equalTo() and or(). It would probably take longer to build such query than to execute it, but that's a very interesting use case! If you end up implementing it I would love to know the results!

In Java & MYSQL, Is it a good practice to encode text when we insert text into DB?

Lets take a look at this scenario: you have a textbox that allows the user to copy any kind of text (UTF8 or Chinese or Arabic characters), then a Submit button to insert that text into MySQL DB.
Normally, I use URLEncoder.encode(text,"UTF-8") & my App runs really stably; I never worried if the users inserted any special characters since the text was encoded so when I read the text, I just decoded it & the text came out exactly the way it was before.
But some guys said that we can set UTF8 in MySQL and Tomcat server or something so we don't need to encode, but this solution requires configuration and I hate configuration as it is not a very sound solution.
Besides, users can enter junk code to hack the DB.
So, In Java & MYSQL, is it good practice to encode text when it is inserted into the DB?
Some people in other forum said it is very bad to store encoded text in DB, but they don't say why it is bad.
So this question is for people who have a lot of experience in Java and MySQL to answer!

The problem with putting URL or XML encoded text into the database is that makes life difficult for querying and doing other processing of that text.
The other problem is that there are different types of escaping that are required in different contexts.
... but this solution requires configuration & I hate configuration as it is not a very sound solution.
Ermm, asserting that configuration is "not a very sound solution" is not a rational argument. The vast majority of applications with a database component require some kind of database configuration.
Besides, users can enter junk code to hack the DB.
The real solution to SQL injection is to use PreparedStatement and fixed SQL query, insert, update, etc strings. Use placeholders for all of the query parameters and use the PreparedStatement set parameter methods to supply their values. This will properly quote the text in the parameters to remove the possibility of SQL injection attacks.
The other thing you need to worry about is people using unescaped XML / HTML metacharacters (like <, > and quotes) to effect XSS attacks against other users. The way to defeat that is to escape the text at the point you are creating the HTML. For instance, you can use the <c:out> to escape the text.
Finally, HTML URL encoded text can't be inserted directly into an HTML page. The URL encoding scheme (using %'s and +'s) is not the correct encoding scheme for text in an HTML page. There you need to use &...; character entities to encode things. A %xx in text will appear as exactly that when you display your web page in a browser. Try it and see!
Answering the questions in the comments:
iamthepiguy said "encode everything before putting it into Db", but u said "No". Suppose i put Html text into DB, there a lot of special characters & many other stuffs, how can we let Db to handle all of them, for example, if mysql doesn't recognize a char, it will turn to "?" & it means the text got corrupted, it mean the users lost that text. How Mysql handle all kind of special characters?
If you use a PreparedStatement with SQL that has placeholders for all of the text parameters, then the JDBC driver takes care of the escaping automatically.
Also, since there is a very diversity of UTF & special chars, so how many other things we need to worry if we do not encode text to make sure the system run stably?
Same answer.
Encoded text make the system run a bit slower, but we are headache-free.
There are no headaches if you use prepared statements and <c:out> (or the equivalent).
you sid "The way to defeat that is to escape the text at the point you are creating the HTML." so we have to use Java to encode right?
Yes, but you ONLY HTML encode the text when you output it for inclusion in a web page. If you output it as JSON, you encode using JSON escaping ... or more likely, you let the JSON serializer do it for you. If you send the text in other formats, or include it in other things, you encode it as required ... or not at all.
But the point is that you don't store it in the database in encoded form. If you do, then in nearly all cases (including HTML!!) you'd need to decode the HTML URL-encoded text before encoding it in the correct way.

It is somewhat better in terms of stability and configuration, as well as safety from XSS attacks, to encode everything before putting it in the database. The disadvantages are it takes slightly longer, and slightly more space in the DB, and you could escape everything when it is created again, but it's easier to escape everything.

Approach for Automating localized Web application in Selenium using Java Bindings

I am automating test cases for a web application using selenium 2.0 and Java. My application supports multiple languages. Some of the test cases require me to validate the text that appears in the UI like success/error messages etc.I am using a properties file to store whatever text I am referring in my tests from the UI, currently only english. For example there is locale_english.properties(see below) that contains all references in english. I am going to have multiple properties files like this for different locales like locale_chinese.properties,locale_french.properties and so on. For locales other than english, their corresponding properties file would have UTF-8 characters (e.g \u30ed) representing the native characters(see below). So If I want to test say Chinese UI, I would load "locale_chinese.properties" instead of "locale_english.properties". I am going to convert the native characters for non-english locale using perhaps native2ascii from JDK or some other way.I tested that Selenium API works well with UTF-8 characters for non-english locales
---locale_english.properties------
user.login.error= Please verify username/password
---locale_chinese.properties------
user.login.error= \u30ed\u30ef\u30eg\u30eh\u30ed
and so on.
The problem is that my locale_english.properties is growing and going out of control. It is becoming hard to manage a single properties file for one locale let alone for multiple locales. Is there a better way of handling localization in Java, particularly in situations like I am in?
Thanks!

You're right that there is a problem managing the files, but you're also right that this is the best approach. Some things are just hard :-(
Selenium (at least the Selenium RC API) does indeed support Unicode input and output, we have lots of tests that enter and confirm Cyrillic and Simple Chinese characters from C#. Since Java strings are Unicode at the core (just like C#), I expect you could simply create the file in a UTF-8-friendly editor like Notepad++ and read them straight into strings and use them directly in the Selenium API.

This is how I solved the issue for those who are interested.

a database would work better for many reasons, like growth, central location, kept outside of app and can be edited and maintained outside of app. We used a table with columns:
id (int) auto increment
id_text -- this and other columns are varchar ... except for date time for last 2
lang
translation
created_by
updated_by
created_date
updated_date
An id is a short english description of the text - like 'hello' or 'error1msg', the key in your map.
In java had a function to get the text for a particular text ... and a app level property - default language (usually en but good to keep it configurable)
Function would scan already loaded hashmap for language asked for - say "ch"
If corresponding translation was not found for this language we would return the default language translation and if that was not founf then we would return "[" + id "]" so the tester knows something is missing in data base - can go to web screen to edit translation table and add it.

Google App Engine and SQL LIKE

Is there any way to query GAE datastore with filter similar to SQL LIKE statement? For example, if a class has a string field, and I want to find all classes that have some specific keyword in that string, how can I do that?
It looks like JDOQL's matches() don't work... Am I missing something?
Any comments, links or code fragments are welcome

As the GAE/J docs say, BigTable doesn't have such native support. You can use JDOQL String.matches for "something%" (i.e startsWith). That's all there is. Evaluate it in-memory otherwise.

If you have a lot of items to examine you want to avoid loading them at all. The best way would probably be to break down the inputs a write time. If you are only searching by whole words then that is easy
For example, "Hello world" becomes "Hello", "world" - just add both to a multi valued property. If you have a lot of text you want to avoid loading the multi valued property because you only need it for the index lookup. You can do this by creating a "Relation Index Entity" - see bret slatkins Google IO talk for details.
You may also want to break down the input into 3 character, 4 character etc strings or stem the words - perhaps with a lucene stemmer.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.