This time we will talk about how to build a simple query for selecting data in a MySQL database, of which charset was set to utf8_general_ci but we need to distinguish between data with, lets say, accents and data without accents (or others UTF-8 characters, like German umlauts), and going a little further, if possible that not distinguish between words that begin with uppercase or lowercase but have accents. Let’s see how we could implement it.
Defining our problem
Let us explain a little better what is our problem. We have a MySQL database where we have a table, for example, “users”, with several fields: id, name, etc. This database was created with utf8_general_ci charset. But what does this mean? Obviously, “utf8” stands for the charset that our database will manage. In second place we have “general”, which indicates that we are not going to a specific set of UTF-8 characters for any specific language (we could use “utf8_spanish_ci” for the character set of the Spanish language, so we couldn not use Chineese characters in that case), but generically UTF-8. Finally, “ci” means “Case Insensitive”, so that in our searches, the MySQL engine will not distinguish between “UPPERCASE”, “Uppercase” or “uppercase”. What is not so well known is that being Case Insensitive implies no distinction between Spanish accents or German umlauts, so that it does not distinguish between “ESPAÑA” or “España”.
So if there are 3 different records in our users table, in which we have the following values for the field name “José”, “Jose” and “JOSÉ”, and we execute the following query:
SELECT * FROM `users` WHERE `name` = 'jose';
This will return 3 records in our database, without any distinction.
Distinguishing UTF-8, uppercase and lowercase characters in MySQL
To distinguish UTF-8 characters being case sensitive in our queries to a MySQL database with charset utf8_general_ci we need to tell to the MySQL engine that, specifically for this query, it does not have to apply charset utf8_general_ci, but use another, known as utf8_bin, and for this we use the COLLATE keyword. With this charset we are telling MySQL that the comparison in the query must be done at the binary level, so that, now I will be able to distinguish between words with accents and without accents, being case sensitive because (obviously) an accent character is binary different from another character without accent and capitalized. So that our previous query, to distinguish accents, would look something like:
SELECT * FROM `users` WHERE `name` = 'josé' COLLATE utf8_bin;
This query would return us a null record, since there is not exactly a “jose” record in our table. In order to obtain a value from this query we should run the same query but compared to the value “José” or “JOSÉ”, since with COLLATE utf8_bin: “josé” <> “jose” <> “José” <> “JOSÉ” etc.
But, what if we need to distinguish between accents or UTF-8 characters but in case insensitive mode? How can we do that?
Distinguishing UTF-8 in case insensitive mode
In order to make MySQL to distinguish accents or UTF-8 characters but not case sensitive, we do not have to do anything special. Just what we need to do is to convert to all uppercase the characters of the value stored in the name field and also to convert to all uppercase the characters of the parameter to be compared, so that internally at the time of the comparison there are not lowercase characters and, thus, the comparison is safe. For that we will use the MySQL keyword UPPER. So that to make our previous query not case-sensitive but accents, it would be:
SELECT * FROM `users` WHERE UPPER(`name`) = UPPER('josé') COLLATE utf8_bin;
This query will return a result set with the names “José” and “JOSÉ”, since we make no distinction of capital, as we wanted to get.