Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 455 Vote(s) - 3.44 Average
  • 1
  • 2
  • 3
  • 4
  • 5
What's the difference between utf8_general_ci and utf8_unicode_ci?

#1
Between `utf8_general_ci` and `utf8_unicode_ci`, are there any differences in terms of performance?
Reply

#2
See the mysql manual, [Unicode Character Sets][1] section:
> For any Unicode character set,
> operations performed using the
> _general_ci collation are faster than those for the _unicode_ci collation.
> For example, comparisons for the
> utf8_general_ci collation are faster,
> but slightly less correct, than
> comparisons for utf8_unicode_ci. The
> reason for this is that
> utf8_unicode_ci supports mappings such
> as expansions; that is, when one
> character compares as equal to
> combinations of other characters. For
> example, in German and some other
> languages “ß” is equal to “ss”.
> utf8_unicode_ci also supports
> contractions and ignorable characters.
> utf8_general_ci is a legacy collation
> that does not support expansions,
> contractions, or ignorable characters.
> It can make only one-to-one
> comparisons between characters.

So to summarize, utf_general_ci uses a smaller and less correct (according to the standard) set of comparisons than utf_unicode_ci which *should* implement the entire standard. The general_ci set will be faster because there is less computation to do.

[1]:

[To see links please register here]

Reply

#3
**In brief words:**

If you need better sorting order - use `utf8_unicode_ci` (this is the preferred method),

but if you utterly interested in performance - use `utf8_general_ci`, but know that it is a little outdated.

The differences in terms of performance are very slight.
Reply

#4
Some details (PL)
-

As we can read [here][2] (**Peter Gulutzan**) there is difference on sorting/comparing polish letter "Ł" (L with stroke - html esc: `Ł`) (lower case: "ł" - html esc: `ł`) - we have following assumption:

utf8_polish_ci Ł greater than L and less than M
utf8_unicode_ci Ł greater than L and less than M
utf8_unicode_520_ci Ł equal to L
utf8_general_ci Ł greater than Z

In polish language letter `Ł` is after letter `L` and before `M`. No one of this coding is better or worse - it depends of your needs.

[1]:

[To see links please register here]

[2]:

[To see links please register here]

Reply

#5
I wanted to know what is the performance difference between using `utf8_general_ci` and `utf8_unicode_ci`, but I did not find any benchmarks listed on the internet, so I decided to create benchmarks myself.

I created a very simple table with 500,000 rows:

CREATE TABLE test(
ID INT(11) DEFAULT NULL,
Description VARCHAR(20) DEFAULT NULL
)
ENGINE = INNODB
CHARACTER SET utf8
COLLATE utf8_general_ci;

Then I filled it with random data by running this stored procedure:

CREATE PROCEDURE randomizer()
BEGIN
DECLARE i INT DEFAULT 0;
DECLARE random CHAR(20) ;
theloop: loop
SET random = CONV(FLOOR(RAND() * 99999999999999), 20, 36);
INSERT INTO test VALUES (i+1, random);
SET i=i+1;
IF i = 500000 THEN
LEAVE theloop;
END IF;
END LOOP theloop;
END

Then I created the following stored procedures to benchmark simple `SELECT`, `SELECT` with `LIKE`, and sorting (`SELECT` with `ORDER BY`):

CREATE PROCEDURE benchmark_simple_select()
BEGIN
DECLARE i INT DEFAULT 0;
theloop: loop
SELECT *
FROM test
WHERE Description = 'test' COLLATE utf8_general_ci;
SET i = i + 1;
IF i = 30 THEN
LEAVE theloop;
END IF;
END LOOP theloop;
END;

CREATE PROCEDURE benchmark_select_like()
BEGIN
DECLARE i INT DEFAULT 0;
theloop: loop
SELECT *
FROM test
WHERE Description LIKE '%test' COLLATE utf8_general_ci;
SET i = i + 1;
IF i = 30 THEN
LEAVE theloop;
END IF;
END LOOP theloop;
END;

CREATE PROCEDURE benchmark_order_by()
BEGIN
DECLARE i INT DEFAULT 0;
theloop: loop
SELECT *
FROM test
WHERE ID > FLOOR(1 + RAND() * (400000 - 1))
ORDER BY Description COLLATE utf8_general_ci LIMIT 1000;
SET i = i + 1;
IF i = 10 THEN
LEAVE theloop;
END IF;
END LOOP theloop;
END;

In the stored procedures above `utf8_general_ci` collation is used, but of course during the tests I used both `utf8_general_ci` and `utf8_unicode_ci`.

I called each stored procedure 5 times for each collation (5 times for `utf8_general_ci` and 5 times for `utf8_unicode_ci`) and then calculated the average values.

My results are:

**`benchmark_simple_select()`**

* with `utf8_general_ci`: 9,957 ms
* with `utf8_unicode_ci`: 10,271 ms

In this benchmark using `utf8_unicode_ci` is slower than `utf8_general_ci` by 3.2%.

**`benchmark_select_like()`**

* with `utf8_general_ci`: 11,441 ms
* with `utf8_unicode_ci`: 12,811 ms

In this benchmark using `utf8_unicode_ci` is slower than `utf8_general_ci` by 12%.

**`benchmark_order_by()`**

* with `utf8_general_ci`: 11,944 ms
* with `utf8_unicode_ci`: 12,887 ms

In this benchmark using `utf8_unicode_ci` is slower than `utf8_general_ci` by 7.9%.
Reply

#6
According to this post, there is a considerably large performance benefit on MySQL 5.7 when using utf8mb4_general_ci in stead of utf8mb4_unicode_ci:

[To see links please register here]

Reply

#7
There are two big difference the sorting and the character matching:

**Sorting**:

- `utf8mb4_general_ci` removes all accents and sorts one by one which may create incorrect sort results.
- `utf8mb4_unicode_ci` sorts accurate.

**Character Matching**

They match characters differently.

For example, in `utf8mb4_unicode_ci` you have `i != ı`, but in `utf8mb4_general_ci` it holds `ı=i`.

For example, imagine you have a row with `name="Yılmaz"`. Then

select id from users where name='Yilmaz';

would return the row if collocation is `utf8mb4_general_ci`, but if it is collocated with `utf8mb4_unicode_ci` it would **not** return the row!

On the other hand we have that `a=ª` and `ß=ss` in `utf8mb4_unicode_ci` which is not the case in `utf8mb4_general_ci`. So imagine you have a row with `name="ªßi"`, then

select id from users where name='assi';

would return the row if collocation is `utf8mb4_unicode_ci`, but would **not** return a row if collocation is set to `utf8mb4_general_ci`.

A full list of matches for each collocation may be found [here](

[To see links please register here]

).


Reply

#8
[This post][1] describes it very nicely.

In short: `utf8_unicode_ci` uses the Unicode Collation Algorithm as defined in the Unicode standards, whereas `utf8_general_ci` is a more simple sort order which results in "less accurate" sorting results.

[1]:

[To see links please register here]

Reply

#9
**For those people still arriving at this question in 2020 or later, there are newer options that may be better than *both* of these. For example, `utf8_unicode_520_ci`.**

All these collations are for the UTF-8 character encoding. The differences are in how text is sorted and compared.

`_unicode_ci` and `_general_ci` are two different sets of rules for sorting and comparing text according to the way we expect. Newer versions of MySQL introduce new sets of rules, too, such as `_unicode_520_ci` for equivalent rules based on Unicode 5.2, or the MySQL 8.x specific `_0900_ai_ci` for equivalent rules based on Unicode 9.0 (and with no equivalent `_general_ci` variant). People reading this now should probably use one of these newer collations instead of either `_unicode_ci` or `_general_ci`. The description of those older collations below is provided for interest only.

*MySQL is currently transitioning away from an older, flawed UTF-8 implementation. For now, you need to use `utf8mb4` instead of `utf8` for the character encoding part, to ensure you are getting the fixed version. The flawed version remains for backward compatibility, though it is being deprecated.*

**Key differences**

* `utf8mb4_unicode_ci` is based on the official Unicode rules for universal sorting and comparison, which sorts accurately in a wide range of languages.

* `utf8mb4_general_ci` is a simplified set of sorting rules which aims to do as well as it can while taking many short-cuts designed to improve speed. It does not follow the Unicode rules and will result in undesirable sorting or comparison in some situations, such as when using particular languages or characters.

On modern servers, this performance boost will be all but negligible. It was devised in a time when servers had a tiny fraction of the CPU performance of today's computers.

**Benefits of `utf8mb4_unicode_ci` over `utf8mb4_general_ci`**

`utf8mb4_unicode_ci`, which uses the Unicode rules for sorting and comparison, employs a fairly complex algorithm for correct sorting in a wide range of languages and when using a wide range of special characters. These rules need to take into account language-specific conventions; not everybody sorts their characters in what we would call 'alphabetical order'.

As far as Latin (ie "European") languages go, there is not much difference between the Unicode sorting and the simplified `utf8mb4_general_ci` sorting in MySQL, but there are still a few differences:

* For examples, the Unicode collation sorts "ß" like "ss", and "Œ" like "OE" as people using those characters would normally want, whereas `utf8mb4_general_ci` sorts them as single characters (presumably like "s" and "e" respectively).

* Some Unicode characters are defined as ignorable, which means they shouldn't count toward the sort order and the comparison should move on to the next character instead. `utf8mb4_unicode_ci` handles these properly.

In non-latin languages, such as Asian languages or languages with different alphabets, there may be a lot *more* differences between Unicode sorting and the simplified `utf8mb4_general_ci` sorting. The suitability of `utf8mb4_general_ci` will depend heavily on the language used. For some languages, it'll be quite inadequate.

**What should you use?**

There is almost certainly no reason to use `utf8mb4_general_ci` anymore, as we have left behind the point where CPU speed is low enough that the performance difference would be important. Your database will almost certainly be limited by other bottlenecks than this.

In the past, some people recommended to use `utf8mb4_general_ci` except when accurate sorting was going to be important enough to justify the performance cost. Today, that performance cost has all but disappeared, and developers are treating internationalization more seriously.

There's an argument to be made that if speed is more important to you than accuracy, you may as well not do any sorting at all. It's trivial to make an algorithm faster if you do not need it to be accurate. So, `utf8mb4_general_ci` is a compromise that's probably not needed for speed reasons and probably also not suitable for accuracy reasons.

One other thing I'll add is that even if you know your application only supports the English language, it may still need to deal with people's names, which can often contain characters used in other languages in which it is just as important to sort correctly. Using the Unicode rules for everything helps add peace of mind that the very smart Unicode people have worked very hard to make sorting work properly.

**What the parts mean**

Firstly, `ci` is for *case-insensitive* sorting and comparison. This means it's suitable for textual data, and case is not important. The other types of collation are `cs` (case-sensitive) for textual data where case is important, and `bin`, for where the encoding needs to match, bit for bit, which is suitable for fields which are really encoded binary data (including, for example, Base64). Case-sensitive sorting leads to some weird results and case-sensitive comparison can result in duplicate values differing only in letter case, so case-sensitive collations are falling out of favor for textual data - if case is significant to you, then otherwise ignorable punctuation and so on is probably also significant, and a binary collation might be more appropriate.

Next, `unicode` or `general` refers to the specific sorting and comparison rules - in particular, the way text is normalized or compared. There are many different sets of rules for the utf8mb4 character encoding, with `unicode` and `general` being two that attempt to work well in all possible languages rather than one specific one. The differences between these two sets of rules are the subject of this answer. Note that `unicode` uses rules from Unicode 4.0. Recent versions of MySQL and MariaDB add the rulesets `unicode_520` using rules from Unicode 5.2, and MySQL 8.x adds `0900` (dropping the "unicode_" part) using rules from Unicode 9.0.

And lastly, `utf8mb4` is of course the character encoding used internally. In this answer I'm talking only about Unicode based encodings.
Reply

#10
The above comments suggest that there is no reason to use `utf8_general*`. However, for Japanese, this may not be true.

With MariaDB `utf8mb4_ja_0900_as_cs` is not available, so you have to use one of the unicode or general options. However, `unicode` treats voiced and unvoiced etc. the same. For example, びよういん (beautician) is treated as equal to びょういん (hospital). This is clearly incorrect behavior.

> select strcmp('が', 'か' collate utf8mb4_unicode_ci); #0
> strcmp('びよういん', 'びょういん' collate utf8mb4_unicode_520_ci); #0





whereas general gives


> select strcmp('が', 'か' collate utf8mb4_general_ci); #1

In other words unicode treats voiced and unvoiced kana the same. Imo, this is not desirable.

Edit: It may be better to use `uca1400_ai_cs` which is available on newer versions of MariaDB, and get the above collations correct.
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through