Forums

loaddata duplicate entry - encoding issue?

I have a model similar to the following:

class Example(models.Model):
    name = models.CharField(unique=True)

After running dumpdata locally (SQLite) I have:

[
  {
    "model": "Example",
    "pk": 1,
    "fields": {
      "name": "Alaves"
    }
  },
  {
    "model": "Example",
    "pk": 2,
    "fields": {
      "name": "Alavés"
    }
  }
]

When deploying to PA (MySQL) running a loaddata results in:

django.db.utils.IntegrityError: Problem installing fixture '/x/y/z': 
(1062, "Duplicate entry 'Alav\xc3\xa9s' for key 'name'")

Both local and PA databases are using UTF8. Can't reproduce locally.

Any hints how to debug/fix? Thanks!

that's probably because sqlite doesn't enforce the unique constraint that you set in your model. Isn't MySQL doing what you want it to do?

SQLite does enforce the unique constraint - the two values are not the same:

Alaves & Alavés

Open a mysql console via the database tab:

SHOW FULL COLUMNS FROM <table>;

You should see the collation value per column, e.g.

| name | varchar(200) | utf8_general_ci | NO | MUL | NULL | | select,insert,update,references | |

To see all collation values:

SHOW COLLATION;

To update column:

ALTER TABLE <table> MODIFY name VARCHAR(200) CHARACTER SET <charset> COLLATE <collation>;

ALTER TABLE <table> MODIFY name VARCHAR(200) CHARACTER SET utf8 COLLATE utf8_bin;

Via: http://stackoverflow.com/a/29570832

The table collation was set to utf8_general_ci. This was the default setting of the MySQL server and the schema.

There are 3 collation names available in MySQL 5.5:

  • A name ending in _ci indicates a case-insensitive collation.
  • A name ending in _cs indicates a case-sensitive collation.
  • A name ending in _bin indicates a binary collation. Character comparisons are based on character binary code values.

The collation had to be changed to utf8_bin.

Aha! That makes sense. Just to make sure I've understood (and in case someone else reading this doesn't quite get the whole "collation" thing), what was happening was that your database was previously set up to interpret accented characters as the same as their unaccented equivalents for the purposes of identifying duplicates -- so you changed the DB setup to treat accented and unaccented characters as different. Is that a reasonable summary?

Correct