Discussion:
collation for a website iso-8859-1: swedish or general, and is it safe to change from utf8 to latin1 ?
(too old to reply)
Casper
2009-10-14 21:38:16 UTC
Permalink
Hello,
I'm using mysql to store content for a PHP website. All my webpages
are ISO-8859-1 so I'm using the latin1_swedish_ci collation which is
the default.

I use mainly italian and also some english language on my website, and
no i18n needs.

First question: is it best to stand with latin1_swedish_ci or should I
better use latin1_general_ci ?

Second question:
when I moved my website to a host provider, I noticed they have htf8
as default collation on their mysql.
I want to change it to latin1_swedish_ci (or general, whichever is
best , see question one).
Since I already have data stored in, is it a dangerous operation to
change from utf8 to latin1 ? Any point to pay attention to?

Thank you much for any help !
regards.
Taliesin Nuin
2009-10-15 07:51:48 UTC
Permalink
Post by Casper
Hello,
I'm using mysql to store content for a PHP website. All my webpages
are ISO-8859-1 so I'm using the latin1_swedish_ci collation which is
the default.
I use mainly italian and also some english language on my website, and
no i18n needs.
First question: is it best to stand with latin1_swedish_ci or should I
better use latin1_general_ci ?
It depends on your needs. Annoying answer I know, but it probably
doesn't make a great deal of difference. How are Italian users used to
sorting strings? To me, General looks like a more intuitive way of
sorting, but then I grew up in the UK with its boring 26 characters and
no umlauts. Assuming you have some Italian background, take a look at
the collation charts for both character sets and see which looks correct
to you.

http://www.collation-charts.org/mysql60/mysql604.latin1_general_ci.html

http://www.collation-charts.org/mysql60/mysql604.latin1_swedish_ci.html

The character sets are the same so there should be no functional
difference, just the ordering (but see my answer to your second question).
Post by Casper
when I moved my website to a host provider, I noticed they have htf8
as default collation on their mysql.
Okay. That's new to me, I've never heard of that one. What the Hell is htf8?
Post by Casper
I want to change it to latin1_swedish_ci (or general, whichever is
best , see question one).
Since I already have data stored in, is it a dangerous operation to
change from utf8 to latin1 ? Any point to pay attention to?
Ah, you meant utf8. Third rule of being a programmer - don't make typos!

Yes. It can be dangerous to go from utf8 to latin1. The reason is that
you are not merely changing a collation here, but an actual character
set. And utf8 contains characters that latin1 does not! So if there's
something in there that latin1 doesn't handle, you have changed your
data. You could check the contents of the database for any rogue
characters and if you've proved that every character in there exists in
latin1, then you can convert.

But fourth rule of being a programmer, never give up flexibility without
good reason. Why would you not want to use the utf8 character set?
Unicode is the modern way.

Should you change the collations on your databases, then aside from
conversion problems the only things to make sure of are (a) that you
convert everything at once to avoid illegal mix of collation errors when
you try to JOIN tables on fields of different types and (b) if you are
changinc character sets (not just collation) you don't have any code
that presumes a specific character set coming MySQL. The latter is
pretty esoteric for most people, but some of us do that sometimes.
(Usually when we're feeling masochistic).

Hope that helps,

Taliesin Nuin.
Post by Casper
Thank you much for any help !
regards.
Captain Paralytic
2009-10-15 10:22:22 UTC
Permalink
Post by Taliesin Nuin
changinc character sets (not just collation) you don't have any code
First rule of pointing out someone's typos, don't make typos!
Taliesin Nuin
2009-10-15 12:08:39 UTC
Permalink
Post by Captain Paralytic
Post by Taliesin Nuin
changinc character sets (not just collation) you don't have any code
First rule of pointing out someone's typos, don't make typos!
Guilty! :) But if you only take advice from perfect people, you'll never
learn anything. ;)

T
Álvaro G. Vicario
2009-10-15 09:23:56 UTC
Permalink
Post by Casper
I'm using mysql to store content for a PHP website. All my webpages
are ISO-8859-1 so I'm using the latin1_swedish_ci collation which is
the default.
I use mainly italian and also some english language on my website, and
no i18n needs.
First question: is it best to stand with latin1_swedish_ci or should I
better use latin1_general_ci ?
when I moved my website to a host provider, I noticed they have htf8
as default collation on their mysql.
I want to change it to latin1_swedish_ci (or general, whichever is
best , see question one).
Since I already have data stored in, is it a dangerous operation to
change from utf8 to latin1 ? Any point to pay attention to?
My advise is to ignore default collations and *always* specify yours.
However, if you're already storing data in UTF-8 you are not guaranteed
to be able to convert it into Latin-1 since the former is a superset of
the latter (i.e., you might have stuff like the euro symbol that doesn't
exist in Latin-1).

A possible procedure could be:

1. Create a dump of your data with mysqldump
2. Convert the *.sql file from UTF-8 to ISO-8859-1 with iconv
-> Check whether there're conversion errors
3. Recreate your tables using latin1_general_ci or whatever
4. Load the converted SQL dump into the new tables

iconv is quite a standard tool in Linux but you can probably find
Windows binaries if you need them.
--
-- http://alvaro.es - Álvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programación web: http://borrame.com
-- Mi web de humor satinado: http://www.demogracia.com
--
Casper
2009-10-15 16:00:44 UTC
Permalink
Hi,

first all I want to thank you for your very useful suggestions! (and
sorry for typo-s, of course htf-8 does not exist :-)

I am currently using latin1_swedish_ci in my home development
environment, and all my web site is set to iso-8859-1 (I have such
content-type set into the HTML meta tags).

It is the new host provider that is using UTF-8 as default for their
mysql and it was my fault I missed to specify the collation on my
tables DDL scripts :-(

Since all my data originate from ISO-8859-1 I guess they dont contain
any UTF-8 char, so hopefully the conversion to latin1 wouldnt miss
anything, nevertheless I'll follow your precious advises.

As a general suggestion, as well, do you suggest to stick with latin1,
or plan for a migration to UTF-8 instead ?

I've been told that latin1 (ie. ISO-8859-1) is more compatible with
browsers, simpler to handle (less hassle), and at least requires less
memory space.
So, since I have no need for i18n I thought I would go for latin1....
what I'm saying is: I hope I'm not actually making a wrong move here
by re-adjusting the host mysql database back to latin1 :-)

As for the decision to use latin1 swedish vs general I'll try to check
the sorting on those collation tables.

thank you again
Álvaro G. Vicario
2009-10-15 16:48:53 UTC
Permalink
Post by Casper
As a general suggestion, as well, do you suggest to stick with latin1,
or plan for a migration to UTF-8 instead ?
I've been told that latin1 (ie. ISO-8859-1) is more compatible with
browsers
I don't think there're major browsers out there that do not support
UTF-8. I'd dare say that most important web apps you can download
already use it.
Post by Casper
simpler to handle (less hassle)
I agree with this. However, hassle comes from various sources and your
mileage may vary:

- Some programming languages (such as PHP) have poor Unicode support.
You might end up calculating the length of a string with the wrong
one-byte function.

- If your editor prepends the byte order marker (BOM) to files, you may
find little squares all around your web site. If it doesn't, you can
easily ruin the file the next time you edit it when saving it with the
default ANSI encoding.
Post by Casper
and at least requires less memory space.
Nothing serious, I presume. Accented letters will use two bytes but
that's it. You'll probably use more memory if you indent your code.
Post by Casper
So, since I have no need for i18n I thought I would go for latin1....
what I'm saying is: I hope I'm not actually making a wrong move here
by re-adjusting the host mysql database back to latin1 :-)
If unsure, I don't think that Latin-1 will hurt. But, of course, this
can be a nice excuse to train yourself with UTF-8.

It's also worth noting that you don't really _need_ any special encoding
to write texts in exotic languages in an HTML document thanks to HTML
entities. But, of course, this doesn't mean it is practical:

முதற்
பக்கம்
--
-- http://alvaro.es - Álvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programación web: http://borrame.com
-- Mi web de humor satinado: http://www.demogracia.com
--
Taliesin Nuin
2009-10-16 10:12:36 UTC
Permalink
Post by Casper
Hi,
first all I want to thank you for your very useful suggestions! (and
sorry for typo-s, of course htf-8 does not exist :-)
See, Captain - he forgives me! :)
Post by Casper
As a general suggestion, as well, do you suggest to stick with latin1,
or plan for a migration to UTF-8 instead ?
What I will say on this subject is this. I have no good reason for you
why you shouldn't go back to the latin1 character set. I can't say that
your customers need unicode or that your server will explode if you use
latin1. ;) I'm just someone who has had to handle migration problems
between character sets before and has had to deal with code bases that
later developed a need for unicode but had specific settings dependent
on single-byte character sets. My feeling is that if you're faced with a
choice between a system that is more capable and modern, and one that is
less capable and will die sooner than the other, then I go for the
first. It of course depends on whether the other system involves more
work for you or not and you know that better than I. I'm just saying
that if its a simple choice, then I would choose utf8. I freely admit
that it is based on personal preference without specific reference to
your code, but for what it's worth, that's my view.

Anyway, best of luck with it,

Taliesin Nuin,
Captain Paralytic
2009-10-16 11:11:31 UTC
Permalink
Post by Taliesin Nuin
Post by Casper
Hi,
first all I want to thank you for your very useful suggestions! (and
sorry for typo-s, of course htf-8 does not exist :-)
See, Captain - he forgives me! :)
Oh I forgive you too. I have to, I do it all too often myself!

As a famous person once said: "Let he who is without skin cast the
first scone."
Taliesin Nuin
2009-10-16 11:42:03 UTC
Permalink
Post by Captain Paralytic
Post by Taliesin Nuin
Post by Casper
Hi,
first all I want to thank you for your very useful suggestions! (and
sorry for typo-s, of course htf-8 does not exist :-)
See, Captain - he forgives me! :)
Oh I forgive you too. I have to, I do it all too often myself!
As a famous person once said: "Let he who is without skin cast the
first scone."
Mmmmmm. Scones... Haven't had them in ages. It's amazing how a random
comment from someone on the Internet will sell an extra packet of scones
from Waitrose tonight. :D

Taliesin.

Loading...