search

Home  >  Q&A  >  body text

Always use UTF-8 encoding

<p> I'm setting up a new server and want full UTF-8 support in my web application. I've tried this before on existing servers, but always seemed to have to fall back to ISO-8859-1. <br />Where do I need to set the encoding/charset? I know I need to configure Apache, MySQL and PHP to achieve this. Is there a standard checklist I can refer to or troubleshoot mismatches? <br />This is a new Linux server running MySQL 5, PHP 5 and Apache 2. <br /></p><p><br /></p>
P粉548512637P粉548512637531 days ago516

reply all(2)I'll reply

  • P粉138871485

    P粉1388714852023-07-25 16:40:26

    I would like to add to chazomaticus' excellent answer:

    Also don't forget the META tags (like this, or the HTML4 or XHTML versions):

    <meta charset="utf-8">

    This may seem trivial, but IE7 has given me problems before.

    I'm doing everything correctly; the database, database connection, and Content-Type HTTP headers are all set to UTF-8 and work fine in all other browsers, but Internet Explorer still insists on using "Western Europe "coding.

    It turns out that the page is missing the META tag. After adding it the problem was solved.


    Edit:

    The W3C actually has a sizeable section dedicated to internationalization (I18N) issues. They have a number of articles related to this issue, covering HTTP, (X)HTML, and CSS:

    They recommend using both HTTP headers and HTML meta tags (or using XML declarations in XHTML provided as XML).

    reply
    0
  • P粉381463780

    P粉3814637802023-07-25 09:11:32

    data storage:

    • Specify the utf8mb4 character set on all tables and text columns in the database. This way, MySQL will physically store and retrieve the value in its native encoding of UTF-8. Note that if utf8mb4_* collations are specified (without any explicit character set), MySQL will implicitly use utf8mb4 encoding.

    • In older versions of MySQL (<5.5.3) you would have to use simple utf8 which only supported a subset of Unicode characters, which I feel bad about, but it's true.

    data access:

    • In your application code (e.g. PHP), no matter what database access method you use, you need to set the connection character set to utf8mb4. This way, when MySQL passes the data to your application, it doesn't do any conversion from its native UTF-8 and vice versa.

    • Some drivers provide their own mechanism for configuring the connection character set, which both updates its own internal state and informs MySQL of the encoding to use on the connection - this is usually the preferred approach. In PHP:

      • If you are using the PDO abstraction layer for PHP ≥ 5.3.6, you can specify the character set in the DSN:

        $dbh = new PDO('mysql:charset=utf8mb4');
      • If you're using mysqli, you can call set_charset():

        $mysqli->set_charset('utf8mb4');       // object oriented style
          mysqli_set_charset($link, 'utf8mb4');  // procedural style
      • If you can only use normal mysql functions, but are running PHP ≥ 5.2.3, you can call the mysql_set_charset method.

    • If the driver does not provide its own mechanism to set the connection character set, you may need to issue a query to tell MySQL how your application wants the data on the connection to be encoded: SET NAMES 'utf8mb4'.

    • The same considerations as above apply to utf8mb4/utf8.

    Output:

    • UTF-8 should be set in the HTTP header, for example Content-Type: text/html; charset=utf-8. You can do this by setting default_charset in php.ini (preferred) or manually using the header() function.
    • If your application transfers text to other systems, they will also need to be told the character encoding. For web applications, the browser must be told the encoding in which the data is sent (via HTTP response headers or HTML metadata).
    • When using json_encode() for output encoding, add JSON_UNESCAPED_UNICODE as the second parameter.

    Input:

    • The browser will submit the data in the character set specified by the document, so no special processing is required on input.
    • If you have doubts about the request encoding (possibly tampered with), you can verify that each received string is valid UTF-8 before trying to store or use it anywhere. PHP's mb_check_encoding() can do this, but you must always use it. There's really no way around this, as a malicious client can submit data in any encoding they want, and I haven't found a trick to reliably get PHP to do this for you.

    Other code notes:

    • Obviously, all files you provide (PHP, HTML, JavaScript, etc.) should be encoded in valid UTF-8.

    • You need to make sure it's safe every time you handle UTF-8 strings. Unfortunately, this is the hardest part. You may need to make extensive use of PHP's mbstring extension.

    • PHP's built-in string operations do not support UTF-8 by default. There are some normal PHP string operations you can safely use (such as concatenation), but for most operations you should use the equivalent mbstring functions.

    • In order to know what you're doing (i.e. not screw up), you really need to understand UTF-8 and how it works at the lowest level. Check out any of the links on utf8.com which provide some great resources to learn everything you need to know.

    reply
    0
  • Cancelreply