More on BOMs
A byte-order mark (BOM) is the Unicode character at code point U+FEFF (zero-width no-break space) when that character is used to denote the endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32.
It is conventionally used as a marker to indicate that text is encoded in UTF-8, UTF-16 or UTF-32.
In most character encodings, the BOM is a pattern which is unlikely to be seen in other contexts (it typically looks like a sequence of obscure control codes). If a BOM is misinterpreted as an actual character within Unicode text then it is generally invisible because it is a zero-width nobreak space. Use of the U+FEFF character for non-BOM purposes has been deprecated in Unicode 3.2 (which provides an alternative, U+2060, for those other purposes), allowing U+FEFF to be used solely with the semantic of BOM.
In UTF-16, a BOM (U+FEFF) is placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.
- If the 16-bit units are represented in big-endian byte order, this BOM character appears in the sequence of bytes as 0xFE followed by 0xFF (where "0x" indicates hexadecimal);
- If the 16-bit units use little-endian order, the sequence of bytes has 0xFF followed by 0xFE.
The Unicode value U+FFFE is guaranteed never to be assigned as a Unicode character; this implies that in a Unicode context the 0xFF, 0xFE byte pattern can only be interpreted as the U+FEFF character expressed in little-endian byte order (since it could not be a U+FFFE character expressed in big-endian byte order).
While UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may be used to mark text as UTF-8. It only identifies a file as UTF-8 and does not indicate anything about byte order.
Many Windows programs (including Windows Notepad) add BOMs to UTF-8 files. However in UNIX systems (which make heavy use of text files for file formats as well as for interprocess communications) this practice is not recommended—it interferes with correct processing of important codes such as the hash-bang (# !) at the start of an interpreted script. It may also interfere with source for programming languages that do not recognize it. For example, gcc reports stray characters at the beginning of a source file, and in PHP, if output buffering is disabled, it has the subtle effect of causing the page to start being sent to the browser, preventing custom headers from being specified by the PHP script. The UTF-8 representation of the BOM is the byte sequence EF BB BF, which appears as the ISO-8859-1 characters ï"¿ in most text editors and web browsers not prepared to handle UTF-8.
Although a BOM could be used with UTF-32, this encoding is rarely used for transmission. Otherwise, the same rules as for UTF-16 are applicable. For the IANA registered character sets UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE a byte order mark must not be used. An initial U+FEFF has to be interpreted as a (deprecated) zero width no-break space because the names of these character sets already determine the byte order. For the registered character sets UTF-16 and UTF-32, an initial U+FEFF indicates the byte order.