Range of unicode characters GHC accepts

Question

This may sound a bit ridiculous, but GHC fails to compile my string containing bacon, a croissant, cucumber, and a potato:

main = putStrLn "🥓  🥐  🥒  🥔"

I realize I could easily write

main = putStrLn "\x1F953  \x1F950  \x1F952  \x1F954"

to the same effect, but I had always assumed GHC would accept any unicode in its source. So: what are the actual restrictions on unicode characters GHC accepts in source files?


BTW: I realize that supporting this sort of thing is hell for the GHC lexer (actually I ran across the above problem while writing test cases for a lexer I wrote), but I still am a tad bit disappointed.


Show source
| haskell   | ghc   | string   | unicode   2017-01-03 08:01 1 Answers

Answers ( 1 )

  1. 2017-01-03 15:01

    Saving main = putStrLn "🥓 🥐 🥒 🥔" as UTF-8 and running it with ghc 8.0.1 on macOS, I got:

    lexical error in string/character literal at character '\129365'
    

    I found this related (but closed) ghc bug report:

    The cause (for both problems) was that older versions of GHC support a older version of Unicode:

    $ ghc-7.0.3 -e "Data.Char.generalCategory '\8342'"
    NotAssigned
    

    So the problem seems to be that the version of ghc we're using doesn't support the newer emojis yet – it thinks the unicode code point is unassigned and errors out even though it's assigned to the emoji in newer versions of unicode.

    A related open ghc bug ticket which mostly discusses which whitespace chars are allowed though.

    Finally, the lit_error function in Lexer.x seems to be where the error is surfaced. There are multiple functions in that file that call that error though, so not sure where it's coming from exactly...

◀ Go back