There is a tiny bit of processing before the lexer is executed. Two things happen before the lexical analysis starts, in order:
- First, the start of the file is checked for the UTF-8 BOM (
\xEF\xBB\xBF). If the sequence is found, the relevant bytes are discarded. Note that it is strongly recommended to save Coyote files without the BOM. This matches (although more sternly) the Unicode consortium recommendation. However, it is supported for compatibility with some non-cooperating editors.
- Then the start of the (possibly remaining) file is checked for the shebang sequence (
#!). If found, the entire line is discarded. This repeats until no shebang is found anymore.
The reason multiple shebangs are supported is to allow for VM parameters to be passed in a portable manner, as most operating systems only recognize a single parameter in shebangs (and said parameter is needed by
#!/usr/bin/env coyote #! -Dbackend=SDL2 -Dreal_size=32 #! -O3 ... rest of code ...
The compiler assumes the following things about the source code, regardless of the operating system or the presence of byte order marks:
- The source is always opened in binary mode, for portability, and to avoid operating system’s newline processing from confusing the parser.
- The only valid input encoding is UTF-8, with or without the BOM. The compiler may or may not verify this. The compiler may issue a warning if a BOM is present, since it is strongly discouraged. Note that (standard, 7-bit) ASCII is indirectly supported, as it is a proper subset of UTF-8. However, Latin-1, Windows-1252, or similar are not supported. Furthermore, the input is expected to be normalized UTF-8. (TBD: Loosen this restriction?) This is unlikely to be verified, as the verification would depend on the Unicode version.
- Line endings are recognized as CRLF, CR, or LF (regular expression
/\r\n?|\n/). This covers Windows (CRLF), old Mac OS (CR), and most other systems (LF). Said line endings are converted to LF (
\n). This implies that a string literal that spans multiple lines in the source will have LF at the end of each line, regardless of the file’s actual line endings. Note that LFCR is not recognized as a line ending (instead, it will count as two) — only a handful of systems used it anyways. Other types of line endings (such as those defined by Unicode) are valid if found within comments and strings, but may create a mismatch between line numbers in compiler messages and in the editor (depending on the editor).
A few other notes about lexical analysis:
- By design, lexical analysis will be completely independent of parsing. No lexer hack or similar will be allowed. The sole exception is a debugging context, to allow for better error messages (if needed).
- An effort will be made to keep the lexical grammar completely regular (to allow DFA-based implementations). There is one potential issue here; we’ll see how it works out. It might, however, eliminate the option of nested comments.
- Whitespace comprises of the characters
/[\r\n\t\v\f ]/. This is the complete list. Comments also count as whitespace, for the sake of separating tokens.