previous page next page reference home emBASIC home page

1.1 Notation

The descriptions of lexical analysis and syntax use a modified BNF grammar notation where necessary. This uses the following style of definition:

name: letter (letter | "_")*
lc_letter: "a"..."z"

The first line says that a name is an lc_letter followed by a sequence of zero or more lc_letters and underscores. A letter in turn is any of the single characters "a" through "z". (This rule is actually adhered to for the names defined in lexical and grammar rules in this document.)
Each rule begins with a name (which is the name defined by the rule) and a colon. A vertical bar (|) is used to separate alternatives; it is the least binding operator in this notation. A star (*) means zero or more repetitions of the preceding item; likewise, a plus (+) means one or more repetitions, and a phrase enclosed in square brackets ([ ]) means zero or one occurrences (in other words, the enclosed phrase is optional). The * and + operators bind as tightly as possible; parentheses are used for grouping. Literal strings are enclosed in quotes. White space is only meaningful to separate tokens. Rules are normally contained on a single line; rules with many alternatives may be formatted alternatively with each line after the first beginning with a vertical bar.

In lexical definitions (as the example above), two more conventions are used: Two literal characters separated by three dots mean a choice of any single character in the given (inclusive) range of ASCII characters. A phrase between angular brackets (<...>) gives an informal description of the symbol defined; e.g., this could be used to describe the notion of `control character' if needed.
Even though the notation used is almost the same, there is a big difference between the meaning of lexical and syntactic definitions: a lexical definition operates on the individual characters of the input source, while a syntax definition operates on the stream of tokens generated by the lexical analysis. All uses of BNF in the next chapter “Lexical Analysis” are lexical definitions; uses in subsequent chapters are syntactic definitions.

2 Lexical Analysis

An emBASIC program is read by a parser. Input to the parser is a stream of tokens, generated by the lexical analyzer. This chapter describes how the lexical analyzer breaks entered text into tokens.
emBASIC uses the 7-bit ASCII character set for program text and string literals. 8-bit characters may be used in string literals and comments but their interpretation is platform dependent; the proper way to insert 8-bit characters in string literals is by using octal or hexadecimal escape sequences.
The run-time character set depends on the I/O devices connected to the target but is generally a superset of ASCII.
Future compatibility note: It may be tempting to assume that the character set for 8-bit characters is ISO Latin-1 (an ASCII superset that covers most western languages that use the Latin alphabet), but it is possible that in the future Unicode text editor used in emBASIC WorkShop application will become common. Future implementations will allow use the UTF-8 encoding, which is also an ASCII superset, but with very different use for the characters with ordinals 128-255.

2.1 Line structure

An emBASIC program is divided into a number of logical lines. Each line is parsed and interpreted once read from the input stream. Input stream could be either serial line input or BitBUS message sent by emBASIC WorkShop.

2.1.1 Logical lines

The end of a logical line is represented by the token EOL. Statements cannot cross logical line boundaries except where EOL is allowed by the syntax (e.g., between statements in compound statements). A logical line is constructed from one or more physical lines by following the explicit line joining rules.

2.1.2 Physical lines

A physical line ends in whatever the current platform's convention is for terminating lines. On current platform, this is the ASCII LF (linefeed) character.

2.1.3 Comments

A comment starts with a REM token that is not part of a string literal, and ends at the end of the physical line. A comment entered after program code in the same line ends the logical line. Comments are ignored by the syntax; they are not tokens. A line starting with * or // is also regarded a comment line.

2.1.4 Explicit line joining

Two or more physical lines may be joined into logical lines using backslash characters (\), as follows: when a physical line ends in a backslash that is not part of a string literal or comment, it is joined with the following forming a single logical line, deleting the backslash and the following end-of-line character. For example:

if 1900 < year < 2100 and 1 <= month <= 12 \
and 1 <= day <= 31 and 0 <= hour < 24 \
and 0 <= minute < 60 and 0 <= second < 60 \
then return 1

A line ending in a backslash cannot carry a comment. A backslash does not continue a comment. Tokens or string literals cannot be split such that part of the token or literal is placed on the continuation line.

2.1.5 Line numbering

emBASIC does not need line numbers. It may generate them on demand and signal errors with line numbers (or line offsets from the last label or procedure start) but it uses line numbers only internally. Branches depend on labels and function/procedure names.

2.1.6 Blank lines

A logical line that contains only spaces, tabs, formfeeds and possibly a comment, is not ignored (i.e., eol element is generated). During interactive input of statements, handling of a blank line may differ depending on the implementation of the WorkShop. In the current implementation, an entirely blank logical line (i.e. one containing not even whitespace or a comment) instructs TCE generator to produce eol to the code segment.
This is necessary for line counting and to make sure an offline edited program matches the one uploaded tothe target controller.

2.1.7 Indentation

Leading whitespace (spaces and tabs) at the beginning of a logical line is used to provide the indentation level of the line.
Tabs and spaces are ignored by the parser but remain in the source code which is located in the WorkShop window. Later, if the program gets uploaded from the target, indentation lost.
Here is an example of a correctly indented piece of emBASIC code:

declare i as long
declare j as long
declare s as double

print at (1,1) clear;

for k=1 to 20
   s=0.
   j = k * 1000
   for i=1 to j
      s = s + 1 / i
   next i
   print at (k, k) "sum("; j; ")="; s
next k

2.1.8 Whitespace between tokens

Except at the beginning of a logical line or in string literals, the whitespace characters space, tab and formfeed should be used to separate tokens. Whitespace is needed between two tokens always if their concatenation could be interpreted as a different token (e.g., ab is one token, but a b is two tokens).

2.1.9 Other tokens

Besides EOL, the following categories of tokens exist: identifiers, keywords, literals, operators, and delimiters. Whitespace characters (other than line terminators, discussed earlier) are not tokens, but serve to delimit tokens. Where ambiguity exists, a token comprises the longest possible string that forms a legal token, when read from left to right.

2.2 Identifiers and keywords

Identifiers can contain the letters A..Z or a..z, numbers 0..9 and underline “_”, dollar “$” or percent sign “%”. The length of an identifier is limited to 32 characters. A variable must not begin with a number and must not be identical to a reserved word. Case is not significant.

2.2.1 Keywords

The following identifiers are used as reserved words, or keywords of the language, and cannot be used as ordinary identifiers. They must be spelled exactly as written here:

ALL AND ASM BAND BNAND
BNOR BNOT BNXOR BOR BREAK
BUS BXOR CALL CASE CLEAR
CMD COMMAND CONST CONTINUE DATA
DECL DECLARE DEF DEFAULT DEFINE
DIV ELSE ELSEIF ELSIF END
ENDDO ENDFUNC ENDPROC ENDTASK ENDTYPE
ERROR FMT FOR FUNCTION GOSUB
GOTO INCHAR INKEY INPUT LET
LOCAL LOOP MAX MESSAGE MOD
MSG NAND NEW NEXT NOR
NOT NXOR POKE PRINT PRIO
PROCEDURE PUBLIC REF REM REPEAT
REPLY RETURN SEND SEQUENCE STATIC
STEP STRUCT STRUCTURE SWITCH TASK
THEN TYPE UNTIL USING WAIT
WHILE WITH XOR INTEGER INCHAR$
INKEY$ FALSE TRUE    

 

2.2.2 Reserved classes of identifiers

Certain classes of identifiers (besides keywords) have special meanings. These are system variables and system constants. The prepend character "@" is used to identify system variable and prepend character "$" is used to identify system constants.
The following identifiers are used as reserved classes of identifiers:

@CONSOLE @TIMEFAC @ERRCODE @TASKNAME @DEBUG @YYDEBUG @SER0_SPEED
@SER0_MODE @SER0_PARITY @SER0_HANDSHAKE @SER1_SPEED @SER1_MODE
@SER1_PARITY @SER1_HANDSHAKE @TASKLINE @MEMORY8 @MEMORY16 @MEMORY32

$VERSION $BUILD $COUNTRY $TERMINAL $TIMEZONE $CPUBUS $TSMBUS $ECBBUS $CANBUS $I2CBUS $DIN $DOUT $AIN $AOUT $PWM $EVTCNT $FREQ $POS

$CFG_START_SIMULATION $CFG_STOP_SIMULATION $CFG_DOWN $CFG_ENABLE $CFG_SET_CHANNEL_RANGE $CFG_SET_CONV_SPEED $CFG_SET_GAIN $CFG_SET_ATTENTUATE $CFG_SET_OFFSET $CFG_SET_LINTAB $CFG_SET_PWM_FREQ $CFG_SET_INC_MODE $CFG_SET_DIR $CFG_SSI_SET_TURNS $CFG_SSI_SET_STEPS
XP_CFG_SET_LOWT[NM1][WLG2]IME
$RANGE_RAW $RANGE_RAW_U_10000 $RANGE_RAW_U_5000 $RANGE_RAW_S_10000 $RANGE_RAW_S_5000 $RANGE_U_10000 $RANGE_U_5000 $RANGE_S_10000 $RANGE_S_5000 $RANGE_UB $RANGE_020mA $RANGE_420mA $RANGE_PT100V4 $RANGE_KTY $RANGE_KTY10 $RANGE_KTY81 $RANGE_LM34 $RANGE_THERMO_K $RANGE_THERMO_J

$BGM_MODE_FIFO $BGM_MODE_LIFO $BGM_MODE_RING $BGM_MODE_RANDOM $BGM_MODE_FIT $BGM_CMD_GETSIZE $BGM_CMD_SETSIZE $BGM_CMD_FORMAT

$COM_BR76800 $COM_BR57600 $COM_BR38400 $COM_BR19200 $COM_BR9600 $COM_BR4800 $COM_BR2400 $COM_BR1200 $COM_BR300 $COM_NONE $COM_EVEN $COM_ODD $COM_MARK $COM_SPACE $COM_7BPC $COM_8BPC $COM_9MARK $COM_9SPACE $COM_9MARKONFIRST $COM_NOHS $COM_RTS $COM_XON

2.2.3 Literals

Literals are notations for constant values of some built-in types.

2.2.4 String literals

String literals can be enclosed in matching single quotes (') or double quotes ("). They can also be enclosed in matching groups of three single or double quotes (these are generally referred to as triple-quoted strings). The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character.
Non printing characters can be specified by their hex representation after a backslash like “test\13\10" or by the function CHAR(9) with the number in parentheses being interpreted decimal if not extended by B or H.

2.2.5 String literal concatenation

Multiple adjacent string literals (delimited by “+”), possibly using different quoting conventions, are allowed, and their meaning is the same as their concatenation.
Thus, "hello" + 'world' is equivalent to "helloworld". Note that this feature is defined at the syntactical level, but implemented at run time. The `+' operator must be used to concatenate string expressions at run time.


2.2.6 Numeric literals

There are two types of numeric literals: plain integers and floating point numbers.
Numeric literals are interpreted as decimal, if they end in “B” (01010101B) they are interpreted as being binary and ending in “H” with leading zero or prepended with 0x are interpreted as numbers to the base of 16 (0F7H or 0xF7).

2.2.7 Integer and long integer literals

Plain integer decimal literals must be at most 32767 (i.e., the largest positive integer, using 16-bit arithmetic). To explicitly use long integer literal it is possible to use cast conversion like:

val = (long)2147483647 // the largest positive long integer

2.2.8 Floating point literals

If a numeric literal contains a dot (.319), it is considered a float literal. An “e” or “E” in a float literal indicates to raise that number to the power of the number following the E (-0.4567e-13).
Some examples of floating point literals:

3.14 10. .001 1e100 3.14e-10 0e0

Note that numeric literals do not include a sign; a phrase like -1 is actually an expression composed of the operator - and the literal 1.


2.2.9 Operators

The following tokens are operators:

+
-
*
/
%
^
**
<<
>>
&
|
~
||
&&
<
>
<=
>=
==
!=
<>
+=
-=
*=
/=
^=

The comparison operators <> and != are alternate spellings of the same operator. != is the preferred spelling; <> is deprecated.