aleph1.co.uk Git - yaffs-website/blob - vendor/nikic/php-parser/doc/component/Lexer.markdown

   1 Lexer component documentation
   2 =============================
   3
   4 The lexer is responsible for providing tokens to the parser. The project comes with two lexers: `PhpParser\Lexer` and
   5 `PhpParser\Lexer\Emulative`. The latter is an extension of the former, which adds the ability to emulate tokens of
   6 newer PHP versions and thus allows parsing of new code on older versions.
   7
   8 This documentation discusses options available for the default lexers and explains how lexers can be extended.
   9
  10 Lexer options
  11 -------------
  12
  13 The two default lexers accept an `$options` array in the constructor. Currently only the `'usedAttributes'` option is
  14 supported, which allows you to specify which attributes will be added to the AST nodes. The attributes can then be
  15 accessed using `$node->getAttribute()`, `$node->setAttribute()`, `$node->hasAttribute()` and `$node->getAttributes()`
  16 methods. A sample options array:
  17
  18 ```php
  19 $lexer = new PhpParser\Lexer(array(
  20     'usedAttributes' => array(
  21         'comments', 'startLine', 'endLine'
  22     )
  23 ));
  24 ```
  25
  26 The attributes used in this example match the default behavior of the lexer. The following attributes are supported:
  27
  28  * `comments`: Array of `PhpParser\Comment` or `PhpParser\Comment\Doc` instances, representing all comments that occurred
  29    between the previous non-discarded token and the current one. Use of this attribute is required for the
  30    `$node->getComments()` and `$node->getDocComment()` methods to work. The attribute is also needed if you wish the pretty
  31    printer to retain comments present in the original code.
  32  * `startLine`: Line in which the node starts. This attribute is required for the `$node->getLine()` to work. It is also
  33    required if syntax errors should contain line number information.
  34  * `endLine`: Line in which the node ends. Required for `$node->getEndLine()`.
  35  * `startTokenPos`: Offset into the token array of the first token in the node. Required for `$node->getStartTokenPos()`.
  36  * `endTokenPos`: Offset into the token array of the last token in the node. Required for `$node->getEndTokenPos()`.
  37  * `startFilePos`: Offset into the code string of the first character that is part of the node. Required for `$node->getStartFilePos()`.
  38  * `endFilePos`: Offset into the code string of the last character that is part of the node. Required for `$node->getEndFilePos()`.
  39
  40 ### Using token positions
  41
  42 > **Note:** The example in this section is outdated in that this information is directly available in the AST: While
  43 > `$property->isPublic()` does not distinguish between `public` and `var`, directly checking `$property->flags` for
  44 > the `$property->flags & Class_::VISIBILITY_MODIFIER_MASK) === 0` allows making this distinction without resorting to
  45 > tokens. However the general idea behind the example still applies in other cases.
  46
  47 The token offset information is useful if you wish to examine the exact formatting used for a node. For example the AST
  48 does not distinguish whether a property was declared using `public` or using `var`, but you can retrieve this
  49 information based on the token position:
  50
  51 ```php
  52 function isDeclaredUsingVar(array $tokens, PhpParser\Node\Stmt\Property $prop) {
  53     $i = $prop->getAttribute('startTokenPos');
  54     return $tokens[$i][0] === T_VAR;
  55 }
  56 ```
  57
  58 In order to make use of this function, you will have to provide the tokens from the lexer to your node visitor using
  59 code similar to the following:
  60
  61 ```php
  62 class MyNodeVisitor extends PhpParser\NodeVisitorAbstract {
  63     private $tokens;
  64     public function setTokens(array $tokens) {
  65         $this->tokens = $tokens;
  66     }
  67
  68     public function leaveNode(PhpParser\Node $node) {
  69         if ($node instanceof PhpParser\Node\Stmt\Property) {
  70             var_dump(isDeclaredUsingVar($this->tokens, $node));
  71         }
  72     }
  73 }
  74
  75 $lexer = new PhpParser\Lexer(array(
  76     'usedAttributes' => array(
  77         'comments', 'startLine', 'endLine', 'startTokenPos', 'endTokenPos'
  78     )
  79 ));
  80 $parser = (new PhpParser\ParserFactory)->create(PhpParser\ParserFactory::ONLY_PHP7, $lexer);
  81
  82 $visitor = new MyNodeVisitor();
  83 $traverser = new PhpParser\NodeTraverser();
  84 $traverser->addVisitor($visitor);
  85
  86 try {
  87     $stmts = $parser->parse($code);
  88     $visitor->setTokens($lexer->getTokens());
  89     $stmts = $traverser->traverse($stmts);
  90 } catch (PhpParser\Error $e) {
  91     echo 'Parse Error: ', $e->getMessage();
  92 }
  93 ```
  94
  95 The same approach can also be used to perform specific modifications in the code, without changing the formatting in
  96 other places (which is the case when using the pretty printer).
  97
  98 Lexer extension
  99 ---------------
 100
 101 A lexer has to define the following public interface:
 102
 103 ```php
 104 function startLexing(string $code, ErrorHandler $errorHandler = null): void;
 105 function getTokens(): array;
 106 function handleHaltCompiler(): string;
 107 function getNextToken(string &$value = null, array &$startAttributes = null, array &$endAttributes = null): int;
 108 ```
 109
 110 The `startLexing()` method is invoked whenever the `parse()` method of the parser is called and is passed the source
 111 code that is to be lexed (including the opening tag). It can be used to reset state or preprocess the source code or tokens. The
 112 passed `ErrorHandler` should be used to report lexing errors.
 113
 114 The `getTokens()` method returns the current token array, in the usual `token_get_all()` format. This method is not
 115 used by the parser (which uses `getNextToken()`), but is useful in combination with the token position attributes.
 116
 117 The `handleHaltCompiler()` method is called whenever a `T_HALT_COMPILER` token is encountered. It has to return the
 118 remaining string after the construct (not including `();`).
 119
 120 The `getNextToken()` method returns the ID of the next token (as defined by the `Parser::T_*` constants). If no more
 121 tokens are available it must return `0`, which is the ID of the `EOF` token. Furthermore the string content of the
 122 token should be written into the by-reference `$value` parameter (which will then be available as `$n` in the parser).
 123
 124 ### Attribute handling
 125
 126 The other two by-ref variables `$startAttributes` and `$endAttributes` define which attributes will eventually be
 127 assigned to the generated nodes: The parser will take the `$startAttributes` from the first token which is part of the
 128 node and the `$endAttributes` from the last token that is part of the node.
 129
 130 E.g. if the tokens `T_FUNCTION T_STRING ... '{' ... '}'` constitute a node, then the `$startAttributes` from the
 131 `T_FUNCTION` token will be taken and the `$endAttributes` from the `'}'` token.
 132
 133 An application of custom attributes is storing the exact original formatting of literals: While the parser does retain
 134 some information about the formatting of integers (like decimal vs. hexadecimal) or strings (like used quote type), it
 135 does not preserve the exact original formatting (e.g. leading zeros for integers or escape sequences in strings). This
 136 can be remedied by storing the original value in an attribute:
 137
 138 ```php
 139 use PhpParser\Lexer;
 140 use PhpParser\Parser\Tokens;
 141
 142 class KeepOriginalValueLexer extends Lexer // or Lexer\Emulative
 143 {
 144     public function getNextToken(&$value = null, &$startAttributes = null, &$endAttributes = null) {
 145         $tokenId = parent::getNextToken($value, $startAttributes, $endAttributes);
 146
 147         if ($tokenId == Tokens::T_CONSTANT_ENCAPSED_STRING   // non-interpolated string
 148             || $tokenId == Tokens::T_ENCAPSED_AND_WHITESPACE // interpolated string
 149             || $tokenId == Tokens::T_LNUMBER                 // integer
 150             || $tokenId == Tokens::T_DNUMBER                 // floating point number
 151         ) {
 152             // could also use $startAttributes, doesn't really matter here
 153             $endAttributes['originalValue'] = $value;
 154         }
 155
 156         return $tokenId;
 157     }
 158 }
 159 ```