包详细信息

dt-sql-parser

DTStack78.1kMIT4.3.1

SQL Parsers for BigData, built with antlr4

antlr4, sql, parser, code-completion

自述文件

dt-sql-parser

English | 简体中文

dt-sql-parser is a SQL Parser project built with ANTLR4, and it's mainly for the BigData field. The ANTLR4 generated the basic Parser, Visitor, and Listener, so it's easy to complete the Lexer, Parser, traverse the AST, and so on features.

Additionally, it provides advanced features such as SQL Validation, Code Completion and Collecting Table and Columns in SQL.

Supported SQL:

MySQL
Flink
Spark
Hive
PostgreSQL
Trino
Impala

[!TIP] This project is the default for Typescript target, also you can try to compile it to other languages if you need.

Integrating SQL Parser with Monaco Editor

We also have provided monaco-sql-languages to easily to integrate dt-sql-parser with monaco-editor.

Installation

# use npm
npm i dt-sql-parser --save

# use yarn
yarn add dt-sql-parser

Usage

We recommend learning the fundamentals usage before continuing. The dt-sql-parser library provides SQL classes for different types of SQL.

import { MySQL, FlinkSQL, SparkSQL, HiveSQL, PostgreSQL, TrinoSQL, ImpalaSQL } from 'dt-sql-parser';

Before using syntax validation, code completion, and other features, it is necessary to instantiate the Parser of the relevant SQL type. For instance, one can consider using MySQL as an example:

const mysql = new MySQL();

The following usage examples will utilize the MySQL, and the Parser for other SQL types will be used in a similar manner as MySQL.

Syntax Validation

First instanced a Parser object, then call the validate method on the SQL instance to validate the sql content, if failed returns an array includes error message.

import { MySQL } from 'dt-sql-parser';

const mysql = new MySQL();
const incorrectSql = 'selec id,name from user1;';
const errors = mysql.validate(incorrectSql);

console.log(errors);

output:

/*
[
  {
    endCol: 5,
    endLine: 1,
    startCol: 0,
    startLine: 1,
    message: "..."
  }
]
*/

Tokenizer

Call the getAllTokens method on the SQL instance:

import { MySQL } from 'dt-sql-parser';

const mysql = new MySQL()
const sql = 'select id,name,sex from user1;'
const tokens = mysql.getAllTokens(sql)

console.log(tokens)

output:

/*
[
  {
    channel: 0
    column: 0
    line: 1
    source: [SqlLexer, InputStream]
    start: 0
    stop: 5
    tokenIndex: -1
    type: 137
    _text: null
  },
  ...
]
*/

Visitor

Traverse the tree node by the Visitor:

import { MySQL, MySqlParserVisitor } from 'dt-sql-parser';

const mysql = new MySQL();
const sql = `select id, name from user1;`;
const parseTree = mysql.parse(sql);

class MyVisitor extends MySqlParserVisitor<string> {
    defaultResult(): string {
        return '';
    }
    aggregateResult(aggregate: string, nextResult: string): string {
        return aggregate + nextResult;
    }
    visitProgram = (ctx) => {
        return this.visitChildren(ctx);
    };
    visitTableName = (ctx) => {
        return ctx.getText();
    };
}
const visitor = new MyVisitor();
const result = visitor.visit(parseTree);

console.log(result);

output:

/*
user1
*/

Listener

Access the specified node in the AST by the Listener

import { MySQL, MySqlParserListener } from 'dt-sql-parser';

const mysql = new MySQL();
const sql = 'select id, name from user1;';
const parseTree = mysql.parse(sql);

class MyListener extends MySqlParserListener {
    result = '';
    enterTableName = (ctx): void => {
        this.result = ctx.getText();
    };
}

const listener = new MyListener();
mysql.listen(listener, parseTree);

console.log(listener.result)

output:

/*
user1
*/

Splitting SQL statements

Take FlinkSQL as an example, call the splitSQLByStatement method on the SQL instance:

import { FlinkSQL } from 'dt-sql-parser';

const flink = new FlinkSQL();
const sql = 'SHOW TABLES;\nSELECT * FROM tb;';
const sqlSlices = flink.splitSQLByStatement(sql);

console.log(sqlSlices)

output:

/*
[
  {
    startIndex: 0,
    endIndex: 11,
    startLine: 1,
    endLine: 1,
    startColumn: 1,
    endColumn: 12,
    text: 'SHOW TABLES;'
  },
  {
    startIndex: 13,
    endIndex: 29,
    startLine: 2,
    endLine: 2,
    startColumn: 1,
    endColumn: 17,
    text: 'SELECT * FROM tb;'
  }
]
*/

Code Completion

Obtaining code completion information at a specified position in SQL.

Call the getAllEntities method on the SQL instance, pass the SQL content and the row and column numbers indicating the position where code completion is desired. The following are some additional explanations about CaretPosition.

keyword candidates list

  import { FlinkSQL } from 'dt-sql-parser';

  const flink = new FlinkSQL();
  const sql = 'CREATE ';
  const pos = { lineNumber: 1, column: 16 }; // the end position
  const keywords = flink.getSuggestionAtCaretPosition(sql, pos)?.keywords;

  console.log(keywords);

output:

  /*
  [ 'CATALOG', 'FUNCTION', 'TEMPORARY', 'VIEW', 'DATABASE', 'TABLE' ] 
  */

Obtaining information related to grammar completion

 import { FlinkSQL } from 'dt-sql-parser';

 const flink = new FlinkSQL();
 const sql = 'SELECT * FROM tb';
 const pos = { lineNumber: 1, column: 16 }; // after 'tb'
 const syntaxSuggestions = flink.getSuggestionAtCaretPosition(sql, pos)?.syntax;

 console.log(syntaxSuggestions);

output:

 /*
 [
   {
     syntaxContextType: 'table',
     wordRanges: [
       {
         text: 'tb',
         startIndex: 14,
         stopIndex: 15,
         line: 1,
         startColumn: 15,
         stopColumn: 16
       }
     ]
   },
   {
     syntaxContextType: 'view',
     wordRanges: [
       {
         text: 'tb',
         startIndex: 14,
         stopIndex: 15,
         line: 1,
         startColumn: 15,
         stopColumn: 16
       }
     ]
   }
 ]
 */

The grammar-related code completion information returns an array, where each item represents what grammar can be filled in at that position. For example, the output in the above example represents that the position can be filled with either a table name or a view name. In this case, syntaxContextType represents the type of grammar that can be completed, and wordRanges represents the content that has already been filled.

Get all entities in SQL (e.g. table, column)

Call the getAllEntities method on the SQL instance, and pass in the sql text and the row and column numbers at the specified location to easily get them.

  import { FlinkSQL } from 'dt-sql-parser';

  const flink = new FlinkSQL();
  const sql = 'SELECT * FROM tb;';
  const pos = { lineNumber: 1, column: 16 }; // tb 的后面
  const entities = flink.getAllEntities(sql, pos);

  console.log(entities);

output

/*
  [
    {
      entityContextType: 'table',
      text: 'tb',
      position: {
        line: 1,
        startIndex: 14,
        endIndex: 15,
        startColumn: 15,
        endColumn: 17
      },
      belongStmt: {
        stmtContextType: 'selectStmt',
        position: [Object],
        rootStmt: [Object],
        parentStmt: [Object],
        isContainCaret: true
      },
      relatedEntities: null,
      columns: null,
      isAlias: false,
      origin: null,
      alias: null
    }
  ]
*/

Position is not required, if the position is passed, then in the collected entities, if the entity is located under the statement where the corresponding position is located, then the statement object to which the entity belongs will be marked with isContainCaret, which can help you quickly filter out the required entities when combined with the code completion function.

Get semantic context information

Call the getSemanticContextAtCaretPosition method on the SQL instance, passing in the sql text and the line and column numbers at the specified position, for example:

import { HiveSQL } from 'dt-sql-parser';

const hive = new HiveSQL();
const sql = 'SELECT * FROM tb;';
const pos = { lineNumber: 1, column: 18 }; // after 'tb;'
const semanticContext = hive.getSemanticContextAtCaretPosition(sql, pos);

console.log(semanticContext);

output

/*
{
  isStatementBeginning: true,
}
*/

Currently, the semantic context information that can be collected is as follows. If there are more requirements, please submit an issue.

isStatementBeginning Whether the current input position is the beginning of a statement

The default strategy for isStatementBeginning is SqlSplitStrategy.STRICT

There are two optional strategies:

SqlSplitStrategy.STRICT Strict strategy, only the statement delimiter ; is used as the identifier for the end of the previous statement
SqlSplitStrategy.LOOSE Loose strategy, based on the syntax parsing tree to split SQL

The difference between the two strategies: For example, if the input SQL is:

CREATE TABLE tb (id INT)

SELECT

In the SqlSplitStrategy.STRICT strategy, isStatementBeginning is false, because the CREATE statement is not terminated by a semicolon.

In the SqlSplitStrategy.LOOSE strategy, isStatementBeginning is true, because the syntax parsing tree splits the SQL into two independent statements: CREATE and SELECT.

You can set the strategy through the third options parameter:

hive.getSemanticContextAtCaretPosition(sql, pos, { splitSqlStrategy: SqlSplitStrategy.LOOSE });

Other API

createLexer Create an instance of Antlr4 Lexer and return it;
createParser Create an instance of Antlr4 parser and return it;
parse Parses the input SQL and returns the parse tree;

Position and Range

Some return results of the APIs provided by dt-sql-parser contain text information, among which the range and start value of line number, column number and index may cause some confusion.

Index

The index starts at 0. In the programming field, it is more intuitive.

index-image

For an index range, the start index starts from 0 and ends with n-1, as shown in the figure above, an index range of blue text should be represented as follows:

{
    startIndex: 0,
    endIndex: 3
}

Line

The line starts at 1.

line-image

For a range of multiple lines, the line number starts from 1 and ends with n. A range of the first and second lines is represented as follows:

{
    startLine: 1,
    endLine: 2
}

Column

The column also starts at 1.

column-image

It is easier to understand by comparing the column number with the cursor position of the editor. For a range of multiple columns, the column number starts from 1 and ends with n+1, as shown in the figure above, a range of blue text columns is represented as follows:

{
    startColumn: 1,
    endColumn: 5
}

CaretPosition Of Code Completion

The code completion of dt-sql-parser was designed to be used in the editor, so the format of the second parameter (CaretPosition) of the getSuggestionAtCaretPosition method is line and column number instead of character position index. This makes it easier to integrate the code completion into the editor. For the editor, it only needs to get the text content and cursor position in the editor at a specific time to call the code completion of dt-sql-parser, without any additional calculation.

But in some other scenarios, you may need to get the caret position required by the code completion through conversion or calculation. Then, there are some precautions that you may need to care about before that.

The code completion of dt-sql-parser depends on antlr4-c3, which is a great library. The code completion of dt-sql-parser is just encapsulated and converted based on antlr4-c3, including converting the line and column number information into the token index required by antlr4-c3, as shown in the figure below:

column-image

Regard the column in the figure as the cursor position, and put this text into the editor, you will get 13 possible cursor positions, while for dt-sql-parser, this text will generate 4 Tokens after being parsed. An important strategy of the code completion is: When the cursor (CaretPosition) has not completely left a Token, dt-sql-parser thinks that this Token has not been completed, and the code completion will infer what can be filled in the position of this Token.

For example, if you want to know what to fill in after SHOW through the code completion, the caret position should be:

{
    lineNumber: 1,
    column: 6
}

At this time, dt-sql-parser will think that SHOW is already a complete Token, and it should infer what can be filled in after SHOW. If the column in the passed-in caret position is 5, then dt-sql-parser will think that SHOW has not been completed, and then infer what can be filled in the position of SHOW. In other words, in the figure above, column: 5 belongs to token: 0, and column: 6 belongs to token: 1.

For the editor, this strategy is also more intuitive. After the user enters SHOW, before pressing the space key, the user probably has not finished entering, maybe the user wants to enter something like SHOWS. When the user presses the space key, the editor thinks that the user wants to enter the next Token, and it is time to ask dt-sql-parser what can be filled in the next Token position.

Contributing

Refer to CONTRIBUTING

License

MIT

更新日志

Changelog

All notable changes to this project will be documented in this file. See standard-version for commit guidelines.

4.3.1 (2025-06-05)

Bug Fixes

common: #424 allTokens slice when caretTokenIndex use tokenIndexOffset (#426) (616dc11)

4.3.0 (2025-05-16)

Features

#410 optimize processCandidates tokenIndexOffset (#411) (5fccaa1)
all sql: add all sql expression column (#358) (26219b8)
collect entity's attribute(#333) (ab60b14)
complete after error syntax (#334) (99b01e5)
flinksql: collect comment, type attribute for entity (#319) (1f1dd19), closes #305
improve errorListener msg (#281) (ff49c91)
merge conflict: merge main solve conflict (f17f19f)
support semantic context of isNewStatement (#361) (042477d)
unify variables in lexer (#366) (07ff5dc)

Bug Fixes

#362 set hiveVar value (#369) (ba9e3d6)
#371 export EntityContext types (#372) (c0a2854)
flink: fix flinksql syntax error about ROW and function using (#383) (188d42d)
flink: resolve conflicts (ec65d4e)
minimum collect candidates boundary to fix parse performance (#378) (64ed7e4)
spell check (#337) (fb55c99)

4.2.0 (2025-05-09)

Features

postgre: #412 update merge's syntax, with clause syntax and copy syntax (#413) (52b9a5a)

Bug Fixes

ci: use project.reject when just check mode (#404) (7324c27)
flink: #415 fix supporting json function's different type (#416) (03a53d3)
split whiteSpace into multiple tokens to fix findCaretTokenIndex may undefined (#385) (8e11012)

4.1.1 (2025-02-17)

Bug Fixes

flink: #398 fix flinksql built-in function's using (#399) (f8afbe2)

4.1.0 (2025-02-13)

Features

add alter table stmt (#312) (5aade9e)
add benchmark test suite (#273) (de1bd9d)
basicSql: remove judge splitListener/collectListener, all sqlParser implements it (#316) (eb2e920)
sync some useful syntax from antlr/grammars-v4 (95a1087)
upgrade trino to 450 (#323) (2b0de6a)
use common sql to run benchmark (#326) (76d0900)

Bug Fixes

#351 antlr4 command optimize (74d6435)
#381 antlr4 flink grammar (74be81c)
add hash partition table keywords MODULUS and REMAINDER (#384) (f2e6b60)
alert to alterView (#346) (9ba5100)
benchmark: add reports dir judge and remove plsql and include pgsql (9c534c2)
build mysql (5d6ff46)
ci: add antlr4 all sql in ci (2b30e78)
ci: change ci and add hash judge (276cc34)
ci: change crypto to devDependencies (b788e1c)
createFunction and createFunctionLoadable (e83449a)
flinksql function params add more time functions (#347) (b835c4b)
hive: add select into configPropertiesItem (#365) (bdb4b96)
impala: fix alter table change statement (#332) (4a9681e)
mysql case when (#317) (fea1ad1)
postgresql: change func_application to add column_name and paren (#359) (9a5eda8)
postgresql: combine plsql_unreserved_keyword to unreserved_keyword and remove unused rules (7884cbe)
trino: update timezone grammar to avoid ambiguity (#394) (05134bc)
update isContainCaret judgment when caret position token is whit… (#390) (20f065d)

4.1.0-beta.0 (2024-08-27)

Features

add alter table stmt (#312) (5aade9e)
add benchmark test suite (#273) (de1bd9d)
basicSql: remove judge splitListener/collectListener, all sqlParser implements it (#316) (eb2e920)
collect entity's attribute(#333) (a3b6b7e)
flinksql: collect comment, type attribute for entity (#319) (ae52ebd), closes #305
improve errorListener msg (#281) (deef123)
sync some useful syntax from antlr/grammars-v4 (95a1087)
upgrade trino to 450 (#323) (2b0de6a)
use common sql to run benchmark (#326) (76d0900)

Bug Fixes

alert to alterView (#346) (9ba5100)
benchmark: add reports dir judge and remove plsql and include pgsql (9c534c2)
build mysql (5d6ff46)
createFunction and createFunctionLoadable (e83449a)
flinksql function params add more time functions (#347) (b835c4b)
impala: fix alter table change statement (#332) (4a9681e)
mysql case when (#317) (fea1ad1)
postgresql: combine plsql_unreserved_keyword to unreserved_keyword and remove unused rules (7884cbe)
spell check (#337) (694b0cd)

4.0.2 (2024-06-19)

Features

add benchmark test suite (#273) (de1bd9d)

Bug Fixes

benchmark: add reports dir judge and remove plsql and include pgsql (9c534c2)
mysql case when (#317) (fea1ad1)

4.0.1 (2024-04-28)

Features

deprecate legacy APIs (#304) (f1c3bbe)

4.0.0 (2024-04-28)

Bug Fixes

#301 pgsql truncate table suggestion (#302) (25358ec)

4.0.0-beta.4.15 (2024-04-21)

Bug Fixes

#295 #297 optimize impala data types suggestion and create table (#296) (31b57a4)
#298 support hive set configuration properties (#299) (0e752fb)

4.0.0-beta.4.14 (2024-04-19)

Bug Fixes

#290 hive lifecycle (#293) (37974ca)

4.0.0-beta.4.13 (2024-04-19)

Features

#290 support dtstack SparkSQL/HiveSQL lifecycle (#292) (da77180)

Bug Fixes

#285 No longer just export types from libs (#286) (8bb27ee)

4.0.0-beta.4.12 (2024-03-28)

Features

add toMatchUnorderedArray matcher and apply it (#271) (a05f099)
collect entity (#265) (a997211), closes #256 #263 #268
migrate to antlr4ng (#267) (195878d)
spark: support materialized view for spark sql (#262) (5ce89cb)
use SLL mode (#269) (3f62ad0)

Bug Fixes

pgsql: (issue #255) select from support view name (#257) (337885b)
trino validation (#248) (1038a3a), closes #246

4.0.0-beta.4.11 (2024-01-04)

Bug Fixes

rename createLexerFormCharStream to createLexerFromCharStream (#243) (4d19f05)

4.0.0-beta.4.10 (2023-12-19)

Features

export visitor and listener as type (d4c97ef)

4.0.0-beta.4.9 (2023-12-13)

Features

add ErrorStrategy(#230) (ee8b468)
add readonly modifier (17eba1f)
optimize suggestion (#231) (3c7c59f)

Bug Fixes

add missing expression predicate to impala parser(#225) (484c881)
correct import path (96b0376)
impala: update impala nonReserved and keywords use as functionName (#229) (8c594cf)

4.0.0-beta.4.8 (2023-11-29)

Features

#190 improve mysql grammer (#196) (428d851)
deprecate legacy util (#223) (340ccae)
export impala listener and visitor (329f161)
mysql auto complete (#219) (3dadc0c)
pgsql collect column name (#216) (db05cb3)
recomplie plsql parser (a98f7f9)
support impala (#184) (e203f1a)

Bug Fixes

add partitioned by sql unit test and compile impala (#221) (4be3640)

4.0.0-beta.4.7 (2023-11-24)

Features

add column to syntaxContextType (00b66cc)

Bug Fixes

#213 flinksql expression syntax (#215) (6bbcb3c)

4.0.0-beta.4.6 (2023-11-20)

Features

improve pgsql grammar and add unit tests(#201) (2e6d18e)
pgsql: optimize drop rule's content (adeb26b)
pgsql: update routine's name (a5757a3)
pgsql: upgrade rules' name: pgsql: upgrade rules' name (4736a65)
upgrade antlr4-c3 and fix the version (1711feb)

4.0.0-beta.4.5 (2023-11-02)

Features

add sideEffects field to trigger treeShaking (#198) (fc4fdbf)

Bug Fixes

#197 keyword rename like value more (56c410e)

4.0.0-beta.4.4 (2023-10-24)

Bug Fixes

correct import path of CaretPosition (9a93be6)

4.0.0-beta.4.3 (2023-10-20)

Features

declare copyright ownership (b8c47d0)
export trino sql (917cb98)
format hive grammar files name (1d890e5)
hive Authorization syntax and unit tests (#159) (2ec03f1)
hiveSQL support create/drop/alter index grammar (ebd9a1e)
improve hive keywords rule (#154) (ded9f28)
recomplie hive sql grammar and fix parser about (e97e0b2)
spark sql DDL test (#170) (d13a929)
support DT_ADD_CONFIG_FILE syntax (13b58b2)
support spark sql auto complete (#179) (4d1dfa6)

Bug Fixes

allow ROW and ROWS as function name (82d097b)
flinkSqlParser should not import src (f413a29)

4.0.0-beta.4.2 (2023-08-31)

Features

format hive grammar files name (1d890e5)
hiveSQL support create/drop/alter index grammar (ebd9a1e)
recomplie hive sql grammar and fix parser about (e97e0b2)
support DT_ADD_CONFIG_FILE syntax (13b58b2)

Bug Fixes

allow ROW and ROWS as function name (82d097b)
flinkSqlParser should not import src (f413a29)

4.0.0-beta.4.1 (2023-06-16)

4.0.0-beta.4 (2023-06-14)

Features

complete Query statements of FlinkSQL (#93) (a026ae0)
extracting the common readSQL function (4af913c)
FlinkSQL supports auto complete (#115) (1b02ff5)
flink: support insert multipe tables statement for FlinkSQL 1.15(#104) (c409b80)
flink: support NOT NULL syntax (#103) (4b824fb)
improve flinksql createStatement (#91) (370cccf)
improve flinksql insertStatement (dc417e1)
replace fliePath with slashText to resolve conflict between filePath and arithmetic (#112) (793ff6e)
support trino(presto) sql language (#105) (9c82a5d)
update alter and drop statements (#94) (fbee70c)
update the builtin funcs of flinksql (#102) (0924acf)
upgrade antlr4 to 4.12.0 (#88) (c1c72de)

Bug Fixes

correct the grammar usage, especially in the parts targeting javascript (#109) (eeb111b)
ingore the unused errors of libs (b39825e)
rename to TrinoSqlParser (46a8a7a)
rename TrinoSqlParser to TrinoSql (8084c79)
rename TrinoSqlVisitor (76d7ee5)
update the exports path of trino (1442ad6)