我想解析一条SQL Select语句,它具有像MySQL这样的普通SQL方言所具有的所有功能.我寻找了用Python编写的解析库,但找不到可以完成这项工作的解析库.我的意思是我找到了一些解析库,但它们只能解析基本的SELECT语句(FROM和WHERE,甚至不能进行ORDER BY).因此,作为替代,我制作了自己的解析器(我知道这根本不是一个很好的解决方案).我花了几个小时来处理它,但我一直收到一个奇怪的错误,不知道如何处理它. 在我展示代码之前,我只想提到,如果您知道一个能够解析SQL语句(不仅是SELECT语句,还可以是CREATE TABLE、INSERT等)的Python库,请让我知道.

以下是我的语言语法字符串:

select_grammar = """
    start: select_statement ";"

    select_statement: "SELECT" column_list "FROM" table_list join_list? where_clause? groupby_clause? having_clause? orderby_clause?

    column_list: "*" | column_expr ("," column_expr)*

    column_expr: function_call | column_name | subquery
    
    column_name: (table_name ".")? NAME ("AS" NAME)?
    
    table_name: NAME ("AS" NAME)?

    function_call: NAME "(" function_args ")" ("AS" NAME)?

    function_args: expression ("," expression)*

    where_clause: "WHERE" condition

    groupby_clause: "GROUP BY" column_expr ("," column_expr)*

    having_clause: "HAVING" logical_expr

    orderby_clause: "ORDER BY" order_column ("," order_column)*

    order_column: column_expr ["ASC" | "DESC"]?

    condition: logical_expr

    logical_expr: logical_term
                | logical_expr "AND" logical_term
                | logical_expr "OR" logical_term
                | "NOT" logical_term

    logical_term: comparison_expr
                | "(" logical_expr ")"
                | subquery

    comparison_expr: expression OPERATOR expression
                    | expression "IS" ("NULL" | "NOT NULL")

    expression: (table_name ".")? NAME | INT | string | function_call | subquery

    table_list: table_name ("," table_name)* | subquery

    subquery: "(" select_statement ")"

    join_list: join_expr+

    join_expr: join_type (table_name | subquery) "ON" condition

    join_type: "INNER JOIN" | "LEFT JOIN" | "RIGHT JOIN" | "FULL JOIN"

    string: ESCAPED_STRING | /'[^']*'/

    OPERATOR: ">" | "<" | ">=" | "<=" | "=" | "!="

    %import common.CNAME -> NAME
    %import common.INT
    %import common.ESCAPED_STRING
    %import common.WS
    %ignore WS
"""

我还创建了Transformer类,如下所示:

@v_args(inline=True)
class SelectTransformer(Transformer):
    def start(self, *args):
        print("start result: ", args)
        return Tree("SELECT statement", args)

    def column_list(self, *args):
        return args

    def column_expr(self, *args):
        return args[0] if len(args) == 1 else args

    def function_call(self, name, args, alias=None):
        return (name, args, alias)

    def subquery(self, value):
        print("Subquery:", value)

    def where_clause(self, condition=None):
        return condition

    def groupby_clause(self, *args):
        return args

    def having_clause(self, condition=None):
        return condition

    def orderby_clause(self, *args):
        return args

    def order_column(self, *args):
        return args

    def condition(self, *args):
        return args

    def logical_expr(self, *args):
        return args

    def logical_term(self, *args):
        return args

    def comparison_expr(self, *args):
        return args

    def expression(self, *args):
        return args[0] if len(args) == 1 else args

    def column_name(self, *args):
        if len(args) == 1:
            return args[0]  # No alias present
        elif len(args) == 3:
            return args[0], args[2]  # Alias present, return a tuple
        else:
            return args

    def table_list(self, *args):
        return args

    def join_list(self, *args):
        return args

    def join_expr(self, *args):
        return args

    def join_type(self, *args):
        return args

    def subquery(self, *args):
        return args

    def string(self, value):
        return value.strip("'")

    def table_name(self, *args):
        if len(args) == 1:
            return args[0]  # No alias present
        elif len(args) == 3:
            return args[0], args[2]  # Alias present, return a tuple
        else:
            return args

我不知道这是否重要,我还创建了一个小函数来很好地显示最终的树:

def format_ast(ast, level=0):
    result = ""
    indent = "  " * level

    if isinstance(ast, tuple):
        for item in ast:
            result += format_ast(item, level + 1)
    elif isinstance(ast, Token):
        result += f"{indent}{ast.type}, Token('{ast.value}')\n"
    elif isinstance(ast, Tree):
        result += f"{indent}Tree({ast.data}), [\n"
        for child in ast.children:
            result += format_ast(child, level + 1)
        result += f"{indent}]\n"
    else:
        result += f"{indent}{ast}\n"

    return result

下面是我正在分析的语句:

sql_query = 'SELECT ' \
        'name AS alias, ' \
        'COUNT(age) AS age_alias, ' \
        '(SELECT department_name FROM departments WHERE department_id = employees.department_id) ' \
        'FROM employees AS emp, department ' \
        'INNER JOIN departments AS dep ON employees.department_id = departments.id ' \
        'LEFT JOIN other_table AS ot ON other_table.id = employees.table_id ' \
        'WHERE age > 25 ' \
        'GROUP BY age, name ' \
        'HAVING COUNT(age) > 1 ' \
        'ORDER BY name ASC, age DESC;'

我执行的代码是这样的:

parser = Lark(select_with_joins_grammar, parser='lalr', transformer=SelectTransformer())
tree = parser.parse(sql_query)

# Print the custom export format
print(format_ast(tree))

这个问题与我的类SelectTransformer的Join_type()方法有关.不知何故,*args总是空的,尽管理论上它应该包含(就像规则中定义的那样)"内联接"或"左联接"、"右联接"或"完全联接". 我的输出如下所示:

  Tree(SELECT statement), [
  Tree(select_statement), [
        NAME, Token('name')
        NAME, Token('alias')
        NAME, Token('COUNT')
        Tree(function_args), [
          NAME, Token('age')
        ]
        NAME, Token('age_alias')
        Tree(select_statement), [
            NAME, Token('department_name')
            NAME, Token('departments')
                  NAME, Token('department_id')
                  OPERATOR, Token('=')
                    NAME, Token('employees')
                    NAME, Token('department_id')
        ]
        NAME, Token('employees')
        NAME, Token('emp')
      NAME, Token('department')
          NAME, Token('departments')
          NAME, Token('dep')
                  NAME, Token('employees')
                  NAME, Token('department_id')
                OPERATOR, Token('=')
                  NAME, Token('departments')
                  NAME, Token('id')
          NAME, Token('other_table')
          NAME, Token('ot')
                  NAME, Token('other_table')
                  NAME, Token('id')
                OPERATOR, Token('=')
                  NAME, Token('employees')
                  NAME, Token('table_id')
            NAME, Token('age')
            OPERATOR, Token('>')
            INT, Token('25')
      NAME, Token('age')
      NAME, Token('name')
            NAME, Token('COUNT')
            Tree(function_args), [
              NAME, Token('age')
            ]
            None
          OPERATOR, Token('>')
          INT, Token('1')
        NAME, Token('name')
        NAME, Token('age')
  ]
]

如您所见,没有显示任何联接类型. 我对解析比较陌生,所以我真的不知道该try 什么.

推荐答案

答案是,案件很重要.语法定义了rulesterminals的组合.规则的名称为小写,而终端的名称为大写.似乎只有终端才会将其匹配的内容捕获为令牌.(可能有一种更正式的方式来表达这一点,但对于这次讨论来说,这已经足够准确了.)

因此,与其说:

    join_expr: join_type (table_name | subquery) "ON" condition

    join_type : "INNER JOIN" | "LEFT JOIN" | "RIGHT JOIN" | "FULL JOIN"

try :

    join_expr: JOIN_TYPE (table_name | subquery) "ON" condition

    JOIN_TYPE: "INNER JOIN" | "LEFT JOIN" | "RIGHT JOIN" | "FULL JOIN"

这将产生包含像JOIN_TYPE, Token('INNER JOIN')JOIN_TYPE, Token('LEFT JOIN')这样的内容的结果.

另一种方法是使每个联接类型都有自己的规则,如下所示:

    join_type: inner_join | left_join | right_join | full_join

    inner_join: "INNER"? "JOIN"

    left_join: "LEFT" "OUTER"? "JOIN"

    right_join: "RIGHT" "OUTER"? "JOIN"

    full_join: "FULL" "OUTER"? "JOIN"

上面的代码将生成不同的 node ,而不是解析后的语法中的标记,在输出中显示为Tree(inner_join)Tree(left_join).未捕获所使用的确切语法差异("Join"与"INNER JOIN").

可能还有其他方法来捕获联接类型,但这不是我的专业领域.

注意,语法还有很长的路要走,比如允许"Join"、"Left Out Join"、"Full Out Join"、"cross Apply"、"Out Apply"和表单TableA A LEFT JOIN (TableB B JOIN TableC C ON C.X = B.X) ON B.Y = A.Y的嵌套连接.你已经接受了相当大的挑战.

Python相关问答推荐

当测试字符串100%包含查询字符串时,为什么t fuzzywuzzy s Process.extractBests不给出100%分数?

如何在vercel中指定Python运行时版本?

如何将Matplotlib的fig.add_axes本地坐标与我的坐标关联起来?

在使用Guouti包的Python中运行MPP模型时内存不足

如何根据日期和时间将状态更新为已过期或活动?

如何在BeautifulSoup中链接Find()方法并处理无?

更改matplotlib彩色条的字体并勾选标签?

使用numpy提取数据块

Deliveryter Notebook -无法在for循环中更新matplotlib情节(保留之前的情节),也无法使用动画子功能对情节进行动画

对于一个给定的数字,找出一个整数的最小和最大可能的和

如何在Windows上用Python提取名称中带有逗号的文件?

海运图:调整行和列标签

如何使用Python以编程方式判断和检索Angular网站的动态内容?

组/群集按字符串中的子字符串或子字符串中的字符串轮询数据框

driver. find_element无法通过class_name找到元素'""

Python逻辑操作作为Pandas中的条件

考虑到同一天和前2天的前2个数值,如何估算电力时间序列数据中的缺失值?

如何使用Numpy. stracards重新编写滚动和?

matplotlib + python foor loop

基于多个数组的多个条件将值添加到numpy数组