Formal Languages and Compilers 笔记&教程第二章上下文无关语言 (Context-Free Languages)

Formal Languages and Compilers (形式语言和编译器) 的自学笔记兼学习教程。
笔记作者介绍：大爽歌, b站小UP主，编程1对1辅导老师。

2.1 Context-Free Grammars

上下文无关语法

A context-free grammar (CFG) is a structure $\sum, P, S)$ where

$N$ : a finite set, the non-terminals.
each element $\in N$ is called a nonterminal characher on a variable.
Each variable represents a different type of phrase or clause in the sentence. Variables are also sometimes called syntactic categories.
Each variable defines a sub-language of the language defined by $G$ .
一个有限集合，非终结符。
其元素为非终结字符或变量。
每个变量代表句子中不同类型的短语或从句。
变量有时也称为句法类别。
每个变量定义了 $G$ 定义的语言的一种子语言。
$\sum$ : a finite set of the terminals, disjoint form N.
The set of terminals is the alphabet of the language defined by the grammar G.
一个有限集合, 终结符，与 $N$ 不相交。
终结符集合是语法 $G$ 定义的语言的字母表。
$P$ : $\subseteq N \times (N \cup \sum)*$ is a finite set of productions.(a finite relation).
The member of $P$ are called the rules or productions if the grammar.
是生成式的有限集合。（关系的有限集合）。

$P$ 的成员称为语法的规则或生成式。
$S$ : $\in N$ is the start symbol(or start variable), used to represent the whole sentence(or program).
$S$ 为起始变量（或起始符号），用于表示整个句子（或程序）。

Productions are denoted as follows:
Productions表示如下:

A production $w)\in P$ is written $A\rightarrow w$ .
Serveral productions $A \rightarrow w_1,…, A \rightarrow w_n $ are written $A\rightarrow w_1|...|w_n$
The right-hand side may be empty: an $\varepsilon$ -production is written $\rightarrow w$ .

Example 1
能表示所有算术表达式的 $G$
$\sum, P, E)$ for arithmetic expressions

$N=\{E, T, F\}$
$\sum={+, -, *, /, (, ), n}$
$E$ ： start symbol
$P$ :
$\rightarrow T | T + E | T - E$
$\rightarrow F | T * F | T / F$
$F \rightarrow (E) | n $

其中n代表number
E代表Expression
T代表Term
F代表Factor

Example 2
回文字符(生成所有a和b构成的回文字符)
$G_2 = (\{S\}, \{a,b\}, P, S)$
P:

$\rightarrow aSa$
$\rightarrow bSb$
$\rightarrow a$
$\rightarrow b$
$\rightarrow \varepsilon$

A typical derivation in this grammar is
$\rightarrow aSa \rightarrow aaSaa \rightarrow aabSbaa \rightarrow aabbaa$

The language is context-free, however, it can be proved that it is not regular.

derivable

生成(导出，派生)

Consider a CFG $\sum, P, S)$

Let $\in (N \cup \sum )^*$ and $\rightarrow w \in P$ .
Then $u A v$ yields uwv in one step, by replacing $A$ with $w$ ; this is denoted $\Rightarrow_G^1 uwv$ .
Alternatively, $u w v$ is derivable form $u A w$ in one step.
Each of $u$ , $v$ , $w$ may be $\varepsilon$ .
$A$ can be replaced with $w$ irrespective of the context $u$ , $v$ in which A occurs.

Alternatively: 或者, 换言之

简单来讲，就是:
$\Rightarrow_G^1 uwv$ 表示：
$u w v$ is derivable form $u A w$ in 1 step.
$u w v$ 可由 $u A w$ 一步生成(导出，派生).

进一步拓展
Derivability is a relation on $\cup \sum)^*$
Derivability: 可派生性

Let $x_i \in (N \cup \sum)^*$ for each $ i\in \mathbb N$
$x_n$ is derivable form $x_0$ in $n$ steps if $x_i \Rightarrow_G^1 x_{i+1}$ for each $\leq i < n$ ; this is denoted $x_0 \Rightarrow_G^n x_n $
$\Rightarrow_G^0 y$ if and only if $x = y$
$y$ is derivable from $x$ if it is derivable in any number of steps.
Alternatively, $x$ generates $y$ or $x$ yields $y$ .
$\Rightarrow_G^* y$ if $\Rightarrow_G^n y$ for some $ n \in \mathbb N$
The relation $\Rightarrow_G^*$ is the reflexive-transitive closure of the relation $\Rightarrow_G^1$

reflexive-transitive closure: 自反传递闭包

举例说明
以前面例2中的CFG $G 2$ 为例
derivation:
$\rightarrow aba$
所以 $\Rightarrow_G^1 aba$

$\rightarrow aSa \rightarrow aaSaa \rightarrow aabSbaa \rightarrow aabbaa$

所以
$\Rightarrow_G^2 aabbaa$
$\Rightarrow_G^3 aabbaa$
$\Rightarrow_G^4 aabbaa$

Context-Free Languages

A context-free grammar generates a context-free language (CFL).

A sentential form is any $\in (N \cup \sum)^*$ deriable from the start symbol S, that is, $S\Rightarrow_G^*x$
A sentence is a sentential form that consists only of terminal symbols: $\in \sum^*$
$\{x \in \sum^* | S \Rightarrow_G^* x \}$ is the language generated by G.
$\subseteq \sum^*$ is context-free if $A = L (G)$
for some CFG $G$ .

A sentential form: 一个句子形式

简单来讲，CFG $G$ 生成的语言就是 CFL

Example 3
$\{ a^nb^n | n \in \mathbb N \}$ is context-free.

$A = L (G)$ for the CFG, $G=(\{S\}, \{a, b\}, P, S)$
with production P:

$\rightarrow aSb$
$\rightarrow \varepsilon$

Futher examples of CFLs:
$A=\{a^ib^jc^k | (i=j\;or\; j=k)\; and\; i, j, k \geq 1 \}$ is generated by $G=(\{S, T, U, A, C\}, \{a, b, c\}, P, S)$ with production P:

$\rightarrow TC | AU$
$\rightarrow aTb | ab$
$\rightarrow bUc | bc$
$\rightarrow a | aA$
$\rightarrow c | cC$

The languages of balanced parentheses is generated by $G=(\{S, T, U, A, C\}, \{(, )\}, P, S)$ with production P:

$\rightarrow (S)|SS|\varepsilon$

balanced paratheses: 平衡括号

Regular Grammars

A regular grammar is a CFG $\sum, P, S)$ where for each $\rightarrow w \in P$

$w=\varepsilon$ , or
$\in \sum N$
The right-hand side of each production is either $\varepsilon$ or a terminal followed by a non-terminal.
即每个产生式的右边要么是 $\varepsilon$ ，要么是一个终结符后面是一个非终结符(即以非终结符结尾)。

Example 4
An example of a regular grammar $\sum, P, S)$

$N=\{S, A\}$
$\sum=\{a, b, c\}$
P:
$\rightarrow aS | bA$
$\rightarrow cA | \varepsilon$

This grammar describes the same language as the regular expression $a^*bc^*$ , viz. the set of all strings consisting of arbitrarily many "a"s, followed by a single "b", followed by arbitrarily many "c"s.

Every regular language is generated by a regular grammar.
Proof:

Consider the DFA $\sum, \delta, q_0, F)$ .
Construct the regular grammar $\sum, P, q_0)$ with the following productions P:
$q_i \rightarrow aq_j \quad if ; \delta(q_i, a) = q_j $
$q_i \rightarrow \varepsilon \quad if ; q_i \in F $
Then for each $\in \sum^*$ ,

$\begin{align} &w = a_1a_2...a_{n-1}a_n \in L(M) \\ \Leftrightarrow \quad &\hat \delta (q_0, w) \in F \\ \Leftrightarrow \quad &\hat \delta (q_{i-1}, a_i) = q_i \; for\; each\; 1 \leq i \leq n \;and\; q_n \in F \\ \Leftrightarrow \quad & q_{i-1} \rightarrow a_iq_i \in P \;for\; each\; 1 \leq i \leq n \;and\; q_n \rightarrow \varepsilon \in F \\ \Leftrightarrow \quad & q_0 \Rightarrow a_1q_1 \Rightarrow a_1a_2q_2 \Rightarrow ... \Rightarrow a_1a_2...a_{n-1}a_nq_n \Rightarrow a_1a_2...a_{n-1}a_n \\ \Leftrightarrow \quad & w = a_1a_2...a_{n-1}a_n \in L(G) \end{align}$

Hence $L (M) = L (G)$

Chomsky Normal Form (CNF)

A CFG $\sum, P, S)$ is in Chomsky normal form if every production has the form

$\rightarrow BC$ , where $\in N$ , or
$\rightarrow a$ , where $a\in N$ .
The right-hand side of each production is either two non-terminals or a terminal.

简单来讲，每个生成式结果为两个非终结符( $Q$ 中的)或者一个终结符( $\sum$ 中的)。

RHS: right-hand side
LHS: left-hand side

Example
$G_1 = (\{S, A, B\}, \{a,b,c\}, P, S)$ with $P$ :

$\rightarrow AB$
$\rightarrow c$
$\rightarrow a$
$\rightarrow b$

$G_1 = (\{S, A, B\}, \{a,b,c\}, P, S)$ with $P$ :

$\rightarrow aA$
$\rightarrow a$
$\rightarrow c$

$G_1$ is in CNF
$G_2$ is not in CNF

For every CFG G with $\varepsilon \notin L(G)$ there is a CFG $G^{'}$ in Chomsky normal form with $L (G) = L (G^{'})$

Eliminate $\varepsilon$ -productions of the form $\rightarrow \varepsilon$ .
Eliminate unit-productions of the form $\rightarrow B$ .
Eliminate non-generating non-terminals.
Eliminate non-reachable non-terminals.
Eliminate terminals form right-hand sides of length at least 2.
Eliminate right-hand sides of length at least 3.

Eliminate: 消除
non-generating: 非生成的
non-reachable: 不可到达

Example

https://www.javatpoint.com/automata-chomskys-normal-form

Convert the given CFG to CNF. Consider the given grammar $G_1$ :

$G_1 = (\{S, A, B\}, \{a,b,c\}, P, S)$ with $P$ :

$\rightarrow TbT$
$\rightarrow aU$
$\rightarrow U$
$\rightarrow V$
$\rightarrow \varepsilon$
$\rightarrow b$

2.2 The Cocke-Younger-Kasami Algorithm

Given a string $\in \sum^*$ and a CFL $A$ , is $ w \in A$?

分析

This is the test for membership in a CFL.
Checking all derivations does not work, since there might be infinitely many.
It suffices to consider derivations that introduce up to $∣ w ∣$ non-terminals.
This gives an upper bound on the length of derivations that need to be checked.
The number of derivations might still be exponential in the length of $w$ .

A technique to improve the running time is dynamic programming.

The CYK algorithm solves the membership problem $\in L(G)$

Assume that $G$ is in chomsky normal form, for example:
$\rightarrow BB\; |\; AS |\; a$
$\rightarrow BC$
$\rightarrow BS \; | \; B$
$\rightarrow a$

let $n$ be the length of $w$ : for example, $n = 6$ for $w = bbabab$ .
mark the positions that separate symbols in $w$ :
Let $w_{ij}$ be the substring of $w$ between positions $i$ and $j$ : for example, $w_{25}=aba$ and $w_{06} = w$ .
$N_{ij} = \{A \in N \;|\; A \Rightarrow ^* w_{ij}\}$ contains the non-terminals that generate $w_{ij}$
The CYK algorithm calculates $N_{ij}$ for each $0\leq i \leq j \leq n$
Then $\in L(G)$ if and only if $\in N_{0n}$

The CYK algorithm fills a table with $N_{ij}$ in column $i$ , row $j$ .
The calculation proceeds in increasing order of substring length.

Example

wikipedia example

CFG $G_2 = (\{S, A, B, C\}, \{a, b\}, S, P)$ with P:
$\rightarrow AB\; |\; BC$
$\rightarrow BA\; |\; a$
$\rightarrow CC\; |\; b$
$\rightarrow AB\; |\; a$

check $\in L(G_2)$ ?

下来用CYK算法绘制表格

初始表格

i	0	1	2	3	4	5
$a_i$	b	b	a	b	a	a
$j = 1$		-	-	-	-	-
$j = 2$			-	-	-	-
$j = 3$				-	-	-
$j = 4$					-	-
$j = 5$						-
$j = 6$

处理表格中长度为1的。
First come the substrings of length 1, that is, $j = i + 1$ .

The 1-symbol substring $w_{i, i+1}$ can be generated form $A$ if $\rightarrow w_{i, i+1} \in P$ .
For each production $\rightarrow a$ where $a = w_{i, i+1}$ , add A to the entry at column $i$ , row $j$ .
These entries form the main diagonal of the table.

i	0	1	2	3	4	5
$a_i$	b	b	a	b	a	a
$j = 1$	${B\}$	-	-	-	-	-
$j = 2$		${B\}$	-	-	-	-
$j = 3$			${A, C\}$	-	-	-
$j = 4$				${B\}$	-	-
$j = 5$					${A, C\}$	-
$j = 6$						${A, C\}$

处理表格中长度为2的。
Then come the substrings of length 2, that is, $j = i + 2$ .

The 2-symbol substring $w_{i, i+2}$ is broken up into two 1-symbol substrings $w_{i, i+1}$ and $w_{i+1, i+2}$ .
If $\in N_{i, i+1}$ and $\in N_{i+1, i+2}$ and $\rightarrow BC \in P$ , then add $A$ to $N_{i, i+2}$ .
These entries form the diagonal below the main diagonal.

i	0	1	2	3	4	5
$a_i$	b	b	a	b	a	a
$j = 1$	${B\}$	-	-	-	-	-
$j = 2$	-	${B\}$	-	-	-	-
$j = 3$		${S,A\}$	${A, C\}$	-	-	-
$j = 4$			${S, C\}$	${B\}$	-	-
$j = 5$				${S,A\}$	${A, C\}$	-
$j = 6$					${B\}$	${A, C\}$

计算 $w_{02}$ , 将其拆分成 $w_{01}$ 和 $w_{12}$ .
$N_{01} = \{B\}$ , $N_{12} = \{B\}$ , 没有 $\in P \rightarrow BB$ , 所有不存在
计算 $w_{13}$ , 将其拆分成 $w_{12}$ 和 $w_{23}$ .
$N_{12} = \{B\}$ , $N_{23} = \{A, C\}$ , $\rightarrow BA \in P$ , $\rightarrow BC \in P$ ，所以 $N_{13} = \{S, A\}$
计算 $w_{24}$ , 将其拆分成 $w_{23}$ 和 $w_{34}$ .
$N_{23} = \{A, C\}$ , $N_{34} = \{B\}$ , $\rightarrow AB \in P$ , $\rightarrow AB \in P$ ，所以 $N_{24} = \{S, C\}$

处理表格中长度为3的。
The 3-symbol substring $w_{i, i+3}$ can be broken up in two ways.

Consider how to generate $w_{i, i+3}$ using a production $A\rightarrow BC$ , that is, $\Rightarrow BC \Rightarrow^* w_{i, i+3}$
This follows from $\Rightarrow^* w_{i, i+1}$ and $\Rightarrow^*w_{i+1, i+3}$ or from $\Rightarrow^* w_{i, i+2}$ and $\Rightarrow^*w_{i+2, i+3}$
If $\in N_{i, i+1}$ and $\in N_{i+1, i+3}$ and $\rightarrow BC \in P$ , then add $A$ to $N_{i, i+3}$ .
If $\in N_{i, i+2}$ and $\in N_{i+2, i+3}$ and $\rightarrow BC \in P$ , then add $A$ to $N_{i, i+3}$ .

i	0	1	2	3	4	5
$a_i$	b	b	a	b	a	a
$j = 1$	${B\}$	-	-	-	-	-
$j = 2$	-	${B\}$	-	-	-	-
$j = 3$	${A\}$	${S,A\}$	${A, C\}$	-	-	-
$j = 4$		${S, C\}$	${S, C\}$	${B\}$	-	-
$j = 5$			${B\}$	${S,A\}$	${A, C\}$	-
$j = 6$				-	${B\}$	${A, C\}$

$w_{03}$ 可以分成 $w_{01}$ + $w_{13}$ , 计算出 $A$
或分成 $w_{02}$ + $w_{23}$ , $w_{02}$ 空，没有
$w_{14}$ 可以分成 $w_{12}$ + $w_{24}$ , 计算出 $S$
或分成 $w_{13}$ + $w_{34}$ , 计算出 $S, C$

处理表格中长度为4的。
此时每个 $w_{i, i+4}$ 有三种分法

$w_{i, i+1}$ 和 $w_{i+1, i+4}$
$w_{i, i+2}$ 和 $w_{i+2, i+4}$
$w_{i, i+3}$ 和 $w_{i+3, i+4}$

i	0	1	2	3	4	5
$a_i$	b	b	a	b	a	a
$j = 1$	${B\}$	-	-	-	-	-
$j = 2$	-	${B\}$	-	-	-	-
$j = 3$	${A\}$	${S,A\}$	${A, C\}$	-	-	-
$j = 4$	${S, C\}$	${S, C\}$	${S, C\}$	${B\}$	-	-
$j = 5$		${B\}$	${B\}$	${S,A\}$	${A, C\}$	-
$j = 6$			${A, S\}$	-	${B\}$	${A, C\}$

处理表格中长度为5的。

i	0	1	2	3	4	5
$a_i$	b	b	a	b	a	a
$j = 1$	${B\}$	-	-	-	-	-
$j = 2$	-	${B\}$	-	-	-	-
$j = 3$	${A\}$	${S,A\}$	${A, C\}$	-	-	-
$j = 4$	${S, C\}$	${S, C\}$	${S, C\}$	${B\}$	-	-
$j = 5$	${B\}$	${B\}$	${B\}$	${S,A\}$	${A, C\}$	-
$j = 6$		${A, S\}$	${A, S\}$	-	${B\}$	${A, C\}$

处理表格中长度为6的。

i	0	1	2	3	4	5
$a_i$	b	b	a	b	a	a
$j = 1$	${B\}$	-	-	-	-	-
$j = 2$	-	${B\}$	-	-	-	-
$j = 3$	${A\}$	${S,A\}$	${A, C\}$	-	-	-
$j = 4$	${S, C\}$	${S, C\}$	${S, C\}$	${B\}$	-	-
$j = 5$	${B\}$	${B\}$	${B\}$	${S,A\}$	${A, C\}$	-
$j = 6$	${A, S\}$	${A, S\}$	${A, S\}$	-	${B\}$	${A, C\}$

最后
$\in N_{06}$
所以 $bbabaa$ 可以由该语法生成。