Hutool - DFA：基于 DFA 模型的多关键字查找

一、简介

在文本处理中，常常需要在一段文本里查找多个关键字是否存在，例如敏感词过滤、关键词匹配等场景。Hutool - DFA 模块基于确定性有限自动机（Deterministic Finite Automaton，DFA）模型，为我们提供了高效的多关键字查找功能。DFA 模型是一种状态机，它通过预先构建一个状态转移表，能够在一次遍历文本的过程中，快速判断是否存在多个关键字，时间复杂度为 $O (n)$ ，其中 $n$ 是文本的长度，这使得它在处理大规模文本和大量关键字时具有很高的效率。

二、引入依赖

若使用 Maven 项目，在 pom.xml 中添加以下依赖：

<dependency>
    <groupId>cn.hutool</groupId>
    <artifactId>hutool-all</artifactId>
    <version>5.8.16</version>
</dependency>

如果是 Gradle 项目，在 build.gradle 中添加：

implementation 'cn.hutool:hutool-all:5.8.16'

三、基本使用步骤

1. 创建 DFA 匹配器

import cn.hutool.dfa.FoundWord;
import cn.hutool.dfa.WordTree;
import java.util.ArrayList;
import java.util.List;

public class DFAExample {
    public static void main(String[] args) {
        // 创建 WordTree 对象，用于构建 DFA 模型
        WordTree wordTree = new WordTree();
        // 添加关键字
        List<String> keywords = new ArrayList<>();
        keywords.add("苹果");
        keywords.add("香蕉");
        keywords.add("葡萄");
        wordTree.addWords(keywords);
    }
}

在上述代码中，首先创建了一个 WordTree 对象，它是 Hutool - DFA 中用于构建 DFA 模型的核心类。然后，创建一个包含多个关键字的列表，并使用 addWords 方法将这些关键字添加到 WordTree 中，从而完成 DFA 模型的构建。

2. 进行关键字查找

import cn.hutool.dfa.FoundWord;
import cn.hutool.dfa.WordTree;
import java.util.ArrayList;
import java.util.List;

public class DFAExample {
    public static void main(String[] args) {
        // 创建 WordTree 对象，用于构建 DFA 模型
        WordTree wordTree = new WordTree();
        // 添加关键字
        List<String> keywords = new ArrayList<>();
        keywords.add("苹果");
        keywords.add("香蕉");
        keywords.add("葡萄");
        wordTree.addWords(keywords);

        // 待查找的文本
        String text = "我喜欢吃苹果和香蕉。";
        // 查找文本中包含的关键字
        List<FoundWord> foundWords = wordTree.matchAll(text);
        for (FoundWord foundWord : foundWords) {
            System.out.println("找到关键字：" + foundWord.getWord() + "，起始位置：" + foundWord.getStartIndex() + "，结束位置：" + foundWord.getEndIndex());
        }
    }
}

在这个代码片段中，定义了一段待查找的文本，然后使用 matchAll 方法在文本中查找之前添加的关键字。matchAll 方法会返回一个 FoundWord 对象的列表，每个 FoundWord 对象包含了找到的关键字、关键字在文本中的起始位置和结束位置。通过遍历这个列表，我们可以输出找到的关键字及其位置信息。

四、高级用法

1. 忽略大小写匹配

import cn.hutool.dfa.FoundWord;
import cn.hutool.dfa.WordTree;
import java.util.ArrayList;
import java.util.List;

public class CaseInsensitiveDFAExample {
    public static void main(String[] args) {
        WordTree wordTree = new WordTree();
        List<String> keywords = new ArrayList<>();
        keywords.add("Apple");
        wordTree.addWords(keywords);

        String text = "I like apple.";
        // 忽略大小写进行匹配
        List<FoundWord> foundWords = wordTree.matchAll(text, true);
        for (FoundWord foundWord : foundWords) {
            System.out.println("找到关键字：" + foundWord.getWord() + "，起始位置：" + foundWord.getStartIndex() + "，结束位置：" + foundWord.getEndIndex());
        }
    }
}

在 matchAll 方法中，第二个参数设置为 true 表示忽略大小写进行匹配。这样，即使文本中的关键字大小写与添加的关键字不一致，也能被正确匹配。

2. 最长匹配原则

import cn.hutool.dfa.FoundWord;
import cn.hutool.dfa.WordTree;
import java.util.ArrayList;
import java.util.List;

public class LongestMatchDFAExample {
    public static void main(String[] args) {
        WordTree wordTree = new WordTree();
        List<String> keywords = new ArrayList<>();
        keywords.add("苹果");
        keywords.add("红苹果");
        wordTree.addWords(keywords);

        String text = "我喜欢吃红苹果。";
        // 开启最长匹配
        List<FoundWord> foundWords = wordTree.matchAll(text, false, true);
        for (FoundWord foundWord : foundWords) {
            System.out.println("找到关键字：" + foundWord.getWord() + "，起始位置：" + foundWord.getStartIndex() + "，结束位置：" + foundWord.getEndIndex());
        }
    }
}

在 matchAll 方法中，第三个参数设置为 true 表示使用最长匹配原则。在上述示例中，文本中包含“红苹果”，由于开启了最长匹配，只会匹配到“红苹果”，而不会匹配到“苹果”。

五、注意事项

关键字添加顺序：关键字的添加顺序不影响匹配结果，因为 DFA 模型是基于状态转移的，所有关键字会被统一构建到状态转移表中。
性能考虑：DFA 模型在处理大规模文本和大量关键字时具有较高的性能，但在构建 DFA 模型时，需要消耗一定的内存和时间。因此，在实际应用中，应根据具体情况合理管理关键字的数量。
字符编码：确保文本和关键字使用相同的字符编码，避免因编码问题导致匹配失败。

通过使用 Hutool - DFA，开发者可以方便地实现高效的多关键字查找功能，无论是敏感词过滤、信息检索还是其他文本处理场景，都能轻松应对。