哈尔滨工业大学计算学部
实验报告
课程名称:数据结构与算法
实验项目:树形结构及其应用
实验题目:哈夫曼编码与译码方法
一、实验目的
哈夫曼编码是一种以哈夫曼树(最优二叉树,带权路径长度最小的二叉树)为基础变长编码方法。其基本思想是:将使用次数多的代码转换成长度较短的编码,而使用次数少的采用较长的编码,并且保持编码的唯一可解性。在计算机信息处理中,经常应用于数据压缩。是一种一致性编码法(又称"熵编码法"),用于数据的无损压缩。要求实现一个完整的哈夫曼编码与译码系统。
二、实验要求及实验环境
实验要求:
1. 从文件中读入任意一篇英文文本文件,分别统计英文文本文件中各字符(包括标点符号和空格)的使用频率;
2. 根据已统计的字符使用频率构造哈夫曼编码树,并给出每个字符的哈夫曼编码(字符集的哈夫曼编码表);
3. 将文本文件利用哈夫曼树进行编码,存储成压缩文件(哈夫曼编码文件);
4. 计算哈夫曼编码文件的压缩率;
5. 将哈夫曼编码文件译码为文本文件,并与原文件进行比较。
实验环境:
我本次使用Visual Studio Code来编写C++程序。
三、设计思想(本程序中的用到的所有数据类型的定义,主程序的流程图及各程序模块之间的调用关系、核心算法的主要步骤)
1.所使用的数据结构:
a.HTNODE结构体
为哈夫曼树的结点,其中的成员为weight,用于存储结点对应权重,lchild用于存储左孩子所对应下标,rchild用于存储右孩子所对应下标,parent用于存储父节点对应下标,ch用于存储结点代表字母。
b.Huffcode结构体
为每个字符的哈夫曼编码结构体,其中的成员为code[MAXBIT]数组,用于存储该字符对应的哈夫曼编码,start用于存储code数组中哈夫曼编码的起始下标位置。
c.Char结构体
为每个字符的结构体,其中的成员为ch,用于存储字符,count用于存储该字符在文章中出现的次数。
- 算法设计逻辑及程序运行效果:
(1)int main()
主函数,定义程序所需变量,调用函数实现各个功能,以下为输入文章。
In the quaint village where Oliver lived, his adventurous spirit often led him on unexpected journeys. One day, while exploring the woods, he discovered an ancient map hidden inside a hollow tree. Guided by curiosity, he followed its intricate pathways to a mysterious cave rumored to hold a long-lost treasure. Braving dark tunnels and overcoming perilous obstacles, Oliver ventured deeper into the cave, solving riddles left behind by its enigmatic guardians. With each challenge conquered, he grew more determined to uncover the treasure's secrets. At last, in the heart of the cave, Oliver found a glittering chest. With trembling hands, he opened it to reveal not gold or jewels, but a collection of ancient scrolls filled with wisdom and secrets of the world. As he returned home, Oliver realized that the true treasure was not material wealth, but the knowledge and experiences gained on his extraordinary journey. Forever changed by his adventure in the enchanted cave, he embraced the wisdom he had unearthed and embarked on a new chapter of his life with renewed purpose.
(2)int Statistic(Char words[], int charnums)
该函数用于统计文章中出现的各个字符及其出现的次数和总字符种类数,打开passage.txt文件后,先将words数组的每一项结构体中的count成员初始化为零,方便后续统计,先读取一个字符,并保存该字符及对其出现次数统计的count进行递增,当未读取到文件末尾时,每读取一个字符,借助found标志变量,判断该字符是否之前被统计过,如果是,则递增该字符出现次数,否则,存储该字符并递增其出现次数,最后关闭文件。
(3)void Display(Char words[], int charnums)
该函数用于在Statistic.txt中输出展示各个字符及其出现的频率,通过循环遍历存储有字符及其出现次数信息的数组,打印输出。以下为输出的结果。
I 0.000925
n 0.056429
0.160962
t 0.065680
h 0.043478
e 0.125809
q 0.001850
u 0.024977
a 0.052729
i 0.056429
v 0.018501
l 0.039778
g 0.015726
w 0.016651
r 0.058279
O 0.004625
d 0.047179
, 0.012951
s 0.040703
o 0.050879
p 0.011101
f 0.009251
m 0.013876
x 0.003700
c 0.025902
j 0.002775
y 0.009251
. 0.008326
G 0.000925
b 0.009251
- 0.000925
B 0.000925
k 0.002775
W 0.001850
' 0.000925
A 0.001850
z 0.000925
F 0.000925
(4)void CreateHT(HTNODE T[], int n, Char words[])
该函数用于构建哈夫曼树,首先将前2n-1个结点初始化,便于后续对T数组后面的位置赋值,之后通过寻找最小的两项合并的方式构建哈夫曼树并完善T数组。
(5)void Coding(HTNODE T[], Huffcode huffcode[], int n)
该函数用于进行哈夫曼编码,每一个huffcode中存储对应字符编码时都从数组最后一位倒着向前依次存储,从该字符对应结点走到根结点的路上,parent指向当前结点的父结点,若该结点为parent的左孩子则赋值为0,否则赋值为1,直到到根结点。
(6)void DisplayCode(Huffcode huffcode[], HTNODE T[], int n)
该函数用于在Code.txt中输出展示各个字符及其对应的哈夫曼编码,打印输出。以下为输出的结果。
I 1000110000
n 0101
110
t 1001
h 11111
e 101
q 000000111
u 00001
a 0011
i 0110
v 111000
l 11101
g 100000
w 100010
r 0111
O 10001101
d 0001
, 010010
s 11110
o 0010
p 000001
f 1000111
m 010011
x 10000100
c 01000
j 00000001
y 1110010
. 1000011
G 1000110001
b 1110011
- 1000110010
B 1000110011
k 00000010
W 100001010
' 000000000
A 100001011
z 000000001
F 000000110
(7)void CodeFile(HTNODE T[], Huffcode huffcode[], int n)
该函数用于将passage.txt文章转为对应的哈夫曼编码输出到encode.txt文件中,当未读取到文件末尾时,通过循环,找到文章中出现的每一个字符对应的哈夫曼编码,并将其对应的哈夫曼编码写入文件中。以下为编码后文件的展示。
100011000001011101001111111011100000001110000100110110010110011101110000110111011110100111000001011101000101111110101111011101000110111101011011100010101111101110101101110001010001010010110111110110111101100011000111100010101011001000010111001000001111101101111000000101100111011010011100010100011110011010101110111011010001110111110110010011110001001011100000101011011000010000000110101000100110100011100000000100100000101110101101111001011110100001111010001101010110111000010011111001001001011010001011111011011101101110101100001000000011110100100111011001011000001101001111111011101000100010001000011111001001011011111101110000101101111001000001011100010101111010001110001101011100011010101000011010101011001110010011001100000111011111011000010001101010111001100101111100110000110111000111101111100101110111101001010001011010010111101101100001111010001100010000101100001101000111011100111110010110010000000101110110001011110011010011110010010010110111111011101000111001011101111010010100010101000111001101001111101100110010110010111011001000001110011011100000010011100111111100010001111100101111011010010010110001111001001111100101111010011010111011000100000111110110010000011111000101110011100001010011001001111010001110100100101101111100101110100011100011110111010010010110000010001100101110100101111010011101001011110100111111000001011110110000111101000110011011100111110000110010110000011000010011011100000010110100100001010101011011110111110110001101010001110001011100010101110100000100100110110010110000011000000110101110110111010010000011111011000101110011111101001001101000111011011111001001011010001101111010110111000101011111011100010101011001000010111101000111000011011010000011010111110011001011001001011010011111110111001000001111100010101001011011110001011101111000011001011000001100111011000010001111011011111011011101101100011110011101110011101111110110010100011101110011111001011001101001111101101010101011010000001001100111001011001000110100000000010011011100010110001101011111010000111101000010100110100111111110101001101000111111100100011111001111101111011010101100000101110010000010010100000011100001101011110100010100101101111110111010000001111011000101100100110010011110111000011011001101011101001101100101101000111010010010110000010101010000010111000101011111010011111110111010010111101001111110000010111101000000000111101101111010101000011110110011111010000111101000010111001110111010011111101001010010110011001011101001111111011101111110100110111100111000101000111110100111111101110010000011111000101010010110100011011110101101110001010111110100011100100000101010001110001111010000011101011010011001101011101100101100000110010001111110111110100110000111101000010100110100111111110100101111010100111110011111010110010110000011011111001101010001111100100101101111110111000100000011010101101000111001101001110100100101100111101111000101001111101110010100101001110100000001011101000111000100111110000000011011000101011110111110010010110111001100001100111000111100100000101110111101101010001001011000100101110001010001111100011010101000011010101011001110111100100001110010111011110111110110100011101101110111101101000111010001001101001111111101000100110111100001001001001111000110101000111011110101010000111101100111110110001010001111101001111111011101000100010011111101000110000111101000010111111011011111101110011110110010000101110101101000111011111001001001110101001011010001101111010110111000101011111001111010011111010110000000001101000111010011111100111001110100111111101110100101110000110111010010111101001111110000010111101110100010001111110110010100101001110010011001110011010111011000111110111010001010100111110110011111101001011011100110000110011101001111111011100000001001010010100010111011010001100000101110001101010001110101100001000000011010111011010101010100010111110110100000001101100101101000111000100101110111110110111101101011000010010010111001100100111000101100101001101111110010110000000010010000010111010110111100101000011110000000110001001111011110001010111110010001111100110101100000101000111011100111110010110111110110111101100011000111100010101011001000010111101110011001011101001111111011101010101010001111100110101100110100011100100000111110001010100101101111110111010101001111100110111001101000101000111010011111110111010001001101111000010010010011110111111011101111100110001110000010101101001101111001111111010001110001101010001110101010011111001100110111000000101010001110001001011100011110010110110001011001000111110011000001100110101111100010100011111011111011011110110111010110100011110111010001001101001111111100111101010110110001010100011100000010000101110000010010111101011000011
(8)void Calculate(Char words[], Huffcode huffcode[], int n)
该函数用于计算哈夫曼编码文件的压缩率并写入Compression_rate.txt中。以下为结果。
28.445879%
(9)void Decoding(HTNODE T[], int n)
该函数用于解码哈夫曼文件,并将解码结果输出到文件decode.txt文件中,首先将哈夫曼编码文件读入到一个数组中,通过循环遍历找到数组的最后一个位置i,通过两层循环嵌套,从根结点开始到叶子结点,如果编码为0,则找左孩子,如果为1,则找右孩子,直到遍历完code数组,每完成一次从根结点开始到叶子结点的遍历,打印一次字符。下图为解码输出文件。
In the quaint village where Oliver lived, his adventurous spirit often led him on unexpected journeys. One day, while exploring the woods, he discovered an ancient map hidden inside a hollow tree. Guided by curiosity, he followed its intricate pathways to a mysterious cave rumored to hold a long-lost treasure. Braving dark tunnels and overcoming perilous obstacles, Oliver ventured deeper into the cave, solving riddles left behind by its enigmatic guardians. With each challenge conquered, he grew more determined to uncover the treasure's secrets. At last, in the heart of the cave, Oliver found a glittering chest. With trembling hands, he opened it to reveal not gold or jewels, but a collection of ancient scrolls filled with wisdom and secrets of the world. As he returned home, Oliver realized that the true treasure was not material wealth, but the knowledge and experiences gained on his extraordinary journey. Forever changed by his adventure in the enchanted cave, he embraced the wisdom he had unearthed and embarked on a new chapter of his life with renewed purpose.
(10)bool Judge()
该函数用于检查passage.txt中的内容和decode.txt中的内容是否相同,通过逐个比较字符来得到结果,如果一致则返回1,否则返回0,由下图运行结果可知完全一致。
四、测试结果
已在第三部分结合函数介绍给出运行结果。
五、经验体会与不足
通过此次实验,更加熟悉树的相关操作,也对哈夫曼编码与解码全过程有了更加深刻的理解。,并且发现了之前学习树形结构上的漏洞,并且由于程序代码较多,存在“牵一发而动全身”的情况,我认识到先有一个整体架构再填充细节的重要性,每修改一个地方都应注意是否会影响程序的其它部分。以下简要描述实验中遇到的问题:
- 对哈夫曼树的构建不太熟悉导致写代码时多次出错,通过看书及观看一些学习视频巩固了老师课上的知识点后,完成了相应功能的代码。
- 在存储每个字符的编码时,由于字符是叶子结点,通过访问叶子结点才能得到字符,而哈夫曼编码又是从根结点到叶子结点遍历才能得到,这两者的矛盾起初想不到解决办法,最后想到可以通过从数组的末尾向前存储的方式来解决这个问题。
#include <iostream>
#define N 100 //叶结点
#define M 2 * N - 1
#define MAXBIT 200
#define MAXWEIGHT 1000
typedef struct
{ //哈夫曼树结点
int weight;
int lchild;
int rchild;
int parent;
char ch;
} HTNODE;
typedef struct character
{
char ch;
int count;
} Char;
typedef struct huffcode
{ //每个字符的哈夫曼编码
int code[MAXBIT];
int start;
} Huffcode;
void SelectMin(HTNODE T[], int n, int &p1, int &p2);
void CreateHT(HTNODE T[], int n, Char words[]);
int Statistic(Char words[], int charnums);
void Display(Char words[], int charnums);
void Coding(HTNODE T[], Huffcode huffcode[], int n);
void DisplayCode(Huffcode huffcode[], HTNODE T[], int n);
void CodeFile(HTNODE T[], Huffcode huffcode[], int n);
void Calculate(Char words[], Huffcode huffcode[], int n);
void Decoding(HTNODE T[], int n);
bool Judge();
int main()
{
Char words[N];
Huffcode huffcode[N];
HTNODE T[M];
int charnums = 0;
charnums = Statistic(words, charnums);
Display(words, charnums);
CreateHT(T, charnums, words);
Coding(T, huffcode, charnums);
DisplayCode(huffcode, T, charnums);
CodeFile(T, huffcode, charnums);
Calculate(words, huffcode, charnums);
Decoding(T, charnums);
printf("%d", Judge());
}
void SelectMin(HTNODE T[], int n, int &p1, int &p2)
{
int i, j;
for (i = 0; i < n; i++)
{
if (T[i].parent == -1)
{
p1 = i;
break;
}
}
for (j = i + 1; j < n; j++)
{
if (T[j].parent == -1)
{
p2 = j;
break;
}
}
for (i = 0; i < n; i++)
{
if (i != p2 && T[i].parent == -1 && T[i].weight < T[p1].weight)
p1 = i;
}
for (j = 0; j < n; j++)
{
if (j != p1 && T[j].parent == -1 && T[j].weight < T[p2].weight)
p2 = j;
}
}
void CreateHT(HTNODE T[], int n, Char words[])
{
int i, p1, p2;
//初始化
for (i = 0; i < n; i++)
{
T[i].weight = words[i].count;
T[i].lchild = -1;
T[i].rchild = -1;
T[i].parent = -1;
T[i].ch = words[i].ch;
}
for (i = n; i < 2 * n - 1; i++)
{
T[i].weight = 0;
T[i].lchild = -1;
T[i].rchild = -1;
T[i].parent = -1;
}
//对森林进行合并
for (i = n; i < 2 * n - 1; i++)
{
SelectMin(T, i, p1, p2);
T[p1].parent = T[p2].parent = i;
T[i].lchild = p1;
T[i].rchild = p2;
T[i].weight = T[p1].weight + T[p2].weight;
}
}
int Statistic(Char words[], int charnums)
{
FILE *fp;
if ((fp = fopen("passage.txt", "r")) == NULL)
{
printf("Fail to open file");
exit(0);
}
char ch;
//先全归0
for (int i = 0; i < N; i++)
words[i].count = 0;
while (fscanf(fp, "%c", &ch) != EOF)
{
int found = 0;
for (int i = 0; i < charnums; i++)
{
if (ch == words[i].ch)
{
words[i].count++;
found = 1;
break;
}
}
if (found == 0)
{
words[charnums].ch = ch;
words[charnums].count = 1;
charnums++;
}
}
fclose(fp);
return charnums;
}
void Display(Char words[], int charnums)
{
int sum = 0;
FILE *fp;
fp = fopen("Statistic.txt", "w");
for (int i = 0; i < charnums; i++)
sum += words[i].count;
for (int i = 0; i < charnums; i++)
fprintf(fp, "%c %f\n", words[i].ch, (float)words[i].count / (float)sum);
fclose(fp);
}
void Coding(HTNODE T[], Huffcode huffcode[], int n)
{
int child, parent;
for (int i = 0; i < n; i++)
{
huffcode[i].start = MAXBIT - 1;
child = i;
parent = T[i].parent;
while (parent != -1)
{
if (child == T[parent].lchild)
huffcode[i].code[huffcode[i].start] = 0;
else
huffcode[i].code[huffcode[i].start] = 1;
child = parent;
parent = T[parent].parent;
huffcode[i].start--;
}
huffcode[i].start++;
}
}
void DisplayCode(Huffcode huffcode[], HTNODE T[], int n)
{
FILE *fp;
fp = fopen("Code.txt", "w");
for (int i = 0; i < n; i++)
{
fprintf(fp, "%c ", T[i].ch);
for (int j = huffcode[i].start; j <= MAXBIT - 1; j++)
{
fprintf(fp, "%d", huffcode[i].code[j]);
}
fprintf(fp, "\n");
}
fclose(fp);
}
void CodeFile(HTNODE T[], Huffcode huffcode[], int n)
{
FILE *fp1 = fopen("passage.txt", "r");
FILE *fp2 = fopen("encode.txt", "w");
char ch;
while (fscanf(fp1, "%c", &ch) != EOF)
{
for (int i = 0; i < n; i++)
{
if (T[i].ch == ch)
{
for (int j = huffcode[i].start; j <= MAXBIT - 1; j++)
fprintf(fp2, "%d", huffcode[i].code[j]);
}
}
}
fclose(fp1);
fclose(fp2);
}
void Calculate(Char words[], Huffcode huffcode[], int n)
{
int result = 0;
float ave = 0.0; //哈夫曼编码平均长度
int m = 1;
int add = 0;
for (int i = 0; i < n; i++)
{
result += words[i].count;
}
for (int i = 0; i < n; i++)
ave += (MAXBIT - huffcode[i].start) * ((float)words[i].count / (float)result);
while (m < n)
{
m *= 2;
add++;
}
FILE *fp = fopen("Compression_rate.txt", "w");
fprintf(fp, "%f%%", (float)(add - ave) / (float)add * 100);
}
void Decoding(HTNODE T[], int n)
{
FILE *fp1 = fopen("encode.txt", "r");
FILE *fp2 = fopen("decode.txt", "w");
char code[100000];
fscanf(fp1, "%s", code);
int i = 0, j = 0, temp;
for (i = 0; code[i] != '\0'; i++)
;
while (j < i)
{
temp = 2 * n - 2; //哈夫曼树的根节点
while (T[temp].lchild != -1 && T[temp].rchild != -1)
{
if (code[j] == '0')
temp = T[temp].lchild;
else
temp = T[temp].rchild;
j++;
}
fprintf(fp2, "%c", T[temp].ch);
}
fclose(fp1);
fclose(fp2);
}
bool Judge()
{
FILE *fp1 = fopen("passage.txt", "r");
FILE *fp2 = fopen("decode.txt", "r");
char ch1, ch2;
while ((ch1 = fgetc(fp1)) != EOF && (ch2 = fgetc(fp2)) != EOF)
{
if (ch1 != ch2)
return 0;
}
return 1;
}