SQL-练习题
- 练习一: 各部门工资最高的员工(难度:中等)
- 练习二: 换座位(难度:中等)
- 练习三: 分数排名(难度:中等)
- 练习四:连续出现的数字(难度:中等)
- 练习五:树节点 (难度:中等)
- 练习六:至少有五名直接下属的经理 (难度:中等)
- 练习七:查询回答率最高的问题 (难度:中等)
- 练习八:各部门前3高工资的员工(难度:中等)
- 练习九:平面上最近距离 (难度: 困难)
- 练习十:行程和用户(难度:困难)
- 练习一:行转列
- 练习二:列转行
- 练习三:带货主播
- 练习四:MySQL 中如何查看sql语句的执行计划?可以看到哪些信息?
- 练习五:解释一下 SQL 数据库中 ACID 是指什么
- 练习一:行转列
- 练习二:列转行
- 练习三:连续登录
- 练习四:hive 数据倾斜的产生原因及优化策略?
练习一: 各部门工资最高的员工(难度:中等)
创建Employee 表,包含所有员工信息,每个员工有其对应的 Id, salary 和 department Id。
+----+-------+--------+--------------+
| Id | Name | Salary | DepartmentId |
+----+-------+--------+--------------+
| 1 | Joe | 70000 | 1 |
| 2 | Henry | 80000 | 2 |
| 3 | Sam | 60000 | 2 |
| 4 | Max | 90000 | 1 |
+----+-------+--------+--------------+
创建Department 表,包含公司所有部门的信息。
+----+----------+
| Id | Name |
+----+----------+
| 1 | IT |
| 2 | Sales |
+----+----------+
编写一个 SQL 查询,找出每个部门工资最高的员工。例如,根据上述给定的表格,Max 在 IT 部门有最高工资,Henry 在 Sales 部门有最高工资。
+------------+----------+--------+
| Department | Employee | Salary |
+------------+----------+--------+
| IT | Max | 90000 |
| Sales | Henry | 80000 |
+------------+----------+--------+
-- 秋招秘笈A
CREATE TABLE employee (
id CHAR(4) NOT NULL,
name VARCHAR(20) NOT NULL,
salary INTEGER,
departmentid char(4),
PRIMARY KEY (id)
);
CREATE TABLE department (
id CHAR(4) NOT NULL,
name VARCHAR(20) NOT NULL,
PRIMARY KEY (id)
);
insert into employee values ('1','joe',70000,'1'),
('2','henry',80000,'2'),
('3','sam',60000,'2'),
('4','max',90000,'1');
insert into department values ('1', 'it'),
('2', 'sales');
select * from department;
select * from employee;
-- 解法一如果有相同的最高薪水,则只会取其中一个,group by会自动挑选一个
----------------------------------------------------------------------------------
SELECT
p2.name AS department,
p1.name AS employee,
MAX(p1.salary) AS salary
FROM
employee AS p1
LEFT OUTER JOIN
department AS p2 ON p1.departmentid = p2.id
GROUP BY department;
--------------------------------------------------------------------------------
-- 解法二,考虑有相同的薪水最大值,采用where in 先挑选出最大薪水的ID, 再匹配id,这样就不会缺失。由于left连接会将缺失值也带上,如果有缺失值,可以直接用join连接。
SELECT
p2.name AS department,
p1.name AS employee,
p1.salary AS salary
FROM
employee AS p1
LEFT OUTER JOIN
department AS p2 ON p1.departmentid = p2.id
WHERE
(p1.departmentid , p1.salary) IN (SELECT
departmentid, MAX(salary)
FROM
employee
GROUP BY departmentid);
练习二: 换座位(难度:中等)
小美是一所中学的信息科技老师,她有一张 seat 座位表,平时用来储存学生名字和与他们相对应的座位 id。
其中纵列的id是连续递增的
小美想改变相邻俩学生的座位。
你能不能帮她写一个 SQL query 来输出小美想要的结果呢?
请创建如下所示seat表:
示例:
+---------+---------+
| id | student |
+---------+---------+
| 1 | Abbot |
| 2 | Doris |
| 3 | Emerson |
| 4 | Green |
| 5 | Jeames |
+---------+---------+
假如数据输入的是上表,则输出结果如下:
+---------+---------+
| id | student |
+---------+---------+
| 1 | Doris |
| 2 | Abbot |
| 3 | Green |
| 4 | Emerson |
| 5 | Jeames |
+---------+---------+
注意:
如果学生人数是奇数,则不需要改变最后一个同学的座位。
SELECT
(CASE
WHEN MOD(id, 2) % 2 != 0 AND counts != id THEN id + 1
WHEN MOD(id, 2) % 2 != 0 AND counts = id THEN id
ELSE id - 1
END) AS id,
student
FROM
seat,
(SELECT
COUNT(*) AS counts
FROM
seat) AS seat_counts
ORDER BY id;
练习三: 分数排名(难度:中等)
假设在某次期末考试中,二年级四个班的平均成绩分别是 93、93、93、91
,请问可以实现几种排序结果?分别使用了什么函数?排序结果是怎样的?
+-------+-----------+
| class | score_avg |
+-------+-----------+
| 1 | 93 |
| 2 | 93 |
| 3 | 93 |
| 4 | 91 |
+-------+-----------+
CREATE TABLE avg_score (
class CHAR(4) NOT NULL,
score_avg INTEGER,
PRIMARY KEY (class)
);
insert into avg_score values('1', 93),('2',93),('3',93),('4',91);
select * from avg_score;
---------------------------------------------------------------------------
-- 第一种直接order by排序
SELECT
*
FROM
avg_score
ORDER BY score_avg desc;
-----------------------------------------------------------------------------
-- 第二种用窗口函数RANK()排出来为1114,DENSE_RANK()排出来为1112
select
class,
score_avg,
rank() over (order by score_avg desc) as ranking
from
avg_score;
练习四:连续出现的数字(难度:中等)
编写一个 SQL 查询,查找所有至少连续出现三次的数字。
+----+-----+
| Id | Num |
+----+-----+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 2 |
| 5 | 1 |
| 6 | 2 |
| 7 | 2 |
+----+-----+
例如,给定上面的 Logs 表, 1 是唯一连续出现至少三次的数字。
+-----------------+
| ConsecutiveNums |
+-----------------+
| 1 |
+-----------------+
-- 连续出现n次,这里举例5次
CREATE TABLE number5 (
id integer NOT NULL,
num INTEGER,
PRIMARY KEY (id)
);
insert into number5 values ('1', 1),
('2', 1),
('3', 1),
('4', 1),
('5', 1),
('6', 2),
('7', 3),
('8', 3),
('9', 3),
('10', 3),
('11', 4),
('12', 4),
('13', 4),
('14', 4),
('15', 4);
select * from number5;
---------------------------------------------------------------------------------
-- 选出目标num,由于只需要一个重复的num,所以使用distinct去重
select distinct Num as ConsecutiveNums
from
-- sub虚拟表包含num,分组num的count情况
(select num, count(1) as rowcount
from
(select id, num,
-- row_number进行排序,partition by将其进行分组(类似group by)
-- 连续的num,在分组后排序和分组前排序的差值是定值,将其作为rowcountgroup
row_number() over (order by id) -
row_number() over (partition by num order by id) as rowcountgroup
from number5) as sub
-- 用group by将num,rowcountgroup分组,having挑选出分组中重复5次以上的
group by num, rowcountgroup
having count(1) >=5) as result;
练习五:树节点 (难度:中等)
对于tree表,id是树节点的标识,p_id是其父节点的id。
+----+------+
| id | p_id |
+----+------+
| 1 | null |
| 2 | 1 |
| 3 | 1 |
| 4 | 2 |
| 5 | 2 |
+----+------+
每个节点都是以下三种类型中的一种:
- Root: 如果节点是根节点。
- Leaf: 如果节点是叶子节点。
- Inner: 如果节点既不是根节点也不是叶子节点。
写一条查询语句打印节点id及对应的节点类型。按照节点id排序。上面例子的对应结果为:
+----+------+
| id | Type |
+----+------+
| 1 | Root |
| 2 | Inner|
| 3 | Leaf |
| 4 | Leaf |
| 5 | Leaf |
+----+------+
说明
- 节点’1’是根节点,因为它的父节点为NULL,有’2’和’3’两个子节点。
- 节点’2’是内部节点,因为它的父节点是’1’,有子节点’4’和’5’。
- 节点’3’,‘4’,'5’是叶子节点,因为它们有父节点但没有子节点。
下面是树的图形:
1
/ \
2 3
/ \
4 5
注意
如果一个树只有一个节点,只需要输出根节点属性。
CREATE TABLE tree (
id INTEGER,
p_id INTEGER,
PRIMARY KEY (id)
);
insert into tree values (1, null),
(2, 1),
(3, 1),
(4, 2),
(5, 2);
--------------------------------------------------------------------
SELECT
*
FROM
tree as t1
left outer JOIN
tree as t2
ON t1.id = t2.p_id
-- 结果如下
'''
id, p_id, id, p_id
'1', NULL, '3', '1'
'1', NULL, '2', '1'
'2', '1', '5', '2'
'2', '1', '4', '2'
'3', '1', NULL, NULL
'4', '2', NULL, NULL
'5', '2', NULL, NULL
'''
---------------------------------------------------------------------
SELECT
distinct t1.id AS 'id',
CASE
WHEN t1.p_id IS NULL THEN 'root'
WHEN t2.id IS NULL THEN 'leaf'
ELSE 'inner'
END AS 'type'
FROM
tree AS t1
LEFT OUTER JOIN
tree AS t2 ON t1.id = t2.p_id
ORDER BY id;
练习六:至少有五名直接下属的经理 (难度:中等)
Employee表包含所有员工及其上级的信息。每位员工都有一个Id,并且还有一个对应主管的Id(ManagerId)。
+------+----------+-----------+----------+
|Id |Name |Department |ManagerId |
+------+----------+-----------+----------+
|101 |John |A |null |
|102 |Dan |A |101 |
|103 |James |A |101 |
|104 |Amy |A |101 |
|105 |Anne |A |101 |
|106 |Ron |B |101 |
+------+----------+-----------+----------+
针对Employee表,写一条SQL语句找出有5个下属的主管。对于上面的表,结果应输出:
+-------+
| Name |
+-------+
| John |
+-------+
注意:
没有人向自己汇报。
CREATE TABLE employee1 (
id CHAR(4) NOT NULL,
name VARCHAR(32),
department VARCHAR(32),
managerid CHAR(32),
PRIMARY KEY (id)
);
insert into employee1 values ('101', 'john', 'A', null),
('102', 'dan', 'A', '101'),
('103', 'james', 'A', '101'),
('104', 'amy', 'A', '101'),
('105', 'anne', 'A', '101'),
('106', 'ron', 'B', '101');
select * from employee1;
-------------------------------------------------------------
-- 用where筛选
SELECT
name
FROM
employee1
WHERE
id in (SELECT
managerid
FROM
employee1
GROUP BY managerid
having COUNT(1) > 4);
---------------------------------------------------------------
-- 用join连接来筛选,join相比left outer join不会有null值
SELECT
name
FROM
employee1
JOIN
(SELECT
managerid, COUNT(1) AS cnt
FROM
employee1
GROUP BY managerid
HAVING cnt > 4) AS e1 ON id = e1.managerid;
练习七:查询回答率最高的问题 (难度:中等)
求出survey_log表中回答率最高的问题,表格的字段有:uid, action, question_id, answer_id, q_num, timestamp。
uid是用户id;action的值为:“show”, “answer”, “skip”;当action是"answer"时,answer_id不为空,相反,当action是"show"和"skip"时为空(null);q_num是问题的数字序号。
写一条sql语句找出回答率最高的问题。
举例:
输入
uid | action | question_id | answer_id | q_num | timestamp |
---|---|---|---|---|---|
5 | show | 285 | null | 1 | 123 |
5 | answer | 285 | 124124 | 1 | 124 |
5 | show | 369 | null | 2 | 125 |
5 | skip | 369 | null | 2 | 126 |
输出
survey_log |
---|
285 |
说明
问题285的回答率为1/1,然而问题369的回答率是0/1,所以输出是285。
注意:
最高回答率的意思是:同一个问题出现的次数中回答的比例。
CREATE TABLE survey_log (
uid CHAR(4) NOT NULL,
action VARCHAR(32) NOT NULL,
question_id INTEGER,
answer_id INTEGER,
q_num INTEGER,
timestamp INTEGER
);
insert into survey_log values ('5', 'show', 285, null, 1, 123),
('5', 'answer', 285, 1, 1, 124),
('5', 'show', 369, null, 2, 125),
('5', 'skip', 369, null, 2, 126);
select * from survey_log;
----------------------------------------------------------------------
SELECT
question_id AS survey_log
FROM
(SELECT
question_id,
SUM(CASE
WHEN action = 'show' THEN 1
ELSE 0
END) / SUM(CASE
WHEN action = 'answer' THEN 1
ELSE 0
END) AS ratio
FROM
survey_log
GROUP BY question_id
ORDER BY ratio DESC
LIMIT 1) AS x;
-- limit 可以选择哪一行的数据,1就是前一行,(1,10就是从2开始选择10行数据),10就是前10行数据
练习八:各部门前3高工资的员工(难度:中等)
将项目7中的employee表清空,重新插入以下数据(其实是多插入5,6两行):
+----+-------+--------+--------------+
| Id | Name | Salary | DepartmentId |
+----+-------+--------+--------------+
| 1 | Joe | 70000 | 1 |
| 2 | Henry | 80000 | 2 |
| 3 | Sam | 60000 | 2 |
| 4 | Max | 90000 | 1 |
| 5 | Janet | 69000 | 1 |
| 6 | Randy | 85000 | 1 |
+----+-------+--------+--------------+
编写一个 SQL 查询,找出每个部门工资前三高的员工。例如,根据上述给定的表格,查询结果应返回:
+------------+----------+--------+
| Department | Employee | Salary |
+------------+----------+--------+
| IT | Max | 90000 |
| IT | Randy | 85000 |
| IT | Joe | 70000 |
| Sales | Henry | 80000 |
| Sales | Sam | 60000 |
+------------+----------+--------+
此外,请考虑实现各部门前N高工资的员工功能。
-- 为了使用窗口函数row_number,第一层拼接两个表,第二层分组计算排序,第三层挑选前三的工资
select department, employee, salary
from
(
select department,
employee,
salary,
row_number() over (partition by department order by salary desc) as row_num
from
(select p1.id, p1.name as employee, p1.departmentid, p1.salary, p2.name as department from employee as p1
left outer join
department as p2
on p1.departmentid = p2.id) as x) as y
where row_num < 4;
--修改为两层
select department, employee, salary
from
(select p1.id, p1.name as employee, p1.departmentid, p1.salary, p2.name as department,
row_number() over (partition by p2.name order by salary desc) as row_num
from employee as p1
left outer join
department as p2
on p1.departmentid = p2.id) as x
where row_num < 4;
--------------------------------------------------------------
-- 用where限制条件
SELECT
D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
FROM
Employee AS E
INNER JOIN
Department AS D ON D.Id = E.DepartmentId
WHERE
(SELECT
COUNT(DISTINCT salary)
FROM
employee AS p4
WHERE
e.departmentid = p4.departmentid
AND p4.salary >= e.salary) < 4
ORDER BY departmentid , salary DESC;
练习九:平面上最近距离 (难度: 困难)
point_2d表包含一个平面内一些点(超过两个)的坐标值(x,y)。
写一条查询语句求出这些点中的最短距离并保留2位小数。
|x | y |
|----|----|
| -1 | -1 |
| 0 | 0 |
| -1 | -2 |
最短距离是1,从点(-1,-1)到点(-1,-2)。所以输出结果为:
| shortest |
1.00
+--------+
|shortest|
+--------+
|1.00 |
+--------+
**注意:**所有点的最大距离小于10000。
SELECT
ROUND(MIN(SQRT(POWER(p1.x - p2.x, 2) + POWER(p1.y - p2.y, 2))),
2) AS shortest
FROM
point_2d AS p1
LEFT OUTER JOIN
point_2d AS p2 ON (p1.x , p1.y) != (p2.x , p2.y);
练习十:行程和用户(难度:困难)
Trips 表中存所有出租车的行程信息。每段行程有唯一键 Id,Client_Id 和 Driver_Id 是 Users 表中 Users_Id 的外键。Status 是枚举类型,枚举成员为 (‘completed’, ‘cancelled_by_driver’, ‘cancelled_by_client’)。
Id | Client_Id | Driver_Id | City_Id | Status | Request_at |
---|---|---|---|---|---|
1 | 1 | 10 | 1 | completed | 2013-10-1 |
2 | 2 | 11 | 1 | cancelled_by_driver | 2013-10-1 |
3 | 3 | 12 | 6 | completed | 2013-10-1 |
4 | 4 | 13 | 6 | cancelled_by_client | 2013-10-1 |
5 | 1 | 10 | 1 | completed | 2013-10-2 |
6 | 2 | 11 | 6 | completed | 2013-10-2 |
7 | 3 | 12 | 6 | completed | 2013-10-2 |
8 | 2 | 12 | 12 | completed | 2013-10-3 |
9 | 3 | 10 | 12 | completed | 2013-10-3 |
10 | 4 | 13 | 12 | cancelled_by_driver | 2013-10-3 |
Users 表存所有用户。每个用户有唯一键 Users_Id。Banned 表示这个用户是否被禁止,Role 则是一个表示(‘client’, ‘driver’, ‘partner’)的枚举类型。
+----------+--------+--------+
| Users_Id | Banned | Role |
+----------+--------+--------+
| 1 | No | client |
| 2 | Yes | client |
| 3 | No | client |
| 4 | No | client |
| 10 | No | driver |
| 11 | No | driver |
| 12 | No | driver |
| 13 | No | driver |
+----------+--------+--------+
写一段 SQL 语句查出2013年10月1日至2013年10月3日期间非禁止用户的取消率。基于上表,你的 SQL 语句应返回如下结果,取消率(Cancellation Rate)保留两位小数。
+------------+-------------------+
| Day | Cancellation Rate |
+------------+-------------------+
| 2013-10-01 | 0.33 |
| 2013-10-02 | 0.00 |
| 2013-10-03 | 0.50 |
+------------+-------------------+
1
drop table if exists trips;
CREATE TABLE trips (
id integer NOT NULL,
client_id CHAR(4) NOT NULL,
driver_id CHAR(4) NOT NULL,
city_id CHAR(4) NOT NULL,
status VARCHAR(32),
request_at DATE,
PRIMARY KEY (id)
);
insert into trips values ('1', '1', '10', '1', 'completed', '2013-10-1'),
('2', '2', '11', '1', 'cancelled_by_driver', '2013-10-1'),
('3', '3', '12', '6', 'completed', '2013-10-1'),
('4', '4', '13', '6', 'cancelled_by_client', '2013-10-1'),
('5', '1', '10', '1', 'completed', '2013-10-2'),
('6', '2', '11', '6', 'completed', '2013-10-2'),
('7', '3', '12', '6', 'completed', '2013-10-2'),
('8', '2', '12', '12', 'completed', '2013-10-3'),
('9', '3', '10', '12', 'completed', '2013-10-3'),
('10', '4', '13', '12', 'cancelled_by_driver', '2013-10-3');
SELECT
*
FROM
trips;
drop table if exists users;
CREATE TABLE users (
user_id INTEGER NOT NULL,
banned VARCHAR(32) NOT NULL,
role VARCHAR(32) NOT NULL,
PRIMARY KEY (user_id)
);
insert into users values ('1', 'no', 'client'),
('2', 'yes', 'client'),
('3', 'no', 'client'),
('4', 'no', 'client'),
('10', 'no', 'driver'),
('11', 'no', 'driver'),
('12', 'no', 'driver'),
('13', 'no', 'driver');
SELECT
*
FROM
users;
-------------------------------------------------------------------------
-- 内连接两个表查看结果
select *
from trips
inner join
users
on trips.client_id=users.user_id and users.banned='no';
-- 这种方法得不出答案,一直不对,最后发现题目的分母是把所有的都算进去
SELECT
request_at,
(SUM(CASE
WHEN banned = 'no' AND (status = 'cancelled_by_driver' or status='cancelled_by_client') THEN 1
ELSE 0
END) /
SUM(CASE
WHEN
banned = 'no'
AND status = 'completed'
THEN
1
ELSE 0
END)) AS 'cancellation rate'
FROM
(SELECT
*
FROM
trips
INNER JOIN users ON trips.client_id = users.user_id) AS x
GROUP BY request_at;
-----------------------------------------------------------------
-- 修改分母为count(*)计算组内所有的计数,但是如果遇到cancel的情况有很多种显然这种or的用法就很笨拙。
SELECT
request_at,
(SUM(CASE
WHEN
AND (status = 'cancelled_by_driver'
OR status = 'cancelled_by_client')
THEN
1
ELSE 0
END) / COUNT(*)) AS 'cancellation rate'
FROM
(SELECT
*
FROM
trips
INNER JOIN users ON trips.client_id = users.user_id
AND users.banned = 'no') AS x
GROUP BY request_at;
-------------------------------------------------------------------
-- 修改为用like来模糊匹配,发现题目限制了日期,增加一句限制日期的即可:in只包含括号内的条件,between and 包含了一个区间。
SELECT
request_at,
(SUM(CASE
WHEN
status like 'cancel%'
THEN
1
ELSE 0
END) / COUNT(*)) AS 'cancellation rate'
FROM
(SELECT
*
FROM
trips
INNER JOIN users ON trips.client_id = users.user_id
AND users.banned = 'no') AS x
WHERE
request_at BETWEEN '2013-10-01' AND '2013-10-03'
GROUP BY request_at;
练习一:行转列
假设 A B C 三位小朋友期末考试成绩如下所示:
+-----+-----------+------|
| name| subject |score |
+-----+-----------+------|
| A | chinese | 99 |
| A | math | 98 |
| A | english | 97 |
| B | chinese | 92 |
| B | math | 91 |
| B | english | 90 |
| C | chinese | 88 |
| C | math | 87 |
| C | english | 86 |
+-----+-----------+------|
请使用 SQL 代码将以上成绩转换为如下格式:
+-----+-----------+------|---------|
| name| chinese | math | english |
+-----+-----------+------|---------|
| A | 99 | 98 | 97 |
| B | 92 | 91 | 90 |
| C | 88 | 87 | 86 |
+-----+-----------+------|---------|
-
当待转换列为数字时,可以使用
SUM AVG MAX MIN
等聚合函数; -
当待转换列为文本时,可以使用
MAX MIN
等聚合函数SELECT name, SUM(CASE WHEN subject = 'chinese' THEN score ELSE NULL END) AS 'chinese', SUM(CASE WHEN subject = 'math' THEN score ELSE NULL END) AS 'math', SUM(CASE WHEN subject = 'english' THEN score ELSE NULL END) AS 'english' FROM score1 GROUP BY name;
练习二:列转行
假设 A B C 三位小朋友期末考试成绩如下所示:
+-----+-----------+------|---------|
| name| chinese | math | english |
+-----+-----------+------|---------|
| A | 99 | 98 | 97 |
| B | 92 | 91 | 90 |
| C | 88 | 87 | 86 |
+-----+-----------+------|---------|
请使用 SQL 代码将以上成绩转换为如下格式:
+-----+-----------+------|
| name| subject |score |
+-----+-----------+------|
| A | chinese | 99 |
| A | math | 98 |
| A | english | 97 |
| B | chinese | 92 |
| B | math | 91 |
| B | english | 90 |
| C | chinese | 88 |
| C | math | 87 |
| C | english | 86 |
+-----+-----------+------|
将上面的结果生成为表score3
create table score3 as
select
*
from
(SELECT
name,
SUM(CASE
WHEN subject = 'chinese' THEN score
ELSE NULL
END) AS 'chinese',
SUM(CASE
WHEN subject = 'math' THEN score
ELSE NULL
END) AS 'math',
SUM(CASE
WHEN subject = 'english' THEN score
ELSE NULL
END) AS 'english'
FROM
score1
GROUP BY name) as score2;
-----------------------------------------------------------------------------------------------------------------
SELECT
name, 'chinese' AS subject, chinese AS score
FROM
score3
UNION ALL SELECT
name, 'math' AS math, math AS score
FROM
score3
UNION ALL SELECT
name, 'english' AS english, english AS score
FROM
score3
ORDER BY name;
练习三:带货主播
假设,某平台2021年主播带货销售额日统计数据如下:
表名 anchor_sales
+-------------+------------+---------|
| anchor_name | date | sales |
+-------------+------------+---------|
| A | 20210101 | 40000 |
| B | 20210101 | 80000 |
| A | 20210102 | 10000 |
| C | 20210102 | 90000 |
| A | 20210103 | 7500 |
| C | 20210103 | 80000 |
+-------------+------------+---------|
定义:如果某主播的某日销售额占比达到该平台当日销售总额的 90% 及以上,则称该主播为明星主播,当天也称为明星主播日。
请使用 SQL 完成如下计算:
a. 2021年有多少个明星主播日?
b. 2021年有多少个明星主播?
CREATE TABLE anchor_sales (
anchor_name VARCHAR(32) NOT NULL,
date DATE,
sales INTEGER
);
insert into anchor_sales values ('A', '20210101', 40000),
('B', '20210101', 80000),
('A', '20210102', 10000),
('C', '20210102', 90000),
('A', '20210103', 7500),
('C', '20210103', 80000);
select * from anchor_sales;
-----------------------------------------------------------------------------------------------
SELECT anchor_name, date,
case when date then 1 else 0 end as date_num
FROM
anchor_sales
group by date
having MAX(sales) / SUM(sales) >= 0.9;
练习四:MySQL 中如何查看sql语句的执行计划?可以看到哪些信息?
explain select * from <表名>;
id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
练习五:解释一下 SQL 数据库中 ACID 是指什么
ACID为原子性(Atomicity)、一致性(Consistency)、隔离性(Isolation)、持久性(Durability)的总称。
原子性(Atomicity)
整个事务是一个不可分割整体,要么全部完成,要么全部不完成,不可能停滞在中间某个环节。事务在执行过程中发生错误,会被回滚(Rollback)到事务开始前的状态,就像这个事务从来没有执行过一样。
每一条的T-SQL语句都是一个事务,如insert语句、update语句等。用户也可以定义自己的事务,使用TYR-CATCH方法将多条语句合为一个事务,比如银行转账,在A账户中减钱与在B账户中增钱是一个自定义的事务。
一致性(Consistency)
一致性,即在事务开始之前和事务结束以后,数据库的完整性约束(唯一约束,外键约束,Check约束等)没有被破坏。业务的一致性可以转化为数据库的一致性。
隔离性(Isolation)
隔离执行事务,多个事务的执行互相不干扰。一个事务不可能获取到另一个事务执行的中间数据。SQL Server利用加锁造成阻塞来保证事务之间不同等级的隔离性。
事务之间的互相影响的情况分为几种,分别为:脏读(Dirty Read),不可重复读,幻读。
脏读表示一个事务获取了另一个事务的未提交数据,这个数据有可能被回滚。
不可重复度表示一个事务执行两次相同的查询,出现了不同的结果,这是因为两次查询中间有另一事务对数据进行了修改。
幻读,是指当事务不是独立执行时发生的一种现象,例如第一个事务对一个表中的数据进行了修改,这种修改涉及到表中的全部数据行。同时,第二个事务也修改这个表中的数据,这种修改是向表中插入一行新数据。那么,第一个事务的用户发现表中还有 没有修改的数据行,就好象发生了幻觉一样。
为了避免上述几种事务之间的影响,SQL Server通过设置不同的隔离等级来进行不同程度的避免。因为高的隔离等级意味着更多的锁,从而牺牲性能.所以这个选项开放给了用户根据具体的需求进行设置。不过默认的隔离等级Read Commited符合了99%的实际需求.
持久性(Durability)
在事务完成以后,该事务对数据库所作的更改便持久的保存在数据库之中,并不会被回滚。
即使出现了任何事故比如断电等,事务一旦提交,则持久化保存在数据库中.
SQL SERVER通过write-ahead transaction log来保证持久性。write-ahead transaction log的意思是,事务中对数据库的改变在写入到数据库之前,首先写入到事务日志中。而事务日志是按照顺序排号的(LSN)。当数据库崩溃或者服务器断点时,重启动SQL SERVER,SQL SERVER首先会检查日志顺序号,将本应对数据库做更改而未做的部分持久化到数据库,从而保证了持久性.(所以事务提交之后,也可以在WAL中查询到操作历史吗)
WAL (write-ahead logging)的中心思想是对数据文件的修改必须是只能发生在这些修改已经记录了日志之后 – 也就是说,先写在日志里,提交的时候再由日志保存到永久存储器。在日志记录冲刷到永久存储器之后. 如果我们遵循这个过程,那么我们就不需要在每次事务提交的时候 都把数据页冲刷到磁盘,因为我们知道在出现崩溃的情况下, 我们可以用日志来恢复数据库:任何尚未附加到数据页的记录 都将先从日志记录中重做(这叫向前滚动恢复,也叫做 REDO) 然后那些未提交的事务做的修改将被从数据页中删除 (这叫向后滚动恢复 - UNDO)。
原文链接:https://blog.csdn.net/Michaelia_hu/article/details/75339924
练习一:行转列
假设有如下比赛结果
+------------+-----------+
| cdate | result |
+------------+-----------+
| 2021-1-1 | 胜 |
| 2021-1-1 | 负 |
| 2021-1-3 | 胜 |
| 2021-1-3 | 负 |
| 2021-1-1 | 胜 |
| 2021-1-3 | 负 |
+------------+-----------+
请使用 SQL 将比赛结果转换为如下形式:
+------------+-----+-----|
| 比赛日期 | 胜 | 负 |
+------------+-----------+
| 2021-1-1 | 2 | 1 |
| 2021-1-2 | 1 | 2 |
+------------+-----------+
create table competition
(cdate date,
result varchar(32));
insert into competition values ('2021-1-1', '胜'),
('2021-1-1', '负'),
('2021-1-3', '胜'),
('2021-1-3', '负'),
('2021-1-1', '胜'),
('2021-1-3', '负');
select cdate,
sum(case when result='胜' then 1 else 0 end) as '胜',
sum(case when result='负' then 1 else 0 end) as '负'
from competition
group by cdate;
练习二:列转行
假设有如下比赛结果
+------------+-----+-----|
| 比赛日期 | 胜 | 负 |
+------------+-----------+
| 2021-1-1 | 2 | 1 |
| 2021-1-3 | 1 | 2 |
+------------+-----------+
请使用 SQL 将比赛结果转换为如下形式:
+------------+-----------+
| cdate | result |
+------------+-----------+
| 2021-1-1 | 胜 |
| 2021-1-1 | 负 |
| 2021-1-3 | 胜 |
| 2021-1-3 | 负 |
| 2021-1-1 | 胜 |
| 2021-1-3 | 负 |
+------------+-----------+
-----------------------------------------------------------------------------------------------------------------
select * from (SELECT
cdate, '胜' AS result
FROM competition1
UNION ALL SELECT
cdate, '负' AS result
FROM
competition1) x
union all
select * from (SELECT
cdate, '胜' AS result
FROM competition1 as p1
where cdate='2021-1-1'
UNION ALL SELECT
cdate, '负' AS result
FROM
competition1
where cdate='2021-1-3') y
order by cdate, result;
练习三:连续登录
有用户表行为记录表t_act_records表,包含两个字段:uid(用户ID),imp_date(日期)
- 计算2021年每个月,每个用户连续登录的最多天数
- 计算2021年每个月,连续2天都有登录的用户名单
- 计算2021年每个月,连续5天都有登录的用户数
构造表mysql如下:
DROP TABLE if EXISTS t_act_records;
CREATE TABLE t_act_records
(uid VARCHAR(20),
imp_date DATE);
INSERT INTO t_act_records VALUES('u1001', 20210101);
INSERT INTO t_act_records VALUES('u1002', 20210101);
INSERT INTO t_act_records VALUES('u1003', 20210101);
INSERT INTO t_act_records VALUES('u1003', 20210102);
INSERT INTO t_act_records VALUES('u1004', 20210101);
INSERT INTO t_act_records VALUES('u1004', 20210102);
INSERT INTO t_act_records VALUES('u1004', 20210103);
INSERT INTO t_act_records VALUES('u1004', 20210104);
INSERT INTO t_act_records VALUES('u1004', 20210105);
-- 计算2021年每个月,每个用户连续登录的最多天数
select uid, count(1) as num from
(select uid, imp_date, date_sub(imp_date, interval sort day) as dsub
from (select uid, imp_date, row_number() over (partition by uid order by imp_date) as sort
from t_act_records) as a) b
group by uid;
-------------------------------------------------------------------------------
-- 计算2021年每个月,连续2天都有登录的用户名单
select uid from
(select uid, count(1) as num from
(select uid, imp_date, date_sub(imp_date, interval sort day) as dsub from (select uid, imp_date, row_number() over (partition by uid order by imp_date) as sort
from t_act_records) as a) b
group by uid) c
where num>1;
-- 计算2021年每个月,连续5天都有登录的用户数
select count(num) as num_id from
(select uid, count(1) as num from
(select uid, imp_date, date_sub(imp_date, interval sort day) as dsub from (select uid, imp_date, row_number() over (partition by uid order by imp_date) as sort
from t_act_records) as a) b
group by uid) c
where num=5;
练习四:hive 数据倾斜的产生原因及优化策略?
1.key分布不均匀
2.业务数据本身的特性
3.SQL语句造成数据倾斜
解决办法
1.hive设置hive.map.aggr=true和hive.groupby.skewindata=true
2.有数据倾斜的时候进行负载均衡,当选项设定为true,生成的查询计划会有两个MR Job。
第一个MR Job中,Map的输出结果集合会随机分布到Reduce中,每个Reduce做部分聚合操作,并输出结果,这样处理的结果是相同Group By Key有可能被分发到不同的Reduce中,从而达到负载均衡的目的;第二个MR Job在根据预处理的数据结果按照 Group By Key 分布到Reduce中(这个过程可以保证相同的 Group By Key 被分布到同一个Reduce中),最后完成最终的聚合操作。
3.SQL语句调整:
1.选用join key 分布最均匀的表作为驱动表。做好列裁剪和filter操作,以达到两表join的时候,数据量相对变小的效果。
2.大小表Join: 使用map join让小的维度表(1000条以下的记录条数)先进内存。在Map端完成Reduce。
3.count distinct大量相同特殊值:count distinct时,将值为空的情况单独处理,
如果是计算count distinct,可以不用处理,直接过滤,在做后结果中加1。如果还有其他计算,需要进行group by,可以先将值为空的记录单独处理,再和其他计算结果进行union.
4.大表Join大表:把空值的Key变成一个字符串加上一个随机数,
把倾斜的数据分到不同的reduce上,由于null值关联不上,处理后并不影响最终的结果。