文章目录
b.example.com:未知的名称或服务:
应用程序服务一般会尝试根据服务器的配置查询主机名,读取一些配置文件(比如 /etc/nsswitch.conf, /etc/hosts, /etc/resolv.conf) 来决定使用什么域名服务器(nameserver),请参考系统如何处理名称解析 。
收到日志告警: `a.example.com:未知的名称或服务`。服务器测试情况如下,出现一定几率失败情况:
[root@v_yunweikaifa246 ~]# ping a.example.com -c 2 -w 0.1
ping: a.example.com: Name or service not known
[root@v_yunweikaifa246 ~]# ping a.example.com -c 2 -w 0.1
ping: a.example.com: Name or service not known
[root@v_yunweikaifa246 ~]# ping a.example.com -c 2 -w 0.1
PING a.example.com (10.0.0.4) 56(84) bytes of data.
64 bytes from bogon (10.0.0.4): icmp_seq=1 ttl=64 time=0.291 ms
64 bytes from bogon (10.0.0.4): icmp_seq=2 ttl=64 time=0.402 ms
--- a.example.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.291/0.346/0.402/0.058 ms
发现可能原因:rotate功能:
多次使用strace -e trace=connect,write getent hosts a.example.com
跟踪连接主机发现,只有去请求nameserver 10.128.2.130
才能正常解析,但服务器选择nameserver 看上去是随机的。
[root@v_yunweikaifa246 ~]# strace -e trace=connect,write getent hosts a.example.com
connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("219.141.140.10")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("219.141.136.10")}, 16) = 0
+++ exited with 2 +++
[root@v_yunweikaifa246 ~]# strace -e trace=connect,write getent hosts a.example.com
connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("219.141.136.10")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.128.2.130")}, 16) = 0
write(1, "10.0.0.4 a.example."..., 3610.0.0.4 a.example.com
) = 36
+++ exited with 0 +++
[root@v_yunweikaifa246 ~]# strace -e trace=connect,write getent hosts a.example.com
connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("219.141.140.10")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("219.141.136.10")}, 16) = 0
+++ exited with 2 +++
**注意:**dig, host, nslook 这类工具,因为他们并没有调用 resolver 的库,只是解析了/etc/resolv.conf
第一条,不能通过nameservers
测试rotate
选项,
查看/etc/resolv.conf 配置
[root@v_yunweikaifa141 ~]# cat /etc/resolv.conf
options timeout:1 attempts:1 rotate
nameserver 10.128.2.130
nameserver 219.141.140.10
nameserver 219.141.136.10
nameserver 202.106.0.20
根据man resolv.conf
中的解释,options timeout:1 attempts:1 rotate
的意思是超时1秒,重试1次,采用rotate
模式,其中rotate 官方给的解释是:
sets RES_ROTATE in _res.options, which causes round-robin selection of nameservers from among those listed. This hasthe effect of spreading the query load among all listed servers, rather than having all clients try the first listedserver first every time.
大致是说在配置的nameservers中随机挑选,而不是每次都首先尝试第一个。当错误nameserver给出响应
问题解决:
-
rotate功能是为了均衡server的负载,所有当nameserver 功能不一样时,去除rotate功能。恢复按默顺序请求。
-
自建的服务配置在第一位和第二位,把公共服务器配置在第三位
都增大的第一个nameserver的服务压力
问题拓展:为什么 resolv.conf 中的rotate选项,每次都选择第二个nameserver作为第一个
多次测试发现 第二个nameserver作为第一次请求的概率比较高。因为在请求前服务器已经做过rotate
/*
* Some resolvers want to even out the load on their nameservers.
* Note that RES_BLAST overrides RES_ROTATE.
*/
if ((statp->options & RES_ROTATE) != 0 &&
(statp->options & RES_BLAST) == 0) {
struct sockaddr_in6 *ina;
unsigned int map;
n = 0;
while (n < MAXNS && EXT(statp).nsmap[n] == MAXNS)
n++;
if (n < MAXNS) {
ina = EXT(statp).nsaddrs[n];
map = EXT(statp).nsmap[n];
for (;;) {
ns = n + 1;
while (ns < MAXNS
&& EXT(statp).nsmap[ns] == MAXNS)
ns++;
if (ns == MAXNS)
break;
EXT(statp).nsaddrs[n] = EXT(statp).nsaddrs[ns]; /*
把第二个IP地址移动到第一个*/
EXT(statp).nsmap[n] = EXT(statp).nsmap[ns];
n = ns;
}
EXT(statp).nsaddrs[n] = ina;
EXT(statp).nsmap[n] = map;
}
}
测试python 脚本
import socket
for x in range(5):
try:
print socket.getaddrinfo('a.example.com', 80);
except:
pass
问题拓展:/etc/resolv.conf
的nameserver 为什么只能配置三个生效
默认情况下/etc/resolv.conf
只能配置三个,多nameserver配置查询不到 因为MAXNS`被定义三个,可以修改重新编译,但官方不推荐
/usr/include/resolv.h
nameserver Name server IP address
Internet address (in dot notation) of a name server that the resolver should query. Up to MAXNS (currently 3, see <resolv.h>) name servers may be listed,
one per keyword. If there are multiple servers, the resolver library queries them in the order listed. If no nameserver entries are present, the default is
to use the name server on the local machine. (The algorithm used is to try a name server, and if the query times out, try the next, until out of name
servers, then repeat trying all the name servers until a maximum number of retries are made.)
问题拓展:nameserver解析不了主机时,不能故障转移:
例如:
nameserver 10.0.0.1 # handles queries for some internal zones
nameserver 10.0.0.2 # handles queries for zones that .1 nameserver doesn't know about
nameserver 10.0.0.3 # handles queries out to the global internet
是因为:首先尝试第一个nameserver。如果第一个nameserver关闭并且在可配置的超时内没有响应,则解析器将移动到下一个nameserver,然后是下一个。如果第一个 DNS 服务器启动并响应,解析器永远不会继续尝试第二个或第三个nameserver。
219.141.136.10网络不可达,给出的状态码是-1,所以继续请求下一个
[root@v_yunweikaifa141 ~]# strace -e trace=connect ping a.example.com -c 1 -w 1...PING a.example.com (10.0.0.4) 56(84) bytes of data.connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("219.141.140.10")}, 16) = -1 ENETUNREACH (Network is unreachable)connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.128.2.130")}, 16) = 0--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---64 bytes from 10.0.0.4: icmp_seq=1 ttl=64 time=0.610 ms--- a.example.com ping statistics ---1 packets transmitted, 1 received, 0% packet loss, time 0msrtt min/avg/max/mdev = 0.610/0.610/0.610/0.000 ms+++ exited with 0 +++
219.141.136.10给出响应,但未找到记录,
[root@v_yunweikaifa246 ~]# strace -e trace=connect ping a.example.com -c 1 -w 1connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("219.141.140.10")}, 16) = 0ping: a.example.com: Name or service not known
收获:
在测试过程中了解了系统是如何解析,以及resolv.conf配置和使用
参阅: