游戏内有一个系统对接收消息顺序是有严格要求的,一旦没有按照发送的顺序接收,就会引起错误,为确定消息是客户端没按顺序发送还是客户端接收时候出错,需要前端到后端的每一个关键阶段都加上了日志去定位问题。
消息顺序: 客户端-> 服务端端口 -> 游戏解包进程 -> 业务进程
日志记录点:
- 客户端(发包内容记录到日志)
- 服务端(通过tcpdump抓包)
- 游戏读socket(记录解包日志)
- 业务进程记录每个收到的消息
通过分析抓包内容分析,发现前端发送的每个消息后端都是按序收到,也能按序解包,出错地方在接收进程在收到某个消息时候,前面会有几个消息丢失了。
为确定消息究竟有没有从解包进程发送到业务进程,使用dbg监控业务进程接收进程发现消息是有的。
而且在官方文档里面进程A往进程B的发送的消息,是可以保证接收顺序的,在Academic and Historical Questions 10.8里面可以证实。
10.8 Is the order of message reception guaranteed?
Yes, but only within one process.If there is a live process and you send it message A and then message B, it’s guaranteed that if message B arrived, message A arrived before it.
On the other hand, imagine processes P, Q and R. P sends message A to Q, and then message B to R. There is no guarantee that A arrives before B. (Distributed Erlang would have a pretty tough time if this was required!)
最奇怪的是解包进程往另外一个日志进程发送的同样消息是正确,就只是这个业务进程不正确。问题到这里就断开了查不下去,后面一次聊天和一个大神聊起这个问题他说遇到过,是和发送socket接口使用不当引起的,回去翻代码果然发现了,使用了erts_internal:port_command,从文档里面发现这段说明:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33%%
%% Port BIFs
%%
%% Currently all port BIFs calls the corresponding
%% erts_internal:port_*() native function which perform
%% most of the actual work. These native functions should
%% *never* be called directly by other functionality. The
%% native functions may be changed, or removed without any
%% notice whatsoever!
%%
%% IMPORTANT NOTE:
%% When the erts_internal:port_*() native functions return
%% a reference, they have also internally prepared the
%% message queue of the caller for a receive that will
%% unconditionally wait for a message containing this
%% reference. If the erlang code calling these native
%% functions do not do this, subsequent receives will not
%% work as expected! That is, it is of *vital importance*
%% that the receive is performed as described above!
%%
-spec port_command(Port, Data) -> 'true' when
Port :: port() | atom(),
Data :: iodata().
port_command(Port, Data) ->
case case erts_internal:port_command(Port, Data, []) of
Ref when erlang:is_reference(Ref) -> receive {Ref, Res} -> Res end;
Res -> Res
end of
true -> true;
Error -> erlang:error(Error, [Port, Data])
end.
不建议直接使用erts_internal:port_*()接口,可能引起后面消息乱序,把erts_internal:port_command换成erlang:port_command后暂没发现问题。
另外记一下发现问题版本是17.5,还有dbg好用。
DBG追踪:1
2
3
4
5
6
7> dbg:start(). % start dbg
> dbg:tracer(). % start a simple tracer process
> dbg:p(PID, r). % trace receive message.
... trace here
> dbg:stop_clear(). % stop tracer and clear effect of tp and p calls.