0%

Tor的Vegas机制

congestion_control_vegas_process_sendme 函数解析

简介

该函数实现了基于 Vegas 拥塞控制算法的 SENDME 处理逻辑。其核心目标是通过基于cwndBDP计算网络的拥塞情况,动态调整 cwnd 以优化数据传输,避免网络拥塞并提升带宽利用率。

函数目标

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/**
* Process a SENDME and update the congestion window according to the
* rules specified in TOR_VEGAS of Proposal #324.
*
* Essentially, this algorithm attempts to measure queue lengths on
* the circuit by subtracting the bandwidth-delay-product estimate
* from the current congestion window.
*
* If the congestion window is larger than the bandwidth-delay-product,
* then data is assumed to be queuing. We reduce the congestion window
* in that case.
*
* If the congestion window is smaller than the bandwidth-delay-product,
* then there is spare bandwidth capacity on the circuit. We increase the
* congestion window in that case.
*
* The congestion window is updated only once every congestion window worth of
* packets, even if the signal persists. It is also updated whenever the
* upstream orcon blocks, or unblocks. This minimizes local client queues.
*/
  • 根据当前的网络状况动态调整拥塞窗口(cwnd)。
  • 通过计算带宽-时延积(BDP)与拥塞窗口(cwnd)的关系,判断是否需要增加或减少 cwnd
  • 在不同阶段(如慢启动、稳态等),采取不同的调整策略。

详细代码解析

1. 入口检查

1
2
tor_assert(cc && cc->cc_alg == CC_ALG_VEGAS);
tor_assert(circ);
  • 确保 cc(拥塞控制结构)和电路 circ 不为空。
  • 确保使用的是 Vegas 拥塞控制算法。

2. 更新事件计数器

1
2
3
4
5
if (cc->next_cc_event)
cc->next_cc_event--;

if (cc->next_cwnd_event)
cc->next_cwnd_event--;
  • next_cc_eventnext_cwnd_event 控制拥塞窗口的更新频率。
  • 事件计数器减少,表示下一个可以进行事件处理的时刻。

3. 更新电路估算值

1
2
3
4
if (!congestion_control_update_circuit_estimates(cc, circ)) {
cc->inflight = cc->inflight - cc->sendme_inc;
return 0;
}
  • congestion_control_update_circuit_estimates(cc, circ) 更新电路的带宽和时延估算。
  • 如果更新失败,则减少 inflight 数量,并返回。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/**
* Upon receipt of a SENDME, pop the oldest timestamp off the timestamp
* list, and use this to update RTT.
*
* Returns true if circuit estimates were successfully updated, false
* otherwise.
*/
bool
congestion_control_update_circuit_estimates(congestion_control_t *cc,
const circuit_t *circ)
{
uint64_t now_usec = monotime_absolute_usec();

/* Update RTT first, then BDP. BDP needs fresh RTT */
uint64_t curr_rtt_usec = congestion_control_update_circuit_rtt(cc, now_usec);
return congestion_control_update_circuit_bdp(cc, circ, curr_rtt_usec);
}
  • 先更新当前的rtt
  • 用更新后的rtt来更新BDP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
/**
* Called when we get a SENDME. Updates circuit RTT by pulling off a
* timestamp of when we sent the CIRCWINDOW_INCREMENT-th cell from
* the queue of such timestamps, and comparing that to current time.
*
* Also updates min, max, and EWMA of RTT.
*
* Returns the current circuit RTT in usecs, or 0 if it could not be
* measured (due to clock jump, stall, etc).
*/
STATIC uint64_t
congestion_control_update_circuit_rtt(congestion_control_t *cc,
uint64_t now_usec)
{
uint64_t rtt, ewma_cnt;
uint64_t sent_at_timestamp;

tor_assert(cc);

/* Get the time that we sent the cell that resulted in the other
* end sending this sendme. Use this to calculate RTT */
sent_at_timestamp = dequeue_timestamp(cc->sendme_pending_timestamps);

rtt = now_usec - sent_at_timestamp;

/* Do not update RTT at all if it looks fishy */
if (time_delta_stalled_or_jumped(cc, cc->ewma_rtt_usec, rtt)) {
num_clock_stalls++; /* Accounting */
return 0;
}

ewma_cnt = n_ewma_count(cc);

cc->ewma_rtt_usec = n_count_ewma(rtt, cc->ewma_rtt_usec, ewma_cnt);

if (rtt > cc->max_rtt_usec) {
cc->max_rtt_usec = rtt;
}

if (cc->min_rtt_usec == 0) {
// If we do not have a min_rtt yet, use current ewma
cc->min_rtt_usec = cc->ewma_rtt_usec;
} else if (cc->cwnd == cc->cwnd_min && !cc->in_slow_start) {
// Raise min rtt if cwnd hit cwnd_min. This gets us out of a wedge state
// if we hit cwnd_min due to an abnormally low rtt.
uint64_t new_rtt = percent_max_mix(cc->ewma_rtt_usec, cc->min_rtt_usec,
rtt_reset_pct);

static ratelim_t rtt_notice_limit = RATELIM_INIT(300);
log_fn_ratelim(&rtt_notice_limit, LOG_NOTICE, LD_CIRC,
"Resetting circ RTT from %"PRIu64" to %"PRIu64" due to low cwnd",
cc->min_rtt_usec/1000, new_rtt/1000);

cc->min_rtt_usec = new_rtt;
num_rtt_reset++; /* Accounting */
} else if (cc->ewma_rtt_usec < cc->min_rtt_usec) {
// Using the EWMA for min instead of current RTT helps average out
// effects from other conns
cc->min_rtt_usec = cc->ewma_rtt_usec;
}

return rtt;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
/**
* Called when we get a SENDME. Updates the bandwidth-delay-product (BDP)
* estimates of a circuit. Several methods of computing BDP are used,
* depending on scenario. While some congestion control algorithms only
* use one of these methods, we update them all because it's quick and easy.
*
* - now_usec is the current monotime in usecs.
* - curr_rtt_usec is the current circuit RTT in usecs. It may be 0 if no
* RTT could bemeasured.
*
* Returns true if we were able to update BDP, false otherwise.
*/
static bool
congestion_control_update_circuit_bdp(congestion_control_t *cc,
const circuit_t *circ,
uint64_t curr_rtt_usec)
{
int chan_q = 0;
unsigned int blocked_on_chan = 0;

tor_assert(cc);

if (CIRCUIT_IS_ORIGIN(circ)) {
/* origin circs use n_chan */
chan_q = circ->n_chan_cells.n;
blocked_on_chan = circ->circuit_blocked_on_n_chan;
} else {
/* Both onion services and exits use or_circuit and p_chan */
chan_q = CONST_TO_OR_CIRCUIT(circ)->p_chan_cells.n;
blocked_on_chan = circ->circuit_blocked_on_p_chan;
}

/* If we have no EWMA RTT, it is because monotime has been stalled
* or messed up the entire time so far. Set our BDP estimates directly
* to current cwnd */
if (!cc->ewma_rtt_usec) {
uint64_t cwnd = cc->cwnd;

tor_assert_nonfatal(cc->cwnd <= cwnd_max);

/* If the channel is blocked, keep subtracting off the chan_q
* until we hit the min cwnd. */
if (blocked_on_chan) {
/* Cast is fine because we're less than int32 */
if (chan_q >= (int64_t)cwnd) {
log_notice(LD_CIRC,
"Clock stall with large chanq: %d %"PRIu64, chan_q, cwnd);
cwnd = cc->cwnd_min;
} else {
cwnd = MAX(cwnd - chan_q, cc->cwnd_min);
}
cc->blocked_chan = 1;
} else {
cc->blocked_chan = 0;
}

cc->bdp = cwnd;

static ratelim_t dec_notice_limit = RATELIM_INIT(300);
log_fn_ratelim(&dec_notice_limit, LOG_NOTICE, LD_CIRC,
"Our clock has been stalled for the entire lifetime of a circuit. "
"Performance may be sub-optimal.");

return blocked_on_chan;
}

4. 计算队列长度

1
2
3
4
if (vegas_bdp(cc) > cc->cwnd)
queue_use = 0; // This should not happen anymore..
else
queue_use = cc->cwnd - vegas_bdp(cc);
  • vegas_bdp(cc) 计算带宽-时延积(BDP),这是理想的拥塞窗口大小。
  • queue_use 计算当前 cwnd 超过 BDP 的部分,即网络中的排队数据量。如果 cwnd 小于 BDP,则认为没有排队,queue_use 为 0。

5. 处理慢启动阶段 (Slow Start)

1
if (cc->in_slow_start) {
  • 慢启动阶段:在慢启动阶段,cwnd 会根据网络的排队情况逐步增加。

a. 判断是否增加拥塞窗口

1
2
3
4
if (queue_use < cc->vegas_params.gamma && !cc->blocked_chan) {
if (cc->cwnd_full) {
uint64_t inc = rfc3742_ss_inc(cc);
cc->cwnd += inc;
  • 如果 queue_use 小于 gamma 且通道没有被阻塞,则继续增加 cwnd
  • 使用 rfc3742_ss_inc(cc) 来获取慢启动阶段的增量。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/**
* Implements RFC3742: Limited Slow Start.
* https://datatracker.ietf.org/doc/html/rfc3742#section-2
*/
static inline uint64_t
rfc3742_ss_inc(const congestion_control_t *cc)
{
if (cc->cwnd <= cc->vegas_params.ss_cwnd_cap) {
/* If less than the cap, round and always grow by at least 1 sendme_inc. */
return ((uint64_t)cc->cwnd_inc_pct_ss*cc->sendme_inc + 50)/100;
} else {
// K = int(cwnd/(0.5 max_ssthresh));
// => K = 2*cwnd/max_ssthresh
// cwnd += int(MSS/K);
// => cwnd += MSS*max_ssthresh/(2*cwnd)
// Return at least 1 for inc.
return MAX(
((uint64_t)cc->sendme_inc*cc->vegas_params.ss_cwnd_cap + cc->cwnd)/
(2*cc->cwnd),
1);
}
}

b. 如果达到 cwnd 最大值,退出慢启动

1
2
3
4
if (cc->cwnd >= cc->vegas_params.ss_cwnd_max) {
cc->cwnd = cc->vegas_params.ss_cwnd_max;
congestion_control_vegas_exit_slow_start(circ, cc);
}
  • 如果 cwnd 达到慢启动的最大值(ss_cwnd_max),则将 cwnd 限制为最大值,并退出慢启动阶段。

c. 如果队列已满,则减少 cwnd

1
2
3
4
else {
uint64_t old_cwnd = cc->cwnd;
cc->cwnd = vegas_bdp(cc) + cc->vegas_params.gamma;
}
  • 如果 queue_use 超过 gamma,说明出现了网络拥塞,cwnd 被设置为 BDP + gamma,从而减缓网络的拥塞。

6. 进入稳态阶段 (Congestion Avoidance)

1
} else if (cc->next_cc_event == 0) {
  • 稳态阶段:当 next_cc_event == 0 时,表示可以更新 cwnd,即进入正常的拥塞控制阶段。

a. 队列长度大于 delta,减少 cwnd

1
2
3
if (queue_use > cc->vegas_params.delta) {
cc->cwnd = vegas_bdp(cc) + cc->vegas_params.delta - CWND_INC(cc);
}
  • 如果 queue_use 大于 delta,说明网络发生了较严重的拥塞,cwnd 被减少到 BDP + delta - CWND_INC(cc)
1
#define CWND_INC(cc)           ((cc)->cwnd_inc)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
struct vegas_params_t {
/** The slow-start cwnd cap for RFC3742 */
uint32_t ss_cwnd_cap;
/** The maximum slow-start cwnd */
uint32_t ss_cwnd_max;
/** The queue use allowed before we exit slow start */
uint16_t gamma;
/** The queue use below which we increment cwnd */
uint16_t alpha;
/** The queue use above which we decrement cwnd */
uint16_t beta;
/** The queue use at which we cap cwnd in steady state */
uint16_t delta;
/** Weighted average (percent) between cwnd estimator and
* piecewise estimator. */
uint8_t bdp_mix_pct;
};

b. 队列长度大于 beta 或者通道被阻塞,减小 cwnd

1
2
3
else if (queue_use > cc->vegas_params.beta || cc->blocked_chan) {
cc->cwnd -= CWND_INC(cc);
}
  • 如果 queue_use 大于 beta 或者通道被阻塞(blocked_chan 为真),则减少 cwnd 以缓解拥塞。

c. 队列长度小于 alphacwnd 已满,增大 cwnd

1
2
3
else if (cc->cwnd_full && queue_use < cc->vegas_params.alpha) {
cc->cwnd += CWND_INC(cc);
}
  • 如果 cwnd 已满且 queue_use 小于 alpha,则增加 cwnd,以利用未充分使用的带宽。

7. 其他处理

a. 确保 cwnd 不小于最小值

1
cc->cwnd = MAX(cc->cwnd, cc->cwnd_min);
  • 确保 cwnd 不低于最小值 cwnd_min,避免 cwnd 降得过低。

b. 记录日志

1
congestion_control_vegas_log(circ, cc);
  • 记录 cwnd 的变动日志,以便调试和分析。

c. 更新事件计数器

1
2
3
4
5
6
if (cc->next_cwnd_event == 0) {
cc->next_cwnd_event = SENDME_PER_CWND(cc);
}
if (cc->next_cc_event == 0) {
cc->next_cc_event = CWND_UPDATE_RATE(cc);
}
  • 事件计数器被重置,以便在下次更新时再次进行控制。

d. 重置 cwnd_full 状态

1
2
if (cwnd_full_reset(cc))
cc->cwnd_full = 0;
  • 如果发生了 cwnd_full 重置,标记 cwnd_full 为 0。

e. 更新正在传输的数据包数量

1
cc->inflight = cc->inflight - cc->sendme_inc;
  • 更新 inflight,即正在传输的数据包数量。

总结:cwnd 的变化过程

1. 慢启动阶段

  • 增加 cwnd:当队列长度小于 gamma 且通道没有被阻塞时,cwnd 增加。
  • 退出慢启动:当 cwnd 达到最大值(ss_cwnd_max)时,退出慢启动。
  • 减少 cwnd:当队列长度超过 gamma 时,cwnd 减少。

2. 稳态阶段

  • 减少 cwnd:当队列长度超过 delta 时,cwnd 减少到 BDP + delta - CWND_INC(cc)
  • 轻度减少 cwnd:当队列长度超过 beta 或通道被阻塞时,cwnd 减少。
  • 增加 cwnd:当队列长度小于 alphacwnd 已满时,cwnd 增加。

3. 最小值保护

  • cwnd 不会小于最小值 cwnd_min,避免窗口过小导致网络不稳定。