Why Beaver Errors Happened, Emblems
- They've passed along feedback around separating Ghost shells from their perks before, they'll continue to let the team know this is something the community wants to see. (Reddit)
- They'll have news for PC Game Pass members later in 2021. (Reddit)
- Further Game Pass info or Xbox can be found here.
- Feedback about the Kill 150 Fallen bounty has been passed along. (Reddit)
- Guided Games Helping Hand emblems for Leviathan raids and its Raid Lairs are no longer available. (Reddit)
- On emblem tracking/accolades for difficult feats such as solos:
A little bit of inside baseball: From what I understand, we have a limited amount of things that we can display in-game via Emblem Trackers. Team has to prioritize which stats are worth tracking to the largest amount of players, while also leaving space for high-skill feats.
We have started creating alternate rewards for specific "achievements" - like the solo / flawless / soloflawless[sic] rewards for the recent dungeon. While I don't know if rewards like this can be made for every end-game activity, it is cool to see players showing off these rewards!
That said, for many things that can't make it into the game (based on time, work, space, or other), an entire community of 3rd party app developers are finding really fun ways to display these sorts of things. Really happy to see raid report adding these badges. (Twitter)
- The team is definitely working on improvements to Destiny 2 which should combat cheating. (Twitter)
- They'll definitely have info out sometime over the summer on how Gambit is changing. Probably won't be in the near future, but definitely before Beyond Light. Until then, hold tight, and bank those motes. (Twitter)
Fletcher Dunn, of Valve.
On why Beavers:
Background: For the past few months, [Destiny 2] has been using some Steam peer-to-peer networking tech I’ve been working on. The tech is software-based routing: route through general-compute Linux boxes using custom protocol, with clients exercising significant routing autonomy. Since peers do not share their IP addresses, script kiddies cannot DDoS other players. [Link].Since it launched, the disconnection rate (#beaver errors) was higher than expected. Some players would get it often; others never saw it. Restarting the game would often make it go away. We could never reproduce it. (It works on my machine!). Each connection involves 4 hosts: 2 clients (that we cannot access) and 2 relays. We dug through countless examples, correlating Bungie’s records, session summaries in our backend, and logs from the 2 relays. Forensic debugging of problems you cannot reproduce is frustrating.
Furthermore, it’s difficult to tell if a specific example is even a bug with packets getting lost in our system; it might just be a “normal” Internet problem. Over a month or two, I fixed some minor bugs, and Bungie fixed some issues with the API integration. Each time, we thought this might be the underlying cause, but nothing moved the needle. Towards the end, I was spending entire days playing Destiny with tons of extra logging.
The breakthrough came while looking at an example involving two relays in Virginia. One relay had our experimental XDP path enabled. XDP is Linux technology that enables you to bypass the kernel and receive Ethernet frames directly in userspace. The kernel does a heroic amount of work for you to deliver a single UDP packet, most of which is bypassed by XDP. So XDP is *insanely* fast. In our case, the XDP code can process 5x-10x the packets for the same CPU cost as the plain BSD socket code.Unfortunately, we cannot deploy XDP everywhere, due to a bug in the Intel NIC driver. We’re using a “Dell” Intel NIC, so Intel told us to take it up with them, and Dell can’t help us. This relay had an AMD processor and Mellanox NIC, so we could enable XDP. So this relay was an “oddball” in the fleet, and for it to be implicated was suspicious. I ran a query our connection analytics to see if it was an outlier.YES!One host in Virginia had a significantly higher rate of disconnections than the otheWAITASECOND. The outlier was not the relay using XDP, it was the *other* one....WAT.
OK, so here's the bug: With XDP, you are serializing raw Ethernet frames, including the next hop MAC address. If the final destination is local, the next hop is that host’s MAC, otherwise it’s a switch. (This is usually the kernel’s “heroic work” I mentioned earlier.) Our XDP code is *not* heroic. It assumes that all traffic goes the same switch. To forward a packet, we rewrite the header (importantly, the destination IP address) and swap the source/destination MAC addresses, sending it back to the switch, who knows what to do next.
This topology assumption was violated in Virgina. One relay was on the same subnet as the XDP relay. So the source MAC was of the relay, not a switch. XDP code gets a client-bound packet, fills in the client’s IP in the IP header, and swaps the source/dest MAC addresses, which sends it back to the relay whence it came (where it is then dropped). This was the cause of broken disconnections in Virgina. If two peers selected these two relays, they could not communicate. Once I knew to look for this type of outlier, I found three other examples exhibiting similar behaviour[sic] (but for less interesting technical reasons). Bungie confirmed that the disconnection rate is now back to expected levels.
Why wasn’t the XDP relay an outlier? Because it wasn’t reporting those disconnected sessions. We don’t log stats for “uninteresting” sessions (sessions used only briefly or not at all). A disconnection is usually considered “interesting”, no matter how brief. But a bug triggered by the asymmetric nature of the loss broke this. A function call had a bool and int parameters swapped, and gcc and MSVC both performed the implicit conversions without complaint!
Why did it take so long to find/fix? Because we were myopic, looking for a software bug. Each time we found something, we thought “this is it!” We *were* finding real problems, they just were very rare in practice. Also: networking is complicated. Why didn’t monitoring catch this? We do have significant monitoring, especially for problems *between* data centers. But it did not detect problems between hosts in the same data center. Also: networking is complicated. (Twitter)