Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
Cloud Network

Amazon's DNS Problem Knocked Out Half the Web, Likely Costing Billions 103

An anonymous reader quotes a report from Ars Technica: On Monday afternoon, Amazon confirmed that an outage affecting Amazon Web Services' cloud hosting, which had impacted millions across the Internet, had been resolved. Considered the worst outage since last year's CrowdStrike chaos, Amazon's outage caused "global turmoil," Reuters reported. AWS is the world's largest cloud provider and, therefore, the "backbone of much of the Internet," ZDNet noted. Ultimately, more than 28 AWS services were disrupted, causing perhaps billions in damages, one analyst estimated for CNN.

[...] Amazon's problems originated at a US site that is its "oldest and largest for web services" and often "the default region for many AWS services," Reuters noted. The same site has experienced two outages before in 2020 and 2021, but while the tech giant had confirmed that those prior issues had been "fully mitigated," apparently the fixes did not ensure stability into 2025. ZDNet noted that Amazon's first sign of the outage was "increased error rates and latency across numerous key services" tied to its cloud database technology. Although "engineers later identified a Domain Name System (DNS) resolution problem" as the root of these issues and quickly fixed it, "other AWS services began to fail in its wake, leaving the platform still impaired" as more than two dozen AWS services shut down. At the peak of the outage on Monday, Down Detector tracked more than 8 million reports globally from users panicked by the outage, ZDNet reported.
Ken Birman, a computer science professor at Cornell University, told Reuters that "software developers need to build better fault tolerance."

"When people cut costs and cut corners to try to get an application up, and then forget that they skipped that last step and didn't really protect against an outage, those companies are the ones who really ought to be scrutinized later."

Amazon's DNS Problem Knocked Out Half the Web, Likely Costing Billions

Comments Filter:
  • by flippy ( 62353 ) on Tuesday October 21, 2025 @03:47PM (#65741344) Homepage
    it's a GOOD thing that one company controls that much of the internet, right? I mean, super efficient.
    • by Brain-Fu ( 1274756 ) on Tuesday October 21, 2025 @04:07PM (#65741392) Homepage Journal

      Or maybe he was quoted out of context.

      When you use AWS to host your businesses website, and/or all the data that your business processes, and/or whatever back-end web-facing APIs your business uses, no amount of "fault tolerance" is going to keep you afloat when AWS goes down.

      If we want to blame the victim, the correct accusation is: "you shouldn't outsource your critical business infrastructure to a huge megacorp that can survive without you."

      • by suutar ( 1860506 )

        I think in this case since it was only one region we can fall back to "you shouldn't outsource anything significantly important without the multi-region failover plan"

        • I think in this case since it was only one region we can fall back to "you shouldn't outsource anything significantly important without the multi-region failover plan"

          Given that the outage was claimed to be in Eastern US, why did I suffer multiple service outages in Idaho?

          Oh, and by the way - one of the services I lost was Amazon shopping :-)

          Regional my ass...

          • Were your outages as an end-user or as a direct customer of AWS?

            US-East-1 is normally the default region (and also historically the cheapest region) of all the alternatives.

            https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Ftechnical.ly%2Fentrepren... [technical.ly]

            As a result, even if as a consumer you think you'd be better served by a more local region, whatever service you're using probably has a number of critical components predominantly or wholly served out of US-East-1, usually for cost reasons.

            • by ls671 ( 1122017 )

              Well, on the plus side, our own infrastructure got a performance boost at exactly the same time because bots running on Amazon cloud couldn't resolve our hostnames to scan them and attempt to break in as they do daily!

          • Given that the outage was claimed to be in Eastern US, why did I suffer multiple service outages in Idaho?

            I'm in the Puget Sound area - you're certainly to the east of me.

            (Actually a bunch of services at the University of Washington were down because of this... although admittedly UW is relying heavily on outsourced services nowadays)

          • by mysidia ( 191772 ) on Wednesday October 22, 2025 @01:22AM (#65742158)

            Given that the outage was claimed to be in Eastern US, why did I suffer multiple service outages in Idaho?

            Clearly bc you used services that dependent on the affected network.

            US-EAST-1 outages also have a way of cascading to the other sites, because it's the most populated region with the largest amoutn of resources.
            When East-1 has issues.. the other regions will receive a huge volume of additional load. They had EC2 launch issues, and throttled ---- slowed down new launches deliberately; likely because every other customer in the US-East-1 region attempting to deploy instances into other regions due to the outage impacting their east-1 resources. This surge in activity in other regions caused by customers attempting to shift traffic around to get past East-1 outage has a chance of causing major network degradation across all regions.

      • by Tony Isaac ( 1301187 ) on Tuesday October 21, 2025 @06:46PM (#65741682) Homepage

        I don't think it's fair to blame cloud hosting at all.

        If you shrink back from the cloud because you're worried about using "a huge megacorp that can survive without you", what are you going to gain by self-hosting? You'll still be dependent on huge megacorps that can survive without you:
        - The electric utility company
        - The ISP that connects you to the internet
        - The landlord that owns the building where you are hosting

        Sure, you can mitigate each of these risks.
        - You can connect to multiple power utilities, add standby generators and UPSes.
        - You can connect to multiple ISPs, each of which is capable of handling your full load of data.
        - You can host duplicate copies of your infrastructure in different locations to make it possible to fail over.

        But how many companies that choose to self-host, will go to all that expense? Not. Very. Many.

        • Yes it's true, we can't escape the interconnected web of dependencies.

          I guess my real gripe is that there are too few cloud hosting companies, and the few that exist are too big. We need many more medium-sized ones so that a single outage doesn't do so much damage, and so they have to compete against each other for business to keep their incentives properly oriented.

        • by Xarius ( 691264 )

          There are such things as smaller independent ISPs where if something goes wrong you can speak to a human who will likely care. Same goes for landlords. Electricity supply, well depends on the country as to that setup. Some things are unavoidable.

          If half the internet was not on AWS, but on 15 - 20 different providers and outage of one would not be a big story.

          • That's all good, deal with smaller companies. But that doesn't help you when the whole data center burns, as happened with the smaller company EV1 data center in Houston a few years ago. https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Fwww.datacenterknowledg... [datacenterknowledge.com] For situations like that, good customer service isn't going to get your system up and running right now. You need failover capabilities to data centers located in some other region. How many companies that self-host, do that?

        • by flippy ( 62353 )

          I used to work for a company that did all those things. We contracted with multiple geographically diverse datacenters, each of which had multiple redundant ISP connections, diesel generators the size of garbage trucks in case grid power was out, we ran replication software to keep the backup DC servers in (as close to) real-time sync with servers in the primary DC. At our office, we installed our own generator to power the entire office if necessary. Each server had 4 copies (primary in primary DC, backup

          • Yes, indeed, it's possible to self-host in a resilient way. It's just highly unusual. Those who complain about the costs of cloud hosting, often forget how difficult and expensive it is to get the equivalent disaster preparedness that you get with big cloud providers. As we have seen though, even the big boys fail sometimes.

        • Doesn't even matter if you self host - the critical failure here was a Domain Name System (DNS) resolution problem which presents a single point of failure to EVERY hosting provider and EVERY service on the internet (public / private / commercial / self-hosted). The general public takes for granted that DNS will continue to work - acting as a database to keep track of which URL should point to which IP address. So many potential attacks against DNS exist (poisoning, redirection, DoS, etc) that unless and un
        • No matter how you slice it, vertical integration sucks. You shouldn't embrace it.

      • by mysidia ( 191772 )

        the correct accusation is: "you shouldn't outsource your critical business infrastructure to a huge megacorp that can survive without you."

        Perhaps you should not, but most businesses DID NOT and Will not build a resilient in-house infrastructure that provides nearly the average uptime as AWS.

        For example.. 99% of companies' -- even large corporations' internal Email the whole company relies on would typically be on a single MS Exchange 2016 server. You would have a hard drive crash, and the server wou

        • You realize everyone making the in-house argument have jobs maintaining in-house systems?
        • "but of course it has the advantrage that your outage will typically not happen at the exact same time as a thousand other corporations' outages."

          For many companies, that isn't an advantage. When AWS or Microsoft goes down, for a lot of companies, you can at least reassure yourself that OK your IT systems are all down for the day buts it's OK as so are all your customers - you're not actually missing any work, and a lot (most) of it will get caught up with later/tomorrow when everyone's back online.

          If it's

    • This is why you don't host your critical service / product / infrastructure in a single AZ
      • And if it's really really super critical you should probably have it hosted in more than one cloud provider as well. Cause multiple AZ isn't going to save your ass if there's some kind of billing / accounting mishap at AWS and your cloud account goes poof.
        • This.

          People need to be setup on multiple CSPs. Not just for reliant service, but also when AZ is charging you too much you can quickly shift more of your traffic to your lower tier. And since it's a "cloud" you ought to be able to quickly spin up and spin down instances in response to price, demand, and availability.

          Just slotting an app into AWS-West and walking away means you aren't going to survive a disruption and you also likely are paying too much.

      • Come back and comment when you know the difference between a region and an AZ. Also be sure to understand the huge leap in complexity/expense between a multi-AZ implementation and a multi-region implementation.

    • East went down. West was up the whole time. Any company not run by idiots it in the contract that they could easily switch to their West instance when East goes down.
      • Re: (Score:3, Interesting)

        by alvinrod ( 889928 )
        For large failures that won't save you. Does Amazon have enough infrastructure to run all of the East instances on their West hardware? That's doubtful and if they tried it would degrade performance if not outright take down the West due to the load.

        Having someone to pick up the slack is only possible if there's excess infrastructure in place to handle it. If there were dozens of smaller players this isn't a problem, but if there are only two or three major providers, none of them will overprovision enou
        • by Cyberax ( 705495 )

          For large failures that won't save you. Does Amazon have enough infrastructure to run all of the East instances on their West hardware? That's doubtful and if they tried it would degrade performance if not outright take down the West due to the load.

          The replacement for us-east-1 is us-east-2. Amazon has been pushing companies to use it for quite a while. They even have discounted traffic between us-east-1 and us-east-2, it costs exactly the same as traffic within the us-east-1.

          • For large failures that won't save you. Does Amazon have enough infrastructure to run all of the East instances on their West hardware? That's doubtful and if they tried it would degrade performance if not outright take down the West due to the load.

            The replacement for us-east-1 is us-east-2. Amazon has been pushing companies to use it for quite a while. They even have discounted traffic between us-east-1 and us-east-2, it costs exactly the same as traffic within the us-east-1.

            What do you mean “pushing for”? Whose network is it anyway? Do ISPs also ask users what DNS they want to use, or is that assigned to them? Why is Amazon having to convince their users to use what should be automatic redundancy?

            This is like having to manually route traffic through redundant firewalls wearing a traffic cop outfit sitting behind a keyboard, while pretending we haven’t invented DNS aliasing or load balancers yet.

            A redundant solution, is a capable backup. A replacement solu

            • by Cyberax ( 705495 )

              What do you mean “pushing for”?

              Advising customers, providing discounted traffic and instances, defaulting to us-east-2 in the console for new accounts, etc.

              Why is Amazon having to convince their users to use what should be automatic redundancy?

              It's not the question of redundancy. It's the problem with customers just building everything in us-east-1, so it now dwarfs everything else.

          • by kriston ( 7886 )

            While this has been true since us-east-2 came online, the idea that it would replace us-east-1 is apparently no more. That's because Amazon announced they're building seven more AZs in us-east-1.

            Not to mention the many Local Zones popping up across the US.

            • by Cyberax ( 705495 )
              The push is still there, it's just that us-east-1 has so much inertia that they _have_ to keep expanding it.
      • As I said in a prior post, I lost multiple services in Idaho.
        • What are you saying here? AWS doesn't host services in Idaho. It would not be unusual for you to connect to services in Northern Virginia.

          Were you thinking that an outage in Northern Virginia should only affect users in Northern Virginia? That's not how the cloud works.

    • Only one problem with that idea, my systems were up/online all day yesterday, connecting flawlessly, and I didn't even know AWS was down or having problems till late afternoon when I had a chance to browse the news. As in all things, diversification is the best disaster protection. When the DNS root servers are all hosted on AWS, then we're in trouble. But Cloudflare, Control-D, thru DoT, my systems connectivity was fine yesterday and my Tor relay node was up online all day long. AWS went down yesterday? Re
    • When one source dominates a market that source becomes "orgullosisima": (very proud in Spanish). When companies become very proud, they tend to do stupid things which can cause the whole thing to come crashing down, or just because of the sheer scale of their operations, they expose their customers to too much risk. The problem is that everyone gravitates to the "market leader" thinking that they are the most competent in the market, when in all reality they might not be. The market leader is actually getti

      • by gweihir ( 88907 )

        Yep. Arrogance, greed and stupidity. It always takes over when people do not stay careful and at least somewhat humble.

    • by gweihir ( 88907 )

      Indeed. Gives you super-efficient and effective outages as well. I am just waiting for the day they cannot recover or where it takes weeks.

    • by mysidia ( 191772 )

      The thing is it cost billions In revenue Amazon created opportunity to earn in the first place

      It is not as if AWS centralization is this critical threat that caused billions in damage. They caused many billions in revenue generation which was slightly reduced during a short outage -- which is extremely minor compared to the value AWS provides. I mean a 24-hour outage is not even a concern.. come back when they have a real catastrophe and it's a major 7-day outage. Even that, quite honestly, may not

  • by AmazingRuss ( 555076 ) on Tuesday October 21, 2025 @03:56PM (#65741368)
    ... suddenly disappearing with a 'poof'.
  • by sinij ( 911942 ) on Tuesday October 21, 2025 @04:03PM (#65741382)
    Just like in that saying goes, if you owe a bank a million they own you, if you owe a bank a billion you own them.
  • by Anonymous Coward on Tuesday October 21, 2025 @04:11PM (#65741408)

    A website I use for work was down, but I just worked on other stuff and then later when it was back up I did the stuff I would have earlier. Nothing of value was lost. And I don't mean the website that was down isn't valuable, it's important to a lot of my work, but it's back up and the dollar value of the downtime to me or to my employer is basically zero. If we had to replace it entirely, the cost would be substantial, but just not being able to use it for a few hours or an entire day costs nothing.

    I have a hard time believing anything worth billions was destroyed. Maybe some purchases got delayed. Maybe some things got bought from a different source. Maybe some people worked on restoring service instead of what they would have been working on, but then got around to doing the work that they would have done.

    Maybe some people had an easy day of sitting around accomplishing less than they would have but then followed that with a busy day of working faster than usual to catch up.

    Anybody have any anecdotes about things that actually got damaged or destroyed that could possibly account for a claim of "billions"?

    • by nightflameauto ( 6607976 ) on Tuesday October 21, 2025 @04:31PM (#65741452)

      A website I use for work was down, but I just worked on other stuff and then later when it was back up I did the stuff I would have earlier. Nothing of value was lost. And I don't mean the website that was down isn't valuable, it's important to a lot of my work, but it's back up and the dollar value of the downtime to me or to my employer is basically zero. If we had to replace it entirely, the cost would be substantial, but just not being able to use it for a few hours or an entire day costs nothing.

      I have a hard time believing anything worth billions was destroyed. Maybe some purchases got delayed. Maybe some things got bought from a different source. Maybe some people worked on restoring service instead of what they would have been working on, but then got around to doing the work that they would have done.

      Maybe some people had an easy day of sitting around accomplishing less than they would have but then followed that with a busy day of working faster than usual to catch up.

      Anybody have any anecdotes about things that actually got damaged or destroyed that could possibly account for a claim of "billions"?

      You're looking at this logically. Stop that.

      I'm sure that what they've done here is calculated how much revenue is generated via the hosted services per hour, multiplied it by the downtime, and just shoved that number out as the number of dollars lost. As you say, it won't actually be that high, but everything now has to be about profit gained or lost. And there's really only one way to even begin to get people to have a conversation about whether throwing all your digital eggs in one basket is a good idea is to scare the shit ouf of them by showing them potential lost profits.

      I don't personally love that this is the timeline we're in, but the scare should be real here. And big decision makers need big numbers flashing in their face or they won't even think about changing their methodology.

      • by irchans ( 527097 )

        You're looking at this logically. Stop that.

        LOL :)

      • And big decision makers need big numbers flashing in their face or they won't even think about changing their methodology.

        By why change, ultimately the people making these decisions don't base them on ZDNet articles or the media reports of users "panicked" because the downdetector plugin showed they weren't able to access something.

        The GP is on the money, a delayed sale is not a sale lost. Business transactions conducted via the internet on AWS don't just evaporate. I don't stop needing ${thing_I_want_to_buy} simply because the website was unavailable.

        • And big decision makers need big numbers flashing in their face or they won't even think about changing their methodology.

          By why change, ultimately the people making these decisions don't base them on ZDNet articles or the media reports of users "panicked" because the downdetector plugin showed they weren't able to access something.

          The GP is on the money, a delayed sale is not a sale lost. Business transactions conducted via the internet on AWS don't just evaporate. I don't stop needing ${thing_I_want_to_buy} simply because the website was unavailable.

          So, is your take that we should just ignore these giant down periods until it becomes "the big one" because nobody took them seriously? I honestly don't think that's the right view, because the corner cutting that lead to it will continue to happen until there are no further corners to cut. When one of these giant clusterfuck hosts drops for three to five days and nobody knows how to fix it, real damage will be done. I'd prefer we see the issues that led to this outage addressed as something other than a sh

    • money not earned is money lost. imagine you run a bakery - you pay for the emlployees and other operation costs of that store but the doors to it won't open because it broke for 2 days. this means you have lost 2 days worth of profit
      • Are you claiming that bakers couldn't bake without AWS? Or that people went hungry?

        Even if a bakery did shut down due to an AWS outage, which I doubt, the people who would have bought the baked goods almost certainly bought something else. No money was lost, it was just spent somewhere else.

        When things are destroyed value is lost. The broken window fallacy is a fallacy because the work that went into replacing a broken window could have gone into installing a window in a new location. Two windows is more th

      • by jbengt ( 874751 )

        money not earned is money lost. imagine you run a bakery - you pay for the emlployees and other operation costs of that store but the doors to it won't open because it broke for 2 days. this means you have lost 2 days worth of profit

        If you've only lost 2 days of profit, that means you broke even on those 2 days, which would be doubtful. But you likely had no income while still tuck with ongoing expenses like utilities, rent, salaries/benefits, etc. So you've actually lost more than money not earned.

      • Real example from my life:

        AC didn't work. Nobody wanted to come into my place and shoot pool because it was around 120 degrees in there. Landlord fixed the problem, but it took 3 days.

        Landlord's remedy was to not charge me rent for 3 days (rent was $6440/month).

        That hardly made me whole and I definitely lost money.

    • by Anonymous Coward on Tuesday October 21, 2025 @05:43PM (#65741578)

      Use as many words as possible to tell me you're a tiny fish.

      Here's something that cost people I know money. Toast (The SaaS POS that underpins a TON of restaurants' operations) was down all damn day.

      Part of Toast is 3rd-party integrations (Doordash, Uber Eats, GrubHub, PostMates, etc) and that was down until the early hours of this morning. A lot of restaurants get 25-40% of their income from 3rd-party integrations, but those were down until the last thing. My friends lost about a thousand dollars in sales for the, plus had to take on extra risk because Toast was "offline" most of the day and so if anyone passed an over-limit credit card got away with it. And that's just a single location QSR.

      And you do realize that "work[ing] on restoring service instead of what they would have been working on" also costs money, right? Man hours spent waiting on AWS, or re-checking AWS, or reporting up the chain that AWS is still having issues, costs companies money.

      • Are you claiming that people went hungry because of this AWS outage? I find that hard to believe. But even if some people did skip some unhealthy delivered meals, the money they would have spent is still in their pocket waiting to be spent on something else. You could just as well claim that billions were gained as a result of the AWS outage because people who couldn't order food delivered bought something later with the money they didn't spend on food delivery.

        And I seriously doubt AWS or other companies s

    • by mysidia ( 191772 )

      What was actually damaged/destroyed

      The damage was Additional revenue-generating opportunities normally enabled by AWS were lost.

      For example: If because of an outage your Ecommerce website is down for an hour -- there is a certain volume of sales: Revenue opportunity: which you lose.
      You calculate that loss by using past data to estimate your expected revenue during the particular hours of the day times the number of hours that you were down leading to an estimated number and dollar sales volume lost.

      • If because of an outage your Ecommerce website is down for an hour -- there is a certain volume of sales: Revenue opportunity: which you lose.

        Only if during that hour those customers make the decision that they really didn't want what you were selling. If they buy it later today or tomorrow or next week, you didn't lose anything. If a container ship carrying your product across the ocean sank, that's a loss. The item you didn't sell during your e-commerce website outage is still in a warehouse waiting to be sold. When it sells, you'll collect the money that you didn't collect when your website was down.

        And if the outage did allow them to realize

  • by xack ( 5304745 ) on Tuesday October 21, 2025 @04:12PM (#65741412)
    Think about how much you rely on your dns server, your dhcp lease, your clock being the right time to validate certificates, having the right combination of browser so you are considered "human" and not a bot. In the old days of the internet you just dialed up and fetched simple static html pages, now we have vibecoded contraptions with huge dependency graphs. I still think it will be inevitable Microsoft will corrupt the secure boot process somehow and render all Windows PCs unbootable as the ultimate screw up.
    • by Bongo ( 13261 )

      So I'm actually saving your post, for that day. Maybe print it out, put it in a frame, and hang it in a gallery.

      Because on that day, that's the only way anyone is gonna see it.

  • House of Cards. (Score:5, Insightful)

    by Fly Swatter ( 30498 ) on Tuesday October 21, 2025 @04:14PM (#65741420) Homepage
    Between AWS, Cloudflare, Google, Microsoft, and whoever else. There are too many single points of failure. But then that is the modern design philosophy for any modern infrastructure or manufacturing project.
    • Between AWS, Cloudflare, Google, Microsoft, and whoever else. There are too many single points of failure. But then that is the modern design philosophy for any modern infrastructure or manufacturing project.

      Let’s be fair and remember greybeards are well aware of the inherent problem. DNS has always been the Achilles heel of the internet. Before any of those entities even started thriving on that newfangled World Wide Web to eventually grow into single points of Too Big To Fail the S&P500.

      Single points of failure, only reinforce the stupidity. Even a static record to all.eggs.basket, won’t save you. But hey, we grew one hell of a house of cards stock market in the meantime. Cough. S&P5

  • by YuppieScum ( 1096 ) on Tuesday October 21, 2025 @04:15PM (#65741424) Journal

    Ken Birman, a computer science professor at Cornell University, told Reuters that "software developers need to build better fault tolerance."

    Ken is missing the point: management need to budget for better fault tolerance, then the developers can build it.

    "When management cut costs and cut corners to try to get an application up, and then don't care that they skipped that last step and didn't really protect against an outage, those managers are the ones who really ought to be sacked later."

    FTFY, Ken.

    • Ken is missing the point: management need to budget for better fault tolerance, then the developers can build it.

      When I worked in the enterprise my experience was such that fault tolerance was more often deficient due to inadequate hardware budgets than to lacking software capabilities, but I know nothing about this particular event.

      • It's almost rudimentary today to have fault tolerance in the design. Horizontal scaling automatically gains some level of fault tolerance unless you specifically build it without.

        Which is usually budget constraint, not a developer constraint.

        My sites are on AWS, and they all stayed up yesterday. But they aren't multi-region. That's the risk we take.

      • ...fault tolerance was more often deficient due to inadequate hardware budgets than to lacking software capabilities...

        That's fair as far as it goes, but adding hardware doesn't magically make fault tolerance happen, unless your platform is Tandem's NonStop.

        Ultimately, unless there is sufficient money to pay for the software development, hardware suite (dev, test and prod), testing and ongoing maintenance of a fault-tolerant system, then you get a system that stops working if, or rather when, there are faults.

    • But do they do? We're accepting the cost of $billions as fact despite the fact it seems to largely be pulled out of someone's arse. What is a cost here? A delay doesn't necessarily incur any cost. If a website is down and an order can't be placed that order will likely be placed an hour or two later. If your production critical system generates actual dollars based on time and sits on the cloud then you may incur a loss (and your stupid arse should be fired), but the reality was the story from Crowdstrike w

    • I agree that management is often the problem. I've worked on many projects and it's the managers/marketing/sales that are pushing to release without fully completing the project. Cutting corners is standard and wrong.

  • by ebunga ( 95613 ) on Tuesday October 21, 2025 @04:22PM (#65741440)

    At some point AWS is going to delete all customer data, or otherwise cause it to be unrecoverable. AWS is too large and complicated for it to not suffer a catastrophic failure at this scale. It's inevitable.

    • No way! That could only happen if the management were self-serving shortsighted corrupt nincompoops. How likely is that in a big organisation?

    • I mean, this should be basic recovery scenario one in any continutity and disaster recovery plan.

      And it doesn't have to be AWS that has the oops. It could be an undetected corruption that borks the database during the next major engine/schema upgrade. A disgruntled employee wipes the encryption keys. A ransomware gang infiltrates your accounts, exfiltrates your customer data, and then locks you out.

      Having AWS be the god-tier admin layer just creates one additional source of heartache if this happens on a

    • by gweihir ( 88907 )

      Or it happens at Google or Microsoft first. My Money would be on MS Azure collapsing or getting fully compromised first (has already happened partially, and several times now), then Amazon, then Google. (Caveat: I know one of the security experts at Google and what they do seems at least reasonable. But who knows.)

    • At some point AWS is going to delete all customer data, or otherwise cause it to be unrecoverable. AWS is too large and complicated for it to not suffer a catastrophic failure at this scale. It's inevitable.

      I really should stop laughing at this problem of theirs.

      But then I think about Facebook trying to sell ads to a few hundred million dead people, and it makes me laugh even harder.

      Don’t you go worrying about Too Big To Fail, ever failing. Any large enough network problem, will become a taxpayer problem to fund anyway. “Critical” infrastructure and all in the Land of the Free-consumer spending GDP. And if a member of the Magnificent Seven ever actually collapses? Our crashing financial m

    • by Bongo ( 13261 )

      Sounds hyperbolic.

      Now, where did I leave my car keys...

  • Business as Usual (Score:5, Informative)

    by RossCWilliams ( 5513152 ) on Tuesday October 21, 2025 @05:32PM (#65741554)
    This is typical of the tech world in general. There will be no real consequences for Amazon compared to the costs.The billions in damages are going to be left to others to pick up the costs.
    • And rightfully so. The $billions number was pulled out of the arse of some random CEO's quotebook. There's no meaningful figure attached to that. It has all the relevant of the existence of the office coffee machine costing $billions because people get up and go use it instead of working.

      The reality is people don't stop working because a couple of websites or internal services are down. God my employer fucking lives in the cloud, but at no point through the couple of outages we've had over the past 5 years

      • The reality is people don't stop working because a couple of websites or internal services are down.

        That depends on their job doesn't it? But whatever the actual damages, Amazon won't be paying them. The same is true for other tech failures.

        When VISA goes down thousands of businesses and their customers may have to stand around trying to figure out how to pay for things. But VISA isn't going to pay that bill. If a Windows update crashes someone's computer causing them to miss a deadline, Microsoft isn't going to pay them for the tax penalty or lost business.

        In fact its likely that most of the people in

      • +1 Insightful (out of mod points for now).

        It may be possible to measure the real impact to some people/companies if we consider the cost of delayed transactions and some opportunity costs. But then, to be fair, you would have to include the gains of those who beneffited from this situation. More likely than not, those billions are bullshit...

    • by gweihir ( 88907 )

      Yes. And that is because without IT liability, IT will never be a mature, dependable and secure discipline. And the damage is raising.

      • Yes. And that is because without IT liability, IT will never be a mature, dependable and secure discipline. And the damage is raising.

        Realize many laws have we have created around the banking industry alone, because they are the very entities responsible for handling money. Now realize how many financial crashes we’ve still suffered, regardless of ALL that.

        IT isn’t the liability. Greed is.

        Given a reasonable budget, the IT professional can recommend and build incredibly resilient networks. But when the cheap-ass greedy executives are more focused on chopping heads and stock buybacks to maximize THEIR returns, those highly re

        • by gweihir ( 88907 )

          And how many banks have seen losing all their customer data or having all the money stolen over the Internet because of IT problems? Right. As far as I know that number is zero. And that comes from regulation and liability.

          You are talking about the business side of things. That one is different. It is not engineering and it is insufficiently secured with liability and regulation, exactly because of greed. But it is a different problem.

  • blaming the brain drain at Amazon [theregister.com].I like the comment imagining that "HR will roll out a “Resilience Recognition” badge on the intranet," among its description of management's 0$ expenditure response.
    • by gweihir ( 88907 )

      Would not surprise me. Get rid of the smart people and everything crumbles after a while. Amazon is not the only place that is happening.

  • Great real world test of services failing. I think this has less to do with AWS screwing up, and more with people putting all their eggs in US-East-1...

    This was a temporary, resolvable issue. What happens if there is an irreversible issue that takes out a chunk of the servers sitting in US-East-1?

    If you had warm standby systems synchronized and ready to go in a different region, this shouldn't have had much impact (other than maybe having to scale to handle a transfer of load from clients normally localiz

  • And then something like this happens. Or a by now large number of other costly screw-ups.

  • Businesses can choose to not have an outage by having distributed infrastructure. AWS makes it insanely easy to do that today compared to 15 years ago. But no, many businesses would rather suffer the damage from outages now and then while praying us-east-1 stays online indefinitely.

    If you are using a vendor that has an outage due to us-east-1 issues, that's a red flag - it means they are cutting costs, not employing skilled engineers, and/or just don't give a damn about your service.

  • The blame for the downtime is on AWS, not anyone I the company. Not even the CTO for choosing AWS...
  • Many people take DNS for granted, but they sure get up in arms when it fails. Well, at least it happened to Amazon. Fuck you, Amazon.
  • How do you test for what happens when you have a billion users at once?

  • I don't use many services anymore. I don't get on many social media sites. I only knew there was a problem when I read about it on lemmy.

He who is content with his lot probably has a lot.

Working...