redundancy - Do browsers re-try DNS when a page load fails? -


after amazon's failure , reading many articles redundant/distributed means in practice, dns seems weak point. example, if dns set round-robin among data centers, , 1 of data centers fails, seems many browsers have cached dns , continue hit failed node.

i understand time-to-live (ttl), of course may set long time.

so question is, if browser not response ip, smart enough refresh dns in hope of being routed node?

round-robin dns per-browser thing. this how mozilla it:

a single host name may resolve multiple ip addresses, each of stored in host entity returned after successful lookup. netlib preserves order in dns server returns ip addresses. if @ point during connection, ip address in use host name fails, netlib use next ip address stored in host entity. if 1 fails, next queried, , on. progression through available ip address accomplished in net_finishconnect() function. before url load considered complete because it's connection went foul, it's host entity consulted determine whether or not ip address should tried given host. once ip address fails, it's out, removed host entity in cache. if ip addresses in host entity fail, netlib propegates "server not responding" error call chain.

as amazon's failure, there nothing wrong dns during amazon's downtime. dns servers correctly reported ip addresses, , browsers used ip addresses. screw-up on amazon's side. re-routed traffic overwhelmed cluster. dns dead-on, clusters couldn't handle huge load of traffic.

amazon says best themself:

ec2 provides 2 important availability building blocks: regions , availability zones. design, regions separate deployments of our infrastructure. regions isolated each other , provide highest degree of independence. many users utilize multiple ec2 regions achieve extremely-high levels of fault tolerance. however, if want move data between regions, need via applications don’t replicate data between regions on our users’ behalf.

in other words, "remember of high-availability told have? yeah it's still you." due own bumbling, took out both primary , secondary nodes in cluster, , there nothing left fail on to. , when brought back, there sudden "re-mirroring storm" nodes tried synchronize simultaneously, causing more denial of service. dns had nothing of it.


Comments