We recieved the following notification from the University Information Systems’ Directory Services department about the PIN System.
Our monitors picked up an outage on the PIN system at 10:42 AM this
morning. At this point we believe that just one of our PIN servers is
unavailable and that users connecting to the other servers or making new
connections to the PIN system should be successfully authenticating.
However, connections to the failed server will remain cached on machines
for 5 minutes from their previous attempt to login. The PIN team is
actively working the issue with NSS to resolve the problem and hope to
bring the failed server back online ASAP.
I apologize for the inconvenience this is causing your customers. More
details on the outage will be made available as soon as we have them.
Stay tuned for further updates.
Update on February 23, 2011:
The PIN system has been fully functional since 12:08 PM. We continue to investigate the root cause and ensuing impact of the partial outage and slowness of the service. At no time was the PIN service fully down, but the reduced capacity, very high traffic, and the resulting slow response were, understandably, construed as an outage by many.
Here’s what we know:At 10:42 our monitors alerted us to a problem on one of our production PIN servers At 10:44 we had taken that PIN server off the content switch so traffic was going to other PIN servers At 10:56 the misbehaving server was restarted, looked healthy, and was placed back on the content switch. At 11:33 we saw the same PIN server having problems and pulled it off the content switch again. At 12:08 having bounced the box, and the software components on it we put the server back online and the service has been stable since.
Customers would have experienced issues with PIN around 10:42 AM and experienced “problems” with the PIN service until 12:12 PM (1:20 hrs). Here are some of the things that complicated connectivity for end-users:We were seeing loads on the PIN service that we haven’t seen in years. We are investigating this, but suspect that helpdesks contributed to this load as they “tested” connectivity. The load overwhelmed the crippled service when the errant machine was pulled off the switch making it REALLY slow (with a number of timeouts) which some users reported as an outage (which it was to them).
Once again, I apologize for the inconvenience this caused you and your customers. We continue to investigate the root cause of this problem and any information that you can add to this would be helpful. As always, please contact me should you have any questions.