Norm figures a way out of the NTF conundrum
Norm was feeling the heat. Maybe bigger heat sinks would help, he thought desperately; but he had no way of knowing. He doubted that heat was causing failures of the line cards in his company’s routers. But how could he know for sure?
He was staring at the eighth instance of this problem in the last month. And he was supposed to be the company’s crack field service engineer on the case. He had heard that the customer, Verizon, was complaining. The trouble ticket he was working had been escalated from ‘severe’ to ‘critical’. He noticed the closed-door meetings his boss was having with the division’s vice president. Was that an involuntary tick, he wondered as his eyes squinted at the screen in front of him. Norm was nervous.
Just last week his boss, a laid-back likeable enough guy, poked his head over the wall of Norm’s cube and asked if he could talk to him. Norm gulped and invited him in. His boss wondered if “it might not be a good idea to assign this trouble ticket to one of the other field service reps.” Over the course of the conversation, Norm realized that his boss still had confidence in his ability to solve the most difficult cases. Norm relaxed his white-knuckle grip on the arms of his chair. By the end of the conversation they concluded that the time it would take to bring another field service engineer up to speed would be wasted since Norm would probably have the problem solved by then. There goes that tick in his eyes again, Norm thought.
That conversation was a week ago and Norm was starting to think that they had made the wrong decision. The problem was: he had no diagnostic capability for the problem he was seeing. His best guess was that it was some sort of complex software/hardware interaction. Just before the systems failed, logs were output indicating some sort of unrecoverable software error. Some rogue process must be trampling protected data store and corrupting the OS. He had sent the logs and software debug information to the engineers in development and he was told that they were making some headway on the software problem, but that was only part of the issue.
When one of the blades went down, they would lose communication with the outside world and the system’s ‘shelf manager’ would automatically alter the board’s status to System Busy (SysBsy). Alarms would sound at the Verizon Network Operations Center and the router’s messaging throughput would spike as 10 gigabits per second of Internet traffic was rerouted. At the same time the system would initiate a boot sequence to attempt to restore the blade to working condition. And here’s the problem that so perplexed Norm: the board would not boot! Of course, any of a number of different issues could cause this. So after much fruitless remote troubleshooting, he would tell the local craftsperson to yank the blade and send it to the factory. But, Norm had come to expect that the blade would function just fine in the lab. And, of course, each one invariably did. He was beginning to see the logic behind his ironic moniker, “No Trouble Found”.
As Norm sat in his cube woolgathering, he felt a tap on his shoulder. It was his boss. “Come over to Tech Center 5,” he said. “We’ve got another one.”
Norm’s heart sank. Not again! He was as clueless now as to how to proceed as he had been a couple of months ago. What now, he thought?
As they walked, his manager spoke up. “We might have better luck this time. The Dallas West Park #3 router has that special instrumented software load, Release 7.1. It has a pre-boot diagnostics engine from a company called ASSET InterTech. You’ve heard about it, haven’t you?”
Norm’s eyes flashed to the alarm console as they entered the Tech Center. Yes, he’d heard about the instrumented load. “That’s the one with the low-level control functions? BootROMBusTest, IOBusTest, CPUAddressData, and all those?”
“Yes,” said his manager. “You know how to use them?”
Without saying a word Norm sat down at the console and took control, entering commands quickly and efficiently. He could feel his old confidence coursing through his bloodstream. He quickly located the failed card and feverishly went into CPU debug mode. Now that he had direct control of the processor he finally felt that he was getting somewhere. It didn’t matter that the board didn’t boot – it didn’t have to.
“There!” he said triumphantly, pointing to the screen some 10 minutes later. “We have a failure on this SPI bus to the BIOS flash. No wonder the board couldn’t recover. We weren’t even launching the BIOS.”
His manager beamed, but then frowned. “But why don’t the boards fail all the time? And why couldn’t you duplicate the failure in the lab here?”
“I don’t know,” said Norm. “It might have something to do with environmental conditions. Maybe when the boards heat up and run operationally for a while something latent manifests itself. Maybe a tin whisker, solder void, marginal part, or something like that. And when we get the board back here the defect has gone into remission. But at least now I know where to look!”
Norm walked proudly out of the Tech Center and headed to the weekly staff meeting with his manager. His chest swelled imperceptibly with his returning self worth. The meeting promised to be a good one. Today, he thought, NTF stands for Norman Thomas Follett, not No Trouble Found!
And Norm lived happily ever after… |