CONNECT NEWSLETTER

Issue Home

 

 
Visit us at Autotestcon
 
Users Group Meeting at ITC
 

asset-intertech.com

ScanWorks®

Boundary-Scan Test

Processor-Controlled Test

Intel® IBIST

Services

Customer Support

ASSET University

Success Stories

Global Contacts

Search Website:

INSIDE ASSET

It takes a village to live down initials like 'NTF'

Tim Dehne

For the umpteenth time Norm cursed his mother's twisted sense of humor under his breath. Christened Norman Thomas Follett, he has spent his early youth in blissful ignorance of the meaning of what had become his nickname. Then, in his early 20s he realized that there was a whole industry that thought that NTF stood for No Trouble Found, not Norman Thomas Follett. By then, it was too late. Norm, or NTF as he was known to his intimate friends, was already in customer support for a telecom company and it was too late to do anything about it. Even more ironic, he found himself in a sticky situation where he couldn't fix this particularly nettlesome problem with his company's routers. It had gotten to the point where the halls would ring with the shouts of his always supportive colleagues saying: "Hey NTF, so you couldn't fix it again, huh?"

Something had to change.

As Norm sat in the Tech Support lab at his remote terminal and watched the Verizon central Dallas router OC-192 blade go offline again, he mused about how he gotten to this point…

Norm's mother started her high-tech career working in field service engineering for Nortel Networks in the early '80s. Right out of school, she turned into a crack troubleshooter for Nortel's flagship voice switch, the DMS (Digital Multiplex Switch). As a young fellow Norm enjoyed his mother's stories when she finally came home from work. "Solved a tough one today," she would say. "One of the CPU cards went offline right in the middle of a software upgrade. Would have been a disaster if its mate had died too." Although Norm didn't understand what she was saying, he knew that she was doing something important. There was an air of great responsibility about her when she spoke of her work.

It turned out that NTF, as his mother affectionately referred to him, shared his mother's problem-solving skills. He graduated with honors from college with a degree in Electrical Engineering. It seemed only natural at the time that he should follow in her footsteps. She, of course, was delighted and once his career choice was made, she again regaled him with stories from her past. He would listen raptly to her tales:

"Did I tell you about the time when the Nortel installers couldn't get the new DMS in downtown Chicago running in time for the big cutover from the analog crossbar switch? It was a crazy time. The crossbar guys didn't want to lose their jobs, so they didn't cut all of the voice trunks over from analog to digital. I couldn't figure out what was wrong with the new Nortel switch, so in desperation I ran upstairs and found the guys with the bolt-cutters sitting around having coffee!"

The stories were practically endless.

"The Alaska disaster was one of my favorites. We cut over a new DMS and everything went fine for the first few days. Then a huge flock of Canadian geese flew in front of the microwave tower which carried all of the voice and data traffic out of the office. Well, the DMS software started to thrash as it tried to return 10,000 trunks to service all at the same time. It took us a week to solve that puppy!"

Although little NTF enjoyed her stories, he was smart enough to want to learn from the past. So he asked her many questions about the technologies of the day. As it turned out, Nortel's DMS switch was built from the ground up on a proprietary hardware and software foundation. Being a system which handled voice calls, including emergency service 911 calls, meant that it had to run continuously with very little downtime. If someone were having a heart attack and dialing for help, it was just completely unacceptable for the DMS to have an outage. At some point in the early 1970s Nortel coined the phrase "our system will be offline for only 2 hours every 40 years." From that, the rest of the telecom industry took up the cry of 99.999 percent -- or 'five nines' – availability for all high-availability systems. Now, almost all mission-critical systems in telecom, computing, storage, military/aerospace, medical, industrial controls and other markets are expected to support 'five-nines' fault tolerance.  

With a shake of his head Norm came back to the present. He had come to understand that hardware and software systems achieved high levels of availability by being very reliable. And these systems became reliable through an ongoing process of continuous improvement, where sophisticated root cause analysis of field problems led to hardware and software improvements. And finding the root cause of a major problem is what totally absorbed him right now.

His console continued to flash the alarm that told him the OC-192 board was offline. What could be the problem, he wondered? The unit was powered up, but the shelf manager could not communicate with it. He knew the processor must be hung, but it was impossible for him to tell whether it was a software or hardware problem.  Attempts to boot the OS or even run the BIOS for a rudimentary hardware check failed repeatedly. And what made the whole thing worse was that this was the eighth time in the last month that this particular problem had manifested itself in Verizon's routers. Although it had yet to cause an outage, Verizon was getting irritated. An escalation call had been made to Norm's VP and people were looking for answers.

Norm considered the possibilities. It was an intermittent problem. When a board was replaced, the new board worked fine. The faulty boards were sent back to the Failure Analysis lab where, of course, the dreaded NTF problem raised its head. Why would the boards fail in the field, but work in the lab? He subjected the boards to as many exhaustive tests as he could, but they always appeared to work fine in the lab. Each board eventually was sent out to a different Verizon site. (Company policy called for an NTF board to be returned to the field three times; if it failed in the field three times and still came up NTF in the lab, it was scrapped.) This particular vintage of board would work fine for a week and then the darn thing would fail again.

If Norm couldn't even run the BIOS, he knew there was some underlying hardware fault with these boards. But where was the problem? Somehow the CPU could not get through to the BIOS flash. Maybe there was something generically wrong with the SPI interface? Could it be intermittent problems with memory access? A bad batch of CPUs? Cosmic rays???

"Damn!" Norm thought. "I wish I had some decent tests I could run to diagnose these kinds of problems! Why weren't the "carrier-grade" systems of today as reliable as the systems of two decades ago?"

Don't miss Part 2 in the next issue of Connect when Norm, with ASSET's help, solves the NTF problems of his circuit boards and resigns himself to his initials.