State of NPUs: a roundtable discussion
Network processors (or "network-processor units"—NPUs) have been touted as an important tool for designers looking to combine off-the-shelf flexibility and performance when developing systems and line cards. But like most new technologies, NPUs have stumbled against unrealized expectations, changing requirements, and market upheaval.
At last October's Network Processor Conference West, Lightwave assembled a panel of NPU experts to discuss current NPU dynamics and how they foresee the technology evolving to meet current and future requirements. Participants included:
Johan Borje, CEO, Xelerated 
Gregg Cook, CEO, Fast Chip 
Thomas Eklund, founder and vice president, business development, Xelerated
Hideo Kikuchi, vice president, Technology Products Group, Fujitsu Microelectronics America 
Michael Miller, CTO and vice president, System Technology Group, IDT 
Keith Morris, director of marketing, AMCC
Misha Nossik, chairman of the Network Processor Forum (NPF) and director of network processing, IDT 
Tom Riordan, vice president and general manager, MIPS Processor Division, PMC-Sierra 
Rupan Roy, chairman, founder, and CTO, Cognigine.
Roy: In cases where you have a single protocol that you want to deal with, you could probably go with a hard-wired solution. But when you talk about multiple encapsulations, you're talking about packets that can be encapsulated within IPv4, within IPv6, and so on—that's where the protocols are so significantly different from packet to packet that you end up having to have programmable architectures to try to figure out exactly what kind of processing you need to perform on a particular packet. For example, what depth in the packet you need to process to, what kind of terminations need to be performed, and so on. Developing a purely hardwired architecture for IP is enormously complex, because the variances are so large and would be quite impractical. We see this in storage processing as well. Ideally, theoretically, an NPU is really a very fast microprocessor. The reason these particular processors are faster (at processing packets) than conventional microprocessors is because they've been architected and designed specifically for protocol processing. They've been architected to understand what packets are, to understand the way packet data flows take place and are provided with I/Os that are line-rate deterministic.
Cook: There's a wide breadth of product types that could potentially use a network processor—from core routers to gateways in the home. And I think we can agree that this is a replacement market relative to the ASICs that are out there. Because even at the low end, the complexities and therefore the performance requirements are quite substantial. And the ability to have the flexibility to meet the evolving protocols or new protocols that are being introduced constantly in this field is very important.
Miller: I think another thing that distinguishes them from a CPU, of course, is that their particular interfaces are optimized for streaming data. The types of data that typically you're working on are often a repetitive sort of format today; in the future, they'll evolve more toward dynamic formats when you start to look at the content sort of area. But it's within that space that I think distinguishes them from the general-purpose CPU.
Morris: There are some other forces at work obviously—business forces in terms of the cost of doing ASICs is getting much, much higher. The burden of doing an ASIC and then amortizing that cost over whatever volume you have is not as appealing when volumes are lower.
Miller: So on one side of it is ASICs, which is a hardwired device, and on the other side is the general-purpose CPU. So they've carved out a special niche that's optimized for communications, streaming data.
Borje: I think also in the current market climate, there's a high focus on extending the life of existing platforms. So adding additional features that weren't there before is a very important dimension. We've heard IPv6 being mentioned; MPLS is also a feature that is being introduced but not rolled out immediately in large volumes. So in extending the life of existing platforms, network processors can get a foothold into the major platforms.
Riordan: What we see based on today's technology—say, 0.13-µm CMOS—is that if you want to build a router that can handle greater than 2.5 Gbits/sec, then you would use a network processor. If you want to build something to do from 2.5 Gbits/sec and below, you would use a general-purpose microprocessor, because in 0.13, a general-purpose microprocessor can handle up to 2.5 Gbits/sec. So you can just draw a graph and draw a line at 2.5 Gbits/sec, and that line will just move over time. It will just continue to go up as you go from 0.13 to 90 nm. General-purpose processors will be able to go from 2.5 to 5 Gbits/sec, and then above the 5-gig line, you'll use some sort of network processor.
Morris: But performance is only one parameter. The other parameter is the efficiency of the solution—the power, the footprint. If you take an NPU approach to something that is more targeted toward the networking application, you'll still have the programmability, but it will probably be in a smaller form factor with more efficiency for less power, which is important.
Roy: It's way too expensive. First of all, you wouldn't be able to get the functionality that we see being included in an NPU in an FPGA. Even with the largest FPGAs, you'd need a bunch of them. It would work very inefficiently. FPGAs have a 10:1 ratio in terms of gate density. So it really would be very difficult to do that from a cost point of view.
Morris: The other thing with an FPGA is that it's just a faster-time-to-market version of an ASIC. So you still have all of the architecture design and development work to design that product for that particular application. Then on top of that, you have to roll the software infrastructure to support that application. The dirty little secret about ASICs is that there is still a lot of software development that needs to go on. When you move into more of a programmable model, you separate the silicon technology from the application. And that allows you to have a lot more leverage in terms of focusing on moving forward with the processing power and the platform that runs the software, then in parallel you can develop the software and leverage that software from platform to platform. So you get a lot of leverage both in terms of doing the first product, but then in the follow-on products you get to reuse a lot of the design and architecture.
Miller: If the particular application goes beyond what the NPU that you want for a particular price point, you've got a question now whether to go up to the next NPU, double-up on the current level of NPU that you have, or put a co-processor there. And part of the things that drive you along that path is wanting to add more services, going beyond just the Layer 2, 3, 4, and now when you start to push more complex sorts of ACL lists or policy-based routing, or you start to push into other more extended sorts of accounting and billing, is where those things starts to happen. Another area that starts leading you down that path is when you start going into the content inspection area. So co-processor engines then offer a way for these particular types of application to be added onto it without necessarily having to add another complete NPU or do some sort of custom design to get there.
Cook: It has a lot to do with the vendor's decision of how he partitioned his solution. Some vendors have separate traffic management components versus packet classification and admin components. So the decision for a customer to use a co-processor or not is fairly straightforward; if they need one function or the other they can pick and choose.
Morris: For the moment, it's really a process of roll up your sleeves with your engineers and analyze what you want to do. Memory bandwidth is key—that's the ultimate bottleneck today, it's not necessarily the performance. Then it's the ease of programmability, the viability of the supplier—are they going to be around? Can the supplier meet more than just one point product?
Miller: I think that because there's been this experience with previous generations not living up to what people were hoping them to be, that there's clearly a need where customers asking suppliers to "show it to me in terms of a reference design so I can see it work"—which is good if you have something that matches up with what they want to do. But then I think you get into an area where people are trying to push forward and they want some other thing that's not in the reference designs. And now they're having to try to reprogram the reference design or maybe build something of their own. But that can be really costly in terms of time, because if you go in the wrong direction, there could be reset. So I think one of the other aspects that is going to become very important are simulation tools, so that people can do a mock-up and try that out.
Cook: I think it's a difficult sell to go in and talk architectures and your part to the customer. I know customers really don't care about the number of CPU cores on your chip. All they care about is their application and the problem they're trying to solve. And I think to just dump a bunch of tools in their lap and say, "Go figure it out," is a non-starter, at least in my experience. Everyone has to have a reference design where you go, and you code it up for the customer and give it back to them. That's the only way a customer is going to commit to you. It's really a "show me" crowd out there.
Roy: We've found customers like that as well. But we have found customers who are quite educated. It's quite surprising; they actually will dig deep into the architecture, they will try to understand it. Generally, we have found that with customers that are really ready to go, those who have a design and are ready to start, you don't quite have to sell as hard. If they are able to figure out pretty quickly that they like your architecture in general, they will start putting resources on it. Those that are months away, those customers that are just kicking the tires, they tend to ask you to do a lot of homework.
Borje: I agree; you have these types of customers. The fact that we have been able to get design wins before we have silicon is because we have 100% deterministic architecture. And that actually means you don't need to performance-optimize your code. Wire speed is guaranteed in a 100% deterministic architecture, and that is really a key selling argument for us.
Morris: I completely agree that determinism is one of the big selling points. What we do is, as well as the deterministic architecture, we think the programmer doesn't need to understand what hardware the software is running on. So our big focus has been on not having to have the end user understand the minutia of the architecture, and that's how we can get to a point where people can make a buy decision without having to be educated to such a high level.
Roy: "Determinism" is kind of an interesting thing because it's a very vague term. Anything can be made to break if you don't map it correctly, and it's not deterministic after that. You can make a general-purpose CPU deterministic, providing you with the required line-rate throughput if you stay within budget given a particular set of parameters. It really depends on the mapping and working within processor resource constraints. The amount of bandwidth you have in your interconnect structures and the amount of competition for resources you have in your processors for a particular application will determine the line rate to which your processor will be deterministic. People will take different tacks in solving this problem, which is fine. I'm just saying that you have to be a little careful with that term.
Eklund: I need to respond to that. When you have a general-purpose processor, it has a certain instruction budget for use and to predict the behavior is very difficult since it depends upon the different traffic scenarios. You then try to optimize the code in a profiler, but it's very difficult to predict the throughput and in reality you can never get deterministic architecture with a general-purpose processor.
Roy: I absolutely disagree—I'm sorry.
Nossik: Let me don my NPF hat here. I hear this all the time. One of the possible solutions to this problem and this argument could be a functional benchmark. This is something the NPF is working on and that would eliminate this discussion. Once all of the companies that have different views of what the performance is and how to measure it agree and vote on a meaningful benchmark, then it becomes a matter of just measuring the performance and everything's clear. Of course, benchmarks are not absolute in the sense that it very much depends on what it is you do. However, we all have gone through the same experience with microprocessors, and today, with all the well-known flaws in the benchmarks, we can objectively compare microprocessors.
Miller: I think there are different aspects. You've got to have a set of standard drivers with APIs that fold into people's existing systems. So you reduce that barrier. Typically, that's more of the control plane. Then you've got to deal with the data plane. You need a set of example applications that you can run. That's traditional. Then I think the third piece that we're seeing start to evolve is the simulators—and the more accurate they are, the better to address code development and performance tuning. So I think that each of our products that we produce in terms silicon has to have these other pieces built around them. And that's really the total solution, not just the piece of silicon.
Cook: It's very key to have a platform environment running protocol stacks, just giving them an example of how you partition your control plane software versus your data plane software; 99% of the time, they're not going to be using the control plane software you're using, but that's okay. They can see how you do it and provide the appropriate API library function calls and see what those calls look like when they move through the control plane.
Nossik: I actually believe that the control plane software is key. Data plane software is something that should not concern the customer; the less the customer deals with it, the better. There are solutions that don't have any data plane software at all. You just configure the machinery that rolls the packets through without any code being executed at runtime. So all the software does is configure the machinery, and that's the control plane. Actually, that's an ideal solution from the point of view of clock rate—the economy of clock cycles. Just configure your policy in such a way that it takes care of a lot of different protocols. So at the runtime, the machinery just switches states, performing a very limited number of instructions. The Network Processing Forum, by the way, in its standards does not look at forwarding plane software at all. It's only control plane APIs that are being standardized.
Eklund: Which is good because you can't define a standard low-level API. Compare our architecture, for instance, with a RISC-based one. To configure the architecture, they're totally different.
Morris: You're basically pre-computing all the different events and outcomes. Whereas the other approach is to figure it out at runtime. And there's a tradeoff. It depends on how elaborate all those different permutations are going to be.
Roy: So let's move to the data plane side of it here. So the issue again is how good are your tools; is your simulator really accurate; is it possible for me to program in a high-level language; if that's the case, how easy is it for me to architect my program and my application? Are the tools standardized enough—are they so esoteric that it takes a long time for me to come up to speed on them? Those issues are really important, because at the end of the day, they should not need to have PhDs sit down and try to figure out how to program. It has to be very simple. It could be standard C, and then it's just like programming a straight processor.
Eklund: It sounds very good having a standard C language and everything. But you have to remember C was designed for von Neuman machines, which is a uni-processor environment. What we're essentially talking about now is multiprocessor architectures, which means, for instance, once the customer has coded the first prototype, then he's still got many, many hours of optimizing the code, which you do in assembly language. The data plane application Layer 2 to Layer 4, where you optimize for performance, you always need to go down to assembly language to optimize to really squeeze out the last bits. You need to utilize at least 80-90% of your chip resources, and if you don't utilize it enough, you can never compete with a fixed-function ASIC . When you do profiling of the code to resolve race conditions and the data dependencies, that's the most time-consuming part of developing the data plane code. The way you overcome that is go with a deterministic architecture. Then you have minimized the most tricky part of developing the code, which will simplify the tools.
Roy: It depends. I guess what we're trying to say is, given my personal experience with customers and the complexity of the algorithms we have encountered in the data pipeline, it is not just straightforward DiffServ-type applications that we are dealing with; there's much more to these applications. These complex multiservice applications have large code bases with numerous decision paths, each of which have to run at line rate. So that's why we have tried to focus on tools and methodologies that enable our users to develop complex code bases in a high-level language such as C and still maintain line-rate performance. There will be cases where you may have to go in and tweak some assembly, but we need to keep that to a minimum. In general, one should be able to develop 80% of the code in C. At the end of the day, it really boils down to whether your architecture has enough horsepower and bandwidth and whether your tools are efficient enough to enable high-level programming methodologies and still provide deterministic line-rate performance. It is absolutely possible to develop such technologies, and we've done it.
Eklund: That means that your chip would be at least 30-40% bigger than my chip.
Roy: No, I disagree. So now we're getting into great detail, and unless you go in and start counting gates, which we're not going to do...
Riordan: From our perspective, it's because—since I represent the general-purpose network processor here—whenever we go to a customer, what we hear the software and even the hardware people say is that I'm going to use your processor up to the point where it won't work anymore. And it's for all these reasons, right? They just want to run generic C. They've got a billion lines of code to run—and that's why they keep telling us to build it faster and faster, because they're trying to move up that bar.
Cook: That's the path of least resistance, right? They don't want to partition this code into control plane and session plane and then port that thing and the data plane stuff down to the new hardware. Again, it comes back to the software of the customer; it always comes back to the software with these guys. And you can have the architecture discussions here, but I think our architecture is something between a Solidum architecture and an MMC architecture, where C is a familiar syntax, but the machine is built with respect to packets and doing things with packets, looking at arbitrary fields within packets and doing arbitrary edit operations such as overwrites and deletes, inserts, appends, pre-pends, these types of things with the packet, and having a hardware architecture that operates at that layer of abstraction, that is as close as possible to the layer of abstraction of software you can get. So the lower that kind of disconnect between the software abstraction layer and the hardware abstraction layer, the better the performance will be.
Borje: I would take a little bit of an odd twist. Why haven't we really been more successful in replacing ASICs? I think the answer to that is that there is a programmability tax with the network processors. And we know that really to get the big, big design wins, for this industry to take off, we have to become as efficient as an ASIC approach and yet maintain the programmability. So how do we get to a point where we have that flexibility of NPUs but don't have the programmability tax? And I think that's the challenge and that's where the architectural evolution will have to take place.
As this roundtable discussion indicates, the ability to program—and even reprogram network-processor units (NPUs) is often touted as a competitive differentiator. However, not every vendor agrees. Startup Teradiant Networks (San Jose, CA) says that NPUs should be "configurable" instead of programmable to meet future requirements for line-rate operation.
The company's TeraPacket "super-pipelined" configurable architecture serves as the foundation for several upcoming Multi-Service Packet Engines (MPEs) and Multi-Service Traffic Managers (MTMs). For example, the TN200 MPE will pair with the TN201 MTM to provide full-duplex 2 × 10-Gbit/sec performance with a power consumption of 18-20 W per 0.13-µm CMOS chip. The TN400 and TN401 will combine to provide 4 × 10-Gbit/sec half-duplex processing (or full-duplex with an additional two chips), while the TN100 and TN101 will address single-stream 10-Gbit/sec applications.
Most NPU architectures use tens if not hundreds of microprocessor cores as a foundation. Such an approach has proven unsuitable for high-speed line rates, according to Teradiant chief executive Satchit Jain. "These programmable microprocessor cores have not delivered line rate," he asserts. "The programmable network processors are extremely hard to program, and they have not scaled. The worst problem is that these performance problems have been detected very late by the system vendors in the system development cycle."
The super-pipelined architecture obviates cores and their attendant programming requirements. The fact that microcode instructions do not have to be executed speeds the processor's performance and ensures line-rate performance, says Subhash Bal, Teradiant's vice president of marketing. However, the devices still provide flexibility through their ability to be configured via device drivers.
The company has targeted the devices at Layer 1-4 applications within multiservice switches and routers for core and metro/edge networks. Simulation tools are already available for the chips; samples and a reference board are expected to be available by the third quarter of this year—perhaps late second quarter, says Bal.
—Stephen Hardy