Using AI Chips to Design Better AI Chips

Chip design is as much an art as it is a technical feat. With all the possible layouts of logic and memory blocks and the wires that connect them, there are seemingly endless combinations of placement and often, believe it or not, the best chip floor plan people work from the ground up. experience and hunches and they can’t always give you a good answer as to why a particular model works and others don’t.

The stakes are high in chip design, and researchers have tried to take the human guesswork out of this chip configuration task and move towards more optimal designs. The task also doesn’t go away as we move into chip designs because all of those chips on a compute engine will need to be interconnected to be a virtual monolithic chip and all latencies and power consumption will need to be taken into account. effect for such circuit complexes.

It’s a natural job, it seems, for AI techniques to help with chip design. This is something we talked about a few years ago with Google engineers. The cloud giant continues to pursue it: In March, scientists at Google Research presented PRIME, a deep learning approach that leverages existing data such as plans and metrics regarding power and latency to create designs faster and smaller accelerators than chips designed using traditional tools. .

“Perhaps the easiest way to use a database of previously designed accelerators for hardware design is to use supervised machine learning to train a prediction model that can predict the performance target of ‘an accelerator given as input,’ they wrote in a report. “Then one could potentially design new accelerators by optimizing the output performance of this learned model relative to the input accelerator design.”

This came a year after Google used a technique called reinforcement learning (RL) to design the layouts of its TPU AI accelerators. It’s not just Google that does all of this. Chip design tool makers like Synopsys and Cadence are both implementing AI techniques in their portfolios.

Now comes Nvidia with an approach that three of its deep learning scientists recently wrote “uses AI to design smaller, faster, and more efficient circuits to deliver more performance to each generation of chips. Large arrays of circuits arithmetic have enabled Nvidia GPUs to achieve unprecedented acceleration for AI, high-performance computing, and computer graphics, so improving the design of these arithmetic circuits would be key to improving the performance and efficiency of GPUs.

The company took a run at RL with its own version, calling it PrefixRL and claiming the technique proved that AI can not only learn to design circuits from scratch, but that those circuits are smaller and faster. than circuits designed using the latest EDA tools. Nvidia’s “Hopper” GPU architecture, introduced in March and expanding the company’s already extensive focus on AI, machine learning and neural networks, contains nearly 13,000 instances of circuits designed using of AI techniques.

In a six-page research paper on PrefixRL, the researchers said they focus on a class of arithmetic circuits called parallel prefix circuits, which encompasses circuits such as adders, incrementers, and encoders, all of which can be defined at a higher level as a prefix. graphics. Nvidia wanted to know if an AI agent could design good prefix graphs, adding that “the state space of all prefix graphs is large O(2^n^n) and cannot be explored at all times. ‘using brute force methods’.

“A prefix graph is converted into a circuit with wires and logic gates using a circuit generator,” they wrote. “These generated circuits are then optimized by a physical synthesis tool using physical synthesis optimizations such as gate sizing, duplication and buffer insertion.”

Arithmetic circuits are built using logic gates such as NAND, NOR and XOR and many wires, need to be small so more can fit on a chip, fast to reduce any delay that can hurt performance and consume the least amount of power. possible energy. With PrefixRL, the researchers focused on circuit size and speed (to reduce delay), which they believe tend to be competing properties. The challenge was to find designs that most effectively utilized the trade-offs between the two. “Simply put, we want the minimum area circuit every delay,” they wrote.

“The final circuit properties (delay, area, and power) do not translate directly from the original prefix graph properties, such as level and number of nodes, due to these physics synthesis optimizations”, wrote the researchers. “That is why the AI ​​agent learns to design prefix graphs but optimizes the properties of the final circuit generated from the prefix graph. We pose arithmetic circuit design as a reinforcement learning (RL) task, where we train an agent to optimize the area and delay properties of arithmetic circuits. For prefix circuits, we design an environment where the RL agent can add or remove a node from the prefix graph.”

The design process then legalizes the prefix graph to ensure that it always maintains a correct prefix sum calculation and a circuit is then created from the legalized prefix graph. A physical synthesis tool then optimizes the circuit and the surface and delay properties of the circuit are then measured. Throughout this process, the RL agent builds the prefix graph through a series of steps by adding or removing nodes.

Nvidia researchers used a fully convolutional neural network and the Q learning algorithm – an RL algorithm – for their work. The algorithm trained the circuit design agent using a grid representation for prefix graphs, with each grid element mapped to a prefix node. The grid representation was used both at the input and output of the Q network – each element of the output grid representing the Q values ​​for adding or removing a node – and the network of neurons predicted Q values ​​for area and delay properties. .

The computational demands for running PrefixRL were significant. The physics simulation required 256 CPUs for each GPU and resulted in over 32,000 GPU hours, according to the researchers. To meet the demands, Nvidia has created a distributed reinforcement learning platform called “Raptor” that leverages Nvidia hardware specifically for this level of reinforcement learning.

“Raptor has several features that improve scalability and training speed, such as job scheduling, custom networking, and GPU-enabled data structures,” they wrote. “In the context of PrefixRL, Raptor makes it possible to distribute work across a combination of CPUs, GPUs and Spot Instances. Networking in this reinforcement learning application is diverse and benefits from… Raptor’s ability to switch between NCCL [Nvidia Collective Communications Library] for point-to-point transfer to transfer model parameters directly from the learner GPU to an inference GPU.

The network also benefited from the Redis store used for asynchronous and smaller messages like rewards and stats and a JIT-compiled RPC for high-volume, low-latency requests like experiment data downloads. Raptor has also included GPU-compatible data structures for tasks such as parallel batch processing of data and prefetching it to the GPU.

The researchers said RL agents were able to design circuits based solely on feedback learning of synthesized circuit properties, with results that use 64b adder circuits designed by PrefixRL. The best adder of this type offered 25% less area than the EDA tool adder and the same delay.

“To our knowledge, this is the first method using a deep reinforcement learning agent to design arithmetic circuits,” the researchers wrote. “We hope this method can be a model for applying AI to real-world circuit design problems: building action spaces, state representations, RL agent models, optimizing for multiple concurrent goals and overcome slow reward computational processes such as physical synthesis.”

Comments are closed.