Dr. Dobb's Digest December 2009

The One Instruction Wonder

A single instruction CPU that does everything via a move instruction

By Al Williams

Al Williams straddles the hardware and software line and has authored books ranging from MFC Black Book to Build Your Own Printed Circuit Board.. He can be contacted at al.williams@awce.com.


One of the original "Star Trek" television episodes involves patients at a facility for the criminally insane. One of the inmates quotes some lines from Shakespeare and announces that she wrote it yesterday. Another character tells her that it had been written by the bard in the past. The woman replies, "Which does not alter the fact that I wrote it again yesterday!"

I suppose in the computer industry it is particularly difficult these days to have a truly original idea, even if you arrive at your idea independent of prior work. I had that experience several years ago. I had just finished a 16-bit CPU design based loosely on Caxton Foster's Blue machine in his excellent (albeit dated) book Computer Architecture. (Yes, I do have strange hobbies.) Like Foster's original, my machine has what I think of as a 1970's minicomputer architecture -- its very similar to a DEC or DG or HP machine from that era. I was contemplating starting a new project using some sort of RISC (Reduced Instruction Set Computer) architecture. RISC's advantages are well known. Simplifying the CPU core by reducing the complexity of the instruction set allows faster speeds, more registers, and pipelining to provide the appearance of single cycle execution. RISC has been so popular that even your PC today probably uses a RISC core that is emulating a non-RISC processor!

So I thought if the "R" in RISC is for reduced, how far can you reduce the instruction set of a computer and still make it do useful work? I realized that you could, in fact, make a perfectly functional computer with only a single simple instruction. I drew up several instruction set architectures and became very enamored of the idea. However, a search on the Internet showed that I wasn't the first person to have this realization. Although they aren't extremely common, OISC (short for "One Instruction Set Computers") have been proposed and built before. The type I had designed was known as a TTA (Transfer Triggered Architecture). There have been a few academic designs using this architecture as well as at least one commercial microcontroller (the Maxim MAXQ).

However, I still did not see some of the design features I wanted to explore in any existing implementation so I pressed ahead with the design of my CPU, the One-Der (with apologies to Tom Hanks). In this article, I describe One-Der -- a 32-bit TTA CPU that operates at roughly 10 MIPS. I also show you how and why you might actually want to use it. The complete source code and related files are available here. A demonstration of how you program One-Der is presented in this video.

What's the One Instruction?

When I tell this story in person, people are usually squirming with the inevitable question: What's the one instruction? It turns out there's several ways to construct a single instruction CPU, but the method I had stumbled on does everything via a move instruction (hence the name, "Transfer Triggered Architecture").

In a way, it is almost object oriented. Instead of normal instructions, you have functional units that expose addressable registers. Moving information into a "register" causes the functional unit to change. So as a simple example, consider a math functional unit -- what would pass for the accumulator on a normal processor. The functional unit might expose three registers. Register 0 would allow you to read and write data to the accumulator. Writing to register 1 would cause the unit to add the value you wrote to the accumulator while register 2 would subtract the value.

That's an overly simple example, but by building complex functional units you can build up an "instruction set" that does anything you want. Meanwhile, the basic CPU architecture remains very simple and stable. It simply orchestrates gating the contents of a register on the bus and then gating the bus into another register.

Fundamental Architecture and Advantages

In practice, One-Der is a bit more complex, but not much. The main CPU is about 140 lines of Verilog code -- less if you deduct the comments. However, this main portion is really just defining the bus between functional units. All the real work is in one of the functional units -- even the program memory and the program counter are functional units.

The main bus has two other functions that augment the transfer of data between functional units. First, it allows for conditional transfers. That is, based on a set of conditional tests, a transfer may not be allowed and the bus cycle becomes a NOP instruction.

The second augmentation is the ability to load constants. The program counter functional unit has a special operation: load the next 32-bit word into a special constant register (which is also part of the PC's functional unit). Because many constants don't require the full 32-bit word, there is also a way to use illegal conditional instructions to load shorter (28-bit) constants in a single 32-bit bus cycle (using the same constant register). In addition, one or more "constant" functional units can make common constants (like zero, for example) easily accessible to any instruction.

This points out one of the major advantages of the One-Der architecture. Need a special set of constants that I don't use? It is trivial to add a new functional unit with your constant. A practical functional unit might be more difficult, but the interface of your logic to the CPU is trivial. Each functional unit has an 8-bit address, so you simply need to create a module for your functional unit and instantiate it in the CPU at an unused address. By convention, all the functional units have the same module signature to a point and then custom parameters follow the standard ones. It is often simpler to test individual functional units separately before integrating them with the main CPU, another advantage.

For example, Listing 1 shows the roughly 70 lines of Verilog that make up the current constant unit. The address is one of the parameters passed in when you instantiate the module.


'timescale 1ns / 1ps 'default_nettype none /********************************************************************** One-Der Copyright 2006, 2007, 2008, 2009 by Al Williams (alw@al-williams.com).

This file is part of One-Der.

One-Der is free software: you can redistribute it and/or modify it under the terms of the GNU General Public Licenses as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

One-Der is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY: without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with One-Der (see LICENSE.TXT). If not, see http://www.gnu.org/licenses/.

If a non-GPL license is desired, contact the author.

This is the FU that contains 16 useful constants.

***********************************************************************/ module FConstant(input wire xreset, // reset input wire clk0, // clock input wire invclk, input wire phase, inout wire [31:0] dbus, // data input wire [12:0] srcadd, // source add input wire [12:0] dstadd, // dest add (not used) input wire [12:0] cfg, // configured address output wire dta, // data transfer ack input wire dtain); // dta input wire

wire oe; // tri state control wire [3:0] subunit; // match addresses assign oe=srcadd[7:0]==cfg[7:0]&& srcadd[12]==cfg[12]; assign subunit=srcadd[11:8]; assign dta=1'b1; reg [31:0] const; assign dbus=oe?const:32'bz;

// set up the constants always @(subunit) begin case (subunit) 4'b0000 : const=32'b0; 4'b0001 : const=32'b1; 4'b0010 : const=32'h2; 4'b0011 : const=32'h80000000; 4'b0100 : const=32'h4; 4'b0101 : const=32'hff; 4'b0110 : const=32'hff00; 4'b0111 : const=32'hff0000; 4'b1000 : const=32'hff000000; 4'b1001 : const=32'h00000010; 4'b1010 : const=32'ha; 4'b1011 : const=32'hf; 4'b1100 : const=32'hf0; 4'b1101 : const=32'h80; 4'b1110 : const=32'haaaaaaaa; 4'b1111 : const=32'hffffffff; default: const=32'h0; endcase end

endmodule

Listing 1

To add a new set of constants, simply copy and paste this module to a new Verilog file, rename the module, and instantiate it with a new address in the main CPU file (oneder.v) and you can put 16 more constants in your new custom instruction set. For example, here's the line that includes the existing unit:


// Constant source (unit 0) FConstant constant(xreset, clk0,invclk,phase, dbus, src, dst, 13'h0, dta, dta);

Constants are a simple example, but perhaps your program would benefit from three accumulators or four different loop counters. No problem. Just instantiate the existing module multiple times, each with a different address. It is just as easy to remove modules to save space on the FPGA if you find your program never uses a particular module.

While it is handy to be able to add, multiply, and subtract functional units by hand, a ripe area of research is using this architectural style in reconfigurable computing applications. A compiler could determine the optimum number and types of functional units and literally build a custom CPU on the fly optimized for the task at hand.

There is one other feature that functional units can use to manipulate the bus. The bus manages a line known as dta (data transfer acknowledge). This bit allows either the sender or the reciever to stall the bus if necessary. This can be useful for slow external memory access for example. The FIO functional unit takes advantage of the bus stall to implement a delay subfunction.

In addition to functional units, One-Der has a set of registers that can act as the source or destination of a transfer. These act like registers on any garden-variety CPU; they simply store values until needed.

Instruction Set Basics

As you might expect, the instruction set is very simple. The way it is laid out it is very easy to read and write the hex op codes directly (although I have a cross assembler I'll tell you about shortly). Figure 1 shows the basic format. Note that the functional units have an 8-bit address and a 4-bit subfunction code.


Figure 1: Instruction Format

So an instruction that doesn't use registers and has no conditions looks like:


00 sSS dDD

Where SS is the source unit, s is the source sub unit, and the D's represent the destination unit and sub unit. Constants are easy to identify also. The short form takes advantage of the fact that if the condition mask is zero, there's no meaningful reason to have the condition match bits set. So a short constant will have a 1 in the first position and you simply have to mask the top bit off. So 1000003F is the constant 3F hex (note that the constants are sign extended to 32-bits so 18000000 is the constant F8000000). Full 32-bit constants start with 00000F01 followed by the full 32-bit constant (this requires two cycles to execute and does not respect condition codes).

Of course, the real meat to this processor isn't the format of the instructions, it is the possible values for the source and destination. Table 1 shows the standard functional units and their addresses. Note the constant functional unit is unit zero which allows for some mnemonic subfunctions. So 00000001 (source unit 0, subunit 0) loads a zero into the program counter while 00A00001 loads a constant A. Of course, that only goes so far and some of the subfunctions are just arbitrarily assigned.

Table 1. Default Functional Units

Although every instruction is technically a move, you might prefer to give some mnemonic aliases to special moves (my cross assembler does this). For example, moving something into the program counter is a jump. A subroutine return involves moving the top of the stack to the program counter.

CPU Tools

You might be wondering: This all sounds interesting, but how can I implement a custom CPU? The answer is to use a Field Programmable Gate Array (FPGA) device. (For more on programmable logic devices like FPGAs, see my article Programmable Logic and Hardware).

In a nutshell, an FPGA implements a large number of logic cells and a programmable way to interconnect them. You can create your design by drawing schematics (only feasible for small designs) or by using a hardware description language (like Verilog, the language used to implement One-Der). A program running on a PC takes your description and converts it into a configuration of the FPGA's logic cells. The result is downloaded to the FPGA by the PC's parallel port. You can also download the result to a EEPROM which can automatically configure the FPGA on startup once you are happy with your design.

Working with FPGAs used to be very expensive. Today, vendors offer development boards that work with free software for well under US$100. My prototype of the One-Der architecture works on a development board available from Digilent (see Resources) that has the equivalent of about 1,000,000 logic gates (it costs a bit more than $100, but was well under $200). In addition to the Xilinx Spartan 3 FPGA chip and assorted support functions, the board also contains some memory devices along with a handful of I/O devices such as switches, LEDs, and a serial port. Of course, none of these things do anything unless your logic makes them do something. The One-Der prototype can read the switches, light the LEDs, and implements a serial port you can use in your programs. It also provides 16 bits of general-purpose outputs and another 16 bits of input available on the board's edge connector.

Verilog

If you aren't familiar with Verilog, you'll notice it strongly resembles C. However, the way it is handled is very different since the result isn't object code, its a hardware configuration. Consider the following C snippet:


x=a|b|~c; y=a&b;

This actually generates code that sets x at some discrete time, then sets y a short time later. After this code executes, x and y won't change unless the program reexecutes this code, or executes some other code that modifies x and y.

In Verilog you might write:




   assign x=a|b|~c;

   assign y=a&b;

This will create several logic gates; one that drives the x output and one that drives the y output. These gates will constantly compute the equations in hardware and the value of x (or y) at any instant will reflect the current values of the inputs. There's no sequence of computations implied by the line order. The resulting logic is all in parallel.

You can specify gates directly or use other methods, but I find them all clunky compared to using assign. For completeness though, here's two other ways you could drive the x output equivalently:


or(x,a,b,~c);

always @(a,b,c) x=a|b|~c;

All of these build asynchronous logic in Verilog. You can also create synchronous logic, which makes up the bulk of the CPU. With synchronous logic, everything is referenced to a clock pulse. This way outputs have until the next clock pulse to "settle" and you can build very complex logic without worrying too much about the different delays encountered by the various signals.

For example, consider generating parity on a serial bit stream. The idea is to use a flip flop to remember the current parity value. When the serial data arrives (at the rising edge of the clock) the new parity will be the old parity exclusive-ored with the data bit. Here's a Verilog module to implement this logic:


module serialparity(input clk, input reset, input data, output reg even, output odd)

assign odd=~even; always @(posedge clk) if (reset) even=1'b0; else even=even^data; endmodule

This defines a module (like a parallel subroutine) that "executes" on each positive clock edge. If you read C, you can probably puzzle out most of the syntax. The 1'b0 means a literal zero that is 1-bit long (the b is for binary). In English, the module takes each data bit and exclusive or's it with the previous parity value to form the next parity value.

Of course, a complete Verilog tutorial is beyond the scope of this article, but there are plenty of resources on the Internet. The CPU is relatively simple, but it probably isn't the ideal first Verilog project. On the other hand, adding functional units and manipulating existing ones is well within the reach of the Verilog neophyte.

Tool Chain

Assuming you have a bunch of Verilog, how do you get it to the hardware? First, most designs will require more than just straight Verilog. You also need constraints which tell the software how to handle certain parts of the Verilog. For example, the parts of One-Der that read the push buttons or drive the LEDs must connect to certain external pins on the FPGA. Without constraints, the translator software will assign them to arbitrary pins which could cause results ranging from things not working, to physically damaging the device by shorting out something on the board.

There are several ways to include constraints. You can put special comments in your Verilog, but I prefer to use a constraint file. The Xilinx tools even include a GUI editor for constraint files to simplify the task.

There are other files you may need. The Xilinx tools have a variety of wizards and code generators that allow you to use GUIs to generate special functions (for example, memory or clock circuits). These files will also be a part of your project.

Xilinx provides several versions of their ISE software which is what you'll use to convert your project into a file that you can download to the FPGA. One version -- WebPack -- is available for free and is all you'll need for many projects. The tool performs several steps to arrive at the final product. First, it does synthesis, mapping your logic descriptions into primitive logic. When this is successful, that logic is mapped to the available FPGA structures. Finally, the structures are arranged on the FPGA and the tool selects how the structures will be interconnected (a process called place and route). Each logic block and interconnect has a certain delay associated with it, so the tool may have to try many different possibilities to arrive at an arrangement that will meet your desired clock speed (the clock speed is specified as a constraint and arriving at a workable solution is known as "meeting timing." For a complex design like One-Der, these steps can take quite some time, even on a fast PC. Once place and route completes, the tools generate a bit file -- an image that you can download to the FPGA using the Xilinx-provided tools (usually using a JTAG cable connected to a printer port or a USB port).

To map these operations to the typical software development flow, you can consider ISE a compiler and linker. The synthesis is roughly analogous to compiling source code to assembly language and the mapping is a little like an assembler converting to binary codes. The place and route step is somewhat like linking everything together, although that analogy doesn't cover everything since a real linker can just put all the output objects in any order. The FPGA router has to use the limited interconnects to get everything that needs to communicate connected.

For smaller designs, you can also use ISE to simulate your work without downloading to the actual IC. In fact, I often do this with small parts of One-Der, but the entire CPU is so complex it is difficult (and slow) to simulate well so I usually try to debug directly on the hardware (although simulating an individual functional unit is relatively easy). Xilinx also provides a tool (but not for free) called ChipScope that allows you to perform basic logic analysis by building a dedicated logic analyzer into the FPGA alongside your design. The analyzer accepts commands and sends data to your PC via the FPGA's JTAG port (the same port you use to program the FPGA during development).

For creating One-Der programs I have a simple cross assembler that uses a combination of awk, and the system's C compiler to output binary in several formats. The mem2flash script (available here) lets you load binary from the assembler directly to the chip's on board flash without having to rebuild the processor just to get a new program loaded.

Road Map

The complete source code for One-Der is available here. Here's a road map to the source code and a summary of what you'll find in each one:

More About Units

There are many ways to solve any problem using the One-Der architecture. For example, conditional bus transfers allow you to make decisions. But functional units might use other methods. For example, the loop unit requires you to move a subunit into the program counter. The value of the subunit will vary depending on if the loop is complete or not.

Each unit receives a clock signal and the inverted clock (useful for the Xilinx stock memory which always looks for a rising edge). There is also an alternating phase signal that provides two distinct clock pulses (four clock edges, since the CPU works on both edges of the clock). The clock cycles have the following purposes:

The CPU uses program information in clock B for some units (mostly those requiring time to set up memory transfers) and in clock C for others. This isn't strictly necessary, but it helps the Xilinx tools perform better since there s more time for signals to propagate for most functional units.

Functional units can be simple or complex. If a functional unit needs more than one bus cycle you can easily stall the bus by asserting the dta line in the functional unit. You can implement bus stalls in several ways. For example, suppose you develop a pipelined multiplier functional unit that requires 64 clock cycles to operate. One approach would be to stall the bus when the final operand is sent to the unit. Or you might prefer to let the unit calculate and only stall the bus if the program attempts to read the result before the pipeline completes.

The Future

Listing 2 shows a simple monitor program that allows direct exploration of the One-Der architecture. Of course, you have to share resources with the monitor software. The monitor lets you manually key in small programs or manually set and read functional units.


;; Simple "monitor" for One-Der ;; Williams 1 March 2009 ;; assume default stack is ok ORG 0 ;; Registers used for fast subroutine calls ##define CRLF 60 ##define HEXO8 61 ##define USEND 62 ##define READHEX 63 ;; data registers ##define HEXNUM 59

;; Put something on the display so we know we are running LDRIQ 0x1234,FIO_DISP ;; init common subroutine calls LDRIQ crlf,R(CRLF) LDRIQ hexout8,R(HEXO8) LDRIQ uartsend,R(USEND) LDRIQ readhex,R(READHEX) ;; Print a string to the terminal (kill this for more room) CALLQ banner top: LDRIQ '?', FACC ; prompt CALLR USEND LDRIQ ' ', FACC ; note that call clobbers FIMM CALLR USEND CALLQ uartrx ; get a character CALLR USEND ; echo ;; we know about several commands ;; fu: ;; p x - print x ;; s xy - set y=x (note no spaces so s 0fe1fe for example) ;; c n y - set y=n (constant) ;; reg: ;; d r - print r ;; r n r - set r=n (constant) ;; f add data...<esc> - Write to flash (working) ;; g add - Call address ;; v add count - view flash add ;; you can backspace in a number as long as you haven't ;; entered a character <'0' yet ;; Vector for each command ;; check for p MOV FACC,FACC2 LDRIQ 'p',FACC_SUB LDIQ print MOVZ FIMMV,FPC ;; check for v MOV FACC2,FACC LDRIQ 'v',FACC_SUB LDIQ view MOVZ FIMMV,FPC ;; check for f MOV FACC2, FACC LDRIQ 'f',FACC_SUB LDIQ flash MOVZ FIMMV,FPC ;; check for s MOV FACC2, FACC LDRIQ 's',FACC_SUB LDIQ setfu MOVZ FIMMV,FPC ;; check for c MOV FACC2, FACC LDRIQ 'c', FACC_SUB LDIQ setfucon MOVZ FIMMV,FPC ;; check for d MOV FACC2, FACC LDRIQ 'd', FACC_SUB LDIQ dispreg MOVZ FIMMV,FPC ;; check for r MOV FACC2, FACC LDRIQ 'r', FACC_SUB LDIQ setreg MOVZ FIMMV,FPC ;; check for g MOV FACC2, FACC LDRIQ 'g', FACC_SUB LDIQ go MOVZ FIMMV,FPC ;; not found JMPQ top ;; View flash view: CALLR READHEX ; read address MOV FACC2, FPC_PGMRDADD CALLR READHEX ; read count ;; Set up loop MOV FACC2, FLOOP_IEND MOV FZERO, FLOOP_I MOV FONE, FLOOP_IINC LDRIQ viewloop, FLOOP_IADD viewloop: MOV FZERO, FPC_PGMRD ; read flash MOV FIMMV, FACC ; print CALLR HEXO8 LDRIQ ' ', FACC ; space between words CALLR USEND MOV FLOOP_IADD,FPC ; loop CALLR CRLF ; done JMPQ top

;; Call a user subroutine go: CALLR READHEX ; get address MOV FACC2,FPC_CALL ; do it JMPQ top ; done

;; Set a register to a constant setreg: CALLR READHEX ; get constant PUSH FACC2 CALLR READHEX ; get register # LDRIQ 0xFFF,FACC2_AND LDRIQ 0x1008000,FACC2_OR ; build instruction to exec JMPQ setfu0 ; from here out, same as setting a FU ;; printa register dispreg: CALLR READHEX ; which register? LDRIQ 12,FACC2_SHL LDRIQ 0xFFF000,FACC2_AND ; get legit part LDRIQ 0x2000002, FACC2_OR ; build instruction JMPQ dispexec ; from here out, same as printing FU

;; Set a Functional unit ;; Note this just sort of execs one instruction setfu: CALLR READHEX ; get src and dest (e.g., 0fe002 is switch->acc) JMPQ setfu0 ; from here out, same as set FU constant

setfucon: CALLR READHEX ; get constant PUSH FACC2 CALLR READHEX ; get destination LDRIQ 0xFFF,FACC2_AND ; build instruction LDRIQ 0x008000,FACC2_OR setfu0: ; execute instruction MOV FACC2,FPC_WDATA LDRIQ execute,FPC_WADD MOV FZERO,FPC_WRITE POP FACC2 CALLQ execute JMPQ top

;; Print an FU print: CALLR READHEX ; which FU? MOV FACC2,FACC ; print fu address:value CALLR HEXO8 LDRIQ ':',FACC CALLR USEND LDRIQ 12,FACC2_SHL LDRIQ 0xFFF000,FACC2_AND ; get legit part LDRIQ 0x002, FACC2_OR ; build instruction dispexec: MOV FACC2,FPC_WDATA ; execute instruction and print LDRIQ execute,FPC_WADD MOV FZERO,FPC_WRITE CALLQ execute CALLR HEXO8 CALLR CRLF JMPQ top

;; write to flash flash: CALLR READHEX ; Get address MOV FACC2,FPC_WADD flashloop: MOV FPC_WADD,FIO_DISP ; echo address to LED display CALLR READHEX ; get data value (FACC2 is value FACC is terminator) LDRIQ 0x1B,FACC_SUB ; test for Esc JMPZQ top MOV FACC2,FPC_WDATA ; write data MOV FZERO,FPC_WRITE JMPQ flashloop ; and loop ;; Several commands have to execute a dynamic instruction ;; So the monitor builds a command and places it here ;; and then calls execute execute: ; this is how we overwrite flash MOV FZERO,FZERO RET

;; Canned routines for UART, etc. ##include library.inc

;; print a stored banner banner: LDRIQ message,FPC_PGMRDADD bann1: MOV FZERO,FPC_PGMRD MOV FIMMV,FACC RETZ CALLR USEND JMPQ bann1 message: ; note stringpack crams text into words ;; while string outputs individual bytes STRING "1Mon" DATA 13 DATA 10 DATA 0 END

Listing 2

What's next for One-Der? There's an endless number of functional units that could be written, of course. In addition, I have plans to add time-based, external, and debugging interrupts. An external memory interface is also in the works. However, I'm close to shifting gears to think more about tools. A simple example would be a "wizard" style interface to select functional units and generate a custom instance of One-Der. I'm even more interested in working on compiler support to automatically select the optimum number and types of units. Perhaps one day, a One-Der savvy compiler would even take advantage of the ability to partially reconfigure the FPGA to change the functional unit distribution on the fly!

Even so, One-Der is imminently usable as it is. Unlike many other FPGA CPU cores, this one is very simple to customize even if you aren't an expert on its internals. Applications that can benefit from custom instruction in hardware -- things like digital signal processing, for example -- are ideal for One-Der since you can implement parts of your algorithm in hardware and then easily integrate those parts with the CPU.

Resources