FPGA流水线设计.pdf-道客多多_道客多多docduoduo.com

资源描述

1、Design of Very Deep Pipelined Multipliers for FPGAsAlex Panato, Sandro Silva, Flvio Wagner, Marcelo Johann, Ricardo Reis, Sergio BampiUniversidade Federal do Rio Grande do Sul - Instituto de InformticaAv Bento Gonalves, 9500, Bloco IV, Porto Alegre, RS, Brazile-mail: inf.ufrgs.brAbstractThis work in

2、vestigates the use of very deep pipelines forimplementing circuits in FPGAs, where each pipelinestage is limited to a single FPGA logic element (LE). Thearchitecture and VHDL design of a parameterized integerarray multiplier is presented and also an IEEE 754compliant 32-bit floating-point multiplier

3、. We show how towrite VHDL cells that implement such approach, and howthe array multiplier architecture was adapted. Synthesisand simulation were performed for Altera Apex20KEdevices, although the VHDL code should be portable toother devices. For this family, a 16 bit integer multiplierachieves a fr

4、equency of 266MHz, while the floating pointunit reaches 235MHz, performing 235 MFLOPS in anFPGA. Additional cells are inserted to synchronize data,what imposes significant area penalties. This and otherconsiderations to apply the technique in real designs arealso addressed.1. IntroductionPipelines a

5、re widely used to improve the performanceof digital circuits, since they provide a simple way ofimplementing parallelism from streams of sequentialoperations. As more stages are inserted in the pipeline,each stage becomes shorter, and ideally presents a smallerdelay. So, the resulting circuit will e

6、xhibit bigger latencybut higher sustained performance when the pipeline isfully utilized.Theoretically, we can push the pipeline depth to a levelof using a single gate between two registers. But usually,there is a compromise between performance improvementsobtained with increased pipeline depth and

7、the penaltiesimposed by the additional memory elements inserted inbetween the stages.In 1 it is presented an 8-bit, full custom, integermultiplier using pipeline stages of a single half adder. 2uses the same methodology to implement a twoscomplement multiplier. Besides the high throughputachieved, t

8、he techniques need very complex and manualwork, since they employ full custom design, and workonly to specific technologies and bit widths, not beingaccessible to regular ASIC designs.In this work we investigate a methodology to design thedeepest pipelined circuits in FPGAs, starting from VHDL.FPGA

9、devices have some specific characteristics thatallow the designer to implement a “gate level“ pipelinewith optimal performance, the only remark being that theword gate here means any 4-input function with a singleoutput. Longer stages will present twice the delay of logicelements and will use an out

10、side connection. Shorterstages do not take advantage of the fact that the FPGA cellcan implement any function with the same delay. The ideaalready appears in an Altera Application Brief 3, but wedid not find descriptions and results of a methodology orimplementation anywhere else.Despite the fact th

11、at FPGA architectures differ fromvendor to vendor, they still present a set of basic commonfeatures that allow building gate level pipelines in VHDL.By doing so, it is possible to reuse and map the design tomany different devices, and reuse it at the press of abutton, in contrast to the full custom

12、approachAs a case study, we developed an architecture forinteger multiplication that exploits the deepest pipelinesand then we build a floating point multiplication unit thatis able to perform 235 MFLOPS in an Altera Apex20KEdevice. The integer architecture is parameterized to anynumber of bits, wha

13、t increases its applicability. Yet thefloating-point unit is restricted to only single precision, 32-bit, as presented in this paper, but can be easily extendedto larger widths.The rest of the paper is organized as follows. Section 2explains how the technique is employed, and presentssome informatio

14、n about the Altera APEX FPGAarchitecture. Section 3 describes the design of aparameterized array integer multiplier, along with itsperformance, which is compared to the default multipliersoffered in the Altera library. In section 4 a complete IEEE754 compliant floating pointer multiplier is presente

15、d,which not only achieves higher absolute frequency, butalso better performance/area tradeoff. Finally, section 5presents some discussion and our concluding remarks.2. Deep pipelines in FPGAAn FPGA device is generally an array of configurablebasic blocks called logic cells (LCs) or logic elements(LE

16、)s. The second term is preferred and used throughoutProceedings of the Design, Automation and Test in Europe Conference and Exhibition Designers Forum (DATE04) 1530-1591/04 $20.00 2004 IEEE this paper to avoid misinterpretation. Abstractingimplementation details, one can think of an LE as beingcompo

17、sed of a look-up table (LUT), a register cell (latch,flip-flop), and a multiplexer, as it is shown in Fig. 1. TheLUT can implement any truth table up to a given numberof inputs, 4 in this case, although there are other simplegates outside the LUT that let the LE implement somefunctions with a larger

18、 number of inputs. The multiplexerselects the output from the LUT or the register as theoutput of the LE.The pipeline depth of a circuit implemented in an FPGAcan be pushed to the level of using the register cell inevery single LE that is necessary. We may still try to putas much logic as we can ins

19、ide a single LE, basically in theLUT part. But every time a combinational circuit does notfit into one LE, an additional stage is introduced. This willguarantee that there is no path longer than a single LEbetween any two storage elements, and is the shortestpossible path between them. Therefore, ou

20、r main goal is todesign circuits that have this property, and to check out theperformance limits that can be achieved in these devices.LUTLUTFigure 1. A simplified FPGA Logic ElementIn many cases, a synthesis process with a VHDLdescription as input can produce gate net-lists that are notexactly what

21、 the designer wanted. But to implementpipelines at the level of logic elements, this situation mustnot occur, and the designer must have full control of theresulting implementation. Hopefully, using the registercells is easier that one might think. It is sufficient toinclude a clause that depends on

22、 the clock signal to makethe output assignment of small partial functions, just as itis done in normal situations to describe a memory element.These partial functions will be contained in basic entitiesthat may be instantiated to build the circuit underconsideration.Fig. 2 shows an example of a half

23、 adder descriptionwhere the outputs are registered, and can be used as astage of the deep pipeline. The synthesis generates twologic elements for this entity, one for the sum output S,and the other for the carry output COUT. Such an entitycan be instantiated anywhere in a bigger design, and theresul

24、t of its synthesis will still be the same pair of LEs.This explains how to describe basic functions in whichthe design must be decomposed. The architecture of thecircuit must also be adapted to allow this decomposition,since not all possible logic functions fit into a single LE.The rule when definin

25、g basic entities is that for everyoutput, there should be at most four inputs that may affectits result, because LUTs have 4 inputs.1213141516171819202122Architecture arch_1 of mcell1 issignal soma,carry: std_logic;beginsoma = PP xor CIN;carry = PP and CIN;process beginwait until CLK = 1;S = soma;CO

26、UT = carry;end process;end arch_1;Figure 2. VHDL of a basic blockThere is another point that must be observed andsignificantly affects the design of the circuit. All the pathsfrom the inputs to the outputs must pass through the samenumber of LEs. This is necessary to synchronize data inthe pipeline.

27、 Whenever a path from an input to an output isshorter than the largest one, additional delay elementsmust be inserted to make the data flow at the same (logic)speed. These delay cells can be declared just as the otherentities were. As it might be expected, the insertion of LEswhose sole purpose is t

28、o produce delays greatly affectscircuit area, increasing device usage, and possibly limitingthe application of such approach.In our implementations, basic entities are groupedtogether using structural VHDL to form bigger buildingblocks. This hierarchy allows us to adequate the circuitstructure to th

29、e FPGA architecture and to perform someplacement optimizations that are described later on.In the Apex20K family of devices, each set of 10 LEs isgrouped in a structure called Logic Array Block (LAB),which in turn is grouped in sets of 10 or 16 to form theMegaLab structure. Communication between LEs

30、 in thesame LAB is extremely fast, with minimal delays causedby interconnects. Connections between LEs in differentLABs inside the same MegaLab have bigger delays, butare still very fast. Delays between LEs start to becomecritical in the interconnections when they are placed inseparate MegaLabs. The

31、re is still, however, a set of “fastinterconnects“ between neighbor MegaLabs that can beused to keep the signals with minimal delays. But they arelimited in the sense that are restricted only to neighborMegaLabs and also because there are only a few of thesefast lines. Given that a MegaLab is not sq

32、uare, the systemwill run out of horizontal connections first.3. Design of an Integer MultiplierIn order to test the gate level pipeline technique justexplained, an integer array multiplier was first designed.The fastest types of multipliers are the parallel ones, asWallace 5 and Dadda 6 architecture

33、s. However, thesearchitectures do not have the same regular structurealready present in the Array multiplier 7. As Fig. 3ashows, each line of logic cells computes basically onepartial product, and could be a separate pipeline stage. Acomparison among this approach and the parallelProceedings of the

34、Design, Automation and Test in Europe Conference and Exhibition Designers Forum (DATE04) 1530-1591/04 $20.00 2004 IEEE architectures for an ASIC are found in 8. We chose theArray architecture as the starting point. By inspection onecan see that delays are also propagated horizontally.Therefore, if p

35、artial products were used as pipeline stages,each stage would have to wait for the propagation of carrysignals and will then have the delay of many LEs.(A)(B)carrycarrycarrycarrycarrycarrycarrycarrycarrycarryFigure 3. Classic and adapted carry propagationSo, in order to implement the gate level pipe

36、line, therewere two options. The first one was to consider the circuitas running diagonally from top-right to bottom-left, andthe other was to adapt the carry propagation to be takeninto account only at the next stage, as Fig. 3b shows. Wechose the second approach, as we expected it to minimizethe a

37、mount of additional delay elements to be inserted.3.1 Multiplier architectureFig. 4 shows an example of a 4-bit integer multiplierwhere it is possible to observe the elaborated architecture.The next two sections explain the basic blocks and theintermediate level structure, respectively.mrow1mrow_mid

38、dlemrow_pre_lastmrow_lastA3 A3 A2 A2 A1 A1 A0 A1 A0B0B1B2B3S7 S6 S5 S4 S3 S2 S1 S0Figure 4. 4-bit array multiplier3.2 Basic building blocksSix types of basic blocks are needed to implement theinteger multiplier (see Fig. 5), and they are:a) A half adder that adds results and carry, called mcell1;b)

39、A multiplication block that computes the sum of twosingle bit multiplications, called mcell2a;c) A multiplication block that computes a single bitmultiplication and runs the result into a full adder,called mcell2;d) A block for propagation only, called mcell3;e) A half adder without carry, called mc

40、ell4;f) A double delay block, to propagate input B andresults, called mcell5;PPPPP(D)SPP CinS(E)PBBBPPPP(F)CSCout SPP Cin(A)CASCinCoutBPPASA(B)B1CASCoutA1ASA0B0(C)Figure 5. Basic building blocks3.3 Intermediate level structureInstances of the basic blocks are used to assembleintermediate level, regu

41、lar blocks, that perform specifictasks in the multiplier (see Fig. 4). The four intermediatelevel structures are:mrow1: Corresponds to the first stage of the pipelineand is composed of n mcell2a cells. Two bitmultiplications are possible at this stage since it does nothave a previous one, and bit mu

42、ltiplications areimplemented with AND gates. The output of this structureis a vector of n+1 bits for results and a vector of n-1 bitsfor carry out.mrow_middle: Corresponds to the next n-2 stages.Each stage uses n blocks of mcell2 and some blocks ofmcell5 for propagation of previous results.mrow_pre_

43、last: The propagation of inputs A and B areno longer needed. So, the nthstage uses n-1 instances ofmcell1 for carry adjust and n instances of mcell3 forpropagating previous result.mrow_last: Implements the n-1 last stages of thepipeline, performing only carry adjust in the mostsignificant bits. Each

44、 line uses one instance of mcell4,except the first, who uses a mcell3 instead, and instancesof mcell1 and mcell3, starting with n-2 instances of mcell1and n+1 instances of mcell3. In the following lines, thenumber of mcell1 instances is decreased by 1, and thenumber of mcell3 instances is increased

45、by 1.There are also adjacent structures for mrow1 andmrow_middle described in the top entity to compute theleast significant bit of the first stages. The first one uses asimple AND gate, and the next ones synchronize B inputsand propagate the result of the least significant bit.Proceedings of the De

46、sign, Automation and Test in Europe Conference and Exhibition Designers Forum (DATE04) 1530-1591/04 $20.00 2004 IEEE 3.4 Latency and Logic Elements predictionWe described the architecture of the multiplier in aparameterized way, so that it can be instantiated for anyrequired number of bits. Since th

47、e implementation ishighly regular, both latency and circuit size can bepredicted. The latency will always be (2*n-1) cycles.The number of LEs used in each basic block is wellknown, and corresponds to the number of small squaresshown in Fig. 5. So, it is possible to estimate the total sizeof the resu

48、lting circuit in number of LEs by the followingequation, for a multiplier of n bits:g166=+=2123235#niinnLEsTable 1 presents the latency and circuit size predictionfor commonly used data widths. The 24 and 54 bit widthsare used in the floating point IEEE 754 standard, whichwill be discussed in sectio

49、n 4.Table 1. Latency and size prediction.#bits Latency #LEs47758 15 35716 31 154524 47 356532 63 641754 107 1855064 127 259153.5 Implementation and Simulation ResultsWe tested the resulting performance of the integermultiplier synthesized for a range of bit widths from 4 to64. Table 2 shows in the second column the operatingfrequency achieved in each circuit. The first thing to noteis that the performance obtained is very high for this kindof device. In fact, we investigated why the 4 and 8 bitcircuits presented the same performance, and found outthat the

展开阅读全文