1、Branch target buffer design for embedded processorsNadav Levison, Shlomo Weiss *Dept. of Electrical Engineering Systems, Tel Aviv University, Tel Aviv 69978, IsraelThe demand for embedded application processors that support multi-tasking operating system and can execute complex applications bring th
2、em closer to general purpose processors. These strong processors have a limited power source because they are usually found in portable devices such as smartphones and other PDAs, and are powered by batteries. The Branch Target Buffer (BTB), which is commonly used in general purpose processors, is b
3、ecoming prevalent in high-end embedded processors in order to support long pipelines and mitigate high miss penalties. However, the BTB is a major power consumer because it is a large SRAM structure that is accessed almost every cycle. We propose two BTB designs that fit the tight power budgets of e
4、mbedded processors. In the first design, the power consumption of a single BTB access is reduced by reading only the lower part of the predicted target address bits. This design has power savings of up to 25% dynamic power, with effectively no performance degradation. In the second design, we avoid
5、redundant BTB accesses to the same set by using a small buffer that holds the most recently accessed set. This design results in 75% dynamic power savings at the cost of up to 0.64% system slowdown in a 2-way BTB, and 80% dynamic power savings at the cost of up to 0.58% system slowdown in a 4-way BT
6、B. 2010 Elsevier B.V. All rights reserved.1. IntroductionIn 1994 IBM and BellSouth launched a new mobile phone called Simon Personal Communicator. Apart from the common mobile phone capabilities, the Simon had additional features such as a calendar, sending and receiving E-mails and faxes, games, an
7、d an address book. Although it was big, heavy, and costly, the Simon is considered to be the first smartphone. Since then smartphones have become powerful, easy to use, populardevices that support a wide range of functions and replace several special-purpose gadgets with a single highly integrated d
8、evice.A smartphone is usually defined as a common mobile phone that can also function as a personal digital assistant (PDA). The features 8 that might be expected in a modern smartphone include the ability to run a multi-tasking operating system, a large display, internet access, E-mail, SMS, person
9、al information management, voice communication, WiFi, still and video camera, music player, GPS and more. In order to support this large variety of tasks, todays smartphones are usually built with an application processor, along with the ubiquitous digital signal processor (DSP) and other ad-hoc har
10、dware accelerators.1.1. ARM Cortex-A8 integrated in the TI-OMAP3The Texas Instrument OMAP3 family, found in the newest Palm Pre and Samsung smartphones, is an example of an integrated circuit that incorporates an application processor and other hardware accelerators. The TI-OMAP3 architecture (Fig.
11、1) has four basic blocks: The ARM Cortex-A8 application processor, 2D/3D GraphicAccelerator, Image Video Audio Accelerator (IVA 2+) and Image Signal Processor (ISP). The ARM Cortex-A8 processor 2 runs the operating system and a variety of applications. It is a dual-issue superscalar processor with a
12、 13 stage pipeline, integrated L2 cache, and advanced dynamic branch prediction. A powerful and lowpower processor, it is produced in a 65 nm fabrication process and can run at the maximum speed of 1.1 GHz.The evolution of the ARM Cortex-A8 predecessors demonstrates the increasing demand for stronge
13、r embedded processors. ARM processors are widely used in cellular phones and PDAs and it is estimated that 99 percent of the worlds smart phones employ 31 ARM technology. In Table 1, four selected ARM embedded processorsfrom the last 15 years are shown.1As Table 1 illustrates, embedded processors ar
14、e becoming stronger: wider issue, longer pipelines, larger execution window, and bigger on-chip cache memories. Looking a few years ahead, the next generation of embedded processors will likely be multicore processors. One of the latest examples is the ARM Cortex-A9 MPCore a dual core SMP processor
15、integrated in the Texas Instruments OMAP4.1.2. Reducing power in embedded processorsAny portable electronic device such as a cellular phone, and especially smartphones, must manage power consumption wisely because of the limited capacity of the battery. Battery capacity does not improve as fast as m
16、icroelectronics technology, and the system energy budget is very limited 29. Therefore a major effort is required to reduce the power consumption of every element in portable devices. In this work we focus on reducing power in the application processor component.One of the major disadvantages of a l
17、ong pipeline in superscalar processors, such as the Cortex-A8, is the high branch misprediction penalty. When branch misprediction is detected, the pipeline must be flushed and all the instructions that follow the mispredicted branch must be canceled. The misprediction penalty is higher in longer pi
18、pelines. To minimize the misprediction penalty a powerful branch prediction mechanism is usually used 2,32.Most processors that use dynamic branch prediction implement two kinds of mechanisms: direction prediction used to predict whether a branch is taken or not, and target address prediction that p
19、redicts the target address of taken branches. The address prediction is usually implemented using a Branch Target Buffer, or BTB a structure that holds branch target addresses of branches that were recently executed. The ARM Cortex-A8 processor, which has a 13 cycle branch misprediction penalty, use
20、s a 512-entry, 2-wayBTB, and a 4096-entry global history buffer 2. However, these structures contribute to the total processor power consumption because they are SRAM structures that are accessed in almost every cycle. Hence a low-power BTB is essential is high-end embedded processors.The research p
21、resented in this paper targets the BTB power consumption problem. We propose two different mechanisms. The first one, Split Data Array (SDA) BTB, is based on the observation that most branch instructions are short distance and therefore dynamic power can be saved by not accessing all of the predicte
22、d targetaddress bits. The BTB data array in this design is divided into two arrays: a low data array to hold the lower part of the predicted target address, and a high data array to hold the remaining bits. The low data array is accessed in every BTB access while the high data array is accessed only
23、 when needed. The second mechanism, BTB with a set-buffer, is based on the locality of reference property of branch addresses. In this design, the index field in the branch address is shifted left when accessing the BTB. This shift increases the probability that two successive BTB accesses are to th
24、e same BTB set, and therefore it is worthwhile to buffer the entire set when accessed. A set-buffer is provided for this purpose. If, as expected, the next BTB reference is to the same set, the prediction can be read from the set-buffer, saving an access to the BTB.1.3. Paper outlineIn Section 2 we
25、present the first BTB mechanism Split Data Array (SDA) BTB, including motivation, design, and results. Design and motivation of the second mechanism BTB with a set-buffer are presented in Section 3. The simulation setup for both designs is described in Section 2.4. Related work is described in Secti
26、on 4 and the paper ends with a summary and conclusions in Section 5.2. SDA BTBIn this section we give a description of the Split Data Array (SDA) BTB. In Section 2.1 we discuss the design motivation, and in Section 2.2 we present the general structure and the way the BTB is accessed. In Section 2.3
27、we discuss power and timing issues related to this design. Result are presented and discussed in Section 2.5.2.1. MotivationIn order to determine the relation between the branch instruction address (BA) and the branch target address (TA) we define the Highest Relevant Bit or HRB of the branch target
28、 address using the two following equations:where S is the set of all the bit positions in which the branch address is different than the target address, assuming a 32 bits address space (bit 0 is the LSB). The bits in the positions 0 and 1 are not stored in the BTB because we assume all instructions
29、 are 4-byte long and are aligned in the memory. HRB is defined as the maximum on S, i.e. the leftmost bit in the target address that is different in the branch address. Note that HRB does not indicate the distance of the branch. For example, if the branch address is 0x0000FFFC and the target address
30、 is 0x00010000, although the branch distance is only one instruction forward the HRB is 16. In the simulations we ran on the SPEC2000 programs 1 we discovered that the average HRB is very low. In at least 47% of the BTB accesses HRB is less than or equal to eight, and in at least 75% of the BTB acce
31、sses HRB is less than or equal to 12. The reasons for this behavior are listed below.Text size: Program text size is usually small. According to our simulations, the SPEC2000 1 average text size is only 194 k instructions. The proposed BTB mechanism is based on the observation that when accessing th
32、e BTB a significant part of the predicted target address is already known. The higher bits in the branch address are identical to the higher bits of the target address and therefore these bits can be bypassed directly from the address of the branch instruction (that is from the program counter) inst
33、ead of reading them from the BTB. Dynamic power is reduced by accessing only the lower part of the target address.2.2. SDA-BTB designIn a traditional BTB design each line is composed of few fields: The tag field which holds the branch identifier extracted from the address of the branch, the data or
34、target address field which holds the predicted target address bits, and a valid bit to indicate the validity of the entry. There might be other fields such as branch direction history counter. All these fields are usually stored in two arrays, one for the tag, and one for all the remaining fields. I
35、n the SDA-BTB design, each entry is composed of the following fields:1. tag holds branch identifiers as in traditional BTB.2. Target Address Low (TAL) holds the n predicted target address lower bits.3. Valid Low (VL) holds a valid bit that indicates whether this entry is valid as in traditional BTB.
36、4. Target Address High (TAH) holds the (30 n) predicted target address higher bits. We assume a 32 bit address space and a 4 byte aligned instruction word, therefore we do not store the two least significant bits.5. Valid High (VH) holds information regarding TAH validity. If this bit is 1 then ther
37、e is valid information for this branch in the TAH data array.TAL, VL and VH are accessed in every BTB access, and are practically the same array. TAH is accessed only when needed, according to the VH bit. We will not refer to VL in the rest of this paper because the use of this bit is exactly as in
38、a traditional BTB.When a lookup operation is initiated (Fig. 2), the branch address is decoded and sent to the tag array. If there is a hit, i.e. there is a valid prediction for this branch, TAL array is read. If the VH bit is 0,no further operation is made, and the predicted target address is the c
39、oncatenation of the higher bits of the branch address (BAH) with the bits that were read from TAL array. We refer to this case as partial prediction. Alternatively, if the VH bit is 1, the TAH array is read, and the predicted target address is the concatenation of the bits that were read from TAH ar
40、ray with the bits that were read from TAL array. We refer to this case as full prediction. For both partial and full prediction the SDA BTB provides the same level of accuracyas the traditional BTB. Partial prediction however is more desirable since in this case only a subset of the target address b
41、its are read from the BTB data array, and consequently the BTB dynamic power consumption is reduced. When the outcome of a branch instruction that missed the BTB becomes available, the BTB is updated (Fig. 3). Depending on the BTB level of associativity and replacement policy, a new entry in the BTB
42、 is allocated using the branch address. The low target address bits TAL are stored in the appropriate entry in the TAL array. A comparison is made between the upper bits of the branch and target addresses, that is between BAH and TAH. If they are equal a partial prediction can be used and therefore
43、there is no need to store the upper predicted address bits. Consequently, the TAH array is not accessed and the VH bit in the appropriate entry is set to 0. On the other hand if BAH and TAH are not equal then a full prediction is needed for this branch. In this case VH bit is set to 1 and the upper
44、target address bits are stored in the TAH data array.2.3. Power and timing considerations2.3.1. PowerThe dynamic power dissipation of the BTB was calculated as follows:PSDA BTB ?ePTAG t PLT t NH NOVERALL PH e2T where PTAG, PL, and PH are the per-access dynamic power of the tag array, TAL array, and
45、TAH array respectively. This equation describes the way the SDA BTB works: The tag array and TAL are accessed for each instruction, while TAH is accessed only in a fraction of the time. For calculating PTAG, we assume that instructions are 4-byte long and are aligned, and therefore the two lowest bi
46、ts of the instruction address are always zero. The BTB has 256 sets, which are accessed with an eight-bit index. After removing the two lowest address bits and the index bits from the 32-bit address, the remaining 22 bits are used as tag bits in the BTB. Fig. 2.Dynamic power values were calculated u
47、sing CACTI 5.2 35, assuming 65 nm technology. CACTI was designed to estimate power in cache memories and SRAM arrays, usually with large line size. We used the scaling methods that was used in 19 and in 37.2.3.2. TimingThe relative timing of accessing the tag and data components in a cache (and BTB)
48、 is a design choice that impacts the basic structure of the cache. There are different set-associative cache organization to support three kinds of cache accesses: serial, fast, and normal 35. In a serial access, first the tag array is accessed. Only if there is a hit in one of the cache ways, the d
49、ata array is accessed, and the appropriate line is read out. In a fast access, the tag and data arrays are accessed in parallel. Because the desired way is unknown at the beginning of the access all the lines in the set are read out of the data array, and the way-selection is made outside the array at the end of the cache access. In a normal access, the tag and data arrays are accessed in parallel, but the line select is made only when the desired way is known. A serial access is characterized by a long access time and low energy, while a f