During this stage, single cycle latency instructions simply have their results forwarded to the next stage. This forwarding ensures that both single and two cycle instructions always write their results in the same stage of the pipeline, so that just one write port to the register file can be used, and it is always available.
For direct mapped and virtually tagged data caching, the simplest by far of the numerous data cache organizations, two SRAMs are used, one storing data and the other storing tags.
If the instruction is a load, data is read from the data cache, so both SRAMs are read in parallel during the access stage of the instruction. The single tag read from the tag SRAM is compared with the virtual address specified in the load instruction, and if the two are equal then the datum recently retrieved from the data SRAM is the desired element of data. The success of finding the tag immediately in the tag SRAM is called a cache hit and allows the load instruction to complete the writeback stage normally. If the tag from the tag SRAM and virtual address from the load instruction are not equal, then the data is not in the cache and the datum retrieved from the data SRAM is useless, referred to as a cache miss. The CPU pipeline must suspend operation (described below) while a state machine updates the cache from memory, reading the required data from memory into the cache and optionally writing any dirty data evicted from the cache back into memory.
During a store, the tag SRAM is read to determine if the store is a cache hit or cache miss. If a cache miss occurs, the previously described cache update state machine brings the correct datum into the cache. Note that this means store data cannot be written to the cache data SRAM during the access stage because the processor does not yet know if the correct line is resident. Instead, the store data is held in a Store Data Queue, until it can be written to the cache data SRAM during the next store instruction. In a classic RISC pipeline, the Store Data Queue is just a single 32 bit register. For later reference, the virtual address written is held in the Store Address Queue, also a single 32 bit register. On more complicated pipelines, these queues can have multiple hardware registers and variable length.
To complicate things further, a load instruction immediately after a store instruction could reference the same memory address, in which case the data must come from the Store Data Queue rather than from the cache's data SRAM. For this reason, during a load instruction the virtual address included in the instruction is checked against the Store Address Queue in addition to the tag cache SRAM. Should the virtual address included in the load instruction be matched by an entry in the Store Address Queue, the associated data from the Store Data Queue is forwarded during the writeback stage rather than any data from the data cache SRAM without changing the flow of the load operation through the pipeline.
During this stage, both single cycle and two cycle instructions write their results into the register file.
Structural hazards are when two instructions might attempt to use the same resources at the same time. Classic RISC pipelines avoided these hazards by replicating hardware. In particular, branch instructions could have used the ALU to compute the target address of the branch. If the ALU were used in the decode stage for that purpose, an ALU instruction followed by a branch would have seen both instructions attempt to use the ALU simultaneously. It is simple to resolve this conflict by designing a specialized branch target adder into the decode stage.
Data hazards are when an instruction, scheduled blindly, would attempt to use data before the data is available in the register file.
Data hazards are avoided in one of two ways:
Suppose the CPU is executing the following piece of code:
The instruction fetch and decode stages will send the second instruction one cycle after the first. They flow down the pipeline as shown in this diagram:
In a naïve pipeline, without hazard consideration, the data hazard progresses as follows:
In cycle 3, the SUB instruction calculates the new value for r10. In the same cycle, the AND operation is decoded, and the value of r10 is fetched from the register file. However, the SUBinstruction has not yet written its result to r10. Write-back of this normally occurs in cycle 5 (green box). Therefore, the value read from the register file and passed to the ALU (in the Execute stage of the AND operation, red box) is incorrect.
Instead, we must pass the data that was computed by SUB back to the Execute stage (i.e. to the red circle in the diagram) of the AND operation before it is normally written-back. The solution to this problem is a pair of bypass multiplexers. These multiplexers sit at the end of the decode stage, and their flopped outputs are the inputs to the ALU. Each multiplexer selects between:
Decode stage logic compares the registers written by instructions in the execute and access stages of the pipeline to the registers read by the instruction in the decode stage, and cause the multiplexers to select the most recent data. These bypass multiplexers make it possible for the pipeline to execute simple instructions with just the latency of the ALU, the multiplexer, and a flip-flop. Without the multiplexers, the latency of writing and then reading the register file would have to be included in the latency of these instructions.
Note that the data can only be passed forward in time - the data cannot be bypassed back to an earlier stage if it has not been processed yet. In the case above, the data is passed forward (by the time the AND is ready for the register in the ALU, the SUB has already computed it).
However, consider the following instructions:
The data read from the address adr will not be present in the data cache until after the Memory Access stage of the LD instruction. By this time, the AND instruction will already be through the ALU. To resolve this would require the data from memory to be passed backwards in time to the input to the ALU. This is not possible. The solution is to delay the AND instruction by one cycle. The data hazard is detected in the decode stage, and the fetch and decode stages are stalled - they are prevented from flopping their inputs and so stay in the same state for a cycle. The execute, access, and write-back stages downstream see an extra no-operation instruction (NOP) inserted between the LD and AND instructions.
This NOP is termed a "pipeline bubble" since it floats in the pipeline, like an air bubble, occupying resources but not producing useful results. The hardware to detect a data hazard and stall the pipeline until the hazard is cleared is called a pipeline interlock. 
A pipeline interlock does not have to be used with any data forwarding, however. The first example of the SUB followed by AND and the second example of LD followed by AND can be solved by stalling the first stage by three cycles until write-back is achieved, and the data in the register file is correct, causing the correct register value to be fetched by the AND's Decode stage. This causes quite a performance hit, as the processor spends a lot of time processing nothing, but clock speeds can be increased as there is less forwarding logic to wait for.
This data hazard can be detected quite easily when the program's machine code is written by the compiler. The original Stanford RISC machine relied on the compiler to add the NOP instructions in this case, rather than having the circuitry to detect and (more taxingly) stall the first two pipeline stages. Hence the name MIPS: Microprocessor without Interlocked Pipeline Stages. It turned out that the extra NOP instructions added by the compiler expanded the program binaries enough that the instruction cache hit rate was reduced. The stall hardware, although expensive, was put back into later designs to improve instruction cache hit rate, at which point the acronym no longer made sense.
Control hazards are caused by conditional and unconditional branching. The classic RISC pipeline resolves branches in the decode stage, which means the branch resolution recurrence is two cycles long. There are three implications:
There are four schemes to solve this performance problem with branches:
Delayed branches were controversial, first, because their semantics are complicated. A delayed branch specifies that the jump to a new location happens after the next instruction. That next instruction is the one unavoidably loaded by the instruction cache after the branch.