Computer Systems And Concepts..

Sunday, August 23, 2015

What is the difference between Shallow copy and Deep copy? Which is default?

When you do a shallow copy, all the fields of the source object is copied to target object as it is. That means, if there is a dynamically created field in the source object, shallow copy will copy the same pointer to target object. So you will have two objects with fields that are pointing to same memory location which is not what you usually want.

In case of deep copy, instead of copying the pointer, the object itself is copied to target. In this case if you modify the target object, it will not affect the source. By default copy constructors and assignment operators do shallow copy. To make it as deep copy, you need to create a custom copy constructor and override assignment operator.

Monday, March 11, 2013

RISC PIPELINE

In the history of computer hardware, some early reduced instruction set computer central processing units (RISC CPUs) used a very similar architectural solution, now called a classic RISC pipeline.

Those CPUs were: MIPS, SPARC, Motorola 88000, and later DLX.

The main common concept of each design was a five-stage execution instruction pipeline. During operation, each pipeline stage would work on one instruction at a time.

Each of these stages consisted of an initial set of flip-flops, and combinational logic which operated on the outputs of those flops.

The classic five stage RISC pipelineInstruction fetchDecode
Execute

Register-Register Operation (Single cycle latency): Add, subtract, compare, and logical operations. During the execute stage, the two arguments were fed to a simple ALU, which generated the result by the end of the execute stage.

Memory Reference (Two cycle latency). All loads from memory. During the execute stage, the ALU added the two arguments (a register and a constant offset) to produce a virtual address by the end of the cycle.

Multi Cycle Instructions (Many cycle latency). Integer multiply and divide and all floating-point operations. During the execute stage, the operands to these operations were fed to the multi-cycle multiply/divide unit. The rest of the pipeline was free to continue execution while the multiply/divide unit did its work. To avoid complicating the writeback stage and issue logic, multicycle instruction wrote their results to a separate set of registers.

Memory Access
Writeback
Hazards
Hennessy and Patterson coined the term hazard for situations in which a completely trivial pipeline would produce wrong answers.
Structural Hazards
Data Hazards
Solution A. Bypassing
SUB r3,r4 -> r10 ; Writes r3-r4 to r10
AND r10,r3 -> r11 ; Writes r10 & r3 to r11

A register file read port (i.e. the output of the decode stage, as in the naïve pipeline): red arrow

The current register pipeline of the ALU (to bypass by one stage): blue arrow

The current register pipeline of the access stage (which is either a loaded value or a forwarded ALU result, this provides bypassing of two stages): purple arrow. Note that this requires the data to be passed backwards in time by one cycle. If this occurs, a bubble must be inserted to stall the AND operation until the data is ready.

Control hazards

The branch resolution recurrence goes through quite a bit of circuitry: the instruction cache read, register file read, branch condition compute (which involves a 32-bit compare on the MIPS CPUs), and the next instruction address multiplexer.

Because branch and jump targets are calculated in parallel to the register read, RISC ISAs typically do not have instructions which branch to a register+offset address. Jump to register is supported.

On any branch taken, the instruction immediately after the branch is always fetched from the instruction cache. If this instruction is ignored, there is a one cycle per taken branch IPC penalty, which is quite large.

Predict Not Taken: Always fetch the instruction after the branch from the instruction cache, but only execute it if the branch is not taken. If the branch is not taken, the pipeline stays full. If the branch is taken, the instruction is flushed (marked as if it were a NOP), and one cycle's opportunity to finish an instruction is lost.

Branch Likely: Always fetch the instruction after the branch from the instruction cache, but only execute it if the branch was taken. The compiler can always fill the branch delay slot on such a branch, and since branches are more often taken than not, such branches have a smaller IPC penalty than the previous kind.

Branch Delay Slot: Always fetch the instruction after the branch from the instruction cache, and always execute it, even if the branch is taken. Instead of taking an IPC penalty for some fraction of branches either taken (perhaps 60%) or not taken (perhaps 40%), branch delay slots take an IPC penalty for those branches into which the compiler could not schedule the branch delay slot. The SPARC, MIPS, and MC88K designers designed a branch delay slot into their ISAs.

Branch Prediction: In parallel with fetching each instruction, guess if the instruction is a branch or jump, and if so, guess the target. On the cycle after a branch or jump, fetch the instruction at the guessed target. When the guess is wrong, flush the incorrectly fetched target.

Compilers typically have some difficulty finding logically independent instructions to place after the branch (the instruction after the branch is called the delay slot), so that they must insert NOPs into the delay slots.

Superscalar processors, which fetch multiple instructions per cycle and must have some form of branch prediction, do not benefit from delayed branches. The Alpha ISA left out delayed branches, as it was intended for superscalar processors.

The most serious drawback to delayed branches is the additional control complexity they entail. If the delay slot instruction takes an exception, the processor has to be restarted on the branch, rather than that next instruction. Exceptions then have essentially two addresses, the exception address and the restart address, and generating and distinguishing between the two correctly in all cases has been a source of bugs for later designs.

The Instruction Cache on these machines had a latency of one cycle. During the Instruction Fetch stage, a 32-bit instruction was fetched from the cache.

The PC predictor sends the Program Counter (PC) to the Instruction Cache to read the current instruction. At the same time, the PC predictor predicts the address of the next instruction by incrementing the PC by 4 (all instructions were 4 bytes long). This prediction was always wrong in the case of a taken branch, jump, or exception (see delayed branches, below). Later machines would use more complicated and accurate algorithms (branch prediction and branch target prediction) to guess the next instruction address.

Unlike earlier microcoded machines, the first RISC machines had no microcode. Once fetched from the instruction cache, the instruction bits were shifted down the pipeline, so that simple combinational logic in each pipeline stage could produce the control signals for the datapath directly from the instruction bits. As a result, very little decoding is done in the stage traditionally called the decode stage.

All MIPS, SPARC, and DLX instructions have at most two register inputs. During the decode stage, these two register names are identified within the instruction, and the two registers named are read from the register file. In the MIPS design, the register file had 32 entries.

At the same time the register file was read, instruction issue logic in this stage determined if the pipeline was ready to execute the instruction in this stage. If not, the issue logic would cause both the Instruction Fetch stage and the Decode stage to stall. On a stall cycle, the stages would prevent their initial flops from accepting new bits.

If the instruction decoded was a branch or jump, the target address of the branch or jump was computed in parallel with reading the register file. The branch condition is computed after the register file is read, and if the branch is taken or if the instruction is a jump, the PC predictor in the first stage is assigned the branch target, rather than the incremented PC that has been computed.

The decode stage ended up with quite a lot of hardware: the MIPS instruction set had the possibility of branching if two registers were equal, so a 32-bit-wide AND tree ran in series after the register file read, making a very long critical path through this stage. Also, the branch target computation generally required a 16 bit add and a 14 bit incrementer. Resolving the branch in the decode stage made it possible to have just a single-cycle branch mispredict penalty. Since branches were very often taken (and thus mispredicted), it was very important to keep this penalty low.

Instructions on these simple RISC machines can be divided into three latency classes according to the type of the operation:

During this stage, single cycle latency instructions simply have their results forwarded to the next stage. This forwarding ensures that both single and two cycle instructions always write their results in the same stage of the pipeline, so that just one write port to the register file can be used, and it is always available.

For direct mapped and virtually tagged data caching, the simplest by far of the numerous data cache organizations, two SRAMs are used, one storing data and the other storing tags.

If the instruction is a load, data is read from the data cache, so both SRAMs are read in parallel during the access stage of the instruction. The single tag read from the tag SRAM is compared with the virtual address specified in the load instruction, and if the two are equal then the datum recently retrieved from the data SRAM is the desired element of data. The success of finding the tag immediately in the tag SRAM is called a cache hit and allows the load instruction to complete the writeback stage normally. If the tag from the tag SRAM and virtual address from the load instruction are not equal, then the data is not in the cache and the datum retrieved from the data SRAM is useless, referred to as a cache miss. The CPU pipeline must suspend operation (described below) while a state machine updates the cache from memory, reading the required data from memory into the cache and optionally writing any dirty data evicted from the cache back into memory.

During a store, the tag SRAM is read to determine if the store is a cache hit or cache miss. If a cache miss occurs, the previously described cache update state machine brings the correct datum into the cache. Note that this means store data cannot be written to the cache data SRAM during the access stage because the processor does not yet know if the correct line is resident. Instead, the store data is held in a Store Data Queue, until it can be written to the cache data SRAM during the next store instruction. In a classic RISC pipeline, the Store Data Queue is just a single 32 bit register. For later reference, the virtual address written is held in the Store Address Queue, also a single 32 bit register. On more complicated pipelines, these queues can have multiple hardware registers and variable length.

To complicate things further, a load instruction immediately after a store instruction could reference the same memory address, in which case the data must come from the Store Data Queue rather than from the cache's data SRAM. For this reason, during a load instruction the virtual address included in the instruction is checked against the Store Address Queue in addition to the tag cache SRAM. Should the virtual address included in the load instruction be matched by an entry in the Store Address Queue, the associated data from the Store Data Queue is forwarded during the writeback stage rather than any data from the data cache SRAM without changing the flow of the load operation through the pipeline.

During this stage, both single cycle and two cycle instructions write their results into the register file.

Structural hazards are when two instructions might attempt to use the same resources at the same time. Classic RISC pipelines avoided these hazards by replicating hardware. In particular, branch instructions could have used the ALU to compute the target address of the branch. If the ALU were used in the decode stage for that purpose, an ALU instruction followed by a branch would have seen both instructions attempt to use the ALU simultaneously. It is simple to resolve this conflict by designing a specialized branch target adder into the decode stage.

Data hazards are when an instruction, scheduled blindly, would attempt to use data before the data is available in the register file.

Data hazards are avoided in one of two ways:

Suppose the CPU is executing the following piece of code:

The instruction fetch and decode stages will send the second instruction one cycle after the first. They flow down the pipeline as shown in this diagram:

In a naïve pipeline, without hazard consideration, the data hazard progresses as follows:

In cycle 3, the SUB instruction calculates the new value for r10. In the same cycle, the AND operation is decoded, and the value of r10 is fetched from the register file. However, the SUBinstruction has not yet written its result to r10. Write-back of this normally occurs in cycle 5 (green box). Therefore, the value read from the register file and passed to the ALU (in the Execute stage of the AND operation, red box) is incorrect.

Instead, we must pass the data that was computed by SUB back to the Execute stage (i.e. to the red circle in the diagram) of the AND operation before it is normally written-back. The solution to this problem is a pair of bypass multiplexers. These multiplexers sit at the end of the decode stage, and their flopped outputs are the inputs to the ALU. Each multiplexer selects between:

Decode stage logic compares the registers written by instructions in the execute and access stages of the pipeline to the registers read by the instruction in the decode stage, and cause the multiplexers to select the most recent data. These bypass multiplexers make it possible for the pipeline to execute simple instructions with just the latency of the ALU, the multiplexer, and a flip-flop. Without the multiplexers, the latency of writing and then reading the register file would have to be included in the latency of these instructions.

Note that the data can only be passed forward in time - the data cannot be bypassed back to an earlier stage if it has not been processed yet. In the case above, the data is passed forward (by the time the AND is ready for the register in the ALU, the SUB has already computed it).

However, consider the following instructions:

The data read from the address adr will not be present in the data cache until after the Memory Access stage of the LD instruction. By this time, the AND instruction will already be through the ALU. To resolve this would require the data from memory to be passed backwards in time to the input to the ALU. This is not possible. The solution is to delay the AND instruction by one cycle. The data hazard is detected in the decode stage, and the fetch and decode stages are stalled - they are prevented from flopping their inputs and so stay in the same state for a cycle. The execute, access, and write-back stages downstream see an extra no-operation instruction (NOP) inserted between the LD and AND instructions.

This NOP is termed a "pipeline bubble" since it floats in the pipeline, like an air bubble, occupying resources but not producing useful results. The hardware to detect a data hazard and stall the pipeline until the hazard is cleared is called a pipeline interlock.

A pipeline interlock does not have to be used with any data forwarding, however. The first example of the SUB followed by AND and the second example of LD followed by AND can be solved by stalling the first stage by three cycles until write-back is achieved, and the data in the register file is correct, causing the correct register value to be fetched by the AND's Decode stage. This causes quite a performance hit, as the processor spends a lot of time processing nothing, but clock speeds can be increased as there is less forwarding logic to wait for.

This data hazard can be detected quite easily when the program's machine code is written by the compiler. The original Stanford RISC machine relied on the compiler to add the NOP instructions in this case, rather than having the circuitry to detect and (more taxingly) stall the first two pipeline stages. Hence the name MIPS: Microprocessor without Interlocked Pipeline Stages. It turned out that the extra NOP instructions added by the compiler expanded the program binaries enough that the instruction cache hit rate was reduced. The stall hardware, although expensive, was put back into later designs to improve instruction cache hit rate, at which point the acronym no longer made sense.

Control hazards are caused by conditional and unconditional branching. The classic RISC pipeline resolves branches in the decode stage, which means the branch resolution recurrence is two cycles long. There are three implications:

There are four schemes to solve this performance problem with branches:

Delayed branches were controversial, first, because their semantics are complicated. A delayed branch specifies that the jump to a new location happens after the next instruction. That next instruction is the one unavoidably loaded by the instruction cache after the branch.

Friday, July 6, 2012

MEMORY CONTROLLER

The memory controller is a digital circuit which manages the flow of data going to and from the main memory. It can be a separate chip or integrated into another chip, such as on the die of a microprocessor. This is also called a Memory Chip Controller (MCC).

Computers using Intel microprocessors have traditionally had a memory controller implemented on their motherboard's northbridge, but many modern microprocessors, such as DEC/Compaq'sAlpha 21364, AMD's Athlon 64 and Opteron processors, IBM's POWER5, Sun Microsystems's UltraSPARC T1, and more recently Intel's Core i7 have an integrated memory controller (IMC) on the microprocessor in order to reduce memory latency. While this has the potential to increase the system's performance, it locks the microprocessor to a specific type (or types) of memory, forcing a redesign in order to support newer memory technologies. When DDR2 SDRAM was introduced, AMD released new Athlon 64 CPUs. These new models, with a DDR2 controller, use a different physical socket (known as Socket AM2), so that they will only fit in motherboards designed for the new type of RAM. When the memory controller is not on-die, the same CPU may be installed on a new motherboard, with an updated northbridge.

The integration of the memory controller onto the die of the microprocessor is not a new concept. Some microprocessors in the 1990s such as the DEC Alpha 21066 and HP PA-7300LC had integrated memory controllers, but rather than for performance gains, this was implemented to reduce the cost of systems by eliminating the need for an external memory controller.

Purpose

Memory controllers contain the logic necessary to read and write to DRAM, and to "refresh" the DRAM by sending current through the entire device. Without constant refreshes, DRAM will lose the data written to it as the capacitors leak their charge within a fraction of a second (not less than 64 milliseconds according to JEDEC standards).

Reading and writing to DRAM is performed by selecting the row and column data addresses of the DRAM as the inputs to the multiplexer circuit, where the demultiplexer on the DRAM uses the converted inputs to select the correct memory location and return the data, which is then passed back through a multiplexer to consolidate the data in order to reduce the required bus width for the operation.

Bus width is the number of parallel lines available to communicate with the memory cell. Memory controllers' bus widths range from 8-bit in earlier systems, to 512-bit in more complicated systems and video cards (typically implemented as four 64-bit simultaneous memory controllers operating in parallel, though some are designed to operate in "gang mode" where two 64-bit memory controllers can be used to access a 128-bit memory device).

Double data rate memory

Double Data Rate DDR memory controllers are used to drive DDR SDRAM, where data is transferred on the rising and falling access of the memory clock of the system. DDR memory controllers are significantly more complicated than Single Data Rate controllers, but allow for twice the data to be transferred without increasing the clock rate or increasing the bus width to the memory cell.

Dual-channel memory

Dual Channel memory controllers are memory controllers where the DRAM devices are separated on to two different buses to allow two memory controllers to access them in parallel. This doubles the theoretical amount of bandwidth of the bus. In theory, more channels can be built (a channel for every DRAM cell would be the ideal solution), but due to wire count, line capacitance, and the need for parallel access lines to have identical lengths, more channels are very difficult to add.

Fully buffered memory

Fully buffered memory systems place a memory buffer device on every memory module (called an FB-DIMM when Fully Buffered RAM is used), which unlike traditional memory controller devices, use a serial data link to the memory controller instead of the parallel link used in previous RAM designs. This decreases the number of the wires necessary to place the memory devices on a motherboard (allowing for a smaller number of layers to be used, meaning more memory devices can be placed on a single board), at the expense of increasing latency (the time necessary to access a memory location). This increase is due to the time required to convert the parallel information read from the DRAM cell to the serial format used by the FB-DIMM controller, and back to a parallel form in the memory controller on the motherboard. In theory, the FB-DIMM's memory buffer device could be built to access any DRAM cells, allowing for memory cell agnostic memory controller design, but this has not been demonstrated, as the technology is in its infancy.

Friday, June 10, 2011

TRANSLATION LOOKASIDE BUFFER

A translation lookaside buffer (TLB) is a CPU cache that memory management hardware uses to improve virtual address translation speed. All current desktop and server processors (such as x86) use a TLB to map virtual and physical address spaces, and it is ubiquitous in any hardware which utilizes virtual memory.

The TLB is typically implemented as content-addressable memory (CAM). The CAM search key is the virtual address and the search result is a physical address. If the requested address is present in the TLB, the CAM search yields a match quickly and the retrieved physical address can be used to access memory. This is called a TLB hit. If the requested address is not in the TLB, it is a miss, and the translation proceeds by looking up the page table in a process called a page walk. The page walk is an expensive process, as it involves reading the contents of multiple memory locations and using them to compute the physical address. After the physical address is determined by the page walk, the virtual address to physical address mapping is entered into the TLB.

A TLB has a fixed number of slots that contain page table entries, which map virtual addresses to physical addresses. The virtual memory is the space seen from a process. This space is segmented in pages of a prefixed size. The page table (generally loaded in memory) keeps track of where the virtual pages are loaded in the physical memory. The TLB is a cache of the page table; that is, only a subset of its content are stored.

The TLB references physical memory addresses in its table. It may reside between the CPU and the CPU cache, between the CPU cache and primary storage memory, or between levels of a multi-level cache. The placement determines whether the cache uses physical or virtual addressing. If the cache is virtually addressed, requests are sent directly from the CPU to the cache, and the TLB is accessed only on a cache miss. If the cache is physically addressed, the CPU does a TLB lookup on every memory operation and the resulting physical address is sent to the cache. There are pros and cons to both implementations. Caches that use virtual addressing have for their key part of the virtual address plus, optionally, a key called an "address space identifier" (ASID). Caches that don't have ASIDs must be flushed every context switch in a multiprocessing environment.

Implication For Performance:

The CPU has to access main memory for a:

instruction cache miss
data cache miss
TLB miss

The third case (the simplest case) is where the desired information itself actually is in a cache, but the information for virtual-to-physical translation is not in a TLB. These are all about equally slow, so a program "thrashing" the TLB will run just as poorly as one thrashing an instruction or data cache. That is why a well functioning TLB is important.

Wednesday, April 27, 2011

VIRTUAL FUNCTION

A virtual function or virtual method is a function or method whose behaviour can be overridden within an inheriting class by a function with the same signature. This concept is a very important part of the polymorphism portion of object-oriented programming (OOP).

The concept of the virtual function solves the following problem:

In OOP when a derived class inherits a base class, an object of the derived class may be referred to (or cast) as either being the base class type or the derived class type. If there are base class methods overridden by the derived class, the method call behaviour is ambiguous.

The distinction between virtual and non-virtual resolves this ambiguity. If the function in question is designated "virtual" in the base class then the derived class's function would be called (if it exists). If it is not virtual, the base class's function would be called.

Virtual functions overcome the problems with the type-field solution by allowing the programmer to declare functions in a base class that can be redefined in each derived class.

In C++ virtual functions or virtual methods are declared by using the virtual keyword to the function's declaration.

For example, a base class Animal could have a virtual function eat. Subclass Fish would implement eat() differently than subclass Wolf, but you can invoke eat() on any class instance referred to as Animal, and get the eat() behaviour of the specific subclass.

This allows a programmer to process a list of objects of class Animal, telling each in turn to eat (by calling eat()), with no knowledge of what kind of animal may be in the list. You also do not need to have knowledge of how each animal eats, or what the complete set of possible animal types might be.

The following is an example in C++. Note that this example is not exception-safe. In particular, it may leak resources if new or vector::push_back throws an exception.

#include <iostream>
#include <vector>
 
class Animal {
    public:
        virtual void eat() const { 
            std::cout << "I eat like a generic Animal." << std::endl; 
        }
        virtual ~Animal() { 
        }
};
 
class Wolf : public Animal {
    public:
        void eat() const { 
            std::cout << "I eat like a wolf!" << std::endl; 
        }
        virtual ~Wolf() { 
        }
};
 
class Fish : public Animal {
    public:
        void eat() const { 
            std::cout << "I eat like a fish!" << std::endl; 
        }
        virtual ~Fish() { 
        }
};
 
class GoldFish : public Fish {
    public:
        void eat() const { 
            std::cout << "I eat like a goldfish!" << std::endl; 
        }
        virtual ~GoldFish() { 
        }
};
 
class OtherAnimal : public Animal {
        virtual ~OtherAnimal() { 
        }
};
 
int main() {
    std::vector<Animal*> animals;
    animals.push_back(new Animal());
    animals.push_back(new Wolf());
    animals.push_back(new Fish());
    animals.push_back(new GoldFish());
    animals.push_back(new OtherAnimal());
 
    for (std::vector<Animal*>::const_iterator it = animals.begin(); it != animals.end(); ++it) {
        (*it)->eat();
        delete *it;
    }
 
    return 0;
}

Output with the virtual function Animal::eat():

I eat like a generic Animal.
I eat like a wolf!
I eat like a fish!
I eat like a goldfish!
I eat like a generic Animal.

Output if Animal::eat() were not declared as virtual:

I eat like a generic Animal.
I eat like a generic Animal.
I eat like a generic Animal.
I eat like a generic Animal.
I eat like a generic Animal.

Sunday, April 24, 2011

COST BENEFIT

A cost benefit analysis is done to determine how well, or how poorly, a planned action will turn out. Although a cost benefit analysis can be used for almost anything, it is most commonly done on financial questions. Since the cost benefit analysis relies on the addition of positive factors and the subtraction of negative ones to determine a net result, it is also known as running the numbers.

Cost Benefit Analysis

A cost benefit analysis finds, quantifies, and adds all the positive factors. These are the benefits. Then it identifies, quantifies, and subtracts all the negatives, the costs. The difference between the two indicates whether the planned action is advisable. The real trick to doing a cost benefit analysis well is making sure you include all the costs and all the benefits and properly quantify them.

Should we hire an additional sales person or assign overtime? Is it a good idea to purchase the new stamping machine? Will we be better off putting our free cash flow into securities rather than investing in additional capital equipment? Each of these questions can be answered by doing a proper cost benefit analysis.

Example Cost Benefit Analysis

As the Production Manager, you are proposing the purchase of a $1 Million stamping machine to increase output. Before you can present the proposal to the Vice President, you know you need some facts to support your suggestion, so you decide to run the numbers and do a cost benefit analysis.

You itemize the benefits. With the new machine, you can produce 100 more units per hour. The three workers currently doing the stamping by hand can be replaced. The units will be higher quality because they will be more uniform. You are convinced these outweigh the costs.

There is a cost to purchase the machine and it will consume some electricity. Any other costs would be insignificant.

You calculate the selling price of the 100 additional units per hour multiplied by the number of production hours per month. Add to that two percent for the units that aren't rejected because of the quality of the machine output. You also add the monthly salaries of the three workers. That's a pretty good total benefit.

Then you calculate the monthly cost of the machine, by dividing the purchase price by 12 months per year and divide that by the 10 years the machine should last. The manufacturer's specs tell you what the power consumption of the machine is and you can get power cost numbers from accounting so you figure the cost of electricity to run the machine and add the purchase cost to get a total cost figure.

You subtract your total cost figure from your total benefit value and your analysis shows a healthy profit. All you have to do now is present it to the VP, right? Wrong. You've got the right idea, but you left out a lot of detail.