Sunday, September 29, 2013

Process synchronization in Linux Kernel

As we have discussed earlier about process synchronization, lets discuss about process synchronization primitives offered in Linux Kernel.


Table of Contents


This article is summary of details mentioned here.

Synchronization Primitives

  1. Per-CPU variables
  2. Atomic Operation
  3. Memory barrier
  4. Spin Lock
  5. Semaphore
  6. SeqLocks
  7. Local Interrupt disabling
  8. Local softirq disabling
  9. Read-Copy-Update

Summary of Synchronization Primitives


Technique
Description

Scope
Per-CPU variables
Duplicate a data structure among the CPUs

All CPUs
Atomic operation
Atomic read-modify-write instruction to a counter

All CPUs
Memory barrier
Avoid instruction reordering

Local CPU or All CPUs
Spin lock
Lock with busy wait

All CPUs
Semaphore
Lock with blocking wait (sleep)

All CPUs
Seqlocks
Lock based on an access counter

All CPUs
Local interrupt disabling
Forbid interrupt handling on a single CPU

Local CPU
Local softirq disabling
Forbid deferrable function handling on a single CPU

Local CPU
Read-copy-update (RCU)
Lock-free access to shared data structures through pointers
All CPUs



Per-CPU variables

The best synchronization technique consists in designing the kernel so as to avoid
the need for synchronization in the first place.

Basically, a per-CPU variable is an array of data structures, one element per each CPU in the system.
A CPU should not access the elements of the array corresponding to the other CPUs.

Pros

  • Freely read and modify its own element without fear of race conditions
  • It avoids cache line snooping and invalidations, which are costly operations.
[Snooping and Invalidations] Snooping is the process where the individual caches monitor address lines,  for accesses to memory locations that they have cached. When a write operation is observed to a location that a cache has a copy of, the cache controller invalidates its own copy of the snooped memory location.

Cons
  • It can only be used when it make sense to logically split the data across the CPUs of the system
  • Do not provide protection against access from asynchronous functions such as interrupt handlers and deferrable functions.
  • Per-CPU variables are variables are prone to race conditions caused by kernel preemption, both in uniprocessor and multiprocessor systems.
[Problem] What would happen if a kernel control path gets the address of its local copy of a per-CPU variable, and then it is preempted and moved to another CPU: the address still refers to the element of the previous CPU.

As a general rule, a kernel control path should access a per-CPU variable with kernel preemption disabled.


Atomic Operations

[Problem] Several assembly language instructions are of type “read-modify-write” i.e. memory location is accessed twice first to read the old value and second time to write a new value

Detailed explanation
Suppose that two kernel control paths running on two CPUs try to “read-modify-write” the same memory location at the same time by executing nonatomic operations.
At first, both CPUs try to read the same location, but the memory arbiter (a hardware circuit that serializes accesses to the RAM chips) steps in to grant access to one of them and delay the other. However, when the first read operation has completed, the delayed CPU reads exactly the same (old) value from the memory location.
Both CPUs then try to write the same (new) value to the memory locationagain, the bus memory access is serialized by the memory arbiter, and eventually both write operations succeed. However, the global result is incorrect because both CPUs write the same (new) value. Thus, the two interleaving “read-modify-write” operations act as a single one.

[Solution] Reason for problem here is that Both CPU’s try to access memory location at the same time. If operation of “read-modify-write” can be made atomic then problem would be solved i.e. Every such operation must be executed in a single instruction without being interrupted in the middle and avoiding accesses to the same memory location by other CPUs.
  • Read-modify-write assembly language instructions are atomic if no other processor has taken the memory bus after the read and before the write. Memory bus stealing never happens in a uniprocessor system.
  • For multiprocessor system, Read-modify-write assembly language instructions whose opcode is prefixed by the Lock byte (0xf0) are atomic. 
  • Other processors cannot access the memory location while the locked instruction is being executed.
Sample functions: atomic_read(v), atomic_set(v,i), atomic_add(i,v), atomic_sub(i,v) etc…


Optimization Barriers & Memory Barriers

Compilers optimize the code for efficient usage of CPU time and memory but this optimizations could be disastrous sometimes when dealing with shared data.

It can never be guaranteed that instructions will be performed in the same order in which they appear in source code mainly because reordering by compiler to optimize and CPU executing several instructions in parallel. This might lead to reordering of memory access patterns.

To avoid this behavior we need a synchronization primitive at two levels

  • Synchronization primitive at Compiler level - Optimization barrier
  • Synchronization primitive at CPU level        - Memory barrier
In linux barrier() macro acts as an optimization barrier. Optimization barrier primitive ensures that assembly language instructions of C statements mentioned before “barrier()” and assembly language instructions of C statements mentioned after, remains in the same sequence.

But, Optimization barrier cannot control in which fashion instructions will be executed by CPU.

A memory barrier primitive ensures that the operations placed before the primitive are finished before starting the operations placed after the primitive. Read memory barrier act only on instructions that read from memory, while Write memory barriers act only on instructions that write to memory.


Spin Locks

When a kernel control path must access a shared data structure or enter a critical section it need to acquire a lock.
Spin Locks are a special kind of lock designed to work in a multiprocessor environment. If kernel control path finds Spin Lock “Open” it acquires the lock and continue execution else it “Spin” around repeatedly executing tight instruction loop, until the lock is released.

The waiting kernel control path keeps running on the CPU, even if it has nothing to do besides waste time. Kernel preemption is disabled in CR protected by Spin Locks.
Spin locks are usually convenient, because many kernel resources are locked for a fraction of a millisecond only; therefore, it would be far more time-consuming to release the CPU and reacquire it later.
Kernel preemption is still enabled during the busy wait phase, thus a process waiting for a spin lock to be released could be replaced by a higher priority process.


Macro
Description

spin_lock_init()
Set the spin lock to 1 (unlocked)

spin_lock()
Cycle until spin lock becomes 1 (unlocked), then set it to 0 (locked)

spin_unlock()
Set the spin lock to 1 (unlocked)

spin_unlock_wait()
Wait until the spin lock becomes 1 (unlocked)

spin_is_locked()
Return 0 if the spin lock is set to 1 (unlocked); 1 otherwise

spin_trylock()
Set the spin lock to 0 (locked), and return 1 if the previous value of the lock was 1; 0 otherwise


Spin Lock with kernel preemption




In case of kernel preemption, the first step is to disable kernel preemption, preempt_disable()
In the above example “Lock is 1 and Unlock is 0”

  • In EAX register we push 1” as we wish to acquire lock. 
  • xchg command will exchange value from EAX register to lock variable but not vice versa. So by step 2, EAX register has value of “1” and lock variable also has value of “1”. This is an atomic operation.
  • Step 3, test or validate the current value of lock. If current value of lock(EAX prev) is “1“ i.e. processor zero flag is already set, hence lock is not available
  • Enable kernel preemption and Jump(Spin) back
  • In Step 3, if current value of lock (EAX prev) is “0” then set it with “1” i.e. Lock has been acquired.

It might happen that process can spin for a long time. If the break_lock field is set then owning process can learn whether there are other processes waiting for the lock. It may decide to release it prematurely to allow other processes.



Read/Write Spin Locks


Read Spin Locks is used to increase the concurrency within kernel. Idea is that allow several kernel control path to access same data structure for “Read” using Read Spin Locks. No kernel control path can modify the data structure using Read Spin Locks. 


To modify shared data structure kernel control path needs to acquire Write Spin Locks. Write Spin Lock is granted when no kernel control path have Read Spin Locks.

Reader and Writer Spin Locks have same priority in this case.



SeqLocks

In Read/Write spin locks reader and write have same priority. Reader must wait until the writer has finished and vice versa.
SeqLocks are similar to read/write locks except except that they give a much higher priority to writers: in fact a writer is allowed to proceed even when readers are active.

Writer never waits but reader may be forced to read the data several times until it get valid copy. Each reader must read sequence counter twice i.e. before and after reading the data and then check whether sequence counter values are same. If its not equal then it means that writer must has become active and has increased the sequence counter, thus implicitly telling the reader that the data just read is not valid.

Every time writer acquire and release sequence lock it must increment sequence counter. When counter is odd means writer is in progress and when counter is even means writer is done.
When a reader enters a critical region, it does not need to disable kernel preemption; on the other hand, the writer automatically disables kernel preemption when entering the critical region, because it acquires the spin lock.

SeqLocks must not be used for 


  • The data structure to be protected does not include pointers that are modified by the writers and dereferenced by the readers (otherwise, a writer could change the pointer under the nose of the readers)
  •  The code in the critical regions of the readers does not have side effects (otherwise, multiple reads would have different effects from a single read)



Read Copy Update (RCU)

This technique is designed to protect data structures that are mostly accessed for reading by several CPUs. 
The key idea consists of limiting the scope of RCU as follows:

  1. Only data structures that are dynamically allocated and referenced by means of pointers can be protected by RCU.
  2. No kernel control path can sleep inside a critical region protected by RCU.
Writer
  • When a writer wants to update the data structure, it dereferences the pointer and makes a copy of the whole data structure. 
  • Next, the writer modifies the copy. 
  • Once finished, the writer changes the pointer to the data structure so as to make it point to the updated copy. 

Because changing the value of the pointer is an atomic operation, each reader or writer sees either the old copy or the new one: no corruption in the data structure may occur.

Memory barrier is required to ensure that the updated pointer is seen by the other CPUs only after the data structure has been modified. 

Spin lock is coupled with RCU to forbid the concurrent execution of writers.

[Problem] Old copy of the data structure cannot be freed right away when the writer updates the pointer. In fact, the readers that were accessing the data structure when the writer started its update could still be reading the old copy. The old copy can be freed only after all (potential) readers on the CPUs have executed.


Semaphore

Please visit Semaphore Page







Saturday, September 21, 2013

Binary number operations

Binary Number Operations


Addition

0 + 0 → 0
0 + 1 → 1
1 + 0 → 1
1 + 1 → 0, carry 1

For e.g

  1 1 1 1 1    (carried digits)
    0 1 1 0 1
+   1 0 1 1 1
-------------
= 1 0 0 1 0 0 = 36

Subtraction

0 − 0 → 0
0 − 1 → 1, borrow 1
1 − 0 → 1
1 − 1 → 0

Subtracting a “1" digit from a "0" digit produces the digit "1", while
1 will have to be subtracted from the next column. This is known as
borrowing.

    *   * * *   (starred columns are borrowed from)
  1 1 0 1 1 1 0
−     1 0 1 1 1
----------------
= 1 0 1 0 1 1 1

Multiplication

           1 0 1 1   (A)
         × 1 0 1 0   (B)
         ---------
           0 0 0 0   ← Corresponds to the rightmost 'zero' in B
   +     1 0 1 1     ← Corresponds to the next 'one' in B
   +   0 0 0 0
   + 1 0 1 1
   ---------------
   = 1 1 0 1 1 1 0

Binary Multiplication for binary point
                 1 0 1 . 1 0 1       A (5.625 in decimal)
                 × 1 1 0 . 0 1       B (6.25  in decimal)
             -------------------
                   1 0 1 1 0 1    ← Corresponds to a 'one' in B
     +           0 0 0 0 0 0      ← Corresponds to a 'zero' in B
     +         0 0 0 0 0 0
     +       1 0 1 1 0 1
     +     1 0 1 1 0 1
     ---------------------------
     = 1 0 0 0 1 1 . 0 0 1 0 1  (35.15625 in decimal)


Negative Binary number

How can we represent a negative number? We cannot use a ‘-‘ sign because all we can store in the computer is zeros and ones.

There are three methods

  • Signed Magnitude
  • 1’s Complement
  • 2’s complement

Tuesday, September 17, 2013

World of Bits and Bytes



Recently, I have realized that no matter how much C, C++ or other high level languages you know it all comes down to bit and bytes view of program. 
I started realizing that after all it’s not that easy to actually think and write in bit/byte manipulation. This is my attempt to learn and be more comfortable about thinking in bit/bytes.


Table of Contents



Basics

Inroduction link


 0 1 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 1 1 1 0 1 0 0 0 1 1 1 1 0 0 0
||              |               |               |              ||
|+- bit 31      |               |               |       bit 0 -+|
|               |               |               |               |
+-- BYTE 3 -----+--- BYTE 2 ----+--- BYTE 1 ----+-- BYTE 0 -----+
|                               |                               |
+----------- WORD 1 ------------+----------- WORD 0 ------------+
|                                                               |
+--------------------------- DWORD -----------------------------+

Hexadecimal Numbers
0 1 2 3 4 5 6 7 8 9 A B C D E F


Bitwise Operators
The & operator (AND)
1   &   1   ==   1
1   &   0   ==   0
0   &   1   ==   0
0   &   0   ==   0

The | operator (OR)
1  |   1   ==   1
1  |   0   ==   1
0  |   1   ==   1
0  |   0   ==   0

The ^ operator (XOR)
1  ^   1   ==   0
1  ^   0   ==   1
0  ^   1   ==   1
0  ^   0   ==   0

The ~ operator
The ~ (Ones Complement or inversion) operator acts only on one value and it inverts it.

The >> (Right Shift)
00001100  - b 
00110000  - b << 2

The << (Left Shift)
00001100  - b
00000011  - b >> 2

Another example is
1<<4; 0001 0000



Bit Fields 



struct date_struct {
    BYTE day   : 5,   // 1 to 31
         month : 4,   // 1 to 12
         year  : 14;  // 0 to 9999
    } date

|0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|
  |                           |       |         |
  +------ year ---------------+ month +-- day --+

date.day = 12;
dateptr = &date;
dateptr->year = 1852;


Basics of Binary numbers and operations (link) 


Problems 

How set a single bit in a byte?

 For e.g In byte 0000 1000 set bit no 6 will produce 0100 1000
 (Remember bit number starts with 0-7)
         
//For problems where certain bit values needs to be changed, first we
//need to create a bit mask.
//Bit mask is a temporary variable with some value. Using this value
//we will access and change specific bits in a byte of data.

//For e.g.
//To set 6th bit in a byte 0000 1000
//We have             MASK 0100 0000  (OR)
//                    ——————————————
//                         0100 1000

//To turn on certain bit in a byte (OR) is used.

int set_bit(int val, int num, bool bitval)
{
    return (val | (bitval << num));
}
//Here, val    = 0000 1000 = 8
//      num    = 6 (set 6th bit)
//      bitval = 1 (set to 1)

//      0000 1000
//(OR)  0100 0000  (1 << 6)
//      ————————-
//      0100 0000

How to unset single bit in a byte?

For e.g In byte 0100 1000 unset bit no 6 will produce 0000 1000
(Remember bit number starts with 0-7)        

//To unset specific bit we will use (AND) operation.
//Mask value need to be ‘0’ for the bit to unset but rest of the bits
//need to ‘1’. The reason for rest of the bits to set as ‘1’ is as
//we are doing (AND), we don’t want to unset other bits which are
//already set.
//
//For e.g.
//To unset 6th bit 0100 1000
//            MASK 1011 1111 (AND)
//                ——————————-
//                 0000 1000   (Result)
int unset_bit(int val, int num, bool bitval)
{
    return (val & ~(bitval << num));
}

//Here, val    = 0100 1000
//      num    = 6
//      bitval = 0
//      (bitval << num) = 0100 0000
//     ~(bitval << num) = 1011 1111
//
//      0100 1000
//      1011 1111 (AND)
//      —————————-
//      0000 1000    (Result)


One function to set and unset

int change_bit(int val, int num, bool bitval)
{
    return (((val & ~(bitval << num)) | (bitval << num));
}


Unset range of bits



Set range of bits





Links



Sunday, September 15, 2013

Process synchronization in OS

This post is all about very simplistic approach towards process and process synchronization.


Table of Contents


What is a Process?

OS objective is to keep as many as of the computer resources as busy as possible. It is used to keep track of all the things an OS must remember about the state of user program.

Process = Code + Allocated Resources + Book keeping information

Process is like a box, a complete entity in itself which does a step by step task written in progam. More formally it is called program in execution.

Lets consider a very basic operating system with very least complexity. This operating system can run only one process at a time. Since, only one process is working at a time, it may happen that all the resources occupied by process will not be used at the same time.


To maximize the resource utilization, we need to have entities running at the same time. When its said multiple entities it is logical that either we need to have multiple process running at the same time or light weight multiple entities running inside process.


Lets explore the second option, now consider process is like a box and it has resources inside the box. We create multiple child of process which is called thread.


Thread is a child of process and hence it will use resource Process have. Theoretically, there is no limit on number of child threads a process can have but it seems logical that process should have enough resource for administrative purpose for these threads. 


Once there are multiple threads they are going to ask for same resource at the same time. For example, if two children are in one room then they will always fight for same toy. Same applies to threads. 


3 Issues with Sharing



  1. How to Share data?
  2. How to ensure threads in a process, executes on at a time?
  3. How to ensure proper sequencing of events?

To understand it better, lets take a real world example


Carl’s Jr. Restaurant


Process


  1. Customer arrives
  2. Employee takes order
  3. Employee cooks food
  4. Employee bag food
  5. Employee takes money
  6. Customer gets food and leaves
If a single employee is doing steps from 1-6 then all other customers have to wait in line and its going to be long wait.
Instead, lets have multiple employees for taking order, cook food, bag food, take money. Each of these 'employees' are multiple threads on Process 'Restaurant'. Each thread is responsible for doing specialized task.

Lets associate 3 issues in current situation
  1. What is shared data? - In step 2-3, Quantity of food. In step 3-4, how much food to bag
  2. Does sequence matters? -  Cook can't cook food until order arrives. Employee can't bag food until it is cooked. So, sequencing matters.
Shared data can be distributed either using message passing or storing that data in global memory of process and each thread read from that memory location.

The next logical question is, How to ensure threads in a process, executes on at a time?
More formally there are three types of solution categories
  1. Algorithmic approach
  2. Software Primitives
  3. Concurrent programming construct

Algorithmic approach

The algorithmic approach to process synchronization does not use any assistance from the computer architecture or the OS kernel. Instead it uses an arrangement of logical conditions to satisfy the desired synchronization requirements. [Dhamdhere]

Software Primitives

A set of software primitives for mutual exclusion e.g Semaphore, Locks etc. were developed to overcome the logical complexity of algorithmic implementations. This is implemented using some special architectural features of computer systems. But, ease of use and correctness still remained the major obstacle in a development of large concurrent systems.

Semaphores

It is a shared integer variable with non-negative values that have Initialization, wait and signal as a indivisible operation.

Semaphore Constructor
Semaphore Destructor
Semaphore Wait
Semaphore Signal

Locks

The basic idea is to close a lock at the start of critical section or an indivisible operation and open it at the end of the critical section or the indivisible operation.

Locks solves, How to ensure threads in a process, executes on at a time? but not the sequencing problem.


Lock Constructor

Lock Destructor

Lock Acquire

Lock Release

Lock Owner




Concurrent Programming Construct


Locks can only solve mutual exclusion problem, they can not solve sequencing problem. We need another mechanism "Monitors"

Monitors is a programming language construct that supports both data access synchronization and control synchronization.


Monitors have 3 parts



  1. Lock for mutual exclusion
  2. 1 or more condition variables for sequencing
  3. Monitor variables for make sequencing decisions -> Shared data 

Condition Variables

Each condition variable is only associated with one lock.

CV Constructor

CV Destructor
CV Wait
CV Signal

Producer-Consumer Problem

Lets consider we have infinite buffer

monitor variable  => int itemCount = 0;

monitor lock        => monitorLock;
monitor condition => needItem;
Producer
Consumer



Recommended reading