Chapter 2. Understanding Shared Buffer(PostgreSQL 9.6 Performance Story)

I am working on publishing the English version of “PostgreSQL 9.6 Performance Story” on Amazon. Below is a preview of Chapter 2.

Shared buffer is an essential component for efficient IO processing.

Because of this importance, the DBMS buffer manager will be highly optimized. In other words, if you set the Shared Buffer to a reasonable size (several Gb or tens of Gb, or even hundreds of Gb), there may not be a performance problem caused by Shared Buffer.

Then why? Do I need to know how Shared Buffer works? It is doubtful.

This is also the question I had when I started the Shared Buffer study.

However, as the research on Shared Buffer was repeated, it was found that Shared Buffer has  characteristics. IO strategy, ring buffer, and clock sweep algorithm. Without knowing these characteristics accurately, it can be difficult to determine the cause of sudden slowing of IO processing.

So let’s get to the point.

Three goals of Shared Buffer for performance improvement


The purpose of Shared Buffer is to improve IO performance by minimizing DISK IO. To accomplish this, we must accomplish the following three goals.

  1. Very large (tens, hundreds of gigabytes) buffers should be accessed quickly.
  2. Contention should be minimized when many users access at the same time.
  3. Frequently used blocks should be in the buffer for as long as possible.

The first and second targets are related to the shared buffer structure, and the third target is related to the shared buffer algorithm. Let’s take a closer look.

Very large (tens, hundreds of gigabytes) buffers should be accessed quickly.

To accomplish this goal, you should be able to find blocks in the shared buffer very quickly. Also, if the block is not in the Shared Buffer, you should be able to quickly see that the block does not exist. The details are described in ‘Shared Buffer Architecture’.

Contention should be minimized when many users access at the same time.

Shared Buffer is a shared resource used by many users simultaneously. A lock mechanism is required to protect shared resources. In order to minimize the contention at the time of acquiring the lock, a method using several locks is required. The details are described in ‘Shared Buffer Architecture’.

Frequently used blocks should be in the buffer for as long as possible.

This goal is related to the efficiency of Shared Buffer management. In terms of performance, it is very important to keep frequently accessed blocks in shared buffers for as long as possible. To do this, PostgreSQL uses the Clock Sweep algorithm. The details are described in ‘Clock Sweep Algorithm’.

Shared Buffer Architecture


Shared buffers consist of four major components.

  1. Hash table
  2. Hash elements (and element keys)
  3. Buffer descriptor to manage buffer status
  4. Buffer pool to store the actual block

Shared Buffer structure is as follows.
2-1
Figure 2-1. Shared Buffer structure

Hash Table

Hash tables are a very effective data structure when managing (retrieving, inputting) buffers in memory. However, a hash table has a problem in that performance is degraded when a hash collision occurs.

Note   Hash collision is a computational term which means that the same hash value is output for different input values. From a Buffer perspective, this means that different blocks are connected to the same hash bucket.

One way to reduce the hash collision problem is to divide the hash table into logical N segments. This is called a ‘segmented hash table’, and PostgreSQL uses this method. (See Figure 2-2)
2-2
Figure 2-2. Segmented hash table structure

What is DIRECTORY?

A segmented hash table is a hash table divided by N logical hash segments. Therefore, another array is needed to indicate the starting position of each segment divided by N. This array is called a ‘directory’. The default directory setting is 256, and if you set Shared Buffer to bigger, the directory size will increase. For more information, see ‘Hash table size and number of hash segments?’.

What is the size of the hash segment?

One hash segment consists of 256 buckets.

Note   A bucket is an array element. That is, the number of buckets is the length of the array.

Hash table size and number of hash segments?

Through the source, let’s see how to calculate the hash table size. (Readers who are not interested in the internal structure can skip this part)

Src 2-1. src/backend/utils/hash/dynahash.c

#define DEF_SEGSIZE                256
#define DEF_SEGSIZE_SHIFT          8    /* must be log2(DEF_SEGSIZE) */
#define DEF_DIRSIZE                256
#define DEF_FFACTOR                1    /* default fill factor *

Src 2-2. src/include/storage/lwlock.h

#define NUM_BUFFER_PARTITIONS  128

Src 2-3. src/backend/utils/hash/dynahash.c

nbuckets = next_pow2_int((nelem - 1) / hctl->ffactor + 1);
hctl->max_bucket = hctl->low_mask = nbuckets - 1;
nsegs = (nbuckets - 1) / hctl->ssize + 1;
nsegs = next_pow2_int(nsegs);

The variables needed to understand the calculation method are as follows.

  • nelem: Number of buffers + NUM_BUFFER_PARTITIONS
  • hctl->ffactor: Means fill factor. The default setting is 1
  • hctl->max_bucket: Number of hash table buckets
  • nsegs: Number of hash segments

Let’s take a closer look at the formula. (Assuming that the block size is 8 KiB and the shared buffer is 1 GiB)

nelem

The nelem value is ‘number of buffers + NUM_BUFFER_PARTITIONS’.
This value is used as an input value to calculate the number of hash table buckets. In the case of 1 GiB, the number of buffers is 131,072 and the value of NUM_BUFFER_PARTITIONS is 128. Therefore, the value of nelem is 131,200.

nelem = (1 GiB / 8KiB = 131,072) + (NUM_BUFFER_PARTITIONS=128) 
= 131,200

nbuckets

The nbuckets value represents the number of hash table buckets.

If you apply the following formula, the number of hash table buckets is 262,144. This value is twice the actual buffer count of 131,072. The reason why the number of hash table buckets is set to be large is to minimize hash collisions.

nbuckets = next_pow2_int((nelem - 1) / hctl->ffactor + 1);
= next_pow2_int((131,200 - 1) / 1 + 1) 
= next_pow2_int(131,200)
= 262,144

In the above equation, hctl-> ffactor means Fill Factor and the default value is 1. And next_pow2_int () is a function that returns an exponent of 2 greater than the input value. Therefore, it returns 2^18 (262,144) which is larger than the input value 131,200.

Note   The formula shows that the larger the fill factor, the smaller the number of buckets, and the smaller the fill factor, the greater the number of buckets. The number of buckets can be increased or decreased by adjusting the fill factor. To minimize the hash collision, it is preferable to use a default value of ‘1’.

hctl->max_bucket

The value of hctl-> max_bucket means the size of the hash table array. The array index is zero-based, so set nbuckets -1.

nseg

nseg means the number of hash segments. From the results below, it can be seen that the number of hash segments is 1,024 when the shared buffer is 1 GiB. Since the number of hash segments is 1,024, the directory size is also set to 1,024.

nsegs = (nbuckets - 1) / hctl->ssize + 1
= (262,144 – 1) / 256 + 1
= 1,024
nsegs = next_pow2_int(nsegs);
= next_pow2_int(1,024)
= 1,024

In the above formula, hctl-> ssize represents the number of buckets per hash segment, and the default value is 256.

What is a buffer partition?

Directories, hash segments, and hash tables are all shared resources in the shared buffer. Shared resources are protected using LW (Light Weight) locks. That is, the backend process must acquire LW locks to access shared memory. Although there is a positive aspect that LW lock protects shared resources, there is a problem that performance degradation may occur due to LW lock contention.

For example, if there is one LW lock that manages a hash table, most processes accessing the hash table will wait for the LW lock. To solve this problem, PostgreSQL uses a method of dividing the hash table into N ‘buffer partitions’ and allocating one LW lock per buffer partition.

What is the number of buffer partitions?

Up to version 9.4, there were only 16 buffer partitions. As a result, LW lock contention is likely to occur during buffer partition access in high concurrency environments, and this has actually caused performance problems. To solve this problem, the number of buffer partitions has been increased to 128 since 9.5. As a result, the number of LW locks to manage buffer partitions has also increased from 16 to 128. The number of buffer partitions can be set with the NUM_BUFFER_PARTITIONS variable.

Note   LW lock is an acronym for Light Weight Lock. As its name suggests, LW locks are very lightweight locks and are the way to access memory. LW locks are different from locks that protect data in tables. The LW lock will be described in the section ‘Spin lock and LW lock’. ORACLE expresses LW lock as Latch.

Hash Element

Let’s take a closer look at source analysis. (You may skip this part if you are not interested in the internal structure.) A hash element consists of an element and an element key. Let’s take a look at the element first.

Element component 

The element consists of an element pointer to the Next element and a hashvalue. (See Sources 2-4) The hashvalue is calculated using the BufferTag.

Src 2-4. src/include/utils/hsearch.h

typedef struct HASHELEMENT {
  struct HASHELEMENT *link; /* link to next entry in same bucket */
  uint32 hashvalue;         /* hash function result for this entry */
} HASHELEMENT;

Element KEY component

The element KEY is constructed as follows. (See Figure 2-3)

  •  The element KEY consists of a BufferTag structure and a buffer descriptor array index. (See Sources 2-5)
  • The BufferTag structure consists of the RelFileNode structure, forkNum, and blockNum. (See Sources 2-6)
  • The RelFileNode structure consists of a tablespace number, a database number, and an object number. (See Sources 2-7)

Src 2-5. /src/backend/storage/buffer/buf_table.c

typedef struct {
BufferTag   key;      /* Tag of a disk page */
int         id;       /* Associated buffer ID */
} BufferLookupEnt;

Src 2-6. /src/include/storage/buf_internals.h

typedef struct buftag {
  RelFileNode rnode;         /* physical relation identifier */
  ForkNumber  forkNum;
  BlockNumber blockNum;      /* blknum relative to begin of reln */
} BufferTag;

Src 2-7. /src/include/storage/relfilenode.h

typedef struct RelFileNode {
  Oid         spcNode;       /* tablespace */
  Oid         dbNode;        /* database */
  Oid         relNode;       /* relation */
} RelFileNode;

2-3

Fiture 2-3. Element KEY component

BufferTag

BufferTag is the same concept as the social secret number of a block. In other words, it consists of data that can uniquely identify each block in the cluster database. To do this, we use the following structure.

  • The BufferTag is stored in the BufferTag structure.
  • The BufferTag structure consists of the RelFileNode structure, forkNum, and blockNum.
  • The RelFileNode structure consists of a tablespace number, a database number, and an object number. That way, you can get a unique object number in the cluster database.
  • forkNum refers to the object type. 0 is the table (or index), 1 is the FSM, 2 is the VM. (See Sources 2-8)
  • blockNum means block number.

Src 2-8. /src/include/common/relpath.h

typedef enum ForkNumber {
  InvalidForkNumber   = -1,
  MAIN_FORKNUM  =  0,
  FSM_FORKNUM,
  VISIBILITYMAP_FORKNUM,
  INIT_FORKNUM
} ForkNumber;

Hash element memory allocation

Let’s take a quiz before proceeding. At what point is the hash element memory structure allocated?

  1. Pre-allocate a certain number of times at DB start
  2. Allocation at the time of request after DB startup

The correct answer is number one.

PostgreSQL pre-allocates ‘buffer number + NUM_BUFFER_PARTITIONS’ hash element arrays at DB startup. And And every time a buffer is allocated, it is pulled out from the last one.

Multiple freeList

Up to version 9.5, one freeList managed the entire hash element array. Therefore, if many processes request buffers at the same time, freeList bottlenecks have occurred. To solve this problem, increased the number of freelists to 32 from 9.6.

Therefore, if the shared buffer is 1 GiB, 4,100 hash elements are managed per freeList. The formula is as follows.

Shared Buffer = 1 GiB
=> The number of hash element arrays that are initially allocated. 
(132,000) = Number of buffers (131,072) + NUM_BUFFER_PARTITIONS (128)

=> Number of hash element arrays managed by one FreeList
(4,100) = 132,000 / NUM_FREELISTS (32)

And I explained that ‘take it out of the back’. This is related to how the hash element is connected. As shown in Figure 2-4, each of the 32 hash element arrays has a link structure from the back to the front. And the freeList in the HASHHDR structure points to the hash element at the end of each hash element array. That is, the method of managing 32 hash element arrays using 32 freeLists is applied. This reduces contention when allocating buffers.
2-4
Figure 2-4. The hash element array status immediately after DB startup

Buffer Descriptor

The buffer descriptor is a structure for managing buffer metadata. Let’s take a closer look at the components through the source. (Readers who are not interested in the internal structure can skip this part)

Buffer descriptor component

Src 2-9. src/include/storage/buf_internals.h

typedef struct BufferDesc {
  BufferTag  tag;               /* ID of page contained in buffer */
  int        buf_id;            /* buffer's index number (from 0) */

/* state of the tag, containing flags, refcount and usagecount */
  pg_atomic_uint32 state;

  int        wait_backend_pid;  /* backend PID of pin-count waiter */
  int        freeNext;          /* link in freelist chain */
  LWLock     content_lock;      /* to lock access to buffer contents */
} BufferDesc;

The main components of the buffer descriptor structure are as follows.

  • tag: Store BufferTag.
  • buf_id: This is the index number in the ‘buffer pool’ array where the actual buffer is stored.
  • wait_backend_pid: To access the buffer, the buffer PIN must be obtained. If the buffer PIN can not be obtained, it should wait. The column provides the process ID that waits for the buffer PIN.
  • context_lock: It is the LW lock needed to access the buffer. The LW lock acquisition will be described in the section ‘Spin lock and LW lock’.

A detailed description of the state and freeNext variable follows.

state variable

Until version 9.5, the number of processes to set the PIN to access the buffer (refcount), the number of buffer accesses (usage_count), and the buffer status (flags) were provided, respectively.

However, since 9.6, these three variables are provided as one state variable. (See Figure 2-5)
2-5
Figure 2-5. State variable component

Bit operation for information acquisition

Since the data is stored in units of bits, it is necessary to perform bit mask and bit shift operations to extract the data. The bit operation method is as follows.

Src 2-10. src/include/storage/buf_internals.h

#define BUF_REFCOUNT_ONE 1
#define BUF_REFCOUNT_MASK ((1U << 18) - 1)
#define BUF_USAGECOUNT_MASK 0x003C0000U
#define BUF_USAGECOUNT_ONE (1U << 18)
#define BUF_USAGECOUNT_SHIFT 18
#define BUF_FLAG_MASK 0xFFC00000U

/* Get refcount and usage_count from buffer state */
#define BUF_STATE_GET_REFCOUNT(state) ((state) & BUF_REFCOUNT_MASK)
#define BUF_STATE_GET_USAGECOUNT(state) (((state) & BUF_USAGECOUNT_MASK) >> BUF_USAGECOUNT_SHIFT)

For example, it can be seen that a bit mask operation and a right shift operation are used to extract the BUF_STATE_GET_USAGECOUNT value. (See Figure 2-6)
2-6
Figure 2-6. Procedure for obtaining usage_count using bit operation

flags

The flags field shows 10 buffer states using 10 bits.

For example, the BM_IO_IN_PROGRESS bit, which is the fifth bit (the 27th bit among 32 bits), manages the ‘disk IO in progress’ state. Therefore, if the buffer is in disk I/O, the corresponding bit must be set to ‘1’. To do this, shift 1 to the left 26 times. (1U means Unsigned Integer 1)

Src 2-11. src/include/storage/buf_internals.h

#define BM_LOCKED                       (1U << 22)
#define BM_DIRTY                        (1U << 23)
#define BM_VALID                        (1U << 24)
#define BM_TAG_VALID                    (1U << 25)
#define BM_IO_IN_PROGRESS               (1U << 26)
#define BM_IO_ERROR                     (1U << 27)
#define BM_JUST_DIRTIED                 (1U << 28)
#define BM_PIN_COUNT_WAITER             (1U << 29)
#define BM_CHECKPOINT_NEEDED            (1U << 30)
#define BM_PERMANENT                    (1U << 31)

freeNext variable

FreeNext is a pointer to the next empty buffer.

Immediately after DB startup, all buffers are empty, so freeNext points to the next array element. After receiving the buffer allocation request, it operates as shown in Figure 2-7.

That is, the firstFreeBuffer entry in the BufferStrategyControl structure points to the freeNext link header, and the lastFreeBuffer entry points to the freeNext link tail.

After all buffers have been allocated, there is no empty buffer. Therefore, if a buffer allocation request is received after this point, a victim buffer should be selected. (Here, firstFreeBuffer has a value of -1.) The details will be explained in the ‘Clock Sweep Algorithm’ section.
2-7
Figure 2-7. FreeNext link changes due to buffer allocation

Spin lock and LW lock

Spin lock and LW lock must be acquired when accessing Shared Buffer. If you’re familiar with ORACLE, think of a spin lock as a mutex and an LW lock as a latch.
The reason for distinguishing spin lock from LW lock is to improve performance. LW lock is a very light lock, and Spin lock is a lighter lock. Therefore, it is advantageous to use Spin Lock if possible. The two differences are summarized as follows. (See Table 2-1)

Table 2-1. Difference between spin lock and LW lock

Item Spin lock LW lock
Work Load Very very low Very low
Context Switching Does not occur May happen
How it works Spin Queuing & Posting
purpose of use Used when accessing variables in the structure Used when accessing a structure
Lock Mode EXCLUSIVE SHARE & EXCLUSIVE

Spin lock

Spin lock works very lightly.

Spin locks are used for operations that can be performed in a very short time. Even if another process has already acquired a Spin lock, it is very likely to acquire a Spin lock after doing a few Spin loops. Spin does not fall into Sleep state, so there is no context switching.

Spin lock implementation method

There are two ways to implement spin locks.

  • Use mutex
  • Use TAS (Test & Set) with an inline assembly language

PostgreSQL uses the second method.

LW lock

PostgreSQL uses the ‘queueing & posting’ approach to acquire LW locks.

  • Suppose you want to retrieve a hash element with a hash bucket. In this case, LW lock should be acquired in SHARED mode because it is in read mode.
  • Assume that you enter BufferTag information in a hash element. In this case, LW lock must be acquired in EXCLUSIVE mode because it is write mode.

LW lock acquisition procedure

The pseudo code for LW lock acquisition is as follows.

LWLockAcquire(LWLock *lock, LWLockMode mode)
LOOP
  Attempt to acquire LW lock by calling LWLockAttemptLock () function.

  Acquire the lock, escape the LOOP. 

  If the lock can not be acquired, it is registered in the waiting queue.

  Call the LWLockAttemptLock () function one more time.

  If the lock is acquired, the entry registered in the waiting queue is 
  deleted and the LOOP is escaped.

  If it can not acquire the lock, it will start waiting. 
  (Waits until another process wakes up.) 
END LOOP

Reading buffers from Shared Buffer


So far we have looked at the major components of Shared Buffer. The main points are summarized as follows.

  • Shared Buffer consists of hash table, hash element, buffer descriptor and buffer pool.
  • The hash table is an array structure, and uses a segmented hash table structure to minimize hash collisions.
  • Also, manage the logical partitions to increase concurrency.
  • At this time, there is one LW lock per partition.
  • The hash element consists of an element and an element key.
  • The element consists of a hashvalue for the BufferTag and a pointer to the next element.
  • The element key stores the BufferTag and the buffer id.
  • BufferTag is equal to the Social Secrete Number for the block.
  • BufferTag consists of database number, table space number, object number, fork number and block number.
  • MAIN object fork number is 0, FSM is 1, VM is 2.
  • Beginning with version 9.6, hash elements are managed as 32 freeLists. This is for the purpose of eliminating contention.
  • The buffer descriptor is a structure array that manages the buffer meta information. At this time, the number of arrays is equal to the number of buffers.
  • The main meta information managed by the buffer descriptor is refcount, usage_count, flags, freeNext, and so on.

In this section, let’s look at the following two flow charts.

  1. Reading a block in the Shared Buffer
  2. When Disk Read occurs

When reading blocks in Shared Buffer

It is assumed that the shared buffer size is 256 MiB and that the number of blocks currently loaded in the shared buffer is three. The BufferTag of the buffer to read is assumed to be ‘Tag_B’. (See Figure 2-8)
2-8
Figure 2-8. Current Shared Buffer Status

Note   The values stored in the actual BufferTag are not the same values as ‘Tag_A’ and ‘Tag_B’. As described earlier, the BufferTag stores a value that can uniquely identify a block. However, for convenience of explanation, a value equal to ‘Tag_A’ is used.

The size of each element shown in Figure 2-8 is shown in Table 2-2.

Table 2-2. Size of each element

Item Calculation method Value
Number of buffers 256 MiB / 8 KiB 32,768
Number of buffer partitions #define NUM_BUFFER_PARTITIONS 128
nelem Number of buffers + Number of buffer partitions 32,896
hctl-ffactor #define DEF_FFACTOR 1
Hashtable array

(Number of buckets)

nbuckets = next_pow2_int((nelem – 1) / hctl->ffactor + 1); 65,536
Hash Segment Size #define DEF_SEGSIZE 256
Number of hash segments Hash table array size / Hash Segment Size 256
Number of FreeList #define NUM_FREELISTS 32
Hash element

Pool Size

nelem 32,896

= 1,028 * 32

Buffer descriptor array Number of buffers 32,767
Array of buffer pools Number of buffers 32,767

Processing order

The order of reading the buffers in the Shared Buffer is as follows. (See Figure 2-9)

  1. Create a BufferTag. (‘Tag_B’ is generated)
    1. Calculate the hashvalue using the generated BufferTag.
    2. Calculate the buffer partition number using the generated hashvalue.
  2. Obtain the LW lock for the buffer partition # in SHARE mode.
  3. Compute the hash table bucket number using hashvalue.
    1. Replace the bucket number with the hash segment number and index number.
    2. Search for ‘Tag_B’ along the hash chain.
  4. Since the buffer ID of Tag_B is 1, set the PIN in the buffer descriptor array index [1]. (In this case, increase refcount and usage_count by 1 respectively)
  5. Release LW lock on buffer partition #.
  6. Read buffer pool array index [1].
  7. Release the PIN. (At this time, reduce refcount by 1)

2-9

Figure 2-9. Block read processing order in Shared Buffer

CALC_BUCKET function

Let’s look closely at the source to find the bucket number in the hash table during the execution. The function to find the hash bucket is calc_bucket (). (See Sources 2-12)

Src 2-12. src/backend/utils/hash/dynahash.c

/* Convert a hash value to a bucket number */
calc_bucket(HASHHDR *hctl, uint32 hash_val){
        uint32          bucket;
        bucket = hash_val & hctl->high_mask;
        if (bucket > hctl->max_bucket)
             bucket = bucket & hctl->low_mask;
        return bucket;
}

The processing logic of cacl_bucket () is very simple. Obtain the bucket number by performing the BITAND operation with the input hash_val and hctl-> high_mask. If the bucket value is greater than hctl-> max_bucket, perform a BITAND operation with hctl-> row_mask. The values used at this time are shown in Table 2-3.

Table 2-3. HASHHDR value

Item Value Calculation formula
hctl->high_mask 131,071 (nbuckets << 1) – 1
hctl->row_mask 65,535 nbuckets -1

The calc_bucket () caller calculates the hash segment number and the index number through the following operation. (See Table 2-4)

HASHHDR    *hctl = hashp->hctl;
bucket      = calc_bucket(hctl, hashvalue);
segment_num = bucket >> hashp->sshift;
segment_ndx = MOD(bucket, hashp->ssize);

Table 2-4. HASHHDR value

Item Value Calculation formula
hashp->sshift 8 #define DEF_SEGSIZE_SHIFT 8
hashp->ssize 256 #define DEF_SEGSIZE   256

To obtain the hash segment number, shifting the 8-bit right shift is equivalent to dividing by 256. The index in the hash segment is the remainder of dividing the bucket value by 256. This can be expressed as follows. (See Figure 2-10)
2-10
Figure 2-10. Procedure for finding bucket numbers in a table

When DISK Read occurs

To read blocks that do not exist in Shared Buffer, Disk Read must be performed. That is, after loading the corresponding block into the shared buffer through Disk Read, the corresponding buffer is read. Assume that the current shared buffer state is shown in Figure 2-8. At this time, the order of reading the block whose BufferTag is ‘Tag_D’ is as follows.

Processing Order (Part -1)

  1.  Create a BufferTag. (‘Tag_D’ is generated)
    1. Calculate the hashvalue using the generated BufferTag.
    2. Calculate the buffer partition number using the generated hashvalue.
  2. Obtain the LW lock for the buffer partition # in SHARE mode.
  3. Compute the hash table bucket number using hashvalue.
    1. Replace the bucket number with the hash segment number and index number.
    2. It searches for ‘Tag_D’ while following the hash chain, but fails the search.
  4. Release LW lock on buffer partition #.

This is the same as reading a block in the Shared Buffer. The only difference is that there is no block in the hash chain. The processing procedure so far is shown in Figure 2-11.
2-11
Figure 2-11. When DISK Read occurs, the processing sequence (Part # 1)

Processing Order (Part-2)

The following processing sequence is as follows. (See Figure 2-12)

  1. Obtain the firstFreeBuffer value of the BufferStrategyControl structure.
    1. Obtain a spin lock to change the value of firstFreeBuffer.
    2. Change the value of firstFreeBuffer from 3 to 4.
    3. Release the spin lock.
  2. Set the PIN in the buffer descriptor array index [3]. (Set refcount to 1 at this time)

After that, we do the work of connecting the hash element to the hash table.

  1. Obtain the LW lock for the buffer partition # in EXCLUSIVE mode.
  2. In the hash element pool, one element is assigned.
    1. Change the freeList pointer to the previous element of the element assigned in step 8
  3. Attach the hash element to the hash chain and copy the record.
  4. Set the buffer header lock on the buffer descriptor array index [3].
    1. Increase usage_count by 1
    2. Release buffer header lock.
  5. Release the LW lock set in the buffer partition #.

2-12

Figure 2-12. When DISK Read occurs, the processing sequence (Part # 2)

Processing Order (Part-3)

The following processing sequence is as follows. (See Figure 2-13.)

  1. Acquires the LW lock in EXCLUSIVE mode to access BufferIOLockArrary array index [3].
  2. Set the buffer header lock on the buffer descriptor array index [3].
    1. Set the BM_IO_IN_PROGRESS flag bit to 1.
    2. Release buffer header lock.
  3. Load the block into the buffer pool array index [3].
  4. Set buffer header lock on buffer descriptor array index [3].
    1. Set the BM_IO_IN_PROGRESS flag bit to zero.
    2. Release buffer header lock.
  5. Release the LW lock acquired to access the BufferIOLockArrary array index [3].
  6. Read buffer pool array index [3].
  7. Release the PIN. (At this time, reduce refcount by 1)

2-13

Figure 2-13. When DISK Read occurs, the processing sequence (Part # 3)

Clock Sweep Algorithm for Buffer Replacement

In the previous example, there was an empty buffer in the shared buffer. If there is no empty buffer, the buffer in the shared buffer must be written to disk.

In this case, the buffer written to the disk is called the victim buffer, and the algorithm for selecting the victim buffer is called the buffer replacement algorithm.

Since the purpose of Shared Buffer is to improve performance by minimizing DISK Read, it is very important to use a Buffer Replacement algorithm that can manage buffers efficiently.

Description of the Clock Sweep algorithm

PostgreSQL uses a clock sweep algorithm for buffer replacement.

This algorithm is a kind of NFU (Not Frequently Used) algorithm that selects less used buffers as victim buffers.

To select a less used buffer, it is necessary to manage the number of accesses per buffer. The number of buffer accesses has been described as a bit operation in the state variable in the buffer descriptor. And that the usage_count in the state variable is incremented by 1 each time the buffer is accessed.

More precisely, it only increases up to the value defined in BM_MAX_USAGE_COUNT, and does not increase more. BM_MAX_USAGE_COUNT The default setting is 5. There are readers who wonder that the value of 5 is very small. I thought so.

However, once you understand the Clock Sweep algorithm, you will think it is a very reasonable value. The reason for setting the maximum value with such a small value is for the purpose of ‘fair competition’. A more detailed explanation will be given while explaining the Clock Sweep algorithm.

The ‘Clock’ of the clock sweep algorithm is an intuitive representation of the algorithm’s operating principle. This algorithm searches the buffer clockwise to find the victim buffer. Since the buffer descriptor is an array, the index 0 and the maximum index are concatenated into a logical circle. That is, it searches the victim buffer from array 0 first. If it finds a victim buffer, it returns a victim buffer and stops the search. In the next search, the search starts from the next index.

The ‘Sweep’ of the Clock Sweep algorithm means that you perform a sweep while performing a search. If you look at the source code, reduce usage_count by 1. This is analogous to cleaning.

That is, the clock sweep algorithm selects a buffer with zero refcount and usage_count as the victim buffer. If the victim buffer is a dirty buffer (BM_DIRTY bit is 1), the content of the buffer is written to disk.

Clock Sweep Algorithm

The clock sweep algorithm consists of very simple code. Some of the real sources are: (For the convenience of explanation, only a part was extracted)

Src 2-13. src/backend/storage/buffer/freelist.c

for (;;) {
 (1)    buf = GetBufferDescriptor(ClockSweepTick());
 (2)    local_buf_state = LockBufHdr(buf);

 (3)    if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0)
 (4)        if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0)
                local_buf_state -= BUF_USAGECOUNT_ONE;
 (5)        else  return buf;
 (6)    UnlockBufHdr(buf, local_buf_state);
 }

A description of the source follows.

  1. Call the ClockSweepTick () function to determine where to start searching for victim buffers. The function returns the nextVictimBuffer value in the BufferStrategyControl structure.
  2. Perform buffer header lock.
  3. If refcount is 0, perform Step (4). Otherwise, perform Step (6).
  4. If usage_count is not 0, decrease usage_count by 1
  5. If usage_count is zero, it returns a buffer.
  6. Release the buffer header lock and repeat Step (1).

Now readers have understood the basic principles of the Clock Sweep algorithm. The clock sweep algorithm is illustrated in Figure 2-14.

2-14

Figure 2-14. Clock Sweep Procedure

Step by step explanation is as follows.

Step-1

NextVictimBuffer, buffer_id = 5 is not a pinned buffer. However, since usage_count is 2, change usage_count to 1 and move to next buffer

Step-2

The nextVictimBuffer, buffer_id = 6, is the pinned buffer. Therefore, it moves to the next buffer.

Step-3

NextVictimBuffer buffer_id = 7 is an unpinned buffer and usage_count is zero. Therefore, it is selected as a victim buffer.

What is fair competition?

I mentioned earlier that the reason for setting the BM_MAX_USAGE_COUNT default setting slightly smaller is because of ‘fair competition’. If you understand the Clock Sweep algorithm and are familiar with ORACLE’s touch count algorithm, you might have some sense of ‘fair competition’.

To understand ‘fair competition’, think about when ‘unfair competition’ occurs. For example, suppose you have accessed a particular buffer a few hundred thousand times in a very short time, and there is no access after that. In this case, is it fair to put the buffer in Shared Buffer for a long time? Of course not. Therefore, the Buffer Replacement algorithm using the access count should include logic to eliminate unfair competition.

For example, ORACLE does not increment touch_count for accesses occurring in the same buffer within N seconds. Also, when the buffers move to the HOT area, touch_count is initialized to zero.

PostgreSQL has the same effect as ORACLE by setting the BM_MAX_USAGE_COUNT to a somewhat smaller value of 5. In other words, usage_count is 5 even though accesses to the same buffer occur thousands of times in N seconds. Thus, no matter how many accesses at any point in time, the buffer is chosen as the victim buffer if it is not accessed again before the clock has rotated six times. (See Figure 2-15)

2-15

Figure 2-15. Fair competition

From another perspective, it is also fair to reduce usage_count by 1 each time a clock sweep is performed. This is because the usage_count of the buffer always maintains a value greater than 1 if it is regularly accessed frequently.

IO Strategy and Ring Buffer for Bulk IO Processing


PostgreSQL has used the Clock Sweep algorithm to improve the efficiency of shared buffers. If so, how do I solve the problem that Seq Scan can flood the shared buffer?

PostgreSQL uses the IO strategy and Ring Buffer to solve this problem.

What is IO strategy?

The IO strategy implies a different strategy depending on the IO type. PostgreSQL has four IO types. (See Sources 2-14)

Src 2-14. src/include/storage/bufmgr.h

typedef enum BufferAccessStrategyType {
  BAS_NORMAL,      /* Normal random access */
  BAS_BULKREAD,    /* Large read-only scan */
  BAS_BULKWRITE,   /* Large multi-block write (e.g. COPY IN) */
  BAS_VACUUM       /* VACUUM */
} BufferAccessStrategyType;

That is, IO requests are classified into NORMAL (for random access), BULK READ (for large segment scan), BULK WRITE (for bulk write), and VACUUM. All requests except NORMAL request use Ring Buffer.

Ring Buffer

Ring Buffer means a logically circular array.

That is, it uses a circular array of a predetermined size, thereby preventing a shared buffer flooding due to Seq Scan. Of course, Ring Buffer also exists in Shared Buffer.

Ring Buffer Size

The size of the ring buffer is slightly different depending on the IO type. (See Sources 2-15)

Src 2-15. src/backend/storage/buffer/freelist.c

switch (btype)
 {
   case BAS_NORMAL:
        return NULL;
   case BAS_BULKREAD:
        ring_size = 256 * 1024 / BLCKSZ;
        break;
   case BAS_BULKWRITE:
        ring_size = 16 * 1024 * 1024 / BLCKSZ;
        break;
   case BAS_VACUUM:
        ring_size = 256 * 1024 / BLCKSZ;
        break;

The Ring Buffer size for BULK READ is 256 KiB (32 blocks).

BULK READ Definition

So what is the definition of BULK READ? Is all Seq Scan BULK READ?
Not like that. PostgreSQL uses the BULK READ method only for Seq scans on tables that are at least one-fourth the size of the shared buffer. At this time, the table size uses statistical information.

How Ring Buffer works in BULK READ

The ring buffer size for BULK READ is 256 KiB (32 blocks). But if you think a bit more, there are some questions. Let’s answer the following questions.

Question 1

There is a table that performs BULK READ very frequently.
If the table is read 100 times repeatedly, is the ring buffer size for that table 32 blocks? 3,200 blocks?

Question 2

If the same table is read twice with UNION ALL, is the ring buffer size 32 blocks? Or is it 64 blocks?

The correct answer is as follows.

1) 3,200 blocks

The reason why 32 more ring buffers are allocated for each SQL execution is due to the fairness of buffer usage. We have introduced Ring Buffer to prevent the risk of Seq Scan on a large table, but allocating only 32 blocks of Ring Buffer is out of equality if you frequently access the table in many SQLs.

2) 64 blocks

Whenever BULK READ target table appears in one SQL, allocate 32 ring buffers each time.

Let’s test it.

Check Ring Buffer operation principle

1. Install the pg_buffercache extension for testing.

create extension pg_buffercache;

2. Since Shared Buffer is 256 MiB, it creates a table that is larger than 1/4 of 256 MiB.

drop table b1;
create table b1 (c1 char(1000), c2 char(1000));
insert into b1 select i, i from generate_series(1,35000) a(i);
analyze b1;

select relname, relpages, relpages*8192/1024/1024 as size
from   pg_class where relname='b1';
 relname | relpages | size
---------+----------+------
 b1      |     8750 |   68 

3. Restart the DB for testing.

4. Seq Scan B1 table.

select count(*) from b1;

5. As a result of executing the check script, you can see that 32 blocks have been loaded into the shared buffer.

Shared Buffer check script execution result (see below for script contents)
buffers
---------
 32

6. Perform one more time.

select count(*) from b1;

7. As a result of executing the check script, you can see that 64 blocks have been loaded into Shared Buffer.

buffers
---------
 64

If you repeat the test several times, you can see that it is loaded into the shared buffer by the number of ‘execution * 32 blocks’.

8. Restart the DB for testing.

9. Perform a UNION ALL statement.

select count(*) from b1
union all
select count(*) from b1;

10. As a result of executing the check script, you can see that 64 blocks have been loaded into Shared Buffer.

buffers
---------
 64

11. Perform one more time.

select count(*) from b1
union all
select count(*) from b1;

12. As a result of executing the check script, 128 blocks are loaded into the shared buffer.

buffers
---------
 128

Script. Query for Shared Buffer check

select count(*) as buffers
from   pg_buffercache a, pg_class b
where  a.relfilenode = pg_relation_filenode(b.oid)
and    a.reldatabase in (0, (select oid
from   pg_database
where  datname=current_database()))
and    b.relname='b1';

Risk of Ring Buffer

So far, PostgreSQL has been using Ring Buffer to prevent damage to shared buffers caused by large BULK READs. In addition, if BULK READ is repeatedly performed, it increases the ring buffer size to improve table access efficiency .

But there is still a risk.

For example, suppose that the data suddenly increases and the table size is larger than one quarter of the shared buffer. In this case, after the DB restart (or after the buffer for the corresponding table is aged out of the shared buffer), the Seq Scan for the corresponding table is performed in the BULK READ manner. That is, the ring buffer is used. Therefore, the query performance may be slow. Let’s test it.

1. Creates a table that is less than one-fourth the size of the shared buffer.

drop table s1;
create table s1 (c1 char(1000), c2 char(1000));
insert into s1 select i, i from generate_series(1,32000) a(i);
analyze s1;

select relname, relpages, relpages*8192/1024/1024 "SIZE"
from   pg_class where relname='s1';
 relname | relpages | SIZE
---------+----------+------
 s1      |     8000 |   62 

2. Restart the DB for testing.

3. For testing, we create a function that iteratively performs Seq Scan.

CREATE or replace FUNCTION loop_fullscan(v_begin integer, v_end integer)
RETURNS VOID AS $$
DECLARE
rval integer;
BEGIN
   FOR i in v_begin..v_end LOOP
      SELECT COUNT(*) FROM s1 INTO rval;
   END LOOP;
END;
$$ LANGUAGE plpgsql;

4. Measure the performance of 100 times Seq scan of the table.

select loop_fullscan(1,100);
 loop_fullscan
 ---------------
 (1 row)
Time: 605.653 ms

5. Enter an additional 1,000 to make the table size 1/4 of the shared buffer size.

insert into s1 select * from c1 limit 1000;
analyze s1;

select relname, relpages, relpages*8192/1024/1024 "SIZE"
from   pg_class where relname='s1';
 relname | relpages | SIZE
---------+----------+------
 c1      |     8250 |   64 

5. Restart the DB for testing.

6. Measure the performance of 100 times Seq scan of the table.

select loop_fullscan(1,100);
 loop_fullscan
 ---------------
 (1 row)
Time: 1689.020 ms

Test results show that using Ring Buffer degrades performance.

Of course, there will be some differences depending on the table size and test environment. However, it is obvious that Ring Buffer slows down the execution speed.

So how can we solve this problem?

Fortunately, PostgreSQL provides a way to solve this problem by providing the pg_prewarm extension.

pg_prewarm Extension

The pg_prewarm extension provides the ability to load tables and indexes into shared buffers. (You can also load a table that is larger than one-fourth the size of the shared buffer.) However, it is not a permanent buffer to reside in the shared buffer.

Let’s try the previous test again using pg_prewarm ().

1. Create the pg_prewarm extension.

create extension pg_prewarm;

2. Restart the DB for testing.

3. Perform pg_prewam.

select pg_prewarm('s1');
 pg_prewarm
 ------------
 8250
Time: 51.789 ms

4. Measure the performance of 100 times Seq scan of the table.

select loop_fullscan(1,100);
 loop_fullscan
 ---------------
Time: 545.811 ms 

Test results show that performance of pg_prewarm () greatly improves query performance. Therefore, you should use pg_prewarm () appropriately when working on batch program performance improvements.

Summary


Let’s briefly summarize what we learned in this chapter.

First, you learned the Shared Buffer structure. A Shared Buffer consists of a hash table, a hash element, a buffer descriptor, and a buffer pool.

And we learned the order of memory read and DISK read. I hope that I can draw the IO processing flow after closing the eyes and drawing the Shared Buffer structure.

Also, we learned the clock sweep algorithm for efficient shared buffer management. Through the understanding of the algorithm, we can think about why the BM_MAX_USAGE_COUNT value is not set too large and fair competition.

Finally, you learned about IO strategies and ring buffers to protect shared buffers from BULK READ. We also learned about the principle of ring buffer operation and its disadvantages, and the pg_prewarm () extension to overcome this.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s