1 # CDDL HEADER START
   2 #
   3 # The contents of this file are subject to the terms of the
   4 # Common Development and Distribution License (the "License").
   5 # You may not use this file except in compliance with the License.
   6 #
   7 # You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   8 # or http://www.opensolaris.org/os/licensing.
   9 # See the License for the specific language governing permissions
  10 # and limitations under the License.
  11 #
  12 # When distributing Covered Code, include this CDDL HEADER in each
  13 # file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  14 # If applicable, add the following below this CDDL HEADER, with the
  15 # fields enclosed by brackets "[]" replaced with your own identifying
  16 # information: Portions Copyright [yyyy] [name of copyright owner]
  17 #
  18 # CDDL HEADER END
  19 #
  20 #
  21 # Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
  22 # Use is subject to license terms. 
  23 #
  24 
  25 TITLE: Dynamic Memory Implementation Overview
  26 
  27 DATE:  10/13/2000
  28 
  29 AUTHOR: Jim Guerrera (james.guerrera@east)
  30 
  31 
  32 1.0 Dynamic Memory Implementation in the SCM Module
  33 
  34 The system memory allocation required by the Storage Cache Manager (SCM)
  35 has been modified to more fully conform to the requirements of the Solaris
  36 OS. The previous implementation required that the total memory requirements
  37 of the package be allocated 'up front' during bootup and was never released. 
  38 The current implementation performs 'on demand' allocations at the time
  39 memory is required in a piecemeal manner. In addition the requisitioned
  40 memory will be released back to the system at some later time.
  41 
  42 2.0 Implementation
  43 
  44 2.1 Memory Allocation
  45 
  46 The memory allocation involves modifications primarily to sd_alloc_buf()
  47 in module sd_bcache.c. When a request is received for cache and system 
  48 resources it is broken down and each piece catagorized both as an
  49 independent entity and as a member of a group with close neighbors. Cache
  50 resources comprise cache control entries (ccent), write control entries 
  51 (wctrl for FWC support) and system memory. The current allocation algorithim
  52 for ccent and wrctl remains the same. The memory allocation has been modified
  53 and falls into two general catagories - single page and multi-page 
  54 allocations.
  55 
  56 2.1.1 A single page allocation means exactly that  - the ccent points to and
  57 owns one page of system memory. If two or more ccent are requisitioned to 
  58 support the caching request then only the first entry in the group actually 
  59 owns the the allocated memory of two or more pages. The secondary entries 
  60 simply point to page boundaries within this larger piece of contiguous memory.
  61 The first entry is termed a host and the secondaries are termed parasites. 
  62 
  63 The process for determining what is a host, a parasite or anything else is 
  64 done in three phases. Phase one simply determines whether the caching request
  65 references a disk area already in cache and  marks it as such. If it is not
  66 in cache it is typed as eligible - i.e. needing memory allocation. Phase
  67 two scans this list of typed cache entries and based on immediate neighbors
  68 is catagorized as host, pest or downgraded to other. A host can only exist 
  69 if there is one or more eligible entries immediately following it and it 
  70 itself either starts the list or immediately follows a non-eligible entry. 
  71 If either condition proves false the catagory remains as eligible (i.e. 
  72 needs memory allocation) but the type is cleared to not host (i.e. other). 
  73 The next phase is simply a matter of scanning the cache entry list and 
  74 allocating multipage memory for hosts, single page entries for others or 
  75 simply setting up pointers in the parasitic entries into it's corresponding
  76 host multipage memory allocation block.
  77 
  78 2.1.2 The maximum number of parasitic entries following a host memory 
  79 allocation is adjustable by the system administrator. The details of this 
  80 are under the description of the KSTAT interface (Sec 3.0).
  81 
  82 2.2 Memory Deallocation
  83 
  84 Memory deallocation is implemented in  sd_dealloc_dm() in module sd_io.c. 
  85 This possibly overly complicated routine works as follows:
  86 
  87 In general the routine sleeps a specified amount of time then wakes and 
  88 examines the entire centry list. If an entry is available (i.e. not in use 
  89 by another thread and has memory which may be deallocated) it takes 
  90 possession and ages the centry by one tick. It then determines if the 
  91 centry has aged sufficiently to have its memory deallocated and for it to 
  92 be placed at the top of the lru.
  93 
  94 2.3 There are two general deallocation schemes in place depending on 
  95 whether the centry is a single page allocation centry or it is a member 
  96 of a host/parasite multipage allocation chain.
  97 
  98 2.3.1 The behavior for a single page allocation centry is as follows:
  99 
 100 If the given centry is selected as a 'holdover' it will age normally 
 101 however at full aging it will only be placed at the head of the lru. 
 102 It's memory will not be deallocated until a further aging level has 
 103 been reached. The entries selected for this behavior are governed by 
 104 counting the number of these holdovers in existence on each wakeup 
 105 and comparing it to a specified percentage. This comparision is always 
 106 one cycle out of date and will float in the relative vicinity of the 
 107 specified number.
 108 
 109 In addition there is a placeholder for centries identified as 'sticky 
 110 meta-data' with its own aging counter. It operates exactly as the holdover 
 111 entries as regards to aging but is absolute - i.e. no percentage governs 
 112 the number of such entries. 
 113 
 114 2.3.2 The percentage and additional aging count are adjustable by the 
 115 system administrator. The details of this are under the description of 
 116 the KSTAT interface (Sec. 3.0).
 117 
 118 2.3.3 The behavior for a host/parasite chain is as follows:
 119 
 120 The host/parasite subchain is examined. If all entries are fully aged the 
 121 entire chain is removed - i.e memory is deallocated from the host centry 
 122 and all centry fields are cleared and each entry requeued on to the lru.
 123 
 124 There are three sleep times and two percentage levels specifiable by the 
 125 system administrator. A meaningful relationship between these variables 
 126 is:
 127 
 128 sleeptime1 >= sleeptime2 >= sleeptime2 and
 129 100% >= pcntfree1 >= pcntfree2 >= 0%
 130 
 131 sleeptime1 is honored between 100% free and pcntfree1. sleeptime2 is 
 132 honored between pcntfree1 and pcntfree2. sleeptime3 is honored between 
 133 pcntfree2 and 0% free. The general thrust here is to automatically 
 134 adjust sleep time to centry load. 
 135 
 136 In addition  there exist an accelerated aging flag which mimics hysterisis 
 137 behavior. If the available centrys fall between pcntfree1 and pcntfree2 
 138 an 8 bit counter is switched on. The effect is to keep the timer value 
 139 at sleeptime2 for 8 cycles even if the number available cache entries 
 140 drifts above pcntfree1. If it falls below pcntfree2 an additional 8 bit 
 141 counter is switched on. This causes the sleep timer to remain at sleeptime3 
 142 for at least 8 cycles even if it floats above pcntfree2 or even pcntfree1. 
 143 The overall effect of this is to accelerate the release of system resources
 144 under what the thread thinks is a heavy load as measured by the number of 
 145 used cache entries.
 146 
 147 3.0 Dynamic Memory Tuning
 148 
 149 A number of behavior modification variables are accessible via system calls 
 150 to the kstat library. A sample program exercising the various features can 
 151 be found in ./src/cmd/ns/sdbc/sdbc_dynmem.c. In addition the behavior variable 
 152 identifiers can be placed in the sdbc.conf file and will take effect on bootup.
 153 There is also a 
 154 number of dynamic memory statistics available to gauge its current state.
 155 
 156 3.1 Behavior Variables
 157 
 158 sdbc_monitor_dynmem --- D0=monitor thread shutdown in the console window
 159                         D1=print deallocation thread stats to the console 
 160                         window
 161                         D2=print more deallocation thread stats to the console 
 162                         window
 163                         (usage: setting a value of 6 = 2+4 sets D1 and D2)
 164 sdbc_max_dyn_list ----- 1 to ?: sets the maximum host/parasite list length
 165                         (A length of 1 prevents any multipage allocations from 
 166                         occuring and effectively removes the concept of 
 167                         host/parasite.)
 168 sdbc_cache_aging_ct1 -- 1 to 255: fully aged count (everything but meta and 
 169                         holdover)
 170 sdbc_cache_aging_ct2 -- 1 to 255: fully aged count for meta-data entries
 171 sdbc_cache_aging_ct3 -- 1 to 255: fully aged count for holdovers
 172 sdbc_cache_aging_sec1 - 1 to 255: sleep level 1 for 100% to pcnt1 free cache 
 173                         entries
 174 sdbc_cache_aging_sec2 - 1 to 255: sleep level 2 for pcnt1 to pcnt2 free cache 
 175                         entries
 176 sdbc_cache_aging_sec3 - 1 to 255: sleep level 3 for pcnt2 to 0% free cache 
 177                         entries
 178 sdbc_cache_aging_pcnt1- 0 to 100: cache free percent for transition from 
 179                         sleep1 to sleep2
 180 sdbc_cache_aging_pcnt2- 0 to 100: cache free percent for transition from 
 181                         sleep2 to sleep3
 182 sdbc_max_holds_pcnt --- 0 to 100: max percent of cache entries to be maintained 
 183                         as holdovers
 184 
 185 3.2 Statistical Variables
 186 
 187 Cache Stats (per wake cycle) (r/w):
 188 sdbc_alloc_ct --------- total allocations performed
 189 sdbc_dealloc_ct ------- total deallocations performed
 190 sdbc_history ---------- current hysterisis flag setting
 191 sdbc_nodatas ---------- cache entries w/o memory assigned
 192 sdbc_candidates ------- cache entries ready to be aged or released
 193 sdbc_deallocs --------- cache entries w/memory deallocated and requeued
 194 sdbc_hosts ------------ number of host cache entries
 195 sdbc_pests ------------ number of parasitic cache entries
 196 sdbc_metas ------------ number of meta-data cache entries
 197 sdbc_holds ------------ number of holdovers (fully aged w/memory and requeued)
 198 sdbc_others ----------- number of not [host, pests or metas]
 199 sdbc_notavail --------- number of cache entries to bypass (nodatas+'in use by 
 200                         other processes')
 201 sdbc_process_directive- D0=1 wake thread
 202                         D1=1 temporaily accelerate aging (set the hysterisis
 203                         flag)
 204 sdbc_simplect --------- simple count of the number of times the kstat update 
 205                         routine has been called          
 206 
 207 
 208 3.3 Range Checks and Limits
 209 
 210 Only range limits are checked. Internal inconsistencies are not checked 
 211 (e.g. pcnt2 > pcnt1). Inconsistencies won't break the system you just won't 
 212 get meaningful behavior. 
 213 
 214 The aging counter and sleep timer limits are arbitrarily limited to a byte 
 215 wide counter. This can be expanded. However max'ing the values under the 
 216 current implementation yields about 18 hours for full aging.
 217 
 218 3.4 Kstat Lookup Name
 219 
 220 The kstat_lookup() module name is "sdbc:dynmem" with an instance of 0.
 221 
 222 3.5 Defaults
 223 
 224 Default values are:
 225 sdbc_max_dyn_list = 8
 226 sdbc_monitor_dynmem = 0
 227 sdbc_cache_aging_ct1 = 3
 228 sdbc_cache_aging_ct2 = 3
 229 sdbc_cache_aging_ct3 = 3
 230 sdbc_cache_aging_sec1 = 10
 231 sdbc_cache_aging_sec2 = 5
 232 sdbc_cache_aging_sec3 = 1
 233 sdbc_cache_aging_pcnt1 = 50
 234 sdbc_cache_aging_pcnt2 = 25
 235 sdbc_max_holds_pcnt = 0
 236 
 237 To make the dynmem act for all intents and purposes like the static model 
 238 beyond the inital startup the appropriate values are:
 239 sdbc_max_dyn_list = 1,
 240 sdbc_cache_aging_ct1/2/3=255,
 241 sdbc_cache_aging_sec1/2/3=255
 242 The remaining variables are irrelevant.
 243 
 244 4.0 KSTAT Implementation for Existing Statistics
 245 
 246 The existing cache statistical reporting mechanism has been replaced by 
 247 the kstat library reporting mechanism. In general the statistics fall into 
 248 two general catagories - global and shared. The global stats reflect gross 
 249 behavior over all cached volumes and shared reflects behavior particular 
 250 to each cached volume.
 251 
 252 4.1 Global KSTAT lookup_name
 253 
 254 The kstat_lookup() module name is "sdbc:gstats" with an instance of 0. The 
 255 identifying ascii strings and associated values matching the sd_stats driver 
 256 structure are:
 257 
 258 sdbc_dirty -------- net_dirty
 259 sdbc_pending ------ net_pending
 260 sdbc_free --------- net_free
 261 sdbc_count -------- st_count            - number of opens for device
 262 sdbc_loc_count ---- st_loc_count        - number of open devices
 263 sdbc_rdhits ------- st_rdhits           - number of read hits
 264 sdbc_rdmiss ------- st_rdmiss           - number of read misses
 265 sdbc_wrhits ------- st_wrhits           - number of write hits
 266 sdbc_wrmiss ------- st_wrmiss           - number of write misses
 267 sdbc_blksize ------ st_blksize          - cache block size
 268 sdbc_num_memsize -- SD_MAX_MEM          - number of defined blocks 
 269                                           (currently 6)
 270 To find the size of each memory blocks append the numbers 0 to 5 to 
 271 'sdbc_memsize'.
 272 sdbc_memsize0 ----- local memory
 273 sdbc_memsize1 ----- cache memory
 274 sdbc_memsize2 ----- iobuf memory
 275 sdbc_memsize3 ----- hash memory
 276 sdbc_memsize4 ----- global memory
 277 sdbc_memsize5 ----- stats memory
 278 sdbc_total_cmem --- st_total_cmem       - memory used by cache structs
 279 sdbc_total_smem --- st_total_smem       - memory used by stat  structs 
 280 sdbc_lru_blocks --- st_lru_blocks
 281 sdbc_lru_noreq ---- st_lru_noreq
 282 sdbc_lru_req ------ st_lru_req
 283 sdbc_num_wlru_inq - MAX_CACHE_NET       - number of net (currently 4)
 284 To find the size of the least recently used write cache per net append 
 285 the numbers 0-3 to sdbc_wlru_inq
 286 sdbc_wlru_inq0 ---- net 0
 287 sdbc_wlru_inq1 ---- net 1
 288 sdbc_wlru_inq2 ---- net 2
 289 sdbc_wlru_inq3 ---- net 3
 290 sdbc_cachesize ---- st_cachesize        - cache size
 291 sdbc_numblocks ---- st_numblocks        - cache blocks
 292 sdbc_num_shared --- MAXFILES*2          - number of shared structures (one for
 293                                           each cached volume)
 294                                           This number dictates the maximum 
 295                                           index size for shared stats and 
 296                                           names given below.
 297 sdbc_simplect ----- simple count of the number of times the kstat update routine
 298                     has been called
 299 
 300 All fields are read only.
 301 
 302 
 303 4.2 Shared Structures KSTAT lookup_name
 304 
 305 The kstat_lookup() module name is "sdbc:shstats" and "sdbc:shname" both with 
 306 an instance of 0. The identifying ascii strings and associated values matching 
 307 the sd_shared driver structure are:
 308 
 309 sdbc:shstats module
 310 sdbc_index ------- structure index number 
 311 sdbc_alloc ------- sh_alloc             - is this allocated?
 312 sdbc_failed ------ sh_failed            - Disk failure status (0=ok,1= /o error
 313                                                 ,2= open failed)
 314 sdbc_cd ---------- sh_cd                - the cache descriptor. (for stats)
 315 sdbc_cache_read -- sh_cache_read        - Number of bytes read from cache
 316 sdbc_cache_write - sh_cache_write       - Number of bytes written  to cache
 317 sdbc_disk_read --- sh_disk_read         - Number of bytes read from disk 
 318 sdbc_disk_write -- sh_disk_write        - Number of bytes written  to disk
 319 sdbc_filesize ---- sh_filesize          - Filesize
 320 sdbc_numdirty ---- sh_numdirty          - Number of dirty blocks
 321 sdbc_numio ------- sh_numio             - Number of blocks on way to disk
 322 sdbc_numfail ----- sh_numfail           - Number of blocks failed
 323 sdbc_flushloop --- sh_flushloop         - Loops delayed so far
 324 sdbc_flag -------- sh_flag              - Flags visible to user programs 
 325 sdbc_simplect ---- simple count of the number of times the kstat update routine
 326                    has been called
 327 
 328 sdbc:shname module
 329 read in as raw bytes and interpreted as a nul terminated assci string.
 330 
 331 These two modules operate hand in hand based on information obtained from the
 332 "sdbc:gstats" module. "sdbc:gstats - sdbc_num_shared" gives the maximum number
 333 possible of shared devices. It does not tell how many devices are actually 
 334 cached - just the maximum possible. In order to determine the number present 
 335 and retrieve the statistics for each device the user must:
 336 
 337 1. open and read "sdbc:shstats" 
 338 2. set the index "sdbc_index" to a starting value (presumably 0)
 339 3. write the kstat module ( the only item in the module is sdbc_index)
 340 
 341 What this does is set a starting index for all subsequent reads. 
 342 
 343 4. to get the device count and associated statistics the user now simply 
 344 reads each module "sdbc:shstats" and "sdbc:shname" as a group repeatedly - 
 345 the index will auto increment
 346 
 347 To reset the index set "sdbc:shstats - sdbc_index" to the required value 
 348 and write the module.
 349 
 350 The first entry returning a nul string to "sdbc:shname" signifies no more 
 351 configured devices.
 352