Print this page
    
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4104 ::spa_space no longer works
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
    
      
        | Split | 
	Close | 
      
      | Expand all | 
      | Collapse all | 
    
    
          --- old/usr/src/uts/common/fs/zfs/vdev_label.c
          +++ new/usr/src/uts/common/fs/zfs/vdev_label.c
   1    1  /*
   2    2   * CDDL HEADER START
   3    3   *
   4    4   * The contents of this file are subject to the terms of the
   5    5   * Common Development and Distribution License (the "License").
   6    6   * You may not use this file except in compliance with the License.
   7    7   *
   8    8   * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9    9   * or http://www.opensolaris.org/os/licensing.
  10   10   * See the License for the specific language governing permissions
  11   11   * and limitations under the License.
  12   12   *
  13   13   * When distributing Covered Code, include this CDDL HEADER in each
  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  
  22   22  /*
  23   23   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  24   24   * Copyright (c) 2013 by Delphix. All rights reserved.
  25   25   */
  26   26  
  27   27  /*
  28   28   * Virtual Device Labels
  29   29   * ---------------------
  30   30   *
  31   31   * The vdev label serves several distinct purposes:
  32   32   *
  33   33   *      1. Uniquely identify this device as part of a ZFS pool and confirm its
  34   34   *         identity within the pool.
  35   35   *
  36   36   *      2. Verify that all the devices given in a configuration are present
  37   37   *         within the pool.
  38   38   *
  39   39   *      3. Determine the uberblock for the pool.
  40   40   *
  41   41   *      4. In case of an import operation, determine the configuration of the
  42   42   *         toplevel vdev of which it is a part.
  43   43   *
  44   44   *      5. If an import operation cannot find all the devices in the pool,
  45   45   *         provide enough information to the administrator to determine which
  46   46   *         devices are missing.
  47   47   *
  48   48   * It is important to note that while the kernel is responsible for writing the
  49   49   * label, it only consumes the information in the first three cases.  The
  50   50   * latter information is only consumed in userland when determining the
  51   51   * configuration to import a pool.
  52   52   *
  53   53   *
  54   54   * Label Organization
  55   55   * ------------------
  56   56   *
  57   57   * Before describing the contents of the label, it's important to understand how
  58   58   * the labels are written and updated with respect to the uberblock.
  59   59   *
  60   60   * When the pool configuration is altered, either because it was newly created
  61   61   * or a device was added, we want to update all the labels such that we can deal
  62   62   * with fatal failure at any point.  To this end, each disk has two labels which
  63   63   * are updated before and after the uberblock is synced.  Assuming we have
  64   64   * labels and an uberblock with the following transaction groups:
  65   65   *
  66   66   *              L1          UB          L2
  67   67   *           +------+    +------+    +------+
  68   68   *           |      |    |      |    |      |
  69   69   *           | t10  |    | t10  |    | t10  |
  70   70   *           |      |    |      |    |      |
  71   71   *           +------+    +------+    +------+
  72   72   *
  73   73   * In this stable state, the labels and the uberblock were all updated within
  74   74   * the same transaction group (10).  Each label is mirrored and checksummed, so
  75   75   * that we can detect when we fail partway through writing the label.
  76   76   *
  77   77   * In order to identify which labels are valid, the labels are written in the
  78   78   * following manner:
  79   79   *
  80   80   *      1. For each vdev, update 'L1' to the new label
  81   81   *      2. Update the uberblock
  82   82   *      3. For each vdev, update 'L2' to the new label
  83   83   *
  84   84   * Given arbitrary failure, we can determine the correct label to use based on
  85   85   * the transaction group.  If we fail after updating L1 but before updating the
  86   86   * UB, we will notice that L1's transaction group is greater than the uberblock,
  87   87   * so L2 must be valid.  If we fail after writing the uberblock but before
  88   88   * writing L2, we will notice that L2's transaction group is less than L1, and
  89   89   * therefore L1 is valid.
  90   90   *
  91   91   * Another added complexity is that not every label is updated when the config
  92   92   * is synced.  If we add a single device, we do not want to have to re-write
  93   93   * every label for every device in the pool.  This means that both L1 and L2 may
  94   94   * be older than the pool uberblock, because the necessary information is stored
  95   95   * on another vdev.
  96   96   *
  97   97   *
  98   98   * On-disk Format
  99   99   * --------------
 100  100   *
 101  101   * The vdev label consists of two distinct parts, and is wrapped within the
 102  102   * vdev_label_t structure.  The label includes 8k of padding to permit legacy
 103  103   * VTOC disk labels, but is otherwise ignored.
 104  104   *
 105  105   * The first half of the label is a packed nvlist which contains pool wide
 106  106   * properties, per-vdev properties, and configuration information.  It is
 107  107   * described in more detail below.
 108  108   *
 109  109   * The latter half of the label consists of a redundant array of uberblocks.
 110  110   * These uberblocks are updated whenever a transaction group is committed,
 111  111   * or when the configuration is updated.  When a pool is loaded, we scan each
 112  112   * vdev for the 'best' uberblock.
 113  113   *
 114  114   *
 115  115   * Configuration Information
 116  116   * -------------------------
 117  117   *
 118  118   * The nvlist describing the pool and vdev contains the following elements:
 119  119   *
 120  120   *      version         ZFS on-disk version
 121  121   *      name            Pool name
 122  122   *      state           Pool state
 123  123   *      txg             Transaction group in which this label was written
 124  124   *      pool_guid       Unique identifier for this pool
 125  125   *      vdev_tree       An nvlist describing vdev tree.
 126  126   *      features_for_read
 127  127   *                      An nvlist of the features necessary for reading the MOS.
 128  128   *
 129  129   * Each leaf device label also contains the following:
 130  130   *
 131  131   *      top_guid        Unique ID for top-level vdev in which this is contained
 132  132   *      guid            Unique ID for the leaf vdev
 133  133   *
 134  134   * The 'vs' configuration follows the format described in 'spa_config.c'.
 135  135   */
 136  136  
 137  137  #include <sys/zfs_context.h>
 138  138  #include <sys/spa.h>
 139  139  #include <sys/spa_impl.h>
 140  140  #include <sys/dmu.h>
 141  141  #include <sys/zap.h>
 142  142  #include <sys/vdev.h>
 143  143  #include <sys/vdev_impl.h>
 144  144  #include <sys/uberblock_impl.h>
 145  145  #include <sys/metaslab.h>
 146  146  #include <sys/zio.h>
 147  147  #include <sys/dsl_scan.h>
 148  148  #include <sys/fs/zfs.h>
 149  149  
 150  150  /*
 151  151   * Basic routines to read and write from a vdev label.
 152  152   * Used throughout the rest of this file.
 153  153   */
 154  154  uint64_t
 155  155  vdev_label_offset(uint64_t psize, int l, uint64_t offset)
 156  156  {
 157  157          ASSERT(offset < sizeof (vdev_label_t));
 158  158          ASSERT(P2PHASE_TYPED(psize, sizeof (vdev_label_t), uint64_t) == 0);
 159  159  
 160  160          return (offset + l * sizeof (vdev_label_t) + (l < VDEV_LABELS / 2 ?
 161  161              0 : psize - VDEV_LABELS * sizeof (vdev_label_t)));
 162  162  }
 163  163  
 164  164  /*
 165  165   * Returns back the vdev label associated with the passed in offset.
 166  166   */
 167  167  int
 168  168  vdev_label_number(uint64_t psize, uint64_t offset)
 169  169  {
 170  170          int l;
 171  171  
 172  172          if (offset >= psize - VDEV_LABEL_END_SIZE) {
 173  173                  offset -= psize - VDEV_LABEL_END_SIZE;
 174  174                  offset += (VDEV_LABELS / 2) * sizeof (vdev_label_t);
 175  175          }
 176  176          l = offset / sizeof (vdev_label_t);
 177  177          return (l < VDEV_LABELS ? l : -1);
 178  178  }
 179  179  
 180  180  static void
 181  181  vdev_label_read(zio_t *zio, vdev_t *vd, int l, void *buf, uint64_t offset,
 182  182          uint64_t size, zio_done_func_t *done, void *private, int flags)
 183  183  {
 184  184          ASSERT(spa_config_held(zio->io_spa, SCL_STATE_ALL, RW_WRITER) ==
 185  185              SCL_STATE_ALL);
 186  186          ASSERT(flags & ZIO_FLAG_CONFIG_WRITER);
 187  187  
 188  188          zio_nowait(zio_read_phys(zio, vd,
 189  189              vdev_label_offset(vd->vdev_psize, l, offset),
 190  190              size, buf, ZIO_CHECKSUM_LABEL, done, private,
 191  191              ZIO_PRIORITY_SYNC_READ, flags, B_TRUE));
 192  192  }
 193  193  
 194  194  static void
 195  195  vdev_label_write(zio_t *zio, vdev_t *vd, int l, void *buf, uint64_t offset,
 196  196          uint64_t size, zio_done_func_t *done, void *private, int flags)
 197  197  {
 198  198          ASSERT(spa_config_held(zio->io_spa, SCL_ALL, RW_WRITER) == SCL_ALL ||
 199  199              (spa_config_held(zio->io_spa, SCL_CONFIG | SCL_STATE, RW_READER) ==
 200  200              (SCL_CONFIG | SCL_STATE) &&
 201  201              dsl_pool_sync_context(spa_get_dsl(zio->io_spa))));
 202  202          ASSERT(flags & ZIO_FLAG_CONFIG_WRITER);
 203  203  
 204  204          zio_nowait(zio_write_phys(zio, vd,
 205  205              vdev_label_offset(vd->vdev_psize, l, offset),
 206  206              size, buf, ZIO_CHECKSUM_LABEL, done, private,
 207  207              ZIO_PRIORITY_SYNC_WRITE, flags, B_TRUE));
 208  208  }
 209  209  
 210  210  /*
 211  211   * Generate the nvlist representing this vdev's config.
 212  212   */
 213  213  nvlist_t *
 214  214  vdev_config_generate(spa_t *spa, vdev_t *vd, boolean_t getstats,
 215  215      vdev_config_flag_t flags)
 216  216  {
 217  217          nvlist_t *nv = NULL;
 218  218  
 219  219          nv = fnvlist_alloc();
 220  220  
 221  221          fnvlist_add_string(nv, ZPOOL_CONFIG_TYPE, vd->vdev_ops->vdev_op_type);
 222  222          if (!(flags & (VDEV_CONFIG_SPARE | VDEV_CONFIG_L2CACHE)))
 223  223                  fnvlist_add_uint64(nv, ZPOOL_CONFIG_ID, vd->vdev_id);
 224  224          fnvlist_add_uint64(nv, ZPOOL_CONFIG_GUID, vd->vdev_guid);
 225  225  
 226  226          if (vd->vdev_path != NULL)
 227  227                  fnvlist_add_string(nv, ZPOOL_CONFIG_PATH, vd->vdev_path);
 228  228  
 229  229          if (vd->vdev_devid != NULL)
 230  230                  fnvlist_add_string(nv, ZPOOL_CONFIG_DEVID, vd->vdev_devid);
 231  231  
 232  232          if (vd->vdev_physpath != NULL)
 233  233                  fnvlist_add_string(nv, ZPOOL_CONFIG_PHYS_PATH,
 234  234                      vd->vdev_physpath);
 235  235  
 236  236          if (vd->vdev_fru != NULL)
 237  237                  fnvlist_add_string(nv, ZPOOL_CONFIG_FRU, vd->vdev_fru);
 238  238  
 239  239          if (vd->vdev_nparity != 0) {
 240  240                  ASSERT(strcmp(vd->vdev_ops->vdev_op_type,
 241  241                      VDEV_TYPE_RAIDZ) == 0);
 242  242  
 243  243                  /*
 244  244                   * Make sure someone hasn't managed to sneak a fancy new vdev
 245  245                   * into a crufty old storage pool.
 246  246                   */
 247  247                  ASSERT(vd->vdev_nparity == 1 ||
 248  248                      (vd->vdev_nparity <= 2 &&
 249  249                      spa_version(spa) >= SPA_VERSION_RAIDZ2) ||
 250  250                      (vd->vdev_nparity <= 3 &&
 251  251                      spa_version(spa) >= SPA_VERSION_RAIDZ3));
 252  252  
 253  253                  /*
 254  254                   * Note that we'll add the nparity tag even on storage pools
 255  255                   * that only support a single parity device -- older software
 256  256                   * will just ignore it.
 257  257                   */
 258  258                  fnvlist_add_uint64(nv, ZPOOL_CONFIG_NPARITY, vd->vdev_nparity);
 259  259          }
 260  260  
 261  261          if (vd->vdev_wholedisk != -1ULL)
 262  262                  fnvlist_add_uint64(nv, ZPOOL_CONFIG_WHOLE_DISK,
 263  263                      vd->vdev_wholedisk);
 264  264  
 265  265          if (vd->vdev_not_present)
 266  266                  fnvlist_add_uint64(nv, ZPOOL_CONFIG_NOT_PRESENT, 1);
 267  267  
 268  268          if (vd->vdev_isspare)
 269  269                  fnvlist_add_uint64(nv, ZPOOL_CONFIG_IS_SPARE, 1);
 270  270  
 271  271          if (!(flags & (VDEV_CONFIG_SPARE | VDEV_CONFIG_L2CACHE)) &&
 272  272              vd == vd->vdev_top) {
 273  273                  fnvlist_add_uint64(nv, ZPOOL_CONFIG_METASLAB_ARRAY,
 274  274                      vd->vdev_ms_array);
 275  275                  fnvlist_add_uint64(nv, ZPOOL_CONFIG_METASLAB_SHIFT,
  
    | 
      ↓ open down ↓ | 
    275 lines elided | 
    
      ↑ open up ↑ | 
  
 276  276                      vd->vdev_ms_shift);
 277  277                  fnvlist_add_uint64(nv, ZPOOL_CONFIG_ASHIFT, vd->vdev_ashift);
 278  278                  fnvlist_add_uint64(nv, ZPOOL_CONFIG_ASIZE,
 279  279                      vd->vdev_asize);
 280  280                  fnvlist_add_uint64(nv, ZPOOL_CONFIG_IS_LOG, vd->vdev_islog);
 281  281                  if (vd->vdev_removing)
 282  282                          fnvlist_add_uint64(nv, ZPOOL_CONFIG_REMOVING,
 283  283                              vd->vdev_removing);
 284  284          }
 285  285  
 286      -        if (vd->vdev_dtl_smo.smo_object != 0)
      286 +        if (vd->vdev_dtl_sm != NULL) {
 287  287                  fnvlist_add_uint64(nv, ZPOOL_CONFIG_DTL,
 288      -                    vd->vdev_dtl_smo.smo_object);
      288 +                    space_map_object(vd->vdev_dtl_sm));
      289 +        }
 289  290  
 290  291          if (vd->vdev_crtxg)
 291  292                  fnvlist_add_uint64(nv, ZPOOL_CONFIG_CREATE_TXG, vd->vdev_crtxg);
 292  293  
 293  294          if (getstats) {
 294  295                  vdev_stat_t vs;
 295  296                  pool_scan_stat_t ps;
 296  297  
 297  298                  vdev_get_stats(vd, &vs);
 298  299                  fnvlist_add_uint64_array(nv, ZPOOL_CONFIG_VDEV_STATS,
 299  300                      (uint64_t *)&vs, sizeof (vs) / sizeof (uint64_t));
 300  301  
 301  302                  /* provide either current or previous scan information */
 302  303                  if (spa_scan_get_stats(spa, &ps) == 0) {
 303  304                          fnvlist_add_uint64_array(nv,
 304  305                              ZPOOL_CONFIG_SCAN_STATS, (uint64_t *)&ps,
 305  306                              sizeof (pool_scan_stat_t) / sizeof (uint64_t));
 306  307                  }
 307  308          }
 308  309  
 309  310          if (!vd->vdev_ops->vdev_op_leaf) {
 310  311                  nvlist_t **child;
 311  312                  int c, idx;
 312  313  
 313  314                  ASSERT(!vd->vdev_ishole);
 314  315  
 315  316                  child = kmem_alloc(vd->vdev_children * sizeof (nvlist_t *),
 316  317                      KM_SLEEP);
 317  318  
 318  319                  for (c = 0, idx = 0; c < vd->vdev_children; c++) {
 319  320                          vdev_t *cvd = vd->vdev_child[c];
 320  321  
 321  322                          /*
 322  323                           * If we're generating an nvlist of removing
 323  324                           * vdevs then skip over any device which is
 324  325                           * not being removed.
 325  326                           */
 326  327                          if ((flags & VDEV_CONFIG_REMOVING) &&
 327  328                              !cvd->vdev_removing)
 328  329                                  continue;
 329  330  
 330  331                          child[idx++] = vdev_config_generate(spa, cvd,
 331  332                              getstats, flags);
 332  333                  }
 333  334  
 334  335                  if (idx) {
 335  336                          fnvlist_add_nvlist_array(nv, ZPOOL_CONFIG_CHILDREN,
 336  337                              child, idx);
 337  338                  }
 338  339  
 339  340                  for (c = 0; c < idx; c++)
 340  341                          nvlist_free(child[c]);
 341  342  
 342  343                  kmem_free(child, vd->vdev_children * sizeof (nvlist_t *));
 343  344  
 344  345          } else {
 345  346                  const char *aux = NULL;
 346  347  
 347  348                  if (vd->vdev_offline && !vd->vdev_tmpoffline)
 348  349                          fnvlist_add_uint64(nv, ZPOOL_CONFIG_OFFLINE, B_TRUE);
 349  350                  if (vd->vdev_resilver_txg != 0)
 350  351                          fnvlist_add_uint64(nv, ZPOOL_CONFIG_RESILVER_TXG,
 351  352                              vd->vdev_resilver_txg);
 352  353                  if (vd->vdev_faulted)
 353  354                          fnvlist_add_uint64(nv, ZPOOL_CONFIG_FAULTED, B_TRUE);
 354  355                  if (vd->vdev_degraded)
 355  356                          fnvlist_add_uint64(nv, ZPOOL_CONFIG_DEGRADED, B_TRUE);
 356  357                  if (vd->vdev_removed)
 357  358                          fnvlist_add_uint64(nv, ZPOOL_CONFIG_REMOVED, B_TRUE);
 358  359                  if (vd->vdev_unspare)
 359  360                          fnvlist_add_uint64(nv, ZPOOL_CONFIG_UNSPARE, B_TRUE);
 360  361                  if (vd->vdev_ishole)
 361  362                          fnvlist_add_uint64(nv, ZPOOL_CONFIG_IS_HOLE, B_TRUE);
 362  363  
 363  364                  switch (vd->vdev_stat.vs_aux) {
 364  365                  case VDEV_AUX_ERR_EXCEEDED:
 365  366                          aux = "err_exceeded";
 366  367                          break;
 367  368  
 368  369                  case VDEV_AUX_EXTERNAL:
 369  370                          aux = "external";
 370  371                          break;
 371  372                  }
 372  373  
 373  374                  if (aux != NULL)
 374  375                          fnvlist_add_string(nv, ZPOOL_CONFIG_AUX_STATE, aux);
 375  376  
 376  377                  if (vd->vdev_splitting && vd->vdev_orig_guid != 0LL) {
 377  378                          fnvlist_add_uint64(nv, ZPOOL_CONFIG_ORIG_GUID,
 378  379                              vd->vdev_orig_guid);
 379  380                  }
 380  381          }
 381  382  
 382  383          return (nv);
 383  384  }
 384  385  
 385  386  /*
 386  387   * Generate a view of the top-level vdevs.  If we currently have holes
 387  388   * in the namespace, then generate an array which contains a list of holey
 388  389   * vdevs.  Additionally, add the number of top-level children that currently
 389  390   * exist.
 390  391   */
 391  392  void
 392  393  vdev_top_config_generate(spa_t *spa, nvlist_t *config)
 393  394  {
 394  395          vdev_t *rvd = spa->spa_root_vdev;
 395  396          uint64_t *array;
 396  397          uint_t c, idx;
 397  398  
 398  399          array = kmem_alloc(rvd->vdev_children * sizeof (uint64_t), KM_SLEEP);
 399  400  
 400  401          for (c = 0, idx = 0; c < rvd->vdev_children; c++) {
 401  402                  vdev_t *tvd = rvd->vdev_child[c];
 402  403  
 403  404                  if (tvd->vdev_ishole)
 404  405                          array[idx++] = c;
 405  406          }
 406  407  
 407  408          if (idx) {
 408  409                  VERIFY(nvlist_add_uint64_array(config, ZPOOL_CONFIG_HOLE_ARRAY,
 409  410                      array, idx) == 0);
 410  411          }
 411  412  
 412  413          VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_VDEV_CHILDREN,
 413  414              rvd->vdev_children) == 0);
 414  415  
 415  416          kmem_free(array, rvd->vdev_children * sizeof (uint64_t));
 416  417  }
 417  418  
 418  419  /*
 419  420   * Returns the configuration from the label of the given vdev. For vdevs
 420  421   * which don't have a txg value stored on their label (i.e. spares/cache)
 421  422   * or have not been completely initialized (txg = 0) just return
 422  423   * the configuration from the first valid label we find. Otherwise,
 423  424   * find the most up-to-date label that does not exceed the specified
 424  425   * 'txg' value.
 425  426   */
 426  427  nvlist_t *
 427  428  vdev_label_read_config(vdev_t *vd, uint64_t txg)
 428  429  {
 429  430          spa_t *spa = vd->vdev_spa;
 430  431          nvlist_t *config = NULL;
 431  432          vdev_phys_t *vp;
 432  433          zio_t *zio;
 433  434          uint64_t best_txg = 0;
 434  435          int error = 0;
 435  436          int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL |
 436  437              ZIO_FLAG_SPECULATIVE;
 437  438  
 438  439          ASSERT(spa_config_held(spa, SCL_STATE_ALL, RW_WRITER) == SCL_STATE_ALL);
 439  440  
 440  441          if (!vdev_readable(vd))
 441  442                  return (NULL);
 442  443  
 443  444          vp = zio_buf_alloc(sizeof (vdev_phys_t));
 444  445  
 445  446  retry:
 446  447          for (int l = 0; l < VDEV_LABELS; l++) {
 447  448                  nvlist_t *label = NULL;
 448  449  
 449  450                  zio = zio_root(spa, NULL, NULL, flags);
 450  451  
 451  452                  vdev_label_read(zio, vd, l, vp,
 452  453                      offsetof(vdev_label_t, vl_vdev_phys),
 453  454                      sizeof (vdev_phys_t), NULL, NULL, flags);
 454  455  
 455  456                  if (zio_wait(zio) == 0 &&
 456  457                      nvlist_unpack(vp->vp_nvlist, sizeof (vp->vp_nvlist),
 457  458                      &label, 0) == 0) {
 458  459                          uint64_t label_txg = 0;
 459  460  
 460  461                          /*
 461  462                           * Auxiliary vdevs won't have txg values in their
 462  463                           * labels and newly added vdevs may not have been
 463  464                           * completely initialized so just return the
 464  465                           * configuration from the first valid label we
 465  466                           * encounter.
 466  467                           */
 467  468                          error = nvlist_lookup_uint64(label,
 468  469                              ZPOOL_CONFIG_POOL_TXG, &label_txg);
 469  470                          if ((error || label_txg == 0) && !config) {
 470  471                                  config = label;
 471  472                                  break;
 472  473                          } else if (label_txg <= txg && label_txg > best_txg) {
 473  474                                  best_txg = label_txg;
 474  475                                  nvlist_free(config);
 475  476                                  config = fnvlist_dup(label);
 476  477                          }
 477  478                  }
 478  479  
 479  480                  if (label != NULL) {
 480  481                          nvlist_free(label);
 481  482                          label = NULL;
 482  483                  }
 483  484          }
 484  485  
 485  486          if (config == NULL && !(flags & ZIO_FLAG_TRYHARD)) {
 486  487                  flags |= ZIO_FLAG_TRYHARD;
 487  488                  goto retry;
 488  489          }
 489  490  
 490  491          zio_buf_free(vp, sizeof (vdev_phys_t));
 491  492  
 492  493          return (config);
 493  494  }
 494  495  
 495  496  /*
 496  497   * Determine if a device is in use.  The 'spare_guid' parameter will be filled
 497  498   * in with the device guid if this spare is active elsewhere on the system.
 498  499   */
 499  500  static boolean_t
 500  501  vdev_inuse(vdev_t *vd, uint64_t crtxg, vdev_labeltype_t reason,
 501  502      uint64_t *spare_guid, uint64_t *l2cache_guid)
 502  503  {
 503  504          spa_t *spa = vd->vdev_spa;
 504  505          uint64_t state, pool_guid, device_guid, txg, spare_pool;
 505  506          uint64_t vdtxg = 0;
 506  507          nvlist_t *label;
 507  508  
 508  509          if (spare_guid)
 509  510                  *spare_guid = 0ULL;
 510  511          if (l2cache_guid)
 511  512                  *l2cache_guid = 0ULL;
 512  513  
 513  514          /*
 514  515           * Read the label, if any, and perform some basic sanity checks.
 515  516           */
 516  517          if ((label = vdev_label_read_config(vd, -1ULL)) == NULL)
 517  518                  return (B_FALSE);
 518  519  
 519  520          (void) nvlist_lookup_uint64(label, ZPOOL_CONFIG_CREATE_TXG,
 520  521              &vdtxg);
 521  522  
 522  523          if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_STATE,
 523  524              &state) != 0 ||
 524  525              nvlist_lookup_uint64(label, ZPOOL_CONFIG_GUID,
 525  526              &device_guid) != 0) {
 526  527                  nvlist_free(label);
 527  528                  return (B_FALSE);
 528  529          }
 529  530  
 530  531          if (state != POOL_STATE_SPARE && state != POOL_STATE_L2CACHE &&
 531  532              (nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_GUID,
 532  533              &pool_guid) != 0 ||
 533  534              nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_TXG,
 534  535              &txg) != 0)) {
 535  536                  nvlist_free(label);
 536  537                  return (B_FALSE);
 537  538          }
 538  539  
 539  540          nvlist_free(label);
 540  541  
 541  542          /*
 542  543           * Check to see if this device indeed belongs to the pool it claims to
 543  544           * be a part of.  The only way this is allowed is if the device is a hot
 544  545           * spare (which we check for later on).
 545  546           */
 546  547          if (state != POOL_STATE_SPARE && state != POOL_STATE_L2CACHE &&
 547  548              !spa_guid_exists(pool_guid, device_guid) &&
 548  549              !spa_spare_exists(device_guid, NULL, NULL) &&
 549  550              !spa_l2cache_exists(device_guid, NULL))
 550  551                  return (B_FALSE);
 551  552  
 552  553          /*
 553  554           * If the transaction group is zero, then this an initialized (but
 554  555           * unused) label.  This is only an error if the create transaction
 555  556           * on-disk is the same as the one we're using now, in which case the
 556  557           * user has attempted to add the same vdev multiple times in the same
 557  558           * transaction.
 558  559           */
 559  560          if (state != POOL_STATE_SPARE && state != POOL_STATE_L2CACHE &&
 560  561              txg == 0 && vdtxg == crtxg)
 561  562                  return (B_TRUE);
 562  563  
 563  564          /*
 564  565           * Check to see if this is a spare device.  We do an explicit check for
 565  566           * spa_has_spare() here because it may be on our pending list of spares
 566  567           * to add.  We also check if it is an l2cache device.
 567  568           */
 568  569          if (spa_spare_exists(device_guid, &spare_pool, NULL) ||
 569  570              spa_has_spare(spa, device_guid)) {
 570  571                  if (spare_guid)
 571  572                          *spare_guid = device_guid;
 572  573  
 573  574                  switch (reason) {
 574  575                  case VDEV_LABEL_CREATE:
 575  576                  case VDEV_LABEL_L2CACHE:
 576  577                          return (B_TRUE);
 577  578  
 578  579                  case VDEV_LABEL_REPLACE:
 579  580                          return (!spa_has_spare(spa, device_guid) ||
 580  581                              spare_pool != 0ULL);
 581  582  
 582  583                  case VDEV_LABEL_SPARE:
 583  584                          return (spa_has_spare(spa, device_guid));
 584  585                  }
 585  586          }
 586  587  
 587  588          /*
 588  589           * Check to see if this is an l2cache device.
 589  590           */
 590  591          if (spa_l2cache_exists(device_guid, NULL))
 591  592                  return (B_TRUE);
 592  593  
 593  594          /*
 594  595           * We can't rely on a pool's state if it's been imported
 595  596           * read-only.  Instead we look to see if the pools is marked
 596  597           * read-only in the namespace and set the state to active.
 597  598           */
 598  599          if ((spa = spa_by_guid(pool_guid, device_guid)) != NULL &&
 599  600              spa_mode(spa) == FREAD)
 600  601                  state = POOL_STATE_ACTIVE;
 601  602  
 602  603          /*
 603  604           * If the device is marked ACTIVE, then this device is in use by another
 604  605           * pool on the system.
 605  606           */
 606  607          return (state == POOL_STATE_ACTIVE);
 607  608  }
 608  609  
 609  610  /*
 610  611   * Initialize a vdev label.  We check to make sure each leaf device is not in
 611  612   * use, and writable.  We put down an initial label which we will later
 612  613   * overwrite with a complete label.  Note that it's important to do this
 613  614   * sequentially, not in parallel, so that we catch cases of multiple use of the
 614  615   * same leaf vdev in the vdev we're creating -- e.g. mirroring a disk with
 615  616   * itself.
 616  617   */
 617  618  int
 618  619  vdev_label_init(vdev_t *vd, uint64_t crtxg, vdev_labeltype_t reason)
 619  620  {
 620  621          spa_t *spa = vd->vdev_spa;
 621  622          nvlist_t *label;
 622  623          vdev_phys_t *vp;
 623  624          char *pad2;
 624  625          uberblock_t *ub;
 625  626          zio_t *zio;
 626  627          char *buf;
 627  628          size_t buflen;
 628  629          int error;
 629  630          uint64_t spare_guid, l2cache_guid;
 630  631          int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL;
 631  632  
 632  633          ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
 633  634  
 634  635          for (int c = 0; c < vd->vdev_children; c++)
 635  636                  if ((error = vdev_label_init(vd->vdev_child[c],
 636  637                      crtxg, reason)) != 0)
 637  638                          return (error);
 638  639  
 639  640          /* Track the creation time for this vdev */
 640  641          vd->vdev_crtxg = crtxg;
 641  642  
 642  643          if (!vd->vdev_ops->vdev_op_leaf)
 643  644                  return (0);
 644  645  
 645  646          /*
 646  647           * Dead vdevs cannot be initialized.
 647  648           */
 648  649          if (vdev_is_dead(vd))
 649  650                  return (SET_ERROR(EIO));
 650  651  
 651  652          /*
 652  653           * Determine if the vdev is in use.
 653  654           */
 654  655          if (reason != VDEV_LABEL_REMOVE && reason != VDEV_LABEL_SPLIT &&
 655  656              vdev_inuse(vd, crtxg, reason, &spare_guid, &l2cache_guid))
 656  657                  return (SET_ERROR(EBUSY));
 657  658  
 658  659          /*
 659  660           * If this is a request to add or replace a spare or l2cache device
 660  661           * that is in use elsewhere on the system, then we must update the
 661  662           * guid (which was initialized to a random value) to reflect the
 662  663           * actual GUID (which is shared between multiple pools).
 663  664           */
 664  665          if (reason != VDEV_LABEL_REMOVE && reason != VDEV_LABEL_L2CACHE &&
 665  666              spare_guid != 0ULL) {
 666  667                  uint64_t guid_delta = spare_guid - vd->vdev_guid;
 667  668  
 668  669                  vd->vdev_guid += guid_delta;
 669  670  
 670  671                  for (vdev_t *pvd = vd; pvd != NULL; pvd = pvd->vdev_parent)
 671  672                          pvd->vdev_guid_sum += guid_delta;
 672  673  
 673  674                  /*
 674  675                   * If this is a replacement, then we want to fallthrough to the
 675  676                   * rest of the code.  If we're adding a spare, then it's already
 676  677                   * labeled appropriately and we can just return.
 677  678                   */
 678  679                  if (reason == VDEV_LABEL_SPARE)
 679  680                          return (0);
 680  681                  ASSERT(reason == VDEV_LABEL_REPLACE ||
 681  682                      reason == VDEV_LABEL_SPLIT);
 682  683          }
 683  684  
 684  685          if (reason != VDEV_LABEL_REMOVE && reason != VDEV_LABEL_SPARE &&
 685  686              l2cache_guid != 0ULL) {
 686  687                  uint64_t guid_delta = l2cache_guid - vd->vdev_guid;
 687  688  
 688  689                  vd->vdev_guid += guid_delta;
 689  690  
 690  691                  for (vdev_t *pvd = vd; pvd != NULL; pvd = pvd->vdev_parent)
 691  692                          pvd->vdev_guid_sum += guid_delta;
 692  693  
 693  694                  /*
 694  695                   * If this is a replacement, then we want to fallthrough to the
 695  696                   * rest of the code.  If we're adding an l2cache, then it's
 696  697                   * already labeled appropriately and we can just return.
 697  698                   */
 698  699                  if (reason == VDEV_LABEL_L2CACHE)
 699  700                          return (0);
 700  701                  ASSERT(reason == VDEV_LABEL_REPLACE);
 701  702          }
 702  703  
 703  704          /*
 704  705           * Initialize its label.
 705  706           */
 706  707          vp = zio_buf_alloc(sizeof (vdev_phys_t));
 707  708          bzero(vp, sizeof (vdev_phys_t));
 708  709  
 709  710          /*
 710  711           * Generate a label describing the pool and our top-level vdev.
 711  712           * We mark it as being from txg 0 to indicate that it's not
 712  713           * really part of an active pool just yet.  The labels will
 713  714           * be written again with a meaningful txg by spa_sync().
 714  715           */
 715  716          if (reason == VDEV_LABEL_SPARE ||
 716  717              (reason == VDEV_LABEL_REMOVE && vd->vdev_isspare)) {
 717  718                  /*
 718  719                   * For inactive hot spares, we generate a special label that
 719  720                   * identifies as a mutually shared hot spare.  We write the
 720  721                   * label if we are adding a hot spare, or if we are removing an
 721  722                   * active hot spare (in which case we want to revert the
 722  723                   * labels).
 723  724                   */
 724  725                  VERIFY(nvlist_alloc(&label, NV_UNIQUE_NAME, KM_SLEEP) == 0);
 725  726  
 726  727                  VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_VERSION,
 727  728                      spa_version(spa)) == 0);
 728  729                  VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_POOL_STATE,
 729  730                      POOL_STATE_SPARE) == 0);
 730  731                  VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_GUID,
 731  732                      vd->vdev_guid) == 0);
 732  733          } else if (reason == VDEV_LABEL_L2CACHE ||
 733  734              (reason == VDEV_LABEL_REMOVE && vd->vdev_isl2cache)) {
 734  735                  /*
 735  736                   * For level 2 ARC devices, add a special label.
 736  737                   */
 737  738                  VERIFY(nvlist_alloc(&label, NV_UNIQUE_NAME, KM_SLEEP) == 0);
 738  739  
 739  740                  VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_VERSION,
 740  741                      spa_version(spa)) == 0);
 741  742                  VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_POOL_STATE,
 742  743                      POOL_STATE_L2CACHE) == 0);
 743  744                  VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_GUID,
 744  745                      vd->vdev_guid) == 0);
 745  746          } else {
 746  747                  uint64_t txg = 0ULL;
 747  748  
 748  749                  if (reason == VDEV_LABEL_SPLIT)
 749  750                          txg = spa->spa_uberblock.ub_txg;
 750  751                  label = spa_config_generate(spa, vd, txg, B_FALSE);
 751  752  
 752  753                  /*
 753  754                   * Add our creation time.  This allows us to detect multiple
 754  755                   * vdev uses as described above, and automatically expires if we
 755  756                   * fail.
 756  757                   */
 757  758                  VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_CREATE_TXG,
 758  759                      crtxg) == 0);
 759  760          }
 760  761  
 761  762          buf = vp->vp_nvlist;
 762  763          buflen = sizeof (vp->vp_nvlist);
 763  764  
 764  765          error = nvlist_pack(label, &buf, &buflen, NV_ENCODE_XDR, KM_SLEEP);
 765  766          if (error != 0) {
 766  767                  nvlist_free(label);
 767  768                  zio_buf_free(vp, sizeof (vdev_phys_t));
 768  769                  /* EFAULT means nvlist_pack ran out of room */
 769  770                  return (error == EFAULT ? ENAMETOOLONG : EINVAL);
 770  771          }
 771  772  
 772  773          /*
 773  774           * Initialize uberblock template.
 774  775           */
 775  776          ub = zio_buf_alloc(VDEV_UBERBLOCK_RING);
 776  777          bzero(ub, VDEV_UBERBLOCK_RING);
 777  778          *ub = spa->spa_uberblock;
 778  779          ub->ub_txg = 0;
 779  780  
 780  781          /* Initialize the 2nd padding area. */
 781  782          pad2 = zio_buf_alloc(VDEV_PAD_SIZE);
 782  783          bzero(pad2, VDEV_PAD_SIZE);
 783  784  
 784  785          /*
 785  786           * Write everything in parallel.
 786  787           */
 787  788  retry:
 788  789          zio = zio_root(spa, NULL, NULL, flags);
 789  790  
 790  791          for (int l = 0; l < VDEV_LABELS; l++) {
 791  792  
 792  793                  vdev_label_write(zio, vd, l, vp,
 793  794                      offsetof(vdev_label_t, vl_vdev_phys),
 794  795                      sizeof (vdev_phys_t), NULL, NULL, flags);
 795  796  
 796  797                  /*
 797  798                   * Skip the 1st padding area.
 798  799                   * Zero out the 2nd padding area where it might have
 799  800                   * left over data from previous filesystem format.
 800  801                   */
 801  802                  vdev_label_write(zio, vd, l, pad2,
 802  803                      offsetof(vdev_label_t, vl_pad2),
 803  804                      VDEV_PAD_SIZE, NULL, NULL, flags);
 804  805  
 805  806                  vdev_label_write(zio, vd, l, ub,
 806  807                      offsetof(vdev_label_t, vl_uberblock),
 807  808                      VDEV_UBERBLOCK_RING, NULL, NULL, flags);
 808  809          }
 809  810  
 810  811          error = zio_wait(zio);
 811  812  
 812  813          if (error != 0 && !(flags & ZIO_FLAG_TRYHARD)) {
 813  814                  flags |= ZIO_FLAG_TRYHARD;
 814  815                  goto retry;
 815  816          }
 816  817  
 817  818          nvlist_free(label);
 818  819          zio_buf_free(pad2, VDEV_PAD_SIZE);
 819  820          zio_buf_free(ub, VDEV_UBERBLOCK_RING);
 820  821          zio_buf_free(vp, sizeof (vdev_phys_t));
 821  822  
 822  823          /*
 823  824           * If this vdev hasn't been previously identified as a spare, then we
 824  825           * mark it as such only if a) we are labeling it as a spare, or b) it
 825  826           * exists as a spare elsewhere in the system.  Do the same for
 826  827           * level 2 ARC devices.
 827  828           */
 828  829          if (error == 0 && !vd->vdev_isspare &&
 829  830              (reason == VDEV_LABEL_SPARE ||
 830  831              spa_spare_exists(vd->vdev_guid, NULL, NULL)))
 831  832                  spa_spare_add(vd);
 832  833  
 833  834          if (error == 0 && !vd->vdev_isl2cache &&
 834  835              (reason == VDEV_LABEL_L2CACHE ||
 835  836              spa_l2cache_exists(vd->vdev_guid, NULL)))
 836  837                  spa_l2cache_add(vd);
 837  838  
 838  839          return (error);
 839  840  }
 840  841  
 841  842  /*
 842  843   * ==========================================================================
 843  844   * uberblock load/sync
 844  845   * ==========================================================================
 845  846   */
 846  847  
 847  848  /*
 848  849   * Consider the following situation: txg is safely synced to disk.  We've
 849  850   * written the first uberblock for txg + 1, and then we lose power.  When we
 850  851   * come back up, we fail to see the uberblock for txg + 1 because, say,
 851  852   * it was on a mirrored device and the replica to which we wrote txg + 1
 852  853   * is now offline.  If we then make some changes and sync txg + 1, and then
 853  854   * the missing replica comes back, then for a few seconds we'll have two
 854  855   * conflicting uberblocks on disk with the same txg.  The solution is simple:
 855  856   * among uberblocks with equal txg, choose the one with the latest timestamp.
 856  857   */
 857  858  static int
 858  859  vdev_uberblock_compare(uberblock_t *ub1, uberblock_t *ub2)
 859  860  {
 860  861          if (ub1->ub_txg < ub2->ub_txg)
 861  862                  return (-1);
 862  863          if (ub1->ub_txg > ub2->ub_txg)
 863  864                  return (1);
 864  865  
 865  866          if (ub1->ub_timestamp < ub2->ub_timestamp)
 866  867                  return (-1);
 867  868          if (ub1->ub_timestamp > ub2->ub_timestamp)
 868  869                  return (1);
 869  870  
 870  871          return (0);
 871  872  }
 872  873  
 873  874  struct ubl_cbdata {
 874  875          uberblock_t     *ubl_ubbest;    /* Best uberblock */
 875  876          vdev_t          *ubl_vd;        /* vdev associated with the above */
 876  877  };
 877  878  
 878  879  static void
 879  880  vdev_uberblock_load_done(zio_t *zio)
 880  881  {
 881  882          vdev_t *vd = zio->io_vd;
 882  883          spa_t *spa = zio->io_spa;
 883  884          zio_t *rio = zio->io_private;
 884  885          uberblock_t *ub = zio->io_data;
 885  886          struct ubl_cbdata *cbp = rio->io_private;
 886  887  
 887  888          ASSERT3U(zio->io_size, ==, VDEV_UBERBLOCK_SIZE(vd));
 888  889  
 889  890          if (zio->io_error == 0 && uberblock_verify(ub) == 0) {
 890  891                  mutex_enter(&rio->io_lock);
 891  892                  if (ub->ub_txg <= spa->spa_load_max_txg &&
 892  893                      vdev_uberblock_compare(ub, cbp->ubl_ubbest) > 0) {
 893  894                          /*
 894  895                           * Keep track of the vdev in which this uberblock
 895  896                           * was found. We will use this information later
 896  897                           * to obtain the config nvlist associated with
 897  898                           * this uberblock.
 898  899                           */
 899  900                          *cbp->ubl_ubbest = *ub;
 900  901                          cbp->ubl_vd = vd;
 901  902                  }
 902  903                  mutex_exit(&rio->io_lock);
 903  904          }
 904  905  
 905  906          zio_buf_free(zio->io_data, zio->io_size);
 906  907  }
 907  908  
 908  909  static void
 909  910  vdev_uberblock_load_impl(zio_t *zio, vdev_t *vd, int flags,
 910  911      struct ubl_cbdata *cbp)
 911  912  {
 912  913          for (int c = 0; c < vd->vdev_children; c++)
 913  914                  vdev_uberblock_load_impl(zio, vd->vdev_child[c], flags, cbp);
 914  915  
 915  916          if (vd->vdev_ops->vdev_op_leaf && vdev_readable(vd)) {
 916  917                  for (int l = 0; l < VDEV_LABELS; l++) {
 917  918                          for (int n = 0; n < VDEV_UBERBLOCK_COUNT(vd); n++) {
 918  919                                  vdev_label_read(zio, vd, l,
 919  920                                      zio_buf_alloc(VDEV_UBERBLOCK_SIZE(vd)),
 920  921                                      VDEV_UBERBLOCK_OFFSET(vd, n),
 921  922                                      VDEV_UBERBLOCK_SIZE(vd),
 922  923                                      vdev_uberblock_load_done, zio, flags);
 923  924                          }
 924  925                  }
 925  926          }
 926  927  }
 927  928  
 928  929  /*
 929  930   * Reads the 'best' uberblock from disk along with its associated
 930  931   * configuration. First, we read the uberblock array of each label of each
 931  932   * vdev, keeping track of the uberblock with the highest txg in each array.
 932  933   * Then, we read the configuration from the same vdev as the best uberblock.
 933  934   */
 934  935  void
 935  936  vdev_uberblock_load(vdev_t *rvd, uberblock_t *ub, nvlist_t **config)
 936  937  {
 937  938          zio_t *zio;
 938  939          spa_t *spa = rvd->vdev_spa;
 939  940          struct ubl_cbdata cb;
 940  941          int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL |
 941  942              ZIO_FLAG_SPECULATIVE | ZIO_FLAG_TRYHARD;
 942  943  
 943  944          ASSERT(ub);
 944  945          ASSERT(config);
 945  946  
 946  947          bzero(ub, sizeof (uberblock_t));
 947  948          *config = NULL;
 948  949  
 949  950          cb.ubl_ubbest = ub;
 950  951          cb.ubl_vd = NULL;
 951  952  
 952  953          spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
 953  954          zio = zio_root(spa, NULL, &cb, flags);
 954  955          vdev_uberblock_load_impl(zio, rvd, flags, &cb);
 955  956          (void) zio_wait(zio);
 956  957  
 957  958          /*
 958  959           * It's possible that the best uberblock was discovered on a label
 959  960           * that has a configuration which was written in a future txg.
 960  961           * Search all labels on this vdev to find the configuration that
 961  962           * matches the txg for our uberblock.
 962  963           */
 963  964          if (cb.ubl_vd != NULL)
 964  965                  *config = vdev_label_read_config(cb.ubl_vd, ub->ub_txg);
 965  966          spa_config_exit(spa, SCL_ALL, FTAG);
 966  967  }
 967  968  
 968  969  /*
 969  970   * On success, increment root zio's count of good writes.
 970  971   * We only get credit for writes to known-visible vdevs; see spa_vdev_add().
 971  972   */
 972  973  static void
 973  974  vdev_uberblock_sync_done(zio_t *zio)
 974  975  {
 975  976          uint64_t *good_writes = zio->io_private;
 976  977  
 977  978          if (zio->io_error == 0 && zio->io_vd->vdev_top->vdev_ms_array != 0)
 978  979                  atomic_add_64(good_writes, 1);
 979  980  }
 980  981  
 981  982  /*
 982  983   * Write the uberblock to all labels of all leaves of the specified vdev.
 983  984   */
 984  985  static void
 985  986  vdev_uberblock_sync(zio_t *zio, uberblock_t *ub, vdev_t *vd, int flags)
 986  987  {
 987  988          uberblock_t *ubbuf;
 988  989          int n;
 989  990  
 990  991          for (int c = 0; c < vd->vdev_children; c++)
 991  992                  vdev_uberblock_sync(zio, ub, vd->vdev_child[c], flags);
 992  993  
 993  994          if (!vd->vdev_ops->vdev_op_leaf)
 994  995                  return;
 995  996  
 996  997          if (!vdev_writeable(vd))
 997  998                  return;
 998  999  
 999 1000          n = ub->ub_txg & (VDEV_UBERBLOCK_COUNT(vd) - 1);
1000 1001  
1001 1002          ubbuf = zio_buf_alloc(VDEV_UBERBLOCK_SIZE(vd));
1002 1003          bzero(ubbuf, VDEV_UBERBLOCK_SIZE(vd));
1003 1004          *ubbuf = *ub;
1004 1005  
1005 1006          for (int l = 0; l < VDEV_LABELS; l++)
1006 1007                  vdev_label_write(zio, vd, l, ubbuf,
1007 1008                      VDEV_UBERBLOCK_OFFSET(vd, n), VDEV_UBERBLOCK_SIZE(vd),
1008 1009                      vdev_uberblock_sync_done, zio->io_private,
1009 1010                      flags | ZIO_FLAG_DONT_PROPAGATE);
1010 1011  
1011 1012          zio_buf_free(ubbuf, VDEV_UBERBLOCK_SIZE(vd));
1012 1013  }
1013 1014  
1014 1015  /* Sync the uberblocks to all vdevs in svd[] */
1015 1016  int
1016 1017  vdev_uberblock_sync_list(vdev_t **svd, int svdcount, uberblock_t *ub, int flags)
1017 1018  {
1018 1019          spa_t *spa = svd[0]->vdev_spa;
1019 1020          zio_t *zio;
1020 1021          uint64_t good_writes = 0;
1021 1022  
1022 1023          zio = zio_root(spa, NULL, &good_writes, flags);
1023 1024  
1024 1025          for (int v = 0; v < svdcount; v++)
1025 1026                  vdev_uberblock_sync(zio, ub, svd[v], flags);
1026 1027  
1027 1028          (void) zio_wait(zio);
1028 1029  
1029 1030          /*
1030 1031           * Flush the uberblocks to disk.  This ensures that the odd labels
1031 1032           * are no longer needed (because the new uberblocks and the even
1032 1033           * labels are safely on disk), so it is safe to overwrite them.
1033 1034           */
1034 1035          zio = zio_root(spa, NULL, NULL, flags);
1035 1036  
1036 1037          for (int v = 0; v < svdcount; v++)
1037 1038                  zio_flush(zio, svd[v]);
1038 1039  
1039 1040          (void) zio_wait(zio);
1040 1041  
1041 1042          return (good_writes >= 1 ? 0 : EIO);
1042 1043  }
1043 1044  
1044 1045  /*
1045 1046   * On success, increment the count of good writes for our top-level vdev.
1046 1047   */
1047 1048  static void
1048 1049  vdev_label_sync_done(zio_t *zio)
1049 1050  {
1050 1051          uint64_t *good_writes = zio->io_private;
1051 1052  
1052 1053          if (zio->io_error == 0)
1053 1054                  atomic_add_64(good_writes, 1);
1054 1055  }
1055 1056  
1056 1057  /*
1057 1058   * If there weren't enough good writes, indicate failure to the parent.
1058 1059   */
1059 1060  static void
1060 1061  vdev_label_sync_top_done(zio_t *zio)
1061 1062  {
1062 1063          uint64_t *good_writes = zio->io_private;
1063 1064  
1064 1065          if (*good_writes == 0)
1065 1066                  zio->io_error = SET_ERROR(EIO);
1066 1067  
1067 1068          kmem_free(good_writes, sizeof (uint64_t));
1068 1069  }
1069 1070  
1070 1071  /*
1071 1072   * We ignore errors for log and cache devices, simply free the private data.
1072 1073   */
1073 1074  static void
1074 1075  vdev_label_sync_ignore_done(zio_t *zio)
1075 1076  {
1076 1077          kmem_free(zio->io_private, sizeof (uint64_t));
1077 1078  }
1078 1079  
1079 1080  /*
1080 1081   * Write all even or odd labels to all leaves of the specified vdev.
1081 1082   */
1082 1083  static void
1083 1084  vdev_label_sync(zio_t *zio, vdev_t *vd, int l, uint64_t txg, int flags)
1084 1085  {
1085 1086          nvlist_t *label;
1086 1087          vdev_phys_t *vp;
1087 1088          char *buf;
1088 1089          size_t buflen;
1089 1090  
1090 1091          for (int c = 0; c < vd->vdev_children; c++)
1091 1092                  vdev_label_sync(zio, vd->vdev_child[c], l, txg, flags);
1092 1093  
1093 1094          if (!vd->vdev_ops->vdev_op_leaf)
1094 1095                  return;
1095 1096  
1096 1097          if (!vdev_writeable(vd))
1097 1098                  return;
1098 1099  
1099 1100          /*
1100 1101           * Generate a label describing the top-level config to which we belong.
1101 1102           */
1102 1103          label = spa_config_generate(vd->vdev_spa, vd, txg, B_FALSE);
1103 1104  
1104 1105          vp = zio_buf_alloc(sizeof (vdev_phys_t));
1105 1106          bzero(vp, sizeof (vdev_phys_t));
1106 1107  
1107 1108          buf = vp->vp_nvlist;
1108 1109          buflen = sizeof (vp->vp_nvlist);
1109 1110  
1110 1111          if (nvlist_pack(label, &buf, &buflen, NV_ENCODE_XDR, KM_SLEEP) == 0) {
1111 1112                  for (; l < VDEV_LABELS; l += 2) {
1112 1113                          vdev_label_write(zio, vd, l, vp,
1113 1114                              offsetof(vdev_label_t, vl_vdev_phys),
1114 1115                              sizeof (vdev_phys_t),
1115 1116                              vdev_label_sync_done, zio->io_private,
1116 1117                              flags | ZIO_FLAG_DONT_PROPAGATE);
1117 1118                  }
1118 1119          }
1119 1120  
1120 1121          zio_buf_free(vp, sizeof (vdev_phys_t));
1121 1122          nvlist_free(label);
1122 1123  }
1123 1124  
1124 1125  int
1125 1126  vdev_label_sync_list(spa_t *spa, int l, uint64_t txg, int flags)
1126 1127  {
1127 1128          list_t *dl = &spa->spa_config_dirty_list;
1128 1129          vdev_t *vd;
1129 1130          zio_t *zio;
1130 1131          int error;
1131 1132  
1132 1133          /*
1133 1134           * Write the new labels to disk.
1134 1135           */
1135 1136          zio = zio_root(spa, NULL, NULL, flags);
1136 1137  
1137 1138          for (vd = list_head(dl); vd != NULL; vd = list_next(dl, vd)) {
1138 1139                  uint64_t *good_writes = kmem_zalloc(sizeof (uint64_t),
1139 1140                      KM_SLEEP);
1140 1141  
1141 1142                  ASSERT(!vd->vdev_ishole);
1142 1143  
1143 1144                  zio_t *vio = zio_null(zio, spa, NULL,
1144 1145                      (vd->vdev_islog || vd->vdev_aux != NULL) ?
1145 1146                      vdev_label_sync_ignore_done : vdev_label_sync_top_done,
1146 1147                      good_writes, flags);
1147 1148                  vdev_label_sync(vio, vd, l, txg, flags);
1148 1149                  zio_nowait(vio);
1149 1150          }
1150 1151  
1151 1152          error = zio_wait(zio);
1152 1153  
1153 1154          /*
1154 1155           * Flush the new labels to disk.
1155 1156           */
1156 1157          zio = zio_root(spa, NULL, NULL, flags);
1157 1158  
1158 1159          for (vd = list_head(dl); vd != NULL; vd = list_next(dl, vd))
1159 1160                  zio_flush(zio, vd);
1160 1161  
1161 1162          (void) zio_wait(zio);
1162 1163  
1163 1164          return (error);
1164 1165  }
1165 1166  
1166 1167  /*
1167 1168   * Sync the uberblock and any changes to the vdev configuration.
1168 1169   *
1169 1170   * The order of operations is carefully crafted to ensure that
1170 1171   * if the system panics or loses power at any time, the state on disk
1171 1172   * is still transactionally consistent.  The in-line comments below
1172 1173   * describe the failure semantics at each stage.
1173 1174   *
1174 1175   * Moreover, vdev_config_sync() is designed to be idempotent: if it fails
1175 1176   * at any time, you can just call it again, and it will resume its work.
1176 1177   */
1177 1178  int
1178 1179  vdev_config_sync(vdev_t **svd, int svdcount, uint64_t txg, boolean_t tryhard)
1179 1180  {
1180 1181          spa_t *spa = svd[0]->vdev_spa;
1181 1182          uberblock_t *ub = &spa->spa_uberblock;
1182 1183          vdev_t *vd;
1183 1184          zio_t *zio;
1184 1185          int error;
1185 1186          int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL;
1186 1187  
1187 1188          /*
1188 1189           * Normally, we don't want to try too hard to write every label and
1189 1190           * uberblock.  If there is a flaky disk, we don't want the rest of the
1190 1191           * sync process to block while we retry.  But if we can't write a
1191 1192           * single label out, we should retry with ZIO_FLAG_TRYHARD before
1192 1193           * bailing out and declaring the pool faulted.
1193 1194           */
1194 1195          if (tryhard)
1195 1196                  flags |= ZIO_FLAG_TRYHARD;
1196 1197  
1197 1198          ASSERT(ub->ub_txg <= txg);
1198 1199  
1199 1200          /*
1200 1201           * If this isn't a resync due to I/O errors,
1201 1202           * and nothing changed in this transaction group,
1202 1203           * and the vdev configuration hasn't changed,
1203 1204           * then there's nothing to do.
1204 1205           */
1205 1206          if (ub->ub_txg < txg &&
1206 1207              uberblock_update(ub, spa->spa_root_vdev, txg) == B_FALSE &&
1207 1208              list_is_empty(&spa->spa_config_dirty_list))
1208 1209                  return (0);
1209 1210  
1210 1211          if (txg > spa_freeze_txg(spa))
1211 1212                  return (0);
1212 1213  
1213 1214          ASSERT(txg <= spa->spa_final_txg);
1214 1215  
1215 1216          /*
1216 1217           * Flush the write cache of every disk that's been written to
1217 1218           * in this transaction group.  This ensures that all blocks
1218 1219           * written in this txg will be committed to stable storage
1219 1220           * before any uberblock that references them.
1220 1221           */
1221 1222          zio = zio_root(spa, NULL, NULL, flags);
1222 1223  
1223 1224          for (vd = txg_list_head(&spa->spa_vdev_txg_list, TXG_CLEAN(txg)); vd;
1224 1225              vd = txg_list_next(&spa->spa_vdev_txg_list, vd, TXG_CLEAN(txg)))
1225 1226                  zio_flush(zio, vd);
1226 1227  
1227 1228          (void) zio_wait(zio);
1228 1229  
1229 1230          /*
1230 1231           * Sync out the even labels (L0, L2) for every dirty vdev.  If the
1231 1232           * system dies in the middle of this process, that's OK: all of the
1232 1233           * even labels that made it to disk will be newer than any uberblock,
1233 1234           * and will therefore be considered invalid.  The odd labels (L1, L3),
1234 1235           * which have not yet been touched, will still be valid.  We flush
1235 1236           * the new labels to disk to ensure that all even-label updates
1236 1237           * are committed to stable storage before the uberblock update.
1237 1238           */
1238 1239          if ((error = vdev_label_sync_list(spa, 0, txg, flags)) != 0)
1239 1240                  return (error);
1240 1241  
1241 1242          /*
1242 1243           * Sync the uberblocks to all vdevs in svd[].
1243 1244           * If the system dies in the middle of this step, there are two cases
1244 1245           * to consider, and the on-disk state is consistent either way:
1245 1246           *
1246 1247           * (1)  If none of the new uberblocks made it to disk, then the
1247 1248           *      previous uberblock will be the newest, and the odd labels
1248 1249           *      (which had not yet been touched) will be valid with respect
1249 1250           *      to that uberblock.
1250 1251           *
1251 1252           * (2)  If one or more new uberblocks made it to disk, then they
1252 1253           *      will be the newest, and the even labels (which had all
1253 1254           *      been successfully committed) will be valid with respect
1254 1255           *      to the new uberblocks.
1255 1256           */
1256 1257          if ((error = vdev_uberblock_sync_list(svd, svdcount, ub, flags)) != 0)
1257 1258                  return (error);
1258 1259  
1259 1260          /*
1260 1261           * Sync out odd labels for every dirty vdev.  If the system dies
1261 1262           * in the middle of this process, the even labels and the new
1262 1263           * uberblocks will suffice to open the pool.  The next time
1263 1264           * the pool is opened, the first thing we'll do -- before any
1264 1265           * user data is modified -- is mark every vdev dirty so that
1265 1266           * all labels will be brought up to date.  We flush the new labels
1266 1267           * to disk to ensure that all odd-label updates are committed to
1267 1268           * stable storage before the next transaction group begins.
1268 1269           */
1269 1270          return (vdev_label_sync_list(spa, 1, txg, flags));
1270 1271  }
  
    | 
      ↓ open down ↓ | 
    972 lines elided | 
    
      ↑ open up ↑ | 
  
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX