## Abstract

Design researchers have struggled to produce quantitative predictions for exactly why and when diversity might help or hinder design search efforts. This paper addresses that problem by studying one ubiquitously used search strategy—Bayesian optimization (BO)—on a 2D test problem with modifiable convexity and difficulty. Specifically, we test how providing diverse versus non-diverse initial samples to BO affects its performance during search and introduce a fast ranked-determinantal point process method for computing diverse sets, which we need to detect sets of highly diverse or non-diverse initial samples. We initially found, to our surprise, that diversity did not appear to affect BO, neither helping nor hurting the optimizer’s convergence. However, follow-on experiments illuminated a key trade-off. Non-diverse initial samples hastened posterior convergence for the underlying model hyper-parameters—a *model building* advantage. In contrast, diverse initial samples accelerated exploring the function itself—a *space exploration* advantage. Both advantages help BO, but in different ways, and the initial sample diversity directly modulates how BO trades those advantages. Indeed, we show that fixing the BO hyper-parameters removes the model building advantage, causing diverse initial samples to always outperform models trained with non-diverse samples. These findings shed light on why, at least for BO-type optimizers, the use of diversity has mixed effects and cautions against the ubiquitous use of space-filling initializations in BO. To the extent that humans use explore-exploit search strategies similar to BO, our results provide a testable conjecture for why and when diversity may affect human-subject or design team experiments.

## 1 Introduction and Related Work

One open question within design research is when or under what conditions providing diverse stimuli or starting solutions to either humans or algorithms can improve their designs’ final performance. Researchers have struggled to produce quantitative predictions or explanations for exactly why and when diversity might help or hinder design search efforts. In studies of human designers or teams, there have been numerous empirical results on the effect of diverse stimuli or sets of stimuli on designers, typically referred to under the topic of *design fixation* (for recent reviews, see Refs. [1,2]). In general, available empirical results are mixed and it is difficult to quantitatively predict, for a new problem or person, whether or not diversity in problem stimuli will or will not help. For instance, there are a number of empirical demonstrations of positive effects of example diversity on novelty and diversity of ideas [3–5], but substantially more mixed results on the effects of diversity on solution *quality*, with some observations of positive effects [6–8], some null or contingent effects [4,9–14], and even some negative effects on solution quality [15,16].

Likewise, in research focused purely on optimization, common academic and industrial practice initializes search algorithms with different strategies like Latin hypercube sampling (LHS) [17] and others in an attempt to “fill” or “cover” a space as uniformly as possible [18] or via quasi-random methods [19–21]. Some methods build diversity-encouraging loss functions directly into their core search algorithms, such as in common meta-heuristic optimizers [22] such as particle swarm optimization, simulated annealing, and genetic algorithms, with one of the most well-known diversity-inducing ones being the non-dominated sorting Genetic Algorithm-II (NSGA-II) [23]. For Bayesian optimization (BO) specifically, a common strategy is to build diversity directly into the acquisition function used in sampling new points from the Gaussian process (GP) posterior [24]. As with human-subjects experiments, the precise effect of diversity on optimization performance is often problem dependent [22] and difficult to predict a priori. Nevertheless, optimization practitioners take these steps to improve initial sample diversity with the hope that the optimizer will converge faster or find better global optima.

But is encouraging initial diversity in this way always a good idea? If so, when and why is it good? Are there any times or conditions when diversity might hurt rather than help our search for good designs?

To address the above questions, this paper studies one type of commonly used search strategy—BO—and how the diversity of its initialization points affects its performance on a search task. We uncover a fascinating dance that occurs between two competing advantages that initial samples endow upon BO—a *model building* versus *space exploration* advantage that we define later—and how the initial samples’ diversity directs the choreography. While the fundamental reason for this interplay will later appear straightforward (and perhaps even discernible through thought experiments rather than numerical experiments), it nevertheless flies in the face of how most practitioners initialize their BO routines or conduct optimal experimental design studies. It also posits a testable prediction about how to induce greater effects of diversity on novice human designers or the conditions under which there may be mixed or even negative effects (see Sec. 6).

Before describing our particular experiment and results, we will first review why BO is a meaningful and generalizable class of search algorithm to use, as well as past work that has tried to understand how diversity affects search processes such as optimization.

*Why model design search as Bayesian optimization?* While this paper addresses only BO, this is an important algorithm in that it plays an out-sized role within the design research and optimization community. For example, BO underlies a vast number of industrially relevant gradient-free surrogate modeling approaches implemented in major design or analysis packages, where it is referred to under a variety of names, including Kriging methods or meta-modeling [25,26]. Its use in applications of computationally expensive multidisciplinary optimization problems is, while not unilateral [27], quite widespread. Likewise, researchers studying human designers often use BO as a proxy model [28] to understand human search, due to the interplay between exploration and exploitation that lies at the heart of most BO acquisition functions like expected improvement (EI). More generally, there is a robust history of fruitful research in cognitive science modeling human cognition as Bayesian processing [29], such as concept learning in cognitive development [30], causal learning [31], and analogical reasoning [32].

While the bulk of BO-related papers focus on new algorithms or acquisition functions, few papers focus on how BO is initialized, preferring instead the general use of space-filling initializations that have a long history in the field of optimal experiment design [27]. In contrast, this paper shows that in certain situations that faith in space-filling designs might be misplaced, particularly when the BO kernel hyper-parameters are adjusted or fit during search.

*What does it even mean for samples to be diverse?* As a practical matter, if we wish to study how diverse samples impact BO, we face a subtle but surprisingly non-trivial problem: how exactly do you quantify whether one set of samples is more or less diverse than another? This is a set-based (i.e., combinatorially large) problem with its own rich history too large to cover extensively here, however our past work on diversity measurement [33–35], computation [36], and optimization [37] provides further pointers for interested readers, and in particular the thesis of Ahmed provides a good starting point for the broader literature and background in this area [38].

For the purposes of understanding how this paper relates to existing approaches, it suffices to know the following regarding common approaches to quantifying diversity: (1) most diversity measurement approaches focus on some variant of a hyper-volume objective spanned by the set of selected points; (2) since this measure depends on *a set* rather than individual points, it becomes combinatorially expensive, necessitating fast polynomial-time approximation, one common tool for which is a determinantal point process (DPP) [39]; however, (3) while sampling the most diverse set via DPPs is easy, sampling percentile sets from the DPP distribution to get the top 5%, median, or lowest 5% of diverse sets becomes exceedingly slow for a large sample pool.

In contrast, for this paper, we created a faster DPP-type sampling method to extract different percentiles of the distribution without actually needing to observe the entire DPP distribution and whose sampling error we can bound using concentration inequalities. Section 2 provides further mathematical background, including information on DPP hyper-parameters and how to select them intelligently, and the Supplementary Material on the ASME Digital Collection provides further algorithmic details. With an understanding of diversity distribution measures in hand, we can now address diversity’s specific effects on optimization more generally.

*How does diversity in initial inputs affect optimizers?* While there are a number of papers that propose either different initialization strategies or benchmarking of existing strategies for optimization, there is limited prior work addressing the direct effect of initial sample diversity.

For general reviews and benchmarking on how to initialize optimizers and the effects of different strategies, papers such as Refs. [20,22] compare initialization strategies for particular optimizers and quantify performance differences. An overall observation across these contributions is the inability of a single initialization method to improve performance across functions of varying complexity. These studies also do not directly measure or address the role of sample diversity directly, only noting such behavior as it correlates indirectly with the sampling strategy.

A second body of work tries to customize initialization strategies on a per-problem basis, often achieving faster convergence on domain-specific problems [18,19,40–42]. While useful in their designed domain, these studies do not directly address the role of diversity either. In contrast, this paper addresses diversity directly using properties of BO that are sufficiently general to apply across multiple domains and applications.

Lastly, how to initialize optimizers has garnered new interest from the machine learning community, for example, in the initial settings of weights and biases in a neural network and the downstream effects on network performance [43,44]. There is also general interest in how to collect diverse samples during learning, either in an active learning [45] or reinforcement learning context [46,47]; however, those lines of work address only diversity throughout data collection, rather than the impact of initial samples considered in this paper.

*What does this paper contribute beyond past work?* This paper’s specific contributions are:

To compute diversity: we describe a fast DPP-based diversity scoring method for selecting diverse initial examples with a fixed size

*k*. Any set of size*k*with these initial examples can be then used to approximate the percentile of diversity that the set belongs to. This method requires selecting a hyper-parameter relating to the DPP measure. We describe a principled method for selecting this parameter in Sec. 2.1 and provide numerical evidence of the improved sampling performance in the Supplementary Material. Compared to prior work, this makes percentile sampling of DPP distributions computationally tractable.To study effects on BO: we empirically evaluate how diverse initial samples affect the convergence rate of a Bayesian optimizer. Section 4 finds that low diversity samples provide a

*model building*advantage to BO while diverse samples provide a*space exploration*advantage that helps BO converge faster. Section 5 shows that removing the model building advantage makes having diverse initial samples uniformly better than non-diverse samples.^{2}

We will next describe our overall experimental approach and common procedures used across all three of our main experiments. We will introduce individual experiment-specific methods only when relevant in each separate experiment section. To fully reproduce the below results, the link in the footnote provides the code associated with this paper.^{3}

## 2 Overall Experimental Approach

This section will first describe how we compute diverse initial samples, including how we set a key hyper-parameter that controls the DPP kernel needed to measure sample set diversity. It then briefly describes the controllable 2D test problem that we use in our experiments. It ends with a description of how we setup the BO search process and the hyper-parameters that we study more deeply in each individual experiment.

### 2.1 Measuring and Sampling From Diverse Sets Using Determinantal Point Processes.

*L-ensemble*(as seen in Eq. (1)) that correlates with the volume spanned by a collection or set of samples (

*Y*) taken from all possible sets ($Y$) given a diversity/similarity (feature) metric.

*K*is the kernel matrix. For sampling diverse examples, this positive semi-definite matrix is typically chosen as a kernel matrix (

*K*) that defines the similarity measure between pairs of data points. For this paper, we use a standard and commonly used similarity measure defined using a radial basis function (RBF) kernel matrix [48]. Specifically, each entry in $LY$ for two data points with index

*i*and

*j*is

*γ*in the DPP kernel can be set in the interval (0, ∞) and will turn out to be quite important in how well we can measure diversity. The next section explores this choice in more depth, but to provide some initial intuition: set

*γ*too high and any selection of points looks equally diverse compared to any other set, essentially destroying the discriminative power of the DPP, while setting

*γ*too low causes the determinant of $L$ to collapse to zero for any set of cardinality greater than the feature-length of

**x**.

*I*is an identity matrix of the same shape as the ensemble matrix $L$. Then, using Theorem 2.2 from Ref. [39], we can write the $P(Y\u2208Y)$ as follows:

This is the probability that a given set of points (*Y*) is highly diverse compared to other possible sets ($Y$)—that is, the higher *P*(*Y*) the more diverse the set. The popularity of DPP-type measures is due to their ability to efficiently sample diverse samples of fixed size *k*. Sampling a set of *k* samples from a DPP is done using a conditional DPP called *k*-DPP [49]. *k*-DPP are able to compute marginal and conditional probabilities with polynomial complexity, in turn allowing sampling from the DPP in polynomial complexity. *k*-DPPs are also well researched and there exists several different methods to speed up the sampling process using a *k*-DPP [50,51]. Our approach allows sampling in constant complexity; however, there is a trade-off in complexity in generating the DPP distribution. The complexity for generating traditional DPP distributions is independent of “*k*,” while our approach has linear dependence on “*k*.” Since existing *k*-DPP approaches lack the ability to efficiently sample from different percentiles of diversity and thus make it computationally expensive to regenerate the distribution to alternatively sample from different percentiles.

To tackle this problem, our approach is designed to sample efficiently from different percentiles of diversity. This is made possible by creating an absolute diversity score. This score is generated by taking a log *determinant* of the kernel matrix defined over the set *Y*. Randomly sampling the *k*-DPP space allows us to bound errors in generating this absolute score through the use of concentration inequalities. The details about how to sample from this distribution and calculate the score are mentioned in the Supplementary Material, so as not to disrupt the paper’s main narrative. Additionally, the Supplementary Material provides empirical results to support our earlier claims regarding efficient sampling from our approach versus the traditional *k*-DPP approach, as well as the trade-off in complexity when generating the DPP distribution. Figure 2 shows example sets of five points and their corresponding DPP score, where the diversity score is monotonic and a positive score corresponds to a more diverse subset.

#### 2.1.1 Selecting the Hyper-parameter for the Determinantal Point Process Kernel.

As mentioned above, the choice of *γ* impacts the accuracy of the DPP score, and when we initially fixed *γ* to |*Y*_{i}|/10, where *Y*_{i} is the set of data points over which the RBF kernel is calculating the DPP score as suggested by Ref. [52], the DPP seemed to be producing largely random scores. To select an appropriate *γ*, we designed a kernel-independent diagnostic method for assessing the DPP kernel with four steps.

First, we randomly generate *M* samples of size *k* sets (think of these as random *k*-sized samples from $Y$). Second, we compute their DPP scores for different possible *γ* values and then sort those *M* sets by that score. Third, we compute the rank correlation among these sets for different pairs of *γ*—intuitively, if the rank correlation is high (toward 1), then either choice of *γ* would produce the same rank orders of which points were considered diverse, meaning the (relative) DPP scores are insensitive to *γ*. In contrast, if the rank correlation is 0, then the two *γ* values produce essentially random orderings. This rank correlation between two different *γ* settings is the color/value shown in each cell of the matrix in Fig. 1. Large ranges of *γ* with high-rank correlation mean that the rankings of DPP scores are stable or robust to small perturbations in *γ*. Lastly, we use this “robust *γ*” region by choosing the largest range of *γ* values that have a relative correlation index of 0.95 or higher. We compute the mean of this range and use that as our selected *γ* in our later experiments. We should note that the functional range of *γ* is dependent on sample size (*k*), and so this “robust *γ*” needs to be recomputed for different initialization sizes.

The detailed settings for the results as seen in Fig. 1 are as follows: $M=10,000$; *k* = 10; and *γ* ∈ [*e* − 7, *e* − 2]. The correlation matrix shows a range of *γ* with strongly correlating relative ordering of the test sets. All *γ* within this range provide a consistent ranking.

### 2.2 A Test Function With Tunable Complexity.

A problem that is common across the study of initialization methods is their inconsistency across problems of varying difficulty, motivating the need to test BO’s search behavior on a problem class with variable complexity. Synthetic objective functions are often used to test the efficiency of different optimizers and there are several libraries online to choose these functions from Ref. [53], though these functions are largely static, in the sense that there is only a single test function definition. There has been research into developing objective function *generators*; for example, in Ref. [54], the author uses a mixture of four features to generate synthetic objective functions. These have been well categorized and the relative performance of different optimizers is documented on each landscape. Similar to this, Ref. [55] looks at using a mixture of different sinusoidal functions to create a noisy 1D function. Both the generators discussed are capable of generating complicated landscapes, but the complexity arises from mixing different randomly generated sinusoids and thus are unable to control or quantify a measure of complexity of the generated landscapes.

To address this controllable complexity problem directly, we created a simple 2D test function generator with tunable complexity parameters that allow us to instantiate multiple random surfaces of similar optimization difficulty. We modified this function from the one used in Ref. [56] where it was referred to as “Wildcat Wells,” though the landscape is functionally just a normal distribution with additive noise of different frequency spectra. We used four factors to control the synthetic objective functions: (1) the number of peaks, (2) noise amplitude, (3) smoothness, and (4) distance between peaks and a seed. The number of peaks control the number of layers of multivariate normal with single peaks. The noise amplitude in the range of [0,1] controls the relative height of the noise compared to the height of the peaks. Setting this to 1 would essentially make the noise in the function as tall as the peaks and give the function infinite peaks. Smoothness in the range of [0,1] controls the weighted contribution of the smooth Gaussian function compared to the rugged noise to the wildcat wells landscape. Setting this to 1 would remove the noise from the function because then the normal distribution completely controls and dominates the function. The last parameter, the distance between peaks, can be tuned in the range of [0,1]. This parameter prevents overlap of peaks when the function is generated with more than one peak.

Some of these parameters overlap in their effects. For example, *N* controls the number of peaks, and ruggedness amplitude controls the height of the noise in the function, so increasing the noise automatically increases the peaks in the function thus we will only look at varying the ruggedness amplitude. Apart from this, ruggedness frequency (the distance between peaks) plays the same role as smoothness (radius of influence of each individual on its neighbor). Thus, for the numerical experiments presented in Secs. 3–5, only the ruggedness amplitude and smoothness will be varied between [0.2, 0.8] with increments of 0.2. To provide some visual examples of the effect of these parameters on the generated functions, Fig. 3 visualizes an example random surface generated with different smoothness and ruggedness amplitude parameters.

### 2.3 Bayesian Optimization.

BO has emerged as a popular sample-efficient approach for optimization of these expensive black-box functions. BO models the black-box function using a surrogate model, typically a GP. The next design to evaluate is then selected according to an acquisition function. The acquisition function uses the GP posterior and makes the next recommendation for function evaluation by balancing between exploration and exploitation. It allows exploration of regions with high uncertainty in the objective function and exploitation of regions where the mean of the objective function is optimum. At each iteration, the GP gets updated according to the selected sample, and this process continues iteratively according to the available budget.

Each data point in the context of Bayesian optimization is extremely expensive; thus, there is a need for selection of an informative set of initial samples for the optimization process. Toward this, this paper investigates the effect of the level of initial diverse coverage of the input space on convergence of Bayesian optimization policies.

For the purpose of numerical experiments, the optimizer used is from the BOTorch Library [57]. The optimizer uses a single task GP model with expected improvement; the kernel used is a Matérn kernel.

*μ*(.) and

*k*(.,.) are the mean function and a real-valued kernel function encoding the prior belief on the correlation among the samples in the design space. In Gaussian process regression, the kernel function dictates the structure of the surrogate model we can fit. An important kernel for Bayesian optimization is the Matérn kernel, which incorporates a smoothness parameter

*ν*to permit greater flexibility in modeling functions:

*H*

_{ν}(·) are the Gamma function and the Bessel function of order

*ν*, and

**θ**is the length scale hyper-parameter which denotes the correlation between the points within each dimension and specifies the distance that the points in the design space influence one another. Here, we use a constant mean for the mean function. The

*model building*advantage that we refer to in this paper corresponds to learning these hyper-parameters. The hyper-parameters of the Gaussian process, namely, the parameters of the kernel function and the mean function are:

*Lengthscale of the Matérn Kernel*: In Eq. (5), *θ* is the lengthscale parameter of the kernel. This parameter controls the ruggedness expected by the Bayesian optimizer in the black box function being studied.

The effects of the parameter are similar to *ν*, but *ν* is not learned during the optimization process while lengthscale is. So, *ν* is not studied as a parameter that influences the modeling behavior but rather studied as an additional parameter for sensitivity.

*Output Scale of Scale Kernel*: Output scale is used to control how the Matérn kernel is scaled for each batch. Since our Bayesian optimizer uses a single task GP, we do not use batch optimization. Thus, this parameter is unique for us and the way it is implemented using BoTorch can be seen in Eq. (6).

*Noise for Likelihood Calculations*: The noise parameter is used to model measurement error or noise in the data. So, as the Gaussian process gets more data, the noise term decreases. So, ideally, this term should converge to 0 when the Bayesian optimizer has found an optimal value since our test functions did not have any added noise.

*Constant for Mean Module*: This constant is used as the mean for the normal distribution that forms the prior of the Gaussian process as shown in Eq. (4).

Further studies and results regarding the effects of the hyper-parameters are available in the Supplementary Material.

We now describe the first experiment where we explore the effects of diversity of initial samples on the convergence of Bayesian optimizers.

## 3 Experiment 1: Does Diversity Affect Optimization Convergence?

### 3.1 Methods.

To test the effects of diversity of initial samples on optimizer convergence, we first generated a set of initial training samples of size (*k*) 10 either from low (5th percentile of diversity) or high-diversity (95th percentile of diversity) using our procedure in Sec. 2.1. Next, we created 100 different instances of the wildcat wells function with different randomly generated seeds for each cell in a 4 × 4 factor grid of four values each of the smoothness and ruggedness amplitude parameters of the wildcat wells function (ranging from 0.2 to 0.8, in steps of 0.2). For simplicity here, we refer to these combinations as families of the wildcat wells function. This resulted in 1600 function instances.

Our experiment consisted of 200 runs of the Bayesian optimizer within each of the smoothness–ruggedness function families, where each run consisted of 100 iterations, and half of the runs were initialized with a low-diversity training sample, and half were initialized with a high-diversity training sample.

We then compared the cumulative optimality gap (COG) across the iterations for the runs with low-diverse initializations and high-diverse initializations within each smoothness–ruggedness combination family. We did this by computing bootstrapped mean and confidence intervals within each low-diverse and high-diverse sets of runs within each family. Given the full convergence data, we compute a COG which is just the area under the optimality gap curve for both the 5th and 95th diversity curves. Intuitively, a larger COG corresponds to a worse overall performance by the optimizer. Using these COG values, we can numerically calculate the improvement of the optimizer in the 95th percentile. The net improvement of COG value while comparing the 5th and 95th percentile is also presented as text in each subplot in Fig. 4.

### 3.2 Results.

As Fig. 4 shows, the cumulative optimality gap does not seem to have a consistent effect across the grid. Diversity produces a positive convergence effect for some cells, but is negative in others. Moreover, there are wide empirical confidence bounds on the mean effect overall, indicating that should an effect exist at all, it likely does not have a large effect size. Changing the function ruggedness or smoothness did not significantly modulate the overall effect. As expected, given sufficient samples (far right on the *x*-axis), both diverse and non-diverse initializations have the same optimality gap, since at that point the initial samples have been crowded out by the new samples gathered by BO during its search.

### 3.3 Discussion.

Overall, the results from Fig. 4 seem to indicate that diversity helps in some cases and hurts in others, and regardless has a limited impact one way or the other. This seems counter to the widespread practice of diversely sampling the initial input space using techniques like LHS. Figure 4 shows that it has little effect.

Why would this be? Given decades of research into initialization schemes for BO and optimal experiment design, we expected diversity to have at least some (perhaps small but at least consistent) positive effect on convergence rates, and not the mixed bag that we see in Fig. 4. How were the non-diverse samples gaining such an upper hand when the diverse samples had a head start on exploring the space—what we call a *space exploration* advantage?

The next section details an experiment we conducted to test a hypothesis regarding a potential implicit advantage that non-diverse samples might endow to BO that would impact the convergence of BO’s hyper-parameter posteriors. As we will see next, this accelerated hyper-parameter posterior convergence caused by non-diverse initialization is the Achilles’ heel of diversely initialized BO that allows the non-diverse samples to keep pace and even exceed diverse BO.

## 4 Experiment 2: Do Lower Diversity Samples Improve Hyper-parameter Posterior Convergence?

After reviewing the results from Fig. 4, we tried to determine why the space exploration advantage of diversity was not helping BO as we thought it should. We considered as a thought experiment the one instance where a poorly initialized BO model with the same acquisition function might outperform another: if one model’s kernel hyper-parameter settings were so grossly incorrect that the model would waste many samples exploring areas that it did not need to if it had the correct hyper-parameters.

Could this misstep be happening in the diversely sampled BO but not in the non-diverse case? If so, this might explain how non-diverse BO was able to keep pace: while diverse samples might give BO a head start, it might be unintentionally blindfolding BO to the true function posteriors, making it run ragged in proverbial directions that it need not. If this hypothesis was true, then we would see this reflected in the comparative accuracy of the kernel hyper-parameters learned by the diverse versus non-diverse BO samples. This experiment set out to test that hypothesis.

### 4.1 Methods.

The key difference from experiment 1 is that rather than comparing the overall optimization convergence, we instead focus on how the initial samples’ diversity affects BO’s hyper-parameter posterior convergence, and compare how far each is from the “ground truth” optimal hyper-parameters.

As with experiment 1, we used the same smoothness and ruggedness amplitude families of the wildcat wells function. To then generate the data for each instance in one of these families, we sampled 20 sets of initial samples. Half of the sampled 20 sets were low (5th percentile of diversity) and the other half from high-diversity (95th percentile of diversity) percentiles.

For each initial sample, we then maximized the GP’s kernel marginal log likelihood (via BOTorch’s GP fit method). We then recorded the hyper-parameters obtained for all 20 initial samples. The mean of the ten samples from low diversity was then used as one point in the box plot’s low-diversity distribution as seen in Fig. 5. We then repeated this process for the high-diversity initial samples. Each point in the box plot can then be understood as the mean hyper-parameter learned by BOTorch given just the initial sample of size (*k*) 10 points. To get the full box plot distribution for each family, the above process is repeated over 100 seeds and Fig. 5 provides the resulting box plot for both diverse and non-diverse initial samples for all the 16 families of wildcat wells function as described in experiment 1.

To provide a ground truth for the true hyper-parameter settings, we ran a binary search to find the size of the sample (*k*_{optimal}) for which BO’s kernel hyper-parameters converged for all families. The hyper-parameter found by providing *k*_{optimal} amount of points for each instance in the family was then plotted as a horizontal line in each box plot. An interesting observation is that some families have non-overlapping horizontal lines. This is because for some families there are more than one modes of “optimal hyper-parameters.” The mode chosen as the “optimal hyper-parameter” is the more observed mode. The process for finding the “optimal hyper-parameter” and which mode is chosen as the optimal hyper-parameter have been described in the Supplementary Material.

If an initial sample provides a good initial estimate of the kernel hyper-parameter posterior, then the box plot should align well or close to the horizontal lines of the true posterior. Figure 5 only shows the results for the Matérn kernel’s lengthscale parameter, given its out-sized importance in controlling the GP function posteriors compared to the other hyper-parameters (e.g., output scale, noise, etc.), which we do not plot here for space reasons. We provide further details and plots for all hyper-parameters in the Supplementary Material for interested readers.

To quantify the average distance between the learned and true hyper-parameters, we also plot in Fig. 5 the mean absolute error (MAE) for both highly diverse (95th) and less diverse (5th) points. The MAE is the sum of the absolute distance of each predicted hyper-parameter from the optimal hyper-parameter for the particular surface of each wildcat wells function. The range as seen in each cell in Fig. 5 corresponds to a 95th percentile confidence bound on the mean absolute error across all the 100 runs.

### 4.2 Results and Discussion.

The results in Fig. 5 show that the MAE values for low diversity samples are always lower compared to the MAE for high-diversity samples. This general behavior is also qualitatively visible in the box plot. This means that after only the initial samples, the non-diverse samples provided much more accurate estimates of the kernel hyper-parameters compared to diverse samples. Moreover, BO systematically *underestimates* the correct lengthscale with diverse samples—this corresponds to the diverse BO modeling function posteriors that have higher frequency components than the true function actually does (as shown via the pedagogical examples in the Supplementary Material).

This provides evidence for the *model building* advantage of non-diverse samples that we defined in Sec. 2.3. It also confirms our previous conjecture from the thought experiment that diverse samples might be impacting BO by causing slower or less accurate convergence to the right BO hyper-parameters. The space exploration advantage of the diverse samples helps it compensate somewhat for its poor hyper-parameters, but BO trained with non-diverse samples can leverage the better hyper-parameters to make more judicious choices about what points to select next.

We did not see major differences in the other three kernel hyper-parameters such as output scale, noise, or the mean function (see Supplementary Material); however, this is not surprising, since BO is not highly sensitive to any of these parameters and the lengthscale parameter dominates large changes in BO behavior.

Comparing the different smoothness and ruggedness settings, when the function is more complex (the top right of the grid at low smoothness and high ruggedness amplitude values), the function’s lengthscale is lower and closer to the value learned by the diverse samples. Looking at the low diversity MAE values (“MAE 5”), we can see they are much closer to those of the high-diversity samples (“MAE 95”), in contrast to when the function is less complex (bottom left side of the grid). Under such conditions, low-diversity samples lose some of the relative model building advantage they have over high-diversity samples. This conjecture aligns with experiment 1 (Fig. 4) where the COG values on the top right part are positive while those on the bottom left are negative.

Figure 5 demonstrated our hypothesized model building advantage that non-diverse initial samples confer to BO. But how do we know that this is the actual causal factor that accelerates BO convergence, and not just correlated with some other effect? If correct, our conjecture posits a natural testable hypothesis: if we fix the values of the hyper-parameter posteriors to identical values between the non-diverse and diverse samples and do not allow the BO to update or optimize them, then this should effectively eliminate the model building advantage, and diverse samples should always outperform non-diverse samples. Metaphorically, if we were to take away the arrow that Paris used against Achilles, would the Battle of Troy have ended differently? Our next experiment finds this out.

## 5 Experiment 3: Does Diversity Affect Optimization Convergence If Hyper-parameters Are Fixed to Optimal Values?

### 5.1 Methods.

This experiment is identical to experiment 1, with two key differences: (1) we now fix the kernel hyper-parameters to the “optimal hyper-parameter” values we found in experiment 2 for all the instances in each family of the wildcat wells function (2) and we do not allow either BO model to further optimize the kernel hyper-parameters. This should remove the hypothesized model building advantage of non-diverse samples without altering any other aspects of experiment 1 and the results in Fig. 4.

### 5.2 Results and Discussion.

Figure 6 shows that once the kernel hyper-parameters are fixed—removing the model building advantage of non-diverse samples—diverse samples consistently and robustly outperform non-diverse initial samples. This holds for both the initial optimality gap at the beginning of the search as well as the cumulative optimality gap and is not qualitatively affected by the function smoothness or roughness amplitude. Unlike in experiment 1 where diversity could either help or hurt the optimizer, once we remove the model building advantage, diversity only helps.

## 6 General Discussion and Conclusions

### 6.1 Summary and Interpretation of Findings.

This paper’s original goal was to investigate how and when diverse initial samples help or hurt Bayesian optimizers. Overall, we found that the initial diversity of the provided samples created two competing effects. First, experiment 2 showed that non-diverse samples improved BO’s abilities to quickly converge to optimal hyper-parameters—we called this a *model building* advantage. Second, experiment 3 showed that conditioned on the same fixed hyper-parameters diverse samples improved BO’s convergence to the optima through faster exploration of the space—we called this a *space exploration* advantage. In experiment 1, diversity had mixed-to-negligible effects since both of these advantages were in play and competed with one another. This interaction provides insight for academic or industrial BO users since common practice recommends initializing BO with space-filling samples (to take advantage of the space exploration advantage), and ignores the model building advantage of non-diverse samples.

Beyond our main empirical result, our improvements to existing diverse sampling approaches (Sec. 2.1) provide new methods for studying how different percentile diversity sets affect phenomena. Researchers may find this contribution of separate technical and scientific interest for related studies that investigate the impact of diversity.

### 6.2 Implications and Future Work.

Beyond the individual results we observed and summarized in each experiment, there are some overall implications and limitations that may guide future work or interpretation of our results more broadly, which we address below.

*Where does this model building advantage induced by non-diverse samples come from?* As we conjectured in experiment 2 (Sec. 4) and confirmed in experiment 3 (Sec. 5), the key advantage of using non-diverse initial samples lies in their ability to induce faster and more accurate posterior convergence when inferring the optimal kernel hyper-parameters, such as length scale and others. This allowed the BO to make more judicious and aggressive choices about what points to sample next, so while the diversely initialized models might get a head start on exploring the space, non-diversely initialized models needed to explore less of the space overall, owing to tighter posteriors of possible functions under the Gaussian process.

While we do not have space to include it in the main paper, the Supplementary Material document’s Sec. 5 shows how this model building advantage occurs as we provide BO with a greater number of initial samples. Briefly, there are three “regimes”: (1) sample-deficient, where there are too few samples to induce a modeling advantage regardless of how diversely we sample the initial points; (2) the “modeling advantage” region, where low-diversity samples induce better hyper-parameter convergence than high-diversity samples; and (3) sample-saturated, where there are enough initial samples to induce accurate hyper-parameter posteriors regardless of how diversely we sample initial points. We direct interested readers to Sec. 5 of the Supplementary Material for a deeper discussion on this.

What this behavior implies more broadly is that non-diverse samples, whether given to an algorithm or a person, have a unique and perhaps underrated value in cases where we have high entropy priors over the Gaussian process hyper-parameters or kernel. In such cases, sacrificing a few initial non-diverse points to better infer key length scales in the GP model may well be a worthwhile trade.

We also saw that in cases where the BO hyper-parameters were not further optimized (as in experiment 3 where hyper-parameters were fixed to optimal values), using diverse points only helped BO. Researchers or practitioners using BO would benefit from carefully reviewing what kernel optimization strategy their library or implementation of choice actually does since that will affect whether or not the model building advantage of non-diverse samples is actually in play.

*What if hyper-parameters are fixed to non-optimal values?* We showed in experiment 3 that fixing BO hyper-parameters to their optimal values ahead of time using an oracle allowed diverse initial samples to unilaterally outperform non-diverse samples. An interesting avenue of future work that we did not explore here for scope reasons would be to see if this holds when hyper-parameters are fixed to non-optimal values. In practical problems, we will not often know the optimal hyper-parameters ahead of time as we did in experiment 3 which caused diversity’s unilateral advantage, so we do not have evidence to generalize beyond this. However, our explanation of the model building advantage would predict that, so long as the hyper-parameters remain fixed (to any value), BO would not have a practical mechanism to benefit much from non-diverse samples, on average.

*What are the implications for how we currently initialize BO?* One of our result’s most striking implications is how it might influence BO initialization procedures that are often considered standard practice. For example, it is common to initialize a BO procedure with a small number of initial space-filling designs, using techniques like LHS before allowing BO to optimize its acquisition function for future samples. In cases where the BO hyper-parameters will remain fixed, experiment 3 implies that this standard practice is excellent advice and far better than non-diverse samples. However, in cases where you plan to optimize the BO kernel during search, using something like LHS becomes more suspect.

In principle, from experiment 1, we see that diverse samples may help or hurt BO, depending on how much leverage the model building advantage of the non-diverse samples can provide. For example, in the upper right of Fig. 4, the function is effectively random noise, and so there is not a strong model building advantage to be gained. In contrast, in the lower left, the smooth and well-behaved functions allowed non-diverse initialization to gain an upper hand.

Our results propose a perhaps now obvious initialization strategy: if you plan on optimizing the BO hyper-parameters, use some non-diverse samples to strategically provide an early model building advantage, while leveraging the rest of the samples to diversely cover the space.

*How might other acquisition functions modulate diversity’s effect?* While we have been referring to BO as though it is a single method throughout this paper, individual BO implementations can vary, both in terms of their kernel structure and their choice of acquisition function—or how BO uses information about the underlying fitted Gaussian process to select subsequent points. In this paper’s experiments, we used EI since it is one of the most widespread choices and behaves qualitatively like other common improvement-based measures like probability of improvement, posterior mean, and upper confidence bound functions. Indeed, we hypothesize that part of the reason non-diverse initial samples are able to gain a model building advantage over diverse samples due to a faster collapse in the posterior distribution of possible GP functions which serves as strong input to EI methods and related variants.

Yet EI and its cousins are only one class of acquisition function; would our results hold if we were to pick an acquisition function that directly attacked the GP’s posterior variance? For example, either entropy-based or active learning-based acquisition functions? This paper did not test this and it would be a logical and valuable future study. Our experimental results and proposed explanation would predict the following: the model building advantage seen by non-diverse samples should reduce or disappear in cases where the acquisition function explicitly samples new points to minimize the posterior over GP function classes since in such cases BO itself would try to select samples that reduced overall GP variance, reducing its dependence on what the initial samples provide.

*To what extent should we expect these results to generalize to other types of problems?* We selected a simple 2D function with controllable complexity in this paper to aid in experimental simplicity, speed, replicability, and ease of visualization; however, this does raise the question of whether or not these results would truly transfer to more complex problems of engineering interest. While future work would have to address more complex problems, we performed two additional experiments studying how the above phenomena change as we (1) increased the wildcat wells function from two to three dimensions and (2) how this behavior changes for other types of common optimization test functions—specifically, we chose the N-dimensional sphere, Rastrigin, and Rosenbrock functions from two to five dimensions. While the existing paper length limits did not allow us to include all of these additional results in the paper’s main body, we direct interested readers to Secs. 6 and 7 of the Supplementary Material document. Briefly, our results align overall with what we described above for the 2D wildcat wells function, and we do not believe that the phenomena we observed are restricted to only our chosen test function or dimension, although clearly future research would need to conduct further tests on other problems to say this with any certainty. Beyond these supplemental results, we can also look at a few critical problem-specific factors and ask what our proposed explanatory model would predict.

For higher dimensional problems, standard GP kernel choices like RBF or Matérn begin to face exponential cost increases due to how hyper-volumes expand. In such cases, having strong constraints (via hyper-parameter priors or posteriors) over possible GP functions becomes increasingly important for fast BO convergence. Our results would posit that any model building advantages from non-diverse sampling would become increasingly important or impactful in cases where it helped BO rapidly collapse the hyper-parameter posteriors.

For discontinuous functions (or GP kernels that assumed as much), the model building advantage of non-diverse samples would decrease since large sudden jumps in the GP posterior mean and variance would make it harder for BO to exploit a model building advantage. However, in discontinuous cases where there were still common global smoothness parameters that governed the continuous portions, the model building advantage would still accelerate advantages for BO convergence.

*How might the results guide human-subject experiments or understanding of human designers?* One possible implication of our results for human designers is that the effects of example diversity on design outcomes may vary as a function of the designer’s prior knowledge of the design problem. More specifically, the model building advantage observed in experiment 2 (and subsequent removal in experiment 3) suggests that when designers have prior knowledge of how quickly the function changes in a local area of the design space, they can more reliably benefit from the space exploration advantage of diverse examples. This leads to a potentially counter-intuitive prediction that domain experts may benefit more from diverse examples compared to domain novices since domain experts would tend to have prior knowledge of the nature of the design problem (a model building advantage). Additionally, perhaps under conditions of uncertainty about the nature of the design problem, it would be useful to combine the strengths of diverse and non-diverse examples; this could be accomplished with a cluster-sampling approach, where we sample diverse points of the design space, but include local non-diverse clusters of examples that are nearby, to facilitate learning of the shape of the design function.

While these implications might be counter-intuitive in that common guidance suggests that the most informative method is to only diversely sample initial points, the crux of our paper’s argument is that non-diverse points *can*, surprisingly, be informative to Bayesian optimization due to their ability to quickly concentrate the posterior distribution of the kernel hyper-parameters, and thus accelerate later optimization. Given this tension, a natural question is “how many non-diverse samples do I really need to take advantage of the modeling advantage without giving up the space exploration advantage?” If I have, say, a budget of ten experiments, should I spend only one low-diversity sample? Or do I need two? Half of my budget? We did not explore these practical questions in this work, due to space constraints, but we think this would be an excellent avenue for continued study.

## Footnotes

For grammatical simplicity and narrative flow, we will use the phrase “non-diverse” throughout the paper to refer to cases where samples are taken from the 5th percentile of diverse sets—these are technically “low-diversity” rather than being absolutely “non-diverse” which would occur when all points in the set are identical, but we trust that readers can keep this minor semantic distinction in mind.

See Note ^{3}.

## Acknowledgment

This research was supported by funding from the National Science Foundation through award #1826083.

## Conflict of Interest

There are no conflicts of interest.

## Data Availability Statement

The data and information that support the findings of this article are freely available.^{4}