Understanding how humans decompose design problems will yield insights that can be applied to develop better support for human designers. However, there are few established methods for identifying the decompositions that human designers use. This paper discusses a method for identifying subproblems by analyzing when design variables were discussed concurrently by human designers. Four clustering techniques for grouping design variables were tested on a range of synthetic datasets designed to resemble data collected from design teams, and the accuracy of the clusters created by each algorithm was evaluated. A spectral clustering method was accurate for most problems and generally performed better than hierarchical (with Euclidean distance metric), Markov, or association rule clustering methods. The method's success should enable researchers to gain new insights into how human designers decompose complex design problems.

## Introduction

Decomposing problems is a critical part of the design process, and the choice of a particular design problem decomposition may influence both the design process and the final solution. Human designers, when confronting a complex system design problem, often decompose the problem into simpler subproblems [1–4]. This is similar, but not identical, to decomposing a formal optimization problem into subproblems to improve the performance of a design optimization algorithm [5]. The design approaches of human designers may be improved by selecting more appropriate decompositions, but it is clear from previous research that humans decompose problems in very different ways [1–4], so the best decompositions are not obvious to humans. A first step to supporting human designers is to understand how human designers intuitively decompose design problems. This effort is both important and difficult. It is important in order to identify potential weaknesses, create guidelines for helping human designers to use better decompositions, and develop methods and tools to support better design. It is difficult, however, to understand how human designers decompose problems because there are few established and straightforward methods for identifying decompositions based on data from human designers.

This paper describes an approach for identifying the subproblems used by human designers during a design process, based on data that describes which topics, decisions, or variables were being discussed during each time segment of the process. We studied data drawn from observation of design teams as they solved design problems (this type of data motivated this work), but similar data could be created by coding verbal protocols, daily journal entries, weekly progress reports, or any other record that identifies the decisions that were being considered over time in a design process. The method involves clustering the variables (or topics or decisions) to determine which ones were typically considered together. The method thus involves a combination of qualitative and quantitative approaches: it first codes qualitative data and then uses clustering algorithms to identify likely subproblems in order to facilitate data exploration. By focusing on the latter step, our work complements existing work in the design community by providing a new approach for analyzing coded data from human designers [6].

Specifically, this paper's contribution is the systematic evaluation of different clustering techniques for use in the latter step of this approach: identifying subproblems based on data describing human design activity. We evaluated the performance of the clustering algorithms by creating a range of synthetic datasets that were similar to real data (from observations of small teams of professional designers solving two facility layout problems) and assessing the accuracy of the clusters that the algorithms generated. We added various levels of noise to the data and assessed and compared the ability of the clustering algorithms to identify the true subproblems. The results suggest approaches for identifying subproblems based on coded observations of design processes. Identifying such subproblems enables research on how human designers decompose problems because it transforms a design team's discussions into structured data about their decompositions.

Previous papers about our related research have described our data collection and analysis methods, some aspects of the clustering algorithms, and the results from our studies of design teams [7–12]; details are given in Sec. 3.1. Unlike these previous papers, this paper discusses a study to evaluate multiple clustering techniques on how well they group the variables that the design team discusses into clusters that describe the subproblems.

The remainder of this paper proceeds as follows: Sec. 2 discusses previous work on problem decomposition and clustering algorithms. Section 3 describes the clustering algorithms and evaluation approaches. Section 4 presents the results of the evaluation of the clustering algorithms. Section 5 discusses these results. Section 6 summarizes and concludes the paper.

## Related Literature

### Decomposition and Human Designers.

A product or system design process can be viewed as a series of design decisions [13–18]. The sequence of decisions is shaped by the way that the designers decompose the system design problem. Decomposition produces subgoals [1], which are desired problem states [19], and achieving those states requires solving design subproblems. A subproblem is a combination of related topics or ideas that are distinct from other subproblems in that they have fewer significant relationships across subproblems than within them [20].

Researchers have used empirical studies of design decision-making to understand how human designers decompose problems. Ho [2] identified two types of decomposition: explicit and implicit. Explicit decomposition is decomposing the design problem at the beginning of the design process, but an implicit decomposition is created as the designers solve the design problem. Although an explicit decomposition can be more effective [4], designers, especially novice designers, appear to use implicit decomposition more often [1]. Tobias et al. [8] also found that designers infrequently discussed their decomposition strategy.

Li et al. [21] identified examples of design process decomposition in a study of mechanical engineers redesigning a gear drive. Austin-Breneman et al. [22] found that, although they decomposed the system design problem given to them and solved those subproblems, student design teams failed to consider the system-level performance, which led to inferior designs. Gero and Song [3] found that professional engineers decomposed a system design problem more often than engineering students. In their study, decomposition occurred when the designer moved from considering a higher level to a lower level of the design problem. The decompositions (and associated subproblems) used by design teams in facility layout design challenges have been described by Gralla et al. [7,9].

Many of these studies used methods that employ observation of human designers and, in particular, verbal protocol analysis. Protocol analysis is a common technique for studying design activity empirically. Protocol analysis collects observations of design activity and then classifies (codes) these observations into categories relevant to the research questions [23–26]. Protocol analysis includes process-oriented studies that consider information processing and content-oriented studies that consider the artefact that is created [27]. Both descriptive and inferential statistics have been used to analyze the coded segments [26].

The connections between coded segments have been used to create linkographs to describe design activity [26,28–30]. Some studies have used a Markov chain to model the design process and estimated the transition probabilities of the Markov chain from the frequency of the activities or states that were coded [26,31]. McComb et al. [32,33] used a hidden Markov model to represent the design process.

These diverse examples show that methods for analyzing data from verbal protocols—and other forms of qualitative data describing design processes—are still evolving. Many studies of design decomposition used coding schemes that focus on the types of activities being performed or information being processed [1,2,4]. We developed an approach that utilizes codes describing the design variables or content of the decision in order to identify design subproblems that show how different variables or content are linked.

### Clustering.

Clustering algorithms are important tools in many domains, and many studies proposing and evaluating clustering approaches have appeared in the literature. A comprehensive review is beyond the scope of this paper. Some key texts include Anderberg [34], Jain and Dubes [35], Kaufman and Rousseeuw [36], and Everitt et al. [37]. Both hierarchical clustering and partitional clustering algorithms have been developed [38,39]. Ng et al. [40] reviewed spectral clustering techniques and observed that using multiple eigenvectors to find multiple clusters is more effective than spectral graph partitioning, which creates multiple clusters by recursive bisection.

One approach to evaluating a clustering algorithm is to evaluate the validity of the clusters that it generates using external criteria, internal criteria, and relative criteria [35,41]. In addition, individual clusters can be evaluated based on their compactness and isolation [35,42,43].

Previous reviews discussed the use of clustering in different applications, including data mining [44] and the traveling salesman problem and bioinformatics [39]. There have also been applications to topics in engineering design, including VLSI design [45] and grouping design concepts [46–48]. To cluster the variables and functions in a design optimization problem, Sarkar et al. [49] used singular value decomposition to generate a distance metric and then used *k*-means clustering to group coupled variables and functions. Sarkar et al. [50] used a similar approach to find the modules in a system architecture. Wilschut et al. [51] used an approach based on Markov clustering to group the rows and columns of a design structure matrix and compared their approach and two other algorithms developed for clustering design structure matrix.

Despite 50 years of clustering research, Jain [52] concluded that “there is no single clustering algorithm that has been shown to dominate other algorithms across all application domains. …A clustering method that satisfies the requirements for one group of users may not satisfy the requirements of another …. ‘clustering is in the eye of the beholder’ so indeed data clustering must involve the user or application needs.” Therefore, it is important to evaluate clustering algorithms for each new application domain, as we do here for identifying clusters based on the variables discussed by human designers.

The current paper complements the authors' previous work on studying how design teams decompose design problems [10,11] and adds to our growing knowledge of how to study design teams [6]. In particular, this study evaluated multiple clustering techniques on how well they group the variables that the design team discusses into clusters that describe the subproblems. Previous work on clustering algorithms has not systematically evaluated these techniques on datasets based on observations of design team activity. In these datasets, the items to be clustered are described by 50 or more binary attributes; thus, these data do not resemble datasets with continuous value attributes or a small number of categorical attributes.

## Approach and Methods

### Background and Previous Work.

The research described in this paper is part of a larger study of human designers decomposing problems, in which the decisions of design teams solving facility layout problems were observed and analyzed [7–12]. In particular, Azhar et al. [10] and Morency et al. [11] described the data collection and analysis processes and introduced the clustering algorithms. This study focuses on evaluating clustering algorithms that could be used to cluster the variables that the design teams discussed.

In brief, the larger study involved observing small teams of professional designers solving two kinds of facility layout problems. Teams of public health professionals designed a point of distribution (POD) for rapidly distributing medication to the local population in a public health emergency. Teams of manufacturing professionals redesigned a factory layout for a moderately complex product assembly process. Each team, while working in a small room with a large printed version of the facility layout, had three to four hours to complete the exercise. We made audio and video recordings of the teams' discussions and their drawings on the facility layout, reviewed these recordings, and identified the variables that the teams discussed: one set of variables for the POD design problem and one for the factory redesign problem. Finally, we coded each 2 min segment of each team's video by determining which, if any, of the variables the team discussed during that segment. The codes are represented for each team as a matrix of values. Each row corresponds to a variable, and each column corresponds to a time segment. The value in an entry is 1 if the team discussed that variable during that segment and is 0 otherwise. Figures 7 and 8 in the Appendix show the data for one factory redesign team and one POD design team.

To identify the decompositions used by design teams, we wanted to identify which variables were considered together as subproblems by the human designers. The premise was that variables discussed at the same time are more likely to be part of the same subproblem. Therefore, we investigated clustering algorithms that could identify groupings of variables based on how often they were coded at the same time. Because clustering is an effective exploratory data analysis approach [36], we decided to use clustering to help us explore the data from each design team. Identifying accurate subproblems for each team is important, but it is only one step. In our larger study, we used these to seek patterns across multiple design teams, but this paper focuses on evaluating the accuracy of the clustering algorithms for identifying the decompositions of human designers.

### Approach Overview.

This study aimed to evaluate the accuracy of four different clustering algorithms for finding subproblems in the type of data described above by clustering concurrently discussed variables. Although our observations of design teams have provided rich information about how design teams decompose facility design problems, it is difficult to evaluate the accuracy of the clustering algorithms because we cannot determine precisely the true subproblems that the design team considered. Therefore, for this study, we introduced synthetic data sets that replicate important characteristics of the real data. We tested the clustering algorithms against noisy versions of these synthetic data and compared the results to the true subproblems. Section 3.3 describes the generation of the synthetic data and the procedure for adding noise, and Sec. 3.4 describes the computational experiments that tested the sensitivity of the clustering algorithms' performance to changes in characteristics of the instance data.

Four clustering algorithms were evaluated by using them to group the variables in each timeline. Section 3.5 presents the clustering algorithms. Because we found no previous research on using clustering algorithms with data about human design activity, we decided to use both hierarchical and partitioning methods (the two major classes of clustering algorithms). Thus, we chose a straightforward hierarchical clustering algorithm using the distances between items and a novel hierarchical clustering algorithm that first used a dimensionality reduction technique from spectral clustering. We chose Markov clustering as a partitioning method because it does not require one to set the number of clusters (unlike approaches such as *k* means). We also used association rules as a partitioning method in order to explore the data by considering the relationships within each time segment (instead of comparing the variables' timelines). Both spectral clustering and Markov clustering are relatively new approaches, so studying them can generate new insights that could not be obtained by looking at only the most well-known approaches.

We evaluated the results by comparing the clusters to the true subproblems to calculate their accuracy, as described in Sec. 3.6.

### Synthetic Data.

We generated synthetic data sets that resemble in important ways the timelines that we collected from our design studies. Recall that the data consist of matrices, in which each row corresponds to a variable and each column corresponds to a time segment. The value in an entry is 1 if the team discussed that variable during that time segment and 0 otherwise. The clustering algorithm aims to identify subproblems, or groups of variables (rows), based on how often the variables are coded concurrently (in the same column).

To generate a synthetic dataset, we first randomly generated a “true” set of subproblems (groupings of variables). Then, for each subproblem, we randomly generated its “work sessions,” which are sets of consecutive time segments in which the variables in the subproblem are coded “1” to indicate that the variables in that subproblem were discussed during those time segments. Finally, noise was added to the data, as described below.

To test the sensitivity of clustering algorithms to a variety of types of input data, five characteristics were varied to generate different types of “problems.” These characteristics are the total number of variables (NUMVAR), the number of segments in the timeline (LENGTH), the distribution of the number of variables in a subproblem (SIZE), the distribution of the number of time segments between work sessions (TBET), and the distribution of the number of time segments in a work session (WORK). The characteristics are listed in the left-hand column of Table 1.

The baseline values for each of these characteristics (shown in the second column of Table 1) were selected for similarity to the data collected in our design studies. We computed the values of each characteristic for the POD and factory study data we collected, based on the clusters identified by the hierarchical clustering algorithm. For the POD and factory studies, respectively, the average number of variables were 38 and 32, so we chose a baseline of 35; the average timeline lengths were 48 and 64 segments, so we chose a baseline of 50; the average and standard deviation of subproblem size were (4.5, 2.9) and (3.9, 2.3), so we chose a baseline of (4, 2.5); the average and standard deviation of time between work sessions were (11.5, 11.4) and (12, 9.5), so we chose a baseline of (12, 10); and the average and standard deviation of the length of work sessions were (2.7, 2.8) and (5.3, 7.1), so we chose a baseline of (4, 4). In the computational experiments described below, we varied the values around the baseline to explore a wider range of problem characteristics.

For a given set of problem set parameter values, generating a no-noise instance involved the following steps: (1) determine the (consecutive) variables that belong to each subproblem by repeatedly drawing an integer number from the subproblem size distribution (SIZE) and assigning that number of variables to the next subproblem until the number of variables is at least NUMVAR, and then remove any variables after the first NUMVAR; (2) for every subproblem, draw a number of time segments from the distribution for the number of time segments between work sessions (TBET) and a number of time segments from the distribution for the length of a work session (WORK), code the variables in this subproblem “1” in every time segment in the work session, and repeat until all LENGTH time segments have been considered.

Real timelines are not as orderly as the synthetic timelines; they have fewer clear subproblems because not every variable in a subproblem is discussed and variables from other subproblems may be discussed. Therefore, we added noise to the synthetic data to create instances with noise at several levels: 0%, 5%, and 10% (this is a sixth characteristic, NOISE). To add noise, we created a new, noisy instance from a no-noise instance by randomly determining, for each variable in each time segment, whether to flip that value from coded to uncoded (or vice versa), with a probability equal to the NOISE value.

### Computational Experiments.

To test the sensitivity of the clustering algorithms to variations in the data set characteristics, we ran three experiments, each with its own set of test instances. Experiment A was a tuning experiment to determine the impact of the clustering algorithm parameters and to select the best algorithm parameters for the remaining experiments. Experiment B explored the impact of instance size and subproblem size. Experiment C explored the impact of subproblem “overlap,” the extent to which multiple subproblems were considered at the same time. These experiments required 2650 instances.

Each experiment included multiple problem sets, where a problem set is a set of instances generated using the same values for the five problem set characteristics (NUMVAR, LENGTH, SIZE, TBET, and WORK). Because three of these parameters are distributions, we generated 50 different no-noise instances using the same characteristics (we confirmed that 50 instances were sufficient, since results for 50 instances were the same as those for 500). Depending on the number of noise levels included in the experiment (0% and 10%; or 0%, 5%, and 10%), the total number of instances in each problem set was either 100 or 150.

Experiment A (the tuning experiment) included four problem sets that were created by varying the subproblem size (SIZE) and the length of the work sessions (WORK), to determine whether these parameters interacted with the algorithm parameters. These two characteristics were chosen because they were most likely to interact: the clustering algorithms' performance may vary with the number of subproblems (controlled by SIZE) or with how often subproblems overlapped with others (partially controlled by WORK). In each problem set, we generated 50 instances with no noise. For every no-noise instance, we also generated a corresponding 10% noisy instance, for a total of 100 instances in each problem set. Table 1 lists the problem set parameter values used to generate the instances for experiment A.

From these instances, we generated clusters by varying the key parameters of the clustering algorithms. For the hierarchical clustering algorithm, we varied the threshold used to stop clustering. For the spectral clustering algorithm, we varied the threshold used to stop clustering and the number of eigenvectors used to compute each variable's projection. For the Markov clustering algorithm, we varied the inflation value. For the association rules, we varied the minimum support and minimum confidence level. Further information on the role of each parameter is provided in Sec. 3.5.

Experiment B included twelve problem sets that were created by varying the number of variables (NUMVAR), the number of segments in the timeline (LENGTH), and the distribution of the number of variables in a subproblem (SIZE). In each problem set, we generated 50 instances with no noise. For every no-noise instance, we also generated two corresponding noisy instances at the 5% and 10% noise levels, for a total of 150 instances in each problem set. Table 1 lists the problem set parameter values for experiment B.

Experiment C included three problem sets (low overlap, medium overlap, and high overlap) that were created by varying (together) the distribution of the number of time segments between work sessions (TBET) and the distribution of the number of time segments in a work session (WORK). The low overlap cases had high time between sessions (TBET = 25, 21) and short work sessions (WORK = 2, 2). The medium overlap cases had moderate time between work sessions (TBET = 12, 10) and moderate work sessions (WORK = 4, 4). The high overlap cases had low time between sessions (TBET = 3, 3) and long work sessions (WORK = 10, 10). In each problem set, we generated 50 instances with no noise. For every no-noise instance, we also generated two corresponding noisy instances at the 5% and 10% levels, for a total of 150 instances in each problem set. Table 1 lists the problem set parameter values for experiment C.

### Clustering Algorithms.

To determine whether clustering would enable the identification of subproblems in the timeline data, and to determine which of several clustering algorithms would perform best, we tested four clustering algorithms: hierarchical, spectral, Markov, and association rule clustering. The reasons for choosing these four algorithms were discussed in Sec. 3.2. The following subsections describe these algorithms.

Every clustering algorithm processes an instance that specifies which variables were discussed in which time segments, and the output is a set of clusters. Every variable is assigned to exactly one cluster because the true subproblems have disjoint sets of variables (see Everitt et al. [37] for a review of clustering techniques that allow overlapping clusters).

#### Hierarchical Clustering.

The first clustering method is a hierarchical clustering method. The variables are clustered together using a distance measure that is based on dissimilarity. Although there are multiple options for specifying the dissimilarity (or distance) between data points, we used the Euclidean distance. The method creates clusters by progressively merging variables into clusters based on their similarity or “closeness” as defined by the distance metric.

*i*and

*j*can be calculated as follows:

where *x _{it}* is 1 if variable

*i*was coded to indicate that it was discussed at time

*t*, and 0 otherwise.

In this work, because each value of *x _{it}* is either 0 or 1, the distance

*d*(

*i*,

*j*) is the square root of the number of segments in which one variable is coded and the other is not. After determining the distances for every pair of variables, we used hierarchical clustering [53] to cluster the variables. In particular, we used the matlab functions pdist and linkage. This generated a dendrogram that progressively clusters variables based on their distances from one another. A single set of clusters can be obtained using a “threshold” to select a particular height on the dendrogram. In this manner, we kept all clusters that were combined with a distance less than this threshold. The appropriate threshold was determined from the results of Experiment A, the tuning experiment. (The threshold may need to be adjusted for very different input data, but experiment A shows that this threshold is best for all the types of input data we examined.)

#### Spectral Clustering.

We developed a novel clustering method that uses spectral clustering for identifying subproblems. It is similar to hierarchical clustering (described above) except that it uses a different distance measure, described below. (This distance measure is different from the ones used by Sarkar [49,50].)

Let *T* be the total number of segments. For variable *i* that was coded with another variable in at least one time segment, let *n*(*i*) be the number of time segments in which the team discussed variable *i*, so $n(i)=\u2211t=1Txit$. Let *n*(*i*, *j*) be the number of time segments in which the team discussed both variables *i* and *j*, so $n(i,j)=\u2211t=1Txitxjt$. Then, $n(i)+n(j)\u2212n(i,j)$ equals the number of time segments in which the team discussed variable *i*, variable *j*, or both variables *i* and *j*.

*N*be the number of variables that were coded with other variables. Let

_{v}*R*be a $Nv\xd7Nv$ matrix in which

*r*is an element of

_{ij}*R*. The relative count

*r*was determined using the following equation:

_{ij}We found the *N _{v}* eigenvalues and eigenvectors of

*R*and identified the

*k*largest eigenvalues in the spectrum of eigenvalues (

*k*is one of the two key parameters that must be set for this clustering algorithm). If there are

*k*clusters of variables, there should be a significant gap between the

*k*th largest eigenvalue and the

*k*+ 1st largest eigenvalue. We created a $Nv\xd7k$ matrix

*U*that contains the eigenvectors for the

*k*largest eigenvalues and created a

*k*×

*k*matrix

*E*that contains the

*k*largest eigenvalues (the sequence of columns of

*U*and of

*E*is in the same order). Each row of the product

*UE*represents one of the variables as a point in a

*k*-dimension space.

We used hierarchical clustering to create a dendrogram of the variables using the distances between the points in the *k*-dimension space (the rows of *UE*). In particular, we used the matlab functions pdist and linkage. (Note that this distance does not equal the distance *d*(*i*, *j*) used in the hierarchical clustering method described in Sec. 3.5.1.) Each point is the transformation of one variable into the *k*-dimension space, and variables that often occur together in the data will be near each other in this new space because they will have similar concurrency values with every other variable.

As in the hierarchical clustering, a threshold is needed to generate clusters from the dendrogram. The tuning experiment, experiment A, examined the accuracy resulting from various values of this threshold, along with various values of the other key parameter, *k*.

#### Markov Clustering.

For the Markov clustering approach, we used the Markov Clustering Algorithm [54,55]. In particular, we used the matlab implementation posted by Hartmann [56]. The algorithm finds clusters in a network graph by simulating random walks, which are more likely to stay within clusters than to cross across clusters. According to the algorithm's developer [55], the key parameter to set for this algorithm is the inflation value; various settings were evaluated in experiment A.

*C*that has one row and column for each variable. The concurrency matrix is not symmetric and has an empty diagonal (

*c*= 0). For $i\u2260j$, entry

_{ii}*c*is determined by

_{ij}#### Association Rule Clustering.

In machine learning, association rules are utilized to discern relationships between sets of items that often occur together [57]. For example, an association rule might state that if variable *i* is coded in a time segment, variable *j* is also likely to be coded in the same time segment. We used association rule learning on the coded variables (the values of *x _{it}*) to generate association rules, then created clusters of variables using those rules. To create the association rules, we used the ARMADA package [58]. In this package, two measures guide the algorithm's identification of rules: minimum support and minimum confidence. In our approach, the support is the number of time segments in which a variable

*i*is coded in the timeline: $Supp(i)=n(i)$. The confidence is the proportion of time segments in which, if variable

*i*was coded, then

*j*was also coded.

In our experiments, the maximum rule length was set to three to improve the algorithm's efficiency, but clusters could contain more than three variables because we created clusters based on all the rules created by the algorithm. Specifically, each association rule established a relationship between two or three variables. We clustered the variables by the following policy: if variables *i* and *j* are together in an association rule, then variables *i* and *j* are in the same cluster.

We conducted a sensitivity analysis to determine how increasing the support and confidence thresholds affected the accuracy of the clusters; we considered low and high values for both the support and confidence parameters in experiment A.

### Evaluating Clustering Accuracy.

We used a cluster accuracy metric to assess how well the clusters match the true subproblems. Generating more accurate clusters provides better insights into how a designer (or design team) decomposed the design problem. In the following description, “cluster” refers to the clusters identified by the algorithm for an instance, while “subproblem” refers to the true subproblems used to generate that instance.

Given a set of clusters, we computed the cluster accuracy by first determining the accuracy from each variable's perspective and then averaging those values. This metric equals the external validity metric described by Rand [41].

Let *N _{v}* be the total number of variables in the subproblems. For variable

*i*, let $n*(i)$ be the number of variables in the subproblem that contains variable

*i*, let $nc(i)$ be the number of variables in the cluster that contains variable

*i*, and let $n+(i)$ be the number of variables from the subproblem that contains variable

*i*that are also in the cluster that contains variable

*i*. Note that $n+(i)$ counts variable

*i*. Then, $Nv\u2212n*(i)$ is the number of variables in the problem that are not in the subproblem that contains variable

*i*, and $nc(i)\u2212n+(i)$ is the number of variables in that cluster that are not in the subproblem that contains variable

*i*.

*tp*(

*i*), the number of true positives,

*tn*(

*i*), the number of true negatives, and the clustering accuracy

*ac*(

*i*) associated with variable

*i*and calculated the average accuracy $A\xaf$ as follows:

## Results

After generating the instances for all the experiments, we evaluated the clustering algorithms by determining how well they created clusters that matched the true subproblems. To assess this, we used the cluster accuracy metric described in Sec. 3.6.

### Experiment A (Tuning).

The purpose of experiment A was to identify the best parameters for each of the clustering algorithms. Recall from Table 1 that we tested each algorithm on four problem sets (varying the subproblem size and length of work sessions) and two levels of noise (0% and 10%), with 50 different instances of each. On these problem sets, we evaluated a number of different settings for the parameters of each algorithm, as described below.

For the hierarchical clustering algorithm, the key parameter is the threshold level. Figure 1 shows a boxplot of the accuracy values across all tested problem instances for various values of the threshold. Using the value of three generated the most accurate clusters, so this value was used in experiments B and C.

For the spectral clustering algorithm, there are two key parameters: the threshold level and the value of *k* (see Sec. 3.5.2 for details). Figure 2 plots the accuracy obtained for several different threshold values. The left-hand plot shows that at a noise level of 10%, the best threshold value is 0.25. The right-hand plot shows that the best values of *k* were from 4 to 6. At a noise level of 0%, several values for the threshold and *k* led to good performance (results not shown). Based on both sets of results, experiments B and C used a threshold of 0.25 and a *k* value of 6.

For the Markov clustering algorithm, the key parameter is the inflation coefficient. We tested four values for this parameter, as advised in Ref. [55]. As shown in Fig. 3, the best performance was achieved at a value of 6. The same trend was visible across all problem sets and noise levels. Therefore, for experiments B and C, we set the inflation value to 6.

For the association rules clustering algorithm, we varied the minimum support and minimum confidence. Min support was set to 5 or 10 time segments, which were close to the values used for WORK. Min confidence was set to 0.50 or 0.95. Figure 4 shows the results for all four combinations of these two parameters. The accuracy is highest for a support of 10 and a confidence of 95%. (These values were best for all values of noise and all problem types.)

### Experiment B.

The purpose of experiment B was to explore the impact of instance size and subproblem size on algorithm performance. As described in Table 1, the experiment included twelve problem sets: two values for NUMVAR, two values for LENGTH, and three sets of values for SIZE; three noise levels were also tested. There are 36 combinations of problem set and noise, and 50 instances in each combination.

Figure 5 shows the results for experiment B for problem sets with NUMVAR = 35; the results for NUMVAR = 100 were similar. The figure plots 95% confidence intervals for the accuracy resulting from each combination of problem set characteristics and noise. In 21 combinations, the spectral clustering algorithm generated more accurate clusters than the other algorithms, and in 13 combinations, spectral clustering and hierarchical clustering generated essentially equally accurate clusters; in the other two combinations, spectral clustering generated clusters that were less accurate than the clusters generated by hierarchical clustering. Both of these latter combinations involved 0% noise, which is likely unrealistic for real data.

In general, the spectral clustering algorithm created the most accurate clusters, followed by the hierarchical clustering algorithm. The Markov clustering algorithm and association rules generated the least accurate clusters.

As LENGTH increased (from 50 to 200), the spectral clustering algorithm generated more accurate clusters, but the hierarchical clustering algorithm generated less accurate clusters in instances with noise. In the hierarchical clustering algorithm, the additional time segments generally increased the distance between variables due to more noise, so fewer were clustered together, which was less accurate. For the spectral clustering algorithm, however, the additional time segments acted as more samples, so the relative count values were more correlated with subproblem membership, which led to more accurate clusters. The additional time segments also led to more rules, so association rules clustering generated larger, less-accurate clusters.

For variations in SIZE, the spectral clustering and hierarchical clustering algorithms generated more accurate clusters when the mean subproblem size was small (SIZE *μ* = 4) and less accurate clusters when it was large (*μ* = 10). In noisy instances, larger subproblems make it more likely that some of the variables are more distant from the others in the same subproblem, which makes it more likely that they would not be grouped together by the spectral clustering or hierarchical clustering algorithms. The Markov clustering algorithm generated more accurate clusters when subproblems were large, however, because of this algorithm's tendency to group more variables together. Despite this improvement in performance, the accuracy remained lower than the best algorithm (spectral or hierarchical) in all cases. The accuracy of the clusters created by the association rules was not significantly affected by changes to SIZE.

### Experiment C.

The purpose of experiment C was to explore the impact of subproblem “overlap,” meaning the extent to which multiple subproblems were worked on at the same time. Figure 6 shows the results for experiment C, which has three problem sets: low overlap, medium overlap, and high overlap. As the overlap increases, the number of subproblems discussed in each time segment increases, and the number of variables coded likewise increases.

There are nine combinations of problem set and noise, and 50 instances in each combination. In five combinations, the spectral clustering algorithm generated more accurate clusters than the other algorithms, and in two combinations, spectral clustering and hierarchical clustering generated essentially equally accurate clusters; in the other two combinations, spectral clustering generated clusters that were less accurate than the clusters generated by hierarchical clustering. Thus, the spectral clustering algorithm generally created more accurate clusters, except for the high overlap instances with noise, where it created less accurate clusters. The hierarchical clustering algorithm generated more accurate clusters for the high overlap instances. The Markov clustering and association rules generated less accurate clusters generally.

Each algorithm's performance clearly differs based on the level of overlap. Hierarchical clustering performs well except in the low overlap cases. In the low overlap instances, the Euclidean distances between variables in different subproblems are generally lower because variables are coded less often. Thus, the hierarchical clustering, which uses this distance, tended to put more variables into the same cluster, which is inaccurate. In the medium and high overlap instances, the distances between variables in different subproblems are generally higher because variables are coded more often, so hierarchical clustering found the right clusters more often.

Spectral clustering performs well in all cases except the high overlap cases with noise. As overlap increases, the relative count values (used by spectral clustering) increase toward one, and the variables look more similar and are closer in the reduced dimensional space, so spectral clustering created fewer, larger clusters, which is less accurate. With no noise, spectral clustering generated very accurate clusters for all levels of overlap. In general, spectral clustering generated less accurate clusters as noise increases; this reduction increases as the overlap increases.

Markov clustering performs better at low levels of overlap; its performance decreases as overlap and noise increase. As overlap increases, the values in the concurrency matrix (used by Markov clustering) increase toward one; thus, more variables will have larger values in the matrix created by Markov clustering, which created fewer, larger clusters, which is less accurate.

The association rules clustering approach performs better at low and medium overlap, and poorly at high overlap. As overlap increases, each time segment has more coded variables, so more rules are generated, which created fewer, larger clusters, which is less accurate.

## Discussion

### Comparing Algorithm Performance.

In principle, all four clustering algorithms are appropriate for the clustering task. Each one considers the similarity of two variables (based on the time segments when they are coded) and forms groups of similar variables. In practice, however, it appears that some perform better for the task considered here.

The association rules clustering approach is one of the two worst performers on the instances in our experiments. It performs adequately when the overlap is low or moderate, but high overlap leads to poor performance (see Fig. 6). Its performance is also worse for longer timelines (see Fig. 5). The association rules clustering technique relies upon the association rules that are constructed from the data; more coded data (due to more time segments, as in experiment B, or more overlap, as in experiment C) led to more rules that met the minimum support and confidence thresholds and larger clusters, and therefore less accuracy.

The Markov clustering approach is the other of the two worst performers on the data sets we investigated. Its accuracy is low in nearly all problem sets, except when overlap and noise are both low (see Fig. 6). Even with longer timelines and therefore more data, performance does not consistently increase (see Fig. 5). Noise leads to worse performance on all problem sets. The Markov clustering technique uses the concurrency of the variables; a pair of variables with large concurrency will be grouped into a cluster. Thus, when overlap is high, the concurrency values are higher, so the Markov clustering algorithm grouped more variables into larger clusters (in some cases, only one cluster). The algorithm's performance was less sensitive to other changes in the problem parameters, although it also generated less accurate clusters when the subproblems were smaller.

The hierarchical clustering algorithm with a Euclidean distance metric generated the best results among all the algorithms in a small number of problem sets: those with high overlap and noise, as discussed in Sec. 4. Its performance was relatively good in medium- and high-overlap problem sets but poor in low-overlap problem sets. Hierarchical clustering performed well when the timelines were relatively short but less well when the timelines were longer and there was noise. Also, it performed better when subproblems were smaller. As discussed in Sec. 4, these decreases in performance were due to increased similarity in the Euclidean distances between variables in the long noisy and low-overlap problem sets.

Spectral clustering is also imperfect, but overall it is the most accurate of the techniques tested here in nearly all the problem sets. Its performance suffered only in the high overlap instances with noise (see Fig. 6) and somewhat in the instances with large subproblems and a short timeline (see Fig. 5). Because the spectral clustering algorithm positions each variable in a new space before clustering, it is less sensitive to changes in the problem size, although increasing overlap with noise did reduce clustering accuracy.

Despite these problems, it is clear that the spectral clustering approach can identify the true subproblems with a high degree of accuracy in most cases. With low or medium overlap, the worst performance was in the range of 85–90% accuracy, with results for most cases above 90%. In the high-overlap cases with noise, the worst performance was in the range of 60–70%; however, data with high overlap can probably be identified before clustering, and the hierarchical approach can be used instead, to reach accuracy values above 90%. Although some subproblems might be grouped together and others might be missing a variable or two, the “gist” of the design problem decomposition remains clear. Therefore, we suggest that spectral clustering be used to analyze such data in future, except in cases with high overlap, when we suggest hierarchical clustering with a Euclidean distance metric.

### Using Clusters to Analyze Human Designers' Decomposition Patterns.

The significance of the work described in this paper depends upon both the assertion that *clustering can identify the subproblems used by human designers* and the *potential of identifying subproblems to advance design research*.

The assertion that *clustering can identify subproblems* rests on two claims: (1) The clustering algorithm can accurately identify groups of variables that are discussed concurrently in the type of noisy data produced by human designers when analyzed using the approach we have described. (2) The variables that a design team discusses concurrently (within a short time period) are components of the same subproblem.

This paper has provided evidence in support of the first claim. The many synthetic datasets we produced were modeled to represent data collected from human designers, then varied to investigate datasets with alternative characteristics and various levels of noise (as discussed in Sec. 3.3). Using synthetic data enables us to compare our results to the “true” set of subproblems. Our results show that, indeed, the spectral and/or hierarchical clustering algorithms can reproduce the true clusters with a high degree of accuracy, lending credence to the first claim.

The second claim is more difficult to validate explicitly because a design team's discussions can move quickly among multiple topics and design teams rarely explicitly discuss the subproblems that they are considering [8]. It is likely, however, that a design team will discuss the multiple variables in a subproblem concurrently (within a short time period) as they generate and evaluate solutions to that subproblem. Variables considered at very different times are unlikely to form part of the same subproblem because their interactions could not then easily be considered and their values determined together. Still, it is possible that variables that a design team discussed concurrently are not in the same subproblem; instead, the design team may have switched from one subproblem to another, and some of the variables belong to the first subproblem, and the other variables belong to another subproblem. In that case, the concurrency of the variables from multiple subproblems is an accidental byproduct. On the other hand, when a team discusses a subproblem over a longer period of time, the concurrency of variables over multiple time segments provides evidence that they are in the same subproblem.

To examine this phenomenon, we considered several passages from the video data collected from human designers (as described in Sec. 3.1). We chose some segments where the clusters indicate a transition between two subproblems, and some segments where the clusters indicate a single subproblem was being discussed. As discussed by Morency [12], reviewing the teams' conversations (not merely the variables discussed) provided evidence that the teams did consider subproblems that corresponded to the concurrent variables. For example, during time segment 12, POD Team *α*'s conversation revolved around the general flow through the POD, which included specifying the locations of the entry, exit, and medication distribution. In the next time segment, the team then considered attributes of the greeting station and the forms distribution stations, which are near each other. It was clear from the video that these were two separate conversations: the first conversation was about the entire POD; the second one was about the details (location and staffing) of two specific stations. Although the first conversation extended into the time segment when the second one began, it was possible to note the transition from one to the other. (The timeline shown in Fig. 8 shows a transition between subproblems from segment 12 to 13 that align with these topics of conversation.)

Therefore, it is likely that variables considered close together in time are likely, although not certainly, in the same subproblem. Thus, it is reasonable to accept the second claim, and we argue that one can use clustering to identify subproblems, with an understanding of the limitations of the method.

The *potential of identifying subproblems to advance design research* depends upon using the clustering results to generate and test hypotheses about how humans decompose design problems. Data from human designers who are allowed to decompose problems intuitively (without specific and enforced guidance) does not specify the subproblems that the designers were using, only the variables that they discussed. Therefore, identifying the subproblems must rely on post hoc analysis of their actions and discussions. (The results of such an analysis cannot be easily validated because their “true” set of subproblems is unknown.) Our approach enables an analyst to use clustering to identify the subproblems into which human designers decomposed a design problem (and this paper has shown that the results are often accurate). Then, one can consider the characteristics of these subproblems: the number of subproblems, the number of variables in the subproblems, which types of variables are in the same subproblem, the sequence of subproblems, and iteration between subproblems, for example. After this, one can use these results to develop hypotheses about the teams' decomposition approaches and then, with a larger dataset, to test these hypotheses (and other existing hypotheses).

For example, when considering how humans design manufacturing facilities, we might hypothesize that human designers' subproblems reflect the perceived coupling among the design variables relevant to the same manufacturing step (e.g., the location and size of the area for the assembly step) or that the subproblems reflect the perceived coupling among design variables relevant to the same attribute across multiple steps (e.g., the locations—but not size – of the areas for the painting, drying, assembly, and packaging steps). Such hypotheses could be evaluated by examining whether the clusters of variables discussed by different teams tend to group multiple variables of different attributes for one manufacturing step or variables with the same attribute for multiple steps, respectively. Determining which, if either, of these hypotheses explains the observed behavior of the design teams could provide insights into human decomposition strategies and the importance of different types of coupling among the design variables.

As an illustration of this approach and its potential, we used the spectral clustering approach to cluster the variables discussed by factory team *α* and analyzed the resulting clusters for evidence of these two hypotheses. The algorithm identified ten clusters. Two of these contained just one variable, so they are ignored in this analysis. Of the remaining eight clusters, shown in Table 2, all but the one large cluster are dominated by variables with the same attribute, either the location of several manufacturing steps or the staffing for several manufacturing steps. The clustering results, therefore, support the hypothesis that this team's subproblems were driven by perceived coupling among design variables relevant to the same attribute across multiple manufacturing steps.

Future work will use the same approach to analyze patterns in coupling and other subproblem features across many design teams, and we hope that the method we have developed will enable other researchers to perform similar analyses. This type of analysis is beyond the scope of this paper, which focuses on the clustering methods that enable transformation of a design team's discussions into structured data about how they decomposed a design problem. The approach thus provides opportunities for additional research that can generate new knowledge about how humans decompose design problems.

### Applicability.

We note that clustering is only appropriate for certain types of design activity analysis. It is appropriate for identifying subproblems based on data that indicate the variables discussed over time. Critically, however, the method operates on the premise that when two variables are discussed at the same time, they are being considered or determined together. If it is impossible to tell whether the variables were actually considered together—for example, if the data are based on a meeting in which multiple subteams are reporting out and the subteams' variables cannot be allocated to specific subteams—this method is not appropriate.

The discretization of time should match the “pace” of the discussion: if it is too small, only one variable will be captured in each time-step and subproblems will not be discernible; if it is too large, multiple subproblems could be discussed in each time-step (which will lead to high overlap). We found that 1- or 2 min segments worked well for observations of small teams solving moderately complex design problems over 3–4 h, and we expect similar or slightly longer segments would be appropriate for individual verbal protocols of similar scope.

Furthermore, we note that clustering is appropriate for identifying subproblems in many kinds of design situations. Our approach was developed to process data based on individual verbal protocols or observations of small teams, which are a common type of data in research on human designers [6]. It can also be used to examine related qualitative data sources. For example, similar data could be extracted from design journals identifying the variables an individual or team are considering [59] or from meeting minutes that indicate the topic considered by a team for the week. With the latter type of data, it may not be possible to extract subproblems of the team's subteams, but it may still be of interest to understand the higher-level subproblems of the team as a whole and their sequencing over time. For analyzing data from different design situations, it will be useful to conduct a tuning experiment (similar to our experiment A) to determine the best parameter settings for the clustering algorithms, if the data characteristics are substantially different from those studied here.

The clustering approach described here does not consider how some types of subproblems might be more likely early or late in the design process or how subproblems might change (by becoming larger or smaller). Future research should consider how subproblems might differ or change throughout a design process.

## Conclusions

This paper has presented a study to evaluate clustering algorithms that can identify subproblems based on coded time segments that describe the variables (decisions) being considered during a design process. We tested four clustering algorithms by using them to group variables from a variety of synthetic instances. The baseline synthetic instances were designed to be similar to the real-world data that we collected in previous observations of human designers, but the problem characteristics were then varied to test how the algorithms' performance differed for different types of input data. The results show that the spectral clustering algorithm was more accurate than the others in most cases, and that hierarchical clustering was best when subproblems frequently “overlapped” in that multiple subproblems were worked on at once. The hierarchical clustering algorithm was also fairly accurate in many cases, but the Markov and association rules clustering algorithms performed poorly.

Beyond the subproblems identified by the approach, the hierarchical and spectral clustering techniques construct a dendrogram that one can use to see the relationships within and between the subproblems to gain additional insight into the design decomposition.

Although the four clustering algorithms considered in this study represent the key types of clustering, many potentially useful clustering algorithms were not considered, and future research is needed to evaluate their performance on data about human design activities.

The paper provides guidance for design researchers who wish to understand how a designer (or design team) decomposes a system design problem during a design process. It presents a general approach for collecting, analyzing, and exploring this data, and it provides specific recommendations on which clustering algorithms may be useful for identifying subproblems based on coded data.

This study is part of a larger effort to understand how designers decompose system design problems, determine the relationship between design decomposition and the quality of the design created, and explain the factors that affect this relationship. Future work is needed to identify the subproblems used by designers in a variety of design domains and examine how they affect design quality. If certain decompositions are shown to yield consistently better designs, then students and practitioners can use more effective design processes that follow superior decompositions. The study described here is designed to improve our ability to explore the problem decompositions used by designers and should be useful to other design researchers exploring similar data.

## Acknowledgment

The authors acknowledge the assistance of David Rizzardo, who organized and led the facility design course, and Connor Tobias, who assisted with some of the data collection and analysis.

## Funding Data

Directorate for Engineering, National Science Foundation (Grant Nos. CMMI-1435074 and CMMI-1435449).