title: “Optimizing the calculation of Hopkins statistic” |

author: “Kevin Wright” |

date: “2021-04-19” |

bibliography: clustertend.bib |

output: |

rmarkdown::html_vignette: |

md_extensions: [ |

“-autolink_bare_uris” |

] |

vignette: > |

% |

% |

% |

The `clustertend::hopkins`

function used two nested `for()`

loops:

```
for (i in 1:n) {
1] <- dist(rbind(p[i, ], data[1, ]))
distp[<- dist(rbind(q[i, ], data[1, ]))
minqi for (j in 2:nrow(data)) {
<- dist(rbind(p[i, ], data[j, ]))
distp[j] <- q[i, ] - data[j, ]
error if (sum(abs(error)) != 0) {
<- dist(rbind(q[i, ], data[j, ]))
distq if (distq < minqi)
<- distq
minqi
} }
```

Whenever nested `for()`

loops are used, you should immediately consider if it possible to vectorize one or both loops. In this case, we can define a function that calculates the euclidean distance from a vector to the nearest row of a matrix and then use this function to eliminate the inner loop:

```
<- function(x,Y){
DistVecToMat min(apply(Y, 1, function(yi) sqrt(sum((x-yi)^2))))
}# DistVecToMat(c(1,2), matrix(c(4,5,6,7), nrow=2, byrow=TRUE))
# 4.242641 7.071068 # sqrt(c(18,50))
# For each ith row of sample, calculate the distance of:
# U[i,] to X
# W[i,] to X[-k[i] , ], i.e. omitting W[i,] from X
<- rep(NA, n)
dux <- rep(NA, n)
dwx for(i in 1:n) {
<- DistVecToMat(U[i,], X)
dux[i] <- DistVecToMat(W[i,], X[-k[i],])
dwx[i] }
```

When thinking about manipulating two vectors or two matrices, you should always keep in mind that there are R functions like `crossprod()`

, `outer()`

, and `apply()`

that might come in handy. I played around with these functions but was having trouble getting the results I wanted. I then used Google to search for ideas and discovered the `pdist`

package, which can efficiently compute the distance between all pairs of rows of two matrices. This is exactly what we need.

```
# pdist(W,X) computes distance between rows of W and rows of X
<- as.matrix(pdist(W,X))
tmp <- apply(tmp, 1, min)
dwx
# pdist(U,X) computes distance between rows of U and rows of X
<- as.matrix(pdist(U,X))
tmp <- apply(tmp, 1, min) dux
```

Finally, there’s two ways two different ways to change some elements of the distance matrix to be Inf:

```
library(microbenchmark)
# Method 1. Loop over vector 1:n
# for(i in 1:m) dwx[i,k[i]] <- Inf
microbenchmark(hopkins(X12, 500))
## Unit: milliseconds
## expr min lq mean median uq max neval
## hopkins(X12, 500) 38.9493 42.8045 50.73876 45.0668 47.69945 120.9366 100
# Method 2. Build a matrix of indexes to the cells that need changing
# dwx[ matrix( c(1:n, k), nrow=n) ] <- Inf
microbenchmark(hopkins(X12, 500))
## Unit: milliseconds
## expr min lq mean median uq max neval
## hopkins(X12, 500) 39.2668 42.41565 50.21628 43.74215 46.9462 126.3522 100
```

The median times across 100 calls to the function are virtually identical for this test case. Results could be different for smaller/larger datasets. In our (purely subjective) taste, the loop method is a bit easier to understand.

How good is the optimization? In one simulated-data example with 1000 rows and 2 columns, sampling 500 rows, the non-optimized function used about 17 seconds, while the optimized function used approximately 0.05 seconds.