In-class Exercise 7

Author

Eugene Toh

Cluster analysis

Data preparation and exploration
- Make sure that we normalise the data (e.g., based on population as for instance drug abuse cases is affected by population)
  - Multiply it by a number to not see too many decimals to make it clearer
Data analysis and selecting clustering variables
- Checking the distribution of the cluster variables
- If numbers have too large a range, use standardisation techniques
  - Z-score (if your data is normally distributed)
    - Positive and negative, centre is 0
  - Min-max (highly skewed)
    - Outputs numbers between 0 to 1
  - Decimal scaling
    - The same as the normalisation method shown above
- Create correlation matrix to visualise linear relationship data
  - Ellipses that look like a forward slash has a positive correlation and vice versa
  - 0 means no linear relationship
  - Ranges from -1 to 1
  - When ellipse is very narrow, there is very strong relationship
Decide on the clustering technique
- Agglomerative (more common)
- Divisive
Perform cluster analysis
- Hierarchical clustering is done with aspatial data
- For each two pairs of variable, we calculate the similarity
  - Euclidean distance
  - City-block formula
  - Chebychev distance
- All of them use square roots as the result of clustering have to be the same sign
- Visualising it
  - Nested clusters
  - Dendrogram
    - The closer each item is grouped together, the more similar they are
- Since we are using hierarchical agglomerative clustering, each iteration subsumes the previous clusters
- Doing comparisons between agglomerative clustering methods
  - Methods
    - Average
    - Single
    - Complete
    - Ward
  - The method with the highest agglomerative coefficient is typically the best method
Decide on the number of clusters
- Gap statistic method
  - Input called B that uses permutation to do simulation
  - K.max restricts the number of clusters (feel free to use any value depending on how many items you have)
    - Feel free to experiment with the value
    - Use gap statistics to figure out the best number of clusters
      - Only analyse values after 3 clusters
      - Then get the k that has the max gap statistic
Validate and interpret the clusters
- Check if there are any clusters which have only one member
  - Data errors
  - One extreme outlier (unlikely)
    - You should exclude it out
- Methods
  - Parallel coordinates
    - You can try to make it interactive with Plotly
    - You can also split them up into separate graphs for each cluster using facets with GGally
    - You can then try and identify how each facets differ to reach a conclusion
  - Dendrogram
  - If at the end you find that there is only one variable that is highly correlated, you can use LISA instead.