Skip to content

Latest commit

 

History

History
164 lines (130 loc) · 6.46 KB

Unsupervised_Learning.md

File metadata and controls

164 lines (130 loc) · 6.46 KB

Unsupervised Learning

alt text Eduardo Muñoz

Determining the number of clusters in a data set

1. Distance Matrix

Normalizar las variables antes de aplicar la matriz de distancias

# Import Libraries
from scipy.spatial import distance_matrix # To calculate the ditance_matrix
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from mpl_toolkits.mplot3d import Axes3D # To show a 3d plot

# Import Dataset
data = pd.read_csv('../datasets/movies/movies.csv',sep=';')
data
user_id star_wars lord_of_the_rings harry_potter
1 1.2 4.9 2.1
2 2.1 8.1 7.9
3 7.4 3 9.9
4 5.6 0.5 1.8
5 1.5 8.3 2.6
6 2.5 3.7 6.5
7 2 8.2 8.5
8 1.8 9.3 4.5
9 2.6 1.7 3.1
10 1.5 4.7 2.3
movies = data.columns.values.tolist()[1:] # List of column names: star_wars,lord_of_the_rings,harry_potter

dm1 = distance_matrix(data[movies],data[movies],p=1) # Manhattan Distance
dm2 = distance_matrix(data[movies],data[movies],p=2) # Euclidean Distance
dm10 = distance_matrix(data[movies],data[movies],p=10) # p>>> distance<<<

# Function to convert distance_matrix to a Dataframe
def distance_matrix_to_df(dd,col_name):
    return pd.DataFrame(dd,index=col_name,columns=col_name)
    
distance_matrix_to_df(dm1,data['user_id'])
distance_matrix_to_df(dm2,data['user_id'])
distance_matrix_to_df(dm10,data['user_id'])

Example: Distance Matrix (Manhattan Distance)

1 2 3 4 5 6 7 8 9 10
0 9.9 15.9 9.1 4.2 6.9 10.5 7.4 5.6 0.7
9.9 0 12.4 17.2 6.1 6.2 0.8 4.9 11.7 9.6
15.9 12.4 0 12.4 18.5 9 12 17.3 12.9 15.2
9.1 17.2 12.4 0 12.7 11 18 15.3 5.5 8.8
4.2 6.1 18.5 12.7 0 9.5 6.5 3.2 8.2 3.9
6.9 6.2 9 11 9.5 0 7 8.3 5.5 6.2
10.5 0.8 12 18 6.5 7 0 5.3 12.5 10.2
7.4 4.9 17.3 15.3 3.2 8.3 5.3 0 9.8 7.1
5.6 11.7 12.9 5.5 8.2 5.5 12.5 9.8 0 4.9
0.7 9.6 15.2 8.8 3.9 6.2 10.2 7.1 4.9 0
fig = plt.figure()
ax = fig.add_subplot(111,projection='3d')
ax.scatter(xs=data['star_wars'],ys=data['lord_of_the_rings'],zs=data['harry_potter']);

2. Hierarchical Clustering

2.1 Linkage Criterions

2.1.1 Single (Enlace Simple)

  • Uses the minimum of the distances between all observations of the two clusters
  • La distancia entre dos clusters es el mínimo de las distancias entre cualquier dos puntos del cluster 1 y el cluster 2

2.1.2 Complete (Enlace Completo)

  • Uses the maximum of the distances between all observations of the two clusters
  • La distancia entre dos clusters es el máximo de las distancias entre cualquier dos puntos del cluster 1 y el cluster 2

2.1.3 Average (Enlace Promedio)

  • Uses the average of the distances between all observations of the two clusters
  • La distancia entre dos clusters es el promedio de las distancias entre cualquier dos puntos del cluster 1 y el cluster 2

2.1.4 Centroid (Enlace del Centroide)

  • Distances between centroids of two clusters
  • La distancia entre dos clusters es la distancia entre el centroide (punto medio) del cluster 1 y el del cluster 2

2.1.5 Ward (Enlace de Ward)

  • Minimizes the variance of the clustes bieng merged
  • Los clusters minimizan la varianza dentro de los puntos del mismo y en el dataset global
# Hierarchical Clustering
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

# Linkage: WARD Distance: EUCLIDEAN 
Z = linkage(data[movies],method='ward',metric='euclidean') # data[movies] definido arriba

# Plot Dendrogram 
plt.figure(figsize=(25,10))
plt.title('Dendograma para el Clustering Jerarquico')
plt.xlabel('ID usuarios Netflix')
plt.ylabel('Distancia')
dendrogram(Z, leaf_rotation=0, leaf_font_size=10)
plt.show();

# Linkage: CENTROID Distance: EUCLIDEAN 
Z = linkage(data[movies],method='centroid',metric='euclidean') # data[movies] definido arriba

# Plot Dendrogram 
plt.figure(figsize=(25,10))
plt.title('Dendograma para el Clustering Jerarquico')
plt.xlabel('ID usuarios Netflix')
plt.ylabel('Distancia')
dendrogram(Z, leaf_rotation=0, leaf_font_size=10)
plt.show();

Example: Hierarchical Clustering

  • X dataset (array n x m) de puntos a clusterizar
  • n número de datos (Rows)
  • m número de rasgos (Columns)
  • Z array de enlace del cluster con la info de uniones
  • k número de clusters

Notebook: Example Hierarchical Clustering using Python

2.2 Pros and Cons

Pros

  • We don't need to specify number of clusters
  • It does not make any assumption in the data shape so it's suitebale for any shape

Cons

  • We need to specify a threshold distance
  • Very sensitive to distance and linkage criterion

3. KMeans