Classifying feedback into categories
Given a set of interviews / reviews / user feedback, you'd like to classify the feedback into broad categories and have a summary of every category.
To do this, we need to:
- Convert the feedback to numbers that we can work with
- Find the best number of category
- Put the feedback in the categories
- Have a summary of every category
The first step can easily be done using Embedding models. The second and third step require a clustering algorithm.
Because the Vectors produced by the models do not form
convex shapes, we cannot use the KMeans algorithm, so we use the DBScan
algorithm as it is able to smartly deduce the number of cluster. To find the
right parameter for epsilon, we look for a "phase transition" (in the
percolation theory sense).
If epsilon is too high, there is only one big cluster. If epsilon is too low,
all the points are marked as outliers (and there is no cluster). Using a
dichotomy search, we find the optimal epsilon that maximizes the number of
cluster to automatically find categories.
def count_clusters(eps, embeddings):
# 5 is a hyper parameter of the algorithm
# We use the L2 norm as it is the only rotationally invariant norm.
dbscan = DBSCAN(eps=eps, min_samples=5, p=2)
labels = dbscan.fit_predict(embeddings)
unique = np.unique(labels)
cluster_count = len(unique)
if -1 in unique:
cluster_count -= 1 # Ignore outliers
return cluster_count
# Find the best eps value to have at least 2 clusters.
upper_eps = 0.5
while count_clusters(upper_eps, embeddings) < 1:
upper_eps *= 2
lower_eps = 0.001
while count_clusters(lower_eps, embeddings) > 0:
lower_eps /= 2
tmp_opti = lower_eps
# Search for the best bounds using a dichotomy
while True:
test_eps = (upper_eps + lower_eps) / 2
ccount = count_clusters(test_eps, embeddings)
if ccount == 0: # Too low, no clusters
lower_eps = test_eps
elif ccount == 1: # Too high, only one cluster
upper_eps = test_eps
else:
lower_eps = test_eps
tmp_opti = test_eps
critical_step = upper_eps - lower_eps
break # Just right.
# We have at least 2 clusters.
# We perform a second optimization round to fine-tune the value of eps
while (
count_outliers(tmp_opti, min_samples, embeddings) > len(embeddings) / 2
and count_clusters(tmp_opti, min_samples, embeddings) > 2
):
# Try to reduce the number of outliers while keeping at least 2 clusters
tmp_opti += critical_step / 16
# We pick the highest value for which the condition above is true.
tmp_opti -= critical_step / 16
In practice, it finds about 5 categories that usually contain reviews about similar themes.
I have tested replacing the homemade optimisation with functions from
scipy.optimize such as minimize, but they did not work well, stopping with
only 1 or 2 clusters. This is because we have access to more information about
the behaviour of count_cluster than scipy which has to perform black box
optimisation.
I have tested another classification algorithm in Category fracking and tried PCA in PCA before classification