Classifying feedback into categories

Given a set of interviews / reviews / user feedback, you'd like to classify the feedback into broad categories and have a summary of every category.

To do this, we need to:

Convert the feedback to numbers that we can work with
Find the best number of category
Put the feedback in the categories
Have a summary of every category

The first step can easily be done using Embedding models. The second and third step require a clustering algorithm.

Because the Vectors produced by the models do not form convex shapes, we cannot use the KMeans algorithm, so we use the DBScan algorithm as it is able to smartly deduce the number of cluster. To find the right parameter for epsilon, we look for a "phase transition" (in the percolation theory sense). If epsilon is too high, there is only one big cluster. If epsilon is too low, all the points are marked as outliers (and there is no cluster). Using a dichotomy search, we find the optimal epsilon that maximizes the number of cluster to automatically find categories.

def count_clusters(eps, embeddings):
	# 5 is a hyper parameter of the algorithm
	# We use the L2 norm as it is the only rotationally invariant norm.
    dbscan = DBSCAN(eps=eps, min_samples=5, p=2)
    labels = dbscan.fit_predict(embeddings)
    unique = np.unique(labels)

    cluster_count = len(unique)
    if -1 in unique:
        cluster_count -= 1  # Ignore outliers
    return cluster_count

# Find the best eps value to have at least 2 clusters.

upper_eps = 0.5
while count_clusters(upper_eps, embeddings) < 1:
    upper_eps *= 2

lower_eps = 0.001
while count_clusters(lower_eps, embeddings) > 0:
    lower_eps /= 2

tmp_opti = lower_eps

# Search for the best bounds using a dichotomy
while True:
    test_eps = (upper_eps + lower_eps) / 2
    ccount = count_clusters(test_eps, embeddings)
    if ccount == 0:  # Too low, no clusters
        lower_eps = test_eps
    elif ccount == 1:  # Too high, only one cluster
        upper_eps = test_eps
    else:
        lower_eps = test_eps
        tmp_opti = test_eps
		critical_step = upper_eps - lower_eps
        break  # Just right.

# We have at least 2 clusters.
# We perform a second optimization round to fine-tune the value of eps

while (
    count_outliers(tmp_opti, min_samples, embeddings) > len(embeddings) / 2
    and count_clusters(tmp_opti, min_samples, embeddings) > 2
):
    # Try to reduce the number of outliers while keeping at least 2 clusters
    tmp_opti += critical_step / 16

# We pick the highest value for which the condition above is true.
tmp_opti -= critical_step / 16

In practice, it finds about 5 categories that usually contain reviews about similar themes.

I have tested replacing the homemade optimisation with functions from scipy.optimize such as minimize, but they did not work well, stopping with only 1 or 2 clusters. This is because we have access to more information about the behaviour of count_cluster than scipy which has to perform black box optimisation.

I have tested another classification algorithm in Category fracking and tried PCA in PCA before classification