3 Ways To Cluster Data Using Unsupervised Learning
This article was written by Julia Masch, a Financial Analyst at I Know First.
Highlights:
- Understanding Unsupervised Learning
- Types of Clustering and How To Implement Them
- How I Know First Uses Clustering
The I Know First algorithm builds relationships between the various assets in the stock market in its prediction process. For example, on May 27, 2018, the machine learning algorithm gave a bullish prediction for Clean Energy Fuels Corp. for a one month time period. The natural gas provider’s stock increased over this time period in accordance with the forecast mostly thanks to climbing diesel prices. While creating the forecast, the algorithm considered the relationship between CLNE and diesel and incorporated this information into the forecast to make it more accurate. While it seems like common sense to a human that if diesel, a substitute for clean energy, increases in price then there will be an increased demand in natural gas and natural gas providers like Clean Energy Fuels Corp. will benefit. However, the I Know First algorithm does not know what goods and services Clean Energy provides, it simply knows the ticker symbol. Yet, the algorithm deduces the relationship using unsupervised learning.
Building relationships between data points (Source: Wikimedia Commons)
One of the most popular methods to implement this is cluster analysis which uses a layered method in order to group the data into relevant groups of clusters (hence the name) to reveal a previously unseen structure in the system. There are many different similarity metrics that can be used to create these clusters such as probabilistic or Euclidean distance and the ideal one depends on what the end goal is. Additionally, some clustering methods change greatly depending on which metric is used, whereas some are less susceptible. Clustering can be applied to many different fields.
Types of Clustering
There are many different types of clustering, but we will discuss 3 of the main ones in depth: K-Means clustering, hierarchical clustering, and density-based spatial clustering of applications with noise (DBSCAN).
K-Means clustering is one of the most popular and works by separating the data into k distinct groups based on each data point’s distance to the centroid of the cluster.
Steps:
- Select how many groups (k) you would like to separate the data into and randomly initialize a center for each group.
- Compute the distance between each point and group center and assign each point to the nearest mean.
- Recalculate the centroid using all the points that are now in the group.
- Repeat steps 2 and 3 until the group centers barely change between repetitions or based on a set number of iterations.
(Pictures from Wikimedia Commons, GIF by I Know First)
Another variation of K-Means clustering is K-Medians clustering which follows the same steps, but instead of using the mean of each group, the median is calculated and used for assigning the points. This method is less sensitive to outliers.
Hierarchical clustering builds a cluster tree that depicts a multilevel hierarchy of the data. There are two types of hierarchical clustering: top-down and bottom-up. Top-down algorithms begin with all of the data in one group and branch out as differences become apparent whereas bottom-up algorithms begin with each point as an individual cluster and then agglomerates pairs of clusters until there is only one. Bottom-up hierarchical clustering is also known as hierarchical agglomerative clustering (HAC). The hierarchy can then be depicted as a tree with the agglomerated group of all clusters as the root and each smaller cluster as a branch and each cluster of a single data point as a leaf.
Steps:
- Assign each data point as a single cluster or leaf and choose which metric for distance will be used to merge clusters (most often average linkage, which calculates the average distance between points in both clusters, is used).
- Find the smallest distance between clusters and merge two clusters into one
- Repeat step 2 until a root is formed that contains all of the leaves and is the root of the tree
(Source: Wikimedia Commons)
Top-down hierarchical clustering is the opposite and instead begins with a root, separates the 2 clusters with the greatest distance between them, and then repeats until each cluster is an individual data point.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is similar to Mean Shift Clustering which uses a sliding window to create clusters based on dense areas of data points. DBSCAN works by selecting a distance, epsilon ε, and then assigns all points within this distance from the starting point or other points in the group to a cluster or neighborhood. A minimum number of points can be assigned so that all neighborhoods have a minimum density. Points that are not within epsilon distance of any cluster can then be classified as noise or outliers.
Steps:
- Choose a random starting point and find all points within epsilon of the point and add them to the cluster. If it does not satisfy the minimum number of points choose another randomly selected point and classify the original point as noise (it may be reclassified as part of another cluster later).
- Add all points within epsilon of the newly added points to the cluster until all points within epsilon of the neighborhood have been checked. The points in the neighborhood make up the cluster.
- Repeat steps 1 and 2 until no more clusters can be created and every point has been classified as part of a cluster or as noise.
In the picture below, DBSCAN captures the nonlinear cluster much better than K-Means clustering could. The gray points are those that were classified as noise.
(Source: Wikimedia Commons)
Some other types of clustering include self-organizing maps which utilize neural nets to make observations about the distribution, hidden Markov models which recover the sequence of states based on the data, the aforementioned mean shift clustering, Gaussian mixture models which assume that the data points are Gaussian distributed, and many more.
So which clustering method is best?
Each clustering method has advantages and disadvantages. For example, since K-Means clustering has linear complexity and runs in O(n), it is quicker than some of the other options. This means the algorithm can be run multiple times and the best option can be picked, but the fact that different clusters may be formed also means that results may be inconsistent and not replicable. However, there are also limitations to K-Means clustering. It cannot identify shapes well and two circular clusters with differing radii may not be classified as clusters based on the shape. Additionally, K-Means modeling can be sensitive to outliers, thus if there are many outliers in the data set it may be more logical to use K-Median clustering.
Unlike in K-Means clustering where a set amount of clusters are chosen in advance, hierarchical clustering allows the viewer to choose any number of clusters by picking any level on the tree. Additionally, HAC is less sensitive to the distance metric than other clustering methods. However, hierarchical clustering does take longer to run at O(n3).
Density-Based Spatial Clustering of Applications with Noise filters out the noise which can be beneficial. Additionally, DBSCAN will choose the optimal number of clusters based on how many neighborhoods is identifies. The drawback to this method is that it struggles when the clusters have greatly varying densities. Additionally, border points can be categorized incorrectly and this method is sensitive to different distance metrics.
Gaussian mixture models are unique from many of the other clustering methods because the end result is a probability that a point lies in a cluster. Therefore, a point could have a probability of being in more than one cluster and so the clusters are not exclusive.
There is no clustering method that is inherently better than others, it simply depends on the context of the data and what will be most useful to the viewer for analysis.
How I Know First Implements Clustering
I Know First uses clustering to group the multivariate time series of data for all of the assets we cover. Clustering allows the aggregation of variables in a model to represent an underlying relationships or connections, making it easier to understand the data. There are three main purposes for this: to simplify the model by automatically creating virtual indices with different degrees of belonging, to increase the precision and robustness of individual assets, and to separate all the assets into stock groups forecast. This is different from the conventional somewhat arbitrary division of companies into different market sectors or industry groups. This way we can separate forecasts into packages for aggressive stocks, small cap stocks, etc. We use custom modified K-means because the speed of this method is essential for such a large dataset and if we run it multiple times the algorithm can discover even more relationships. Clustering helps us uncover the ways assets interact with and affect each other and the algorithm uses this information to improve the accuracy of the predictions it makes. It also allows a macro view, to see where the market is heading as a whole. Moreover, the parameters of each cluster help in constructing an efficient portfolio with maximized opportunity and minimized risk.
Subscribe to I Know First’s Daily Market Forecast