Data mining. Textbook - страница 5



But what is the difference between clustering and splitting data into one or more datasets?

The methods of implicit clustering and managed clustering are actually very similar. The only difference is that we use different parameters to determine in which direction we should split the data. Take as an example a set of points on a sphere that define an interconnected network. Both methods aim to keep the network as close as possible to the network defined by the two nearest points. This is because we don’t care if we are very far from one or the other. So, using the implicit clustering algorithm (cluster distance), we will divide the sphere into two parts that define very different networks: one will be the network defined by the two closest points, and the other will be the network defined by the two farthest points. The result is two completely separate networks. But this is not a good approach, because the further we move away from the two closest points, the smaller the distance between the points, the more difficult it will be to find connections between them – since there is a limited number of points that are connected by a small distance.

On the other hand, the method of controlled clustering (cluster distance) would require us to measure the length between each pair of points, and then perform calculations that make the networks closest to each other the smallest distance possible. The result is likely to be two separate networks that are close to each other but not exactly the same. Since we need two networks to be similar to each other in order to detect a relationship, it is likely that this method will not work – instead, the two clusters will be completely different.

The difference between these two methods comes down to how we define a «cluster». The point is that in the first method (cluster distance) we define a cluster as a set of points belonging to a network similar to a network defined by two nearest points. By this definition, networks will always be connected (they will be the same distance apart) no matter how many points we include in the definition. But in the second method (clustering control), we define clusters as pairs of points that are the same distance from all other points in the network. This definition can make finding connected points very difficult because it requires us to find every point that is similar to other points in the network. However, this is an understandable compromise. By focusing on finding clusters with the same distance from each other, we are likely to get more useful data, because if we find connections between them, we can use this information to find the relationship between them. This means that we have more opportunities to find connections, which will make it easier to identify relationships. By defining clusters using distance measurements, we ensure that we can find a relationship between two points, even if there is no way to directly measure the distance between them. But this often results in very few connections in the data.

Looking at the example of creating two datasets – one for implicit clustering and one for managed clustering – we can easily see the difference between the two methods. In the first example, the results may be the same in one case and different in another. But if the method is good for finding interesting relationships (as it usually is), it will give us useful information about the overall structure of the data. However, if the technique is not good at identifying relationships, then it will give us very little information.