Data mining. Textbook - страница 2
In these circumstances, the anomaly is indicative of the likelihood of data processing. It is unlikely that a pattern of deviations or outliers in the data is a random deviation of the underlying probability distribution. This suggests that the deviation is associated with a specific, random process. Under this assumption, anomalies can be thought of as anomalies in the data generated by the process. However, the anomaly is not necessarily related to the data processing process.
Understanding Data Anomaly
In the context of evaluating data anomalies, it is important to understand the probability distribution and its probability. It is also important to know whether the probability is approximately distributed or not. If it is approximately distributed, then the probability is likely to be approximately equal to the true probability. If it is not approximately distributed, then there is a possibility that the probability of the deviation may be slightly greater than the true probability. This allows anomalies with larger deviations to be interpreted as larger anomalies. The probability of data anomaly can be assessed using any measure of probability, such as sample probability, likelihood, or confidence intervals. Even if the anomaly is not associated with a specific process, it is still possible to estimate the probability of a deviation.
These probabilities must be compared with the natural distribution. If the probability is much greater than the natural probability, then there is a possibility that the deviation is not of the same magnitude. However, it is unlikely that the deviation is much greater than the natural probability, since the probability is very small. Therefore, this does not indicate an actual deviation from the probability distribution.
Revealing the Data Anomalies Significance
In the context of evaluating data anomalies, it is useful to identify the relevant circumstances. For example, if there is an anomaly in the number of delayed flights, it may happen that the deviation is quite small. If many flights are delayed, it is more likely that the number of delays is very close to the natural probability. If there are several flights that are delayed, it is unlikely that the deviation is much greater than the natural probability. Therefore, this will not indicate a significantly higher deviation. This suggests that the data anomaly is not a big deal.
If the percentage deviation from the normal distribution is significantly higher, then there is a possibility that data anomalies are process related, as is the case with this anomaly. This is additional evidence that the data anomaly is a deviation from a normal distribution.
After analyzing the significance of the anomaly, it is important to find out what the cause of the anomaly is. Is it related to the process that generated the data, or is it unrelated? Did the data anomaly arise in response to an external influence, or did it originate internally? This information is useful in determining what the prospects for obtaining more information about the process are.
The reason is that not all deviations are related to process variability and affect the process in different ways. In the absence of a clear process, determining the impact of a data anomaly can be challenging.
Analysis of the importance of data anomalies