Data Integrity Checks
Last updated September 6, 2024
Ensuring the quality of your data is crucial for building robust and reliable machine learning models. Data integrity checks help you identify and address potential issues in your data that can negatively impact model performance and create inaccurate predictions. Evidently AI provides powerful tools for conducting these crucial checks.
Data Integrity Checks
- Missing Values: Evidently AI can detect missing values in your data, allowing you to understand the extent of the issue and identify features with missing data. This is important because missing data can lead to biased models and inaccurate predictions.
- Duplicate Rows: Detecting duplicate rows in your data is important to ensure that your dataset is free from redundant entries. Duplicates can affect model training and may lead to inaccurate model estimates.
- Data Type Consistency: Ensuring that all data points in a specific feature adhere to the correct data type is essential. This includes checks for numerical features (integers, floats), categorical features (text, strings), and timestamp data.
- Incorrect Format: Validate that numerical data is stored in the correct format (e.g., dates, times, currencies). Invalid formats can cause errors during data processing or analysis.
- Outlier Detection: Identifies extreme values in your data that may be errors or unusual events. Outliers can bias your model and affect its accuracy, especially when used in statistical modeling or as inputs to sensitive models.
- Data Uniqueness: Check that features should be unique, particularly in cases of identification numbers or unique identifiers.
- Feature Distributions: Analyze distributions of your features, such as histograms or box plots. This helps you identify potentially problematic patterns.
- Value Range: Validate that values within features fall within expected ranges. This is important for numerical features where unrealistic values might indicate data errors.
Was this article helpful?