Advanced KNIME Techniques for Data Processing and Analysis
Using experience and knowledge, you can create the most effective recipe for each KNIME situation. Data preparation research can be divided into two types: KPI calculation and data preparation for data science algorithms to extract information from raw data. In the first case, the method is determined by the industry and the domain, whereas in the second case, it is more standard for data scientists.
In this article, we introduce four fundamental and broad steps for preparing data to be used by machine learning algorithms.
The application of these transformations will be explained through a specific example.
- Missing data imputed
This is a binary classification problem. There are only two values for the churn output class: yes and no together. You can use logistic regression, decision trees, or random forests for characterization. The historical algorithm can be interpreted fast and simply with logistic regression. The solution will be implemented.
A Gaussian distribution can be used to normalize the data for logistic regression. Calculations involving distances or changes will use standardized information. In the calculation of variances and distances, features with large ranges have the potential to dominate the algorithm. It is necessary to normalize the data to ensure equal treatment of all input features.
Logistic conversion regression focuses on numerical attributes. In the first calculation, this is true. Some packages support categorical and nominal input features in logistic regression. A preparation step converts categorical features into numbers in the logistic regression learning function. Logistic regression uses only numerical features most of the time.
- Imputation of Missing Values
Data with missing values cannot be analyzed using logistic regression. Some logistic regression learning functions include missing value strategies. Nevertheless, we want to be in charge of everything. Imputation of missing values must also be decided.
The last fundamental question is: Are the target classes evenly distributed? We have an 85 percent imbalance between the two classes (no churn) and a 15 percent imbalance (churn). Having a smaller number than another class may cause the training algorithm to overlook it. If the imbalance is not too significant, stratified sampling should be sufficient for the Partitioning node. However, if the imbalance is more severe, as in this case, it might be helpful to resample before feeding the training algorithm.
If you are resampling, you can either undersample or oversample the majority. It depends on how much data you have and whether you can discard data from the majority class. Using the SMOTE algorithm, we oversampled the minority class because the dataset is so small.