blog




  • Essay / Handling Missing Values ​​– One of the Most Important Issues in Data Preprocessing

    Handling Missing Values ​​(MV) is an important issue in data preprocessing in data mining. One reason is that data attributes can be aggregated from different sources. Cases may not exist in all data sources. The other reason is the omission of declaration. The simplest way to deal with VMs is to eliminate cases containing at least one VM. However, this is only practical when the data contains a small number of cases with VMs and when analysis of full cases does not lead to serious bias results for inference. For example, in our study, 10-30% of students did not achieve their high school GPA or SAT scores. It is impossible to simply dismiss these students because most of them are international students or transfer students who make up a significant subset of the population. It is also not practical to discard these variables, as they have been shown to be important predictors in predicting student performance. It is therefore important to apply an appropriate imputation strategy to the data. Say no to plagiarism. Get a tailor-made essay on “Why Violent Video Games Should Not Be Banned”? Get an Original Essay There are also various data mining methods. Unlike traditional explanatory models where the goal is to explore the relationship between an outcome variable and explanatory variables, the goal of the data mining model is to make predictions on a new data set. There is a target variable, which can be continuous or categorical. There are also predictors, called features, that measure a set of characteristics of the sample members. By applying different data mining models, a prediction model can be built based on the current data. The model can be applied to new data, where a new set of feature values ​​is used to make predictions. Different data mining methods have different algorithms and will therefore result in different prediction performance. According to Luengo, imputation methods can improve data mining methods for different categories because there can be an interaction between imputation strategies and data mining methods. We would like to explore how this works on our data. In this chapter, we will first present the imputation strategies applied in this thesis. Next, we will present the data mining methods applied to our data. Third, a commonly used oversampling method, SMOTE, will be introduced to solve the problem of data imbalance. Imbalanced data generally refers to a classification problem in which classes are not represented equally. For example, in our dataset, there are approximately 3,000 students in total, of which 90% are labeled as passing students and the remaining 10% are labeled as failing students. Most machine learning methods do not work well on imbalanced data. Thus, techniques must be used to solve the problem of data imbalance. SMOTE is one of them.