In the past I haven’t been a big fan of Jupyter notebooks. As someone who came into the world of Python already proficient in C++ and far more used to a standard flat file structure, using notebooks seemed strange and unnecessary. Fast forward to today and I really enjoy using notebooks for most data science challenges. More importantly, I enjoy writing notebooks designed to teach something, combining code blocks with markdown text explaining each stage in the project.

This project is a simple example of such a notebook, taking a dive into the UCI Machine Learning Repository Dataset on credit card approvals. This dataset is great for learning, with categorical features, numeric features, and missing values in multiple columns. There are 16 columns in the data, and these are:

  • Gender
  • Age
  • Debt
  • Married
  • BankCustomer
  • EducationLevel
  • Ethnicity
  • YearsEmployed
  • PriorDefault
  • Employed
  • CreditScore
  • DriversLicense
  • Citizen
  • ZipCode
  • Income
  • ApprovalStatus

The goal of this exercise is to predict the final column, the ApprovalStatus, where a + denotes an approved application, and a - denotes a rejected application. Thus, this is a classification problem.

This notebook is not an exhaustive guide, and instead provides an introduction to the concepts of data inspection and cleaning, data visualisation, feature selection, and hyperparameter tuning for multiple classification models.

You can find the notebook on Kaggle and on my GitHub.