| Type: | Package | 
| Title: | Collection of Machine Learning Datasets for Supervised Machine Learning | 
| Version: | 1.0.1 | 
| Maintainer: | Gary Hutson <hutsons-hacks@outlook.com> | 
| Description: | Contains a collection of datasets for working with machine learning tasks. It will contain datasets for supervised machine learning Jiang (2020)<doi:10.1016/j.beth.2020.05.002> and will include datasets for classification and regression. The aim of this package is to use data generated around health and other domains. | 
| License: | MIT + file LICENSE | 
| Encoding: | UTF-8 | 
| LazyData: | true | 
| BugReports: | https://github.com/StatsGary/MLDataR/issues | 
| Imports: | ConfusionTableR, dplyr, parsnip, rsample, recipes, workflows, ranger, caret, varhandle, OddsPlotty, ggplot2 | 
| RoxygenNote: | 7.1.2 | 
| Suggests: | rmarkdown, knitr | 
| VignetteBuilder: | knitr | 
| Depends: | R (≥ 2.10) | 
| NeedsCompilation: | no | 
| Packaged: | 2022-10-03 14:44:47 UTC; garyh | 
| Author: | Gary Hutson | 
| Repository: | CRAN | 
| Date/Publication: | 2022-10-03 15:10:02 UTC | 
PreDiabetes dataset
Description
PreDiabetes dataset
Usage
PreDiabetes
Format
A data frame with 3059 rows and 9 variables:
- Age
- age of the patient presenting with diabetes 
- Sex
- sex of the patient with diabetes 
- IMD_Decile
- Index of Multiple Deprivation Decile 
- BMI
- Body Mass Index of patient 
- Age_PreDiabetes
- age at pre diabetes diagnosis 
- HbA1C
- average blood glucose mmol/mol 
- Time_Pre_To_Diabetes
- time in years between pre-diabetes and diabetes diagnosis 
- Age_Diabetes
- age at diabetes diagnosis 
- PreDiabetes_Checks_Before_Diabetes
- number of pre-diabetes related primary care appointments before diabetes diagnosis 
Source
Generated by Asif Laldin a.laldin@nhs.net, Jan-2022
Examples
library(dplyr)
data(PreDiabetes)
# Convert diabetes data to factor'
diabetes_data <- PreDiabetes %>%
 glimpse()
Care Home Incidents
Description
a NHS patient safety incidents dataset: https://www.england.nhs.uk/patient-safety/report-patient-safety-incident/ dataset that has been synthetically generated against real data
Usage
care_home_incidents
Format
A data frame with 1216 rows and 12 variables:
- CareHomeFail
- a binary indicator to specify whether a certain care home is failing 
- WeightLoss
- aggregation of incidents indicating weight loss in patient 
- Medication
- medication missed aggregaation 
- Falls
- Recorded number of patient falls 
- Choking
- Number of patient choking incidents 
- UnexpectedDeaths
- unexpected deaths in the care home 
- Bruising
- Number of bruising incidents in the care home 
- Absconsion
- Absconding from the care home setting 
- ResidentAbuseByResident
- Abuse conducted by one care home resident against another 
- ResidentAbuseByStaff
- Incidents of resident abuse by staff 
- ResidentAbuseOnStaff
- Incidents of residents abusing staff 
- Wounds
- Unexplained wounds against staff 
Source
Collected by Gary Hutson hutsons-hacks@outlook.com, Jan-2022
Examples
library(dplyr)
data(care_home_incidents)
# Convert diabetes data to factor'
ch_incs <- care_home_incidents %>%
 mutate(CareHomeFail = as.factor(CareHomeFail))
 ch_incs %>% glimpse()
 # Check factor
 factor(ch_incs$CareHomeFail)
csgo
Description
csgo
Usage
csgo
Format
A data frame with 1,133 rows and 17 variables:
- map
- Map on which the match was played 
- day
- Day of the month 
- month
- Month of the year 
- year
- Year 
- date
- Date of match DD/MM/YYYY 
- wait_time_s
- Time waited to find match 
- match_time_s
- Total match length in seconds 
- team_a_rounds
- Number of rounds played as Team A 
- team_b_rounds
- Number of rounds played as Team B 
- ping
- Maximum ping in milliseconds;the signal that's sent from one computer to another on the same network 
- kills
- Number of kills accumulated in match; max 5 per round 
- assists
- Number of assists accumulated in a match,inflicting oppononent with more than 50 percent damage,who is then killed by another player accumulated in match max 5 per round 
- deaths
- Number of times player died during match;max 1 per round 
- mvps
- Most Valuable Player award 
- hs_percent
- Percentage of kills that were a result from a shot to opponent's head 
- points
- Number of points accumulated during match. Apoints are gained from kills, assists,bomb defuses & bomb plants. Points are lost for sucicide and friendly kills 
- result
- The result of the match, Win, Loss, Draw 
Source
Extracted by Asif Laldin a.laldin@nhs.net, March-2019
Diabetes datasets
Description
Diabetes datasets
Usage
diabetes_data
Format
A data frame with 520 rows and 17 variables:
- Age
- age of the patient presenting with diabetes 
- Gender
- gender of the patient with diabetes 
- ExcessUrination
- if the patient has a history of excessive urination 
- Polydipsia
- abnormal thurst, accompanied by the excessive intake of water or fluid 
- WeightLossSudden
- Sudden weight loss that has recently occured 
- Fatigue
- Fatigue or weakness 
- Polyphagia
- excessive or extreme hunger 
- GenitalThrush
- patient has thrush fungus on or near their genital region 
- BlurredVision
- history of blurred vision 
- Itching
- skin itching 
- Irritability
- general irritability and mood issues 
- DelayHealing
- delayed healing of wounds 
- PartialPsoriasis
- partial psoriasis on the body 
- MuscleStiffness
- stiffness of the muscles 
- Alopecia
- scalp alopecia and hair shedding 
- Obesity
- Classified as obese 
- DiabeticClass
- Class label to indicate whether the patient is diabetic or not 
Source
Collected by Gary Hutson hutsons-hacks@outlook.com, Dec-2021
Examples
library(dplyr)
data(diabetes_data)
# Convert diabetes data to factor'
diabetes_data <- diabetes_data %>%
 glimpse() %>%
 mutate(DiabeticClass = as.factor(DiabeticClass))
 is.factor(diabetes_data$DiabeticClass)
Heart disease dataset
Description
The dataset is to be used with a supervised classification ML model to classify heart disease.
Usage
heartdisease
Format
A data frame with 918 rows and 10 variables:
- Age
- age of the patient presenting with heart disease 
- Sex
- gender of the patient 
- RestingBP
- blood pressure for resting heart beat 
- Cholesterol
- Cholesterol reading 
- FastingBS
- blood sample of glucose after a patient fasts https://www.diabetes.co.uk/diabetes_care/fasting-blood-sugar-levels.html 
- RestingECG
- Resting echocardiography is an indicator of previous myocardial infarction e.g. heart attack 
- MaxHR
- Maximum heart rate 
- Angina
- chest pain caused by decreased flood flow https://www.nhs.uk/conditions/angina/ 
- HeartPeakReading
- reading at the peak of the heart rate 
- HeartDisease
- the classification label of whether patient has heart disease or not 
Source
Collected by Gary Hutson hutsons-hacks@outlook.com, Dec-2021
Examples
library(dplyr)
library(ConfusionTableR)
data(heartdisease)
# Convert diabetes data to factor'
hd <- heartdisease %>%
 glimpse() %>%
 mutate(HeartDisease = as.factor(HeartDisease))
# Check that the label is now a factor
 is.factor(hd$HeartDisease)
 # Dummy encoding
# Get categorical columns
hd_cat <- hd  %>%
 dplyr::select_if(is.character)
 # Dummy encode the categorical variables
 # Specify the columns to encode
 cols <- c("RestingECG", "Angina", "Sex")
 # Dummy encode using dummy_encoder in ConfusionTableR package
 coded <- ConfusionTableR::dummy_encoder(hd_cat, cols, remove_original = TRUE)
coded <- coded %>%
    select(RestingECG_ST, RestingECG_LVH, Angina=Angina_Y,
    Sex=Sex_F)
# Remove column names we have encoded from original data frame
hd_one <- hd[,!names(hd) %in% cols]
# Bind the numerical data on to the categorical data
hd_final <- bind_cols(coded, hd_one)
# Output the final encoded data frame for the ML task
glimpse(hd_final)
Long stayers dataset
Description
classification dataset of long staying patients. Contains patients who have been registered as an inpatient for longer than 7 days length of stay https://www.england.nhs.uk/south/wp-content/uploads/sites/6/2016/12/rig-reviewing-stranded-patients-hospital.pdf.
Usage
long_stayers
Format
A data frame with 768 rows and 9 variables:
- stranded.label
- binary classification label indicating whether stranded = 1 or not stranded=0 
- age
- age of the patient 
- care.home.referral
- flag indicating whether referred from a private care home - 1=Care Home Referral and 0=Not a care home referral 
- medicallysafe
- flag indicating whether they are medically safe for discharge - 1=Medically safe and 0=Not medically safe 
- hcop
- flag indicating health care for older person triage - 1=Yes triaged from HCOP and 0=Triaged from different department 
- mental_health_care
- flag indicating whether they require mental health care - 1=MH assistance needed and 0=No history of mental health 
- periods_of_previous_care
- Count of the number of times they have been in hospital in last 12 months 
- admit_date
- date the patient was admitted as an inpatient 
- frailty_index
- indicates the type of frailty - nominal variable 
Source
Prepared, acquired and adatped by Gary Hutson hutsons-hacks@outlook.com, Dec-2021. Synthetic data, based off live patient data from various NHS secondary health care trusts.
Examples
library(dplyr)
library(ggplot2)
library(caret)
library(rsample)
library(varhandle)
data("long_stayers")
glimpse(long_stayers)
# Examine class imbalance
prop.table(table(long_stayers$stranded.label))
# Feature engineering
long_stayers <- long_stayers %>%
dplyr::mutate(stranded.label=factor(stranded.label)) %>%
 dplyr::select(everything(), -c(admit_date))
 # Feature encoding
 cats <- select_if(long_stayers, is.character)
 cat_dummy <- varhandle::to.dummy(cats$frailty_index, "frail_ind")
#Converts the frailty index column to dummy encoding and sets a column called "frail_ind" prefix
cat_dummy <- cat_dummy %>%
 as.data.frame() %>%
 dplyr::select(-frail_ind.No_index_item) #Drop the field of interest
long_stayers <- long_stayers %>%
 dplyr::select(-frailty_index) %>%
 bind_cols(cat_dummy) %>% na.omit(.)
# Split the data
split <- rsample::initial_split(long_stayers, prop = 3/4)
train <- rsample::training(split)
test <- rsample::testing(split)
set.seed(123)
glm_class_mod <- caret::train(factor(stranded.label) ~ ., data = train,
                             method = "glm")
print(glm_class_mod)
# Predict the probabilities
preds <- predict(glm_class_mod, newdata = test) # Predict class
pred_prob <- predict(glm_class_mod, newdata = test, type="prob") #Predict probs
predicted <- data.frame(preds, pred_prob)
test <- test %>%
 bind_cols(predicted) %>%
 dplyr::rename(pred_class=preds)
#Evaluate with ConfusionTableR
library(ConfusionTableR)
cm <- ConfusionTableR::binary_class_cm(test$stranded.label, test$pred_class, positive="Stranded")
cm$record_level_cm
# Visualise odds ration
library(OddsPlotty)
plotty <- OddsPlotty::odds_plot(glm_class_mod$finalModel,
                               title = "Odds Plot ",
                               subtitle = "Showing odds of patient stranded",
                               point_col = "#00f2ff",
                               error_bar_colour = "black",
                               point_size = .5,
                               error_bar_width = .8,
                               h_line_color = "red")
print(plotty)
Stroke Classification dataset
Description
This dataset has been obtained from a Stoke department within the NHS and is a traditional supervised ML classification dataset
Usage
stroke_classification
Format
A data frame with 5110 rows and 11 variables:
- pat_id
- unique patient identifier index 
- stroke
- outcome variable as a flag - 1 for stroke and 0 for no stroke 
- gender
- patient gender description 
- age
- age of the patient 
- hypertension
- binary flag to indicate whether patient has hypertension: https://www.nhs.uk/conditions/high-blood-pressure-hypertension/ 
- heart_disease
- binary flag to indicate whether patient has heart disease: 1 or no heart disease history: 0 
- work_related_stress
- binary flag to indicate whether patient has history of work related stress 
- urban_residence
- binary flag indicating whether patient lives in an urban area or not 
- avg_glucose_level
- average blood glucose readings of the patient 
- bmi
- body mass index of the patient: https://www.nhs.uk/live-well/healthy-weight/bmi-calculator/ 
- smokes
- binary flag to indicate if the patient smokes - 1 for current smoker and 0 for smoking cessation 
Source
Prepared and compiled by Gary Hutson hutsons-hacks@outlook.com, Apr-2022.
Thyroid disease dataset
Description
The dataset is to be used with a supervised classification ML model to classify thyroid disease. The dataset was sourced and adapted from the UCI Machine Learning repository https://archive.ics.uci.edu/ml/index.php.
Usage
thyroid_disease
Format
A data frame with 3772 rows and 28 variables:
- ThryroidClass
- binary classification label indicating whether sick = 1 or negative=0 
- patient_age
- age of the patient 
- patient_gender
- flag indicating gender of patient - 1=Female and 0=Male 
- presc_thyroxine
- flag to indicate whether thyroxine replacement prescribed 1=Thyroxine prescribed 
- queried_why_on_thyroxine
- flag to indicate query has been actioned 
- presc_anthyroid_meds
- flag to indicate whether anti-thyroid medicine has been prescribed 
- sick
- flag to indicate sickness due to thyroxine depletion or over activity 
- pregnant
- flag to indicate whether the patient is pregnant 
- thyroid_surgery
- flag to indicate whether the patient has had thyroid surgery 
- radioactive_iodine_therapyI131
- indicates whether patient has had radioactive iodine treatment: https://www.nhs.uk/conditions/thyroid-cancer/treatment/ 
- query_hypothyroid
- flag to indicate under active thyroid query https://www.nhs.uk/conditions/underactive-thyroid-hypothyroidism/ 
- query_hyperthyroid
- flag to indicate over active thyroid query https://www.nhs.uk/conditions/overactive-thyroid-hyperthyroidism/ 
- lithium
- Lithium carbonate administered to decrease the level of thyroid hormones 
- goitre
- flag to indicate swelling of the thyroid gland https://www.nhs.uk/conditions/goitre/ 
- tumor
- flag to indicate a tumor 
- hypopituitarism
- flag to indicate a diagnosed under active thyroid 
- psych_condition
- indicates whether a patient has a psychological condition 
- TSH_measured
- a TSH level lower than normal indicates there is usually more than enough thyroid hormone in the body and may indicate hyperthyroidism 
- TSH_reading
- the reading result of the TSH blood test 
- T3_measured
- linked to TSH reading - when free triiodothyronine rise above normal this indicates hyperthyroidism 
- T3_reading
- the reading result of the T3 blood test looking for above normal levels of free triiodothyronine 
- T4_measured
- free thyroxine, also known as T4, is used with T3 and TSH tests to diagnose hyperthyroidism 
- T4_reading
- the reading result of th T4 test 
- thyrox_util_rate_T4U_measured
- flag indicating the thyroxine utilisation rate https://pubmed.ncbi.nlm.nih.gov/1685967/ 
- thyrox_util_rate_T4U_reading
- the result of the test 
- FTI_measured
- flag to indicate measurement on the Free Thyroxine Index (FTI)https://endocrinology.testcatalog.org/show/FRTUP 
- FTI_reading
- the result of the test mentioned above 
- ref_src
- [nominal] indicating the referral source of the patient 
Source
Prepared and adatped by Gary Hutson hutsons-hacks@outlook.com, Dec-2021 and sourced from Garavan Institute and J. Ross Quinlan.
References
Thyroid disease records supplied by the Garavan Institute and J. Ross Quinlan.
Examples
library(dplyr)
library(ConfusionTableR)
library(parsnip)
library(rsample)
library(recipes)
library(ranger)
library(workflows)
data("thyroid_disease")
td <- thyroid_disease
# Create a factor of the class label to use in ML model
td$ThryroidClass <- as.factor(td$ThryroidClass)
# Check the structure of the data to make sure factor has been created
str(td)
# Remove missing values, or choose more advaced imputation option
td <- td[complete.cases(td),]
#Drop the column for referral source
td <- td %>%
 dplyr::select(-ref_src)
# Analyse class imbalance
class_imbalance <- prop.table(table(td$ThryroidClass))
class_imbalance
#Divide the data into a training test split
set.seed(123)
split <- rsample::initial_split(td, prop=3/4)
train_data <- rsample::training(split)
test_data <- rsample::testing(split)
# Create recipe to upsample and normalise
set.seed(123)
td_recipe <-
 recipe(ThryroidClass ~ ., data=train_data) %>%
  step_normalize(all_predictors()) %>%
  step_zv(all_predictors())
# Instantiate the model
set.seed(123)
rf_mod <-
  parsnip::rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("classification")
# Create the model workflow
td_wf <-
  workflow() %>%
  workflows::add_model(rf_mod) %>%
  workflows::add_recipe(td_recipe)
# Fit the workflow to our training data
set.seed(123)
td_rf_fit <-
  td_wf %>%
  fit(data = train_data)
# Extract the fitted data
td_fitted <- td_rf_fit %>%
   extract_fit_parsnip()
# Predict the test set on the training set to see model performance
class_pred <- predict(td_rf_fit, test_data)
td_preds <- test_data %>%
bind_cols(class_pred)
# Convert both to factors
td_preds$.pred_class <- as.factor(td_preds$.pred_class)
td_preds$ThryroidClass <- as.factor(td_preds$ThryroidClass)
# Evaluate the data with ConfusionTableR
cm <- ConfusionTableR::binary_class_cm(td_preds$ThryroidClass ,
                                       td_preds$.pred_class,
                                       positive="sick")
#View Confusion matrix
cm$confusion_matrix
#View record level
cm$record_level_cm