Title: | Predictive Power Score |
---|---|
Description: | The Predictive Power Score (PPS) is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two variables. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). PPS can be useful for data exploration purposes, in the same way correlation analysis is. For more information on PPS, see <https://github.com/paulvanderlaken/ppsr>. |
Authors: | Paul van der Laken [aut, cre, cph] |
Maintainer: | Paul van der Laken <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.0.5 |
Built: | 2024-11-14 03:28:54 UTC |
Source: | https://github.com/paulvanderlaken/ppsr |
Lists all algorithms currently supported
available_algorithms()
available_algorithms()
a list of all available parsnip engines
available_algorithms()
available_algorithms()
Lists all evaluation metrics currently supported
available_evaluation_metrics()
available_evaluation_metrics()
a list of all available evaluation metrics and their implementation in functional form
available_evaluation_metrics()
available_evaluation_metrics()
Normalizes the original score compared to a naive baseline score The calculation that's being performed depends on the type of model
normalize_score(baseline_score, model_score, type)
normalize_score(baseline_score, model_score, type)
baseline_score |
float, the evaluation metric score for a naive baseline (model) |
model_score |
float, the evaluation metric score for a statistical model |
type |
character, type of model |
numeric vector of length one, normalized score
The PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two columns. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). It can be used as an alternative to the correlation (matrix).
Calculate predictive power score for x on y
score( df, x, y, algorithm = "tree", metrics = list(regression = "MAE", classification = "F1_weighted"), cv_folds = 5, seed = 1, verbose = TRUE )
score( df, x, y, algorithm = "tree", metrics = list(regression = "MAE", classification = "F1_weighted"), cv_folds = 5, seed = 1, verbose = TRUE )
df |
data.frame containing columns for x and y |
x |
string, column name of predictor variable |
y |
string, column name of target variable |
algorithm |
string, see |
metrics |
named list of |
cv_folds |
float, number of cross-validation folds |
seed |
float, seed to ensure reproducibility/stability |
verbose |
boolean, whether to print notifications |
a named list, potentially containing
the name of the predictor variable
the name of the target variable
text showing how to interpret the resulting score
the predictive power score
the evaluation metric used to compute the PPS
the score of a naive model on the evaluation metric
the score of the predictive model on the evaluation metric
how many cross-validation folds were used
the seed that was set
text shwoing what algorithm was used
text showing whether classification or regression was used
score(iris, x = 'Petal.Length', y = 'Species')
score(iris, x = 'Petal.Length', y = 'Species')
Calculate correlation coefficients for whole dataframe
score_correlations(df, ...)
score_correlations(df, ...)
df |
data.frame containing columns for x and y |
... |
arguments to pass to |
a data.frame with x-y correlation coefficients
score_correlations(iris)
score_correlations(iris)
x
and y
.Calculate predictive power scores for whole dataframe
Iterates through the columns of the dataframe, calculating the predictive power
score for every possible combination of x
and y
.
score_df(df, ..., do_parallel = FALSE, n_cores = -1)
score_df(df, ..., do_parallel = FALSE, n_cores = -1)
df |
data.frame containing columns for x and y |
... |
any arguments passed to |
do_parallel |
bool, whether to perform |
n_cores |
numeric, number of cores to use, defaults to maximum minus 1 |
a data.frame containing
the name of the predictor variable
the name of the target variable
text showing how to interpret the resulting score
the predictive power score
the evaluation metric used to compute the PPS
the score of a naive model on the evaluation metric
the score of the predictive model on the evaluation metric
how many cross-validation folds were used
the seed that was set
text shwoing what algorithm was used
text showing whether classification or regression was used
score_df(iris) score_df(mtcars, do_parallel = TRUE, n_cores = 2)
score_df(iris) score_df(mtcars, do_parallel = TRUE, n_cores = 2)
x
and y
.Note that the targets are on the rows, and the features on the columns.
score_matrix(df, ...)
score_matrix(df, ...)
df |
data.frame containing columns for x and y |
... |
any arguments passed to |
a matrix of numeric values, representing predictive power scores
score_matrix(iris) score_matrix(mtcars, do_parallel = TRUE, n_cores=2)
score_matrix(iris) score_matrix(mtcars, do_parallel = TRUE, n_cores=2)
Calculates out-of-sample model performance of a statistical model
score_model(train, test, model, x, y, metric)
score_model(train, test, model, x, y, metric)
train |
df, training data, containing variable y |
test |
df, test data, containing variable y |
model |
parsnip model object, with mode preset |
x |
character, column name of predictor variable |
y |
character, column name of target variable |
metric |
character, name of evaluation metric being used, see |
numeric vector of length one, evaluation score for predictions using naive model
Calculate out-of-sample model performance of naive baseline model The calculation that's being performed depends on the type of model For regression models, the mean is used as prediction For classification, a model predicting random values and a model predicting modal values are used and the best model is taken as baseline score
score_naive(train, test, x, y, type, metric)
score_naive(train, test, x, y, type, metric)
train |
df, training data, containing variable y |
test |
df, test data, containing variable y |
x |
character, column name of predictor variable |
y |
character, column name of target variable |
type |
character, type of model |
metric |
character, evaluation metric being used |
numeric vector of length one, evaluation score for predictions using naive model
y
variable
using every column in the dataset as x
, including itself.Calculate predictive power scores for y
Calculates the predictive power scores for the specified y
variable
using every column in the dataset as x
, including itself.
score_predictors(df, y, ..., do_parallel = FALSE, n_cores = -1)
score_predictors(df, y, ..., do_parallel = FALSE, n_cores = -1)
df |
data.frame containing columns for x and y |
y |
string, column name of target variable |
... |
any arguments passed to |
do_parallel |
bool, whether to perform |
n_cores |
numeric, number of cores to use, defaults to maximum minus 1 |
a data.frame containing
the name of the predictor variable
the name of the target variable
text showing how to interpret the resulting score
the predictive power score
the evaluation metric used to compute the PPS
the score of a naive model on the evaluation metric
the score of the predictive model on the evaluation metric
how many cross-validation folds were used
the seed that was set
text shwoing what algorithm was used
text showing whether classification or regression was used
score_predictors(df = iris, y = 'Species') score_predictors(df = mtcars, y = 'mpg', do_parallel = TRUE, n_cores = 2)
score_predictors(df = iris, y = 'Species') score_predictors(df = mtcars, y = 'mpg', do_parallel = TRUE, n_cores = 2)
Visualize the PPS & correlation matrices
visualize_both( df, color_value_positive = "#08306B", color_value_negative = "#8b0000", color_text = "#FFFFFF", include_missings = TRUE, nrow = 1, ... )
visualize_both( df, color_value_positive = "#08306B", color_value_negative = "#8b0000", color_text = "#FFFFFF", include_missings = TRUE, nrow = 1, ... )
df |
data.frame containing columns for x and y |
color_value_positive |
color used for upper limit of gradient (high positive correlation) |
color_value_negative |
color used for lower limit of gradient (high negative correlation) |
color_text |
string, hex value or color name used for text, best to pick high contrast with |
include_missings |
bool, whether to include the variables without correlation values in the plot |
nrow |
numeric, number of rows, either 1 or 2 |
... |
any arguments passed to |
a grob object, a grid with two ggplot2 heatmap visualizations
visualize_both(iris) visualize_both(mtcars, do_parallel = TRUE, n_cores = 2)
visualize_both(iris) visualize_both(mtcars, do_parallel = TRUE, n_cores = 2)
Visualize the correlation matrix
visualize_correlations( df, color_value_positive = "#08306B", color_value_negative = "#8b0000", color_text = "#FFFFFF", include_missings = FALSE, ... )
visualize_correlations( df, color_value_positive = "#08306B", color_value_negative = "#8b0000", color_text = "#FFFFFF", include_missings = FALSE, ... )
df |
data.frame containing columns for x and y |
color_value_positive |
color used for upper limit of gradient (high positive correlation) |
color_value_negative |
color used for lower limit of gradient (high negative correlation) |
color_text |
color used for text, best to pick high contrast with |
include_missings |
bool, whether to include the variables without correlation values in the plot |
... |
arguments to pass to |
a ggplot object, a heatmap visualization
visualize_correlations(iris)
visualize_correlations(iris)
If y
is specified, visualize_pps
returns a barplot of the PPS of
every predictor on the specified target variable.
If y
is not specified, visualize_pps
returns a heatmap visualization
of the PPS for all X-Y combinations in a dataframe.
visualize_pps( df, y = NULL, color_value_high = "#08306B", color_value_low = "#FFFFFF", color_text = "#FFFFFF", include_target = TRUE, ... )
visualize_pps( df, y = NULL, color_value_high = "#08306B", color_value_low = "#FFFFFF", color_text = "#FFFFFF", include_target = TRUE, ... )
df |
data.frame containing columns for x and y |
y |
string, column name of target variable,
can be left |
color_value_high |
string, hex value or color name used for upper limit of PPS gradient (high PPS) |
color_value_low |
string, hex value or color name used for lower limit of PPS gradient (low PPS) |
color_text |
string, hex value or color name used for text, best to pick high contrast with |
include_target |
boolean, whether to include the target variable in the barplot |
... |
any arguments passed to |
a ggplot object, a vertical barplot or heatmap visualization
visualize_pps(iris, y = 'Species') visualize_pps(iris) visualize_pps(mtcars, do_parallel = TRUE, n_cores = 2)
visualize_pps(iris, y = 'Species') visualize_pps(iris) visualize_pps(mtcars, do_parallel = TRUE, n_cores = 2)