isaricanalytics.analytics¶
- isaricanalytics.analytics.convert_categorical_to_onehot(data: DataFrame, dictionary: DataFrame, categorical_columns: Iterable[str], sep: str = '___', missing_val: str = 'nan', drop_first: bool = False) DataFrame[source]¶
pandas.DataFrame: Returns the given dataframe with categorical variable columns converted to onehot-encoded variable columns.- Parameters:
- datapandas.DataFrame
The incoming data.
- dictionarypandas.DataFrame
The data dictionary.
- categorical_columnstyping.Iterable
An iterable of categorical column names.
- sepstr, default=”___”
Field-value separator, defaults to
"___"- missing_valstr, default=”nan”
Optional value with which to replace missing values, defaults to
"nan".- drop_firstbool, default=False
Optional boolean to indicate how to drop categorical columns [?], defaults to
False.
- Returns:
- pandas.DataFrame
The original dataframe with the categorical -> one-hot-encoded variable columns.
- isaricanalytics.analytics.convert_onehot_to_categorical(data: DataFrame, dictionary: DataFrame, categorical_columns: Iterable[str], sep: str = '___', missing_val: str = 'nan') DataFrame[source]¶
pandas.DataFrame: Returns the given dataframe with onehot-encoded variable columns to categorical variable columns.- Parameters:
- datapandas.DataFrame
The incoming data.
- dictionarypandas.DataFrame
Data dictionary.
- categorical_columnstyping.Iterable
An iterable of categorical column names.
- sepstr, default=”___”
Optional field-value separator, defaults to
"___".- missing_valstr, default=”nan”
Optional value with which to replace missing values, defaults to
"nan".
- Returns:
- pandas.DataFrame
The original dataframe with the one-hot-encoded -> categorical variable columns.
- isaricanalytics.analytics.create_grouped_results(selected_features: Iterable[str], feature_importance: dict[str, float], sep: str = '___') tuple[DataFrame, Iterable[str], Iterable[str]][source]¶
tuple: Creates and returns grouped feature results.The main dataFrame lists categories under their main fields, with main fields sorted by their maximum coefficient magnitude.
- Parameters:
- selected_featurestyping.Iterable
An iterable of selected feature names.
- feature_importancedict
A dict of feature names and coefficient weights.
- sepstr, default=”___”
Optional field-value separator, defaults to
"___".
- Returns:
- tuple
Feature results.
- isaricanalytics.analytics.descriptive_comparison_table(data: DataFrame, dictionary: DataFrame, by_column: str | None = None, include_totals: bool = True, column_reorder: Iterable[str] | None = None, sep: str = '___', pvalue_significance: dict[str, float] = {'*': 0.05, '**': 0.01}) tuple[DataFrame, str][source]¶
tuple: Returns the descriptive comparison table and table key for binary (including onehot-encoded categorical) and numerical variables in data.The descriptive table will have separate columns for each category that exists for the
by_columnvariable, if this is provided.- Parameters:
- datapandas.DataFrame
Incoming data.
- dictionarypandas.DataFrame
Data dictionary.
- by_columnstr, default=None
Optional. No description available, defaults to
None.- include_totalsbool, default=None
Optional. No description available, defaults to
None.- column_reordertyping.Iterable, default=None
Optional iterable of names of columns to reorder by, defaults to
None.- sepstr, default=”___”
Optional field-value separator, defaults to
"___".- pvalue_significancedict, default={“*”: 0.05, “**”: 0.01}
A dict of significance levels, defaults to
{"*": 0.05, "**": 0.01}.
- Returns:
- tuple
Returns the descriptive comparison table and table key for binary and numerical variables in the data.
- isaricanalytics.analytics.descriptive_table(data: DataFrame, dictionary: DataFrame, by_column: str | None = None, include_totals: bool = True, column_reorder: Iterable[str] | None = None, include_raw_variable_name: bool = False, sep: str = '___') tuple[DataFrame, str][source]¶
tuple: Returns the descriptive table and table key for binary and numerical variables in the data.The descriptive table will have separate columns for each category that exists for the
by_columnvariable, if this is provided.- Parameters:
- datapandas.DataFrame
Incoming data.
- dictionarypandas.DataFrame
Data dictionary.
- by_columnstr, default=None
Optional. No description available, defaults to
None.- include_totalsbool, default=None
Optional. No description available, defaults to
None.- column_reordertyping.Iterable, default=None
Optional iterable of names of columns to reorder by, defaults to
None.- include_raw_variable_namebool, default=False
Optional boolean indicating whether to include the raw variable name, defaults to
False.- sepstr, default=”___”
Optional field-value separator, defaults to
"___".
- Returns:
- tuple
Returns the descriptive table and table key for binary and numerical variables in the data.
- isaricanalytics.analytics.execute_cox_model(data: DataFrame, duration_col: str, event_col: str, predictors: Iterable[str], labels: dict[str, str] | None = None) DataFrame[source]¶
pandas.DataFrame: Executes a Cox Proportional Hazards model without weights and returns a summary of the results.- Parameters:
- datapandas.DataFrame
The incoming data.
- duration_colstr
Name of the time variable.
- event_colstr
Name of the outcome variable (binary event).
- predictorstyping.Iterable
Names of predictor variables.
- labelsdict, default=None
Dictionary mapping variable names to readable labels, default=``None``.
- Returns:
- pandas.DataFrame
The model results.
- isaricanalytics.analytics.execute_glm_regression(elr_dataframe_df: DataFrame, elr_outcome: str, elr_predictors: Iterable, model_type: str = 'linear', print_results: bool = True, labels: dict[str, str] | None = None, reg_type: str = 'Multi') DataFrame[source]¶
pandas.DataFrame: Executes a GLM (Generalized Linear Model) for linear or logistic regression.- Parameters:
- elr_dataframe_dfpandas.DataFrame
The incoming data.
- elr_outcomestr
Name of the response variable.
- elr_predictors: typing.Iterable
Iterable of predictor variable names.
- model_typestr, default=”linear”
Optional indicator regression model type - use
"linear"for linear regression (Gaussian) or"logistic"for logistic regression (Binomial); defaults to"linear".- print_resultsbool, default=True
Optional indicator of whether to print the results table, defaults to
True.- labelsdict, default=None
Optional map of variable names to readable labels, defaults to
None.`.- reg_typestr, default=”multi”
Optional regression type -
"uni"for univariate,"multi"for multivariate. Defaults to"multi".
- Returns:
- pandas.DataFrame
The model results.
- isaricanalytics.analytics.execute_glmm_regression(elr_dataframe_df: DataFrame, elr_outcome: str, elr_predictors: Iterable[str], elr_groups: str, model_type: str = 'linear', print_results: bool = True, labels: dict[str, str] | None = None, reg_type: str = 'multi') DataFrame[source]¶
pandas.DataFrame: Executes a mixed effects model for linear or logistic regression.- Parameters:
- elr_dataframe_dfpandas.DataFrame
The incoming data.
- elr_outcomestr
Name of the response variable.
- elr_predictorstyping.Iterable
Iterable of predictor variable names.
- elr_groupsstr
Name of the variable that defines the groups (random effect).
- model_typestr, default=”linear”
Optional regression model type - use
"linear"for linear regression or"logistic"for logistic regression; defaults to"linear".- print_resultsbool, default=True
Optional indicator of whether to print the results summary, defaults to
True.- labelsdict, default=None
Optional map of variable names to readable labels, defaults to
None.- reg_typestr, default=”multi”
Optional regression type -
"uni"for univariate,"multi"for multivariate. Defaults to"multi".
- Returns:
- pandas.DataFrame
The model results.
- isaricanalytics.analytics.execute_kaplan_meier(data: DataFrame, duration_col: str, event_col: str, group_col: str, alpha=0.05, n_times=5) tuple[DataFrame, DataFrame, float][source]¶
tuple: Executes the Kaplan-Meier model and returns the results.- Parameters:
- datapandas.DataFrame
The incoming data.
- duration_colstr
Name of the time variable.
- event_colstr
Name of the outcome variable (binary event).
- group_colstr
Name of the grouping column.
- alphafloat, default=0.05
Optional alpha, defaults to \(0.05\).
- n_timesint, default=5
Optional. No description available, defaults to \(5\).
- Returns:
- pandas.DataFrame
A tuple consisting of the model results, risk table and the \(p\)-value.
- isaricanalytics.analytics.extend_dictionary(dictionary: DataFrame, new_variable_dict: dict[str, Any], data: DataFrame, sep: str = '___') DataFrame[source]¶
pandas.DataFrame: Returns the VERTEX dictionary with new custom variables added.- Parameters:
- dictionarypandas.DataFrame
VERTEX dictionary containing columns
"field_name","form_name","field_type","field_label","parent","branching_logic".- new_variable_dictdict
A dict with the same keys as the dictionary columns, the values for each item can be a string or a list.
- datapandas.DataFrame
Pandas dataframe containing the data for the project. The columns of this dataframe must include the variables in
new_variable_dict["field_type"].- sepstr
Separator for creating new one-hot-encoded variable names.
- Returns:
- pandas.DataFrame
VERTEX dictionary containing the original variables, plus the new variables and any one-hot-encoded variables derived from this.
- isaricanalytics.analytics.format_descriptive_table_variables(dictionary: DataFrame, max_len: int = 100, add_key: bool = True, sep: str = '___', binary_symbol: str = '*', numeric_symbol: str = '+') str[source]¶
str: Returns a formatted string of the descriptive table variable field names.- Parameters:
- dictionarypandas.DataFrame
The data dictionary.
- max_lenint, default=100
Optional maximum length of field names, defaults to \(100\).
- add_keybool, default=True
Optional. No description available, defaults to
True.- sepstr, default=”___”
Optional field-value separator, defaults to
"___".- binary_symbolstr, default=”*”
Optional. No description available, defaults to
"*".- numeric_symbolstr, default=”*”
Optional. No description available, defaults to
"+".
- Returns:
- str
A formatted string of the descriptive table variable field names.
- isaricanalytics.analytics.format_pvalue(pvalue: float, dp: int = 3, min_val: float = 0.001, significance: dict[str, float] = {'*': 0.05, '**': 0.01}) str[source]¶
str: Returns a formatted \(p\)-value string.- Parameters:
- pvaluefloat
The \(p\)-value.
- dpint, default=3
Optional. No description available, defaults to \(3\).
- min_valfloat, default=0.001
Optional. No description available, defaults to \(0.001\).
- significancedict, default={“*”: 0.05, “**”: 0.01}.
Dict of significance levels, defaults to
{"*": 0.05, "**": 0.01}.
- Returns:
- str
The formatted \(p\)-value string.
- isaricanalytics.analytics.format_variables(dictionary: DataFrame, max_len: int = 40, sep: str = '___') str[source]¶
str: Returns a formatted string of the descriptive table variable field names.- Parameters:
- dictionarypandas.DataFrame
The data dictionary.
- max_lenint, default=40
Optional maximum length of field names, defaults to \(40\).
- sepstr, default=”___”
Optional field-value separator, defaults to
"___".
- Returns:
- str
A formatted string of the descriptive table variable field names.
- isaricanalytics.analytics.from_timeA_to_timeB(data: DataFrame, dictionary: DataFrame, timeA_column: str, timeB_column: str, timediff_column: str, timediff_label: str, time_unit: str = 'days') tuple[DataFrame][source]¶
tuple: Returns the data and data dictionary updated with time difference calculation between two given data columns.- Parameters:
- datapandas.DataFrame
The incoming data.
- dictionarypandas.DataFrame
The data dictionary.
- timeA_columnstr
The name of the first time column.
- timeB_columnstr
The name of the second time column.
- timediff_columnstr
The name of the column for storing the time difference of the two given time columns.
- timediff_labelstr
A label to attach to the time difference column.
- time_unitstr, default=”days”
An optional indicator of which time unit to use for the time difference calculation, defaults to
"days".
- Returns:
- tuple
The data and data dictionary updated with time difference information.
- isaricanalytics.analytics.get_chi2_pvalue(x: Series, y: Series, x_cat: Iterable[Any] = [True, False], y_cat: Iterable[Any] = [True, False]) float[source]¶
float: Returns the :math`p`-value for a Chi-squared test.- Parameters:
- xpandas.Series
The first series/factor.
- ypandas.Series
The second series/factor.
- x_cattyping.Iterable
An iterable of categories by which to group the first series.
- y_cattyping.Iterable
An interable of categories by which to group the second series.
- Returns:
- float
The \(p\)-value for the test.
- isaricanalytics.analytics.get_counts(data: DataFrame, dictionary: DataFrame, max_n_variables: int = 10, sep: str = '___') DataFrame[source]¶
pandas.DataFrame: Returns a dataframe of variable column counts.- Parameters:
- datapandas.DataFrame
The incoming data.
- dictionarypandas.DataFrame
The data dictionary.
- max_n_variablesint, default=10
Optional number of variables for which to take counts, defaults to \(10\).
- sepstr, default=”___”
Optional field-value separator, defaults to
"___".
- Returns:
- The variable column counts dataframe.
- isaricanalytics.analytics.get_descriptive_data(data: DataFrame, dictionary: DataFrame, by_column: str | None = None, include_sections: Iterable[str] = ['demog'], include_types: Iterable[str] = ['binary', 'categorical', 'numeric'], exclude_suffix: Iterable[str] = ['_units', 'addi', 'otherl2', 'item', '_oth', '_unlisted', 'otherl3'], include_subjid: bool = False, exclude_negatives: bool = True, sep: str = '___') DataFrame[source]¶
pandas.DataFrame: Returns descriptive data.- Parameters:
- datapandas.DataFrame
Incoming data.
- dictionarypandas.DataFrame
Data dictionary.
- by_columnstr, default=None
Optional. No description available, defaults to
None.- include_sectionstyping.Iterable, default=[“demog”]
Optional list of names of sections to include, defaults to
["demog"].- include_typestyping.Iterable, default=[“binary”, “categorical”, “numeric”]
Optional iterable of variable type names, defaults to
["binary", "categorical", "numeric"].- exclude_suffixtyping.Iterable, default=[“_units”, “addi”, “otherl2”, “item”, “_oth”, “_unlisted”, “otherl3”]
Optional iterable of suffixes to exclude, defaults to
["_units", "addi", "otherl2", "item", "_oth", "_unlisted", "otherl3"].- include_subjidbool, default=False
Optional boolean to indicate whether to include subject ID, defaults to
False.- exclude_negativesbool, default=True
Optional boolean to indicate whether to drop negatives, defaults to
True.- sepstr, default=”___”
Optional field-value separator, defaults to
"___".
- Returns:
- pandas.DataFrame
Returns the descriptive data.
- isaricanalytics.analytics.get_fisher_exact_pvalue(x: Series, y: Series, x_cat: Iterable[Any] = [True, False], y_cat: Iterable[Any] = [True, False])[source]¶
float: Returns the \(p\)-value for a Fisher exact test.- Parameters:
- xpandas.Series
The first series/factor.
- ypandas.Series
The second series/factor.
- x_cattyping.Iterable
An iterable of categories by which to group the first series.
- y_cattyping.Iterable
An interable of categories by which to group the second series.
- Returns:
- float
The \(p\)-value for the test.
- isaricanalytics.analytics.get_mean_and_stdev(series: Series, add_spaces: bool = False, dp: int = 1, mfw: int = 4, min_n: int = 3) str[source]¶
str: Returns the mean and standard deviation of a series of values as a formatted string.- Parameters:
- seriespandas.Series
The input series for which to calculate the mean and st. dev.
- add_spacesbool, default=False
Add spacing in the string.
- dpint, default=1
No description available.
- mfwint, default=4
No description available.
- min_nint, default=3
No description available.
- Returns:
- str
The mean and st. dev. as a formatted string.
- isaricanalytics.analytics.get_median_interquartile_range(series: Series, add_spaces: bool = False, dp: int = 1, mfw: int = 4, min_n: int = 3) str[source]¶
str: Returns the median interquartile range (IQR) of a given series of values as a string.- Parameters:
- seriespandas.Series
The input series for which to calculate the IQR.
- add_spacesbool, default=False
Add spacing in the string.
- dpint, default=1
No description available.
- mfwint, default=4
No description available.
- min_nint, default=3
No description available.
- Returns:
- str
The IQR as a string.
- isaricanalytics.analytics.get_modelling_data(data: DataFrame, dictionary: DataFrame, outcome_columns: str | Iterable[str], include_sections: Iterable[str] = ['demog', 'comor', 'adsym', 'vacci', 'vital', 'sympt', 'labs'], required_variables: Iterable[str] | None = None, include_types: Iterable[str] = ['binary', 'categorical', 'numeric'], exclude_suffix: Iterable[str] = ['_units', 'addi', 'otherl2', 'item', '_oth', '_unlisted', 'otherl3'], include_subjid: bool = False, exclude_negatives: bool = True, fillna: bool = True, drop_first: bool = False, sep: str = '___') DataFrame[source]¶
pandas.DataFrame: Returns modelling data.- Parameters:
- datapandas.DataFrame
Incoming data.
- dictionarypandas.DataFrame
Data dictionary.
- outcome_columnstyping.Iterable
Outcome columns.
- include_sectionstyping.Iterable, default=[“demog”, “comor”, “adsym”, “vacci”, “vital”, “sympt”, “labs”]
Optional list of names of sections to include, defaults to
["demog", "comor", "adsym", "vacci", "vital", "sympt", "labs"].- required_variables: typing.Iterable, default=None
Required variable column names, defaults to
None.- include_typestyping.Iterable, default=[“binary”, “categorical”, “numeric”]
Optional iterable of variable type names, defaults to
["binary", "categorical", "numeric"].- exclude_suffixtyping.Iterable, default=[“_units”, “addi”, “otherl2”, “item”, “_oth”, “_unlisted”, “otherl3”]
Optional iterable of suffixes to exclude, defaults to
["_units", "addi", "otherl2", "item", "_oth", "_unlisted", "otherl3"].- include_subjidbool, default=False
Optional boolean to indicate whether to include subject ID, defaults to
False.- exclude_negativesbool, default=True
Optional boolean to indicate whether to drop negatives, defaults to
True.- fillnabool, default=True
Optional boolean to fill nulls, defaults to
True.- drop_firstbool, default=False
Optional boolean relating to dropping columns, defaults to
False.- sepstr, default=”___”
Optional field-value separator, defaults to
"___".
- Returns:
- pandas.DataFrame
Returns the modelling data.
- isaricanalytics.analytics.get_n_percent_value(series: Series, add_spaces: bool = False, dp: int = 1, mfw: int = 4, min_n: int = 1) str[source]¶
str: Returns the n-percent value of a series as a string.- Parameters:
- seriespandas.Series
The input series.
- add_spacesbool, default=False
Add spacing around the string.
- dpint, default=1
No description available.
- mfwint, default=1
No description available.
- min_nint, default=1
No description available.
- Returns:
- str
The n-percent value of the series as a string.
- isaricanalytics.analytics.get_parameter_ranking(logistic: Any, n_top: int = 10, threshold: float = 0.001) DataFrame[source]¶
:py:class:pd.DataFrame : Returns a dataframe of rankings of parameter combinations using stored scores and coefficient paths.
- Parameters:
- logistictyping.Any
The logistic model.
- n_topint, default=10
Optional number of top ranking features to select, defaults to \(10\).
- thresholdfloat, default=1e-3
Optional ranking threshold, defaults to \(0.001\).
- Returns:
- pandas.DataFrame
The dataframe of parameter rankings.
- isaricanalytics.analytics.get_proportions(data: DataFrame, dictionary: DataFrame, max_n_variables: int = 10, ignore_branching_logic: bool = False, branching_logic: str = '', sep: str = '___') DataFrame[source]¶
pandas.DataFrame: Returns a dataframe of proportional counts for variable columns.- Parameters:
- datapandas.DataFrame
The incoming data.
- dictionarypandas.DataFrame
The data dictionary.
- max_n_variablesint, default=10
Optional number of variables for which to take counts, defaults to \(10\).
- sepstr, default=”___”
Optional field-value separator, defaults to
"___".
- Returns:
- pandas.DataFrame
A dataframe of proportional counts for variable columns.
- isaricanalytics.analytics.get_pyramid_data(data: DataFrame, column_dict: dict[str, str], left_side: str = 'Female', right_side: str = 'Male') DataFrame[source]¶
pandas.DataFrame: Returns dual stack pyramid data.- Parameters:
- datapandas.DataFrame
The incoming data.
- column_dictdict
Dict of pyramid keys and data column names.
- left_sidestr, default=”Female”
Optional label for the left side of the pyramid, defaults to
"Female".- right_sidestr, default=”Male”
Optional label for the right side of the pyramid, defaults to
"Male".
- Returns:
- pandas.DataFrame
Dual stack pyramid data.
- isaricanalytics.analytics.get_upset_counts_intersections(data: DataFrame, dictionary: DataFrame, variables: list[str] | None = None, n_variables: int = 5, sep: str = '___') tuple[DataFrame][source]¶
pandas.DataFrame: Returns a dataframe of upset counts intersections.- Parameters:
- datapandas.DataFrame
The incoming data.
- dictionarypandas.DataFrame
The data dictionary.
- variableslist, default=None
Optional list of names of variable columns for which to take counts, defaults to
None.- n_variablesint, default=5
Optional limit for the number of variable columns, defaults to \(5\).
- sepstr, default=”___”
Optional field-value separator, defaults to
"___".
- Returns:
- pandas.DataFrame
A dataframe of upset counts intersections.
- isaricanalytics.analytics.get_variables_by_section_and_type(data: DataFrame, dictionary: DataFrame, required_variables: Iterable[str] | None = None, include_sections: Iterable[str] = ['demog'], include_types: Iterable[str] = ['binary', 'categorical', 'numeric'], exclude_suffixes: Iterable[str] = ['_units', 'addi', 'otherl2', 'item', '_oth', '_unlisted', 'otherl3'], include_subjid: bool = False) list[str][source]¶
list: Returns a list of all variables in the dataframe from specified sections and types, plus any required variables.- Parameters:
- datapandas.DataFrame
Incoming data.
- dictionarypandas.DataFrame
Data dictionary.
- required_variablestyping.Iterable, default=None
Optional iterable of required variable names, defaults to
None.- include_sectionstyping.Iterable, default=[“demog”]
Optional iterable of names of sections to include, defaults to
["demog"].- include_typestyping.Iterable, default=[“binary”, “categorical”, “numeric”]
Optional iterable of variable type names, defaults to
["binary", "categorical", "numeric"].- exclude_suffixestyping.Iterable, default=``[“_units”, “addi”, “otherl2”, “item”, “_oth”, “_unlisted”, “otherl3”]``
Optional iterable of suffixes to exclude, defaults to
["_units", "addi", "otherl2", "item", "_oth", "_unlisted", "otherl3"].- include_subjidbool, default=False
Optional boolean to indicate whether to include subject ID, defaults to
False.
- Returns:
- list
A list of all variables in the dataframe from specified sections and types, plus any required variables.
- isaricanalytics.analytics.impute_miss_val(data: DataFrame, dictionary: DataFrame, outcome_column: str = 'outco_binary_outcome', missing_threshold: float = 0.7, verbose: bool = False) DataFrame[source]¶
pandas.DataFrame: The data with missing values imputed.Imputes missing values or drops columns based on missing value proportion and median.
- Parameters:
- datapandas.DataFrame
The incoming data.
- dictionarypandas.DataFrame
The data dictionary.
- outcome_columnstr, default=”outco_binary_outcome”
Optional outcome column, defaults to
"outco_binary_outcome".- missing_thresholdfloat, default=0.7
A proportional imputation threshold for missing values, defaults to \(0.7\).
- verbose, bool=False
Optional indicator of whether to print imputations summary, defaults to
False.
- Returns:
- pandas.DataFrame
Data with missing values imputed or columns dropped.
- isaricanalytics.analytics.lasso_var_sel_binary(data: DataFrame, outcome_column: str = 'mapped_outcome', metric: str = 'balanced_accuracy', threshold: float = 0.001, gridsearch_params: dict[str, Iterable[float]] = {'Cs': [0.001, 0.00316, 0.01, 0.0316, 0.1, 0.316, 1, 3.16, 10], 'l1_ratios': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}, random_state: int = 42, verbose: bool = False, sep: str = '___') tuple[Any][source]¶
tuple: Prepares data and selects features using binary logistic regression with elastic net penalty.Specifically designed for binary outcomes only.
- Parameters:
- datapandas.DataFrame
The incoming data.
- outcome_columnstr, default=”mapped_outcome”
Optional outcome column, defaults to
"mapped_outcome".- metricstr, default=”balanced_accuracy”
Optional metric, defaults to
"balanced_accuracy".- thresholdfloat, default=1e-3
Optional threshold, defaults to \(0.001\).
- gridsearch_paramsdict, default={“l1_ratios”: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], “Cs”: [1e-3, 3.16e-3, 1e-2, 3.16e-2, 1e-1, 3.16e-1, 1, 3.16, 10]}
Optional grid search params, defaults to:
{ "l1_ratios": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], "Cs": [1e-3, 3.16e-3, 1e-2, 3.16e-2, 1e-1, 3.16e-1, 1, 3.16, 10] }
- random_stateint, default=42
Optional random state, defaults to \(42\).
- verbosebool, default=False
Optional indicator of whether to print the analysis, defaults to
False.- sepstr, default=”___”
Optional field-value separator, defaults to
"___".
- Returns:
- tuple
Results tuple.
- isaricanalytics.analytics.mannwhitneyu(x, y, use_continuity=True, alternative='two-sided', axis=0, method='auto', *, nan_policy='propagate', keepdims=False)[source]¶
Perform the Mann-Whitney U rank test on two independent samples.
The Mann-Whitney U test is a nonparametric test of the null hypothesis that the distribution underlying sample x is the same as the distribution underlying sample y. It is often used as a test of difference in location between distributions.
- Parameters:
- x, yarray-like
N-d arrays of samples. The arrays must be broadcastable except along the dimension given by axis.
- use_continuitybool, optional
Whether a continuity correction (1/2) should be applied. Default is True when method is
'asymptotic'; has no effect otherwise.- alternative{‘two-sided’, ‘less’, ‘greater’}, optional
Defines the alternative hypothesis. Default is ‘two-sided’. Let SX(u) and SY(u) be the survival functions of the distributions underlying x and y, respectively. Then the following alternative hypotheses are available:
‘two-sided’: the distributions are not equal, i.e. SX(u) ≠ SY(u) for at least one u.
‘less’: the distribution underlying x is stochastically less than the distribution underlying y, i.e. SX(u) < SY(u) for all u.
‘greater’: the distribution underlying x is stochastically greater than the distribution underlying y, i.e. SX(u) > SY(u) for all u.
Under a more restrictive set of assumptions, the alternative hypotheses can be expressed in terms of the locations of the distributions; see [5] section 5.1.
- axisint or None, default: 0
If an int, the axis of the input along which to compute the statistic. The statistic of each axis-slice (e.g. row) of the input will appear in a corresponding element of the output. If
None, the input will be raveled before computing the statistic.- method{‘auto’, ‘asymptotic’, ‘exact’} or PermutationMethod instance, optional
Selects the method used to calculate the p-value. Default is ‘auto’. The following options are available.
'asymptotic': compares the standardized test statistic against the normal distribution, correcting for ties.'exact': computes the exact p-value by comparing the observed \(U\) statistic against the exact distribution of the \(U\) statistic under the null hypothesis. No correction is made for ties.'auto': chooses'exact'when the size of one of the samples is less than or equal to 8 and there are no ties; chooses'asymptotic'otherwise.PermutationMethod instance. In this case, the p-value is computed using permutation_test with the provided configuration options and other appropriate settings.
- nan_policy{‘propagate’, ‘omit’, ‘raise’}
Defines how to handle input NaNs.
propagate: if a NaN is present in the axis slice (e.g. row) along which the statistic is computed, the corresponding entry of the output will be NaN.omit: NaNs will be omitted when performing the calculation. If insufficient data remains in the axis slice along which the statistic is computed, the corresponding entry of the output will be NaN.raise: if a NaN is present, aValueErrorwill be raised.
- keepdimsbool, default: False
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.
- Returns:
- resMannwhitneyuResult
An object containing attributes:
- statisticfloat
The Mann-Whitney U statistic corresponding with sample x. See Notes for the test statistic corresponding with sample y.
- pvaluefloat
The associated p-value for the chosen alternative.
Notes
If
U1is the statistic corresponding with sample x, then the statistic corresponding with sample y isU2 = x.shape[axis] * y.shape[axis] - U1.mannwhitneyu is for independent samples. For related / paired samples, consider scipy.stats.wilcoxon.
method
'exact'is recommended when there are no ties and when either sample size is less than 8 [1]. The implementation follows the algorithm reported in [3]. Note that the exact method is not corrected for ties, but mannwhitneyu will not raise errors or warnings if there are ties in the data. If there are ties and either samples is small (fewer than ~10 observations), consider passing an instance of PermutationMethod as the method to perform a permutation test.The Mann-Whitney U test is a non-parametric version of the t-test for independent samples. When the means of samples from the populations are normally distributed, consider scipy.stats.ttest_ind.
Beginning in SciPy 1.9,
np.matrixinputs (not recommended for new code) are converted tonp.ndarraybefore the calculation is performed. In this case, the output will be a scalar ornp.ndarrayof appropriate shape rather than a 2Dnp.matrix. Similarly, while masked elements of masked arrays are ignored, the output will be a scalar ornp.ndarrayrather than a masked array withmask=False.Array API Standard Support
mannwhitneyu has experimental support for Python Array API Standard compatible backends in addition to NumPy. Please consider testing these features by setting an environment variable
SCIPY_ARRAY_API=1and providing CuPy, PyTorch, JAX, or Dask arrays as array arguments. The following combinations of backend and device (or other capability) are supported.Library
CPU
GPU
NumPy
✅
n/a
CuPy
n/a
⛔
PyTorch
✅
⛔
JAX
⚠️ no JIT
⛔
Dask
⛔
n/a
See Support for the array API standard for more information.
References
[1]H.B. Mann and D.R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other”, The Annals of Mathematical Statistics, Vol. 18, pp. 50-60, 1947.
[2]Mann-Whitney U Test, Wikipedia, http://en.wikipedia.org/wiki/Mann-Whitney_U_test
[3]Andreas Löffler, “Über eine Partition der nat. Zahlen und ihr Anwendung beim U-Test”, Wiss. Z. Univ. Halle, XXXII’83 pp. 87-89.
[4] (1,2,3,4,5,6,7)Rosie Shier, “Statistics: 2.3 The Mann-Whitney U Test”, Mathematics Learning Support Centre, 2004.
[5]Michael P. Fay and Michael A. Proschan. “Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules.” Statistics surveys, Vol. 4, pp. 1-39, 2010. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2857732/
Examples
We follow the example from [4]: nine randomly sampled young adults were diagnosed with type II diabetes at the ages below.
>>> males = [19, 22, 16, 29, 24] >>> females = [20, 11, 17, 12]
We use the Mann-Whitney U test to assess whether there is a statistically significant difference in the diagnosis age of males and females. The null hypothesis is that the distribution of male diagnosis ages is the same as the distribution of female diagnosis ages. We decide that a confidence level of 95% is required to reject the null hypothesis in favor of the alternative that the distributions are different. Since the number of samples is very small and there are no ties in the data, we can compare the observed test statistic against the exact distribution of the test statistic under the null hypothesis.
>>> from scipy.stats import mannwhitneyu >>> U1, p = mannwhitneyu(males, females, method="exact") >>> print(U1) 17.0
mannwhitneyu always reports the statistic associated with the first sample, which, in this case, is males. This agrees with \(U_M = 17\) reported in [4]. The statistic associated with the second statistic can be calculated:
>>> nx, ny = len(males), len(females) >>> U2 = nx*ny - U1 >>> print(U2) 3.0
This agrees with \(U_F = 3\) reported in [4]. The two-sided p-value can be calculated from either statistic, and the value produced by mannwhitneyu agrees with \(p = 0.11\) reported in [4].
>>> print(p) 0.1111111111111111
The exact distribution of the test statistic is asymptotically normal, so the example continues by comparing the exact p-value against the p-value produced using the normal approximation.
>>> _, pnorm = mannwhitneyu(males, females, method="asymptotic") >>> print(pnorm) 0.11134688653314041
Here mannwhitneyu’s reported p-value appears to conflict with the value \(p = 0.09\) given in [4]. The reason is that [4] does not apply the continuity correction performed by mannwhitneyu; mannwhitneyu reduces the distance between the test statistic and the mean \(\mu = n_x n_y / 2\) by 0.5 to correct for the fact that the discrete statistic is being compared against a continuous distribution. Here, the \(U\) statistic used is less than the mean, so we reduce the distance by adding 0.5 in the numerator.
>>> import numpy as np >>> from scipy.stats import norm >>> U = min(U1, U2) >>> N = nx + ny >>> z = (U - nx*ny/2 + 0.5) / np.sqrt(nx*ny * (N + 1)/ 12) >>> p = 2 * norm.cdf(z) # use CDF to get p-value from smaller statistic >>> print(p) 0.11134688653314041
If desired, we can disable the continuity correction to get a result that agrees with that reported in [4].
>>> _, pnorm = mannwhitneyu(males, females, use_continuity=False, ... method="asymptotic") >>> print(pnorm) 0.0864107329737
Regardless of whether we perform an exact or asymptotic test, the probability of the test statistic being as extreme or more extreme by chance exceeds 5%, so we do not consider the results statistically significant.
Suppose that, before seeing the data, we had hypothesized that females would tend to be diagnosed at a younger age than males. In that case, it would be natural to provide the female ages as the first input, and we would have performed a one-sided test using
alternative = 'less': females are diagnosed at an age that is stochastically less than that of males.>>> res = mannwhitneyu(females, males, alternative="less", method="exact") >>> print(res) MannwhitneyuResult(statistic=3.0, pvalue=0.05555555555555555)
Again, the probability of getting a sufficiently low value of the test statistic by chance under the null hypothesis is greater than 5%, so we do not reject the null hypothesis in favor of our alternative.
If it is reasonable to assume that the means of samples from the populations are normally distributed, we could have used a t-test to perform the analysis.
>>> from scipy.stats import ttest_ind >>> res = ttest_ind(females, males, alternative="less") >>> print(res) TtestResult(statistic=-2.239334696520584, pvalue=0.030068441095757924, df=7.0)
Under this assumption, the p-value would be low enough to reject the null hypothesis in favor of the alternative.
- isaricanalytics.analytics.mean_std_str(series: Series, add_spaces: bool = False, dp: int = 1, mfw: int = 4, min_n: int = 3) str[source]¶
str: Returns the mean and standard deviation of a series of values as a formatted string.Warning
DEPRECATED. Please use
get_mean_and_stdev()instead.- Parameters:
- seriespandas.Series
The input series for which to calculate the mean and st. dev.
- add_spacesbool, default=False
Add spacing in the string.
- dpint, default=1
No description available.
- mfwint, default=4
No description available.
- min_nint, default=3
No description available.
- Returns:
- str
The mean and st. dev. as a formatted string.
- isaricanalytics.analytics.median_iqr_str(series: Series, add_spaces: bool = False, dp: int = 1, mfw: int = 4, min_n: int = 3) str[source]¶
str: Returns the median interquartile range (IQR) of a given series of values as a formatted string.Warning
DEPRECATED. Please use
get_median_interquartile_range()instead.- Parameters:
- seriespandas.Series
The input series for which to calculate the IQR.
- add_spacesbool, default=False
Add spacing in the string.
- dpint, default=1
No description available.
- mfwint, default=4
No description available.
- min_nint, default=3
No description available.
- Returns:
- str
The IQR as a string.
- isaricanalytics.analytics.n_percent_str(series: Series, add_spaces: bool = False, dp: int = 1, mfw: int = 4, min_n: int = 1) str[source]¶
str: Returns the n-percent value of a series as a string.Warning
DEPRECATED. Please use
get_n_percent_value()instead.- Parameters:
- seriespandas.Series
The input series.
- add_spacesbool, default=False
Add spacing around the string.
- dpint, default=1
No description available.
- mfwint, default=1
No description available.
- min_nint, default=1
No description available.
- Returns:
- str
The n-percent value of the series as a string.
- isaricanalytics.analytics.regression_summary_table(table: DataFrame, dictionary: DataFrame, highlight_predictors: dict[str, Iterable[str]] | None = None, pvalue_significance: float | None = None, result_type: str = 'OddsRatio', sep: str = '___') DataFrame[source]¶
pandas.DataFrame: Returns a regression summary table.- Parameters:
- tablepandas.DataFrame
The incoming table.
- dictionarypandas.DataFrame
The data dictionary.
- highlight_predictorsdict, default=None
Optional. No description available, defaults to
None.- pvalue_significanceint, default=5
Optional \(p\)-value significance level, defaults to
None.- result_typestr, default=”OddsRatio”
Optional. No description avaiable, defaults to
"OddsRatio".- sepstr, default=”___”
Optional field-value separator, defaults to
"___".
- Returns:
- pandas.DataFrame
Regression summary table.
- isaricanalytics.analytics.remove_single_binary_outcome_predictors(data: DataFrame, dictionary: DataFrame, predictors: Iterable[str], outcome: str) Iterable[str][source]¶
typing.Iterable: Returns a list of retained columns in the data after removing single binary outcome predictor columns.Removes binary predictors that are associated with only one outcome, e.g. if all patients with
some_variable=1haveoutcome=1.- Parameters:
- data: pandas.DataFrame
The incoming data.
- dictionary: pandas.DataFrame
The data dictionary.
- predictors: typing.Iterable
Iterable of predictor variable column names.
- outcomestr
The outcome string.
- Returns:
- typing.Iterable:
List of predictor variable column names excluding any that can’t be used in the logistic regression model.
- isaricanalytics.analytics.rmv_high_corr(data: DataFrame, dictionary: DataFrame, outcome_column: str = 'outco_binary_outcome', correlation_threshold: float = 0.5, verbose: bool = False) DataFrame[source]¶
pandas.DataFrame: Removes variables in the data with high multicollinearity.Arbitrarily selecting one variable to remove if the correlation between two variables is above a threshold.
- Parameters:
- datapandas.DataFrame
The incoming data.
- dictionarypandas.DataFrame
The data dictionary.
- outcome_columnstr, default=``”outco_binary_outcome”``
Optional outcome column, defaults to
"outco_binary_outcome".- correlation_thresholdfloat, default=0.5
Optional correlation threshold, defaults to \(0.5\).
- verbosebool, default=False
Optional indicator of whether to print correlation summary.
- Returns:
- pandas.DataFrame
The data with high correlation variables removed.
- isaricanalytics.analytics.rmv_low_var(data: DataFrame, dictionary: DataFrame, mad_threshold: float = 0.1, freq_threshold: float = 0.05, outcome_column: str = 'outco_binary_outcome', verbose: bool = False) DataFrame[source]¶
pandas.DataFrame: Removes numerical variables from the data with Median Absolute Deviation (MAD) below a given threshold.Excludes binary columns from MAD calculation. Removes binary columns with very low frequencies.
- Parameters:
- datapandas.DataFrame
The incoming data.
- dictionarypandas.DataFrame
The data dictionary.
- mad_thresholdfloat, default=0.1
Optional MAD threshold, defaults to \(0.1\).
- freq_thresholdfloat, default=0.5
Optional frequency threshold, defaults to \(0.05\).
- outcome_columnstr, default=``”outco_binary_outcome”``
Optional outcome column, defaults to
"outco_binary_outcome".- verbosebool, default=False
Optional indicator of whether to print MAD analysis summary.
- Returns:
- pandas.DataFrame
The data with low MAD columns removed.
- isaricanalytics.analytics.trim_field_label(x: str, max_len: int = 40) str[source]¶
str: Trims field label using an optional max. length parameter that defaults to 40 characters.- Parameters:
- xstr
The input field label.
- max_lenint, default=40
- An optional maximum length parameter to use for trimming, defaults to
- :math:`40`.
- Returns:
- str
The trimmed field label.
- isaricanalytics.analytics.variance_influence_factor_backwards_elimination(data: DataFrame, dictionary: DataFrame, predictors_list: Iterable[str], sep: str = '___') tuple[Iterable[str], DataFrame][source]¶
tuple: Returns an iterable of retained columns and the VIF backwards elimination data.- Parameters:
- datapandas.DataFrame
Incoming data.
- dictionarypandas.DataFrame
Data dictionary.
- predictors_listtyping.Iterable
Iterable of predictor variable column names.
- sepstr, default=”___”
Optional field-value separator, defaults to
"___".
- Returns:
- tuple
An iterable of retained columns and the VIF backwards elimination data.