`isaricanalytics.analytics`¶

isaricanalytics.analytics.convert_categorical_to_onehot(data: DataFrame, dictionary: DataFrame, categorical_columns: Iterable[str], sep: str = '___', missing_val: str = 'nan', drop_first: bool = False) → DataFrame[source]¶

pandas.DataFrame : Returns the given dataframe with categorical variable columns converted to onehot-encoded variable columns.

Parameters:

datapandas.DataFrame: The incoming data.
dictionarypandas.DataFrame: The data dictionary.
categorical_columnstyping.Iterable: An iterable of categorical column names.
sepstr, default=”___”: Field-value separator, defaults to "___"
missing_valstr, default=”nan”: Optional value with which to replace missing values, defaults to "nan".
drop_firstbool, default=False: Optional boolean to indicate how to drop categorical columns [?], defaults to False.

Returns:

pandas.DataFrame: The original dataframe with the categorical -> one-hot-encoded variable columns.

isaricanalytics.analytics.convert_onehot_to_categorical(data: DataFrame, dictionary: DataFrame, categorical_columns: Iterable[str], sep: str = '___', missing_val: str = 'nan') → DataFrame[source]¶

pandas.DataFrame : Returns the given dataframe with onehot-encoded variable columns to categorical variable columns.

Parameters:

datapandas.DataFrame: The incoming data.
dictionarypandas.DataFrame: Data dictionary.
categorical_columnstyping.Iterable: An iterable of categorical column names.
sepstr, default=”___”: Optional field-value separator, defaults to "___".
missing_valstr, default=”nan”: Optional value with which to replace missing values, defaults to "nan".

Returns:

pandas.DataFrame: The original dataframe with the one-hot-encoded -> categorical variable columns.

isaricanalytics.analytics.create_grouped_results(selected_features: Iterable[str], feature_importance: dict[str, float], sep: str = '___') → tuple[DataFrame, Iterable[str], Iterable[str]][source]¶

tuple : Creates and returns grouped feature results.

The main dataFrame lists categories under their main fields, with main fields sorted by their maximum coefficient magnitude.

Parameters:

selected_featurestyping.Iterable: An iterable of selected feature names.
feature_importancedict: A dict of feature names and coefficient weights.
sepstr, default=”___”: Optional field-value separator, defaults to "___".

Returns:

tuple: Feature results.

isaricanalytics.analytics.descriptive_comparison_table(data: DataFrame, dictionary: DataFrame, by_column: str | None = None, include_totals: bool = True, column_reorder: Iterable[str] | None = None, sep: str = '___', pvalue_significance: dict[str, float] = {'*': 0.05, '**': 0.01}) → tuple[DataFrame, str][source]¶

tuple : Returns the descriptive comparison table and table key for binary (including onehot-encoded categorical) and numerical variables in data.

The descriptive table will have separate columns for each category that exists for the by_column variable, if this is provided.

Parameters:

datapandas.DataFrame: Incoming data.
dictionarypandas.DataFrame: Data dictionary.
by_columnstr, default=None: Optional. No description available, defaults to None.
include_totalsbool, default=None: Optional. No description available, defaults to None.
column_reordertyping.Iterable, default=None: Optional iterable of names of columns to reorder by, defaults to None.
sepstr, default=”___”: Optional field-value separator, defaults to "___".
pvalue_significancedict, default={“*”: 0.05, “**”: 0.01}: A dict of significance levels, defaults to {"*": 0.05, "**": 0.01}.

Returns:

tuple: Returns the descriptive comparison table and table key for binary and numerical variables in the data.

isaricanalytics.analytics.descriptive_table(data: DataFrame, dictionary: DataFrame, by_column: str | None = None, include_totals: bool = True, column_reorder: Iterable[str] | None = None, include_raw_variable_name: bool = False, sep: str = '___') → tuple[DataFrame, str][source]¶

tuple : Returns the descriptive table and table key for binary and numerical variables in the data.

The descriptive table will have separate columns for each category that exists for the by_column variable, if this is provided.

Parameters:

datapandas.DataFrame: Incoming data.
dictionarypandas.DataFrame: Data dictionary.
by_columnstr, default=None: Optional. No description available, defaults to None.
include_totalsbool, default=None: Optional. No description available, defaults to None.
column_reordertyping.Iterable, default=None: Optional iterable of names of columns to reorder by, defaults to None.
include_raw_variable_namebool, default=False: Optional boolean indicating whether to include the raw variable name, defaults to False.
sepstr, default=”___”: Optional field-value separator, defaults to "___".

Returns:

tuple: Returns the descriptive table and table key for binary and numerical variables in the data.

isaricanalytics.analytics.execute_cox_model(data: DataFrame, duration_col: str, event_col: str, predictors: Iterable[str], labels: dict[str, str] | None = None) → DataFrame[source]¶

pandas.DataFrame : Executes a Cox Proportional Hazards model without weights and returns a summary of the results.

Parameters:

datapandas.DataFrame: The incoming data.
duration_colstr: Name of the time variable.
event_colstr: Name of the outcome variable (binary event).
predictorstyping.Iterable: Names of predictor variables.
labelsdict, default=None: Dictionary mapping variable names to readable labels, default=``None``.

Returns:

pandas.DataFrame: The model results.

isaricanalytics.analytics.execute_glm_regression(elr_dataframe_df: DataFrame, elr_outcome: str, elr_predictors: Iterable, model_type: str = 'linear', print_results: bool = True, labels: dict[str, str] | None = None, reg_type: str = 'Multi') → DataFrame[source]¶

pandas.DataFrame : Executes a GLM (Generalized Linear Model) for linear or logistic regression.

Parameters:

elr_dataframe_dfpandas.DataFrame: The incoming data.
elr_outcomestr: Name of the response variable.
elr_predictors: typing.Iterable: Iterable of predictor variable names.
model_typestr, default=”linear”: Optional indicator regression model type - use "linear" for linear regression (Gaussian) or "logistic" for logistic regression (Binomial); defaults to "linear".
print_resultsbool, default=True: Optional indicator of whether to print the results table, defaults to True.
labelsdict, default=None: Optional map of variable names to readable labels, defaults to None.`.
reg_typestr, default=”multi”: Optional regression type - "uni" for univariate, "multi" for multivariate. Defaults to "multi".

Returns:

pandas.DataFrame: The model results.

isaricanalytics.analytics.execute_glmm_regression(elr_dataframe_df: DataFrame, elr_outcome: str, elr_predictors: Iterable[str], elr_groups: str, model_type: str = 'linear', print_results: bool = True, labels: dict[str, str] | None = None, reg_type: str = 'multi') → DataFrame[source]¶

pandas.DataFrame : Executes a mixed effects model for linear or logistic regression.

Parameters:

elr_dataframe_dfpandas.DataFrame: The incoming data.
elr_outcomestr: Name of the response variable.
elr_predictorstyping.Iterable: Iterable of predictor variable names.
elr_groupsstr: Name of the variable that defines the groups (random effect).
model_typestr, default=”linear”: Optional regression model type - use "linear" for linear regression or "logistic" for logistic regression; defaults to "linear".
print_resultsbool, default=True: Optional indicator of whether to print the results summary, defaults to True.
labelsdict, default=None: Optional map of variable names to readable labels, defaults to None.
reg_typestr, default=”multi”: Optional regression type - "uni" for univariate, "multi" for multivariate. Defaults to "multi".

Returns:

pandas.DataFrame: The model results.

isaricanalytics.analytics.execute_kaplan_meier(data: DataFrame, duration_col: str, event_col: str, group_col: str, alpha=0.05, n_times=5) → tuple[DataFrame, DataFrame, float][source]¶

tuple : Executes the Kaplan-Meier model and returns the results.

Parameters:

datapandas.DataFrame: The incoming data.
duration_colstr: Name of the time variable.
event_colstr: Name of the outcome variable (binary event).
group_colstr: Name of the grouping column.
alphafloat, default=0.05: Optional alpha, defaults to \(0.05\).
n_timesint, default=5: Optional. No description available, defaults to \(5\).

Returns:

pandas.DataFrame: A tuple consisting of the model results, risk table and the \(p\)-value.

isaricanalytics.analytics.extend_dictionary(dictionary: DataFrame, new_variable_dict: dict[str, Any], data: DataFrame, sep: str = '___') → DataFrame[source]¶

pandas.DataFrame : Returns the VERTEX dictionary with new custom variables added.

Parameters:

dictionarypandas.DataFrame: VERTEX dictionary containing columns "field_name", "form_name", "field_type", "field_label", "parent", "branching_logic".
new_variable_dictdict: A dict with the same keys as the dictionary columns, the values for each item can be a string or a list.
datapandas.DataFrame: Pandas dataframe containing the data for the project. The columns of this dataframe must include the variables in new_variable_dict["field_type"].
sepstr: Separator for creating new one-hot-encoded variable names.

Returns:

pandas.DataFrame: VERTEX dictionary containing the original variables, plus the new variables and any one-hot-encoded variables derived from this.

isaricanalytics.analytics.format_descriptive_table_variables(dictionary: DataFrame, max_len: int = 100, add_key: bool = True, sep: str = '___', binary_symbol: str = '*', numeric_symbol: str = '+') → str[source]¶

str : Returns a formatted string of the descriptive table variable field names.

Parameters:

dictionarypandas.DataFrame: The data dictionary.
max_lenint, default=100: Optional maximum length of field names, defaults to \(100\).
add_keybool, default=True: Optional. No description available, defaults to True.
sepstr, default=”___”: Optional field-value separator, defaults to "___".
binary_symbolstr, default=”*”: Optional. No description available, defaults to "*".
numeric_symbolstr, default=”*”: Optional. No description available, defaults to "+".

Returns:

str: A formatted string of the descriptive table variable field names.

isaricanalytics.analytics.format_pvalue(pvalue: float, dp: int = 3, min_val: float = 0.001, significance: dict[str, float] = {'*': 0.05, '**': 0.01}) → str[source]¶

str : Returns a formatted \(p\)-value string.

Parameters:

pvaluefloat: The \(p\)-value.
dpint, default=3: Optional. No description available, defaults to \(3\).
min_valfloat, default=0.001: Optional. No description available, defaults to \(0.001\).
significancedict, default={“*”: 0.05, “**”: 0.01}.: Dict of significance levels, defaults to {"*": 0.05, "**": 0.01}.

Returns:

str: The formatted \(p\)-value string.

isaricanalytics.analytics.format_variables(dictionary: DataFrame, max_len: int = 40, sep: str = '___') → str[source]¶

str : Returns a formatted string of the descriptive table variable field names.

Parameters:

dictionarypandas.DataFrame: The data dictionary.
max_lenint, default=40: Optional maximum length of field names, defaults to \(40\).
sepstr, default=”___”: Optional field-value separator, defaults to "___".

Returns:

str: A formatted string of the descriptive table variable field names.

isaricanalytics.analytics.from_timeA_to_timeB(data: DataFrame, dictionary: DataFrame, timeA_column: str, timeB_column: str, timediff_column: str, timediff_label: str, time_unit: str = 'days') → tuple[DataFrame][source]¶

tuple : Returns the data and data dictionary updated with time difference calculation between two given data columns.

Parameters:

datapandas.DataFrame: The incoming data.
dictionarypandas.DataFrame: The data dictionary.
timeA_columnstr: The name of the first time column.
timeB_columnstr: The name of the second time column.
timediff_columnstr: The name of the column for storing the time difference of the two given time columns.
timediff_labelstr: A label to attach to the time difference column.
time_unitstr, default=”days”: An optional indicator of which time unit to use for the time difference calculation, defaults to "days".

Returns:

tuple: The data and data dictionary updated with time difference information.

isaricanalytics.analytics.get_chi2_pvalue(x: Series, y: Series, x_cat: Iterable[Any] = [True, False], y_cat: Iterable[Any] = [True, False]) → float[source]¶

float : Returns the :math`p`-value for a Chi-squared test.

Parameters:

xpandas.Series: The first series/factor.
ypandas.Series: The second series/factor.
x_cattyping.Iterable: An iterable of categories by which to group the first series.
y_cattyping.Iterable: An interable of categories by which to group the second series.

Returns:

float: The \(p\)-value for the test.

isaricanalytics.analytics.get_counts(data: DataFrame, dictionary: DataFrame, max_n_variables: int = 10, sep: str = '___') → DataFrame[source]¶

pandas.DataFrame : Returns a dataframe of variable column counts.

Parameters:

datapandas.DataFrame: The incoming data.
dictionarypandas.DataFrame: The data dictionary.
max_n_variablesint, default=10: Optional number of variables for which to take counts, defaults to \(10\).
sepstr, default=”___”: Optional field-value separator, defaults to "___".

Returns:

The variable column counts dataframe.

isaricanalytics.analytics.get_descriptive_data(data: DataFrame, dictionary: DataFrame, by_column: str | None = None, include_sections: Iterable[str] = ['demog'], include_types: Iterable[str] = ['binary', 'categorical', 'numeric'], exclude_suffix: Iterable[str] = ['_units', 'addi', 'otherl2', 'item', '_oth', '_unlisted', 'otherl3'], include_subjid: bool = False, exclude_negatives: bool = True, sep: str = '___') → DataFrame[source]¶

pandas.DataFrame : Returns descriptive data.

Parameters:

datapandas.DataFrame: Incoming data.
dictionarypandas.DataFrame: Data dictionary.
by_columnstr, default=None: Optional. No description available, defaults to None.
include_sectionstyping.Iterable, default=[“demog”]: Optional list of names of sections to include, defaults to ["demog"].
include_typestyping.Iterable, default=[“binary”, “categorical”, “numeric”]: Optional iterable of variable type names, defaults to ["binary", "categorical", "numeric"].
exclude_suffixtyping.Iterable, default=[“_units”, “addi”, “otherl2”, “item”, “_oth”, “_unlisted”, “otherl3”]: Optional iterable of suffixes to exclude, defaults to ["_units", "addi", "otherl2", "item", "_oth", "_unlisted", "otherl3"].
include_subjidbool, default=False: Optional boolean to indicate whether to include subject ID, defaults to False.
exclude_negativesbool, default=True: Optional boolean to indicate whether to drop negatives, defaults to True.
sepstr, default=”___”: Optional field-value separator, defaults to "___".

Returns:

pandas.DataFrame: Returns the descriptive data.

isaricanalytics.analytics.get_fisher_exact_pvalue(x: Series, y: Series, x_cat: Iterable[Any] = [True, False], y_cat: Iterable[Any] = [True, False])[source]¶

float : Returns the \(p\)-value for a Fisher exact test.

Parameters:

xpandas.Series: The first series/factor.
ypandas.Series: The second series/factor.
x_cattyping.Iterable: An iterable of categories by which to group the first series.
y_cattyping.Iterable: An interable of categories by which to group the second series.

Returns:

float: The \(p\)-value for the test.

isaricanalytics.analytics.get_mean_and_stdev(series: Series, add_spaces: bool = False, dp: int = 1, mfw: int = 4, min_n: int = 3) → str[source]¶

str : Returns the mean and standard deviation of a series of values as a formatted string.

Parameters:

seriespandas.Series: The input series for which to calculate the mean and st. dev.
add_spacesbool, default=False: Add spacing in the string.
dpint, default=1: No description available.
mfwint, default=4: No description available.
min_nint, default=3: No description available.

Returns:

str: The mean and st. dev. as a formatted string.

isaricanalytics.analytics.get_median_interquartile_range(series: Series, add_spaces: bool = False, dp: int = 1, mfw: int = 4, min_n: int = 3) → str[source]¶

str : Returns the median interquartile range (IQR) of a given series of values as a string.

Parameters:

seriespandas.Series: The input series for which to calculate the IQR.
add_spacesbool, default=False: Add spacing in the string.
dpint, default=1: No description available.
mfwint, default=4: No description available.
min_nint, default=3: No description available.

Returns:

str: The IQR as a string.

isaricanalytics.analytics.get_modelling_data(data: DataFrame, dictionary: DataFrame, outcome_columns: str | Iterable[str], include_sections: Iterable[str] = ['demog', 'comor', 'adsym', 'vacci', 'vital', 'sympt', 'labs'], required_variables: Iterable[str] | None = None, include_types: Iterable[str] = ['binary', 'categorical', 'numeric'], exclude_suffix: Iterable[str] = ['_units', 'addi', 'otherl2', 'item', '_oth', '_unlisted', 'otherl3'], include_subjid: bool = False, exclude_negatives: bool = True, fillna: bool = True, drop_first: bool = False, sep: str = '___') → DataFrame[source]¶

pandas.DataFrame : Returns modelling data.

Parameters:

datapandas.DataFrame: Incoming data.
dictionarypandas.DataFrame: Data dictionary.
outcome_columnstyping.Iterable: Outcome columns.
include_sectionstyping.Iterable, default=[“demog”, “comor”, “adsym”, “vacci”, “vital”, “sympt”, “labs”]: Optional list of names of sections to include, defaults to ["demog", "comor", "adsym", "vacci", "vital", "sympt", "labs"].
required_variables: typing.Iterable, default=None: Required variable column names, defaults to None.
include_typestyping.Iterable, default=[“binary”, “categorical”, “numeric”]: Optional iterable of variable type names, defaults to ["binary", "categorical", "numeric"].
exclude_suffixtyping.Iterable, default=[“_units”, “addi”, “otherl2”, “item”, “_oth”, “_unlisted”, “otherl3”]: Optional iterable of suffixes to exclude, defaults to ["_units", "addi", "otherl2", "item", "_oth", "_unlisted", "otherl3"].
include_subjidbool, default=False: Optional boolean to indicate whether to include subject ID, defaults to False.
exclude_negativesbool, default=True: Optional boolean to indicate whether to drop negatives, defaults to True.
fillnabool, default=True: Optional boolean to fill nulls, defaults to True.
drop_firstbool, default=False: Optional boolean relating to dropping columns, defaults to False.
sepstr, default=”___”: Optional field-value separator, defaults to "___".

Returns:

pandas.DataFrame: Returns the modelling data.

isaricanalytics.analytics.get_n_percent_value(series: Series, add_spaces: bool = False, dp: int = 1, mfw: int = 4, min_n: int = 1) → str[source]¶

str : Returns the n-percent value of a series as a string.

Parameters:

seriespandas.Series: The input series.
add_spacesbool, default=False: Add spacing around the string.
dpint, default=1: No description available.
mfwint, default=1: No description available.
min_nint, default=1: No description available.

Returns:

str: The n-percent value of the series as a string.

isaricanalytics.analytics.get_parameter_ranking(logistic: Any, n_top: int = 10, threshold: float = 0.001) → DataFrame[source]¶

:py:class:pd.DataFrame : Returns a dataframe of rankings of parameter combinations using stored scores and coefficient paths.

Parameters:

logistictyping.Any: The logistic model.
n_topint, default=10: Optional number of top ranking features to select, defaults to \(10\).
thresholdfloat, default=1e-3: Optional ranking threshold, defaults to \(0.001\).

Returns:

pandas.DataFrame: The dataframe of parameter rankings.

isaricanalytics.analytics.get_proportions(data: DataFrame, dictionary: DataFrame, max_n_variables: int = 10, ignore_branching_logic: bool = False, branching_logic: str = '', sep: str = '___') → DataFrame[source]¶

pandas.DataFrame : Returns a dataframe of proportional counts for variable columns.

Parameters:

datapandas.DataFrame: The incoming data.
dictionarypandas.DataFrame: The data dictionary.
max_n_variablesint, default=10: Optional number of variables for which to take counts, defaults to \(10\).
sepstr, default=”___”: Optional field-value separator, defaults to "___".

Returns:

pandas.DataFrame: A dataframe of proportional counts for variable columns.

isaricanalytics.analytics.get_pyramid_data(data: DataFrame, column_dict: dict[str, str], left_side: str = 'Female', right_side: str = 'Male') → DataFrame[source]¶

pandas.DataFrame : Returns dual stack pyramid data.

Parameters:

datapandas.DataFrame: The incoming data.
column_dictdict: Dict of pyramid keys and data column names.
left_sidestr, default=”Female”: Optional label for the left side of the pyramid, defaults to "Female".
right_sidestr, default=”Male”: Optional label for the right side of the pyramid, defaults to "Male".

Returns:

pandas.DataFrame: Dual stack pyramid data.

isaricanalytics.analytics.get_upset_counts_intersections(data: DataFrame, dictionary: DataFrame, variables: list[str] | None = None, n_variables: int = 5, sep: str = '___') → tuple[DataFrame][source]¶

pandas.DataFrame : Returns a dataframe of upset counts intersections.

Parameters:

datapandas.DataFrame: The incoming data.
dictionarypandas.DataFrame: The data dictionary.
variableslist, default=None: Optional list of names of variable columns for which to take counts, defaults to None.
n_variablesint, default=5: Optional limit for the number of variable columns, defaults to \(5\).
sepstr, default=”___”: Optional field-value separator, defaults to "___".

Returns:

pandas.DataFrame: A dataframe of upset counts intersections.

isaricanalytics.analytics.get_variables_by_section_and_type(data: DataFrame, dictionary: DataFrame, required_variables: Iterable[str] | None = None, include_sections: Iterable[str] = ['demog'], include_types: Iterable[str] = ['binary', 'categorical', 'numeric'], exclude_suffixes: Iterable[str] = ['_units', 'addi', 'otherl2', 'item', '_oth', '_unlisted', 'otherl3'], include_subjid: bool = False) → list[str][source]¶

list : Returns a list of all variables in the dataframe from specified sections and types, plus any required variables.

Parameters:

datapandas.DataFrame: Incoming data.
dictionarypandas.DataFrame: Data dictionary.
required_variablestyping.Iterable, default=None: Optional iterable of required variable names, defaults to None.
include_sectionstyping.Iterable, default=[“demog”]: Optional iterable of names of sections to include, defaults to ["demog"].
include_typestyping.Iterable, default=[“binary”, “categorical”, “numeric”]: Optional iterable of variable type names, defaults to ["binary", "categorical", "numeric"].
exclude_suffixestyping.Iterable, default=``[“_units”, “addi”, “otherl2”, “item”, “_oth”, “_unlisted”, “otherl3”]``: Optional iterable of suffixes to exclude, defaults to ["_units", "addi", "otherl2", "item", "_oth", "_unlisted", "otherl3"].
include_subjidbool, default=False: Optional boolean to indicate whether to include subject ID, defaults to False.

Returns:

list: A list of all variables in the dataframe from specified sections and types, plus any required variables.

isaricanalytics.analytics.impute_miss_val(data: DataFrame, dictionary: DataFrame, outcome_column: str = 'outco_binary_outcome', missing_threshold: float = 0.7, verbose: bool = False) → DataFrame[source]¶

pandas.DataFrame : The data with missing values imputed.

Imputes missing values or drops columns based on missing value proportion and median.

Parameters:

datapandas.DataFrame: The incoming data.
dictionarypandas.DataFrame: The data dictionary.
outcome_columnstr, default=”outco_binary_outcome”: Optional outcome column, defaults to "outco_binary_outcome".
missing_thresholdfloat, default=0.7: A proportional imputation threshold for missing values, defaults to \(0.7\).
verbose, bool=False: Optional indicator of whether to print imputations summary, defaults to False.

Returns:

pandas.DataFrame: Data with missing values imputed or columns dropped.

isaricanalytics.analytics.lasso_var_sel_binary(data: DataFrame, outcome_column: str = 'mapped_outcome', metric: str = 'balanced_accuracy', threshold: float = 0.001, gridsearch_params: dict[str, Iterable[float]] = {'Cs': [0.001, 0.00316, 0.01, 0.0316, 0.1, 0.316, 1, 3.16, 10], 'l1_ratios': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}, random_state: int = 42, verbose: bool = False, sep: str = '___') → tuple[Any][source]¶

tuple : Prepares data and selects features using binary logistic regression with elastic net penalty.

Specifically designed for binary outcomes only.

Parameters:

datapandas.DataFrame

The incoming data.

outcome_columnstr, default=”mapped_outcome”

Optional outcome column, defaults to "mapped_outcome".

metricstr, default=”balanced_accuracy”

Optional metric, defaults to "balanced_accuracy".

thresholdfloat, default=1e-3

Optional threshold, defaults to \(0.001\).

gridsearch_paramsdict, default={“l1_ratios”: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], “Cs”: [1e-3, 3.16e-3, 1e-2, 3.16e-2, 1e-1, 3.16e-1, 1, 3.16, 10]}

Optional grid search params, defaults to:

{
    "l1_ratios": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
    "Cs": [1e-3, 3.16e-3, 1e-2, 3.16e-2, 1e-1, 3.16e-1, 1, 3.16, 10]
}

random_stateint, default=42

Optional random state, defaults to \(42\).

verbosebool, default=False

Optional indicator of whether to print the analysis, defaults to False.

sepstr, default=”___”

Optional field-value separator, defaults to "___".

Returns:

tuple: Results tuple.

isaricanalytics.analytics.mannwhitneyu(x, y, use_continuity=True, alternative='two-sided', axis=0, method='auto', *, nan_policy='propagate', keepdims=False)[source]¶

Perform the Mann-Whitney U rank test on two independent samples.

The Mann-Whitney U test is a nonparametric test of the null hypothesis that the distribution underlying sample x is the same as the distribution underlying sample y. It is often used as a test of difference in location between distributions.

Parameters:

x, yarray-like

N-d arrays of samples. The arrays must be broadcastable except along the dimension given by axis.

use_continuitybool, optional

Whether a continuity correction (1/2) should be applied. Default is True when method is 'asymptotic'; has no effect otherwise.

alternative{‘two-sided’, ‘less’, ‘greater’}, optional

Defines the alternative hypothesis. Default is ‘two-sided’. Let SX(u) and SY(u) be the survival functions of the distributions underlying x and y, respectively. Then the following alternative hypotheses are available:

‘two-sided’: the distributions are not equal, i.e. SX(u) ≠ SY(u) for at least one u.
‘less’: the distribution underlying x is stochastically less than the distribution underlying y, i.e. SX(u) < SY(u) for all u.
‘greater’: the distribution underlying x is stochastically greater than the distribution underlying y, i.e. SX(u) > SY(u) for all u.

Under a more restrictive set of assumptions, the alternative hypotheses can be expressed in terms of the locations of the distributions; see [5] section 5.1.

axisint or None, default: 0

If an int, the axis of the input along which to compute the statistic. The statistic of each axis-slice (e.g. row) of the input will appear in a corresponding element of the output. If None, the input will be raveled before computing the statistic.

method{‘auto’, ‘asymptotic’, ‘exact’} or PermutationMethod instance, optional

Selects the method used to calculate the p-value. Default is ‘auto’. The following options are available.

'asymptotic': compares the standardized test statistic against the normal distribution, correcting for ties.
'exact': computes the exact p-value by comparing the observed \(U\) statistic against the exact distribution of the \(U\) statistic under the null hypothesis. No correction is made for ties.
'auto': chooses 'exact' when the size of one of the samples is less than or equal to 8 and there are no ties; chooses 'asymptotic' otherwise.
PermutationMethod instance. In this case, the p-value is computed using permutation_test with the provided configuration options and other appropriate settings.

nan_policy{‘propagate’, ‘omit’, ‘raise’}

Defines how to handle input NaNs.

propagate: if a NaN is present in the axis slice (e.g. row) along which the statistic is computed, the corresponding entry of the output will be NaN.
omit: NaNs will be omitted when performing the calculation. If insufficient data remains in the axis slice along which the statistic is computed, the corresponding entry of the output will be NaN.
raise: if a NaN is present, a ValueError will be raised.

keepdimsbool, default: False

If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.

Returns:

resMannwhitneyuResult

An object containing attributes:

statisticfloat: The Mann-Whitney U statistic corresponding with sample x. See Notes for the test statistic corresponding with sample y.
pvaluefloat: The associated p-value for the chosen alternative.

See also

:func:`scipy.stats.wilcoxon`, :func:`scipy.stats.ranksums`, :func:`scipy.stats.ttest_ind`

Notes

If U1 is the statistic corresponding with sample x, then the statistic corresponding with sample y is U2 = x.shape[axis] * y.shape[axis] - U1.

mannwhitneyu is for independent samples. For related / paired samples, consider scipy.stats.wilcoxon.

method 'exact' is recommended when there are no ties and when either sample size is less than 8 [1]. The implementation follows the algorithm reported in [3]. Note that the exact method is not corrected for ties, but mannwhitneyu will not raise errors or warnings if there are ties in the data. If there are ties and either samples is small (fewer than ~10 observations), consider passing an instance of PermutationMethod as the method to perform a permutation test.

The Mann-Whitney U test is a non-parametric version of the t-test for independent samples. When the means of samples from the populations are normally distributed, consider scipy.stats.ttest_ind.

Beginning in SciPy 1.9, np.matrix inputs (not recommended for new code) are converted to np.ndarray before the calculation is performed. In this case, the output will be a scalar or np.ndarray of appropriate shape rather than a 2D np.matrix. Similarly, while masked elements of masked arrays are ignored, the output will be a scalar or np.ndarray rather than a masked array with mask=False.

Array API Standard Support

mannwhitneyu has experimental support for Python Array API Standard compatible backends in addition to NumPy. Please consider testing these features by setting an environment variable SCIPY_ARRAY_API=1 and providing CuPy, PyTorch, JAX, or Dask arrays as array arguments. The following combinations of backend and device (or other capability) are supported.

Library	CPU	GPU
NumPy	✅	n/a
CuPy	n/a	⛔
PyTorch	✅	⛔
JAX	⚠️ no JIT	⛔
Dask	⛔	n/a

See Support for the array API standard for more information.

References

[1]

H.B. Mann and D.R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other”, The Annals of Mathematical Statistics, Vol. 18, pp. 50-60, 1947.

[2]

Mann-Whitney U Test, Wikipedia, http://en.wikipedia.org/wiki/Mann-Whitney_U_test

[3]

Andreas Löffler, “Über eine Partition der nat. Zahlen und ihr Anwendung beim U-Test”, Wiss. Z. Univ. Halle, XXXII’83 pp. 87-89.

[4] (1,2,3,4,5,6,7)

Rosie Shier, “Statistics: 2.3 The Mann-Whitney U Test”, Mathematics Learning Support Centre, 2004.

[5]

Michael P. Fay and Michael A. Proschan. “Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules.” Statistics surveys, Vol. 4, pp. 1-39, 2010. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2857732/

Examples

We follow the example from [4]: nine randomly sampled young adults were diagnosed with type II diabetes at the ages below.

>>> males = [19, 22, 16, 29, 24]
>>> females = [20, 11, 17, 12]

We use the Mann-Whitney U test to assess whether there is a statistically significant difference in the diagnosis age of males and females. The null hypothesis is that the distribution of male diagnosis ages is the same as the distribution of female diagnosis ages. We decide that a confidence level of 95% is required to reject the null hypothesis in favor of the alternative that the distributions are different. Since the number of samples is very small and there are no ties in the data, we can compare the observed test statistic against the exact distribution of the test statistic under the null hypothesis.

>>> from scipy.stats import mannwhitneyu
>>> U1, p = mannwhitneyu(males, females, method="exact")
>>> print(U1)
17.0

mannwhitneyu always reports the statistic associated with the first sample, which, in this case, is males. This agrees with \(U_M = 17\) reported in [4]. The statistic associated with the second statistic can be calculated:

>>> nx, ny = len(males), len(females)
>>> U2 = nx*ny - U1
>>> print(U2)
3.0

This agrees with \(U_F = 3\) reported in [4]. The two-sided p-value can be calculated from either statistic, and the value produced by mannwhitneyu agrees with \(p = 0.11\) reported in [4].

>>> print(p)
0.1111111111111111

The exact distribution of the test statistic is asymptotically normal, so the example continues by comparing the exact p-value against the p-value produced using the normal approximation.

>>> _, pnorm = mannwhitneyu(males, females, method="asymptotic")
>>> print(pnorm)
0.11134688653314041

Here mannwhitneyu’s reported p-value appears to conflict with the value \(p = 0.09\) given in [4]. The reason is that [4] does not apply the continuity correction performed by mannwhitneyu; mannwhitneyu reduces the distance between the test statistic and the mean \(\mu = n_x n_y / 2\) by 0.5 to correct for the fact that the discrete statistic is being compared against a continuous distribution. Here, the \(U\) statistic used is less than the mean, so we reduce the distance by adding 0.5 in the numerator.

>>> import numpy as np
>>> from scipy.stats import norm
>>> U = min(U1, U2)
>>> N = nx + ny
>>> z = (U - nx*ny/2 + 0.5) / np.sqrt(nx*ny * (N + 1)/ 12)
>>> p = 2 * norm.cdf(z)  # use CDF to get p-value from smaller statistic
>>> print(p)
0.11134688653314041

If desired, we can disable the continuity correction to get a result that agrees with that reported in [4].

>>> _, pnorm = mannwhitneyu(males, females, use_continuity=False,
...                         method="asymptotic")
>>> print(pnorm)
0.0864107329737

Regardless of whether we perform an exact or asymptotic test, the probability of the test statistic being as extreme or more extreme by chance exceeds 5%, so we do not consider the results statistically significant.

Suppose that, before seeing the data, we had hypothesized that females would tend to be diagnosed at a younger age than males. In that case, it would be natural to provide the female ages as the first input, and we would have performed a one-sided test using alternative = 'less': females are diagnosed at an age that is stochastically less than that of males.

>>> res = mannwhitneyu(females, males, alternative="less", method="exact")
>>> print(res)
MannwhitneyuResult(statistic=3.0, pvalue=0.05555555555555555)

Again, the probability of getting a sufficiently low value of the test statistic by chance under the null hypothesis is greater than 5%, so we do not reject the null hypothesis in favor of our alternative.

If it is reasonable to assume that the means of samples from the populations are normally distributed, we could have used a t-test to perform the analysis.

>>> from scipy.stats import ttest_ind
>>> res = ttest_ind(females, males, alternative="less")
>>> print(res)
TtestResult(statistic=-2.239334696520584,
            pvalue=0.030068441095757924,
            df=7.0)

Under this assumption, the p-value would be low enough to reject the null hypothesis in favor of the alternative.

isaricanalytics.analytics.mean_std_str(series: Series, add_spaces: bool = False, dp: int = 1, mfw: int = 4, min_n: int = 3) → str[source]¶

str : Returns the mean and standard deviation of a series of values as a formatted string.

Warning

DEPRECATED. Please use get_mean_and_stdev() instead.

Parameters:

seriespandas.Series: The input series for which to calculate the mean and st. dev.
add_spacesbool, default=False: Add spacing in the string.
dpint, default=1: No description available.
mfwint, default=4: No description available.
min_nint, default=3: No description available.

Returns:

str: The mean and st. dev. as a formatted string.

isaricanalytics.analytics.median_iqr_str(series: Series, add_spaces: bool = False, dp: int = 1, mfw: int = 4, min_n: int = 3) → str[source]¶

str : Returns the median interquartile range (IQR) of a given series of values as a formatted string.

Warning

DEPRECATED. Please use get_median_interquartile_range() instead.

Parameters:

seriespandas.Series: The input series for which to calculate the IQR.
add_spacesbool, default=False: Add spacing in the string.
dpint, default=1: No description available.
mfwint, default=4: No description available.
min_nint, default=3: No description available.

Returns:

str: The IQR as a string.

isaricanalytics.analytics.n_percent_str(series: Series, add_spaces: bool = False, dp: int = 1, mfw: int = 4, min_n: int = 1) → str[source]¶

str : Returns the n-percent value of a series as a string.

Warning

DEPRECATED. Please use get_n_percent_value() instead.

Parameters:

seriespandas.Series: The input series.
add_spacesbool, default=False: Add spacing around the string.
dpint, default=1: No description available.
mfwint, default=1: No description available.
min_nint, default=1: No description available.

Returns:

str: The n-percent value of the series as a string.

isaricanalytics.analytics.regression_summary_table(table: DataFrame, dictionary: DataFrame, highlight_predictors: dict[str, Iterable[str]] | None = None, pvalue_significance: float | None = None, result_type: str = 'OddsRatio', sep: str = '___') → DataFrame[source]¶

pandas.DataFrame : Returns a regression summary table.

Parameters:

tablepandas.DataFrame: The incoming table.
dictionarypandas.DataFrame: The data dictionary.
highlight_predictorsdict, default=None: Optional. No description available, defaults to None.
pvalue_significanceint, default=5: Optional \(p\)-value significance level, defaults to None.
result_typestr, default=”OddsRatio”: Optional. No description avaiable, defaults to "OddsRatio".
sepstr, default=”___”: Optional field-value separator, defaults to "___".

Returns:

pandas.DataFrame: Regression summary table.

isaricanalytics.analytics.remove_single_binary_outcome_predictors(data: DataFrame, dictionary: DataFrame, predictors: Iterable[str], outcome: str) → Iterable[str][source]¶

typing.Iterable : Returns a list of retained columns in the data after removing single binary outcome predictor columns.

Removes binary predictors that are associated with only one outcome, e.g. if all patients with some_variable=1 have outcome=1.

Parameters:

data: pandas.DataFrame: The incoming data.
dictionary: pandas.DataFrame: The data dictionary.
predictors: typing.Iterable: Iterable of predictor variable column names.
outcomestr: The outcome string.

Returns:

typing.Iterable:: List of predictor variable column names excluding any that can’t be used in the logistic regression model.

isaricanalytics.analytics.rmv_high_corr(data: DataFrame, dictionary: DataFrame, outcome_column: str = 'outco_binary_outcome', correlation_threshold: float = 0.5, verbose: bool = False) → DataFrame[source]¶

pandas.DataFrame : Removes variables in the data with high multicollinearity.

Arbitrarily selecting one variable to remove if the correlation between two variables is above a threshold.

Parameters:

datapandas.DataFrame: The incoming data.
dictionarypandas.DataFrame: The data dictionary.
outcome_columnstr, default=``”outco_binary_outcome”``: Optional outcome column, defaults to "outco_binary_outcome".
correlation_thresholdfloat, default=0.5: Optional correlation threshold, defaults to \(0.5\).
verbosebool, default=False: Optional indicator of whether to print correlation summary.

Returns:

pandas.DataFrame: The data with high correlation variables removed.

isaricanalytics.analytics.rmv_low_var(data: DataFrame, dictionary: DataFrame, mad_threshold: float = 0.1, freq_threshold: float = 0.05, outcome_column: str = 'outco_binary_outcome', verbose: bool = False) → DataFrame[source]¶

pandas.DataFrame : Removes numerical variables from the data with Median Absolute Deviation (MAD) below a given threshold.

Excludes binary columns from MAD calculation. Removes binary columns with very low frequencies.

Parameters:

datapandas.DataFrame: The incoming data.
dictionarypandas.DataFrame: The data dictionary.
mad_thresholdfloat, default=0.1: Optional MAD threshold, defaults to \(0.1\).
freq_thresholdfloat, default=0.5: Optional frequency threshold, defaults to \(0.05\).
outcome_columnstr, default=``”outco_binary_outcome”``: Optional outcome column, defaults to "outco_binary_outcome".
verbosebool, default=False: Optional indicator of whether to print MAD analysis summary.

Returns:

pandas.DataFrame: The data with low MAD columns removed.

isaricanalytics.analytics.trim_field_label(x: str, max_len: int = 40) → str[source]¶

str : Trims field label using an optional max. length parameter that defaults to 40 characters.

Parameters:

xstr: The input field label.
max_lenint, default=40
An optional maximum length parameter to use for trimming, defaults to
:math:`40`.

Returns:

str: The trimmed field label.

isaricanalytics.analytics.variance_influence_factor_backwards_elimination(data: DataFrame, dictionary: DataFrame, predictors_list: Iterable[str], sep: str = '___') → tuple[Iterable[str], DataFrame][source]¶

tuple : Returns an iterable of retained columns and the VIF backwards elimination data.

Parameters:

datapandas.DataFrame: Incoming data.
dictionarypandas.DataFrame: Data dictionary.
predictors_listtyping.Iterable: Iterable of predictor variable column names.
sepstr, default=”___”: Optional field-value separator, defaults to "___".

Returns:

tuple: An iterable of retained columns and the VIF backwards elimination data.

isaricanalytics.analytics¶

`isaricanalytics.analytics`¶