BookmarkSubscribeRSS Feed
jorgelobin
Fluorite | Level 6

Hello everyone,

I have written some code in SAS to calculate the Population Stability Index (PSI) in order to compare the distributions of a decision engine between two datasets (a base model and a current dataset). I would like to ask if the implementation is correct and if the PSI calculation is done appropriately. Any feedback or suggestions would be greatly appreciated!

Data description:

  • I am using two datasets: BBDD_MODELO (the base model) and BBDD_OOT (the current dataset).
  • Both datasets contain the variable MOTOR_DE_DECISION, which indicates different decision categories. I aim to compare the proportions of these categories between the two datasets using the PSI.

Steps already tried:

Here is the SAS code I’ve implemented so far. The script performs the following steps:

  1. Import the two Excel datasets (BBDD_MODELO and BBDD_OOT).
  2. Calculate counts and proportions for each category of the variable MOTOR_DE_DECISION in both datasets.
  3. Calculate the PSI by comparing the proportions between the base model and the current dataset.
  4. Create a summary table with the PSI values and classify the stability result as Green (G), Amber (A), or Red (R) based on the PSI threshold.
/*Lectura del archivo BBDD_MODELO.xlsx*/
PROC IMPORT OUT=BBDD_MODELO
    DATAFILE = "C:\Users\user\Desktop\BBDD_MODELO.xlsx"
    DBMS = XLSX;
    GETNAMES = YES;
RUN;

/*Lectura del archivo BBDD_OOT.xlsx*/
PROC IMPORT OUT=BBDD_OOT
    DATAFILE = "C:\Users\user\Desktop\BBDD_OOT.xlsx"
    DBMS = XLSX;
    GETNAMES = YES;
RUN;

/*Recuentos y Proporciones de la tabla base (BBDD_MODELO)*/
/*Obtenemos las frecuencias de los distintos valores de la variable MOTOR_DE_DECISION*/
PROC SQL;
	CREATE TABLE BASE_AUX AS SELECT
		MOTOR_DE_DECISION,
		COUNT(*) AS RECUENTO_BASE
	FROM BBDD_MODELO
	GROUP BY MOTOR_DE_DECISION;
QUIT;

/*Sumamos todas las frecuencias de cada categoría para obtener el total de observaciones*/
PROC SQL NOPRINT;
	SELECT SUM(RECUENTO_BASE) INTO :TOT_RECUENTO_BASE
	FROM BASE_AUX;
QUIT;

/*Calculamos la proporción de cada categoría*/
DATA BASE;
	SET BASE_AUX;
	PROPORCION_BASE = (RECUENTO_BASE/&TOT_RECUENTO_BASE);
RUN;

/*Sumamos todas las proporciones de cada categoría. Esta suma tiene que ser igual a 1*/
PROC SQL NOPRINT;
	SELECT SUM(PROPORCION_BASE) INTO :TOT_PROPORCION_BASE
	FROM BASE;
QUIT;
/*Recuentos y Proporciones de la tabla actual (BBDD_OOT)*/
/*Obtenemos las frecuencias de los distintos valores de la variable MOTOR_DE_DECISION*/
PROC SQL;
	CREATE TABLE ACTUAL_AUX AS SELECT
		MOTOR_DE_DECISION,
		COUNT(*) AS RECUENTO_ACTUAL
	FROM BBDD_OOT
	GROUP BY MOTOR_DE_DECISION;
QUIT;

/*Sumamos todas las frecuencias de cada categoría para obtener el total de observaciones*/
PROC SQL NOPRINT;
	SELECT SUM(RECUENTO_ACTUAL) INTO :TOT_RECUENTO_ACTUAL
	FROM ACTUAL_AUX;
QUIT;

/*Calculamos la proporción de cada categoría*/
DATA ACTUAL;
	SET ACTUAL_AUX;
	PROPORCION_ACTUAL = (RECUENTO_ACTUAL/&TOT_RECUENTO_ACTUAL);
RUN;

/*Sumamos todas las proporciones de cada categoría. Esta suma tiene que ser igual a 1*/
PROC SQL NOPRINT;
	SELECT SUM(PROPORCION_ACTUAL) INTO :TOT_PROPORCION_ACTUAL
	FROM ACTUAL;
QUIT;
/*Cálculo del PSI*/
DATA RESUMEN;
	MERGE BASE ACTUAL;
	BY MOTOR_DE_DECISION;
	IF RECUENTO_BASE = . THEN RECUENTO_BASE = 0;
	IF RECUENTO_ACTUAL = . THEN RECUENTO_ACTUAL = 0;
	IF RECUENTO_BASE = 0 THEN PROPORCION_BASE = (1/&TOT_RECUENTO_BASE);
	IF RECUENTO_ACTUAL = 0 THEN PROPORCION_ACTUAL = (1/&TOT_RECUENTO_ACTUAL);
	PSI = (PROPORCION_ACTUAL-PROPORCION_BASE)*LOG(PROPORCION_ACTUAL/PROPORCION_BASE);
RUN;
/*Guardamos el valor del PSI en la macro variable PSI*/
PROC SQL NOPRINT;
	SELECT SUM(PSI) INTO :PSI
	FROM RESUMEN;
QUIT;
/*Se crea la fila de totales*/
DATA TOTALES;
	MOTOR_DE_DECISION = "TOTAL";
	PSI = Ψ
	RECUENTO_BASE = &TOT_RECUENTO_BASE;
	RECUENTO_ACTUAL = &TOT_RECUENTO_ACTUAL;
	PROPORCION_BASE = &TOT_PROPORCION_BASE;
	PROPORCION_ACTUAL = &TOT_PROPORCION_ACTUAL;
	IF PSI <= 0.1 THEN RAG = "G";
		ELSE IF 0.1 < PSI <= 0.25 THEN RAG = "A";
		ELSE IF PSI > 0.25 THEN RAG = "R";
RUN;
/*Añadimos los totales a la tabla resumen*/
DATA RESULTADO;
	RETAIN MOTOR_DE_DECISION RECUENTO_BASE RECUENTO_ACTUAL PROPORCION_BASE PROPORCION_ACTUAL PSI RAG;
	SET RESUMEN TOTALES;
	FORMAT PROPORCION_BASE 6.4;
	FORMAT PROPORCION_ACTUAL 6.4;
RUN;

 

You can also check a more detailed explanation and implementation of the Population Stability Index (PSI) on my website, where you can download the datasets to test and verify the code:

PSI Implementation in SAS

Thank you in advance for your help!

4 REPLIES 4
ChrisNZ
Tourmaline | Level 20

Vetting one's code is tedious, and even more so without data.

You might get more engagement if you supply and small sample, and someone will check that the PSI results are the same as what they calculate.

PaigeMiller
Diamond | Level 26

Agreeing with @ChrisNZ , without actual data (or made up data that are representative of the problem), I wouldn't really want to say if your code is correct. Most of us refuse to download Excel files (particularly from sites we don't know) because they can be a security threat; data needs to be provided following these examples and instructions.

 

Using count(*) will produce results that will not account for missing values in RECUENTO_BASE, maybe that's what you want or maybe that's not what you want.

 

There are better ways to compute percents, PROC FREQ will do that for you in one step, replacing the first two PROC SQL's and the data step DATA BASE. PROC FREQ also specifically allows you to handle missing values in RECUENTO_BASE whichever way you want. Same for your calculation of ACTUAL. Macro variables are not needed here.

--
Paige Miller
mkeintz
PROC Star

When SAS was first published it was called Statistical Analysis System, because it offers a lot of statistical procedures.   They are still there.  And instead of programming data steps and SQL steps, you should take advantage of them, especially in the generation of frequency proportions.

 

I looked up the definition of Population Stability Index, which, according to http://www.stat.wmich.edu/naranjo/PSI.pdf , is defined as

 

         mkeintz_0-1724863483416.png

 

where each of two populations have a variable with the same B non-empty bins, and phat{i} and qhat{i} are estimates of proportions in bin number i (p for population 1 and q for pop 2).

 

Generating the phat{i} and qhat{i} are trivial with proc freq, which directly generate percentages, not only for a report, but also to an output dataset.  Then you can follow with a merge (by bin category) to generate each of the B components of the psi index, followed by a total of those components.  I've made an example below using the variable TYPE from sashelp.cars as motor_de_decision:

 

data bbdd_modelo bbdd_oot;
  set sashelp.cars (keep=type rename=(type=motor_de_decision));
  /* TYPE has values like "SUV", "Sedan", "Truck", etc. */
  if   1<=mod(_n_,10)<=5 then output bbdd_modelo;
  else output bbdd_oot;
run;

proc freq data=bbdd_modelo noprint;
  tables motor_de_decision
   / out=base_pop (rename=(count=base_count percent=base_pct));
run;
proc freq data=bbdd_oot noprint;
  table motor_de_decision
   / out=actual_pop (rename=(count=actual_count percent=actual_pct));
run;

data psi_results (label='One obs per bin, an extra obs for all bins');
  merge base_pop actual_pop    end=end_of_merge;
  by motor_de_decision;
  base_proportion=base_pct/100;
  actual_proportion=actual_pct/100;
  psi_index = (base_proportion-actual_proportion)
            * (log(base_proportion)-log(actual_proportion)) ;

  array total_count {2} _temporary_;
  total_count{1} + base_count;
  total_count{2} + actual_count;
  array total_psi {1}   _temporary_;
  total_psi{1} + psi_index;

  output;
  /* Now output a final, extra obs with the overall PSI index value */
  if end_of_merge then do;
    call missing(of _all_);
    psi_index=total_psi{1};
    base_count=total_count{1};
    actual_count=total_count{2};
    output;
  end;
run;

Dataset PSI_RESULTS will have B+1 observations, one for each bin. followed by an observation containing the psi_index over all bins.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1078 views
  • 1 like
  • 5 in conversation