Statistical Procedures

amrora · Posted 04-07-2023 03:56 PM

Hello,

I have aggregate data (no person-level data; all numbers and percentages) including predictor variables with multiple levels (age in the screenshot below has 5 levels) across 3 levels of an outcome (n% across the top of the screen shot below). My mentor is asking me to calculate standard differences for each categorical predictor variable across 3 levels of the outcome (realizing that we'll have to compare 2 vs 1, and 3 vs 1; instead of 1 vs 2 vs 3). I see that to use proc psmatch, the predictor variables have to be binary (0/1), but I don't think I can do this with the aggregated data that I have.

Does anyone know how I can calculate standard differences for categorical predictor variables as shown below?

Thank you in advance for your help!

ballardw · Posted 04-08-2023 05:57 PM

Please post a link to your definition of "standard difference".

My searches turn up too many radically different things containing that phrase to want to spend any time guessing which is applicable.

If you are unable to post data step code describing your data then please at least post simple text in a text box opened on the forum with the </> icon that appears above the message window.

And indicate which are "outcomes". Four identical column headings obfuscates which are what for which purpose.

BTW there are procedures that do tests across multiple levels but the data needs to be in a reasonable form for specific tests.

Rick_SAS · Posted 04-09-2023 06:41 AM

I'm guessing that the OP knows about and has read Yang and Dalton ("A unified approach to measuring the effect size between two groups using SAS", SAS... which shows how to calculate "standardized difference scores" for two groups. The definition is on pp 2-3. Their macro is available from the Cleveland Clinic at https://www.lerner.ccf.org/quantitative-health/documents/stddiff.sas

It sounds like the OP wants to compute a similar measure for more than two groups.

amrora · Posted 04-10-2023 12:46 PM

Hi, I have read that article and have been trying to use the macro without success until this morning. I think I got it to work with my data (count data, see below). I knew that I could only compare outcomes 2 vs 1, and 3 vs 1, but was trying to figure out how to calculate a single standardized difference for all of age (5 levels). I realized this morning that I had to calculate a SMD for each level of age (as binary 0/1). I'm waiting on feedback from my mentor.

Here is the data:

Data age;
Input level age_grp severity count;
cards;
0 0 0 34967
0 1 0 109368
0 0 1 674
0 1 1 5992
0 0 2 133
0 1 2 4790
1 0 0 44727
1 1 0 99608
1 0 1 1293
1 1 1 5373
1 0 2 1342
1 1 2 3581
2 0 0 29442
2 1 0 114893
2 0 1 1074
2 1 1 5592
2 0 2 2035
2 1 2 2888
3 0 0 30257
3 1 0 114078
3 0 1 3077
3 1 1 3589
3 0 2 1314
3 1 2 3609
4 0 0 4942
4 1 0 139393
4 0 1 548
4 1 1 6118
4 0 2 99
4 1 2 4824
;

amandiyliwo · Posted 01-23-2025 04:16 PM

I am working on doing the same and need help. How did you end up resolving this issue?

The Macro, as stated before, works for two groups only. Are there any other macros we can use to calculate for multiple groups or any base SAS codes we can code to calculate the SMD? Can someone help answer this?

pink_poodle · Posted 01-23-2025 04:42 PM

To find the standard difference, you can use a formula. For example, there are two buckets “1” and “2” with some stuff. What is standard difference between stuff in bucket “2” vs “1”? To standardize relative to bucket #1, we need to divide by the amount of stuff in this bucket. So, the standard difference is the amount in bucket “2” minus “1” divided by “1”. For example, the standard difference in the amount of 🥔 🥔 (potatoes) between buckets # 2 and # 1 is 0.2, meaning that bucket 2 has 20% more potatoes relative to bucket #1:

Std_diff_2_to_1 = (b2 - b1)/b1;

We can multiply by 100 to get the percent.

quickbluefish · Posted 01-23-2025 05:21 PM

The macro referenced above is flawed in the way it calculates standard diffs for variables with more than two levels -- this is specifically because of the way it stores an array of floating point numbers in a single macro variable string. The result is that things get rounded pretty substantially, and in certain cases, this can affect the SMD quite a lot. I've had an email exchange with the authors of that paper and they are aware of the problem. At some point, I re-wrote the macro in such a way that it avoids those two steps ('select ... into :var separated by...' syntax) and the results of that macro exactly match one written in a completely different way with PROC IML. The process of calculating these SMDs is much more involved than doing it for variables with only two levels.

amandiyliwo · Posted 01-23-2025 05:38 PM

Would you be willing to share your corrected macro?

quickbluefish · Posted 01-23-2025 06:59 PM

You're welcome to try using this 'table1' macro, which, among other things, will calculate SMDs for continuous, 2 level and >2 level categorical variables. If you describe a bit what your data look like and what the result is you're trying to achieve, I can help you set up the call to the macro correctly. It's the program called 'table1.sas' in this github repo:

https://github.com/Jeremy-Smith5/CEP-public/tree/main/SAS

quickbluefish · Posted 01-23-2025 07:01 PM

...and if you'd rather avoid that, there's an R package (not written by me) that will calculate SMDs like this. I think I've used that once or twice.

amandiyliwo · Posted 01-24-2025 02:34 PM

Thank you so much for sharing! I tried the macro but encountered some errors, and the SD was not being calculated for one of my variables, so it turned up as missing. Regarding my data structure, here is an example where I would like to get the SD of the continuous variables and the proportions. Thank you so much for all your help so far with this. What am I missing? Please help me understand why it's not working.

Variable	TRT1	TRT2	TRT3	SD
Age	72.5	(5.9)	73.1	(6.1)	72.5	(5.8)
Gender, n (%)
. Female	406	9.09	1256	28.12	2804	62.79
. Male	403	9.03	1183	26.50	2878	64.47
Race, n (%)
. Asian	10	8.26	26	21.49	85	70.25
. Black	184	8.90	580	28.06	1303	63.04
. Hispanic	5	8.77	12	21.05	40	70.18
. White	610	9.12	1821	27.24	4254	63.64

I got the SMD for everything but race categories. What would be the issue?

var	lvl	ALL	ALL_2	TRT_1	TRT_1_2	TRT_2	TRT_2_2	TRT_3	TRT_3_2	test_stat	test_stat_val	pval	has_missing	SMD_MNA2_2_vs_MNA2_1	SMD_MNA2_3_vs_MNA2_1	SMD_MNA2_3_vs_MNA2_2
Pop		8758	1	784	0.089518	2382	0.27198	5592	0.638502			.
Gender										Chi2	2.637272	0.2675		-0.03101	0.008678663	0.039693716
	Female	4414	0.503996	394	0.502551	1234	0.518052	2786	0.498212			.
	Male	4344	0.496004	390	0.497449	1148	0.481948	2806	0.501788			.
Race										Chi2	5.701208	0.4575		0.041915	0.023022193	0.058672359
	Asian	121	0.013816	10	0.012755	26	0.010915	85	0.0152			.
	Black	1949	0.222539	172	0.219388	558	0.234257	1219	0.21799			.
	Hispanic	57	0.006508	5	0.006378	12	0.005038	40	0.007153			.
	White	6631	0.757136	597	0.76148	1786	0.74979	4248	0.759657			.
Age	Mean (StdDev)	72.63896	5.883168	72.52296	5.937476	73.08564	6.039716	72.46495	5.798456			.		0.093956	-0.009885058	-0.104841103

Here is the macro call:

/* Call the %table1 macro */

/* Include the macro file */
%INCLUDE 'C:\FOLDER\table 1 macro.txt';
%table1(
personfile=test, /* Input dataset */
stratvars=trt, /* Stratify only by mna2 */
rowvars=
gender | /* Gender as a row variable */
race | /* Race as a row variable */
age/mean std | /* Age as a continuous variable with mean and std */
uselabels=1, /* Use variable labels if available */
pvalues=1, /* Include p-values for group comparisons */
printSMD=1 /* Calculate and display standardized mean differences */
);

amandiyliwo · Posted 01-24-2025 02:44 PM

Just want to add that I reran it and yes the macro worked! No missing. Thank you so much!

quickbluefish · Posted 01-24-2025 03:15 PM

Great! One thing - if your STRATVAR has more than 2 levels (looks like you have 3), the macro will not calculate p-values (only SMDs). If you really need p-values for each pair-wise comparison of the STRATVAR, you would need to run the macro multiple times (once for each pair of cohorts) and merge them together. Very awkward, but works. But SMDs should work regardless of how many columns - there will be one for each pair.

Statistical Procedures

How to calculate standard difference of 4-level predictor variables with aggregate data

Re: How to calculate standard difference of 4-level predictor variables with aggregate data

Re: How to calculate standard difference of 4-level predictor variables with aggregate data

Re: How to calculate standard difference of 4-level predictor variables with aggregate data

Re: How to calculate standard difference of 4-level predictor variables with aggregate data

Re: How to calculate standard difference of 4-level predictor variables with aggregate data

Re: How to calculate standard difference of 4-level predictor variables with aggregate data

Re: How to calculate standard difference of 4-level predictor variables with aggregate data

Re: How to calculate standard difference of 4-level predictor variables with aggregate data

Re: How to calculate standard difference of 4-level predictor variables with aggregate data

Re: How to calculate standard difference of 4-level predictor variables with aggregate data

Re: How to calculate standard difference of 4-level predictor variables with aggregate data

Re: How to calculate standard difference of 4-level predictor variables with aggregate data

Follow Us

What is...

Statistical Procedures

Our biggest data and AI event of the year.

Follow Us

What is...