Hello. I have a dataset that contains 5 variables. "ID" is a unique identifier of a person in my dataset. "Gender" is male (1) or female (2). AgeRange contains only 1 through 6, which indicates which age range they fall into based on a categorical scheme for my report. "Screen" is if they were ever medically screened. "Status" has 3 entries, positive, negative, or blank. It is blank when the patient has not been screened. Here is the sample dataset:
data JANFEB.SAMPLEUNIQUE;
infile datalines dsd truncover;
input ID:BEST. Gender:BEST. AgeRange:BEST. Screen:$3. Result:$8.;
datalines4;
1,1,1,Yes,Positive
2,2,2,No,
3,1,3,Yes,Negative
4,2,4,No,
5,1,5,Yes,Positive
6,2,6,No,
7,1,1,Yes,Negative
8,2,2,No,
9,1,3,Yes,Positive
10,2,4,No,
11,1,5,Yes,Negative
12,2,6,Yes,Positive
13,1,1,Yes,Negative
14,2,3,Yes,Positive
;;;;
How do I create a table from this dataset that groups screening and status results in this following way? I would like to primarily group my new data table by age range (1-6) and gender (1-2).
Age Range | Gender | Positive | Screened | All | All Screened Difference | Screened Positive Difference | Screening rate | Prevalence Rate |
1 | 1 | 10 | 2780 | 3935 | 1155 | 2770 | 71% | 0.4% |
1 | 2 | 6 | 2325 | 3302 | 977 | 2319 | 70% | 0.3% |
2 | 1 | 25 | 1172 | 1386 | 214 | 1147 | 85% | 2.1% |
2 | 2 | 22 | 3098 | 3592 | 494 | 3076 | 86% | 0.7% |
3 | 1 | 91 | 951 | 1149 | 198 | 860 | 83% | 9.6% |
3 | 2 | 39 | 1835 | 2221 | 386 | 1796 | 83% | 2.1% |
4 | 1 | 189 | 1073 | 1317 | 244 | 884 | 81% | 17.6% |
4 | 2 | 27 | 945 | 1190 | 245 | 918 | 79% | 2.9% |
5 | 1 | 198 | 1586 | 1921 | 335 | 1388 | 83% | 12.5% |
5 | 2 | 42 | 1132 | 1413 | 281 | 1090 | 80% | 3.7% |
6 | 1 | 179 | 3902 | 4793 | 891 | 3723 | 81% | 4.6% |
6 | 2 | 175 | 3932 | 4862 | 930 | 3757 | 81% | 4.5% |
Here are the explanation of the columns:
- Positive: Unique count under each age range and specific gender where the status variable had 'Positive' in it.
- Screened: Unique count under each age range and specific gender where the screened variable had 'Yes' in it.
- All: Unique count under each age range and specific gender in the dataset.
- All Screened Difference: All - Screened
- Screened Positive Difference: Screened - Positive
- Screening rate: Screened/Total
- Positive prevalence; Positive/Screen
Thanks so much for your help!
I did a roundabout way of getting to your solution. I separated genders into two different tables and then merged them together and proc sort them according to age range and gender. I'm not sure what exactly happened though. The output of your code and mine are off. Do you know what is wrong with the code? I added a bigger dataset so that all age ranges for all genders would have some kind of entry.
/*extended sample dataset of 30 entries*/
data sampleunique;
infile datalines dsd truncover;
input ID:BEST. Gender:BEST. AgeRange:BEST. Screen:$3. Result:$8.;
datalines4;
1,1,1,Yes,Positive
2,2,2,No,
3,1,3,Yes,Negative
4,2,4,No,
5,1,5,Yes,Positive
6,2,6,No,
7,1,1,Yes,Negative
8,2,2,No,
9,1,3,Yes,Positive
10,2,4,No,
11,1,5,Yes,Negative
12,2,6,Yes,Positive
13,1,1,Yes,Negative
14,2,3,Yes,Positive
15,1,1,Yes,Negative
16,1,2,Yes,Negative
17,1,3,Yes,Negative
18,1,4,No,
19,1,5,No,
20,1,6,No,
21,2,1,No,
22,2,2,No,
23,2,3,No,
24,2,4,No,
25,2,5,Yes,Positive
26,2,6,Yes,Positive
27,2,1,Yes,Positive
28,2,6,Yes,Negative
29,1,3,Yes,Negative
30,2,1,Yes,Positive
;;;;
/*this was your code*/
proc sql;
create table WANT as
select AGERANGE,
GENDER,
count(case when char(RESULT,1)="P" then 1 else 0 end) as POSITIVE,
count(case when char(SCREEN,1)="Y" then 1 else 0 end) as SCREENED,
count(*) as TOTAL,
CALCULATED TOTAL - CALCULATED SCREENED as TOTAL_SCREEN_DIFF,
CALCULATED SCREENED - CALCULATED POSITIVE as SCREEN_POS_DIFF,
(CALCULATED SCREENED / CALCULATED TOTAL) * 100 as SCREEN_RATE,
(CALCULATED POSITIVE / CALCULATED SCREENED) * 100 as POS_PREV
from SAMPLEUNIQUE
group by AGERANGE,
GENDER;
quit;
/*this is mine*/
proc sql;
create table sampleMALE as
select agerange,
sum(case when char(result,1)="P" then 1 else 0 end) as POSITIVE,
sum(case when screen="Yes" then 1 else 0 end) as SCREENED,
count(*) as TOTAL,
(CALCULATED TOTAL - CALCULATED SCREENED) as TotalScreenedDifference,
(CALCULATED SCREENED - CALCULATED POSITIVE) as ScreenedPositiveDifference,
(CALCULATED SCREENED / CALCULATED TOTAL) * 100 as SCREENINGRATE,
(CALCULATED POSITIVE / CALCULATED SCREENED) * 100 as POSITIVEPREVALENCE
from sampleunique
where gender=1
group by agerange;
quit;
data samplemale;
set samplemale;
gender=1;
run;
proc sql;
create table sampleFEMALE as
select agerange,
sum(case when char(result,1)="P" then 1 else 0 end) as POSITIVE,
sum(case when screen="Yes" then 1 else 0 end) as SCREENED,
count(*) as TOTAL,
(CALCULATED TOTAL - CALCULATED SCREENED) as TotalScreenedDifference,
(CALCULATED SCREENED - CALCULATED POSITIVE) as ScreenedPositiveDifference,
(CALCULATED SCREENED / CALCULATED TOTAL) * 100 as SCREENINGRATE,
(CALCULATED POSITIVE / CALCULATED SCREENED) * 100 as POSITIVEPREVALENCE
from sampleunique
where gender=2
group by agerange;
quit;
data sampleFEMALE;
set sampleFEMALE;
Gender=2;
run;
data sampleMERGE;
retain agerange gender positive;
set sampleMALE sampleFEMALE;
run;
proc sort data=sampleMERGE;
by agerange gender;
run;
Exactly the same as in this post:
You just change the groups?
I got most of the code to work, but not quite sure how to group genders in. Whenenver I select agerange and gender and group by the same two variables, the table creates 31,000+ entries.
I am using my real dataset in here, so the variables are a bit different
proc sql;
create table janfeb.Janhosptypescreeningresults as
select agerange,
sum(case when char(hcvresult,1)="P" then 1 else 0 end) as POSITIVE,
sum(case when hcvscreening="Yes" then 1 else 0 end) as SCREENED,
count(*) as TOTAL,
(CALCULATED TOTAL - CALCULATED SCREENED) as TotalScreenedDifference,
(CALCULATED SCREENED - CALCULATED POSITIVE) as ScreenedPositiveDifference,
(CALCULATED SCREENED / CALCULATED TOTAL) * 100 as SCREENINGRATE,
(CALCULATED POSITIVE / CALCULATED SCREENED) * 100 as POSITIVEPREVALENCE
from janfeb.jfgmi
where dcdeathdate >= dhms(mdy(02,01,2017),0,0,0) and dcdeathdate <= dhms(mdy(02,28,2017),0,0,0)
group by agerange;
quit;
How do I include the count for agerange and gender into this new table? Thanks!
Mabye restart your question by asking what it is your actually trying to do. From what you have said it sounds like your creating a table based on demography data. The simplest way to do this would be to do a proc freq on your data, then post process the resulting output from that, something like:
data SAMPLEUNIQUE; infile datalines dsd truncover; input ID:BEST. Gender:BEST. AgeRange:BEST. Screen:$3. Result:$8.; datalines4; 1,1,1,Yes,Positive 2,2,2,No, 3,1,3,Yes,Negative 4,2,4,No, 5,1,5,Yes,Positive 6,2,6,No, 7,1,1,Yes,Negative 8,2,2,No, 9,1,3,Yes,Positive 10,2,4,No, 11,1,5,Yes,Negative 12,2,6,Yes,Positive 13,1,1,Yes,Negative 14,2,3,Yes,Positive ;;;; run; proc freq data=sampleunique noprint; tables gender * agerange * screen * result / out=want; run;
The want table can then be transposed up if needed, and you can add screening counts to that and then calculate the percentages.
Thanks for your suggestions. How do I transpose it and add the difference and percentages? I proc freq that table and got this dataset:
data WORK.WANT;
infile datalines dsd truncover;
input agerange:32. gender:$1. HCVScreening:$9. HCVresult:$27. COUNT:32. PERCENT:32.;
datalines4;
1,1,No,,1114,
1,1,No,Negative,41,0.1619209352
1,1,Yes,Negative,2770,10.939536353
1,1,Yes,Positive,10,0.039492911
1,2,No,,931,
1,2,No,Negative,46,0.1816673907
1,2,Yes,Negative,2319,9.1584060661
1,2,Yes,Positive,6,0.0236957466
2,1,,,1,
2,1,No,,193,
2,1,No,Negative,18,0.0710872398
2,1,No,Positive,2,0.0078985822
2,1,Yes,Negative,1147,4.5298368943
2,1,Yes,Positive,25,0.0987322776
2,2,No,,389,
2,2,No,Negative,105,0.4146755657
2,2,Yes,Negative,3076,12.148019431
2,2,Yes,Positive,22,0.0868844042
3,1,,,1,
3,1,No,,173,
3,1,No,Negative,23,0.0908336954
3,1,No,Positive,1,0.0039492911
3,1,Yes,Negative,860,3.3963903479
3,1,Yes,Positive,91,0.3593854903
3,2,No,,314,
3,2,No,Negative,72,0.2843489594
3,2,Yes,Negative,1796,7.0929268196
3,2,Yes,Positive,39,0.154022353
4,1,,,4,
4,1,No,,221,
4,1,No,Negative,18,0.0710872398
4,1,No,Positive,1,0.0039492911
4,1,Yes,Negative,884,3.4911733344
4,1,Yes,Positive,189,0.7464160183
4,2,No,,195,
4,2,No,Negative,50,0.1974645551
4,2,Yes,Negative,918,3.6254492319
4,2,Yes,Positive,27,0.1066308598
5,1,,,3,
5,1,No,,305,
5,1,No,Negative,27,0.1066308598
5,1,Yes,Negative,1388,5.4816160499
5,1,Yes,Positive,198,0.7819596382
5,2,No,,241,
5,2,No,Negative,40,0.1579716441
5,2,Yes,Negative,1090,4.3047273014
5,2,Yes,Positive,42,0.1658702263
6,1,,,2,
6,1,No,,825,
6,1,No,Negative,62,0.2448560483
6,1,Yes,Negative,3725,14.711109356
6,1,Yes,Positive,179,0.7069231073
6,2,,,1,
6,2,No,,846,
6,2,No,Negative,82,0.3238418704
6,2,Yes,,1,
6,2,Yes,Negative,3757,14.837486671
6,2,Yes,Positive,175,0.6911259429
;;;;
Is there no way to do this on proc sql? I tried to select and group by agerange and gender, but it gave me 31,000 entries. Not quite sure what happened.
Well, you could do it in SQL, it just becomes a bit fiddly see below. Thats why its easier to do a proc freq, then in datastep move your data up (see methods to reshape data, a ton of results on Google). Note your test data doesn't seem to have any differences in it or maybe I misunderstand the total?
data SAMPLEUNIQUE; infile datalines dsd truncover; input ID:BEST. Gender:BEST. AgeRange:BEST. Screen:$3. Result:$8.; datalines4; 1,1,1,Yes,Positive 2,2,2,No, 3,1,3,Yes,Negative 4,2,4,No, 5,1,5,Yes,Positive 6,2,6,No, 7,1,1,Yes,Negative 8,2,2,No, 9,1,3,Yes,Positive 10,2,4,No, 11,1,5,Yes,Negative 12,2,6,Yes,Positive 13,1,1,Yes,Negative 14,2,3,Yes,Positive ;;;; run; proc sql; create table WANT as select AGERANGE, GENDER, count(case when char(RESULT,1)="P" then 1 else 0 end) as POSITIVE, count(case when char(SCREEN,1)="Y" then 1 else 0 end) as SCREENED, count(*) as ALL, CALCULATED ALL - CALCULATED SCREENED as ALL_DIFF, CALCULATED SCREENED - CALCULATED POSITIVE as SCREEN_POS_DIFF, (CALCULATED SCREENED / CALCULATED ALL) * 100 as SCREEN_RATE, (CALCULATED POSITIVE / CALCULATED SCREENED) * 100 as POS_PREV from SAMPLEUNIQUE group by AGERANGE, GENDER; quit;
There are no differences in the test data. The difference variables are basically the following:
All_Screened_Diff= Distinct count of EVERYONE in that agerange and gender minus distinct count of those SCREENED in that agerange and gender
Screened_Pos_Diff= Distinct count of SCREENED in that agerange and gender minus distinct count of those POSITIVE in result in that agerange and gender
I did a roundabout way of getting to your solution. I separated genders into two different tables and then merged them together and proc sort them according to age range and gender. I'm not sure what exactly happened though. The output of your code and mine are off. Do you know what is wrong with the code? I added a bigger dataset so that all age ranges for all genders would have some kind of entry.
/*extended sample dataset of 30 entries*/
data sampleunique;
infile datalines dsd truncover;
input ID:BEST. Gender:BEST. AgeRange:BEST. Screen:$3. Result:$8.;
datalines4;
1,1,1,Yes,Positive
2,2,2,No,
3,1,3,Yes,Negative
4,2,4,No,
5,1,5,Yes,Positive
6,2,6,No,
7,1,1,Yes,Negative
8,2,2,No,
9,1,3,Yes,Positive
10,2,4,No,
11,1,5,Yes,Negative
12,2,6,Yes,Positive
13,1,1,Yes,Negative
14,2,3,Yes,Positive
15,1,1,Yes,Negative
16,1,2,Yes,Negative
17,1,3,Yes,Negative
18,1,4,No,
19,1,5,No,
20,1,6,No,
21,2,1,No,
22,2,2,No,
23,2,3,No,
24,2,4,No,
25,2,5,Yes,Positive
26,2,6,Yes,Positive
27,2,1,Yes,Positive
28,2,6,Yes,Negative
29,1,3,Yes,Negative
30,2,1,Yes,Positive
;;;;
/*this was your code*/
proc sql;
create table WANT as
select AGERANGE,
GENDER,
count(case when char(RESULT,1)="P" then 1 else 0 end) as POSITIVE,
count(case when char(SCREEN,1)="Y" then 1 else 0 end) as SCREENED,
count(*) as TOTAL,
CALCULATED TOTAL - CALCULATED SCREENED as TOTAL_SCREEN_DIFF,
CALCULATED SCREENED - CALCULATED POSITIVE as SCREEN_POS_DIFF,
(CALCULATED SCREENED / CALCULATED TOTAL) * 100 as SCREEN_RATE,
(CALCULATED POSITIVE / CALCULATED SCREENED) * 100 as POS_PREV
from SAMPLEUNIQUE
group by AGERANGE,
GENDER;
quit;
/*this is mine*/
proc sql;
create table sampleMALE as
select agerange,
sum(case when char(result,1)="P" then 1 else 0 end) as POSITIVE,
sum(case when screen="Yes" then 1 else 0 end) as SCREENED,
count(*) as TOTAL,
(CALCULATED TOTAL - CALCULATED SCREENED) as TotalScreenedDifference,
(CALCULATED SCREENED - CALCULATED POSITIVE) as ScreenedPositiveDifference,
(CALCULATED SCREENED / CALCULATED TOTAL) * 100 as SCREENINGRATE,
(CALCULATED POSITIVE / CALCULATED SCREENED) * 100 as POSITIVEPREVALENCE
from sampleunique
where gender=1
group by agerange;
quit;
data samplemale;
set samplemale;
gender=1;
run;
proc sql;
create table sampleFEMALE as
select agerange,
sum(case when char(result,1)="P" then 1 else 0 end) as POSITIVE,
sum(case when screen="Yes" then 1 else 0 end) as SCREENED,
count(*) as TOTAL,
(CALCULATED TOTAL - CALCULATED SCREENED) as TotalScreenedDifference,
(CALCULATED SCREENED - CALCULATED POSITIVE) as ScreenedPositiveDifference,
(CALCULATED SCREENED / CALCULATED TOTAL) * 100 as SCREENINGRATE,
(CALCULATED POSITIVE / CALCULATED SCREENED) * 100 as POSITIVEPREVALENCE
from sampleunique
where gender=2
group by agerange;
quit;
data sampleFEMALE;
set sampleFEMALE;
Gender=2;
run;
data sampleMERGE;
retain agerange gender positive;
set sampleMALE sampleFEMALE;
run;
proc sort data=sampleMERGE;
by agerange gender;
run;
You should not put where clauses with group by's. If you need to restrict it then (select * from have where xyz) subquery it. As for different counts, sure your doing different grouping, you will get different output?
Thanks. How would you put the where statement here?
proc sql;
create table janfeb.Febscreeningresults as
select agerange, gender,
sum(case when char(hcvresult,1)="P" then 1 else 0 end) as POSITIVE,
sum(case when hcvscreening="Yes" then 1 else 0 end) as SCREENED,
count(*) as TOTAL,
(CALCULATED TOTAL - CALCULATED SCREENED) as TotalScreenedDifference,
(CALCULATED SCREENED - CALCULATED POSITIVE) as ScreenedPositiveDifference,
(CALCULATED SCREENED / CALCULATED TOTAL) * 100 as SCREENINGRATE,
(CALCULATED POSITIVE / CALCULATED SCREENED) * 100 as POSITIVEPREVALENCE
from janfeb.jfgmi
where dcdeathdate >= dhms(mdy(02,01,2017),0,0,0) and dcdeathdate <= dhms(mdy(02,28,2017),0,0,0)
group by agerange, gender;
quit;
It seems to work fine on sas when i put the where statement below the from statement. Thanks again.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.