BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
byeh2017
Quartz | Level 8

Hello. I have a dataset that contains 5 variables. "ID" is a unique identifier of a person in my dataset. "Gender" is male (1) or female (2). AgeRange contains only 1 through 6, which indicates which age range they fall into based on a categorical scheme for my report. "Screen" is if they were ever medically screened. "Status" has 3 entries, positive, negative, or blank. It is blank when the patient has not been screened. Here is the sample dataset:

 

 

data JANFEB.SAMPLEUNIQUE;
  infile datalines dsd truncover;
  input ID:BEST. Gender:BEST. AgeRange:BEST. Screen:$3. Result:$8.;
datalines4;
1,1,1,Yes,Positive
2,2,2,No,
3,1,3,Yes,Negative
4,2,4,No,
5,1,5,Yes,Positive
6,2,6,No,
7,1,1,Yes,Negative
8,2,2,No,
9,1,3,Yes,Positive
10,2,4,No,
11,1,5,Yes,Negative
12,2,6,Yes,Positive
13,1,1,Yes,Negative
14,2,3,Yes,Positive
;;;;

 

How do I create a table from this dataset that groups screening and status results in this following way? I would like to primarily group my new data table by age range (1-6) and gender (1-2).

 

Age RangeGenderPositiveScreenedAllAll Screened DifferenceScreened Positive DifferenceScreening ratePrevalence Rate
1110278039351155277071%0.4%
12623253302977231970%0.3%
212511721386214114785%2.1%
222230983592494307686%0.7%
3191951114919886083%9.6%
323918352221386179683%2.1%
411891073131724488481%17.6%
4227945119024591879%2.9%
5119815861921335138883%12.5%
524211321413281109080%3.7%
6117939024793891372381%4.6%
6217539324862930375781%4.5%

 

Here are the explanation of the columns:

- Positive: Unique count under each age range and specific gender where the status variable had 'Positive' in it.

- Screened: Unique count under each age range and specific gender where the screened variable had 'Yes' in it.

- All: Unique count under each age range and specific gender in the dataset.

- All Screened Difference: All - Screened

Screened Positive Difference: Screened - Positive

- Screening rate: Screened/Total

- Positive prevalence; Positive/Screen

 

Thanks so much for your help!

1 ACCEPTED SOLUTION

Accepted Solutions
byeh2017
Quartz | Level 8

I did a roundabout way of getting to your solution. I separated genders into two different tables and then merged them together and proc sort them according to age range and gender. I'm not sure what exactly happened though. The output of your code and mine are off. Do you know what is wrong with the code? I added a bigger dataset so that all age ranges for all genders would have some kind of entry.

 

/*extended sample dataset of 30 entries*/

data sampleunique;
  infile datalines dsd truncover;
  input ID:BEST. Gender:BEST. AgeRange:BEST. Screen:$3. Result:$8.;
datalines4;
1,1,1,Yes,Positive
2,2,2,No,
3,1,3,Yes,Negative
4,2,4,No,
5,1,5,Yes,Positive
6,2,6,No,
7,1,1,Yes,Negative
8,2,2,No,
9,1,3,Yes,Positive
10,2,4,No,
11,1,5,Yes,Negative
12,2,6,Yes,Positive
13,1,1,Yes,Negative
14,2,3,Yes,Positive
15,1,1,Yes,Negative
16,1,2,Yes,Negative
17,1,3,Yes,Negative
18,1,4,No,
19,1,5,No,
20,1,6,No,
21,2,1,No,
22,2,2,No,
23,2,3,No,
24,2,4,No,
25,2,5,Yes,Positive
26,2,6,Yes,Positive
27,2,1,Yes,Positive
28,2,6,Yes,Negative
29,1,3,Yes,Negative
30,2,1,Yes,Positive
;;;;

/*this was your code*/

proc sql;
  create table WANT as
  select  AGERANGE,
          GENDER,
          count(case when char(RESULT,1)="P" then 1 else 0 end) as POSITIVE,
          count(case when char(SCREEN,1)="Y" then 1 else 0 end) as SCREENED,
          count(*) as TOTAL,
          CALCULATED TOTAL - CALCULATED SCREENED as TOTAL_SCREEN_DIFF,
          CALCULATED SCREENED - CALCULATED POSITIVE as SCREEN_POS_DIFF,
          (CALCULATED SCREENED / CALCULATED TOTAL) * 100 as SCREEN_RATE,
          (CALCULATED POSITIVE / CALCULATED SCREENED) * 100 as POS_PREV
  from    SAMPLEUNIQUE
  group by AGERANGE,
           GENDER;
quit;

/*this is mine*/

proc sql;
  create table sampleMALE as
  select  agerange,
  		  sum(case when char(result,1)="P" then 1 else 0 end) as POSITIVE,
          sum(case when screen="Yes" then 1 else 0 end) as SCREENED,
          count(*) as TOTAL,
          (CALCULATED TOTAL - CALCULATED SCREENED) as TotalScreenedDifference,
          (CALCULATED SCREENED - CALCULATED POSITIVE) as ScreenedPositiveDifference,
          (CALCULATED SCREENED / CALCULATED TOTAL) * 100 as SCREENINGRATE,
          (CALCULATED POSITIVE / CALCULATED SCREENED) * 100 as POSITIVEPREVALENCE
  from    sampleunique
  where gender=1
  group by agerange;
quit;

data samplemale;
set samplemale;
gender=1;
run;

proc sql;
  create table sampleFEMALE as
  select  agerange,
  		  sum(case when char(result,1)="P" then 1 else 0 end) as POSITIVE,
          sum(case when screen="Yes" then 1 else 0 end) as SCREENED,
          count(*) as TOTAL,
          (CALCULATED TOTAL - CALCULATED SCREENED) as TotalScreenedDifference,
          (CALCULATED SCREENED - CALCULATED POSITIVE) as ScreenedPositiveDifference,
          (CALCULATED SCREENED / CALCULATED TOTAL) * 100 as SCREENINGRATE,
          (CALCULATED POSITIVE / CALCULATED SCREENED) * 100 as POSITIVEPREVALENCE
  from    sampleunique
  where gender=2
  group by agerange;
quit;

data sampleFEMALE;
set sampleFEMALE;
Gender=2;
run;

data sampleMERGE;
retain agerange gender positive;
set sampleMALE sampleFEMALE;
run;

proc sort data=sampleMERGE;
by agerange gender;
run;

View solution in original post

9 REPLIES 9
byeh2017
Quartz | Level 8

I got most of the code to work, but not quite sure how to group genders in. Whenenver I select agerange and gender and group by the same two variables, the table creates 31,000+ entries.

 

I am using my real dataset in here, so the variables are a bit different

 

proc sql;
  create table janfeb.Janhosptypescreeningresults as
  select  agerange,
  		  sum(case when char(hcvresult,1)="P" then 1 else 0 end) as POSITIVE,
          sum(case when hcvscreening="Yes" then 1 else 0 end) as SCREENED,
          count(*) as TOTAL,
          (CALCULATED TOTAL - CALCULATED SCREENED) as TotalScreenedDifference,
          (CALCULATED SCREENED - CALCULATED POSITIVE) as ScreenedPositiveDifference,
          (CALCULATED SCREENED / CALCULATED TOTAL) * 100 as SCREENINGRATE,
          (CALCULATED POSITIVE / CALCULATED SCREENED) * 100 as POSITIVEPREVALENCE
  from    janfeb.jfgmi
  where dcdeathdate >= dhms(mdy(02,01,2017),0,0,0) and dcdeathdate <= dhms(mdy(02,28,2017),0,0,0)
  group by agerange;
quit;

How do I include the count for agerange and gender into this new table? Thanks!

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Mabye restart your question by asking what it is your actually trying to do.  From what you have said it sounds like your creating a table based on demography data.  The simplest way to do this would be to do a proc freq on your data, then post process the resulting output from that, something like:

data SAMPLEUNIQUE;
  infile datalines dsd truncover;
  input ID:BEST. Gender:BEST. AgeRange:BEST. Screen:$3. Result:$8.;
datalines4;
1,1,1,Yes,Positive
2,2,2,No,
3,1,3,Yes,Negative
4,2,4,No,
5,1,5,Yes,Positive
6,2,6,No,
7,1,1,Yes,Negative
8,2,2,No,
9,1,3,Yes,Positive
10,2,4,No,
11,1,5,Yes,Negative
12,2,6,Yes,Positive
13,1,1,Yes,Negative
14,2,3,Yes,Positive
;;;;
run;

proc freq data=sampleunique noprint;
  tables gender * agerange * screen * result / out=want;
run;

The want table can then be transposed up if needed, and you can add screening counts to that and then calculate the percentages.

byeh2017
Quartz | Level 8

Thanks for your suggestions. How do I transpose it and add the difference and percentages? I proc freq that table and got this dataset:

 

data WORK.WANT;
  infile datalines dsd truncover;
  input agerange:32. gender:$1. HCVScreening:$9. HCVresult:$27. COUNT:32. PERCENT:32.;
datalines4;
1,1,No,,1114,
1,1,No,Negative,41,0.1619209352
1,1,Yes,Negative,2770,10.939536353
1,1,Yes,Positive,10,0.039492911
1,2,No,,931,
1,2,No,Negative,46,0.1816673907
1,2,Yes,Negative,2319,9.1584060661
1,2,Yes,Positive,6,0.0236957466
2,1,,,1,
2,1,No,,193,
2,1,No,Negative,18,0.0710872398
2,1,No,Positive,2,0.0078985822
2,1,Yes,Negative,1147,4.5298368943
2,1,Yes,Positive,25,0.0987322776
2,2,No,,389,
2,2,No,Negative,105,0.4146755657
2,2,Yes,Negative,3076,12.148019431
2,2,Yes,Positive,22,0.0868844042
3,1,,,1,
3,1,No,,173,
3,1,No,Negative,23,0.0908336954
3,1,No,Positive,1,0.0039492911
3,1,Yes,Negative,860,3.3963903479
3,1,Yes,Positive,91,0.3593854903
3,2,No,,314,
3,2,No,Negative,72,0.2843489594
3,2,Yes,Negative,1796,7.0929268196
3,2,Yes,Positive,39,0.154022353
4,1,,,4,
4,1,No,,221,
4,1,No,Negative,18,0.0710872398
4,1,No,Positive,1,0.0039492911
4,1,Yes,Negative,884,3.4911733344
4,1,Yes,Positive,189,0.7464160183
4,2,No,,195,
4,2,No,Negative,50,0.1974645551
4,2,Yes,Negative,918,3.6254492319
4,2,Yes,Positive,27,0.1066308598
5,1,,,3,
5,1,No,,305,
5,1,No,Negative,27,0.1066308598
5,1,Yes,Negative,1388,5.4816160499
5,1,Yes,Positive,198,0.7819596382
5,2,No,,241,
5,2,No,Negative,40,0.1579716441
5,2,Yes,Negative,1090,4.3047273014
5,2,Yes,Positive,42,0.1658702263
6,1,,,2,
6,1,No,,825,
6,1,No,Negative,62,0.2448560483
6,1,Yes,Negative,3725,14.711109356
6,1,Yes,Positive,179,0.7069231073
6,2,,,1,
6,2,No,,846,
6,2,No,Negative,82,0.3238418704
6,2,Yes,,1,
6,2,Yes,Negative,3757,14.837486671
6,2,Yes,Positive,175,0.6911259429
;;;;

Is there no way to do this on proc sql? I tried to select and group by agerange and gender, but it gave me 31,000 entries. Not quite sure what happened.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, you could do it in SQL, it just becomes a bit fiddly see below.  Thats why its easier to do a proc freq, then in datastep move your data up (see methods to reshape data, a ton of results on Google).  Note your test data doesn't seem to have any differences in it or maybe I misunderstand the total?

data SAMPLEUNIQUE;
  infile datalines dsd truncover;
  input ID:BEST. Gender:BEST. AgeRange:BEST. Screen:$3. Result:$8.;
datalines4;
1,1,1,Yes,Positive
2,2,2,No,
3,1,3,Yes,Negative
4,2,4,No,
5,1,5,Yes,Positive
6,2,6,No,
7,1,1,Yes,Negative
8,2,2,No,
9,1,3,Yes,Positive
10,2,4,No,
11,1,5,Yes,Negative
12,2,6,Yes,Positive
13,1,1,Yes,Negative
14,2,3,Yes,Positive
;;;;
run;

proc sql;
  create table WANT as
  select  AGERANGE,
          GENDER,
          count(case when char(RESULT,1)="P" then 1 else 0 end) as POSITIVE,
          count(case when char(SCREEN,1)="Y" then 1 else 0 end) as SCREENED,
          count(*) as ALL,
          CALCULATED ALL - CALCULATED SCREENED as ALL_DIFF,
          CALCULATED SCREENED - CALCULATED POSITIVE as SCREEN_POS_DIFF,
          (CALCULATED SCREENED / CALCULATED ALL) * 100 as SCREEN_RATE,
          (CALCULATED POSITIVE / CALCULATED SCREENED) * 100 as POS_PREV
  from    SAMPLEUNIQUE
  group by AGERANGE,
           GENDER;
quit;
byeh2017
Quartz | Level 8

There are no differences in the test data. The difference variables are basically the following:

 

All_Screened_Diff= Distinct count of EVERYONE in that agerange and gender minus distinct count of those SCREENED in that agerange and gender

 

Screened_Pos_Diff= Distinct count of SCREENED in that agerange and gender minus distinct count of those POSITIVE in result in that agerange and gender

byeh2017
Quartz | Level 8

I did a roundabout way of getting to your solution. I separated genders into two different tables and then merged them together and proc sort them according to age range and gender. I'm not sure what exactly happened though. The output of your code and mine are off. Do you know what is wrong with the code? I added a bigger dataset so that all age ranges for all genders would have some kind of entry.

 

/*extended sample dataset of 30 entries*/

data sampleunique;
  infile datalines dsd truncover;
  input ID:BEST. Gender:BEST. AgeRange:BEST. Screen:$3. Result:$8.;
datalines4;
1,1,1,Yes,Positive
2,2,2,No,
3,1,3,Yes,Negative
4,2,4,No,
5,1,5,Yes,Positive
6,2,6,No,
7,1,1,Yes,Negative
8,2,2,No,
9,1,3,Yes,Positive
10,2,4,No,
11,1,5,Yes,Negative
12,2,6,Yes,Positive
13,1,1,Yes,Negative
14,2,3,Yes,Positive
15,1,1,Yes,Negative
16,1,2,Yes,Negative
17,1,3,Yes,Negative
18,1,4,No,
19,1,5,No,
20,1,6,No,
21,2,1,No,
22,2,2,No,
23,2,3,No,
24,2,4,No,
25,2,5,Yes,Positive
26,2,6,Yes,Positive
27,2,1,Yes,Positive
28,2,6,Yes,Negative
29,1,3,Yes,Negative
30,2,1,Yes,Positive
;;;;

/*this was your code*/

proc sql;
  create table WANT as
  select  AGERANGE,
          GENDER,
          count(case when char(RESULT,1)="P" then 1 else 0 end) as POSITIVE,
          count(case when char(SCREEN,1)="Y" then 1 else 0 end) as SCREENED,
          count(*) as TOTAL,
          CALCULATED TOTAL - CALCULATED SCREENED as TOTAL_SCREEN_DIFF,
          CALCULATED SCREENED - CALCULATED POSITIVE as SCREEN_POS_DIFF,
          (CALCULATED SCREENED / CALCULATED TOTAL) * 100 as SCREEN_RATE,
          (CALCULATED POSITIVE / CALCULATED SCREENED) * 100 as POS_PREV
  from    SAMPLEUNIQUE
  group by AGERANGE,
           GENDER;
quit;

/*this is mine*/

proc sql;
  create table sampleMALE as
  select  agerange,
  		  sum(case when char(result,1)="P" then 1 else 0 end) as POSITIVE,
          sum(case when screen="Yes" then 1 else 0 end) as SCREENED,
          count(*) as TOTAL,
          (CALCULATED TOTAL - CALCULATED SCREENED) as TotalScreenedDifference,
          (CALCULATED SCREENED - CALCULATED POSITIVE) as ScreenedPositiveDifference,
          (CALCULATED SCREENED / CALCULATED TOTAL) * 100 as SCREENINGRATE,
          (CALCULATED POSITIVE / CALCULATED SCREENED) * 100 as POSITIVEPREVALENCE
  from    sampleunique
  where gender=1
  group by agerange;
quit;

data samplemale;
set samplemale;
gender=1;
run;

proc sql;
  create table sampleFEMALE as
  select  agerange,
  		  sum(case when char(result,1)="P" then 1 else 0 end) as POSITIVE,
          sum(case when screen="Yes" then 1 else 0 end) as SCREENED,
          count(*) as TOTAL,
          (CALCULATED TOTAL - CALCULATED SCREENED) as TotalScreenedDifference,
          (CALCULATED SCREENED - CALCULATED POSITIVE) as ScreenedPositiveDifference,
          (CALCULATED SCREENED / CALCULATED TOTAL) * 100 as SCREENINGRATE,
          (CALCULATED POSITIVE / CALCULATED SCREENED) * 100 as POSITIVEPREVALENCE
  from    sampleunique
  where gender=2
  group by agerange;
quit;

data sampleFEMALE;
set sampleFEMALE;
Gender=2;
run;

data sampleMERGE;
retain agerange gender positive;
set sampleMALE sampleFEMALE;
run;

proc sort data=sampleMERGE;
by agerange gender;
run;
RW9
Diamond | Level 26 RW9
Diamond | Level 26

You should not put where clauses with group by's.  If you need to restrict it then (select * from have where xyz) subquery it.  As for different counts, sure your doing different grouping, you will get different output?

byeh2017
Quartz | Level 8

Thanks. How would you put the where statement here?

 

proc sql;
  create table janfeb.Febscreeningresults as
  select  agerange, gender,
  		  sum(case when char(hcvresult,1)="P" then 1 else 0 end) as POSITIVE,
          sum(case when hcvscreening="Yes" then 1 else 0 end) as SCREENED,
          count(*) as TOTAL,
          (CALCULATED TOTAL - CALCULATED SCREENED) as TotalScreenedDifference,
          (CALCULATED SCREENED - CALCULATED POSITIVE) as ScreenedPositiveDifference,
          (CALCULATED SCREENED / CALCULATED TOTAL) * 100 as SCREENINGRATE,
          (CALCULATED POSITIVE / CALCULATED SCREENED) * 100 as POSITIVEPREVALENCE
  from    janfeb.jfgmi
  where dcdeathdate >= dhms(mdy(02,01,2017),0,0,0) and dcdeathdate <= dhms(mdy(02,28,2017),0,0,0)
  group by agerange, gender;
quit;

It seems to work fine on sas when i put the where statement below the from statement. Thanks again.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 3058 views
  • 0 likes
  • 2 in conversation