SAS community (1st time messaging)
I am stuck trying to determine the last part of my code. Following are made up medical data were patients sought specialty care in the surrounding community (city, state). Desired output should array data by CLINIC, PRODUCT_LINE, and VISITS. I want to list only (by-CLINIC and by PRODUCT_LINE) the top three rows including VISITS (descending order); thereafter, row four should sum the remaining PRODUCT_LINE VISITS.
DATA CARE ;
INPUT PRODUCT_LINE$12. CITY$12. ST$3. CLINIC$5. VISITS 4.0;
LENGTH CITYST $20. ;
CITYST=CITY||'_'||ST ;
CARDS ;
CARDIOLOGY TEMPLE TX 0110 200
CARDIOLOGY TEMPLE TX 0110 300
CARDIOLOGY LUBBOCK TX 0110 50
CARDIOLOGY BUDA TX 0110 300
CARDIOLOGY BANDERA TX 0110 200
CARDIOLOGY SAN ANTONIO TX 0110 20
CARDIOLOGY TEMPLE TX 0110 100
CARDIOLOGY SAN ANTONIO TX 0110 50
CARDIOLOGY SAN ANTONIO TX 0110 10
CARDIOLOGY TEMPLE TX 0110 100
CARDIOLOGY TEMPLE TX 0110 100
;
RUN ;
PROC SUMMARY data=CARE nway ;
var VISITS ;
class CLINIC PRODUCT_LINE CITYST / missing ;
output out=CARE_TEST3 (drop=_TYPE_ _FREQ_) sum=VISITS ;
run ;
PROC SORT DATA=CARE_TEST3 ;
BY CLINIC ;
RUN ;
PROC RANK DATA=CARE_TEST3 OUT=V_TEST DESCENDING ;
BY CLINIC PRODUCT_LINE ;
VAR VISITS ;
RANKS R_VISITS ;
RUN ;
PROC SORT DATA=V_TEST ;
BY CLINIC PRODUCT_LINE R_VISITS CITYST ;
WHERE R_VISITS <=3 ;
RUN ;
SORT V_TEST RAW OUTPUT
Obs CLINIC PRODUCT_LINE CITYST VISITS R_VISITS
1 0110 CARDIOLOGY TEMPLE _TX 800 1
2 0110 CARDIOLOGY BUDA _TX 300 2
3 0110 CARDIOLOGY BANDERA _TX 200 3
DESIRED CARDIOLOGY OUTPUT:
Obs CLINIC PRODUCT_LINE CITYST VISITS R_VISITS
1 0110 CARDIOLOGY TEMPLE _TX 800 1
2 0110 CARDIOLOGY BUDA _TX 300 2
3 0110 CARDIOLOGY BANDERA _TX 200 3
3 0110 CARDIOLOGY CARE_ELSEWHERE 130 4
Thanks in advance for any help!
Edit note: I've combined the original answer plus the follow-up, so that the response marked as "solution" makes sense:
yesterday - last edited yesterday
I don't think you need proc rank. And I also suggest you let proc summary not only provide the 3-way (clinic*product_line*cityst) total VISITS (with _type_=7), but also the other marginal totals - in particular _TYPE_=6 (marginal totals for clinic*product_line). Then you can
DATA CARE ;
INPUT PRODUCT_LINE$12. CITY$12. ST$3. CLINIC$5. VISITS 4.0;
LENGTH CITYST $20. ;
CITYST=CITY||'_'||ST ;
CARDS ;
CARDIOLOGY TEMPLE TX 0110 200let
CARDIOLOGY TEMPLE TX 0110 300
CARDIOLOGY LUBBOCK TX 0110 50
CARDIOLOGY BUDA TX 0110 300
CARDIOLOGY BANDERA TX 0110 200
CARDIOLOGY SAN ANTONIO TX 0110 20
CARDIOLOGY TEMPLE TX 0110 100
CARDIOLOGY SAN ANTONIO TX 0110 50
CARDIOLOGY SAN ANTONIO TX 0110 10
CARDIOLOGY TEMPLE TX 0110 100
CARDIOLOGY TEMPLE TX 0110 100
RUN ;
PROC SUMMARY data=CARE ;
var VISITS ;
class CLINIC PRODUCT_LINE CITYST / missing ;
output out=CARE_TEST3 (drop=_FREQ_) sum=VISITS ;
run ;
proc sort data=care_test3 (where=(_type_=6 or _type_=7));
by clinic product_line _type_ descending visits;
run;
data want (drop=_:);
set care_test3 ;
retain _total_visits r_visits;
if _type_=6 then do;
_total_visits=visits;
r_visits=0;
end;
if _type_=7;
r_visits+1;
if r_visits<=3 then do;
output;
_total_visits=_total_visits-visits;
end;
else if r_visits=4 then do;
visits=_total_visits;
cityst='CARE_ELSEWHERE';
output;
end;
run;
Note the PROC SORT reads in only the _TYPE_=6 and _TYPE_=7 records produced by PROC SUMMARY. And the sort sorts the data by CLINIC/PRODUCT_LINE. Within each CLINIC/PRODUCT_LINE there will be only one _TYPE_=6 record, followed by the _TYPE_=7 records, which are sorted by descending visit.
The sort output if read, and every time there is a _TYPE_=6 record, _TOTAL_VISITS is reset and r_visit is set to 0. For each _type_=7 record, r_visit is incremented, the data is output (for r_visit<=3), and _TOTAL_VISITS is decremented by VISITS. So when r_visit=4 is encountered it just outputs the updated _TOTAL_VISTS (copied to VISITS).
Response to "Since VISITS can/will reach in the thousands for multiple CITYST throughout the US, if CITYST is >20+ will the aforementioned code still work?"
The code will work no matter how many levels of CITYST (or CLINIC or PRODUCTLINE) is in your data. The "_TYPE_" variables only designate the "type" of aggregation, not the number of levels a class variable might have.
Let's say you have 3 CLASS variables in proc summary, A, B, and C (with Na, Nb, and Nc levels - including missing in your case). Then you can have 2**3=8 "types" of totals/means/other stats. For instance if you have
CLASS A B C, then the _type_ levels would be
I don't think you need proc rank. And I also suggest you let proc summary not only provide the 3-way (clinic*product_line*cityst) total VISITS (with _type_=7), but also the other marginal totals - in particular _TYPE_=6 (marginal totals for clinic*product_line). Then you can
DATA CARE ;
INPUT PRODUCT_LINE$12. CITY$12. ST$3. CLINIC$5. VISITS 4.0;
LENGTH CITYST $20. ;
CITYST=CITY||'_'||ST ;
CARDS ;
CARDIOLOGY TEMPLE TX 0110 200let
CARDIOLOGY TEMPLE TX 0110 300
CARDIOLOGY LUBBOCK TX 0110 50
CARDIOLOGY BUDA TX 0110 300
CARDIOLOGY BANDERA TX 0110 200
CARDIOLOGY SAN ANTONIO TX 0110 20
CARDIOLOGY TEMPLE TX 0110 100
CARDIOLOGY SAN ANTONIO TX 0110 50
CARDIOLOGY SAN ANTONIO TX 0110 10
CARDIOLOGY TEMPLE TX 0110 100
CARDIOLOGY TEMPLE TX 0110 100
RUN ;
PROC SUMMARY data=CARE ;
var VISITS ;
class CLINIC PRODUCT_LINE CITYST / missing ;
output out=CARE_TEST3 (drop=_FREQ_) sum=VISITS ;
run ;
proc sort data=care_test3 (where=(_type_=6 or _type_=7));
by clinic product_line _type_ descending visits;
run;
data want (drop=_:);
set care_test3 ;
retain _total_visits r_visits;
if _type_=6 then do;
_total_visits=visits;
r_visits=0;
end;
if _type_=7;
r_visits+1;
if r_visits<=3 then do;
output;
_total_visits=_total_visits-visits;
end;
else if r_visits=4 then do;
visits=_total_visits;
cityst='CARE_ELSEWHERE';
output;
end;
run;
Note the PROC SORT reads in only the _TYPE_=6 and _TYPE_=7 records produced by PROC SUMMARY. And the sort sorts the data by CLINIC/PRODUCT_LINE. Within each CLINIC/PRODUCT_LINE there will be only one _TYPE_=6 record, followed by the _TYPE_=7 records, which are sorted by descending visit.
The sort output if read, and every time there is a _TYPE_=6 record, _TOTAL_VISITS is reset and r_visit is set to 0. For each _type_=7 record, r_visit is incremented, the data is output (for r_visit<=3), and _TOTAL_VISITS is decremented by VISITS. So when r_visit=4 is encountered it just outputs the updated _TOTAL_VISTS (copied to VISITS).
mkeintz, thank you for the prompt response.
Edit note: I've combined the original answer plus the follow-up, so that the response marked as "solution" makes sense:
yesterday - last edited yesterday
I don't think you need proc rank. And I also suggest you let proc summary not only provide the 3-way (clinic*product_line*cityst) total VISITS (with _type_=7), but also the other marginal totals - in particular _TYPE_=6 (marginal totals for clinic*product_line). Then you can
DATA CARE ;
INPUT PRODUCT_LINE$12. CITY$12. ST$3. CLINIC$5. VISITS 4.0;
LENGTH CITYST $20. ;
CITYST=CITY||'_'||ST ;
CARDS ;
CARDIOLOGY TEMPLE TX 0110 200let
CARDIOLOGY TEMPLE TX 0110 300
CARDIOLOGY LUBBOCK TX 0110 50
CARDIOLOGY BUDA TX 0110 300
CARDIOLOGY BANDERA TX 0110 200
CARDIOLOGY SAN ANTONIO TX 0110 20
CARDIOLOGY TEMPLE TX 0110 100
CARDIOLOGY SAN ANTONIO TX 0110 50
CARDIOLOGY SAN ANTONIO TX 0110 10
CARDIOLOGY TEMPLE TX 0110 100
CARDIOLOGY TEMPLE TX 0110 100
RUN ;
PROC SUMMARY data=CARE ;
var VISITS ;
class CLINIC PRODUCT_LINE CITYST / missing ;
output out=CARE_TEST3 (drop=_FREQ_) sum=VISITS ;
run ;
proc sort data=care_test3 (where=(_type_=6 or _type_=7));
by clinic product_line _type_ descending visits;
run;
data want (drop=_:);
set care_test3 ;
retain _total_visits r_visits;
if _type_=6 then do;
_total_visits=visits;
r_visits=0;
end;
if _type_=7;
r_visits+1;
if r_visits<=3 then do;
output;
_total_visits=_total_visits-visits;
end;
else if r_visits=4 then do;
visits=_total_visits;
cityst='CARE_ELSEWHERE';
output;
end;
run;
Note the PROC SORT reads in only the _TYPE_=6 and _TYPE_=7 records produced by PROC SUMMARY. And the sort sorts the data by CLINIC/PRODUCT_LINE. Within each CLINIC/PRODUCT_LINE there will be only one _TYPE_=6 record, followed by the _TYPE_=7 records, which are sorted by descending visit.
The sort output if read, and every time there is a _TYPE_=6 record, _TOTAL_VISITS is reset and r_visit is set to 0. For each _type_=7 record, r_visit is incremented, the data is output (for r_visit<=3), and _TOTAL_VISITS is decremented by VISITS. So when r_visit=4 is encountered it just outputs the updated _TOTAL_VISTS (copied to VISITS).
Response to "Since VISITS can/will reach in the thousands for multiple CITYST throughout the US, if CITYST is >20+ will the aforementioned code still work?"
The code will work no matter how many levels of CITYST (or CLINIC or PRODUCTLINE) is in your data. The "_TYPE_" variables only designate the "type" of aggregation, not the number of levels a class variable might have.
Let's say you have 3 CLASS variables in proc summary, A, B, and C (with Na, Nb, and Nc levels - including missing in your case). Then you can have 2**3=8 "types" of totals/means/other stats. For instance if you have
CLASS A B C, then the _type_ levels would be
I need to add two more measure to my input statement (weight & cost). How can I incorporate the two additional measures into the 'nth sum by-group' code below? The main descending dimension should continue to be 'visits'.
DATA CARE ;
INPUT PRODUCT_LINE$12. CITY$12. ST$3. CLINIC$5. VISITS 4.0 WEIGHT 4.0 COST 10.2 ;
LENGTH CITYST $20. ;
CITYST=CITY||'_'||ST ;
CARDS ;
CARDIOLOGY TEMPLE TX 0110 200 1.56 225.00
CARDIOLOGY TEMPLE TX 0110 300 1.56 225.00
CARDIOLOGY LUBBOCK TX 0110 50 2.00 275.00
CARDIOLOGY BUDA TX 0110 300 1.23 175.00
CARDIOLOGY BANDERA TX 0110 200 2.01 250.00
CARDIOLOGY SAN ANTONIO TX 0110 20 2.89 375.00
CARDIOLOGY TEMPLE TX 0110 100 1.56 225.00
CARDIOLOGY SAN ANTONIO TX 0110 50 2.89 375.00
CARDIOLOGY SAN ANTONIO TX 0110 10 2.89 375.00
CARDIOLOGY TEMPLE TX 0110 100 1.56 225.00
CARDIOLOGY TEMPLE TX 0110 100 1.56 225.00
RUN ;
By "incorporate the two additional measures into the 'nth sum by-group' code", do you mean that you want the sum of the WEIGHT and COST for each of the locations with the most visits?
If so, why not add the WEIGHT and COST to the VAR statement in proc summary, and also adjust the OUTPUT statement? You can review the documentation on the OUTPUT statement in PROC SUMMARY to see how to get WEIGHT and COST to hold totals.
Then in the DATA WANT step, create and manage the variables _total_weight and _total_cost the same way as you manage _total_visits.
Apologies for my lack of specificity. I did tried several versions of adding the new measure 'PAID' (same as COST) to the SAS code (VAR and OUTPUT statement) but the summation of PAID appears to be inaccurate. In PROC SORT 2 (see below results) amount PAID for Cardiology CARE_ELSEWHERE should be 1400, not 1125. First, I tried sum=VISITS sum=PAID ; run ; ... but PAID seemed to copy over exact values as VISITS. Secondly, I tried sum= ; run; ... this version worked best but PAID summation for CARE_ELSEWHERE is suspect. What am I doing wrong? Thanks in advance!
DATA HOLLYWOODSQ ;
INPUT PRODUCT_LINE$12. CITY$12. ST$3. CLINIC$5. VISITS 4.0 PAID 10.2 ;
LENGTH CITYST $20. ;
CITYST=CATS(CITY,'_',ST) ;
CARDS ;
CARDIOLOGY TEMPLE TX 0110 200 225.00
DERMATOLOGY AUSTIN TX 0110 200 300.00
INTERNAL MED SAN ANTONIO TX 0110 200 375.00
CARDIOLOGY TEMPLE TX 0110 300 225.00
DERMATOLOGY AUSTIN TX 0110 200 300.00
INTERNAL MED AUSTIN TX 0110 500 300.00
CARDIOLOGY LUBBOCK TX 0110 50 275.00
DERMATOLOGY SAN ANTONIO TX 0110 400 375.00
INTERNAL MED LUBBOCK TX 0110 100 275.00
CARDIOLOGY BUDA TX 0110 300 175.00
DERMATOLOGY AUSTIN TX 0110 200 300.00
INTERNAL MED SAN ANTONIO TX 0110 200 375.00
CARDIOLOGY BANDERA TX 0110 200 250.00
DERMATOLOGY CIBOLO TX 0110 200 250.00
INTERNAL MED SAN JOSE TX 0110 200 150.00
CARDIOLOGY SAN ANTONIO TX 0110 20 375.00
DERMATOLOGY AUSTIN TX 0110 200 300.00
INTERNAL MED SAN ANTONIO TX 0110 20 375.00
CARDIOLOGY TEMPLE TX 0110 100 225.00
DERMATOLOGY TEMPLE TX 0110 100 225.00
INTERNAL MED TEMPLE TX 0110 100 225.00
CARDIOLOGY SAN ANTONIO TX 0110 50 375.00
CARDIOLOGY SAN ANTONIO TX 0110 10 375.00
INTERNAL MED AUSTIN TX 0110 50 300.00
INTERNAL MED AUSTIN TX 0110 200 300.00
DERMATOLOGY AUSTIN TX 0110 100 300.00
DERMATOLOGY AUSTIN TX 0110 100 300.00
CARDIOLOGY TEMPLE TX 0110 100 300.00
CARDIOLOGY TEMPLE TX 0110 100 225.00
;
RUN ;
PROC SUMMARY data=HOLLYWOODSQ ;
var VISITS PAID;
class CLINIC PRODUCT_LINE CITYST / missing ;
output out=CARE_TEST3 (drop=_FREQ_) sum= ;
run ;
proc sort data=care_test3 (where=(_type_=6 or _type_=7));
by clinic product_line _type_ descending visits;
run;
PROC PRINT DATA=CARE_TEST3 ;
TITLE 'PROC SORT 1' ;
RUN ;
data want (drop=_:);
set care_test3 ;
retain _total_visits r_visits;
if _type_=6 then do;
_total_visits=visits;
r_visits=0;
end;
if _type_=7;
r_visits+1;
if r_visits<=3 then do;
output;
_total_visits=_total_visits-visits;
end;
else if r_visits=4 then do;
visits=_total_visits;
cityst='CARE_ELSEWHERE';
output;
end;
run;
PROC PRINT DATA=WANT ;
TITLE 'PROC SORT 2' ;
RUN ;
PROC SORT 1
Obs CLINIC PRODUCT_LINE CITYST _TYPE_ VISITS PAID
1 0110 CARDIOLOGY 6 1430 3025
2 0110 CARDIOLOGY TEMPLE_TX 7 800 1200
3 0110 CARDIOLOGY BUDA_TX 7 300 175
4 0110 CARDIOLOGY BANDERA_TX 7 200 250
5 0110 CARDIOLOGY SAN ANTONIO_TX 7 80 1125
6 0110 CARDIOLOGY LUBBOCK_TX 7 50 275
7 0110 DERMATOLOGY 6 1700 2650
8 0110 DERMATOLOGY AUSTIN_TX 7 1000 1800
9 0110 DERMATOLOGY SAN ANTONIO_TX 7 400 375
10 0110 DERMATOLOGY CIBOLO_TX 7 200 250
11 0110 DERMATOLOGY TEMPLE_TX 7 100 225
12 0110 INTERNAL MED 6 1570 2675
13 0110 INTERNAL MED AUSTIN_TX 7 750 900
14 0110 INTERNAL MED SAN ANTONIO_TX 7 420 1125
15 0110 INTERNAL MED SAN JOSE_TX 7 200 150
16 0110 INTERNAL MED LUBBOCK_TX 7 100 275
17 0110 INTERNAL MED TEMPLE_TX 7 100 225
PROC SORT 2
Obs CLINIC PRODUCT_LINE CITYST VISITS PAID r_visits
1 0110 CARDIOLOGY TEMPLE_TX 800 1200 1
2 0110 CARDIOLOGY BUDA_TX 300 175 2
3 0110 CARDIOLOGY BANDERA_TX 200 250 3
4 0110 CARDIOLOGY CARE_ELSEWHERE 130 1125 4
5 0110 DERMATOLOGY AUSTIN_TX 1000 1800 1
6 0110 DERMATOLOGY SAN ANTONIO_TX 400 375 2
7 0110 DERMATOLOGY CIBOLO_TX 200 250 3
8 0110 DERMATOLOGY CARE_ELSEWHERE 100 225 4
9 0110 INTERNAL MED AUSTIN_TX 750 900 1
10 0110 INTERNAL MED SAN ANTONIO_TX 420 1125 2
11 0110 INTERNAL MED SAN JOSE_TX 200 150 3
12 0110 INTERNAL MED CARE_ELSEWHERE 200 275 4
I think you overlooked this part of my response:
Then in the DATA WANT step, create and manage the variables _total_weight and _total_cost the same way as you manage _total_visits.
Don't you intend to do the same for WEIGHT and COST as you did for VISITs - namely take a total and then keep subtracting each subtotal maintain a residual, to be assigned when r_visits=4?
Also, I can't get your initial DATA step to work - perhaps you've collapsed tab characters in the data lines, which defeats the field alignments implied by the input formats. I'm too lazy to fix each data line one at a time.
Thank you, I am finally able to find time to get back to testing/coding my program (too much distraction at work). As recommended, I added the necessary 'measure' (PAID) code and the program worked like a charm ..thank you! I have one last tweak however! I must add one more 'variable' (LOCATION) because I realized I need to process 'PCwant' calculation only for those healthcare records that occur outside Market-A. To flag outside records, I added the necessary code (LOCATION/PSA); however, I get back no observations for PCwant when I execute my code. What am I missing? … attached is my program.
OMG, I finally figure out the issue. I went back and ready your notes dated 10/14/2019. It's working now, please disregard need for further assistance.
thanks!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.