Hello,
I have multiple years of datasets (2006-2019). I would like to sort all the data by patient id before appending so as to reduce the size and remove unwanted records. I found a macro code but its not doing what I expected. I was able to sort the first year (2006) if i replace ( out=) with claims_rx_2006
but its not sorting other years.
data _null_;
input datasets $char50. ;
call execute
("proc sort data=work."||strip(datasets)||
/* "(keep=cum_freq cum_date rename=(cum_freq="||strip(datasets)||"))"|| */
"out="||strip(datasets)||";"||
"by pat_id; run;");
cards;
claims_rx_2006
claims_rx_2007
claims_rx_2008
claims_rx_2009
claims_rx_2010
claims_rx_2011
claims_rx_2012
claims_rx_2013
claims_rx_2014
claims_rx_2015
claims_rx_2016
claims_rx_2017
claims_rx_2018
claims_rx_2019
;run;
Try this macro:
%macro append;
%do i=2006 %to 2019;
proc sort data=claims_rx_&i out= _claims_rx_&i;
by pat_id;
run;
%end;
data combined;
set %do i=2006 %to 2019;
_claims_rx_&i
%end;
;
run;
%mend;
%append;
Hi @CathyVI
A word of advice, from efficiency perspective, you will be better served if the code you use does this
Hope this helps
Here is an example of what I was talking about
data _null_;
input datasets $char50. ;
call execute
("proc sort data=work."||strip(datasets)||
/* if you can use where= and keep= options, the would come here. Just uncomment the line below and populate it as needed */
/* '(KEEP=pat_id WHERE=(1=1))' || */
" out="||strip(datasets)||"; by pat_id; run; PROC APPEND BASE=combined DATA="||strip(datasets)||"; run;"
);
cards;
claims_rx_2006
claims_rx_2007
claims_rx_2008
claims_rx_2009
claims_rx_2010
claims_rx_2011
claims_rx_2012
claims_rx_2013
claims_rx_2014
claims_rx_2015
claims_rx_2016
claims_rx_2017
claims_rx_2018
claims_rx_2019
;run;
Hope this helps
But the code did a lot of work to sort the individual datasets.
So use SET so that the combined dataset is also sorted.
data want;
set claims_rx_2006 - claims_rx_2019;
by pat_id;
run;
Save some code. If you do not want to create an additional data set then don't bother with the OUT= option.
With Call Execute I would also not pass the commented out section as your code does.
When I run into issues with Call Execute I will create a character value with the code created to examine the results such as
data junk;
input datasets $char50. ;
length code $ 300;
code = "proc sort data=work."||strip(datasets)||
/* "(keep=cum_freq cum_date rename=(cum_freq="||strip(datasets)||"))"|| */
"out="||strip(datasets)||";"||
"by pat_id; run;";
cards;
claims_rx_2006
claims_rx_2007
claims_rx_2008
claims_rx_2009
;
When I look that the results for the Code variable above I get:
proc sort data=work.claims_rx_2006out=claims_rx_2006;by pat_id; run;
notice that there is no space before OUT. This would likely throw like this:
34 proc sort data=work.claims_rx_2006out=claims_rx_2006;by pat_id; run; - 22 76 ERROR 22-322: Syntax error, expecting one of the following: DATA, DUPOUT, IN, OUT, SIZE, SORTSEQ, SORTSIZE, T, TECH, TECHNIQUE, UNIOUT, UNIQUEOUT. ERROR 76-322: Syntax error, statement will be ignored.
What this error doesn't tell is that the source data set likely doesn't exist because it saw the OUT as part of the dataset name.
The solution would be include a SPACE before the Out= (or drop the Out entirely as not needed from the code shown)
data junk; input datasets $char50. ; length code $ 500; code = "proc sort data=work."||strip(datasets)|| /* "(keep=cum_freq cum_date rename=(cum_freq="||strip(datasets)||"))"|| */ " out="||strip(datasets)||";"|| "by pat_id; run;"; cards; claims_rx_2006 claims_rx_2007 claims_rx_2008 claims_rx_2009 ;
What did your LOG show. The log typically shows the submitted code from call execute.
You are working way too hard. Copy and paste is your friend. It would be just as easy to create the proc sort steps directly as it would be to make the data step you showed.
proc sort data=claims_rx_2006 ;
by pat_id;
run;
proc sort data=claims_rx_2007 ;
by pat_id;
run;
proc sort data=claims_rx_2008 ;
by pat_id;
run;
proc sort data=claims_rx_2009 ;
by pat_id;
run;
proc sort data=claims_rx_2010 ;
by pat_id;
run;
proc sort data=claims_rx_2011 ;
by pat_id;
run;
proc sort data=claims_rx_2012 ;
by pat_id;
run;
proc sort data=claims_rx_2013 ;
by pat_id;
run;
proc sort data=claims_rx_2014 ;
by pat_id;
run;
proc sort data=claims_rx_2015 ;
by pat_id;
run;
proc sort data=claims_rx_2016 ;
by pat_id;
run;
proc sort data=claims_rx_2017 ;
by pat_id;
run;
proc sort data=claims_rx_2018 ;
by pat_id;
run;
proc sort data=claims_rx_2019 ;
by pat_id;
run;
If you did want to do it with code then just use a do loop from 2006 to 2019 rather than typing in all of the dataset names. Also use the CAT... series of functions to make your code easier to read.
data _null_;
do year=2006 to 2019;
call execute(catx(' '
,'proc sort data=',cats('claims_rx_',year)
,';', 'by pat_id;','run;'
));
end;
run;
NOTE: CALL EXECUTE generated line. 1 + proc sort data= claims_rx_2006 ; by pat_id; run; 2 + proc sort data= claims_rx_2007 ; by pat_id; run; 3 + proc sort data= claims_rx_2008 ; by pat_id; run; ...
And once you have them sorted there is no need for copy and paste or other code generation methods to combine all of the datasets.
data want;
set claims_rx_2006 - claims_rx_2019;
by pat_id;
run;
What do you mean by duplicate records?
If you want to eliminate duplicates from the individual files then include the NODUPKEY option in the generated PROC SORT code. Placement should be obvious, or easy to fix once you try it and get an error message because you put it in the wrong place.
If you want to eliminate duplicates caused by the same PAT_ID appearing in multiple datasets then you can use subsetting IF statement in the data step that combines them. For example to keep the earliest observation you could do:
data want;
set claims_rx_2006 - claims_rx_2019 ;
by pat_id;
if first.pat_id;
run;
@Tom The placement seems obvious but i have tried to insert in every possible area with no luck.
data _null_;
do year=2006 to 2019;
call execute(catx(' '
,'proc sort data=', cats('claims_rx_',year)
,';', 'by pat_id;','run;'
));
end;
run;
data _null_;
do year=2006 to 2019;
call execute(catx(' '
,'proc sort data=', cats('claims_rx_',year)
,' nodupkey;', 'by pat_id;','run;'
));
end;
run;
If it errors out, check if the nodupkey is being added to the end of the data set name. May need to switch from CATS to a different concatenate operator that won't steal your spaces.
It is part of the PROC statement itself.
Try any of these locations.
data _null_;
do year=2006 to 2019;
call execute(catx(' '
,'proc sort nodupkey data=', cats('claims_rx_',year)
,';', 'by pat_id;','run;'
));
end;
run;
Or
data _null_;
do year=2006 to 2019;
call execute(catx(' '
,'proc sort data=', cats('claims_rx_',year)
,'nodupkey;', 'by pat_id;','run;'
));
end;
run;
or
data _null_;
do year=2006 to 2019;
call execute(catx(' '
,'proc sort data=', cats('claims_rx_',year)
,'nodupkey',';', 'by pat_id;','run;'
));
end;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.