- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have multiple years of datasets (2006-2019). I would like to sort all the data by patient id before appending so as to reduce the size and remove unwanted records. I found a macro code but its not doing what I expected. I was able to sort the first year (2006) if i replace ( out=) with claims_rx_2006
but its not sorting other years.
data _null_;
input datasets $char50. ;
call execute
("proc sort data=work."||strip(datasets)||
/* "(keep=cum_freq cum_date rename=(cum_freq="||strip(datasets)||"))"|| */
"out="||strip(datasets)||";"||
"by pat_id; run;");
cards;
claims_rx_2006
claims_rx_2007
claims_rx_2008
claims_rx_2009
claims_rx_2010
claims_rx_2011
claims_rx_2012
claims_rx_2013
claims_rx_2014
claims_rx_2015
claims_rx_2016
claims_rx_2017
claims_rx_2018
claims_rx_2019
;run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Try this macro:
%macro append;
%do i=2006 %to 2019;
proc sort data=claims_rx_&i out= _claims_rx_&i;
by pat_id;
run;
%end;
data combined;
set %do i=2006 %to 2019;
_claims_rx_&i
%end;
;
run;
%mend;
%append;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @CathyVI
A word of advice, from efficiency perspective, you will be better served if the code you use does this
- While sorting, exclude the records and columns you don't need to make it into the Proc Sort. (Hint: use Where= & Keep= Data set options on the Input data set)
- Use Proc Append instead of SET statement. Proc Append reads records in Blocks, while the SET statement reads one record at a time. (Hint: As you loop through the macro, issue Proc Append if you can)
Hope this helps
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here is an example of what I was talking about
data _null_;
input datasets $char50. ;
call execute
("proc sort data=work."||strip(datasets)||
/* if you can use where= and keep= options, the would come here. Just uncomment the line below and populate it as needed */
/* '(KEEP=pat_id WHERE=(1=1))' || */
" out="||strip(datasets)||"; by pat_id; run; PROC APPEND BASE=combined DATA="||strip(datasets)||"; run;"
);
cards;
claims_rx_2006
claims_rx_2007
claims_rx_2008
claims_rx_2009
claims_rx_2010
claims_rx_2011
claims_rx_2012
claims_rx_2013
claims_rx_2014
claims_rx_2015
claims_rx_2016
claims_rx_2017
claims_rx_2018
claims_rx_2019
;run;
Hope this helps
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
But the code did a lot of work to sort the individual datasets.
So use SET so that the combined dataset is also sorted.
data want;
set claims_rx_2006 - claims_rx_2019;
by pat_id;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Save some code. If you do not want to create an additional data set then don't bother with the OUT= option.
With Call Execute I would also not pass the commented out section as your code does.
When I run into issues with Call Execute I will create a character value with the code created to examine the results such as
data junk;
input datasets $char50. ;
length code $ 300;
code = "proc sort data=work."||strip(datasets)||
/* "(keep=cum_freq cum_date rename=(cum_freq="||strip(datasets)||"))"|| */
"out="||strip(datasets)||";"||
"by pat_id; run;";
cards;
claims_rx_2006
claims_rx_2007
claims_rx_2008
claims_rx_2009
;
When I look that the results for the Code variable above I get:
proc sort data=work.claims_rx_2006out=claims_rx_2006;by pat_id; run;
notice that there is no space before OUT. This would likely throw like this:
34 proc sort data=work.claims_rx_2006out=claims_rx_2006;by pat_id; run; - 22 76 ERROR 22-322: Syntax error, expecting one of the following: DATA, DUPOUT, IN, OUT, SIZE, SORTSEQ, SORTSIZE, T, TECH, TECHNIQUE, UNIOUT, UNIQUEOUT. ERROR 76-322: Syntax error, statement will be ignored.
What this error doesn't tell is that the source data set likely doesn't exist because it saw the OUT as part of the dataset name.
The solution would be include a SPACE before the Out= (or drop the Out entirely as not needed from the code shown)
data junk; input datasets $char50. ; length code $ 500; code = "proc sort data=work."||strip(datasets)|| /* "(keep=cum_freq cum_date rename=(cum_freq="||strip(datasets)||"))"|| */ " out="||strip(datasets)||";"|| "by pat_id; run;"; cards; claims_rx_2006 claims_rx_2007 claims_rx_2008 claims_rx_2009 ;
What did your LOG show. The log typically shows the submitted code from call execute.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You are working way too hard. Copy and paste is your friend. It would be just as easy to create the proc sort steps directly as it would be to make the data step you showed.
proc sort data=claims_rx_2006 ;
by pat_id;
run;
proc sort data=claims_rx_2007 ;
by pat_id;
run;
proc sort data=claims_rx_2008 ;
by pat_id;
run;
proc sort data=claims_rx_2009 ;
by pat_id;
run;
proc sort data=claims_rx_2010 ;
by pat_id;
run;
proc sort data=claims_rx_2011 ;
by pat_id;
run;
proc sort data=claims_rx_2012 ;
by pat_id;
run;
proc sort data=claims_rx_2013 ;
by pat_id;
run;
proc sort data=claims_rx_2014 ;
by pat_id;
run;
proc sort data=claims_rx_2015 ;
by pat_id;
run;
proc sort data=claims_rx_2016 ;
by pat_id;
run;
proc sort data=claims_rx_2017 ;
by pat_id;
run;
proc sort data=claims_rx_2018 ;
by pat_id;
run;
proc sort data=claims_rx_2019 ;
by pat_id;
run;
If you did want to do it with code then just use a do loop from 2006 to 2019 rather than typing in all of the dataset names. Also use the CAT... series of functions to make your code easier to read.
data _null_;
do year=2006 to 2019;
call execute(catx(' '
,'proc sort data=',cats('claims_rx_',year)
,';', 'by pat_id;','run;'
));
end;
run;
NOTE: CALL EXECUTE generated line. 1 + proc sort data= claims_rx_2006 ; by pat_id; run; 2 + proc sort data= claims_rx_2007 ; by pat_id; run; 3 + proc sort data= claims_rx_2008 ; by pat_id; run; ...
And once you have them sorted there is no need for copy and paste or other code generation methods to combine all of the datasets.
data want;
set claims_rx_2006 - claims_rx_2019;
by pat_id;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What do you mean by duplicate records?
If you want to eliminate duplicates from the individual files then include the NODUPKEY option in the generated PROC SORT code. Placement should be obvious, or easy to fix once you try it and get an error message because you put it in the wrong place.
If you want to eliminate duplicates caused by the same PAT_ID appearing in multiple datasets then you can use subsetting IF statement in the data step that combines them. For example to keep the earliest observation you could do:
data want;
set claims_rx_2006 - claims_rx_2019 ;
by pat_id;
if first.pat_id;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Tom The placement seems obvious but i have tried to insert in every possible area with no luck.
data _null_;
do year=2006 to 2019;
call execute(catx(' '
,'proc sort data=', cats('claims_rx_',year)
,';', 'by pat_id;','run;'
));
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data _null_;
do year=2006 to 2019;
call execute(catx(' '
,'proc sort data=', cats('claims_rx_',year)
,' nodupkey;', 'by pat_id;','run;'
));
end;
run;
If it errors out, check if the nodupkey is being added to the end of the data set name. May need to switch from CATS to a different concatenate operator that won't steal your spaces.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It is part of the PROC statement itself.
Try any of these locations.
data _null_;
do year=2006 to 2019;
call execute(catx(' '
,'proc sort nodupkey data=', cats('claims_rx_',year)
,';', 'by pat_id;','run;'
));
end;
run;
Or
data _null_;
do year=2006 to 2019;
call execute(catx(' '
,'proc sort data=', cats('claims_rx_',year)
,'nodupkey;', 'by pat_id;','run;'
));
end;
run;
or
data _null_;
do year=2006 to 2019;
call execute(catx(' '
,'proc sort data=', cats('claims_rx_',year)
,'nodupkey',';', 'by pat_id;','run;'
));
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content