BookmarkSubscribeRSS Feed
CathyVI
Pyrite | Level 9

Hello,

I have multiple years of datasets (2006-2019). I would like to sort all the data by patient id before appending so as to reduce the size and remove unwanted records. I found a macro code but its not doing what I expected. I was able to sort the first year (2006)  if i replace ( out=) with claims_rx_2006 but its not sorting other years.

 

data _null_;
   input datasets $char50. ;
   call execute 
      ("proc sort data=work."||strip(datasets)||
/*  "(keep=cum_freq cum_date rename=(cum_freq="||strip(datasets)||"))"||  */
      "out="||strip(datasets)||";"||
       "by pat_id; run;");
   cards;
claims_rx_2006
claims_rx_2007
claims_rx_2008
claims_rx_2009
claims_rx_2010
claims_rx_2011
claims_rx_2012
claims_rx_2013
claims_rx_2014
claims_rx_2015
claims_rx_2016
claims_rx_2017
claims_rx_2018
claims_rx_2019
 ;run;

 

12 REPLIES 12
A_Kh
Lapis Lazuli | Level 10

Try this macro:

 %macro append;
 	%do i=2006 %to 2019;
		proc sort data=claims_rx_&i out= _claims_rx_&i;
			by pat_id;
		run;
	%end; 
	data combined;
		set %do i=2006 %to 2019;
				_claims_rx_&i
			%end;
			;
	run; 
%mend; 
%append; 
AhmedAl_Attar
Ammonite | Level 13

Hi @CathyVI 

A word of advice, from efficiency perspective, you will be better served if the code you use does this

  1. While sorting, exclude the records and columns you don't need to make it into the Proc Sort. (Hint: use Where= & Keep= Data set options on the Input data set)
  2. Use Proc Append instead of SET statement. Proc Append reads records in Blocks, while the SET statement reads one record at a time. (Hint: As you loop through the macro, issue Proc Append if you can)

Hope this helps

AhmedAl_Attar
Ammonite | Level 13

@CathyVI 

Here is an example of what I was talking about

data _null_;
   input datasets $char50. ;
   call execute 
      ("proc sort data=work."||strip(datasets)||
	  /* if you can use where= and keep= options, the would come here. Just uncomment the line below and populate it as needed */
	  /* '(KEEP=pat_id  WHERE=(1=1))' || */
      " out="||strip(datasets)||"; by pat_id; run; PROC APPEND BASE=combined DATA="||strip(datasets)||"; run;"
	);
   cards;
claims_rx_2006
claims_rx_2007
claims_rx_2008
claims_rx_2009
claims_rx_2010
claims_rx_2011
claims_rx_2012
claims_rx_2013
claims_rx_2014
claims_rx_2015
claims_rx_2016
claims_rx_2017
claims_rx_2018
claims_rx_2019
 ;run;

Hope this helps

Tom
Super User Tom
Super User

But the code did a lot of work to sort the individual datasets.

So use SET so that the combined dataset is also sorted.

data want;
  set claims_rx_2006 - claims_rx_2019;
  by pat_id;
run;
ballardw
Super User

Save some code. If you do not want to create an additional data set then don't bother with the OUT= option.

With Call Execute I would also not pass the commented out section as your code does.

When I run into issues with Call Execute I will create a character value with the code created to examine the results such as

 

data junk;
input datasets $char50. ;
length code $ 300;
code = "proc sort data=work."||strip(datasets)||
/* "(keep=cum_freq cum_date rename=(cum_freq="||strip(datasets)||"))"|| */
"out="||strip(datasets)||";"||
"by pat_id; run;";
cards;
claims_rx_2006
claims_rx_2007
claims_rx_2008
claims_rx_2009
;

When I look that the results for the Code variable above I get:

proc sort data=work.claims_rx_2006out=claims_rx_2006;by pat_id; run;

notice that there is no space before OUT. This would likely throw like this:

34   proc sort data=work.claims_rx_2006out=claims_rx_2006;by pat_id; run;
                                          -
                                          22
                                          76

ERROR 22-322: Syntax error, expecting one of the following: DATA, DUPOUT, IN, OUT, SIZE, SORTSEQ,
              SORTSIZE, T, TECH, TECHNIQUE, UNIOUT, UNIQUEOUT.

ERROR 76-322: Syntax error, statement will be ignored.

What this error doesn't tell is that the source data set likely doesn't exist because it saw the OUT as part of the dataset name.

The solution would be include a SPACE before the Out= (or drop the Out entirely as not needed from the code shown)

data junk;
   input datasets $char50. ;
   length code $ 500;
   code = "proc sort data=work."||strip(datasets)||
/*  "(keep=cum_freq cum_date rename=(cum_freq="||strip(datasets)||"))"||  */
      " out="||strip(datasets)||";"||
       "by pat_id; run;";
   cards;
claims_rx_2006
claims_rx_2007
claims_rx_2008
claims_rx_2009
;

 

 

 

What did your LOG show. The log typically shows the submitted code from call execute.

 

Tom
Super User Tom
Super User

You are working way too hard. Copy and paste is your friend. It would be just as easy to create the proc sort steps directly as it would be to make the data step you showed.  

proc sort data=claims_rx_2006 ;
  by pat_id;
run;
proc sort data=claims_rx_2007 ;
  by pat_id;
run;
proc sort data=claims_rx_2008 ;
  by pat_id;
run;
proc sort data=claims_rx_2009 ;
  by pat_id;
run;
proc sort data=claims_rx_2010 ;
  by pat_id;
run;
proc sort data=claims_rx_2011 ;
  by pat_id;
run;
proc sort data=claims_rx_2012 ;
  by pat_id;
run;
proc sort data=claims_rx_2013 ;
  by pat_id;
run;
proc sort data=claims_rx_2014 ;
  by pat_id;
run;
proc sort data=claims_rx_2015 ;
  by pat_id;
run;
proc sort data=claims_rx_2016 ;
  by pat_id;
run;
proc sort data=claims_rx_2017 ;
  by pat_id;
run;
proc sort data=claims_rx_2018 ;
  by pat_id;
run;
proc sort data=claims_rx_2019 ;
  by pat_id;
run;

If you did want to do it with code then just use a do loop from 2006 to 2019 rather than typing in all of the dataset names.  Also use the CAT... series of functions to make your code easier to read.

data _null_;
  do year=2006 to 2019;
    call execute(catx(' '
      ,'proc sort data=',cats('claims_rx_',year)
      ,';', 'by pat_id;','run;'
    ));
  end;
run;
NOTE: CALL EXECUTE generated line.
1    + proc sort data= claims_rx_2006 ; by pat_id; run;
2    + proc sort data= claims_rx_2007 ; by pat_id; run;
3    + proc sort data= claims_rx_2008 ; by pat_id; run;
...

And once you have them sorted there is no need for copy and paste or other code generation methods to combine all of the datasets.

data want;
  set claims_rx_2006 - claims_rx_2019;
  by pat_id;
run;
CathyVI
Pyrite | Level 9

@Tom  @Reeza @AhmedAl_Attar @ballardw @A_Kh 

Thank you all. I tested some of the codes and they all worked. However @Tom If i want to remove duplicate records using your code here will I insert NODUKEY.

Tom
Super User Tom
Super User

What do you mean by duplicate records?

If you want to eliminate duplicates from the individual files then include the NODUPKEY option in the generated PROC SORT code.  Placement should be obvious, or easy to fix once you try it and get an error message because you put it in the wrong place.

 

If you want to eliminate duplicates caused by the same PAT_ID appearing in multiple datasets then you can use subsetting IF statement in the data step that combines them.  For example to keep the earliest observation you could do:

data want;
  set claims_rx_2006 - claims_rx_2019 ;
  by pat_id;
  if first.pat_id;
run;
CathyVI
Pyrite | Level 9

@Tom  The placement seems obvious but i have tried to insert in every possible area with no luck.

data _null_;
do year=2006 to 2019;
call execute(catx(' '
,'proc sort data=', cats('claims_rx_',year)
,';', 'by pat_id;','run;'
));
end;
run;
Reeza
Super User
data _null_;
do year=2006 to 2019;
call execute(catx(' '
,'proc sort data=', cats('claims_rx_',year)
,' nodupkey;', 'by pat_id;','run;'
));
end;
run;

If it errors out, check if the nodupkey is being added to the end of the data set name. May need to switch from CATS to a different concatenate operator that won't steal your spaces. 

 

Tom
Super User Tom
Super User

It is part of the PROC statement itself.

Try any of these locations.

data _null_;
  do year=2006 to 2019;
    call execute(catx(' '
      ,'proc sort nodupkey data=', cats('claims_rx_',year)
      ,';', 'by pat_id;','run;'
    ));
  end;
run;

Or

data _null_;
  do year=2006 to 2019;
    call execute(catx(' '
      ,'proc sort data=', cats('claims_rx_',year)
      ,'nodupkey;', 'by pat_id;','run;'
    ));
  end;
run;

or

data _null_;
  do year=2006 to 2019;
    call execute(catx(' '
      ,'proc sort data=', cats('claims_rx_',year)
      ,'nodupkey',';', 'by pat_id;','run;'
    ));
  end;
run;
Reeza
Super User
How big are the data sets? I wonder if an index may be a better approach.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1892 views
  • 2 likes
  • 6 in conversation