Conditional programming (macro?) to identify & merge data sets?

SharonZS · Posted 10-31-2018 06:11 PM

I need to write a program to identify and selectively merge SAS data sets. I’m working with 13 years of survey data from 1999-2012 – each is a separate data set. Complex survey data – each year of data has 100 strata containing 2 or 3 PSUs (grouped PSUs). I’ve created separate replicate data sets for each PSU by strata (replicates for jackknife standard error estimation): have one for each of the strata having 2 PSUs and 2 for all strata having 3 PSUs. The number of strata having 3 PSUs varies by year of data collection (range: 13-31 strata/year have 3 PSUs). The strata ID numbers (1-100) for those having 3 PSUs aren’t a continuous series. For example, for one year strata numbers 3, 7, 9, 13, 21, 22 and 37 might have 3 PSUs. The replicate data sets are indexed by the strata number. So for example rep2_3, rep2_7, rep2_9…. would identify the replicate data sets for my example.

I now need to write syntax to identify and merge these replicate data sets for strata having 3 PSUs across years of observation by strata index number. For example, Strata # 5 (Rep2_5) might have had 3 PSUs and now corresponding replicate data sets for 1999, 2001, 2002, 2007 and 2011. I want to find the rep2_5 data sets in folders for their respective years and concatenate them. Is this a situation for %GOTO?

Any suggestions as to how I can use some sort of conditional programming and indices to identify and concatenate replicate data sets for each strata ID across years? Thank you so much for reading this! I’m truly flummoxed!

ballardw · Posted 10-31-2018 07:39 PM

How does this data set get used in your analysis?

Do you know what the final data set should look like for what you are attempting? If so then provide some example of the actual variable names and values for a few replicates (3 or 4 should work), preferably as a data step for each and an example what you want the final data set to look like using that example input.

Folders are not the issue if you build SAS data sets, LIBRARY names are.

If each of the data sets is structured the same combining them may be as simple as

data allreps;
   set lib1.rep: lib2.rep: lib3.rep: ;
run;

assuming each year's data is in a separate library. The above would combine all data sets in each of the 3 dummy libraries that begin with REP (that's what the colon character ending the data set name does, a wildcard list builder as it were).

Of course if you have other sets that start with rep that you don't want that might be an issue and the set statement gets more

complicated.

What I am not sure of is why you built so many different data sets.

Your description goes into a lot of details that actually have little to do with selecting or merging. If there is data actually contained in the data set names, as implied by rep2_7 and such, I hope the corresponding information is stored in variables inside each set.

One thing that might be an issue: are the meanings of the strata across years the same? Assuming geographical meaning does strata 1 always represent the same geography (or other subpopulation) when it appears? If not you may need to do some additional adjustments on the strata variables.

Reeza · Posted 10-31-2018 07:45 PM

When doing Ballardw suggestion of a few variations, can you also show what that would look like on your original data source? You may need a DoW loop but when I see a lot of subset data I start to smell issues and lots of places for errors.

SharonZS · Posted 10-31-2018 08:46 PM

Thank you so much for your responses! My apologies for any lack of clarity -- trying to be concise and missed! This is a long-standing project (10 year grant) and I'm revising and extending macros written by another statistician... Looking at changes in cost for 78 medical conditions from 1999-2012. The data sets that I'm trying to identify and merge are actually separate data sets containing cost estimates, prevalence, and coefficient estimates for each replicate (1-100 for reps and the non-continuous index for strata which contain a third PSU) for each year of observation.

I'm merging the cost, prevalence and beta estimates to create a merged set of data sets to examine the extent to which changes in medical spending are attributable to changes in the cost of treatment versus changes in the prevalence of the health condition over time.

All of the replicates for each year are in separate SAS libraries by year: est99.replicate2_&j est00.replicate2_&j

The problem I have is that the second replicate for the strata having 3 PSUs isn't consistent for all years. I somehow need to create indices for both year and strata number. I've used indices for variables and imputed data sets in the same library (er.g imputed data sets) but don't know how to do this for data sets across libraries. I also don't know how to do this for a non-continuous index that changes across years.... First set of replicates was very straightforward (do j =1 to 100).

My existing syntax is run by strata number (other statistician didn't make it this far to deal with this...). For example, I want to identify and merge together all of the cost estimates for strata 3 for the years when this strata had 3 PSUs. I've done the easy case where the strata has 3 PSUs for all years. I'm now trying to figure out how to do it for the remaining strata without doing incredibly laborious and error prone hard coding.... Existing code looked like this:

The names here are the macro call are the data sets containing means for each replicate (final_est_replicate2_&j for all 13 years)

I use another macro to run this for all years and strata...

%macro cgar_from_year_a
(year=,rep=,master=,coeff=,cost=, name1=,name2=,name3=,name4=,name5=,name6=,name7=,......name13= );

data final_estimates99;
set est99.&name1;
year = 1999;
run;

data final_estimates00;
set est00.&name2;
year = 2000;
run;

data final_estimates01;
set est01.&name3;
year = 2001;
run;

Goes up to 2012. Problem now is that SAS of course stops when it encounters a year/library that doesn't contain a data set for that replicate (strata that doesn't have a third PSU for that year). The existing syntax concatenates all of the cost data sets:

Data attrib_cost_long;
set final_estimates99 final_estimates00 final_estimates01 final_estimates02 final_estimates03
final_estimates04 final_estimates05 final_estimates06 final_estimates07 final_estimates08
final_estimates09 final_estimates10 final_estimates11 final_estimates12;
run;

I need to figure out how to get something similar when the replicate data sets only exist for a subset of years of observation.

At this point the only variable for number of PSUs for strata per year exists in another data set that I made. The names of the data sets are really the only indicator as to whether there was a third PSU in that strata for the year (if rep2_&j exists in libraryYR).

Thank you so much for your responses! I've been asking around at my institution and no one has any idea of what to do (I'm more of a modeler...).

Reeza · Posted 10-31-2018 08:56 PM

Yeah, you'd be better off refactoring this from scratch and get it working. I think he replicates are a problem. How many data sets do you have in total and how many rows in the data set if you were to combine it all together?

SharonZS · Posted 11-01-2018 11:39 AM

The replicates are definitely the problem! I have over 1600+ of them (other secret detail that I didn't even mention is that in 2012 they changed the PSU variable so I have 525 replicates for that year alone!!). I may end up hard coding this...Ugh.. Thank you so much for your responses! Seriously the most useful feedback I've gotten in weeks!

Couple thoughts I had after reading your responses. Is there a way to use the %GOTO or %RETURN to do this? For example, I've created variables for the number of PSUs in each stratum by year: psuYR equal to 2 or 3 for each stratum.

%macro combdat (psuYR);

%if &psuYR ne 3 %then %return; /*Only creates data set for strata having 3 PSUs*/

data final_estimatesYR;

set estYR.final_estimates_rep2_&j;

run;

%mend combdat;

Does this look like a direction worth pursuing? Would run for each year within each stratum number.

The other piece that I thought might be useful that I do have is a list of strata with 3 PSUs created for each year:

data stratlist;
set strats;
by psu strat;
if psu = 3 then list = '&j='||trim(left(strat))||' or ';
if last.psu and psu=3 then list= '&j='||trim(left(strat));
run;

*creates macro variable;
%global list1;
%let list1=;

data replist;
Set stratlist;
if psu = 3 then
call symput("list1", trim(resolve('&list1'))||' '||trim(list));
run;

%put list1=&list1; /* Need to redo this and add in a suffix for year */

I then use this list of strata having 3 PSUs to run the next set of computations for these replicates:

%do j = 1 %to 100;
%if &list1 %then %do;

%let num = %eval(&num+1);

Any useful way to use this?

Reeza · Posted 11-01-2018 11:42 AM

I rarely do this, and not sure I have time for it, but can you email me the full file via DM and I'll take a look? Include one or two variations and I can make some recommendations on how to refactor it for efficiency.

Conditional programming (macro?) to identify & merge data sets?

Re: Conditional programming (macro?) to identify & merge data sets?

Re: Conditional programming (macro?) to identify & merge data sets?

Re: Conditional programming (macro?) to identify & merge data sets?

Re: Conditional programming (macro?) to identify & merge data sets?

Re: Conditional programming (macro?) to identify & merge data sets?

Re: Conditional programming (macro?) to identify & merge data sets?

Classroom Training Available!