Re: macro Do loop for a set of programs

pensarchem · Posted 07-28-2020 05:25 PM

Hi everyone,

I have not used do loop very much. however I have the following codes/logic need to be developed, please kindly help.

Let's say,

1, By running programs 1, 2, 3, I created a cohort about 3 million of clients for my cohort_2019 ;

2, then I use this cohort_2019, run the same program, 1, 2, 3, (need to change the dates though, should also in the loop) create another cohort_2018;

......

Create back to cohort_2010;

I personally would do it without 'do loop' if it's less than 4 circles. For 10, I think it's too much. As I lack experience in do loop, so I ask your kind help. thanks.

PaigeMiller · Posted 07-28-2020 05:29 PM

The obvious answer is to create all the years in one sequence of programs, and then you wind up with a data set that contains years 2010 through 2019. Simple! No need for a loop!

--
Paige Miller

pensarchem · Posted 07-28-2020 05:34 PM

Yeah, I understand you, just feel it's not a smart way to do it. I will do this if I could not get the answer here or figure out a way..

PaigeMiller · Posted 07-28-2020 05:38 PM

And I feel it is the smartest way to do this, and the one that will take the least programming effort.

--
Paige Miller

ballardw · Posted 07-28-2020 06:23 PM

@pensarchem wrote:

Yeah, I understand you, just feel it's not a smart way to do it. I will do this if I could not get the answer here or figure out a way..

Not quite following what the main objection is.

Typically with something like this, assuming the data comes from different data sets (if it is already in one set you are doing WAY too much work), I would combine the data sets keeping the variables needed.

If there is some minor difference in the data set variable names then use data set rename options to rename them to a consistent name.

If there are minor differences in coding values then use the INDSNAME option to get the data set names an conditionally apply coding to make the values standard.

If you don't have appropriate date values to identify records then perhaps that INDSNAME can help supply them as a standard value for Year of some sort (DataYear, CollectionYear, ReportYear or what have you)

Then with one data set run the "Programs 1, 2, 3" which might want some tweaking to use the modified data set.

I could show you code doing this for 20 yearly data sets with changes of variable names or scale of the variable of interest but it is just plain routine and tedious. But still simpler than "fixing" each year's data individually. Partially because having in one set forced me consider some of the scale changes collectively. If 20 separate sets had been made without that consideration then likely there would have been fixes needed when the years still needed to be combined for modeling or graphing purposes (which is why I needed all the values in one set).

pensarchem · Posted 07-28-2020 06:40 PM

I just did copied the codes for 10 times and made very minor changes for each set. The coding does little time like 10 mins, which is great. I am not sure if I misunderstood both of you.

I say 'smart way' means the code is short and uses whatever short macro/ loop with only one set of main program, like simple macro..

I normally use the combination of sas/sql (under oracle), so don't know lots of commands under pure SAS.

PaigeMiller · Posted 07-28-2020 07:13 PM

@pensarchem wrote:
I just did copied the codes for 10 times and made very minor changes for each set. The coding does little time like 10 mins, which is great. I am not sure if I misunderstood both of you.

Obviously, you did misunderstand. Again, I state (and agree with @ballardw ) that creating 10 data sets, whether it is by macro loop or replicating the code as you did, is more effort and a poorer coding process than creating one big data set. Furthermore, the programming after you have the 10 data sets is more difficult than if you had all the data in one data set.

I say 'smart way' means the code is short and uses whatever short macro/ loop with only one set of main program, like simple macro..

The smart way is to avoid macro loops when not necessary. It is more coding, and is harder to do for people who are not fluent in SAS macro. The smart way is to create one large data set with all years, and then use other SAS structures (not macro loops) to analyze all the years with one analysis.

--
Paige Miller

pensarchem · Posted 07-28-2020 08:03 PM

Well, I think I understand both of your points now.
However, I just want to point out just in case if I did not explain what I need clearly.
1, we cannot create a large dataset, without set the client cohort from previous year, we are going to deal with the complete database for 10 year, so it's not practical; Most importantly, I need the list from a specific year (a, let's 2017) to run the previous year (a-1, 2016) results, but date is changing for each run and should include all the years after the new a-1, 2016 to the year 2019 .
2, Let's say I got 4 millions patients in 2019, create a special list with 3 millions patients in 2019 as list_2019;
3, then use list_2019, run the similar codes, but time from 2018-2019 two years, generate a list 2 millions list_2018;
4, then use list_2018, time from 2017-2019; generate a list 1.5 millions;
....
10, list_2011, date 2010-2019, generate 200K patients.

regardless, thank you both for the discussion.

mkeintz · Posted 07-28-2020 08:32 PM

You're saying that each year of data is (partly or completely?) derived from the data of the following year, correct? If so, and if the logic used for each year unchanged, then you have a case for a macro, in which the primary (possibly only) argument would be the source year, as in:

%macro build_prior_year(inyear=);
    %let prior_year= %eval(&inyear-1);

    data mydata_&prior_year ;
      set mydate_&inyear;
       .... other code ....;
    run;
%mend build_prior_year;

%build_prior_year(inyear=2019);

Of course, this is just the most basic, uncomplicated example. In which you create dataset MYDATA_2018 from MYDATA_2019. You could then call it again be invoking:

%build_prior_year(inyear=2018);

etc., etc.

You could get fancier and modify the BUILD_PRIOR_YEAR macro to loop through those years:

%macro build_prior_years(inyearstart=,inyearend=);
  %do inyear=&inyearstart %to &inyearend %by -1;
    %let prior_year= %eval(&inyear-1);
    data mydata_&prior_year ;
      set mydate_&inyear;
       .... other code ....;
    run;
  %end;
%mend build_prior_year;

%build_prior_years(inyearbeg=2019,inyearend=2015);

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ballardw · Posted 07-29-2020 05:47 AM

@pensarchem wrote:
Well, I think I understand both of your points now.
However, I just want to point out just in case if I did not explain what I need clearly.
1, we cannot create a large dataset, without set the client cohort from previous year, we are going to deal with the complete database for 10 year, so it's not practical; Most importantly, I need the list from a specific year (a, let's 2017) to run the previous year (a-1, 2016) results, but date is changing for each run and should include all the years after the new a-1, 2016 to the year 2019 .
2, Let's say I got 4 millions patients in 2019, create a special list with 3 millions patients in 2019 as list_2019;
3, then use list_2019, run the similar codes, but time from 2018-2019 two years, generate a list 2 millions list_2018;
4, then use list_2018, time from 2017-2019; generate a list 1.5 millions;
....
10, list_2011, date 2010-2019, generate 200K patients.

regardless, thank you both for the discussion.

Please rephrase this part of your response

we cannot create a large dataset, without set the client cohort from previous year, we are going to deal with the complete database for 10 year, so it's not practical;

I don't understand the "without set the client cohort" bit.

And maybe provide some example data of the "rules" involved.

A very typical activity with cohort studies is identify people that either drop out or are collected in every iteration of the data collection.

If you are looking to find every person active in all of the collection cycles the step is easy:

1) combine the data

2) sort by the individual identification informat PLUS the date of collection (or the year if the collection is once per calendar year)

3) in a data step count and compare years to see if a person is in all of the cycles

4) keep the ones with all the cycles.

The example below shows identifying ID from a study with 5 collection "years" and get the Id's that appear in all of the years.

data example;
   input id year;
datalines;
1   1
2   1
3   1
4   1
1   2
3   2
4   2
1   3
2   3
3   3
4   3
2   4
3   4
4   4
1   5
3   5
4   5
;

proc sort data=example;
   by id year;
run;

data want;
   set example;
   retain yrcount;
   by id;
   if first.id then yrcount=1;
   else yrcount+1;
   /* keep the id and year of ID that 
      occur in all 5 years of the
      dummy study
   */
   if last.id and yrcount=5;
run;

I could then use the data in WANT to merge with the Example data to get any other variables that were needed for analysis for ONLY the ids from all the cycles.

If your existing data doesn't have the collection cycle but has a date of collection it should be VERY easy to create the collection cycle indicator and use that instead of year.

I have done this with studies that had 3 month, 6 month and multiple year intervals in the collection "cycle".

pensarchem · Posted 08-01-2020 02:50 PM

Thanks everyone for your input.

In the end, I did use macro.

1, generate a cohort for the first year first;

2, then use the macro and almost the same code (need some changes to join the cohort with database though) to generate the cohort for the last 9 years.

3, then label and put them together to give me the whole 10 year list.

The most easiest way for coding will be:

1) extract all the data for last 10 years;

2) then rank the visits or service date..

However, the extraction will take too long for the first step and is not practical in my case.

thanks everyone.

SAS Innovate 2025: Call for Content

Classroom Training Available!