DATA Step, Macro, Functions and more

Need some help how to do this in data step statement

Reply
N/A
Posts: 0

Need some help how to do this in data step statement

hi,

I have one issue that I want to work on.

Say for example I have in a dataset variables like ID, start date and End date. And there are multiple values for an ID, how can I output values for a particular ID with the earliest Start date and lastest End date.

On Input data set , Values are like below-

12345, 1/1/2008, 8/31/2008
12345,9/1/2008, 9/1/2009

On Output data set I want something like this -

12345,1/1/2008, 9/1/2009

Please let me know if I have to retain ID into a hold variable iand then used first.var /last.var after sorting my dataset.

thanks,
gina
Super Contributor
Super Contributor
Posts: 3,174

Re: Need some help how to do this in data step statement

Posted in reply to deleted_user
The ID variable will be the same for your process - you need to retain the MIN value and end up with the last observation, that having your MAX value for a "sorted" file by ID. Of course, a PROC MEANS/SUMMARY would also do quite nicely with minimal coding.

Scott Barry
SBBWorks, Inc.
SAS Super FREQ
Posts: 8,866

Re: Need some help how to do this in data step statement

Posted in reply to deleted_user
Hi:
It seems that someone has told you about FIRST. and LAST. variables in DATA step processing and that would be one way to get the output dataset you need. In my opinion, I think that a DATA step program is the hardest way to get what you want -- when there are multiple procedure methods to accomplish the same thing.

However, before you go down the DATA step road, there are 4 other ways to get what you need without writing a program. As Scott said, you can use PROC MEANS/SUMMARY, but there are other methods as well:
1) PROC MEANS (SUMMARY) method
2) PROC REPORT method
3) PROC TABULATE method
4) PROC SQL method
5) Data step program using FIRST. and LAST. (requires a PROC SORT step)

The methods are not listed in any particular order, except all the procedure methods are listed before the DATA step method.

The PROC MEANS/SUMMARY method is one way to go. It will require the use of some options and extra statements to create an output dataset. PROC REPORT can also create an output dataset and the syntax might be a bit more straightforward than PROC MEANS. PROC TABULATE can create an output dataset, also; and last, but not least in the procedure department, there's PROC SQL that can create an output dataset.

There may be a reason you want to use a DATA step program. This approach teaches some basic concepts that are very useful for the beginning programmer to know: the BY statement, the use of FIRST.byvar and LAST.byvar, the IF statement, the RETAIN statement, the OUTPUT statement, as well as the KEEP= dataset option.

Along the way, you'll also have to understand that SAS date variables are NUMERIC variables. For example 1/1/2008 or 11/15/1950 are date values that you can read, but inside a SAS dataset, those values are stored as: 17532 and -3334 respectively (because in the SAS world, dates are stored as an offset from 1/11960). So 1/1/2008 is 17532 days AFTER 1/1/1960 and 11/15/1950 is 3334 days BEFORE 1/1/1960 (hence the - sign in -3334 value). This means that you will want to use a SAS pre-defined date format to display your dates and your MIN/MAX values -- or else you'll see numbers that won't make much sense to you. The way you apply formats to variables is with a FORMAT statement or in some procedures with the F= option.

The approach you use depends on the type of output you need and whether you want a "report" along the way or just a dataset to pass to subsequent processes. As I said, all 5 methods will give you an output dataset. They are all shown in the program below. The first part is where I make some fake data -- you would not need the step that creates WORK.MYDATA -- that's just a name I made up. In each of the 5 steps, you would have to plug in the name of -YOUR- dataset after the DATA= option or in the SET statement.

The procedure methods all have their own way of naming the min and the max -- although PROC MEANS and PROC SQL give you control over the name you use for the calculated variables. You'd have to use a RENAME= option with PROC REPORT and PROC TABULATE to get different names. Also, some of the procedures create "extra" variables when they make output datasets. For example, PROC MEANS will create _TYPE_ and _FREQ_ varables; PROC REPORT will create a _BREAK_ variable; PROC TABULATE will create _TABLE_, _TYPE_ and _PAGE_ variables. These extra variables are part of how the output datasets are built by these procedures. You can read about them and decide if you want them by looking at the documentation for each procedure/method.


cynthia

[pre]
** make some data to use;
** if your data are in SAS format already, you will not need this step;
data work.mydata;
infile datalines dlm=',' dsd;
input idval sdate : mmddyy10. edate : mmddyy10.;
datalines;
12345, 1/1/2008, 8/31/2008
12345, 9/1/2008, 9/1/2009
23456, 11/15/1950, 12/31/1951
23456, 6/18/1953, 11/30/1953
;
run;

proc print data=work.mydata;
title 'A) What the data looks like when stored internally';
title2 'Without any formats used for the display of dates';
run;

proc print data=work.mydata;
title 'B) What the data looks like WITH formats used for display of dates';
format sdate edate mmddyy10.;
run;

**1) PROC MEANS approach;
proc means data=work.mydata min max nway;
title '1) Create an OUTPUT dataset using PROC MEANS';
class idval;
var sdate edate;
output out=work.means_minmax min(sdate)=min_sdate max(edate)=max_edate;
run;

proc print data=work.means_minmax;
title '1) This is the new dataset from PROC MEANS';
format min_sdate max_edate mmddyy10.;
run;

** 2) PROC REPORT approach -- can get REPORT and dataset;
** with one pass through the data;
proc report data=work.mydata nowd
out=work.rep_minmax;
title '2) Using PROC REPORT';
column idval sdate edate;
define idval / group;
define sdate / min f=mmddyy10.;
define edate / max f=mmddyy10.;
run;

proc print data=tab_minmax;
title '2) Data created in REPORT step';
format sdate_min edate_max mmddyy10.;
run;

** 3) PROC TABULATE approach can get REPORT and dataset;
** with one pass through the data;
proc tabulate data=work.mydata f=mmddyy10.
out=work.tab_minmax;
title '3) Using PROC TABULATE for REPORT and data';
class idval;
var sdate edate;
table idval,
min*sdate max*edate;
run;

proc print data=tab_minmax;
title '3) Data created in TABULATE step';
format sdate_min edate_max mmddyy10.;
run;



** 4) Proc SQL approach;
proc sql;
create table work.sql_minmax as
select idval, min(sdate) as min_sdate, max(edate) as max_edate
from work.mydata
group by idval
order by idval;
quit;

proc print data=work.sql_minmax;
title '4) Data Created With PROC SQL';
format min_sdate max_edate mmddyy10.;
run;


** 5) Data Step approach;
proc sort data=work.mydata;
by idval;
run;

data ds_minmax(keep=idval min_sdate max_edate);
set work.mydata;
by idval;
retain min_sdate max_edate;

** clear min/max holding variables;
if first.idval then do;
min_sdate = 999999;
max_edate = .;
end;
** check every variable for min/max;
if sdate lt min_sdate then min_sdate = sdate;
if edate gt max_edate then max_edate = edate;

** on last observation for idval, output;
if last.idval then output;
format min_sdate max_edate mmddyy10.;
run;

proc print data=ds_minmax;
title '5) Data created in Data Step program';
format min_sdate max_edate mmddyy10.;
run;


[/pre]
N/A
Posts: 0

Re: Need some help how to do this in data step statement

Posted in reply to Cynthia_sas
Greta..Thanks Cynthia..So many options. I will test my data using one of these proc steps. I have lots of other variables in my input data set, where I think Proc means a better option than proc tabulate/proc means.

Thank you so much again!!!
Valued Guide
Posts: 2,177

Re: Need some help how to do this in data step statement

Posted in reply to deleted_user
proc means seems simplest to me
supposing in dataset work.history the underlying information is in variables "id" and "event", then this will create work.summary dataset with columns ID, Earlest, Latest
[pre]proc means nway data= work.history missing noprint ;
var event ;
class ID ;
output out= work.summary(drop= _type_ _freq_ )
min= earliest max= latest ;
run ;[/pre]
N/A
Posts: 0

Re: Need some help how to do this in data step statement

Hi Cynthia,

If I have more valiable inmy input data set along with idval, do I need to include them in "CLASS" option in proc means if I wanted to out all of other valiables to output dataset.

proc means data=mydata min max nway noprint ;
class idval ; /***** do I need to iclude all variables in here if I want to have them on output dataset ? ***/
var cov_startdt cov_termdt ;
output out=means_minmax (drop= _type_ _freq_)
min(cov_startdt)=min_sdate max(cov_termdt)=max_edate;
run;


thanks,
Gina.
SAS Super FREQ
Posts: 8,866

Re: Need some help how to do this in data step statement

Posted in reply to deleted_user
Hi:
That depends...usually your CLASS variables define subgroups for the analysis that you are doing with PROC MEANS, as described in the documentation:
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/a000146731.htm

Depending on what the variable is, you may or may not want it used for analysis (as with a STREET ADDRESS variable, for example) but if you had a STATE variable, you might want it used for subgrouping of your Means information.

Some PROC MEANS examples and usage are here:
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/a000146738.htm
There are 12 examples here....with descriptions of the analysis being done.

PROC MEANS is a summary procedure. The reason you use it is to get summary/summarized information and certain calculated statistics for your analysis variables (listed in the VAR statement) based on the various combinations of variables listed on the CLASS statement.

This would be an instance where knowing a bit more about your data would help. What are the variables in your INPUT dataset and what do you need in the OUTPUT dataset??

cynthia
Respected Advisor
Posts: 3,799

Re: Need some help how to do this in data step statement

Posted in reply to Cynthia_sas
I would think ID would probably work. But I prefer the IDGROUP option on the OUTPUT statement. The OP should be able to summarize the data and let other variables "path-through" to the output data.
N/A
Posts: 0

Re: Need some help how to do this in data step statement

Posted in reply to data_null__
proc means data=mydata min max nway classdata=mydata exclusive noprint;
/*by subs_idn cov_startdt cov_termdt ; */
/*
Class subs_idn ;
var cov_startdt cov_termdt ;*/
output out=means_minmax (drop= _type_ _freq_)
min(cov_startdt)=min_sdate max(cov_termdt)=max_edate;
run;

I tried above option including dataset in CLASSDATA option, but looks like it didn't worked. My input data set had some +20 valriables and I can't give them all in either Var list or Class list either (tried and program failed saying format of variables doesn't match list type)

thanks,
gina
SAS Super FREQ
Posts: 8,866

Re: Need some help how to do this in data step statement

Posted in reply to data_null__
Yes, I agree, either ID or IDGROUP option of the OUTPUT statement will "carry" variables over to the output dataset:
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/a000146733.htm
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/a000146734.htm

I don't think CLASSDATA will work for what you descibe as needing -- CLASSDATA a way of limiting or providing a "set" of data for which you want analysis. So imagine that your data file had 10 years of data and you wanted to only limit the data to certain combinations of years, and people based on some complex selection criteria ....you could pass a dataset to PROC MEANS via CLASSDATA that would exclusively run the analysis based on the rows in the CLASSDATA dataset. It doesn't sound to me like that's what you want to do. CLASSDATA will not cause extra variables to be put into the output data set.

cynthia
Ask a Question
Discussion stats
  • 9 replies
  • 224 views
  • 0 likes
  • 5 in conversation