Choose most recent record from multiple data set

Reply
Regular Contributor
Posts: 150

Choose most recent record from multiple data set

I am trying to pull in the most recent data from multiple datasets. In short, I get a montly update of peoples status on multiple variables that comes in a new data set every month. I have these dataset for the last 5 years and need to choose each persons most recent record. Some people may be in every dataset while others might drop out of the datasets after a year, or 2 years, and so on. I basically need the last record a person has before they drop out or their most recent record if they have not dropped out. Is there an easy way to do this?

 

Also, the datasets are stored in folders by year. Short of creating multiple libnames for each dataset is there an easy way to read them all in? 

 

Any help would be greatly apprecaited! 

Super User
Super User
Posts: 7,401

Re: Choose most recent record from multiple data set

Well, you will need to libname to each place, otherwise you can't access them.  Its a problem with keeping your data in separate bits.  

Once you have your libnames in place then you can use some generic code to put it all together (note, it would need the data to be pretty similar structure):

data total;
  /* set your specifics for the template dataset here */
run;

data _null_;
  set sashelp.vtable (where=(libname in ("put your libnames in quotes here"));
  call execute(cats('proc append base=total data=',catx('.',libname,memname),'; run;'));
run;

The above appends all the data into one dataset.  Now if your data does not contain date which we can sort the data, then you may need to change from append to a datastep and do;

data _null_;
  set sashelp.vtable (where=(libname in ("put your libnames in quotes here"));
  call execute(cats('data total; set total ',catx('.',libname,memname),' indsname=tmp; year=tmp; run;'));
run;

You now have a big dataset with all your data in one place.  Now you can sort the data by person year and then do:

data want;
  set have;
  by person;
  if last.person then output;
run;

 

Regular Contributor
Posts: 150

Re: Choose most recent record from multiple data set

Thank you for the quick response. I had a couple questions:

1) I am not sure what you mean by set your specifics for the template dataset here. 

 

2) With 5 different libnames how do I separate them out in the second step? Do I just put all of the libnames in their own quotes next to each other? 

 

Thank you again! 

 

 

Super User
Super User
Posts: 7,401

Re: Choose most recent record from multiple data set

1) Create an empty dataset, which has tpyes set, and lengths greater than anything you will encounter.  This will help avoid the whole different lengths between datasets.  If your dataset match exactly then you can drop this step.

 

2) You can add any information you need into this step:

data _null_;
  set sashelp.vtable (where=(libname in ("put your libnames in quotes here"));
  call execute(cats('data total; set total ',catx('.',libname,memname),' indsname=tmp; year=tmp; run;'));
run;

In the above I take the dataset name and set that to year - as I don't know your data I couldn't say, but if you want another variable for libname then:

data _null_;
  set sashelp.vtable (where=(libname in ("put your libnames in quotes here"));
  call execute(cats('data total; set total ',catx('.',libname,memname),' indsname=tmp; lname="',libname,'"; year=tmp; run;'));
run;

 

Regular Contributor
Posts: 150

Re: Choose most recent record from multiple data set

Thank you.

 

Another question. I just created a macro to call in the various libnames that gives me ds2001-ds2016, below. Is there a way I can just integrate those libnames into this code instead of having to put every pathway? I tried to incorporate but keep getting an error with the call function. Also, what does the year=tmp refer to? Thank you for your help!  

 

options mprint;
%macro demo_check;
%global start end;
%let start=%sysfunc(time(),time8.0);

%do i=2000 %to 2016;
libname ds&i "pathway\&i";
%do j=1 %to 12;
%if &j=1 %then %let m=jan;
%if &j=2 %then %let m=feb;
%if &j=3 %then %let m=mar;
%if &j=4 %then %let m=apr;
%if &j=5 %then %let m=may;
%if &j=6 %then %let m=jun;
%if &j=7 %then %let m=jul;
%if &j=8 %then %let m=aug;
%if &j=9 %then %let m=sep;
%if &j=10 %then %let m=oct;
%if &j=11 %then %let m=nov;
%if &j=12 %then %let m=dec;

%let end=%sysfunc(time(),time8.0);
%if &i.=2016 and &j>=8 %then %return;
%end;
%end;
%mend demo_check;
%demo_check;
%put &start &end;

Super User
Posts: 17,829

Re: Choose most recent record from multiple data set

You don't need a macro. Unless you really really want to, I suppose.

 

There's a libname function. 
If you have a standard way to reference your folders and can create that string in a data step and then pass it to the libname function. 

 

You can also use the MONNAME3 format to generate the 3 char string that references month.

 

Here's an example fo generating all your libname paths using a dataset, 

 

data libnames;
main_path = 'C:\storage';
do year=2000 to 2016;
    do month=1 to 12;
        path = catx("\", main_path, put(year, 4.), put(mdy(month, 1, year), monname3.));
    end;
end;
run;

 

 

Regular Contributor
Posts: 150

Re: Choose most recent record from multiple data set

Thank you. This seems much better than the macro. I ran the code and came up with a dataset with 4 varibles but now how do I use that dataset to read in my other datasets? 

Respected Advisor
Posts: 3,892

Re: Choose most recent record from multiple data set

[ Edited ]

You do know that you can have a single library definition which points to multiple folders?

http://support.sas.com/documentation/cdl/en/hostwin/69955/HTML/default/viewer.htm#chloptfmain.htm

 

You can either use existing librefs or you can just use multiple physical paths to do so - or a combination of both.

 

If you have a concatenated library definition and you have same named tables in different folders then SAS will pick the one which comes first according to the path order in your libname statement (but I believe to understand that your table names are unique so that's not an issue).

 

Once you've got a single libref you then can write code for picking your data out of multiple tables. Is there a pattern to your table names which we could use?

What I would be doing is to just generate a set statement from most recent to oldest table and then process through the data from "top to bottom". Whenever I find a new person I'd add the key to a hash table, if the person is already in the hash table then I'd skip the current input record for output.

Ask a Question
Discussion stats
  • 7 replies
  • 236 views
  • 1 like
  • 4 in conversation