Create a dataset from multiple datasets/criteria

bolore · Posted 08-30-2011 12:30 PM

Hi

I'm a SAS programming beginner!

I've a got a dataset with a list of ID numbers and dates. I want to

open a different dataset corresponding to the month and year of each ID

opening date + one year on and get some data for each ID.

For example (dataset1)

ID open_date

123 01may2004

124 16aug2010

244 20jan2009

In the end I I want to create a dataset (from a different data set with further additional data) for each of

the ID that has a corresponding open_date + 1 year.

ID open_date current_date additional_data (from dataset2)

123 01may2004 01may2005 1

124 16aug2010 16aug2011 0

244 20jan2009 20jan2010 0

I guess this is similar to a lookup table in excel, but a little stuck in SAS - Hope you can help.

art297 · Posted 08-30-2011 12:46 PM

Are they both already SAS datasets? Does the date have to be precisely 1 year apart? What happens if the initial date is February 29th of a given year?

If they are both SAS datasets, and dependent upon the answers for the above questions, then you can merge the two files in either a datastep or proc sql, and control the other factors.

Whether a datastep will be sufficient will be a function of what kind of match there is between the two files, i.e. one to one, one to many, or many to many.

bolore · Posted 08-30-2011 01:10 PM

Yes, they are SAS datasets.

I want to chose the second dataset from lots of possible datasets depending on the ID open date e.g. if opendate is 1aug2010 then open dataset aug2010. The datasets are monthly datasets that I need the data from so single day precision is not required, just month.

art297 · Posted 08-30-2011 01:40 PM

You didn't answer regarding the type of matching that would exist. If it is only one to one, and since you are new to SAS, depending upon the number of records involved (e.g., if it is less than 10 million or so) I, personally, would combine all of the monthly files into one dataset (which you can probably do in a datastep using the filename statement .. possibly with a wild card), sort the resulting and main datasets by ID and month, and then merge then together in a datastep.

If you do it with a datastep, you will have to first insure that the two sets of files (let's call them have and wanttomerge) only have one variable that has the same name, namely id. If they do share any other variable names, you can always use the datastep rename option to change the name(s) in the wanttomerge dataset.

Without knowing more about the files one can't write the code for you, but it would probably look something like:

data want;

merge have (in=ina) wanttomerge (in=inb);

by id;

if ina and month(date) eq month(datefromwanttomerge) and

year(date)+1 eq year(datefromwanttomerge);

run;

Code like that would select all of the records from have, and combine them with records from wanttomerge, where the records exist in wanttomerge that share the same month but are one year later than those in have.

HTH,

Art

art297 · Posted 08-30-2011 07:14 PM

You never responded to my last post, thus the following may be totally irrelevant. However, if not, I think that you are looking for something like:

libname monthly "c:\art\monthly";

data have;

input ID open_date date9.;

cards;

123 01may2004

124 16aug2010

244 20jan2009

;

data monthly.may2005;

input ID (open_date current_date) (date9.) x y z;

cards;

123 01may200407may2005 1 2 3

;

data monthly.aug2011;

input ID (open_date current_date) (date9.) x y z;

cards;

124 16aug201012aug2011 4 5 6

;

data monthly.jan2010;

input ID (open_date current_date) (date9.) x y z;

cards;

244 20jan200918jan2010 7 8 9

;

proc sql noprint;

select "monthly."||memname

into :monthly_files

separated by " "

from dictionary.tables

where libname="MONTHLY"

;

quit;

data all_monthly (drop=open_date);

set &monthly_files. INDSNAME=dsn;

fdate=input(scan(dsn,2,"."),monyy7.);

run;

data have;

set have;

fdate=input(put(intnx('year',open_date,1,'s'),monyy7.),monyy7.);

run;

proc sort data=have;

by id fdate;

run;

proc sort data=all_monthly;

by id fdate;

run;

data want;

merge have (in=ina) all_monthly (in=inb);

by id fdate;

if ina;

run;

R_A_G_ · Posted 08-30-2011 08:07 PM

Sorry for the late reply I finally found what the problem was, 'bs' is telling sas to read from right to left and I wanted 2nd column from left to right, found it by try and error. Below is the code for reading the correct column (2nd column from the left) But I have no idea what to use if I wanted to read from left to right that's why I've used -32 , I just counted the column.

Thanks

filename cc "C:\CDM1\5K\5K_5000obs\5K_5000obs_RIT\IN\dat1.dat";

Data class2_1;

infile cc lrecl=10000 filename=filename;

input;

file2_2=input(scan(_infile_,-32,,'bs'),best12.);

run;

art297 · Posted 08-30-2011 08:27 PM

R.A.G,

You posted your message in response to the wrong thread, but I'll try to answer it anyway since I probably confused the earlier message.

As I recall you wanted to read the second column. You really only needed to use:

file2_2=input(scan(_infile_,2,,'s'),best12.);

Art

R_A_G_ · Posted 08-30-2011 10:18 PM

thanks

RG

bolore · Posted 08-31-2011 04:51 AM

Hi Art,

Thanks for your replies. I have about 10 years worth of data split up

into monthly SAS data files. The files are labelled yyyymm and have

about 1million IDs (rows) and about 125 variables (col) in each

dataset.

I would like to find out information on 80000 IDs which have an open

date variable and is held in a SAS dataset. I want to pull the same

variables for each ID from the monthly tables corresponding to the

table which holds data 1 year on from the open date. The IDs have

different open dates over the 10year period.

The matching would be 1 to 1 on the IDs. I want a snapshot of how the

ID is doing after 1 year.

If I join all the monthly files together, won't this be quite

inefficient? 10x12x1million or 10x12x80000 if I only keep those IDs

I'm interested in.

Many thanks,

Bo

art297 · Posted 08-31-2011 08:38 AM

Define efficient! Yes, it may be using a sledge hammer to drive in a nail, but a very accurate sledge hammer. But, you are talking about less than 10 million records, which will only take up seconds of cpu time.

The alternative is to develop complex macros that can easily take hours/days to write and add in numerous instances where you could error by writing something that looks like it will work but does something totally unexpected. Since I recall you saying that you are a SAS newbee, I'd strongly recommend NOT going there too early in your learning curve.

Regarding processing, if you are not going to use all 125 variables, dropping as many as possible, as early in the process as you can, will drastically improve efficiency.

Ksharp · Posted 08-31-2011 11:05 PM

It looks like a query problem. En... Hash Table or proc format will be a good choice.

Finding in a dataset once, then use Macro to run across every dataset.

Ksharp

bolore · Posted 09-02-2011 06:40 AM

Thanks all. I finally managed to do this with a looped proc SQL and where statement.

Registration is open

SAS Training: Just a Click Away