SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

How do I compare and transform variables from multiple monthly datasets?

Accepted Solution Solved
Reply
New Contributor
Posts: 3
Accepted Solution

How do I compare and transform variables from multiple monthly datasets?

Hi,

I am working on a credit risk database using SAS Enterprise Guide 4.3. I have 40 monthly files with the same variables, of which two are of interest for this issue: one variable is a binary one called 'bad', who takes the value 1 when an account is in default and 0 otherwise, and another one called 'account-exposure'. What I need to do is retain the value of the account_exposure for each account_number at the moment of default, i.e. when the bad variable turns from 0 to 1 from one month to the next. The primary key in all the datasets is the account_number, and I have compiled a dataset with all the bad=1 from the last available month. (these are the accounts that need to be checked)

 

I am a bit stuck as I would basically need to look at two consecutive datasets at a time to compare the values of 'bad' and retain the account_exposure for each account_number if bad(dataset_month1)=1 and bad(dataset_month0)=0 but I am not sure how to do that going through all 40 datasets. The name of all datasets are something like blablabla_yyyymm, and have tought of something like below but I am not too familiar with macros:

%macro default(data);

Proc sql;

Create table defaultflag as

              Select &data.account_number

                             &data.account_exposure

              Substr(put(&data), length(put(&data))-3, 4) as default_date  /*how do I transform the dataset name into a character string?*/

from &data

Where &data.bad=0 and &data.bad=1;    /*this condition should be formulated in another way*/

Quit;

%mend default;

Any help on the above or any other ideas would be much appreciated!!


Accepted Solutions
Solution
‎09-18-2016 06:56 AM
Respected Advisor
Posts: 3,902

Re: How do I compare and transform variables from multiple monthly datasets?

[ Edited ]

For your data and using SAS datasets I prefer to have monthly files over one big table. What you could do when creating a new monthly file is to also (re-) create a view over all the existing tables.

 

I believe I haven't fully understood how you want to create your new variable but I believe a variation of below could still allow for a "straight forward" solution approach without the need to create a big table or sorting of all the data.

/* create sample data */
data crl_201606;
  input account_number account_exposure bad account_limit;
  datalines;
574476 56.8 0 23
245245 67.5 0 24
252450 45.9 0 25
252457 90.5 1 26
252456 45.6 1 76
252351 23.3 0 5
568576 45.9 1 3
324553 0 1 32
857843 34.7 0 2
;
run;
data crl_201607;
  input account_number account_exposure bad account_limit;
  datalines;
574476 53.5 1 56
245245 67.8 0 45
252450 43.78 0 34
252457 74.7 1 56
252456 89.4 1 456
252351 21.5 0 23
568576 39.6 1 4
324553 2.5 1 53
857843 37.9 1 4
;
run;
data crl_201608;
  input account_number account_exposure bad account_limit;
  datalines;
574476 67.5 1 56
245245 68 1 45
252450 43.2 0 67
252457 74.7 1 4
252456 89.4 1 34
252351 21.5 0 23
568576 39.9 1 4
324553 2.9 1 53
857843 37.9 1 4
;
run;

/* create macro variable with list of available source tables */
proc sql noprint;
  select catt(libname,'.',memname,' (keep=account_number account_exposure bad account_limit)') into :source_tbl_list separated by ' '
  from dictionary.tables
  where libname='WORK' and memname like 'CRL^_%' escape '^'
  order by memname
  ;
quit;

/* analyse source tables and create new variable "want" */
data want /*(drop=_:)*/ ;
  length source_tbl $ 41;
  set &source_tbl_list indsname=_source_tbl;
  source_tbl=_source_tbl;
  if _n_=1 then 
    do;
      length _account_exposure_time_of_bad 8;
      dcl hash h1();
      _rc=h1.defineKey('account_number');
      _rc=h1.defineData('_account_exposure_time_of_bad');
      _rc=h1.defineDone();
    end;

   if h1.check() ne 0 then
    do;
      if bad=1 then
        do;
          _account_exposure_time_of_bad=account_exposure;
          _rc=h1.add();
        end;
    end;
  else if h1.find()=0 then
    do;
      want=account_limit/_account_exposure_time_of_bad;
    end;
run;

What I wasn't sure about:

1. Do you want to use the exposure value the first time things go "bad" or should it be the most recent "bad" one.

2. You talk about a selected list of accounts you need to look at (only the ones with a "bad" in the latest month or so). I haven't implemented this so the current code looks at all accounts. It wouldn't be to hard though to implement such an account selection as well. What you could do: create and load another hash table with only the accounts you need to look at in the very first iteration of the "want" data step. Then in every iteration of the data step look up the account_number in the hash (using check() method) and if not found in the hash go straight to the next iteration (e.g. by using "return" statement).

 

Adding a variation for table selection in case you have other tables or views starting with CRL_ in your library.

/* create macro variable with list of available source tables */
%let source_tbl_list=;
proc sql noprint;
  select catt(libname,'.',memname,' (keep=account_number account_exposure bad account_limit)') into :source_tbl_list separated by ' '
  from dictionary.tables
  where libname='WORK' and memtype='DATA' and prxmatch('/^crl_\d{6}\s*$/oi',memname)>0 
  order by memname
  ;
quit;
%put &=source_tbl_list; 

 

View solution in original post


All Replies
Super User
Super User
Posts: 7,421

Re: How do I compare and transform variables from multiple monthly datasets?

Put all your data together, then you can sort it, retain values etc.

Example:

data temp_201608;
  id=1;var1=0;output;
  id=3;var1=1;output;
run;
data temp_201607;
  id=2;var1=1;output;
  id=3;var1=0;output;
run;

data have;
  set temp: indsname=n;
  dte=scan(n,2,"_");
run;

proc sort data=have;
  by id dte;
run;

Its never a good idea to work with many datasets each with "data" in the name.

Super User
Posts: 5,260

Re: How do I compare and transform variables from multiple monthly datasets?

My first question as an architect would be if it's possible to store all files in the same data set/table? The would minimize advanced (unnecassary) macro coding.

That said, please attach some sample data of what you have, and what you want?

Data never sleeps
New Contributor
Posts: 3

Re: How do I compare and transform variables from multiple monthly datasets?

Thanks,

 

I was avoinding a concatenation or append procedure as the datasets are many (40) and pretty massive (about 3m entries each). However, in case this solution is the easiest one I am happy to listen. I guess it might work by creating a new variable with the data (yyyymm) into the concatenation and comparing the bad variable for each consecutive months given the same account number? I cannot give any sample of the data as it is confidential, however I have attached a very raw example in Excel- hope it is sufficiently clear.

Solution
‎09-18-2016 06:56 AM
Respected Advisor
Posts: 3,902

Re: How do I compare and transform variables from multiple monthly datasets?

[ Edited ]

For your data and using SAS datasets I prefer to have monthly files over one big table. What you could do when creating a new monthly file is to also (re-) create a view over all the existing tables.

 

I believe I haven't fully understood how you want to create your new variable but I believe a variation of below could still allow for a "straight forward" solution approach without the need to create a big table or sorting of all the data.

/* create sample data */
data crl_201606;
  input account_number account_exposure bad account_limit;
  datalines;
574476 56.8 0 23
245245 67.5 0 24
252450 45.9 0 25
252457 90.5 1 26
252456 45.6 1 76
252351 23.3 0 5
568576 45.9 1 3
324553 0 1 32
857843 34.7 0 2
;
run;
data crl_201607;
  input account_number account_exposure bad account_limit;
  datalines;
574476 53.5 1 56
245245 67.8 0 45
252450 43.78 0 34
252457 74.7 1 56
252456 89.4 1 456
252351 21.5 0 23
568576 39.6 1 4
324553 2.5 1 53
857843 37.9 1 4
;
run;
data crl_201608;
  input account_number account_exposure bad account_limit;
  datalines;
574476 67.5 1 56
245245 68 1 45
252450 43.2 0 67
252457 74.7 1 4
252456 89.4 1 34
252351 21.5 0 23
568576 39.9 1 4
324553 2.9 1 53
857843 37.9 1 4
;
run;

/* create macro variable with list of available source tables */
proc sql noprint;
  select catt(libname,'.',memname,' (keep=account_number account_exposure bad account_limit)') into :source_tbl_list separated by ' '
  from dictionary.tables
  where libname='WORK' and memname like 'CRL^_%' escape '^'
  order by memname
  ;
quit;

/* analyse source tables and create new variable "want" */
data want /*(drop=_:)*/ ;
  length source_tbl $ 41;
  set &source_tbl_list indsname=_source_tbl;
  source_tbl=_source_tbl;
  if _n_=1 then 
    do;
      length _account_exposure_time_of_bad 8;
      dcl hash h1();
      _rc=h1.defineKey('account_number');
      _rc=h1.defineData('_account_exposure_time_of_bad');
      _rc=h1.defineDone();
    end;

   if h1.check() ne 0 then
    do;
      if bad=1 then
        do;
          _account_exposure_time_of_bad=account_exposure;
          _rc=h1.add();
        end;
    end;
  else if h1.find()=0 then
    do;
      want=account_limit/_account_exposure_time_of_bad;
    end;
run;

What I wasn't sure about:

1. Do you want to use the exposure value the first time things go "bad" or should it be the most recent "bad" one.

2. You talk about a selected list of accounts you need to look at (only the ones with a "bad" in the latest month or so). I haven't implemented this so the current code looks at all accounts. It wouldn't be to hard though to implement such an account selection as well. What you could do: create and load another hash table with only the accounts you need to look at in the very first iteration of the "want" data step. Then in every iteration of the data step look up the account_number in the hash (using check() method) and if not found in the hash go straight to the next iteration (e.g. by using "return" statement).

 

Adding a variation for table selection in case you have other tables or views starting with CRL_ in your library.

/* create macro variable with list of available source tables */
%let source_tbl_list=;
proc sql noprint;
  select catt(libname,'.',memname,' (keep=account_number account_exposure bad account_limit)') into :source_tbl_list separated by ' '
  from dictionary.tables
  where libname='WORK' and memtype='DATA' and prxmatch('/^crl_\d{6}\s*$/oi',memname)>0 
  order by memname
  ;
quit;
%put &=source_tbl_list; 

 

New Contributor
Posts: 3

Re: How do I compare and transform variables from multiple monthly datasets?

Thanks a lot, that sounds like a good alternative! Computing the variable is easy, the below is the part I was struggling with. 

Regarding your two point, yes, I need the first instance of 'bad', while the second one I belive I can solve with either a where subsetting of the most recent dataset or with yout idea, to save some processing time.

Just not sure what the if _n_=1 stands for?

Will try it tomorrow on the database to see how it goes Smiley Happy

Respected Advisor
Posts: 3,902

Re: How do I compare and transform variables from multiple monthly datasets?

A hash table gets instantiated during execution time but we want to only instantiate the hash table once. "_n_" is an automatic variable which contains the count of the iteration of the data step (so 1 in the first iteration, 2 in the 2nd itereation and so on). 

"if _n_=1" lets us do things only once in the very first iteration of the data step (here: instantiate the hash table).

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 583 views
  • 0 likes
  • 4 in conversation