Help using Base SAS procedures

Removing records in a time series that do not satisfy certain conditions_phaseII

Accepted Solution Solved
Reply
Super Contributor
Posts: 338
Accepted Solution

Removing records in a time series that do not satisfy certain conditions_phaseII

Hi SAS Discussion Forum,

In the data set (belwo), bank_number, Account_number and current_date are the

variables that would uniquely identify a record. 

You will see account_number 111 and 777 in two places but they are unique as they belong to 2 different banks.

data a;

informat Current_date date9.;

Input Current_date   Account_number $12-14 Arrears_Band $17-23 bank_number;

Format Current_date date9.;

cards;

28FEB2010  111  NPNA    10

31MAR2010  111  Current 10

31MAY2010  111  NPNA    10

31MAR2010  222  NPNA    20

31MAY2010  222  Current 20

31OCT2010  222  Current 20

30JUN2011  222  90 +    20

31JUL2011  222  NPNA    20

28FEB2010  444  30 - 60 30

31MAR2010  444  Current 30

30APR2010  444  30 - 60 50

31MAY2010  444  Current 50

28FEB2010  555  60 - 90 70

30JUN2011  555  90 +    70

31MAR2011  666  60 - 90 80

30JUN2011  666  90 +    80

28FEB2010  777  90 +    90

30JUN2011  777  Current 90

31MAR2011  111  90 +    100

30JUN2011  111  Current 100

31DEC2010  999  Current 10

31JAN2011  999  NPNA    10

28FEB2010  123  NPNA    20

31MAR2010  123  NPNA    20

30APR2010  123  NPNA    20

31DEC2010  777  Current 90

31JAN2011  777  NPNA    90

28FEB2011  777  NPNA    90

31MAR2011  777  NPNA    90

;

run;

/*Question:

I need to remove all the records that do not satisfy the below 2 conditions.

1) For a given 'account_number', when first occurrence of 'current_date'

=28FEB2010, 'Arrears_Band' could take any value

2) For a given 'account_number', when first occurrence of 'current_date'

^=28FEB2010, "Arrears_Band" must be "Current"

Answer:

All Records pertaining to combinations below should be completely removed from our data set using a suitable code (pink color records should be removed from data set).

Account_number   Bank_number

222                  20

444                  50

666                  80

111                  100

My effort:

I have tried the following code (Bolotin, Haikuo, Dan et

al. gratefully acknowledged) but doesn’t work.

proc sort data=a out=TT;

by bank_number account_number current_date;

run;

data want;

    set tt;

    by bank_number account_number current_date;

    if first.current_date

      then _f1 = current_date ne '28FEB2010'd and Arrears_Band

ne 'Current' ;

        retain _f1;

    if _f1 ne 1 then output;

drop _:;

run;

Could anyone of you please help me to code this.

Thank you

Mirisage


Accepted Solutions
Solution
‎08-20-2012 10:26 PM
Respected Advisor
Posts: 3,124

Re: Removing records in a time series that do not satisfy certain conditions_phaseII

All you need may only involve one small change in your original code:

data want;

    set tt;

    by bank_number account_number current_date;

    if first.account_number

      then _f1 = (current_date ne '28FEB2010'd and Arrears_Band

                    ne 'Current') ;

        retain _f1;

    if _f1 ne 1 then output;

drop _:;

run;

Haikuo

View solution in original post


All Replies
Solution
‎08-20-2012 10:26 PM
Respected Advisor
Posts: 3,124

Re: Removing records in a time series that do not satisfy certain conditions_phaseII

All you need may only involve one small change in your original code:

data want;

    set tt;

    by bank_number account_number current_date;

    if first.account_number

      then _f1 = (current_date ne '28FEB2010'd and Arrears_Band

                    ne 'Current') ;

        retain _f1;

    if _f1 ne 1 then output;

drop _:;

run;

Haikuo

Super User
Posts: 9,671

Re: Removing records in a time series that do not satisfy certain conditions_phaseII

If I understand what you  mean.

data a;
informat Current_date date9.;
Input Current_date   Account_number $12-14 Arrears_Band $17-23 bank_number;
Format Current_date date9.;
cards;
28FEB2010  111  NPNA    10
31MAR2010  111  Current 10
31MAY2010  111  NPNA    10
31MAR2010  222  NPNA    20
31MAY2010  222  Current 20
31OCT2010  222  Current 20
30JUN2011  222  90 +    20
31JUL2011  222  NPNA    20
28FEB2010  444  30 - 60 30
31MAR2010  444  Current 30
30APR2010  444  30 - 60 50
31MAY2010  444  Current 50
28FEB2010  555  60 - 90 70
30JUN2011  555  90 +    70
31MAR2011  666  60 - 90 80
30JUN2011  666  90 +    80
28FEB2010  777  90 +    90
30JUN2011  777  Current 90
31MAR2011  111  90 +    100
30JUN2011  111  Current 100
31DEC2010  999  Current 10
31JAN2011  999  NPNA    10
28FEB2010  123  NPNA    20
31MAR2010  123  NPNA    20
30APR2010  123  NPNA    20
31DEC2010  777  Current 90
31JAN2011  777  NPNA    90
28FEB2011  777  NPNA    90
31MAR2011  777  NPNA    90
;
run;

data a temp(keep=group);
 set a ;
 by bank_number Account_number notsorted;
 if first.account_number then do;
                                group+1;
                                if Current_date ne '28FEB2010'd and Arrears_Band ne "Current" then output temp;
                              end;
 output a;
run;
data want;
 merge a temp(in=inb);
 by group;
 if not inb;
run;
 

Ksharp

Super Contributor
Posts: 338

Re: Removing records in a time series that do not satisfy certain conditions_phaseII

Hi Haikuo and Ksharp.

Both of your codes are ideally working for the given sample data set.

However, when I applied these 2 codes to my 1 million data set, one code selected more records than the other.

To check, which code works for my original data set, I have used the procedure below and found that Haikuo's code works as I intended. I was not clever enough to layout the representative small dataset correctly.

Thanks again to both of you for this great help.

Regards

Mirisage

/*Checking if selected records satisfy my two conditions*/

data delete_this;

set want;

by bank_number account_number current_date;

if first.account_number;

run;

proc freq data=delete_this;

tables current_date *Arrears_Band;

run;

/*END. Checking if selected records satisfy my two conditions*/

/*Checking if non-selected records do not satisfy my two conditions*/

/*Sorting original data set*/

proc sort data=a out=b;

  by bank_number account_number current_date; /*always I ought to go by this order of vars*/

run;

/*sorting created data set */

proc sort data=want out=c;

  by bank_number account_number current_date; /*always I ought to go by this order of vars*/

run;

/*merging to seperate out non selected data set  */

data merged;

   merge b (in=x)

         c (in=y)

            ;

            by bank_number account_number current_date;

            if x and not y;

run;

/*by group processing*/

data any_name;

set merged;

by bank_number account_number current_date;

if first.account_number;

run;

/*Checking. I found we are correct*/

/*not a single account had its first currnet_date =28FEB2010*/

/* but not a single account had "current" as its arrers_band*/

/*So, our removing of records is correct*/

proc freq data=any_name;

  tables current_date*Arrears_Band;

run;

/*END*/

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 3 replies
  • 172 views
  • 3 likes
  • 3 in conversation