BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Mirisage
Obsidian | Level 7

Hi SAS Discussion Forum,

In the data set (belwo), bank_number, Account_number and current_date are the

variables that would uniquely identify a record. 

You will see account_number 111 and 777 in two places but they are unique as they belong to 2 different banks.

data a;

informat Current_date date9.;

Input Current_date   Account_number $12-14 Arrears_Band $17-23 bank_number;

Format Current_date date9.;

cards;

28FEB2010  111  NPNA    10

31MAR2010  111  Current 10

31MAY2010  111  NPNA    10

31MAR2010  222  NPNA    20

31MAY2010  222  Current 20

31OCT2010  222  Current 20

30JUN2011  222  90 +    20

31JUL2011  222  NPNA    20

28FEB2010  444  30 - 60 30

31MAR2010  444  Current 30

30APR2010  444  30 - 60 50

31MAY2010  444  Current 50

28FEB2010  555  60 - 90 70

30JUN2011  555  90 +    70

31MAR2011  666  60 - 90 80

30JUN2011  666  90 +    80

28FEB2010  777  90 +    90

30JUN2011  777  Current 90

31MAR2011  111  90 +    100

30JUN2011  111  Current 100

31DEC2010  999  Current 10

31JAN2011  999  NPNA    10

28FEB2010  123  NPNA    20

31MAR2010  123  NPNA    20

30APR2010  123  NPNA    20

31DEC2010  777  Current 90

31JAN2011  777  NPNA    90

28FEB2011  777  NPNA    90

31MAR2011  777  NPNA    90

;

run;

/*Question:

I need to remove all the records that do not satisfy the below 2 conditions.

1) For a given 'account_number', when first occurrence of 'current_date'

=28FEB2010, 'Arrears_Band' could take any value

2) For a given 'account_number', when first occurrence of 'current_date'

^=28FEB2010, "Arrears_Band" must be "Current"

Answer:

All Records pertaining to combinations below should be completely removed from our data set using a suitable code (pink color records should be removed from data set).

Account_number   Bank_number

222                  20

444                  50

666                  80

111                  100

My effort:

I have tried the following code (Bolotin, Haikuo, Dan et

al. gratefully acknowledged) but doesn’t work.

proc sort data=a out=TT;

by bank_number account_number current_date;

run;

data want;

    set tt;

    by bank_number account_number current_date;

    if first.current_date

      then _f1 = current_date ne '28FEB2010'd and Arrears_Band

ne 'Current' ;

        retain _f1;

    if _f1 ne 1 then output;

drop _:;

run;

Could anyone of you please help me to code this.

Thank you

Mirisage

1 ACCEPTED SOLUTION

Accepted Solutions
Haikuo
Onyx | Level 15

All you need may only involve one small change in your original code:

data want;

    set tt;

    by bank_number account_number current_date;

    if first.account_number

      then _f1 = (current_date ne '28FEB2010'd and Arrears_Band

                    ne 'Current') ;

        retain _f1;

    if _f1 ne 1 then output;

drop _:;

run;

Haikuo

View solution in original post

3 REPLIES 3
Haikuo
Onyx | Level 15

All you need may only involve one small change in your original code:

data want;

    set tt;

    by bank_number account_number current_date;

    if first.account_number

      then _f1 = (current_date ne '28FEB2010'd and Arrears_Band

                    ne 'Current') ;

        retain _f1;

    if _f1 ne 1 then output;

drop _:;

run;

Haikuo

Ksharp
Super User

If I understand what you  mean.

data a;
informat Current_date date9.;
Input Current_date   Account_number $12-14 Arrears_Band $17-23 bank_number;
Format Current_date date9.;
cards;
28FEB2010  111  NPNA    10
31MAR2010  111  Current 10
31MAY2010  111  NPNA    10
31MAR2010  222  NPNA    20
31MAY2010  222  Current 20
31OCT2010  222  Current 20
30JUN2011  222  90 +    20
31JUL2011  222  NPNA    20
28FEB2010  444  30 - 60 30
31MAR2010  444  Current 30
30APR2010  444  30 - 60 50
31MAY2010  444  Current 50
28FEB2010  555  60 - 90 70
30JUN2011  555  90 +    70
31MAR2011  666  60 - 90 80
30JUN2011  666  90 +    80
28FEB2010  777  90 +    90
30JUN2011  777  Current 90
31MAR2011  111  90 +    100
30JUN2011  111  Current 100
31DEC2010  999  Current 10
31JAN2011  999  NPNA    10
28FEB2010  123  NPNA    20
31MAR2010  123  NPNA    20
30APR2010  123  NPNA    20
31DEC2010  777  Current 90
31JAN2011  777  NPNA    90
28FEB2011  777  NPNA    90
31MAR2011  777  NPNA    90
;
run;

data a temp(keep=group);
 set a ;
 by bank_number Account_number notsorted;
 if first.account_number then do;
                                group+1;
                                if Current_date ne '28FEB2010'd and Arrears_Band ne "Current" then output temp;
                              end;
 output a;
run;
data want;
 merge a temp(in=inb);
 by group;
 if not inb;
run;
 

Ksharp

Mirisage
Obsidian | Level 7

Hi Haikuo and Ksharp.

Both of your codes are ideally working for the given sample data set.

However, when I applied these 2 codes to my 1 million data set, one code selected more records than the other.

To check, which code works for my original data set, I have used the procedure below and found that Haikuo's code works as I intended. I was not clever enough to layout the representative small dataset correctly.

Thanks again to both of you for this great help.

Regards

Mirisage

/*Checking if selected records satisfy my two conditions*/

data delete_this;

set want;

by bank_number account_number current_date;

if first.account_number;

run;

proc freq data=delete_this;

tables current_date *Arrears_Band;

run;

/*END. Checking if selected records satisfy my two conditions*/

/*Checking if non-selected records do not satisfy my two conditions*/

/*Sorting original data set*/

proc sort data=a out=b;

  by bank_number account_number current_date; /*always I ought to go by this order of vars*/

run;

/*sorting created data set */

proc sort data=want out=c;

  by bank_number account_number current_date; /*always I ought to go by this order of vars*/

run;

/*merging to seperate out non selected data set  */

data merged;

   merge b (in=x)

         c (in=y)

            ;

            by bank_number account_number current_date;

            if x and not y;

run;

/*by group processing*/

data any_name;

set merged;

by bank_number account_number current_date;

if first.account_number;

run;

/*Checking. I found we are correct*/

/*not a single account had its first currnet_date =28FEB2010*/

/* but not a single account had "current" as its arrers_band*/

/*So, our removing of records is correct*/

proc freq data=any_name;

  tables current_date*Arrears_Band;

run;

/*END*/

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 771 views
  • 3 likes
  • 3 in conversation