BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
noemi_b
Obsidian | Level 7

Hi all!

I am stuck with the process of removing missing values from my dataset that I hope you will help me to solve. I am using SAS University Edition v9.4 . 

Dataset Description:


I have a confidential dataset consisting of 135000 bank customers and 172 variables that describe the customer's demographics, the accounts' transactions volume, the client's geographical coordinates, dummies for having a mortgage or not, etc...

All observations are sorted by customer ID and collected across the time period that goes from April 2014 to August 2015. 
Well..this dataset is a panel data and among the 172 variables I have the variable "customer_status_numeric" that is 1 (if customer is Passive) and 0 if customer is Active. The peculiarity of this variable is that as soon as the customer turns from Active into Passive there are missing values for the rest of the observation period. It means that if the dummy "customer_status_numeric" is 1 (Passive) then all the other 169 variables (except for customer ID and month) become missing values. This is because when the customer becomes Passive it means that his/her bank account gets closed and therefore no data will be collected for the remaining time period.

Problem to be solved:

Now my problem is as follows. There are certain customers that are passive across all the observation period, and, thus, they are meaningless for my analysis given the fact that they present no observations but customer ID and Month. 
Therefore, I have to find a way to tell SAS to delete those customers that are Passive both on 01-Apr-2014 and on 01-Aug-2015. 
However, there are also customers that are Active until a certain date- i.e. 01-May-2015- and then they become Passive for the remaining period- i.e- from 01-May-2015 to 01-Aug-2015-. I need to keep those customers as they will be relevant for my study.

In other words, I need to delete customers whose dummy variable "customer_status_numeric" appears to be 1 both on 01-Apr-2014 and on 01-Aug-2015. In this way I will remove both customers who are Passive from the beginning to the end of the time period and those clients who are active first, then passive for a certain time interval, and then active again and finally passive again. 

Attached a small snapshot of my dataset.

 

I hope I gave you a bit of an overview of my problem and that you can help. Please, reply in simple terms. I am still a beginner of SAS so I will not understand if you will use too many difficult statistical terms in your answer. I only need a code that hopefully will help! 

Thank you in advance!


Screen Shot 2017-07-03 at 17.05.49.png
1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

One piece that's not clear:  Is your date variable a character string, or is it a numeric?  PROC CONTENTS will tell you that.

 

You would identify the observations slightly differently.  For character strings:

 

set have (where=(date='2014-04-01'));

 

vs. for numerics:

 

set have (where=(date='01Apr2014'd));

 

Using the character strings to illustrate, here is a way you could do this:

 

data want;

set have (where=(date='2014-04-01') in=in1)

   have (where=(date='2015-08-01') in=in2)

   have (in=in3);

by customer_id;

if in1 then total_passive = customer_status_numeric;

else if in2 then total_passive + customer_status_numeric;

if in3 and total_passive < 2;

run;

 

This logic does assume that you have data for the complete time period for all customers.

View solution in original post

5 REPLIES 5
Astounding
PROC Star

One piece that's not clear:  Is your date variable a character string, or is it a numeric?  PROC CONTENTS will tell you that.

 

You would identify the observations slightly differently.  For character strings:

 

set have (where=(date='2014-04-01'));

 

vs. for numerics:

 

set have (where=(date='01Apr2014'd));

 

Using the character strings to illustrate, here is a way you could do this:

 

data want;

set have (where=(date='2014-04-01') in=in1)

   have (where=(date='2015-08-01') in=in2)

   have (in=in3);

by customer_id;

if in1 then total_passive = customer_status_numeric;

else if in2 then total_passive + customer_status_numeric;

if in3 and total_passive < 2;

run;

 

This logic does assume that you have data for the complete time period for all customers.

noemi_b
Obsidian | Level 7

@Astounding Thank you for the kind reply!

First of all, my date variable is numeric. Secondly, when you write "This logic does assume that you have data for the complete time period for all customers" what do you mean? Do you mean that the logic assumes I have data in the variable "customer_status_numeric" for the complete time period and for all customers? If yes, then sure I have data for all the customers and I will try your code out. 

Astounding
PROC Star

Actually, it means you have customer_status_numeric for both date values (starting and ending point).

 

Since date is numeric, you need to switch to the other form for referring to specific dates:

 

set have (where=(date=('01apr2014'd) in=in1)

   have (where=(date=('1Aug2015'd) in=in2)

   have (in=in3);

 

noemi_b
Obsidian | Level 7

@Astounding Yes, I do have customer_status_numeric for both date values. Then your logic is right!

I only have one more question to address you. By writing in=in1, in=in2 and in=in3, do we tell to SAS that if the variable customer_status_numeric is 1 both on 01-Apr-2014 and on 01-Aug-2015 then delete customer_id from the dataset? What I actually need to solve my issue. 

Astounding
PROC Star

The in= variables determine where the observation came from.  So customer by customer ...

 

The in=in1 observations bring in just the April 1, 2014 observation, and sets up the initial total status value for that customer.

 

The in=in2 observations bring in just the August 1, 2015 observation, and possibly increments the total status value.

 

The in=in3 observations bring in all observations for the customer (including the April 1 and August 1 observations).

 

The IF statement subsets observations.  It looks for in3 and the previously computed total status to decide whether or not to keep the observation or delete it.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 1063 views
  • 1 like
  • 2 in conversation