Re: Create new variable based on first occurrence in data set

ColonelSanders · Posted 03-14-2018 11:24 AM

New to SAS and this site

I have a data set that I need code to help pull back first relationships manager based on the account number.

Example data set:
Date ACCOUNT RM
1-1-18 123 Gabe
1-2-18. 123 Gabe
1-3-18. 123. Mary

New variable would pull back ‘Gabe’ in each row.

Data set is large and will account for each day of the year and 1000’s of accounts.

Thanks in advance

Reeza · Posted 03-14-2018 11:28 AM

What does 'pull back' mean?

What are you looking for as output?

If you have multiple accounts it would be helpful if you included more than one in your example.

@ColonelSanders wrote:
New to SAS and this site

I have a data set that I need code to help pull back first relationships manager based on the account number.

Example data set:
Date ACCOUNT RM
1-1-18 123 Gabe
1-2-18. 123 Gabe
1-3-18. 123. Mary

New variable would pull back ‘Gabe’ in each row.

Data set is large and will account for each day of the year and 1000’s of accounts.

Thanks in advance

ColonelSanders · Posted 03-14-2018 11:49 AM

Wow that was fast and Sorry having to use my cell as site is blocked on work pc so made the ask to brief.

In the end I would like the following result. Keeping the data set as is with a new variable or ‘Original RM’ being added.

Date ACCOUNT RM Original RM
1-1-18 123 Gabe Gabe
1-2-18. 123 Gabe Gabe
1-3-18 123 Mary Gabe
1-1-18 234 Bob Bob
1-2-18 234 Bob Bob
1-3-18 234 Bob Bob
1-1-18 789 Sam Sam
1-2-18 789 Mike Sam
1-3-18 789 Betty Sam

I would be using this report to capture who added the account originaly not who manages the acccount today.

Reeza · Posted 03-14-2018 11:54 AM

data want;
set have;
by account date;

retain original_RM;

if first.account then original_RM = RM;
run;

You need a couple of different options here:

1. BY group processing to identify the FIRST. of each account (first.account)

2. RETAIN to keep the value until it's reset.

3. I included date in the BY to ensure the first date is used, but this also means you need to sort the data first.

@ColonelSanders wrote:
Wow that was fast and Sorry having to use my cell as site is blocked on work pc so made the ask to brief.

In the end I would like the following result. Keeping the data set as is with a new variable or ‘Original RM’ being added.

Date ACCOUNT RM Original RM
1-1-18 123 Gabe Gabe
1-2-18. 123 Gabe Gabe
1-3-18 123 Mary Gabe
1-1-18 234 Bob Bob
1-2-18 234 Bob Bob
1-3-18 234 Bob Bob
1-1-18 789 Sam Sam
1-2-18 789 Mike Sam
1-3-18 789 Betty Sam

I would be using this report to capture who added the account originaly not who manages the acccount today.

RW9 · Posted 03-14-2018 11:55 AM

Something like:

proc sort data=have out=first;
  by account date;
run;
data first (keep=account rm_first);
  set first (rename=(rm_original=rm_first));
  by account;
  if first.account;
run;
data want;
  merge have first;
  by account;
run;

Note, post test data in the form of a datastep in future, I am not here to type out test data, hence the above is not tested.

mkeintz · Posted 03-14-2018 11:52 AM

Is the data already sorted by ACCOUNT/DATE? If so, then:

data want;
  set have (drop=rm);
  by account;
  if first.account then set have (keep=rm) point=_n_;
run;

The first SET statement reads each incoming record from HAVE, but does not read in variable RM. As a result RM IS NOT REPLACED by that SET statement.
The BY ACCT statement tells sas (a) to expect the incoming data to be sorted, and (b) create a dummy variable (first.account and last.account) indicating whether the record-in-hand is the start or end of a BY-group.
The IF first.account then set have reads in only one variable (RM) and does not read sequentially, but points to observation _N_ of HAVE. And _N_ is an automatic variable that counts iterations of the DATA step, which in this case is also a count of observations in HAVE.

Take home message: Make the program-data-vector (better known as PDV) your friend. PDV is a list of variables in the data step. All but one of those variables are replaced with the first set. And the other one (RM) is replaced only occasionally.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Create new variable based on first occurrence in data set