How to backfill missing values using the same table

Scooby3g · Posted 11-11-2024 05:15 PM

Hi, is it possible to backfill the obs in row 1 , 2 and 7 with the customer_ID that became available in the later months?

data Have;
	input customer_ID $ Na $ month $ cost;
	datalines;
	.	Joe	202401 100
	.	Joe	202402 150
	123 Joe	202403 200
	123	Joe	202404 250
	123 Joe	202405 300
	123 Joe	202406 375
	.	Jane 202401 100
	456 Jane 202402 175
	456 Jane 202403 200
	456 Jane 202404 275
	456 Jane 202405 300
	456 Jane 202406 375
	;
run;

Patrick · Posted 11-11-2024 06:07 PM

That's logically only possible if for each name there is exactly one matching ID value.

Even though "ugly" but because SAS tables are sequential it's also possible that your table is in a a sort order that we could always use the first non-blank ID for the same name.

You have to tell us what's the case.

Ksharp · Posted 11-11-2024 07:59 PM

data Have;
	input customer_ID $ Na $ month : yymmn6. cost;
	format month yymmn6.;
	datalines;
	.	Joe	202401 100
	.	Joe	202402 150
	123 Joe	202403 200
	123	Joe	202404 250
	123 Joe	202405 300
	123 Joe	202406 375
	.	Jane 202401 100
	456 Jane 202402 175
	456 Jane 202403 200
	456 Jane 202404 275
	456 Jane 202405 300
	456 Jane 202406 375
	;
run;
proc sort data=have out=temp;
by Na descending month;
run;
data want;
 length customer_ID $ 80;
 set temp(rename=(customer_ID=_customer_ID));
 by Na;
 retain customer_ID ;
 if first.Na then call missing(customer_ID);
 if not missing(_customer_ID) then customer_ID=_customer_ID;
 drop _customer_ID;
run;
proc sort data=want;by Na  month;run;

Ksharp · Posted 11-11-2024 08:15 PM

If you have a big table,could try DOW skill:

data Have;
	input customer_ID $ Na $ month : yymmn6. cost;
	format month yymmn6.;
	datalines;
	.	Joe	202401 100
	.	Joe	202402 150
	123 Joe	202403 200
	123	Joe	202404 250
	123 Joe	202405 300
	123 Joe	202406 375
	.	Jane 202401 100
	456 Jane 202402 175
	456 Jane 202403 200
	456 Jane 202404 275
	456 Jane 202405 300
	456 Jane 202406 375
	;
run;

data want;
if 0 then set have;
 do until(not missing(customer_ID) or last.Na);
  set have;
  by Na notsorted;
 end;
 _customer_ID=customer_ID;
 do until(not missing(customer_ID) or last.Na);
  set have;
  by Na notsorted;
  output;
 end;
drop customer_ID;
run;

mkeintz · Posted 11-12-2024 01:11 PM

No sorting is required. If each name has at least one non-missing customer_id, then

data Have;
	input customer_ID $ Na $ month $ cost;
	datalines;
	.	Joe	202401 100
	.	Joe	202402 150
	123 Joe	202403 200
	123	Joe	202404 250
	123 Joe	202405 300
	123 Joe	202406 375
	.	Jane 202401 100
	456 Jane 202402 175
	456 Jane 202403 200
	456 Jane 202404 275
	456 Jane 202405 300
	456 Jane 202406 375
run;
data backfill/view=backfill;
  set have (keep=customer_id na where=(customer_id^=.));
  by na notsorted;
  if last.na;
run;

data want;
  set have (drop=customer_id);
  by na notsorted;
  if first.na then set backfill;
run;

This is two data steps, but only one retrieval of disk data, so should be very efficient.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Patrick · Posted 11-13-2024 08:32 PM

And here an approach using a hash table lookup.

data Have;
  input customer_ID Na $ month : yymmn6. cost;
  format month yymmn6.;
  datalines;
. Joe 202401 100
. Joe 202402 150
123 Joe 202403 200
123 Joe 202404 250
123 Joe 202405 300
123 Joe 202406 375
. Jane 202401 100
456 Jane 202402 175
456 Jane 202403 200
456 Jane 202404 275
456 Jane 202405 300
456 Jane 202406 375
;
run;

data want;
  if _n_=1 then
    do;
      dcl hash h1(dataset:'have(where=(not missing(customer_id)))');
      h1.defineKey('na');
      h1.defineData('customer_id');
      h1.defineDone();
    end;
  set have;
  if missing(customer_id) then h1.find();
run;

quickbluefish · Posted 11-13-2024 09:53 PM

This is really interesting - almost like creating a name-to-customer ID lookup format and then applying it, but much more compact. Almost 20 years of using SAS and I think I've used a hash table maybe twice.

Patrick · Posted 11-14-2024 06:19 PM

@quickbluefish

It's very worthwhile getting up-to-speed with hash tables. Like with formats hash table lookups can be very efficient because they don't require sorting of the main table. Unlike with formats you can have a composite key and lookup multiple variables. You can also load multiple rows (items) per key and then iterate over them using the do_over() method.

How to backfill missing values using the same table

Re: How to backfill missing values using the same table

Re: How to backfill missing values using the same table

Re: How to backfill missing values using the same table

Re: How to backfill missing values using the same table

Re: How to backfill missing values using the same table

Re: How to backfill missing values using the same table

Re: How to backfill missing values using the same table

Registration is open

SAS Training: Just a Click Away