Solved: Re: Filling empty cells based on unique id

raddad34 · Posted 08-13-2021 10:25 AM

So i have a dataset like this that has a few million rows:

Unique_Id          Age        Segment
1                  56         0
1                  57         4       
1                  57         0   
2                  23         0
2                  24         0        
2                  24         1
3                  77         0
3                  77         2
3                  78         0
3                  78         0
4                  86         0
4                  88         1.1
4                  88         0
5                  12         0
5                  12         0
5                  14         0
5                  14         0

I want to fill the segment portion up by matching it with age and unique_id. So unique_id 2 is in segment 1 when age is 24, i want both unqiue id 2 at age 24s to have segment 1 but not have it at age 23.

For a more clear picture, heres what i am looking for the dataset to resemble:

Unique_Id          Age        Segment
1                  56         0
1                  57         4       
1                  57         4  
2                  23         0
2                  24         1        
2                  24         1
3                  77         2
3                  77         2
3                  78         0
3                  78         0
4                  86         0
4                  88         1.1
4                  88         1.1
5                  12         0
5                  12         0
5                  14         0
5                  14         0

My problem is that I don't even know where to start to do this. If anyone can provide help or tips it would be much appreciated. Thanks.

jimbarbour · Posted 08-13-2021 10:58 AM

I think a Data step would be a good approach; a few million rows is nothing to a Data step.

Here's some sample code, below, followed by the results. Are these results what you want?

Jim

DATA	Have;
	INPUT
	Unique_ID          Age        Segment;
DATALINES;
1                  56         0
1                  57         4       
1                  57         0   
2                  23         0
2                  24         0        
2                  24         1
3                  77         0
3                  77         2
3                  78         0
3                  78         0
4                  86         0
4                  88         1.1
4                  88         0
5                  12         0
5                  12         0
5                  14         0
5                  14         0
;
RUN;

PROC	SORT	DATA=Have;
	BY	Unique_ID	Age	DESCENDING	Segment;
RUN;

DATA	Want;
	DROP	_:;
	SET	Have;
		BY	Unique_ID	Age;
	RETAIN	_Prior_Segment	0;

	IF	Segment					>	0	THEN
		DO;
			_Prior_Segment		=	Segment;
		END;
	ELSE
		DO;
			Segment				=	_Prior_Segment;
		END;

	OUTPUT;

	IF	LAST.Age						OR
		LAST.Unique_ID					THEN
		_Prior_Segment			=	0;
RUN;

View solution in original post

PaigeMiller · Posted 08-13-2021 10:46 AM

Compute the max segment and merge in with the other columns using PROC SQL. Does not work if segment is a negative number.

proc sql;
    create table want as select a.unique_id,a.age,b.segment 
    from have as a left join 
    (select unique_id,age,max(segment) as segment from have group by unique_id,age) as b
    on a.unique_id=b.unique_id and a.age=b.age;
quit;

If you really have to do this on a million records, it could be slow, and then maybe PROC SUMMARY is a better solution.

Please, for you future benefit, let's get the terminology right, a cell with a 0 is not an empty cell. It is a cell with a zero.

--
Paige Miller

jimbarbour · Posted 08-13-2021 10:58 AM

I think a Data step would be a good approach; a few million rows is nothing to a Data step.

Here's some sample code, below, followed by the results. Are these results what you want?

Jim

DATA	Have;
	INPUT
	Unique_ID          Age        Segment;
DATALINES;
1                  56         0
1                  57         4       
1                  57         0   
2                  23         0
2                  24         0        
2                  24         1
3                  77         0
3                  77         2
3                  78         0
3                  78         0
4                  86         0
4                  88         1.1
4                  88         0
5                  12         0
5                  12         0
5                  14         0
5                  14         0
;
RUN;

PROC	SORT	DATA=Have;
	BY	Unique_ID	Age	DESCENDING	Segment;
RUN;

DATA	Want;
	DROP	_:;
	SET	Have;
		BY	Unique_ID	Age;
	RETAIN	_Prior_Segment	0;

	IF	Segment					>	0	THEN
		DO;
			_Prior_Segment		=	Segment;
		END;
	ELSE
		DO;
			Segment				=	_Prior_Segment;
		END;

	OUTPUT;

	IF	LAST.Age						OR
		LAST.Unique_ID					THEN
		_Prior_Segment			=	0;
RUN;

PaigeMiller · Posted 08-13-2021 11:00 AM

This doesn't work for unique_id=3 and age=77

--
Paige Miller

jimbarbour · Posted 08-13-2021 11:09 AM

@PaigeMiller, oh, I see what you're saying. They want previous occurrences of an ID - Age combination to have the same Segment. Thank you. I think that could be corrected by a descending Sort on Segment. Prior post adjusted accordingly. The only "fly in the ointment" here is if there are multiple segments in a given age.

Jim

Patrick · Posted 08-13-2021 11:32 AM

Something like below should work.

data	have;
	input	unique_id age segment;
datalines;
1 56 0
1 57 4 
1 57 0 
2 23 0
2 24 0 
2 24 1
3 77 0
3 77 2
3 78 0
3 78 0
4 86 0
4 88 1.1
4 88 0
5 12 0
5 12 0
5 14 0
5 14 0
;

proc sort data=have out=want;
  by unique_id age descending segment;
run;

data want;
  set want(rename=(segment=_segment));
  by unique_id age;
  retain segment;
  if first.age then segment=_segment;
  drop _segment;
run;

Ksharp · Posted 08-15-2021 08:21 AM

DATA	Have;
	INPUT
	Unique_ID          Age        Segment;
DATALINES;
1                  56         0
1                  57         4       
1                  57         0   
2                  23         0
2                  24         0        
2                  24         1
3                  77         0
3                  77         2
3                  78         0
3                  78         0
4                  86         0
4                  88         1.1
4                  88         0
5                  12         0
5                  12         0
5                  14         0
5                  14         0
;
RUN;
proc sql;
create table want(drop=dummy) as
select unique_id,age,segment as dummy,max(segment) as segment
 from have
   group by unique_id,age;
quit;

andreas_lds · Posted 08-16-2021 02:36 AM

Using a hash-object is missing the collection of solutions:

data want;
   set have;
   
   if _n_ = 1 then do;
      declare hash h(dataset: 'have(where=(Segment ^= 0))');
      h.defineKey('Unique_Id', 'Age');
      h.defineData('Segment');
      h.defineDone();
   end;
   
   rc = h.find();
   /* no need to check rc */
   drop rc;
run;

Registration is open

SAS Training: Just a Click Away