BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
raddad34
Fluorite | Level 6

So i have a dataset like this that has a few million rows:

Unique_Id          Age        Segment
1                  56         0
1                  57         4       
1                  57         0   
2                  23         0
2                  24         0        
2                  24         1
3                  77         0
3                  77         2
3                  78         0
3                  78         0
4                  86         0
4                  88         1.1
4                  88         0
5                  12         0
5                  12         0
5                  14         0
5                  14         0

I want to fill the segment portion up by matching it with age and unique_id. So unique_id 2 is in segment 1 when age is 24, i want both unqiue id 2 at age 24s to have segment 1 but not have it at age 23.

 

For a more clear picture, heres what i am looking for the dataset to resemble:

Unique_Id          Age        Segment
1                  56         0
1                  57         4       
1                  57         4  
2                  23         0
2                  24         1        
2                  24         1
3                  77         2
3                  77         2
3                  78         0
3                  78         0
4                  86         0
4                  88         1.1
4                  88         1.1
5                  12         0
5                  12         0
5                  14         0
5                  14         0

My problem is that I don't even know where to start to do this. If anyone can provide help or tips it would be much appreciated. Thanks.

 

1 ACCEPTED SOLUTION

Accepted Solutions
jimbarbour
Meteorite | Level 14

I think a Data step would be a good approach; a few million rows is nothing to a Data step.

 

Here's some sample code, below, followed by the results.  Are these results what you want?

 

Jim

 

DATA	Have;
	INPUT
	Unique_ID          Age        Segment;
DATALINES;
1                  56         0
1                  57         4       
1                  57         0   
2                  23         0
2                  24         0        
2                  24         1
3                  77         0
3                  77         2
3                  78         0
3                  78         0
4                  86         0
4                  88         1.1
4                  88         0
5                  12         0
5                  12         0
5                  14         0
5                  14         0
;
RUN;

PROC	SORT	DATA=Have;
	BY	Unique_ID	Age	DESCENDING	Segment;
RUN;

DATA	Want;
	DROP	_:;
	SET	Have;
		BY	Unique_ID	Age;
	RETAIN	_Prior_Segment	0;

	IF	Segment					>	0	THEN
		DO;
			_Prior_Segment		=	Segment;
		END;
	ELSE
		DO;
			Segment				=	_Prior_Segment;
		END;

	OUTPUT;

	IF	LAST.Age						OR
		LAST.Unique_ID					THEN
		_Prior_Segment			=	0;
RUN;

jimbarbour_0-1628867423497.png

 

 

View solution in original post

7 REPLIES 7
PaigeMiller
Diamond | Level 26

Compute the max segment and merge in with the other columns using PROC SQL. Does not work if segment is a negative number.

 

proc sql;
    create table want as select a.unique_id,a.age,b.segment 
    from have as a left join 
    (select unique_id,age,max(segment) as segment from have group by unique_id,age) as b
    on a.unique_id=b.unique_id and a.age=b.age;
quit;

 

If you really have to do this on a million records, it could be slow, and then maybe PROC SUMMARY is a better solution.

 

Please, for you future benefit, let's get the terminology right, a cell with a 0 is not an empty cell. It is a cell with a zero.

--
Paige Miller
jimbarbour
Meteorite | Level 14

I think a Data step would be a good approach; a few million rows is nothing to a Data step.

 

Here's some sample code, below, followed by the results.  Are these results what you want?

 

Jim

 

DATA	Have;
	INPUT
	Unique_ID          Age        Segment;
DATALINES;
1                  56         0
1                  57         4       
1                  57         0   
2                  23         0
2                  24         0        
2                  24         1
3                  77         0
3                  77         2
3                  78         0
3                  78         0
4                  86         0
4                  88         1.1
4                  88         0
5                  12         0
5                  12         0
5                  14         0
5                  14         0
;
RUN;

PROC	SORT	DATA=Have;
	BY	Unique_ID	Age	DESCENDING	Segment;
RUN;

DATA	Want;
	DROP	_:;
	SET	Have;
		BY	Unique_ID	Age;
	RETAIN	_Prior_Segment	0;

	IF	Segment					>	0	THEN
		DO;
			_Prior_Segment		=	Segment;
		END;
	ELSE
		DO;
			Segment				=	_Prior_Segment;
		END;

	OUTPUT;

	IF	LAST.Age						OR
		LAST.Unique_ID					THEN
		_Prior_Segment			=	0;
RUN;

jimbarbour_0-1628867423497.png

 

 

PaigeMiller
Diamond | Level 26

This doesn't work for unique_id=3 and age=77

--
Paige Miller
jimbarbour
Meteorite | Level 14

@PaigeMiller, oh, I see what you're saying.  They want previous occurrences of an ID - Age combination to have the same Segment.  Thank you.  I think that could be corrected by a descending Sort on Segment.  Prior post adjusted accordingly.  The only "fly in the ointment" here is if there are multiple segments in a given age.

 

Jim

 

 

Patrick
Opal | Level 21

Something like below should work.

data	have;
	input	unique_id age segment;
datalines;
1 56 0
1 57 4 
1 57 0 
2 23 0
2 24 0 
2 24 1
3 77 0
3 77 2
3 78 0
3 78 0
4 86 0
4 88 1.1
4 88 0
5 12 0
5 12 0
5 14 0
5 14 0
;

proc sort data=have out=want;
  by unique_id age descending segment;
run;

data want;
  set want(rename=(segment=_segment));
  by unique_id age;
  retain segment;
  if first.age then segment=_segment;
  drop _segment;
run;
Ksharp
Super User
DATA	Have;
	INPUT
	Unique_ID          Age        Segment;
DATALINES;
1                  56         0
1                  57         4       
1                  57         0   
2                  23         0
2                  24         0        
2                  24         1
3                  77         0
3                  77         2
3                  78         0
3                  78         0
4                  86         0
4                  88         1.1
4                  88         0
5                  12         0
5                  12         0
5                  14         0
5                  14         0
;
RUN;
proc sql;
create table want(drop=dummy) as
select unique_id,age,segment as dummy,max(segment) as segment
 from have
   group by unique_id,age;
quit;
andreas_lds
Jade | Level 19

Using a hash-object is missing the collection of solutions:

data want;
   set have;
   
   if _n_ = 1 then do;
      declare hash h(dataset: 'have(where=(Segment ^= 0))');
      h.defineKey('Unique_Id', 'Age');
      h.defineData('Segment');
      h.defineDone();
   end;
   
   rc = h.find();
   /* no need to check rc */
   drop rc;
run;

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1807 views
  • 3 likes
  • 6 in conversation