BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
raddad34
Fluorite | Level 6

So i have a dataset like this that has a few million rows:

Unique_Id          Age        Segment
1                  56         0
1                  57         4       
1                  57         0   
2                  23         0
2                  24         0        
2                  24         1
3                  77         0
3                  77         2
3                  78         0
3                  78         0
4                  86         0
4                  88         1.1
4                  88         0
5                  12         0
5                  12         0
5                  14         0
5                  14         0

I want to fill the segment portion up by matching it with age and unique_id. So unique_id 2 is in segment 1 when age is 24, i want both unqiue id 2 at age 24s to have segment 1 but not have it at age 23.

 

For a more clear picture, heres what i am looking for the dataset to resemble:

Unique_Id          Age        Segment
1                  56         0
1                  57         4       
1                  57         4  
2                  23         0
2                  24         1        
2                  24         1
3                  77         2
3                  77         2
3                  78         0
3                  78         0
4                  86         0
4                  88         1.1
4                  88         1.1
5                  12         0
5                  12         0
5                  14         0
5                  14         0

My problem is that I don't even know where to start to do this. If anyone can provide help or tips it would be much appreciated. Thanks.

 

1 ACCEPTED SOLUTION

Accepted Solutions
jimbarbour
Meteorite | Level 14

I think a Data step would be a good approach; a few million rows is nothing to a Data step.

 

Here's some sample code, below, followed by the results.  Are these results what you want?

 

Jim

 

DATA	Have;
	INPUT
	Unique_ID          Age        Segment;
DATALINES;
1                  56         0
1                  57         4       
1                  57         0   
2                  23         0
2                  24         0        
2                  24         1
3                  77         0
3                  77         2
3                  78         0
3                  78         0
4                  86         0
4                  88         1.1
4                  88         0
5                  12         0
5                  12         0
5                  14         0
5                  14         0
;
RUN;

PROC	SORT	DATA=Have;
	BY	Unique_ID	Age	DESCENDING	Segment;
RUN;

DATA	Want;
	DROP	_:;
	SET	Have;
		BY	Unique_ID	Age;
	RETAIN	_Prior_Segment	0;

	IF	Segment					>	0	THEN
		DO;
			_Prior_Segment		=	Segment;
		END;
	ELSE
		DO;
			Segment				=	_Prior_Segment;
		END;

	OUTPUT;

	IF	LAST.Age						OR
		LAST.Unique_ID					THEN
		_Prior_Segment			=	0;
RUN;

jimbarbour_0-1628867423497.png

 

 

View solution in original post

7 REPLIES 7
PaigeMiller
Diamond | Level 26

Compute the max segment and merge in with the other columns using PROC SQL. Does not work if segment is a negative number.

 

proc sql;
    create table want as select a.unique_id,a.age,b.segment 
    from have as a left join 
    (select unique_id,age,max(segment) as segment from have group by unique_id,age) as b
    on a.unique_id=b.unique_id and a.age=b.age;
quit;

 

If you really have to do this on a million records, it could be slow, and then maybe PROC SUMMARY is a better solution.

 

Please, for you future benefit, let's get the terminology right, a cell with a 0 is not an empty cell. It is a cell with a zero.

--
Paige Miller
jimbarbour
Meteorite | Level 14

I think a Data step would be a good approach; a few million rows is nothing to a Data step.

 

Here's some sample code, below, followed by the results.  Are these results what you want?

 

Jim

 

DATA	Have;
	INPUT
	Unique_ID          Age        Segment;
DATALINES;
1                  56         0
1                  57         4       
1                  57         0   
2                  23         0
2                  24         0        
2                  24         1
3                  77         0
3                  77         2
3                  78         0
3                  78         0
4                  86         0
4                  88         1.1
4                  88         0
5                  12         0
5                  12         0
5                  14         0
5                  14         0
;
RUN;

PROC	SORT	DATA=Have;
	BY	Unique_ID	Age	DESCENDING	Segment;
RUN;

DATA	Want;
	DROP	_:;
	SET	Have;
		BY	Unique_ID	Age;
	RETAIN	_Prior_Segment	0;

	IF	Segment					>	0	THEN
		DO;
			_Prior_Segment		=	Segment;
		END;
	ELSE
		DO;
			Segment				=	_Prior_Segment;
		END;

	OUTPUT;

	IF	LAST.Age						OR
		LAST.Unique_ID					THEN
		_Prior_Segment			=	0;
RUN;

jimbarbour_0-1628867423497.png

 

 

PaigeMiller
Diamond | Level 26

This doesn't work for unique_id=3 and age=77

--
Paige Miller
jimbarbour
Meteorite | Level 14

@PaigeMiller, oh, I see what you're saying.  They want previous occurrences of an ID - Age combination to have the same Segment.  Thank you.  I think that could be corrected by a descending Sort on Segment.  Prior post adjusted accordingly.  The only "fly in the ointment" here is if there are multiple segments in a given age.

 

Jim

 

 

Patrick
Opal | Level 21

Something like below should work.

data	have;
	input	unique_id age segment;
datalines;
1 56 0
1 57 4 
1 57 0 
2 23 0
2 24 0 
2 24 1
3 77 0
3 77 2
3 78 0
3 78 0
4 86 0
4 88 1.1
4 88 0
5 12 0
5 12 0
5 14 0
5 14 0
;

proc sort data=have out=want;
  by unique_id age descending segment;
run;

data want;
  set want(rename=(segment=_segment));
  by unique_id age;
  retain segment;
  if first.age then segment=_segment;
  drop _segment;
run;
Ksharp
Super User
DATA	Have;
	INPUT
	Unique_ID          Age        Segment;
DATALINES;
1                  56         0
1                  57         4       
1                  57         0   
2                  23         0
2                  24         0        
2                  24         1
3                  77         0
3                  77         2
3                  78         0
3                  78         0
4                  86         0
4                  88         1.1
4                  88         0
5                  12         0
5                  12         0
5                  14         0
5                  14         0
;
RUN;
proc sql;
create table want(drop=dummy) as
select unique_id,age,segment as dummy,max(segment) as segment
 from have
   group by unique_id,age;
quit;
andreas_lds
Jade | Level 19

Using a hash-object is missing the collection of solutions:

data want;
   set have;
   
   if _n_ = 1 then do;
      declare hash h(dataset: 'have(where=(Segment ^= 0))');
      h.defineKey('Unique_Id', 'Age');
      h.defineData('Segment');
      h.defineDone();
   end;
   
   rc = h.find();
   /* no need to check rc */
   drop rc;
run;

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1097 views
  • 3 likes
  • 6 in conversation