So i have a dataset like this that has a few million rows:
Unique_Id Age Segment
1 56 0
1 57 4
1 57 0
2 23 0
2 24 0
2 24 1
3 77 0
3 77 2
3 78 0
3 78 0
4 86 0
4 88 1.1
4 88 0
5 12 0
5 12 0
5 14 0
5 14 0
I want to fill the segment portion up by matching it with age and unique_id. So unique_id 2 is in segment 1 when age is 24, i want both unqiue id 2 at age 24s to have segment 1 but not have it at age 23.
For a more clear picture, heres what i am looking for the dataset to resemble:
Unique_Id Age Segment
1 56 0
1 57 4
1 57 4
2 23 0
2 24 1
2 24 1
3 77 2
3 77 2
3 78 0
3 78 0
4 86 0
4 88 1.1
4 88 1.1
5 12 0
5 12 0
5 14 0
5 14 0
My problem is that I don't even know where to start to do this. If anyone can provide help or tips it would be much appreciated. Thanks.
I think a Data step would be a good approach; a few million rows is nothing to a Data step.
Here's some sample code, below, followed by the results. Are these results what you want?
Jim
DATA Have;
INPUT
Unique_ID Age Segment;
DATALINES;
1 56 0
1 57 4
1 57 0
2 23 0
2 24 0
2 24 1
3 77 0
3 77 2
3 78 0
3 78 0
4 86 0
4 88 1.1
4 88 0
5 12 0
5 12 0
5 14 0
5 14 0
;
RUN;
PROC SORT DATA=Have;
BY Unique_ID Age DESCENDING Segment;
RUN;
DATA Want;
DROP _:;
SET Have;
BY Unique_ID Age;
RETAIN _Prior_Segment 0;
IF Segment > 0 THEN
DO;
_Prior_Segment = Segment;
END;
ELSE
DO;
Segment = _Prior_Segment;
END;
OUTPUT;
IF LAST.Age OR
LAST.Unique_ID THEN
_Prior_Segment = 0;
RUN;
Compute the max segment and merge in with the other columns using PROC SQL. Does not work if segment is a negative number.
proc sql;
create table want as select a.unique_id,a.age,b.segment
from have as a left join
(select unique_id,age,max(segment) as segment from have group by unique_id,age) as b
on a.unique_id=b.unique_id and a.age=b.age;
quit;
If you really have to do this on a million records, it could be slow, and then maybe PROC SUMMARY is a better solution.
Please, for you future benefit, let's get the terminology right, a cell with a 0 is not an empty cell. It is a cell with a zero.
I think a Data step would be a good approach; a few million rows is nothing to a Data step.
Here's some sample code, below, followed by the results. Are these results what you want?
Jim
DATA Have;
INPUT
Unique_ID Age Segment;
DATALINES;
1 56 0
1 57 4
1 57 0
2 23 0
2 24 0
2 24 1
3 77 0
3 77 2
3 78 0
3 78 0
4 86 0
4 88 1.1
4 88 0
5 12 0
5 12 0
5 14 0
5 14 0
;
RUN;
PROC SORT DATA=Have;
BY Unique_ID Age DESCENDING Segment;
RUN;
DATA Want;
DROP _:;
SET Have;
BY Unique_ID Age;
RETAIN _Prior_Segment 0;
IF Segment > 0 THEN
DO;
_Prior_Segment = Segment;
END;
ELSE
DO;
Segment = _Prior_Segment;
END;
OUTPUT;
IF LAST.Age OR
LAST.Unique_ID THEN
_Prior_Segment = 0;
RUN;
This doesn't work for unique_id=3 and age=77
@PaigeMiller, oh, I see what you're saying. They want previous occurrences of an ID - Age combination to have the same Segment. Thank you. I think that could be corrected by a descending Sort on Segment. Prior post adjusted accordingly. The only "fly in the ointment" here is if there are multiple segments in a given age.
Jim
Something like below should work.
data have;
input unique_id age segment;
datalines;
1 56 0
1 57 4
1 57 0
2 23 0
2 24 0
2 24 1
3 77 0
3 77 2
3 78 0
3 78 0
4 86 0
4 88 1.1
4 88 0
5 12 0
5 12 0
5 14 0
5 14 0
;
proc sort data=have out=want;
by unique_id age descending segment;
run;
data want;
set want(rename=(segment=_segment));
by unique_id age;
retain segment;
if first.age then segment=_segment;
drop _segment;
run;
DATA Have;
INPUT
Unique_ID Age Segment;
DATALINES;
1 56 0
1 57 4
1 57 0
2 23 0
2 24 0
2 24 1
3 77 0
3 77 2
3 78 0
3 78 0
4 86 0
4 88 1.1
4 88 0
5 12 0
5 12 0
5 14 0
5 14 0
;
RUN;
proc sql;
create table want(drop=dummy) as
select unique_id,age,segment as dummy,max(segment) as segment
from have
group by unique_id,age;
quit;
Using a hash-object is missing the collection of solutions:
data want;
set have;
if _n_ = 1 then do;
declare hash h(dataset: 'have(where=(Segment ^= 0))');
h.defineKey('Unique_Id', 'Age');
h.defineData('Segment');
h.defineDone();
end;
rc = h.find();
/* no need to check rc */
drop rc;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.