So i have a dataset like this that has a few million rows:
Unique_Id Age Segment
1 56 0
1 57 4
1 57 0
2 23 0
2 24 0
2 24 1
3 77 0
3 77 2
3 78 0
3 78 0
4 86 0
4 88 1.1
4 88 0
5 12 0
5 12 0
5 14 0
5 14 0
I want to fill the segment portion up by matching it with age and unique_id. So unique_id 2 is in segment 1 when age is 24, i want both unqiue id 2 at age 24s to have segment 1 but not have it at age 23.
For a more clear picture, heres what i am looking for the dataset to resemble:
Unique_Id Age Segment
1 56 0
1 57 4
1 57 4
2 23 0
2 24 1
2 24 1
3 77 2
3 77 2
3 78 0
3 78 0
4 86 0
4 88 1.1
4 88 1.1
5 12 0
5 12 0
5 14 0
5 14 0
My problem is that I don't even know where to start to do this. If anyone can provide help or tips it would be much appreciated. Thanks.
I think a Data step would be a good approach; a few million rows is nothing to a Data step.
Here's some sample code, below, followed by the results. Are these results what you want?
Jim
DATA Have;
INPUT
Unique_ID Age Segment;
DATALINES;
1 56 0
1 57 4
1 57 0
2 23 0
2 24 0
2 24 1
3 77 0
3 77 2
3 78 0
3 78 0
4 86 0
4 88 1.1
4 88 0
5 12 0
5 12 0
5 14 0
5 14 0
;
RUN;
PROC SORT DATA=Have;
BY Unique_ID Age DESCENDING Segment;
RUN;
DATA Want;
DROP _:;
SET Have;
BY Unique_ID Age;
RETAIN _Prior_Segment 0;
IF Segment > 0 THEN
DO;
_Prior_Segment = Segment;
END;
ELSE
DO;
Segment = _Prior_Segment;
END;
OUTPUT;
IF LAST.Age OR
LAST.Unique_ID THEN
_Prior_Segment = 0;
RUN;
Compute the max segment and merge in with the other columns using PROC SQL. Does not work if segment is a negative number.
proc sql;
create table want as select a.unique_id,a.age,b.segment
from have as a left join
(select unique_id,age,max(segment) as segment from have group by unique_id,age) as b
on a.unique_id=b.unique_id and a.age=b.age;
quit;
If you really have to do this on a million records, it could be slow, and then maybe PROC SUMMARY is a better solution.
Please, for you future benefit, let's get the terminology right, a cell with a 0 is not an empty cell. It is a cell with a zero.
I think a Data step would be a good approach; a few million rows is nothing to a Data step.
Here's some sample code, below, followed by the results. Are these results what you want?
Jim
DATA Have;
INPUT
Unique_ID Age Segment;
DATALINES;
1 56 0
1 57 4
1 57 0
2 23 0
2 24 0
2 24 1
3 77 0
3 77 2
3 78 0
3 78 0
4 86 0
4 88 1.1
4 88 0
5 12 0
5 12 0
5 14 0
5 14 0
;
RUN;
PROC SORT DATA=Have;
BY Unique_ID Age DESCENDING Segment;
RUN;
DATA Want;
DROP _:;
SET Have;
BY Unique_ID Age;
RETAIN _Prior_Segment 0;
IF Segment > 0 THEN
DO;
_Prior_Segment = Segment;
END;
ELSE
DO;
Segment = _Prior_Segment;
END;
OUTPUT;
IF LAST.Age OR
LAST.Unique_ID THEN
_Prior_Segment = 0;
RUN;
This doesn't work for unique_id=3 and age=77
@PaigeMiller, oh, I see what you're saying. They want previous occurrences of an ID - Age combination to have the same Segment. Thank you. I think that could be corrected by a descending Sort on Segment. Prior post adjusted accordingly. The only "fly in the ointment" here is if there are multiple segments in a given age.
Jim
Something like below should work.
data have;
input unique_id age segment;
datalines;
1 56 0
1 57 4
1 57 0
2 23 0
2 24 0
2 24 1
3 77 0
3 77 2
3 78 0
3 78 0
4 86 0
4 88 1.1
4 88 0
5 12 0
5 12 0
5 14 0
5 14 0
;
proc sort data=have out=want;
by unique_id age descending segment;
run;
data want;
set want(rename=(segment=_segment));
by unique_id age;
retain segment;
if first.age then segment=_segment;
drop _segment;
run;
DATA Have;
INPUT
Unique_ID Age Segment;
DATALINES;
1 56 0
1 57 4
1 57 0
2 23 0
2 24 0
2 24 1
3 77 0
3 77 2
3 78 0
3 78 0
4 86 0
4 88 1.1
4 88 0
5 12 0
5 12 0
5 14 0
5 14 0
;
RUN;
proc sql;
create table want(drop=dummy) as
select unique_id,age,segment as dummy,max(segment) as segment
from have
group by unique_id,age;
quit;
Using a hash-object is missing the collection of solutions:
data want;
set have;
if _n_ = 1 then do;
declare hash h(dataset: 'have(where=(Segment ^= 0))');
h.defineKey('Unique_Id', 'Age');
h.defineData('Segment');
h.defineDone();
end;
rc = h.find();
/* no need to check rc */
drop rc;
run;
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.