Hi all
I have this real life problem to solve which feels very much like a standard "out of the book" problem (something like a parts list). But I just don't manage to find an easy coding solution for it.
What I have:
A couple of ID's which change over time.
What I need:
Chain up these ID's and figure out which one is the latest one (the ID_TO in the chain with the latest date).
I then want to create a data set which can be used as input for a format (cntlin) to recode all ID's to the latest one (where needed)
ID's in chains don't overlap (=an ID is a member of exactly one chain).
I don't have a lot of data so simple code is more important than performance optimised one.
data have;
input date:date9. id_from id_to;
format date date9.;
datalines;
01may2012 001 002
02may2012 002 005
03may2012 005 002
01may2012 012 019
02may2012 019 012
03may2012 012 017
04may2012 017 016
;
run;
Want:
start label
001 002
005 002
017 016
019 016
012 016
Any suggestions - some code or eventually just the necessary keywords to find some guidance for this assumed schoolbook problem - highly appreciated.
For code: SAS9.3 version under Windows.
Thanks
Patrick
Thanks. It looks like those were the only corrections needed. Double-checking the original post, it's a relief to see that there won't be a tremendous amount of data to run through.
The hashing solution is intriguing. It seems like this ought to be a good problem for hashing. My hashing skills aren't up to the task, however.
Here is one possibility:
data have;
input date:date9. id_from id_to;
format date date9.;
datalines;
01may2012 001 002
02may2012 002 005
03may2012 005 002
01may2012 012 019
02may2012 019 012
03may2012 012 017
04may2012 017 016
;
data temp;
do until (last_record);
set have end=last_record;
last_id_to=lag(id_to);
if id_from ne last_id_to then group+1;
output temp;
end;
run;
data want (keep=start label);
do until (last.group);
set temp;
by group;
if last.group then label=id_to;
end;
do until (last.group);
set temp (rename=(id_from=start));
by group;
if start ne label then output;
end;
run;
proc sort data=want nodupkey;
by start;
run;
Art
Thanks a lot for your suggestion.
When I've created the sample data I was asking myself if it is a good idea to provide it in this sorted order (by date and "groups"). This is in reality of course not the case and the data for a specific group is not in sequence.
That's why your code doesn't work for me.
I'll provide at the end of this thread better sample data with some more explanation.
Thanks
Patrick
Sounds like you're playing around with what's called a 'slowly changing dimension' or a 'Type 2' dimension from a BI perspective.
Some words if you felt like googling, b/c I'm sure Art's solution will work .
Here's another solution that doesn't use loops.
data have;
input date:date9. id_from id_to;
format date date9.;
datalines;
01may2012 001 002
02may2012 002 005
03may2012 005 002
01may2012 012 019
02may2012 019 012
03may2012 012 017
04may2012 017 016
;
run;
*Identify Chains;
data id_chains;
set have;
retain chain 0;
if id_from ne lag(id_to) then chain+1;
run;
*Get last ID per chain;
proc sort data=id_chains; by chain date; run;
proc sort data=id_chains nodupkey out=latest_id (keep= chain id_to); by chain; run;
*Merge in the latest ID;
data want;
merge id_chains latest_id (rename= id_to=latest_id);
by chain;
run;
Hi Reeza
Thanks for your code. It has the same basic issue than Art's (see comment there).
I believe there is also something else not working as expected as when I run it on my laptop it gives me as latest_ID '2' and '19' - but it should be '2' and '16'.
Thanks
Patrick
Hi , Patrick.
If I understand what you mean .
Check out if it is what you want.
data have; input date:date9. id_from $ id_to $; format date date9.; datalines; 01may2012 001 002 02may2012 002 005 03may2012 005 002 01may2012 012 019 02may2012 019 012 03may2012 012 017 04may2012 017 016 ; run; data want(drop=_:); if _n_ eq 1 then do; if 0 then set have; declare hash ha(); ha.definekey('id_from'); ha.definedone(); end; set have; if ha.check() ne 0 then do; ha.add(); do i=_n_+1 to nobs ; set have(keep=id_from id_to rename=(id_from=_id_from id_to=_id_to)) point=i nobs=nobs; if id_to eq _id_from then id_to=_id_to; else leave; end; output; id_from=id_to;ha.replace(); end; run;
Ksharp
Hi Ksharp
I would have been disappointed if you wouldn't have come up with a hash approach :-))
Your code works with the sample data but also falls short when I feed it with a bit a more elaborate sample.
I haven't really analysed your code good enough to tell why this happens.
I'll add better sample data and some more explanation right now.
Thanks
Patrick
Below some better sample data (which broke all of the above solutions) together with the expected (wanted) result.
What this is about in real life:
Product codes which change during promotion periods - and I don't have some parent product code I can trust due to data quality issues. So what I need to do is to "chain-up" all these product codes and then create my own parent code (using the last code in the chain). I know that this means the parent code will change over time and that's why I don't want to actually store a parent code but use a format (...and the most current code will then link to the most current entries in ref tables - that's why I must use a "dynamic" parent code).
Sample data:
01may2012 001 022
01may2012 032 019
02may2012 019 032
02may2012 022 005
03may2012 005 022
03may2012 032 017
04may2012 017 016
Resulting chains:
001/022/005/022
032/019/032/017/016
Wanted DS:
Start Label
001 022
005 022
032 016
019 016
017 016
Patrick,
That changes the complexity of the problem significantly. We had a similar thread sometime back both here and on SAS-L. I can't find the one here, but take a look at:
The problem I found is that to find a complete chain you must pass through the entire dataset once. So if there are multiple chains it requires multiple passes. My solution is a macro that recursively passes through the dataset until all chains are found. Basically, each pass through the dataset 'have' creates 2 datasets: 1) the obs belonging to the current chain are output to dataset 'chain', 2) all other obs are output to 'have', so 'have' reduces in size through each pass. The recursion is controlled by the value of the variable _have_left that contains the number of obs that are output to 'have' and still require further processing. When _have_left = 0, the recursion stops.
%macro recurse;
%do %until ( &have_left eq 0 );
data have (keep=id_from id_to)
chain (keep=start label);
set have end=eof;
retain _1st_rec 1 _last_id_to;
if _1st_rec or id_from = _last_id_to then do;
_1st_rec = 0;
_last_id_to = id_to;
start = id_from;
label = id_to;
output chain;
end;
else do;
_have_left + 1;
output have;
end;
if eof then do;
call symput( 'chain_end', _last_id_to ); * final product at end of chain;
call symput( 'have_left', _have_left ); * number of records still to process;
end;
run;
data chain;
set chain;
label = "&chain_end"; * set to the final product;
if start = label then delete;
run;
proc append base=chain_fmt data=chain;
run;
%end;
* remove dupe pairs like 032,016 from sample;
proc sort data=chain_fmt nodup;
by _all_;
run;
%mend;
%recurse;
SAS_Bigot,
I think that you're right ... you need two passes through the data. But it shouldn't be that complicated. Here's a simpler way.
data want;
set have (rename=(id_from=start id_to=label)) nobs=_nobs_;
if _n_ < _nobs_ then do _i_=_n_+1 to _nobs_;
set have point=_i_;
if id_from=label then label=id_to;
end;
if start ne end;
retain fmtname '$id_to';
run;
** if needed;
proc sort data=want NODUPKEY;
by start end;
run;
It still takes some processing time, and it assumes that the ID changes should be applied in the order that they appear within the HAVE data set.
I like it. It's very compact and easy to understand. I made a small correction to get it to work: start ne label.
Ooops. That's what happens when you don't test your code. Thanks for catching it. I guess that applies to the PROC SORT as well.
I have a minor concern ... how long will it take to run. The number of SET statement executions is proportional to the square of the number of observations. Some jobs were meant to run overnight. :smileylaugh:
Astounding,
I ran your code:
data have;
input date:date9. id_from $ id_to $;
format date date9.;
datalines;
01may2012 001 022
01may2012 032 019
02may2012 019 032
02may2012 022 005
03may2012 005 022
03may2012 032 017
04may2012 017 016
;
run;
data want(keep=start label);
set have (rename=(id_from=start id_to=label)) nobs=_nobs_;
if _n_ < _nobs_ then do _i_=_n_+1 to _nobs_;
set have point=_i_;
if id_from=label then label=id_to;
end;
if start ne label;
retain fmtname '$id_to';
run;
proc sort data=want NODUPKEY;
by start label;
run;
below is the log file:
2767 data have;
2768 input date:date9. id_from $ id_to $;
2769 format date date9.;
2770 datalines;
NOTE: The data set WORK.HAVE has 7 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
2778 ;
2779 run;
2780 data want(keep=start label);
2781 set have (rename=(id_from=start id_to=label)) nobs=_nobs_;
2782 if _n_ < _nobs_ then do _i_=_n_+1 to _nobs_;
2783 set have point=_i_;
2784 if id_from=label then label=id_to;
2785 end;
2786 if start ne label;
2787 retain fmtname '$id_to';
2788 run;
NOTE: There were 7 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 6 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
2789
2790 proc sort data=want NODUPKEY;
2791 by start label;
2792 run;
NOTE: There were 6 observations read from the data set WORK.WANT.
NOTE: 1 observations with duplicate key values were deleted.
NOTE: The data set WORK.WANT has 5 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
Thanks. It looks like those were the only corrections needed. Double-checking the original post, it's a relief to see that there won't be a tremendous amount of data to run through.
The hashing solution is intriguing. It seems like this ought to be a good problem for hashing. My hashing skills aren't up to the task, however.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.