Solved: Re: Find last child

Patrick · Posted 05-14-2012 08:51 AM

Hi all

I have this real life problem to solve which feels very much like a standard "out of the book" problem (something like a parts list). But I just don't manage to find an easy coding solution for it.

What I have:

A couple of ID's which change over time.

What I need:

Chain up these ID's and figure out which one is the latest one (the ID_TO in the chain with the latest date).

I then want to create a data set which can be used as input for a format (cntlin) to recode all ID's to the latest one (where needed)

ID's in chains don't overlap (=an ID is a member of exactly one chain).

I don't have a lot of data so simple code is more important than performance optimised one.

data have;
input date:date9. id_from id_to;
format date date9.;
datalines;
01may2012 001 002
02may2012 002 005
03may2012 005 002
01may2012 012 019

02may2012 019 012
03may2012 012 017
04may2012 017 016

;
run;

Want:
start label
001 002

005 002

017 016

019 016

012 016

Any suggestions - some code or eventually just the necessary keywords to find some guidance for this assumed schoolbook problem - highly appreciated.

For code: SAS9.3 version under Windows.

Thanks

Patrick

Astounding · Posted 05-15-2012 02:33 PM

Thanks. It looks like those were the only corrections needed. Double-checking the original post, it's a relief to see that there won't be a tremendous amount of data to run through.

The hashing solution is intriguing. It seems like this ought to be a good problem for hashing. My hashing skills aren't up to the task, however.

View solution in original post

art297 · Posted 05-14-2012 10:10 AM

Here is one possibility:

data have;

input date:date9. id_from id_to;

format date date9.;

datalines;

01may2012 001 002

02may2012 002 005

03may2012 005 002

01may2012 012 019

02may2012 019 012

03may2012 012 017

04may2012 017 016

;

data temp;

do until (last_record);

set have end=last_record;

last_id_to=lag(id_to);

if id_from ne last_id_to then group+1;

output temp;

end;

run;

data want (keep=start label);

do until (last.group);

set temp;

by group;

if last.group then label=id_to;

end;

do until (last.group);

set temp (rename=(id_from=start));

by group;

if start ne label then output;

end;

run;

proc sort data=want nodupkey;

by start;

run;

Patrick · Posted 05-15-2012 06:42 AM

Art

Thanks a lot for your suggestion.

When I've created the sample data I was asking myself if it is a good idea to provide it in this sorted order (by date and "groups"). This is in reality of course not the case and the data for a specific group is not in sequence.

That's why your code doesn't work for me.

I'll provide at the end of this thread better sample data with some more explanation.

Thanks

Patrick

Reeza · Posted 05-14-2012 06:22 PM

Sounds like you're playing around with what's called a 'slowly changing dimension' or a 'Type 2' dimension from a BI perspective.

Some words if you felt like googling, b/c I'm sure Art's solution will work .

Here's another solution that doesn't use loops.

data have;

input date:date9. id_from id_to;

format date date9.;

datalines;

01may2012 001 002

02may2012 002 005

03may2012 005 002

01may2012 012 019

02may2012 019 012

03may2012 012 017

04may2012 017 016

;

run;

*Identify Chains;

data id_chains;

set have;

retain chain 0;

if id_from ne lag(id_to) then chain+1;

run;

*Get last ID per chain;

proc sort data=id_chains; by chain date; run;

proc sort data=id_chains nodupkey out=latest_id (keep= chain id_to); by chain; run;

*Merge in the latest ID;

data want;

merge id_chains latest_id (rename= id_to=latest_id);

by chain;

run;

Patrick · Posted 05-15-2012 06:46 AM

Hi Reeza

Thanks for your code. It has the same basic issue than Art's (see comment there).

I believe there is also something else not working as expected as when I run it on my laptop it gives me as latest_ID '2' and '19' - but it should be '2' and '16'.

Thanks

Patrick

Ksharp · Posted 05-15-2012 02:45 AM

Hi , Patrick.

If I understand what you mean .

Check out if it is what you want.

data have;
  input date:date9. id_from $ id_to $;
  format date date9.;
  datalines;
01may2012 001 002
02may2012 002 005
03may2012 005 002
01may2012 012 019
02may2012 019 012
03may2012 012 017
04may2012 017 016
;
run;

data want(drop=_:);
 if _n_ eq 1 then do;
  if 0 then set have;
  declare hash ha();
   ha.definekey('id_from');
   ha.definedone();
 end;
set have;
if ha.check() ne 0 then do;
     ha.add();
     do i=_n_+1 to nobs ;
      set have(keep=id_from id_to rename=(id_from=_id_from id_to=_id_to)) point=i nobs=nobs;
      if id_to eq _id_from then id_to=_id_to;
       else leave;
     end;
 output;
 id_from=id_to;ha.replace();
end;
run;

Ksharp

Patrick · Posted 05-15-2012 06:49 AM

Hi Ksharp

I would have been disappointed if you wouldn't have come up with a hash approach :-))

Your code works with the sample data but also falls short when I feed it with a bit a more elaborate sample.

I haven't really analysed your code good enough to tell why this happens.

I'll add better sample data and some more explanation right now.

Thanks

Patrick

Patrick · Posted 05-15-2012 07:06 AM

Below some better sample data (which broke all of the above solutions) together with the expected (wanted) result.

What this is about in real life:

Product codes which change during promotion periods - and I don't have some parent product code I can trust due to data quality issues. So what I need to do is to "chain-up" all these product codes and then create my own parent code (using the last code in the chain). I know that this means the parent code will change over time and that's why I don't want to actually store a parent code but use a format (...and the most current code will then link to the most current entries in ref tables - that's why I must use a "dynamic" parent code).

Sample data:
01may2012 001 022
01may2012 032 019
02may2012 019 032
02may2012 022 005
03may2012 005 022
03may2012 032 017
04may2012 017 016

Resulting chains:
001/022/005/022
032/019/032/017/016

Wanted DS:
Start Label
001   022
005   022
032   016
019   016
017   016

art297 · Posted 05-15-2012 08:52 AM

Patrick,

That changes the complexity of the problem significantly. We had a similar thread sometime back both here and on SAS-L. I can't find the one here, but take a look at:

SAS-L archives -- September 2011, week 2 (#94)</title><style type="text/css"><!--BODY { font-family:...

FloydNevseta · Posted 05-15-2012 09:30 AM

The problem I found is that to find a complete chain you must pass through the entire dataset once. So if there are multiple chains it requires multiple passes. My solution is a macro that recursively passes through the dataset until all chains are found. Basically, each pass through the dataset 'have' creates 2 datasets: 1) the obs belonging to the current chain are output to dataset 'chain', 2) all other obs are output to 'have', so 'have' reduces in size through each pass. The recursion is controlled by the value of the variable _have_left that contains the number of obs that are output to 'have' and still require further processing. When _have_left = 0, the recursion stops.

%macro recurse;
%do %until ( &have_left eq 0 );
   data have (keep=id_from id_to)
      chain (keep=start label);
   set have end=eof;
   retain _1st_rec 1 _last_id_to;
   if _1st_rec or id_from = _last_id_to then do;
      _1st_rec = 0;
      _last_id_to = id_to;
      start = id_from;
      label = id_to;
      output chain;
   end;
   else do;
      _have_left + 1;
      output have;
   end;
   if eof then do;
      call symput( 'chain_end', _last_id_to ); * final product at end of chain;
      call symput( 'have_left', _have_left ); * number of records still to process;
   end;
   run;

   data chain;
   set chain;
   label = "&chain_end"; * set to the final product;
   if start = label then delete;
   run;

proc append base=chain_fmt data=chain;
run;
%end;

* remove dupe pairs like 032,016 from sample;
proc sort data=chain_fmt nodup;
by _all_;
run;
%mend;

%recurse;

Astounding · Posted 05-15-2012 09:58 AM

SAS_Bigot,

I think that you're right ... you need two passes through the data. But it shouldn't be that complicated. Here's a simpler way.

data want;

set have (rename=(id_from=start id_to=label)) nobs=_nobs_;

if _n_ < _nobs_ then do _i_=_n_+1 to _nobs_;

set have point=_i_;

if id_from=label then label=id_to;

end;

if start ne end;

retain fmtname '$id_to';

run;

** if needed;

proc sort data=want NODUPKEY;

by start end;

run;

It still takes some processing time, and it assumes that the ID changes should be applied in the order that they appear within the HAVE data set.

FloydNevseta · Posted 05-15-2012 12:05 PM

I like it. It's very compact and easy to understand. I made a small correction to get it to work: start ne label.

Astounding · Posted 05-15-2012 01:57 PM

Ooops. That's what happens when you don't test your code. Thanks for catching it. I guess that applies to the PROC SORT as well.

I have a minor concern ... how long will it take to run. The number of SET statement executions is proportional to the square of the number of observations. Some jobs were meant to run overnight. :smileylaugh:

Linlin · Posted 05-15-2012 02:23 PM

Astounding,

I ran your code:

data have;
input date:date9. id_from $ id_to $;
format date date9.;
datalines;
01may2012 001 022
01may2012 032 019
02may2012 019 032
02may2012 022 005
03may2012 005 022
03may2012 032 017
04may2012 017 016
;
run;
data want(keep=start label);
   set have (rename=(id_from=start id_to=label)) nobs=_nobs_;
   if _n_ < _nobs_ then do _i_=_n_+1 to _nobs_;
     set have point=_i_;
     if id_from=label then label=id_to;
end;
if start ne label;
retain fmtname '$id_to';
run;

proc sort data=want NODUPKEY;
by start label;
run;

below is the log file:
2767 data have;
2768    input date:date9. id_from $ id_to $;
2769    format date date9.;
2770    datalines;

NOTE: The data set WORK.HAVE has 7 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

2778 ;
2779 run;
2780 data want(keep=start label);
2781     set have (rename=(id_from=start id_to=label)) nobs=_nobs_;
2782     if _n_ < _nobs_ then do _i_=_n_+1 to _nobs_;
2783       set have point=_i_;
2784       if id_from=label then label=id_to;
2785    end;
2786    if start ne label;
2787    retain fmtname '$id_to';
2788 run;

NOTE: There were 7 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 6 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

2789
2790 proc sort data=want NODUPKEY;
2791 by start label;
2792 run;

NOTE: There were 6 observations read from the data set WORK.WANT.
NOTE: 1 observations with duplicate key values were deleted.
NOTE: The data set WORK.WANT has 5 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

Astounding · Posted 05-15-2012 02:33 PM

Thanks. It looks like those were the only corrections needed. Double-checking the original post, it's a relief to see that there won't be a tremendous amount of data to run through.

The hashing solution is intriguing. It seems like this ought to be a good problem for hashing. My hashing skills aren't up to the task, however.

Classroom Training Available!