BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Patrick
Opal | Level 21

Hi all

I have this real life problem to solve which feels very much like a standard "out of the book" problem (something like a parts list). But I just don't manage to find an easy coding solution for it.

What I have:

A couple of ID's which change over time.

What I need:

Chain up these ID's and figure out which one is the latest one (the ID_TO in the chain with the latest date).

I then want to create a data set which can be used as input for a format (cntlin) to recode all ID's to the latest one (where needed)

ID's in chains don't overlap (=an ID is a member of exactly one chain).

I don't have a lot of data so simple code is more important than performance optimised one.

data have;
  input date:date9. id_from id_to;
  format date date9.;
  datalines;
01may2012 001 002
02may2012 002 005
03may2012 005 002
01may2012 012 019

02may2012 019 012
03may2012 012 017
04may2012 017 016

;
run;


Want:
start  label
001   002

005   002

017   016

019   016

012   016

Any suggestions - some code or eventually just the necessary keywords to find some guidance for this assumed schoolbook problem - highly appreciated.

For code: SAS9.3 version under Windows.

Thanks

Patrick

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

Thanks.  It looks like those were the only corrections needed.  Double-checking the original post, it's a relief to see that there won't be a tremendous amount of data to run through.

The hashing solution is intriguing.  It seems like this ought to be a good problem for hashing.  My hashing skills aren't up to the task, however.

View solution in original post

19 REPLIES 19
art297
Opal | Level 21

Here is one possibility:

data have;

  input date:date9. id_from id_to;

  format date date9.;

  datalines;

01may2012 001 002

02may2012 002 005

03may2012 005 002

01may2012 012 019

02may2012 019 012

03may2012 012 017

04may2012 017 016

;

data temp;

  do until (last_record);

    set have end=last_record;

    last_id_to=lag(id_to);

    if id_from ne last_id_to then group+1;

    output temp;

  end;

run;

data want (keep=start label);

  do until (last.group);

    set temp;

    by group;

    if last.group then label=id_to;

  end;

  do until (last.group);

    set temp (rename=(id_from=start));

    by group;

    if start ne label then output;

  end;

run;

proc sort data=want nodupkey;

  by start;

run;

Patrick
Opal | Level 21

Art

Thanks a lot for your suggestion.

When I've created the sample data I was asking myself if it is a good idea to provide it in this sorted order (by date and "groups"). This is in reality of course not the case and the data for a specific group is not in sequence.

That's why your code doesn't work for me.

I'll provide at the end of this thread better sample data with some more explanation.

Thanks

Patrick

Reeza
Super User

Sounds like you're playing around with what's called a 'slowly changing dimension'  or a 'Type 2' dimension from a BI perspective.

Some words if you felt like googling, b/c I'm sure Art's solution will work Smiley Happy.

Here's another solution that doesn't use loops.

data have;

  input date:date9. id_from id_to;

  format date date9.;

  datalines;

01may2012 001 002

02may2012 002 005

03may2012 005 002

01may2012 012 019

02may2012 019 012

03may2012 012 017

04may2012 017 016

;

run;

*Identify Chains;

data id_chains;

    set have;

    retain chain 0;

    if id_from ne lag(id_to) then chain+1;

run;

*Get last ID per chain;

proc sort data=id_chains; by chain date; run;

proc sort data=id_chains nodupkey out=latest_id (keep= chain id_to); by chain; run;

*Merge in the latest ID;

data want;

    merge id_chains latest_id (rename= id_to=latest_id);

    by chain;

run;

Patrick
Opal | Level 21

Hi Reeza

Thanks for your code. It has the same basic issue than Art's (see comment there).

I believe there is also something else not working as expected as when I run it on my laptop it gives me as latest_ID '2' and '19' - but it should be '2' and '16'.

Thanks

Patrick

Ksharp
Super User

Hi , Patrick.

If I understand what you mean .

Check out if it is what you want.

data have;
  input date:date9. id_from $ id_to $;
  format date date9.;
  datalines;
01may2012 001 002
02may2012 002 005
03may2012 005 002
01may2012 012 019
02may2012 019 012
03may2012 012 017
04may2012 017 016
;
run;

data want(drop=_:);
 if _n_ eq 1 then do;
  if 0 then set have;
  declare hash ha();
   ha.definekey('id_from');
   ha.definedone();
 end;
set have;
if ha.check() ne 0 then do;
     ha.add();
     do i=_n_+1 to nobs ;
      set have(keep=id_from id_to rename=(id_from=_id_from id_to=_id_to)) point=i nobs=nobs;
      if id_to eq _id_from then id_to=_id_to;
       else leave;
     end;
 output;
 id_from=id_to;ha.replace();
end;
run;

Ksharp

Patrick
Opal | Level 21

Hi Ksharp

I would have been disappointed if you wouldn't have come up with a hash approach :-))

Your code works with the sample data but also falls short when I feed it with a bit a more elaborate sample.

I haven't really analysed your code good enough to tell why this happens.

I'll add better sample data and some more explanation right now.

Thanks

Patrick

Patrick
Opal | Level 21

Below some better sample data (which broke all of the above solutions) together with the expected (wanted) result.

What this is about in real life:

Product codes which change during promotion periods - and I don't have some parent product code I can trust due to data quality issues. So what I need to do is to "chain-up" all these product codes and then create my own parent code (using the last code in the chain). I know that this means the parent code will change over time and that's why I don't want to actually store a parent code but use a format (...and the most current code will then link to the most current entries in ref tables - that's why I must use a "dynamic" parent code).

Sample data:
01may2012 001 022
01may2012 032 019
02may2012 019 032
02may2012 022 005
03may2012 005 022
03may2012 032 017
04may2012 017 016


Resulting chains:
001/022/005/022
032/019/032/017/016

Wanted DS:
Start Label
001   022
005   022
032   016
019   016
017   016

art297
Opal | Level 21

Patrick,

That changes the complexity of the problem significantly.  We had a similar thread sometime back both here and on SAS-L.  I can't find the one here, but take a look at:

SAS-L archives -- September 2011, week 2 (#94)</title><style type="text/css"><!--BODY { font-family:...

FloydNevseta
Pyrite | Level 9

The problem I found is that to find a complete chain you must pass through the entire dataset once. So if there are multiple chains it requires multiple passes. My solution is a macro that recursively passes through the dataset until all chains are found. Basically, each pass through the dataset 'have' creates 2 datasets: 1) the obs belonging to the current chain are output to dataset 'chain', 2) all other obs are output to 'have', so 'have' reduces in size through each pass. The recursion is controlled by the value of the variable _have_left that contains the number of obs that are output to 'have' and still require further processing. When _have_left = 0, the recursion stops.

%macro recurse;
%do %until ( &have_left eq 0 );
   data have (keep=id_from id_to)
      chain (keep=start label);
   set have end=eof;
   retain _1st_rec 1 _last_id_to;
   if _1st_rec or id_from = _last_id_to then do;
      _1st_rec = 0;
      _last_id_to = id_to;
      start = id_from;
      label = id_to;
      output chain;
   end;
   else do;
      _have_left + 1;
      output have;
   end;
   if eof then do;
      call symput( 'chain_end', _last_id_to ); * final product at end of chain;
      call symput( 'have_left', _have_left ); * number of records still to process;
   end;
   run;

   data chain;
   set chain;
   label = "&chain_end"; * set to the final product;
   if start = label then delete;
   run;

   proc append base=chain_fmt data=chain;
   run;
%end;

* remove dupe pairs like 032,016 from sample;
proc sort data=chain_fmt nodup;
by _all_;
run;
%mend;

%recurse;

Astounding
PROC Star

SAS_Bigot,

I think that you're right ... you need two passes through the data.  But it shouldn't be that complicated.  Here's a simpler way.

data want;

   set have (rename=(id_from=start id_to=label)) nobs=_nobs_;

   if _n_ < _nobs_ then do _i_=_n_+1 to _nobs_;

     set have point=_i_;

     if id_from=label then label=id_to;

  end;

  if start ne end;

  retain fmtname '$id_to';

run;

** if needed;

proc sort data=want NODUPKEY;

   by start end;

run;

It still takes some processing time, and it assumes that the ID changes should be applied in the order that they appear within the HAVE data set.

FloydNevseta
Pyrite | Level 9

I like it. It's very compact and easy to understand. I made a small correction to get it to work: start ne label.

Astounding
PROC Star

Ooops.  That's what happens when you don't test your code.  Thanks for catching it.  I guess that applies to the PROC SORT as well.

I have a minor concern ... how long will it take to run.  The number of SET statement executions is proportional to the square of the number of observations.  Some jobs were meant to run overnight.   :smileylaugh:

Linlin
Lapis Lazuli | Level 10

Astounding,

I ran your code:

data have;
  input date:date9. id_from $ id_to $;
  format date date9.;
  datalines;
01may2012 001 022
01may2012 032 019
02may2012 019 032
02may2012 022 005
03may2012 005 022
03may2012 032 017
04may2012 017 016
;
run;
data want(keep=start label);
   set have (rename=(id_from=start id_to=label)) nobs=_nobs_;
   if _n_ < _nobs_ then do _i_=_n_+1 to _nobs_;
     set have point=_i_;
     if id_from=label then label=id_to;
  end;
  if start ne label;
  retain fmtname '$id_to';
run;

proc sort data=want NODUPKEY;
   by start label;
run;


below is the log file:
2767  data have;
2768    input date:date9. id_from $ id_to $;
2769    format date date9.;
2770    datalines;

NOTE: The data set WORK.HAVE has 7 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


2778  ;
2779  run;
2780  data want(keep=start label);
2781     set have (rename=(id_from=start id_to=label)) nobs=_nobs_;
2782     if _n_ < _nobs_ then do _i_=_n_+1 to _nobs_;
2783       set have point=_i_;
2784       if id_from=label then label=id_to;
2785    end;
2786    if start ne label;
2787    retain fmtname '$id_to';
2788  run;

NOTE: There were 7 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 6 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


2789
2790   proc sort data=want NODUPKEY;
2791     by start label;
2792  run;

NOTE: There were 6 observations read from the data set WORK.WANT.
NOTE: 1 observations with duplicate key values were deleted.
NOTE: The data set WORK.WANT has 5 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Astounding
PROC Star

Thanks.  It looks like those were the only corrections needed.  Double-checking the original post, it's a relief to see that there won't be a tremendous amount of data to run through.

The hashing solution is intriguing.  It seems like this ought to be a good problem for hashing.  My hashing skills aren't up to the task, however.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 19 replies
  • 5903 views
  • 10 likes
  • 10 in conversation