This article presents a tool called allChains for dealing with data representing single linked chains. Single linked chains are often used to record successive changes in a system. For instance, you might create a new profile record each time a user changes his profile. This change could be recorded in a table as OldRecordId - NewRecordId - DateOfChange. Following the OldRecordId - NewRecordId links, you can recreate the evolution of a given profile. To help this task, allChains produces an augmented table containing OldRecordId - NewRecordId - DateOfChange - OriginalRecordId - CurrentRecordId - OrderOfLinkInChain.
The allChains macro identifies the first (original) and last (current) record in every chain and the order of every link in its chain. The processing is performed efficiently with a SAS hash object and therefore requires that your data fit into memory.
A word of caution. It is expected that single linked chains follow certain rules.
Failure to comply with these rules will likely generate incorrect results.
The macro and a small example are attached.
Hello PGStats, thank you so much for putting this out here. Great solution. I do have a two questions though. If I run this on a small sample the _last_ field does give me what I'm expecting and suits my needs great (thanks again). BUT I'm getting a repeated ERROR. Duplicate key. in the log. The code does run but I'd still like to correct that error. Any thoughts?
Also when I run the entire dataset of 33 million rows I get the following error after about 5-10 minutes, is this a issue with my available resources?
An error occured executing the workspace job "All Chains".
The server is disconnected.
Hi @Steelers_In_DC - I can't answer the first question but as far as "server is disconnected" -- in EG that means the session crashed, became unresponsive, or the connection was interrupted. There are ways to collect more diagnostics, but if you're running an intense job that might cause the session to go down, I guess that's where you look first.
Chris, thanks. Can you give direction on how I would go about getting more diagnositcs?
ERROR: Duplicate key. signals a split chain, i.e. the presence of many links with the same from value. Try, for example:
data have;
input DATE :date9. to from;
format date yymmdd10.;
datalines;
27-Sep-16 10 6 <-- Duplicate from value signals split chain
27-Sep-16 7 5
17-Jun-16 6 4
10-Sep-16 9 8
5-Mar-16 4 2
27-Sep-16 15 6 <-- Duplicate from value signals split chain
;
%allChains(have, want, id=date);
For a large dataset like yours, you could modify the macro statement
declare hash links(); to declare hash links(hashexp : 12);
to increase the number of hash bins from the default 256 to 2^12 = 4096 and improve performance.
@Steelers_In_DC - You would need to turn on SAS Workspace Logging (usually an admin task if you have a shared SAS environment) to capture the SAS session log.
I'm have two columns deduped and I still get the duplicate key error. I tried hashexp : 12 but still get the same results. I tried running the 16 digit fields as character and numeric but no change.
The 'from' column should contain unique values. Remember: the chains are assumed to be disjoint.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.