BookmarkSubscribeRSS Feed

Finding all single linked chains: the allChains macro

Started ‎01-25-2017 by
Modified ‎01-25-2017 by
Views 2,346

This article presents a tool called allChains for dealing with data representing single linked chains. Single linked chains are often used to record successive changes in a system. For instance, you might create a new profile record each time a user changes his profile. This change could be recorded in a table as OldRecordId - NewRecordId - DateOfChange. Following the OldRecordId - NewRecordId links, you can recreate the evolution of a given profile. To help this task, allChains produces an augmented table containing OldRecordId - NewRecordId - DateOfChange - OriginalRecordId - CurrentRecordId - OrderOfLinkInChain.

 

The allChains macro identifies the first (original) and last (current) record in every chain and the order of every link in its chain. The processing is performed efficiently with a SAS hash object and therefore requires that your data fit into memory.

 

A word of caution. It is expected that single linked chains follow certain rules.

  1. Chains must be disjoint. Links cannot be duplicated and chains cannot split or merge.
  2. Chains cannot contain cycles.
  3. Chains are limited to 100 links.

Failure to comply with these rules will likely generate incorrect results.

The macro and a small example are attached.

Comments

Hello PGStats, thank you so much for putting this out here.  Great solution.  I do have a two questions though.  If I run this on a small sample the _last_ field does give me what I'm expecting and suits my needs great (thanks again).  BUT I'm getting a repeated ERROR. Duplicate key. in the log.  The code does run but I'd still like to correct that error.  Any thoughts?

 

Also when I run the entire dataset of 33 million rows I get the following error after about 5-10 minutes, is this a issue with my available resources?

An error occured executing the workspace job "All Chains".
The server is disconnected.

Hi @Steelers_In_DC - I can't answer the first question but as far as "server is disconnected" -- in EG that means the session crashed, became unresponsive, or the connection was interrupted.  There are ways to collect more diagnostics, but if you're running an intense job that might cause the session to go down, I guess that's where you look first.

Chris, thanks.  Can you give direction on how I would go about getting more diagnositcs?

ERROR: Duplicate key. signals a split chain, i.e. the presence of many links with the same from value. Try, for example:

 

data have;
input DATE :date9. to from;
format date yymmdd10.;
datalines;
27-Sep-16	10	6  <-- Duplicate from value signals split chain
27-Sep-16	7	5
17-Jun-16	6	4
10-Sep-16	9	8
5-Mar-16	4	2
27-Sep-16	15	6  <-- Duplicate from value signals split chain
;

%allChains(have, want, id=date);

 

For a large dataset like yours, you could modify the macro statement

 

declare hash links();

to

declare hash links(hashexp : 12);

to increase the number of hash bins from the default 256 to 2^12 = 4096 and improve performance.

 

 

 

@Steelers_In_DC - You would need to turn on SAS Workspace Logging (usually an admin task if you have a shared SAS environment) to capture the SAS session log.

I'm have two columns deduped and I still get the duplicate key error.  I tried hashexp : 12 but still get the same results.  I tried running the 16 digit fields as character and numeric but no change.

The 'from' column should contain unique values. Remember: the chains are assumed to be disjoint. 

Version history
Last update:
‎01-25-2017 12:58 AM
Updated by:
Contributors

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Labels
Article Tags