Can Array be used to group distinct linked-numbers from two columns o...

KachiM · Posted 05-03-2019 01:13 PM

Bharath_aavas has opened a thread with the subject "create common group id for
multiple link id's".

He solved the problem using 8 steps, answered his question
himself and closed the thread.

In reply to me
he wanted to solve the problem using Arrays.

He said later:
" I have inserted my SAS Code and tested and it is working in
all scenarios. Yet to automate for N number of level in macro
this is working until 6 relation id's."

Since his thread is marked as "Answered", my
Array solution offered to him seems to have not been
seen by others. For the benefit of him
and others, I am posting the array solution using this thread.

This is his Data Set:

data have; 
input p1 p2;
cards;
9528 9534
9534 9056
8675 7890
8675 87677
7002 7001
7001 7005
7005 7009
5422 22333
22333 78878
78878 21121
21121 787999
787999 3422
1 2
1 3
1 4
1 11
5 7
6 7
8 7
;
run;

This is the expected answer I arrived.

group_id 	Loan1 	Loan2 	Loan3 	Loan4 	Loan5 	Loan6
1 	9528 	9534 	9056 	. 	. 	.
2 	8675 	7890 	87677 	. 	. 	.
3 	7002 	7001 	7005 	7009 	. 	.
4 	5422 	22333 	78878 	21121 	787999 	3422
5 	1 	2 	3 	4 	11 	.
6 	5 	7 	6 	8 	. 	.

The correct specification will be to get the list of
distinct numbers from both P1 and P2 subject to
having a linkage between elements of both columns.
Such elements are placed into one group.

9528 9534 9056 form one group as other numbers are far apart.

The search then continues for distinct-linked numbers
for another group.

Initially we don't know how many groups are there and how many
distinct numbers will fall into a group.

Approach:

We need to decide when a group begins and when it ends.

Externally we found the number of observations of the Data Set.

Two temporary arrays are created with size equal to the
number of observations. They are named as PID[ ] and LID[ ].
Elements from P1 are copied to PID and from P2 to LID.

A character variable (LIST) with an initial length 100 is
used. It must be increased to larger size when needed. One
can use a length of 32767 initially but for this example it
is a waste of memory.

Elements PID[1] and LID[1], if they are different, are placed
in the LIST. Next, the end of group (or beginning of group)
is decided by the following 2 conditions:

[1] PID[i] is compared with either of LID[i-1] or of PID[i-1].
and
[2] LID[i] is compared with either of PID[i-1] or of LID[i-1].

Their values are negated to get COND1 and COND2. If they
take 1 each, then PID[i] and LID[i] are no longer linked
to previous elements. It is the beginning of next group.
If one of them is zero, then the same group continues.

Here are two conditions:

cond1 = ^(pid[i] = lid[i-1] or pid[i] = pid[i-1]);
cond2 = ^(lid[i] = pid[i-1] or lid[i] = lid[i-1]);

Also we count the number of elements placed in the LIST
and the maximum number of elements is found. This Maximum
number will be used in the next Data Step to size an array.

Here is the Data Step which gets the LIST of distinct
elements for each group in TEMP Data Set:

data _null_;
   call symputx('size', num);
   if 0 then set have nobs = num;
   stop;
run;


data temp;
   array pid[&size] _temporary_;
   array lid[&size] _temporary_;
   length list $100;
   if _n_ = 1 then do i = 1 by 1 until(eof);
      set have end = eof;
      pid[i] = p1;
      lid[i] = p2;
   end;
   if pid[1] = lid[1] then list = catx(' ',list, pid[1]);
   else list = catx(' ',list, pid[1], lid[1]);
   do i = 2 to dim(pid);
      cond1 = ^(pid[i] = lid[i-1] or pid[i] = pid[i-1]);
      cond2 = ^(lid[i] = pid[i-1] or lid[i] = lid[i-1]);
      if cond1 * cond2 then do; count = countw(list); output; list = ' '; end; 
      if cond1 then list = catx(' ',list, pid[i]);
      if cond2 then list = catx(' ',list, lid[i]);
      Max = max(Max, count);
      if i = dim(pid) then do; 
         Max = max(Max,countw(list)); 
         output;
         call symputx('Max', Max);
         stop; 
      end;
   end;
keep list;
run;

The following Data Step copies the elements from the LIST,
converting them into numbers, into an array L[ ] sized to
the Maximum found in the previous Step.

data want;
   retain group_id 0;
   array L Loan1 - Loan&Max;
   set temp;
   group_id + 1;
   n = countw(list);
   do i = 1 to n;
      L[i] = input(scan(list,i), best.);
   end;
keep group_id Loan:;
run;

Everything is dynamically done except the length of LIST.

DataSP

Can Array be used to group distinct linked-numbers from two columns of a Data Set?

Registration is open

SAS Training: Just a Click Away