Hi everyone,
This is an extension to a question i posted earlier under the topic - Deduplication across rows. This time, the data set contains both levels and names as below:
This can be created with the data step:
data Hierarchy;
infile datalines dlm='' truncover;
length CustID $10 L1 $20 L1_Name $20 L2 $20 L2_Name $20 L3 $20 L3_Name $20 L4 $20 L4_Name $20 L5 $20 L5_Name $20 L6 $20 L6_Name $20 L7 $20 L7_Name $20 L8 $20 L8_Name $20;
input CustID $ L1 $ L1_Name $ L2 $ L2_Name $ L3 $ L3_Name $ L4 $ L4_Name $ L5 $ L5_Name $ L6 $ L6_Name $ L7 $ L7_Name $ L8 $ L8_Name $;
cards;
1 A20 James A20 James A40 Peter A50 Andrew A60 Bartholomew A70 Thomas A80 Judas
2 A40 Peter A70 Thomas A80 Judas A80 Judas
3 A70 Thomas A80 Judas A90 Steve A100 Luke
4 A30 John A60 Bartholomew A70 Thomas
;
Notice:
You may notice duplicates across rows. e.g on row 1, the values of L1 and L1_name equals the values of L2 and L2_names respectively. Also, on row 2, the values of L3 and L3_Name equals those on L4 and L4_name respectively.
OBJECTIVE:
The objective is to:
1. Delete duplicate values completely e.g delete values in L2 and L2_Names since they are duplicates
2. Ensure blanks are only at the left end of each cell. E.g we don't want blanks in the L2 and L2_Names (see point 3)
3. To deal with blanks, we want to shift cells leftwards where a blank has been formed due to us deleting duplicates.
TARGET DATASET:
Thanks
data want;
set Hierarchy;
/*ARRAY to group the variables together*/
array lhs(8) L1-L8;
array lhsn(8) L1_name L2_name L3_name L4_name L5_name L6_name L7_name L8_name;
do i=1 to 8;
do j=1 to 8;
/*IF logic to determine if the values are the same and if so, set
the repeat value to missing*/
if i ne j and lhs(j)=lhs(i) then lhs(j)=' ';
if i ne j and lhsn(j)=lhsn(i) then lhsn(j)=' ';
end;
end;
/*this section of the code checks to see if a value is missing and if so,
shifts the non-missing values to the left*/
/*if the values are not to be shifted over, then leave this section of the
out*/
do k=1 to 8;
if lhs[k] = '' then do;
do l= k+1 to 8;
if lhs[l] ne '' then do;
lhs[k] = lhs[l];
lhs[l] = '';
leave;
end;
end;
end;
end;
do k=1 to 8;
if lhsn[k] = '' then do;
do l= k+1 to 8;
if lhsn[l] ne '' then do;
lhsn[k] = lhsn[l];
lhsn[l] = '';
leave;
end;
end;
end;
end;
drop i j k l;
run;
data want;
set Hierarchy;
/*ARRAY to group the variables together*/
array lhs(8) L1-L8;
array lhsn(8) L1_name L2_name L3_name L4_name L5_name L6_name L7_name L8_name;
do i=1 to 8;
do j=1 to 8;
/*IF logic to determine if the values are the same and if so, set
the repeat value to missing*/
if i ne j and lhs(j)=lhs(i) then lhs(j)=' ';
if i ne j and lhsn(j)=lhsn(i) then lhsn(j)=' ';
end;
end;
/*this section of the code checks to see if a value is missing and if so,
shifts the non-missing values to the left*/
/*if the values are not to be shifted over, then leave this section of the
out*/
do k=1 to 8;
if lhs[k] = '' then do;
do l= k+1 to 8;
if lhs[l] ne '' then do;
lhs[k] = lhs[l];
lhs[l] = '';
leave;
end;
end;
end;
end;
do k=1 to 8;
if lhsn[k] = '' then do;
do l= k+1 to 8;
if lhsn[l] ne '' then do;
lhsn[k] = lhsn[l];
lhsn[l] = '';
leave;
end;
end;
end;
end;
drop i j k l;
run;
Hi,
Your solutions works but I got lost in the arrays.
Can you recommend any material on arrays please?
Thanks
Once again, the approach with a long dataset format:
data Hierarchy;
infile datalines dlm='' truncover;
length CustID $10 L1 $20 L1_Name $20 L2 $20 L2_Name $20 L3 $20 L3_Name $20 L4 $20 L4_Name $20 L5 $20 L5_Name $20 L6 $20 L6_Name $20 L7 $20 L7_Name $20 L8 $20 L8_Name $20;
input CustID $ L1 $ L1_Name $ L2 $ L2_Name $ L3 $ L3_Name $ L4 $ L4_Name $ L5 $ L5_Name $ L6 $ L6_Name $ L7 $ L7_Name $ L8 $ L8_Name $;
cards;
1 A20 James A20 James A40 Peter A50 Andrew A60 Bartholomew A70 Thomas A80 Judas
2 A40 Peter A70 Thomas A80 Judas A80 Judas
3 A70 Thomas A80 Judas A90 Steve A100 Luke
4 A30 John A60 Bartholomew A70 Thomas
;
run;
proc transpose
data=hierarchy
out=trans1
;
by custid;
var l:;
run;
data trans2;
set trans1;
by custid;
retain count;
if first.custid
then count = 1;
else count + 1;
if count ge 2 and lag2(col1) = col1 then delete;
drop count;
run;
data trans3;
set trans2;
by custid;
retain count;
if first.custid then count = 1;
substr(_name_,2,1) = put(count,1.);
if index(_name_,'Name') then count + 1;
drop count;
run;
proc transpose
data=trans3
out=trans4 (drop=_name_)
;
by custid;
id _name_;
var col1;
run;
data want;
set
hierarchy (obs=0)
trans4
;
run;
If you kept your data in a sensible format from the beginning:
data Hierarchy;
infile datalines dlm='' truncover;
length
CustID $10
sequence 3
L $20
L_Name $20
;
input CustID sequence $ L $ L_Name;
cards;
1 1 A20 James
1 2 A20 James
1 3 A40 Peter
1 4 A50 Andrew
1 5 A60 Bartholomew
1 6 A70 Thomas
1 7 A80 Judas
2 1 A40 Peter
2 2 A70 Thomas
2 3 A80 Judas
2 4 A80 Judas
3 1 A70 Thomas
3 2 A80 Judas
3 3 A90 Steve
3 4 A100 Luke
4 1 A30 John
4 2 A60 Bartholomew
4 3 A70 Thomas
;
run;
coding against it would be MUCH easier:
data want;
format custid sequence l l_name;
set hierarchy (drop=sequence);
by custid;
l_l = lag(l);
if first.custid
then sequence = 1;
else do;
if l_l = l
then delete;
else sequence + 1;
end;
drop l_l;
run;
data Hierarchy;
infile datalines dlm='' truncover;
length CustID $10 L1 $20 L1_Name $20 L2 $20 L2_Name $20 L3 $20 L3_Name $20 L4 $20 L4_Name $20 L5 $20 L5_Name $20 L6 $20 L6_Name $20 L7 $20 L7_Name $20 L8 $20 L8_Name $20;
input CustID $ L1 $ L1_Name $ L2 $ L2_Name $ L3 $ L3_Name $ L4 $ L4_Name $ L5 $ L5_Name $ L6 $ L6_Name $ L7 $ L7_Name $ L8 $ L8_Name $;
cards;
1 A20 James A20 James A40 Peter A50 Andrew A60 Bartholomew A70 Thomas A80 Judas
2 A40 Peter A70 Thomas A80 Judas A80 Judas
3 A70 Thomas A80 Judas A90 Steve A100 Luke
4 A30 John A60 Bartholomew A70 Thomas
;
data want;
set Hierarchy;
array x1{*} $ L1-L8;
array x2{*} $ L1_name L2_name L3_name L4_name L5_name L6_name L7_name L8_name;
array y{9999} $ 100 _temporary_;
call missing(of y{*});
j=0;
do i=1 to dim(x1);
temp=catx(' ',x1{i},x2{i});
if temp not in y then do;j+1;y{j}=temp;end;
end;
do i=1 to dim(x1);
x1{i}=scan(y{i},1); x2{i}=scan(y{i},2);
end;
drop i j;
run;
Sorry I am always late, but fun:
data Hierarchy;
infile datalines dlm='' truncover;
length CustID $10 L1 $20 L1_Name $20 L2 $20 L2_Name $20 L3 $20 L3_Name $20 L4 $20 L4_Name $20 L5 $20 L5_Name $20 L6 $20 L6_Name $20 L7 $20 L7_Name $20 L8 $20 L8_Name $20;
input CustID $ L1 $ L1_Name $ L2 $ L2_Name $ L3 $ L3_Name $ L4 $ L4_Name $ L5 $ L5_Name $ L6 $ L6_Name $ L7 $ L7_Name $ L8 $ L8_Name $;
cards;
1 A20 James A20 James A40 Peter A50 Andrew A60 Bartholomew A70 Thomas A80 Judas
2 A40 Peter A70 Thomas A80 Judas A80 Judas
3 A70 Thomas A80 Judas A90 Steve A100 Luke
4 A30 John A60 Bartholomew A70 Thomas
;
data want;
if _n_=1 then do;
dcl hash H (ordered:'y') ;
length kn $20 k_name $20;
h.definekey ('i') ;
h.definedata ('kn',"k_name") ;
h.definedone () ;
call missing(kn,k_name);
end;
set Hierarchy end=z;
array t(*) l:;
do i=1 to dim(t) by 2;
rc=h.find(key:i-2);
if t(i) ne kn then rc=h.add(key:i,data:t(i),data:t(i+1));
end;
dcl hiter it('h');
rc = it.first();
do i=1 by 2 while (rc = 0);
t(i)=kn; t(i+1)=k_name;
rc = it.next();
end;
h.clear();
keep custid l:;
run;
data Hierarchy;
infile datalines dlm='' truncover;
length CustID $10 L1 $20 L1_Name $20 L2 $20 L2_Name $20 L3 $20 L3_Name $20 L4 $20 L4_Name $20 L5 $20 L5_Name $20 L6 $20 L6_Name $20 L7 $20 L7_Name $20 L8 $20 L8_Name $20;
input CustID $ L1 $ L1_Name $ L2 $ L2_Name $ L3 $ L3_Name $ L4 $ L4_Name $ L5 $ L5_Name $ L6 $ L6_Name $ L7 $ L7_Name $ L8 $ L8_Name $;
cards;
1 A20 James A20 James A40 Peter A50 Andrew A60 Bartholomew A70 Thomas A80 Judas
2 A40 Peter A70 Thomas A80 Judas A80 Judas
3 A70 Thomas A80 Judas A90 Steve A100 Luke
4 A30 John A60 Bartholomew A70 Thomas
;
data temp;
set Hierarchy;
array t(*) l:;
do i=1 to dim(t) by 2;
kn=t(i);
k_name=t(i+1);
output;
end;
keep custid kn k_name;
run;
data want;
set temp;
by custid kn k_name notsorted;
if 0 then set Hierarchy;
array t(*) l1--l8_name;
retain t;
if first.custid then do; n=1;call missing(of t(*));end;
else if first.kn then n+2;
t(n)=kn;
t(n+1)=k_name;
if last.custid;
keep custid l:;
run;
HI @frupaul For some reason, when i tested the solutions with million + records the two step solution previous apparently is the fastest(sighs). I was intrigued to test as @FreelanceReinh diligently pointed out a similar two step solution in the previous thread of yours was much faster and of course he knows i try to pay attention to those finer details to learn more. On my college lab machine, that holds true and in other machines it could be otherwise. FWIW, that was fun testing.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.