Hi everyone,
A bit of an interesting issue. I have the following data set:
Use the below code to generate table above:
data Hierarchy;
INFILE DATALINES DLM='' missover;
length CustID $10 L1 $20 L2 $20 L3 $20 L4 $20 L5 $20 L6 $20 L7 $20 L8 $20;
input CustID $ L1 $ L2 $ L3 $ L4 $ L5 $ L6 $ L7 $ L8 $;
cards;
1 James James Luke Tom Anthony Matthew Jake Brudhy
2 Peter Peter Raymond Raymond Peter Luke
;
Objective:
My objective is to deduplicate across rows. Across each row, if the values for any successive levels (L1,L2...) are the same, I would like to delete the higher level i.e. if L1 and L2 contain same values, then delete L2. However, since blanks can only be present after the last value in each row and not within the row, where duplicates are found and deleted, I would like to shift the values on the right of the empty cell to the left to fill up the deleted cell (arrow in above diagram shows this).
After doing that, final table should look like this.
All empty cells are to the left end of each row.
Does anyone know how to accomplish this?
Thanks,
Some fun
data want;
if _n_=1 then do;
length k $8;
dcl hash H (multidata:'y') ;
h.definekey ("custid") ;
h.definedata ("custid","k") ;
h.definedone () ;
call missing(k);
end;
set Hierarchy ;
array t(*) l:;
do i=1 to dim(t);
if not missing(t(i)) then
if i=1 or t(i) ne t(i-1) then h.add(key:custid,data:custid,data:t(i));
end;
do i=1 by 1 while(h.do_over() eq 0);
t(i)=k;
end;
do i=i to dim(t);
call missing(t(i));
end;
keep custid l:;
run;
Let's go into pedagogical mode:
For each row (each "observation" in data step lingo) you want to iteratively do the same process to every successive pairs of neighboring values, namely
To implement iterative processes, examine the do loop. For 8 variables, you will want to loop through this process up to 7 times (L1 vs L2, then L2 vs L3, .... L7 vs L8).
And to facilitate support of the do loop for your task, learn about declaring the list of variables L1, L2, ... L8 as an array, via the ARRAY statement. Once associated with an array, the variables L1-L8 can not only be accessed by their own names, but also as element 1 of the array, element 2 of the array, etc. I.e. you can dismiss using specific variables names, and instead use a loop where you compare element i vs element i+1, and then just increment i in the loop.
data Hierarchy;
INFILE DATALINES DLM='' missover;
length CustID $10 L1 $20 L2 $20 L3 $20 L4 $20 L5 $20 L6 $20 L7 $20 L8 $20;
input CustID $ L1 $ L2 $ L3 $ L4 $ L5 $ L6 $ L7 $ L8 $;
cards;
1 James James Luke Tom Anthony Matthew Jake Brudhy
2 Peter Peter Raymond Raymond Peter Luke
;
proc transpose data=Hierarchy out=_Hierarchy;
by custid;
var l:;
run;
data want;
set _Hierarchy;
by custid col1 notsorted;
array L(8) $ ;
retain l;
if first.custid then do; n=0;call missing(of l(*));end;
if first.col1 then n+1; L(n)=col1;
if last.custid;
drop n col1 _name_ ;
run;
Hi Novinosrin,
Thanks for the solution. I am dealing with 100000 records. Will this proposed approach work on such high volume data?
Please test and let the community know. If it's not robust, I am sure you will have get more efficient solutions, however until then play with it
HI Novinosrin
Thanks for the solution. I am dealing with 100,000 records. Will this proposed approach work on such high volume data?
Some fun
data want;
if _n_=1 then do;
length k $8;
dcl hash H (multidata:'y') ;
h.definekey ("custid") ;
h.definedata ("custid","k") ;
h.definedone () ;
call missing(k);
end;
set Hierarchy ;
array t(*) l:;
do i=1 to dim(t);
if not missing(t(i)) then
if i=1 or t(i) ne t(i-1) then h.add(key:custid,data:custid,data:t(i));
end;
do i=1 by 1 while(h.do_over() eq 0);
t(i)=k;
end;
do i=i to dim(t);
call missing(t(i));
end;
keep custid l:;
run;
Cleaned further and edited some redundant step:
data want;
if _n_=1 then do;
length k $8;
dcl hash H (multidata:'y') ;
h.definekey ("custid") ;
h.definedata ("k") ;
h.definedone () ;
call missing(k);
end;
set Hierarchy ;
array t(*) l:;
do i=1 to dim(t);
if not missing(t(i)) then
if i=1 or t(i) ne t(i-1) then h.add(key:custid,data:t(i));
end;
call missing(of t(*));
do i=1 by 1 while(h.do_over() eq 0);
t(i)=k;
end;
keep custid l:;
run;
@novinosrin: Well done. The revised version is more efficient, but interestingly it seems to be still a bit slower than your two-step solution using PROC TRANSPOSE (at least on my computer; probably also depending on input data): With an input dataset containing 10 million observations the latter took 24.0 s vs. 26.4 s for the hash solution, which in turn was faster than the below solution using the (slow) PRXCHANGE function (27.3 s) -- all run times averaged over three runs.
data want(drop=c i);
length c $176;
set Hierarchy;
array l[8];
c=prxchange('s/(~[^~]+~)(\1)+/\1/', -1, '~'||catx('~~',of l:)||'~');
do i=1 to dim(l);
l[i]=scan(c,i,'~');
end;
run;
(To avoid truncation, I increased the length specifications for variable k in your hash solution and array L in your two-step solution to $20.)
Respect mate. This will have to send me back to school. Short and sleek. Although I have no idea how it works, it works.
Thanks
Thanks, @frupaul.
I don't use the PRX functions frequently, so it took me some time to (hopefully) get the regex right.
The third argument of PRXCHANGE encloses the values of L1, ..., L8 in '~' characters (which must not occur in any of the values) and concatenates the results. The resulting string (with a maximum length of (20+2)*8=176) is then searched for a string of the form "~more characters~" (where "more characters" may contain anything but the '~': "[^~]+" means "one or more non-tilde characters") which is followed by one or more occurrences of the same string (this is the "(\1)+" part). If such a duplicate (or triplicate, ...) string is found, it is replaced by only one copy (the final "\1" in the regex). The search/replace operation is continued until the end of the input string is reached (second argument -1).
The enclosing characters (I chose '~' because it is not one of the PRX metacharacters) were necessary to avoid partial matches such as "James" in "James Miller".
(I was tempted to use CALL POKELONG to write the final result to L1-L8 -- which could have shortened the code further [no array, no DO loop] -- but since CATX does not provide an option to preserve trailing blanks, that was not successful.)
Hi FreelanceReinhard,
Thanks for the explanation, it makes sense but will become much clearer once I have read the PRX functions.
I will be putting up another question in 5 minutes, along the same line of reasoning as this. But that one will be really challenging. Please share your thoughts when it comes.
Thanks
@FreelanceReinh Thank you Sir, as always for the details. Your diligence teaches me a lot. And I can't thank you enough for your time in a lot of thread to help me learn a lot. Just waking up in Chicago. Good morning to you. I hope you have a nice day.
You're welcome, @novinosrin. It's a mutual learning experience. So, thank you, too.
Hi Novinosrin,
I will be posting an extension to this problem. Just realised the initial post was an over simplification of the issue. Please do chip in.
The post will have a similar subject to this one.
Thanks
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.