BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
frupaul
Quartz | Level 8

Hi everyone,

 

A bit of an interesting issue. I have the following data set:

Capture.PNG

 

Use the below code to generate table above:

  data Hierarchy;
 INFILE DATALINES DLM='' missover;
 length CustID $10 L1 $20 L2 $20 L3 $20 L4 $20 L5 $20 L6 $20 L7 $20 L8 $20;
 input CustID  $ L1 $ L2 $ L3 $ L4 $ L5 $ L6 $  L7 $ L8 $;
 cards;
 1 James James Luke Tom Anthony Matthew Jake Brudhy
 2 Peter Peter Raymond Raymond Peter Luke
 ;

Objective:

 

 

My objective is to deduplicate across rows. Across each row, if the values for any successive levels (L1,L2...) are the same, I would like to delete the higher level i.e. if L1 and L2 contain same values, then delete L2. However, since blanks can only be present after the last value in each row and not within the row, where duplicates are found and deleted, I would like to shift the values on the right of the empty cell to the left to fill up the deleted cell (arrow in above diagram shows this).

 

After doing that, final table should look like this.

 

Capture.PNG

 

All empty cells are to the left end of each row.

 

Does anyone know how to accomplish this?

 

Thanks,

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

Some fun 

 

data want;
if _n_=1 then do;
length k $8;
  dcl hash H (multidata:'y') ;
   h.definekey  ("custid") ;
   h.definedata ("custid","k") ;
   h.definedone () ;
   call missing(k);
end;
set Hierarchy ;
array t(*) l:;
do i=1 to dim(t);
if not missing(t(i)) then 
if  i=1 or  t(i) ne t(i-1) then h.add(key:custid,data:custid,data:t(i));
end;
do i=1 by 1 while(h.do_over() eq 0);
t(i)=k;
end;
do i=i to dim(t);
 call missing(t(i));
end;
keep custid l:;
run;

View solution in original post

22 REPLIES 22
mkeintz
PROC Star

Let's go into pedagogical mode:

 

For each row (each "observation" in data step lingo) you want to iteratively do the same process to every successive pairs of neighboring values, namely

  1. if "neighboring" values match, then shift to the left all subsequent (further right) neighbors, making sure to replace the right-most value with a blank.
  2. either recheck the current pair of neighbors or (if you know you never have triplets) immediately go to the next pair.  By "next pair" I mean step one name to the right - i.e. right-side of current pair becomes left-side next pair.

 

To implement iterative processes, examine the do loop.  For 8 variables, you will want to loop through this process up to 7 times (L1 vs L2, then L2 vs L3, .... L7 vs L8). 

 

And to facilitate support of the do loop for your task, learn about declaring the list of variables L1, L2, ... L8 as an array, via the ARRAY statement.  Once associated with an array, the variables L1-L8 can not only be accessed by their own names, but also as element 1 of the array, element 2 of the array, etc.  I.e. you can dismiss using specific variables names, and instead use a loop where you compare element i vs element i+1, and then just increment i in the loop.

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
novinosrin
Tourmaline | Level 20
data Hierarchy;
 INFILE DATALINES DLM='' missover;
 length CustID $10 L1 $20 L2 $20 L3 $20 L4 $20 L5 $20 L6 $20 L7 $20 L8 $20;
 input CustID  $ L1 $ L2 $ L3 $ L4 $ L5 $ L6 $  L7 $ L8 $;
 cards;
 1 James James Luke Tom Anthony Matthew Jake Brudhy
 2 Peter Peter Raymond Raymond Peter Luke
 ;


proc transpose data=Hierarchy out=_Hierarchy;
by custid;
var l:;
run;
data want;
set _Hierarchy;
by custid col1 notsorted;
array L(8) $ ;
retain l;
if first.custid then do; n=0;call missing(of l(*));end;
if first.col1 then n+1; L(n)=col1;
if last.custid;
drop n col1 _name_ ;
run;
frupaul
Quartz | Level 8

Hi Novinosrin,

 

Thanks for the solution. I am dealing with 100000 records. Will this proposed approach work on such high volume data?

 

 

novinosrin
Tourmaline | Level 20

Please test and let the community know. If it's not robust, I am sure you will have get more efficient solutions, however until then play with it

frupaul
Quartz | Level 8

HI Novinosrin

 

Thanks for the solution. I am dealing with 100,000 records. Will this proposed approach work on such high volume data?

 

 

novinosrin
Tourmaline | Level 20

Some fun 

 

data want;
if _n_=1 then do;
length k $8;
  dcl hash H (multidata:'y') ;
   h.definekey  ("custid") ;
   h.definedata ("custid","k") ;
   h.definedone () ;
   call missing(k);
end;
set Hierarchy ;
array t(*) l:;
do i=1 to dim(t);
if not missing(t(i)) then 
if  i=1 or  t(i) ne t(i-1) then h.add(key:custid,data:custid,data:t(i));
end;
do i=1 by 1 while(h.do_over() eq 0);
t(i)=k;
end;
do i=i to dim(t);
 call missing(t(i));
end;
keep custid l:;
run;

novinosrin
Tourmaline | Level 20

Cleaned further and edited some redundant step:

 

data want;
if _n_=1 then do;
length k $8;
  dcl hash H (multidata:'y') ;
   h.definekey  ("custid") ;
   h.definedata ("k") ;
   h.definedone () ;
   call missing(k);
end;
set Hierarchy ;
array t(*) l:;
do i=1 to dim(t);
if not missing(t(i)) then 
if  i=1 or  t(i) ne t(i-1) then h.add(key:custid,data:t(i));
end;
call missing(of t(*));
do i=1 by 1 while(h.do_over() eq 0);
t(i)=k;
end;
keep custid l:;
run;
FreelanceReinh
Jade | Level 19

@novinosrin: Well done. The revised version is more efficient, but interestingly it seems to be still a bit slower than your two-step solution using PROC TRANSPOSE (at least on my computer; probably also depending on input data): With an input dataset containing 10 million observations the latter took 24.0 s vs. 26.4 s for the hash solution, which in turn was faster than the below solution using the (slow) PRXCHANGE function (27.3 s) -- all run times averaged over three runs.

 

data want(drop=c i);
length c $176;
set Hierarchy;
array l[8];
c=prxchange('s/(~[^~]+~)(\1)+/\1/', -1, '~'||catx('~~',of l:)||'~');
do i=1 to dim(l);
  l[i]=scan(c,i,'~');
end;
run;

(To avoid truncation, I increased the length specifications for variable k in your hash solution and array L in your two-step solution to $20.)

 

frupaul
Quartz | Level 8

Respect mate. This will have to send me back to school. Short and sleek. Although I have no idea how it works, it works.

 

Thanks

FreelanceReinh
Jade | Level 19

Thanks, @frupaul.

 

I don't use the PRX functions frequently, so it took me some time to (hopefully) get the regex right.

 

The third argument of PRXCHANGE encloses the values of L1, ..., L8 in '~' characters (which must not occur in any of the values) and concatenates the results. The resulting string (with a maximum length of (20+2)*8=176) is then searched for a string of the form "~more characters~" (where "more characters" may contain anything but the '~': "[^~]+" means "one or more non-tilde characters") which is followed by one or more occurrences of the same string (this is the "(\1)+" part). If such a duplicate (or triplicate, ...) string is found, it is replaced by only one copy (the final "\1" in the regex). The search/replace operation is continued until the end of the input string is reached (second argument -1).

 

The enclosing characters (I chose '~' because it is not one of the PRX metacharacters) were necessary to avoid partial matches such as "James" in "James Miller".

 

(I was tempted to use CALL POKELONG to write the final result to L1-L8 -- which could have shortened the code further [no array, no DO loop] -- but since CATX does not provide an option to preserve trailing blanks, that was not successful.)

frupaul
Quartz | Level 8

Hi FreelanceReinhard,

 

Thanks for the explanation, it makes sense but will become much clearer once I have read the PRX functions.

 

I will be putting up another question in 5 minutes, along the same line of reasoning as this. But that one will be really challenging. Please share your thoughts when it comes.

 

Thanks

novinosrin
Tourmaline | Level 20

@FreelanceReinh Thank you Sir, as always for the details. Your diligence teaches me a lot. And I can't thank you enough for your time in a lot of thread to help me learn a lot. Just waking up in  Chicago. Good morning to you. I hope you have a nice day. 

FreelanceReinh
Jade | Level 19

You're welcome, @novinosrin. It's a mutual learning experience. So, thank you, too.

frupaul
Quartz | Level 8

Hi Novinosrin,

 

I will be posting an extension to this problem. Just realised the initial post was an over simplification of the issue. Please do chip in.

The post will have a similar subject to this one.

 

Thanks

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 22 replies
  • 3509 views
  • 7 likes
  • 5 in conversation