Solved: Re: Duduplication accross rows

frupaul · Posted 08-20-2018 12:00 PM

Hi everyone,

A bit of an interesting issue. I have the following data set:

Use the below code to generate table above:

  data Hierarchy;
 INFILE DATALINES DLM='' missover;
 length CustID $10 L1 $20 L2 $20 L3 $20 L4 $20 L5 $20 L6 $20 L7 $20 L8 $20;
 input CustID  $ L1 $ L2 $ L3 $ L4 $ L5 $ L6 $  L7 $ L8 $;
 cards;
 1 James James Luke Tom Anthony Matthew Jake Brudhy
 2 Peter Peter Raymond Raymond Peter Luke
 ;

Objective:

My objective is to deduplicate across rows. Across each row, if the values for any successive levels (L1,L2...) are the same, I would like to delete the higher level i.e. if L1 and L2 contain same values, then delete L2. However, since blanks can only be present after the last value in each row and not within the row, where duplicates are found and deleted, I would like to shift the values on the right of the empty cell to the left to fill up the deleted cell (arrow in above diagram shows this).

After doing that, final table should look like this.

All empty cells are to the left end of each row.

Does anyone know how to accomplish this?

Thanks,

novinosrin · Posted 08-20-2018 04:11 PM

Some fun

data want;
if _n_=1 then do;
length k $8;
  dcl hash H (multidata:'y') ;
   h.definekey  ("custid") ;
   h.definedata ("custid","k") ;
   h.definedone () ;
   call missing(k);
end;
set Hierarchy ;
array t(*) l:;
do i=1 to dim(t);
if not missing(t(i)) then 
if  i=1 or  t(i) ne t(i-1) then h.add(key:custid,data:custid,data:t(i));
end;
do i=1 by 1 while(h.do_over() eq 0);
t(i)=k;
end;
do i=i to dim(t);
 call missing(t(i));
end;
keep custid l:;
run;

View solution in original post

mkeintz · Posted 08-20-2018 01:21 PM

Let's go into pedagogical mode:

For each row (each "observation" in data step lingo) you want to iteratively do the same process to every successive pairs of neighboring values, namely

if "neighboring" values match, then shift to the left all subsequent (further right) neighbors, making sure to replace the right-most value with a blank.
either recheck the current pair of neighbors or (if you know you never have triplets) immediately go to the next pair. By "next pair" I mean step one name to the right - i.e. right-side of current pair becomes left-side next pair.

To implement iterative processes, examine the do loop. For 8 variables, you will want to loop through this process up to 7 times (L1 vs L2, then L2 vs L3, .... L7 vs L8).

And to facilitate support of the do loop for your task, learn about declaring the list of variables L1, L2, ... L8 as an array, via the ARRAY statement. Once associated with an array, the variables L1-L8 can not only be accessed by their own names, but also as element 1 of the array, element 2 of the array, etc. I.e. you can dismiss using specific variables names, and instead use a loop where you compare element i vs element i+1, and then just increment i in the loop.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

novinosrin · Posted 08-20-2018 01:24 PM

data Hierarchy;
 INFILE DATALINES DLM='' missover;
 length CustID $10 L1 $20 L2 $20 L3 $20 L4 $20 L5 $20 L6 $20 L7 $20 L8 $20;
 input CustID  $ L1 $ L2 $ L3 $ L4 $ L5 $ L6 $  L7 $ L8 $;
 cards;
 1 James James Luke Tom Anthony Matthew Jake Brudhy
 2 Peter Peter Raymond Raymond Peter Luke
 ;


proc transpose data=Hierarchy out=_Hierarchy;
by custid;
var l:;
run;
data want;
set _Hierarchy;
by custid col1 notsorted;
array L(8) $ ;
retain l;
if first.custid then do; n=0;call missing(of l(*));end;
if first.col1 then n+1; L(n)=col1;
if last.custid;
drop n col1 _name_ ;
run;

frupaul · Posted 08-20-2018 02:33 PM

Hi Novinosrin,

Thanks for the solution. I am dealing with 100000 records. Will this proposed approach work on such high volume data?

novinosrin · Posted 08-20-2018 02:35 PM

Please test and let the community know. If it's not robust, I am sure you will have get more efficient solutions, however until then play with it

frupaul · Posted 08-20-2018 02:34 PM

HI Novinosrin

Thanks for the solution. I am dealing with 100,000 records. Will this proposed approach work on such high volume data?

novinosrin · Posted 08-20-2018 04:11 PM

Some fun

data want;
if _n_=1 then do;
length k $8;
  dcl hash H (multidata:'y') ;
   h.definekey  ("custid") ;
   h.definedata ("custid","k") ;
   h.definedone () ;
   call missing(k);
end;
set Hierarchy ;
array t(*) l:;
do i=1 to dim(t);
if not missing(t(i)) then 
if  i=1 or  t(i) ne t(i-1) then h.add(key:custid,data:custid,data:t(i));
end;
do i=1 by 1 while(h.do_over() eq 0);
t(i)=k;
end;
do i=i to dim(t);
 call missing(t(i));
end;
keep custid l:;
run;

novinosrin · Posted 08-20-2018 08:19 PM

Cleaned further and edited some redundant step:

data want;
if _n_=1 then do;
length k $8;
  dcl hash H (multidata:'y') ;
   h.definekey  ("custid") ;
   h.definedata ("k") ;
   h.definedone () ;
   call missing(k);
end;
set Hierarchy ;
array t(*) l:;
do i=1 to dim(t);
if not missing(t(i)) then 
if  i=1 or  t(i) ne t(i-1) then h.add(key:custid,data:t(i));
end;
call missing(of t(*));
do i=1 by 1 while(h.do_over() eq 0);
t(i)=k;
end;
keep custid l:;
run;

FreelanceReinh · Posted 08-21-2018 05:16 AM

@novinosrin: Well done. The revised version is more efficient, but interestingly it seems to be still a bit slower than your two-step solution using PROC TRANSPOSE (at least on my computer; probably also depending on input data): With an input dataset containing 10 million observations the latter took 24.0 s vs. 26.4 s for the hash solution, which in turn was faster than the below solution using the (slow) PRXCHANGE function (27.3 s) -- all run times averaged over three runs.

data want(drop=c i);
length c $176;
set Hierarchy;
array l[8];
c=prxchange('s/(~[^~]+~)(\1)+/\1/', -1, '~'||catx('~~',of l:)||'~');
do i=1 to dim(l);
  l[i]=scan(c,i,'~');
end;
run;

(To avoid truncation, I increased the length specifications for variable k in your hash solution and array L in your two-step solution to $20.)

frupaul · Posted 08-21-2018 05:27 AM

Respect mate. This will have to send me back to school. Short and sleek. Although I have no idea how it works, it works.

Thanks

FreelanceReinh · Posted 08-21-2018 06:19 AM

Thanks, @frupaul.

I don't use the PRX functions frequently, so it took me some time to (hopefully) get the regex right.

The third argument of PRXCHANGE encloses the values of L1, ..., L8 in '~' characters (which must not occur in any of the values) and concatenates the results. The resulting string (with a maximum length of (20+2)*8=176) is then searched for a string of the form "~more characters~" (where "more characters" may contain anything but the '~': "[^~]+" means "one or more non-tilde characters") which is followed by one or more occurrences of the same string (this is the "(\1)+" part). If such a duplicate (or triplicate, ...) string is found, it is replaced by only one copy (the final "\1" in the regex). The search/replace operation is continued until the end of the input string is reached (second argument -1).

The enclosing characters (I chose '~' because it is not one of the PRX metacharacters) were necessary to avoid partial matches such as "James" in "James Miller".

(I was tempted to use CALL POKELONG to write the final result to L1-L8 -- which could have shortened the code further [no array, no DO loop] -- but since CATX does not provide an option to preserve trailing blanks, that was not successful.)

frupaul · Posted 08-21-2018 06:30 AM

Hi FreelanceReinhard,

Thanks for the explanation, it makes sense but will become much clearer once I have read the PRX functions.

I will be putting up another question in 5 minutes, along the same line of reasoning as this. But that one will be really challenging. Please share your thoughts when it comes.

Thanks

novinosrin · Posted 08-21-2018 08:02 AM

@FreelanceReinh Thank you Sir, as always for the details. Your diligence teaches me a lot. And I can't thank you enough for your time in a lot of thread to help me learn a lot. Just waking up in Chicago. Good morning to you. I hope you have a nice day.

FreelanceReinh · Posted 08-21-2018 09:17 AM

You're welcome, @novinosrin. It's a mutual learning experience. So, thank you, too.

frupaul · Posted 08-22-2018 02:10 AM

Hi Novinosrin,

I will be posting an extension to this problem. Just realised the initial post was an over simplification of the issue. Please do chip in.

The post will have a similar subject to this one.

Thanks

Register Today!

SAS Training: Just a Click Away