Solved: Re: How do I dedupe observations with similar pattern?

rush_milo · Posted 06-24-2021 03:49 PM

Dear SAS communities,

Below is the 'Have' table and I would like to dedup it into the 'Want' table as below. The end goal is to have a smaller 'Have' table since the current one has multiple observations with different month/date.

Have	Remark
/sas/table_hist_m01.sas7bdat	Keep, but rename _m01 to mXX
/sas/table_hist_m02.sas7bdat	Remove
/sas/table_hist_m03.sas7bdat	Remove
/sas/table_hist_m04.sas7bdat	Remove
/sas/ww0.sas7bdat	Keep, no change due to single table
/sas/table_new_20210301.sas7bdat	Keep, but rename _20210301 to _YYYYMMDD
/sas/table_new_20210302.sas7bdat	Remove
/sas/table_new_20210303.sas7bdat	Remove
/sas/table_new_20210304.sas7bdat	Remove

Want
/sas/table_hist_mXX.sas7bdat
/sas/ww0.sas7bdat
/sas/table_new_YYYYMMDD.sas7bdat

May I ask if there's an efficient way of doing this in SAS? Appreciate your kind advice. Thanks!

ChrisNZ · Posted 06-25-2021 02:03 AM

Another way

data WANT;
  set HAVE(keep=PATH);
  PATH = prxchange('s/_m\d\d(\.sas7bdat)/_mXX\1/'     , 1, PATH);
  PATH = prxchange('s/_\d{8}(\.sas7bdat)/_YYYYMMDD\1/', 1, PATH);
  if PATH ne lag(PATH); 
  ID+1;
run;

PATH	ID
/sas/table_hist_mXX.sas7bdat	1
/sas/ww0.sas7bdat	2
/sas/table_new_YYYYMMDD.sas7bdat	3

High-Performance SAS Coding - Third Edition

View solution in original post

tarheel13 · Posted 06-24-2021 05:31 PM

Can you please post your data as datalines? It looks like prxmatch might work.

rush_milo · Posted 06-24-2021 09:39 PM

Hi @tarheel13

Sorry here's the datalines:

DATA HAVE;
	  INFILE DATALINES DELIMITER=',' DSD;
	  INPUT id PATH & $50. REMARK & $50. ;
	  DATALINES;
  1, "/sas/table_hist_m01.sas7bdat", "Keep but rename _m01 to mXX"
  2, "/sas/table_hist_m02.sas7bdat", "Remove"
  3, "/sas/table_hist_m03.sas7bdat", "Remove"
  4, "/sas/table_hist_m04.sas7bdat", "Remove"
  5, "/sas/ww0.sas7bdat", "Keep, no change due to single table"
  6, "/sas/table_new_20210301.sas7bdat", "Keep, but rename _20210301 to _YYYYMMDD"
  7, "/sas/table_new_20210302.sas7bdat", "Remove"
  8, "/sas/table_new_20210303.sas7bdat", "Remove"
  9, "/sas/table_new_20210304.sas7bdat", "Remove"
RUN;

DATA WANT;
	  INFILE DATALINES DELIMITER=',' DSD;
	  INPUT id PATH & $50.;
	  DATALINES;
  1, "/sas/table_hist_m01.sas7bdat"
  2, "/sas/ww0.sas7bdat"
  3, "/sas/table_new_YYYYMMDD.sas7bdat"
RUN;

rush_milo · Posted 06-24-2021 09:48 PM

Sorry @tarheel13 the 'Want' datalines should be as below. Corrected m01 to mXX.

DATA WANT;
	  INFILE DATALINES DELIMITER=',' DSD;
	  INPUT id PATH & $50.;
	  DATALINES;
  1, "/sas/table_hist_mXX.sas7bdat"
  2, "/sas/ww0.sas7bdat"
  3, "/sas/table_new_YYYYMMDD.sas7bdat"
RUN;

SASKiwi · Posted 06-24-2021 05:43 PM

So where does the title come into what you want? Do you want to concatenate the rows in m01 - m04 into dataset mxx for example?

rush_milo · Posted 06-24-2021 09:45 PM

Hi @SASKiwi,

Sorry for not including the datalines. Basically I want the 'Path' column to be deduped into a single line only. That is from:

m01, m02, m03 --> To: mXX
/sas/table_new_20210301.sas7bdat, /sas/table_new_20210302.sas7bdat --> /sas/table_new_YYYYMMDD.sas7bdat

Below are the datalines for reference:

DATA HAVE;
	  INFILE DATALINES DELIMITER=',' DSD;
	  INPUT id PATH & $50. REMARK & $50. ;
	  DATALINES;
  1, "/sas/table_hist_m01.sas7bdat", "Keep but rename _m01 to mXX"
  2, "/sas/table_hist_m02.sas7bdat", "Remove"
  3, "/sas/table_hist_m03.sas7bdat", "Remove"
  4, "/sas/table_hist_m04.sas7bdat", "Remove"
  5, "/sas/ww0.sas7bdat", "Keep, no change due to single table"
  6, "/sas/table_new_20210301.sas7bdat", "Keep, but rename _20210301 to _YYYYMMDD"
  7, "/sas/table_new_20210302.sas7bdat", "Remove"
  8, "/sas/table_new_20210303.sas7bdat", "Remove"
  9, "/sas/table_new_20210304.sas7bdat", "Remove"
RUN;

DATA WANT;
	  INFILE DATALINES DELIMITER=',' DSD;
	  INPUT id PATH & $50.;
	  DATALINES;
  1, "/sas/table_hist_mXX.sas7bdat"
  2, "/sas/ww0.sas7bdat"
  3, "/sas/table_new_YYYYMMDD.sas7bdat"
RUN;

andreas_lds · Posted 06-25-2021 12:29 AM

The following code could have/cause some issues, if the regular expression match wrong parts of Path.

data work.replaced;
   set have(drop= Remark);
   
   Path = prxchange('s/_m\d\d/_mXX/', 1, Path);
   Path = prxchange('s/_\d{8}/_YYYYMMDD/', 1, Path);   
run;


proc sort data=work.replaced out=work.nodups nodupkey;
   by Path;
run;

ChrisNZ · Posted 06-25-2021 02:03 AM

Another way

data WANT;
  set HAVE(keep=PATH);
  PATH = prxchange('s/_m\d\d(\.sas7bdat)/_mXX\1/'     , 1, PATH);
  PATH = prxchange('s/_\d{8}(\.sas7bdat)/_YYYYMMDD\1/', 1, PATH);
  if PATH ne lag(PATH); 
  ID+1;
run;

PATH	ID
/sas/table_hist_mXX.sas7bdat	1
/sas/ww0.sas7bdat	2
/sas/table_new_YYYYMMDD.sas7bdat	3

High-Performance SAS Coding - Third Edition

rush_milo · Posted 06-25-2021 06:34 AM

Hi @ChrisNZ,

Your code works great! Thank you so much.

However, I noticed that it doesn't seem to work for observations that are not in sequence. Kindly refer to my 'Have' datalines below for id 3 - 8. The sequence is sas7bdat, followed by sas7bvew. I would like to keep one each for both and remove the rest as per my 'Want'.

DATA HAVE;
	  INFILE DATALINES DELIMITER=',' DSD;
	  INPUT id PATH & $50. REMARK & $50. ;
	  DATALINES;
  1, "/sas/table_new_dummy.sas7bdat", "Dummy"
  2, "/sas/table_old_dummy.sas7bdat", "Dummy"
  3, "/sas/table_new_20210301.sas7bdat", "Keep"
  4, "/sas/table_new_20210301.sas7bvew", "Keep"
  5, "/sas/table_new_20210302.sas7bdat", "Remove"
  6, "/sas/table_new_20210302.sas7bvew", "Remove"
  7, "/sas/table_new_20210303.sas7bdat", "Remove"
  8, "/sas/table_new_20210303.sas7bvew", "Remove"
RUN;

DATA WANT;
	  INFILE DATALINES DELIMITER=',' DSD;
	  INPUT id PATH & $50.;
	  DATALINES;
  1, "/sas/table_new_dummy.sas7bdat"
  2, "/sas/table_old_dummy.sas7bdat"
  3, "/sas/table_new_YYYYMMDD.sas7bdat"
  4, "/sas/table_new_YYYYMMDD.sas7bvew"
RUN;

Really appreciate your advice. Thanks!