BookmarkSubscribeRSS Feed
devarayalu
Fluorite | Level 6

data best;

    input patient 1-2 arm $ 4-5 bestres $ 6-7 delay 9-10;

  datalines;

01 A CR 0

02 A PD 1

03 B PR 1

04 B CR 2

05 C SD 1

06 C SD 3

07 C PD 2

01 A CR 0

03 B PD 1

  ;

run;

proc sort data=best nodup out=ex2;

  by arm patient;

run;

7 REPLIES 7
Keith
Obsidian | Level 7

Not easy with an unsorted dataset.  You could store the concatenated values of arm and patient in a temporary array which adds 1 for each row read in, then compare the value of the current row with the previous array values.  If it matches then delete the row.  This means the maximum array size will be the number of rows in the dataset, I imagine this could cause memory issues for large datasets.

The main question is why you want to do this?

devarayalu
Fluorite | Level 6

Sorry for the misunderstanding. My question is

Is there any alternate approach to remove duplicates in Data step after doing sorting?

Thank you

Alpay
Fluorite | Level 6

Are you trying to de-dup by values (arm, patient) or onservations (arm, patient, bestres, delay)?

Keith
Obsidian | Level 7

Apologies, I was thinking the 'nodupkey' was in place, not 'nodup'.  I have another method of achieving your output dataset, using PROC SUMMARY.  This works fine for datasets with not many variables and rows, but I wouldn't recommend using it for large datasets.  You won't need to presort the data this way.

proc summary data=best nway;

class _all_;

output out=ex2 (drop=_:);

run;

Alpay
Fluorite | Level 6

Since "nodup" option (NODUPRECS) is used with proc sort statement there will be 8 records in the final data set.

data _null_;

  infile datalines eof=last;

  if _n_ = 1 then do;

    declare hash h(ordered:'a');

    h.defineKey('arm','patient','bestres','delay');

    h.defineDone();

  end;

    input patient 1-2 arm $ 4-5 bestres $ 6-7 delay 9-10;

    if h.check() ne 0 then h.replace();

    last: h.output(dataset:'ex2');

  datalines;

01 A CR 0

02 A PD 1

03 B PR 1

04 B CR 2

05 C SD 1

06 C SD 3

07 C PD 2

01 A CR 0

03 B PD 1

;

run;

Haikuo
Onyx | Level 15

Hi,

Your sample data has presented a complex situation, which makes me wonder what your true intention is:

1. you used 'nodup' instead of 'nodupkey', which will NOT work on nonadjacent duplicates BY KEYS, such as 1st and 9th obs.

2. if you just want no duplicates in term of all variables, you can use 'nodup' plus 'by _all_'.

3. if you want your input and output as is, like Keith said, that would be difficult using just data step. However, in case what you really want is 'nodupkey' or no duplicates for all variables, data step hash() has its' native edge:

/*nodupkey equivalent, and no duplicates for all variables only need minor tweak to the following code*/

data best;

    input patient 1-2 arm $ 4-5 bestres $ 6-7 delay 9-10;

  datalines;

01 A CR 0

02 A PD 1

03 B PR 1

04 B CR 2

05 C SD 1

06 C SD 3

07 C PD 2

01 A CR 0

03 B PD 1

  ;

run;

data _null_;

  if 0 then set best;

    dcl hash h(dataset:'best', ordered:'a');

    h.definekey('arm','patient');

    h.definedata(all:'y');

    h.definedone();

  _rc=h.output(dataset:'ex2');

run;

proc print;run;

Good Luck,

Haikuo

pali
Fluorite | Level 6

Seems you are looking a way to remove duplicates on some key variables (method other than proc sort nodupkey):

See if it helps:

data best;
    input patient 1-2 arm $ 4-5 bestres $ 6-7 delay 9-10;
datalines;
01 A CR 0
02 A PD 1
03 B PR 1
04 B CR 2
05 C SD 1
06 C SD 3
07 C PD 2
01 A CR 0
03 B PD 1
;

proc sort data=best ;
by arm patient;
run;


data uniq;
set best;
by arm patient;
if first.patient;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 1293 views
  • 0 likes
  • 5 in conversation