Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Can we get the same ex2 dataset in data step also? If so please give me the code

Reply
Contributor
Posts: 49

Can we get the same ex2 dataset in data step also? If so please give me the code

data best;

    input patient 1-2 arm $ 4-5 bestres $ 6-7 delay 9-10;

  datalines;

01 A CR 0

02 A PD 1

03 B PR 1

04 B CR 2

05 C SD 1

06 C SD 3

07 C PD 2

01 A CR 0

03 B PD 1

  ;

run;

proc sort data=best nodup out=ex2;

  by arm patient;

run;

Regular Contributor
Posts: 151

Re: Can we get the same ex2 dataset in data step also? If so please give me the code

Not easy with an unsorted dataset.  You could store the concatenated values of arm and patient in a temporary array which adds 1 for each row read in, then compare the value of the current row with the previous array values.  If it matches then delete the row.  This means the maximum array size will be the number of rows in the dataset, I imagine this could cause memory issues for large datasets.

The main question is why you want to do this?

Contributor
Posts: 49

Re: Can we get the same ex2 dataset in data step also? If so please give me the code

Sorry for the misunderstanding. My question is

Is there any alternate approach to remove duplicates in Data step after doing sorting?

Thank you

Frequent Contributor
Posts: 95

Re: Can we get the same ex2 dataset in data step also? If so please give me the code

Are you trying to de-dup by values (arm, patient) or onservations (arm, patient, bestres, delay)?

Regular Contributor
Posts: 151

Re: Can we get the same ex2 dataset in data step also? If so please give me the code

Apologies, I was thinking the 'nodupkey' was in place, not 'nodup'.  I have another method of achieving your output dataset, using PROC SUMMARY.  This works fine for datasets with not many variables and rows, but I wouldn't recommend using it for large datasets.  You won't need to presort the data this way.

proc summary data=best nway;

class _all_;

output out=ex2 (drop=_Smiley Happy;

run;

Frequent Contributor
Posts: 95

Re: Can we get the same ex2 dataset in data step also? If so please give me the code

Since "nodup" option (NODUPRECS) is used with proc sort statement there will be 8 records in the final data set.

data _null_;

  infile datalines eof=last;

  if _n_ = 1 then do;

    declare hash h(ordered:'a');

    h.defineKey('arm','patient','bestres','delay');

    h.defineDone();

  end;

    input patient 1-2 arm $ 4-5 bestres $ 6-7 delay 9-10;

    if h.check() ne 0 then h.replace();

    last: h.output(dataset:'ex2');

  datalines;

01 A CR 0

02 A PD 1

03 B PR 1

04 B CR 2

05 C SD 1

06 C SD 3

07 C PD 2

01 A CR 0

03 B PD 1

;

run;

Respected Advisor
Posts: 3,124

Re: Can we get the same ex2 dataset in data step also? If so please give me the code

Hi,

Your sample data has presented a complex situation, which makes me wonder what your true intention is:

1. you used 'nodup' instead of 'nodupkey', which will NOT work on nonadjacent duplicates BY KEYS, such as 1st and 9th obs.

2. if you just want no duplicates in term of all variables, you can use 'nodup' plus 'by _all_'.

3. if you want your input and output as is, like Keith said, that would be difficult using just data step. However, in case what you really want is 'nodupkey' or no duplicates for all variables, data step hash() has its' native edge:

/*nodupkey equivalent, and no duplicates for all variables only need minor tweak to the following code*/

data best;

    input patient 1-2 arm $ 4-5 bestres $ 6-7 delay 9-10;

  datalines;

01 A CR 0

02 A PD 1

03 B PR 1

04 B CR 2

05 C SD 1

06 C SD 3

07 C PD 2

01 A CR 0

03 B PD 1

  ;

run;

data _null_;

  if 0 then set best;

    dcl hash h(dataset:'best', ordered:'a');

    h.definekey('arm','patient');

    h.definedata(all:'y');

    h.definedone();

  _rc=h.output(dataset:'ex2');

run;

proc print;run;

Good Luck,

Haikuo

Occasional Contributor
Posts: 5

Re: Can we get the same ex2 dataset in data step also? If so please give me the code

Seems you are looking a way to remove duplicates on some key variables (method other than proc sort nodupkey):

See if it helps:

data best;
    input patient 1-2 arm $ 4-5 bestres $ 6-7 delay 9-10;
datalines;
01 A CR 0
02 A PD 1
03 B PR 1
04 B CR 2
05 C SD 1
06 C SD 3
07 C PD 2
01 A CR 0
03 B PD 1
;

proc sort data=best ;
by arm patient;
run;


data uniq;
set best;
by arm patient;
if first.patient;
run;

Ask a Question
Discussion stats
  • 7 replies
  • 425 views
  • 0 likes
  • 5 in conversation