Re: how to delete few records in a dataset at once

alepage · Posted 03-19-2024 04:00 PM

Hello,

I have been using the dataset Class from sashelp.class as data source to illustrate what I would like to do.

So, here's the case scenario. We have many datasets, from dataset1 to dataset4. In each dataset, we have 5 records except the last one (dataset4) which has 4 records.

The dataset class3 is generated randomly to provide a list of records to be deleted from the original dataset, and we also need to keep a trace of the information that was deleted somewhere else. (dataset1A -- dataset4A).

Also, I would like to open a dataset, delete all the requested records in one shot then closed it.

So, I wonder if the script below is deleting all the requested records per dataset in one shot.

data lookup;
input  name $8. dsname $9.;
datalines;
Alfred  dataset1
Alice   dataset1
Barbara dataset1
Carol   dataset1
Henry   dataset1 
James   dataset2
Jane    dataset2
Janet   dataset2
Jeffrey dataset2
John    dataset2
Joyce   dataset3
Judy    dataset3
Louise  dataset3
Mary    dataset3
Philip  dataset3
Robert  dataset4
Ronald  dataset4
Thomas  dataset4
William dataset4
;
run;

data class;
set sashelp.class;
run;
proc sql;
create table class2 as
select a.*,
       b.dsname
from class as a
inner join lookup as b
on(a.name=b.name);
quit;

%macro multipleds;

%do j=1 %to 4;
	%let name=%sysfunc(cats(dataset,&j));
	%put &name.;
	Data &name.;
	set class2;
	if compress(dsname) eq %tslit(dataset&j.) then output dataset&j.;
	run; 
%end;

%mend multipleds;
%multipleds;

data temp (keep=name); 
  if 0 then set SASHELP.CLASS nobs=NOBS;
  do I=1 to 10;
    N=rand('uniform',1,NOBS); 
    OBS=N;
    set SASHELP.CLASS point=N;
    output; 
  end ;
  stop  ;
run;
/**** class3 contains the list of records to be deleted ****/
proc sql;
create table class3 as
select a.*
from lookup as a
inner join temp as b
on(a.name=b.name);
quit;

/********* put deleted records into new datasets *************/
data _null_;
   set class3;
   dsname2=cats(dsname,"A");
   call execute
   (compbl(cat(
   "proc sql;",
    "create table ", dsname2, " as ",
	"select a.*  ",
	"from ",dsname," as a  ",
	"where name in (select name from class3) ",
	"having dsname eq ",%tslit(dsname),";quit;"

    )));
run;

/********* Deleting few records per datasets *************/

data _null_;
   set class3;   
   call execute
   (compbl(cat(
   "proc sql;",
    "delete ", 
	"from ",dsname,	
	"where name in (select name from class3 having dsname eq ",%tslit(dsname),");quit;"

    )));
run;


%macro multipleds2;

%do j=1 %to 4;
	%let name=%sysfunc(cats(dataset,&j));
	%put &name.;
	Data &name.;
	set &name.;	
	run; 
%end ;

%mend multipleds2;
%multipleds2;

PaigeMiller · Posted 03-19-2024 04:32 PM

So, I wonder if the script below is deleting all the requested records per dataset in one shot.

You could try it and see. Maxim 4

--
Paige Miller

ballardw · Posted 03-19-2024 05:17 PM

I'm not quite sure I know what you want. This would be my take on what I think is going on.

The first data step combines data sets to filter and adds the name of the source data set, library and data set, for each record.

The sorts are to get the combined and temp set into the same order.

The last data step uses the In= options to add temporary variables that indicate which data sets contribute to the current observation when merged and then use conditional logic to write to two different data sets. The removed has the data that matches the Temp data. This does assume that one of the data sets (really should be temp) does not have repeats of the matching variable(s).

data combined;
   set set1 set2 set3 set4 indsname=dsn;
   source = dsn;
run;

proc sort data=combined;
   by name;  /* what ever matching criteria */
run;
proc sort data=temp;
   by name;
run;

data removed
     kept 
;
   merge combined (in=in1)
         temp     (in=in2)
   ;
   by name;  /* same matching criteria as sorts*/
   if in2 then output removed;
   else output kept;
run;

If you don't want the Source variable in the Kept data set add the option (drop=source).

alepage · Posted 03-20-2024 01:56 PM

Imagine that we have a list of policies to delete from the original datasets (removepolicieslst)

Removepolicieslst look like that:

PolicyNumber dataset path …

A1111001 Dataset1 p:/path1

A1111011 Dataset1 p:/path1

A1111101 Dataset1 p:/path1

A1111051 Dataset2 p:/path1

A1111061 Dataset2 p:/path1

And so on….

Now we want to remove from dataset1, all the transactions associated to the policyNumber A1111001, A1111011, A1111101 in one shot, then send this information into the dataset1A, instead of opening dataset1, remove the records associated to A1111001, then update the dataset1 and at the same time send the information to dataset1A. Do the same steps for the transactions associated to A1111011 and to A1111101.

A global dataset containing all the information on thousands of policies is not possible.

So my question is, with the example done with sashelp.class are we removing the transaction from dataset1, then dataset2, then dataset3 thereafter the dataset4.

Tom · Posted 03-19-2024 11:41 PM

"in one shot" ? Depends a lot on what you mean by that.

It is a little hard to tell what part of your code is setting up the example and which part is the actual attempt to produce the output from that example input.

Let's assume that LOOKUP was your attempt to make datasets named DATASET1 to DATASET4 from SASHELP.CLASS. So something like:

data dataset1 dataset2 dataset3 dataset4;
  set sashelp.class;
  if findw('Alfred Alice Barbara Carol Henry',Name,' ','st') then output dataset1 ;
  if findw('James Jane Janet Jeffrey John',Name,' ','st') then output dataset2 ;
  if findw('Joyce Judy Louise Mary Philip',Name,' ','st') then output dataset3 ;
  if findw('Robert Ronald Thomas William',Name,' ','st') then output dataset4 ;
run;

And that TEMP is your attempt to make a list of keys to be excluded. So something like:

data temp;
  do i=1 to 10;
    p=rand('uniform',1,nobs);
    set sashelp.class point=p nobs=nobs;
    output;
  end;
  stop;
  keep name;
run;
proc sort;
  by name;
run;

Does that dataset actually need to have DSNAME variable? Why? Could the same NAME appear in more than one of the datasets? Let's just assume it is not important yet.

So to then split your four input datasets into eight output datasets you might so something like:

data in1 in2 in3 in4 out1 out2 out3 out4;
  merge dataset1(in=in1) dataset2(in=in2) dataset3(in=in3) dataset4(in=in4) temp(in=exclude);
  by name;
  if in1 then if exclude then output out1; else output in1;
  if in2 then if exclude then output out2; else output in2;
  if in3 then if exclude then output out3; else output in3;
  if in4 then if exclude then output out4; else output in4;
run;

Result will look something like this:

179  data in1 in2 in3 in4 out1 out2 out3 out4;
180    merge dataset1(in=in1) dataset2(in=in2) dataset3(in=in3) dataset4(in=in4) temp(in=exclude);
181    by name;
182    if in1 then if exclude then output out1; else output in1;
183    if in2 then if exclude then output out2; else output in2;
184    if in3 then if exclude then output out3; else output in3;
185    if in4 then if exclude then output out4; else output in4;
186  run;

NOTE: There were 5 observations read from the data set WORK.DATASET1.
NOTE: There were 5 observations read from the data set WORK.DATASET2.
NOTE: There were 5 observations read from the data set WORK.DATASET3.
NOTE: There were 4 observations read from the data set WORK.DATASET4.
NOTE: There were 10 observations read from the data set WORK.TEMP.
NOTE: The data set WORK.IN1 has 3 observations and 5 variables.
NOTE: The data set WORK.IN2 has 2 observations and 5 variables.
NOTE: The data set WORK.IN3 has 4 observations and 5 variables.
NOTE: The data set WORK.IN4 has 3 observations and 5 variables.
NOTE: The data set WORK.OUT1 has 3 observations and 5 variables.
NOTE: The data set WORK.OUT2 has 5 observations and 5 variables.
NOTE: The data set WORK.OUT3 has 1 observations and 5 variables.
NOTE: The data set WORK.OUT4 has 1 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.04 seconds

So if you know there are FOUR datasets to read in and EIGHT datasets to write out then a few well placed %DO loops will allow you to generate that code pattern. So perhaps something like this:


%macro exclude
(dslist
,excludeds
);
%local i n dsname;
%let n=%sysfunc(countw(&dslist,%str( )));
data 
%do i=1 %to &n ;
  %let dsname=%scan(&dslist,&i,%str( ));
  &dsname._out &dsname._in
%end;
;
  merge 
%do i=1 %to &n ;
  %let dsname=%scan(&dslist,&i,%str( ));
  &dsname(in=in&i)
%end;
  &excludeds(in=exclude)
;
  by NAME;
%do i=1 %to &n ;
  %let dsname=%scan(&dslist,&i,%str( ));
  if in&i then if exclude then output &dsname._out; else output &dsname._in;
%end;
run;
%mend exclude;

Let's try it:

249  options mprint;
250  %exclude(dataset1 dataset2 dataset3 dataset4,temp);
MPRINT(EXCLUDE):   data dataset1_out dataset1_in dataset2_out dataset2_in dataset3_out dataset3_in dataset4_out dataset4_in ;
MPRINT(EXCLUDE):   merge dataset1(in=in1) dataset2(in=in2) dataset3(in=in3) dataset4(in=in4) temp(in=exclude) ;
MPRINT(EXCLUDE):   by NAME;
MPRINT(EXCLUDE):   if in1 then if exclude then output dataset1_out;
MPRINT(EXCLUDE):   else output dataset1_in;
MPRINT(EXCLUDE):   if in2 then if exclude then output dataset2_out;
MPRINT(EXCLUDE):   else output dataset2_in;
MPRINT(EXCLUDE):   if in3 then if exclude then output dataset3_out;
MPRINT(EXCLUDE):   else output dataset3_in;
MPRINT(EXCLUDE):   if in4 then if exclude then output dataset4_out;
MPRINT(EXCLUDE):   else output dataset4_in;
MPRINT(EXCLUDE):   run;

NOTE: There were 5 observations read from the data set WORK.DATASET1.
NOTE: There were 5 observations read from the data set WORK.DATASET2.
NOTE: There were 5 observations read from the data set WORK.DATASET3.
NOTE: There were 4 observations read from the data set WORK.DATASET4.
NOTE: There were 10 observations read from the data set WORK.TEMP.
NOTE: The data set WORK.DATASET1_OUT has 3 observations and 5 variables.
NOTE: The data set WORK.DATASET1_IN has 3 observations and 5 variables.
NOTE: The data set WORK.DATASET2_OUT has 5 observations and 5 variables.
NOTE: The data set WORK.DATASET2_IN has 2 observations and 5 variables.
NOTE: The data set WORK.DATASET3_OUT has 1 observations and 5 variables.
NOTE: The data set WORK.DATASET3_IN has 4 observations and 5 variables.
NOTE: The data set WORK.DATASET4_OUT has 1 observations and 5 variables.
NOTE: The data set WORK.DATASET4_IN has 3 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.04 seconds

mkeintz · Posted 03-20-2024 03:44 PM

Instead of randomly generating a list of delete requests in dataset CLASS3. I just made an arbitrary dataset DELETE_GROUP below. The main point is that you can let SAS do all the logical work using the WHERE constraints in the second DATA step below. Just make a hash object out of DELETE_GROUP keyed on name and dsname.

data delete_group;
  input dsname $  name $;
datalines;
DATASET1 Barbara
DATASET1 Henry
DATASET2 James
DATASET2 Jeffrey
DATASET3 Mary
DATASET3 Philip
DATASET4 Thomas
;
run;

data keep_dataset1 (where=(dsname='DATASET1' and delete_code=0))
     keep_dataset2 (where=(dsname='DATASET2' and delete_code=0))
     keep_dataset3 (where=(dsname='DATASET3' and delete_code=0))
     keep_dataset4 (where=(dsname='DATASET4' and delete_code=0))
     del_dataset1  (where=(dsname='DATASET1' and delete_code=1))
     del_dataset2  (where=(dsname='DATASET2' and delete_code=1))
     del_dataset3  (where=(dsname='DATASET3' and delete_code=1))
     del_dataset4  (where=(dsname='DATASET4' and delete_code=1))
     ;
  set dataset1-dataset4 indsname=dsn;
  dsname=scan(dsn,2);
  if _n_=1 then do;
    declare hash h (dataset:'delete_group');
      h.definekey('name','dsname');
      h.definedone();
  end;
  delete_code=(h.check()=0);
run;

And you don't have to macro-ize the logic to make it more dynamic. Once you have the DELETE_GROUPS dataset, you can use PROC SQL to make some macro vars naming the input and output datasets.

proc sql noprint;
  select distinct dsname into :dsname_list separated by ' ' from delete_group;
  select distinct cats('keep_',dsname)||' (where=(dsname='||quote(dsname)||' and delete_code=0))'
    into :keep_list separated by ' ' from delete_group;
  select distinct cats('del_',dsname)||' (where=(dsname='||quote(dsname)||' and delete_code=1))'
    into :delete_list separated by ' ' from delete_group;
quit;
%put &=dsname_list;
%put &=keep_list;
%put &=delete_list;

data &keep_list  &delete_list ;
  set &dsname_list indsname=dsn;
  dsname=scan(dsn,2);
  if _n_=1 then do;
    declare hash h (dataset:'delete_group');
      h.definekey('name','dsname');
      h.definedone();
  end;
  delete_code=(h.check()=0);
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Patrick · Posted 03-20-2024 11:10 PM

@alepage You've got already solution proposals so just adding a few comments.

SAS is not a database and it does not support cascading deletes.

It's rarely done with SAS tables but you could define foreign key constraints to enforce that you never delete a master record while there are still transactions linked to it. One reason such constraints aren't used often with SAS tables is that it requires deletion directly on the source table and deletions are only logical meaning the table doesn't reduce in size.

If you need things like cascading deletes then I'd recommend to store the data in a database that supports such functionality.

how to delete few records in a dataset at once