Re: How to remove observations by group from one dataset and put it in...

alepage · Posted 07-31-2024 04:44 PM

Hello,

Imagine that you bought an auto insurance policy (duration = 12 months), let's say in January 2023 having number x123456. This policy will have transactions (observations) in Jan, Fev, Mar, Apr, May, Jun, Jul, Aug, Sep, oct, nov, dec 2023 and Jan 2024 and maybe in Fev 2024.

Therefore those transactions (observations) related to that policy will be spead into 12 or 13 datasets such as:

&co._auto_prmjan2023, &co._auto_prmfeb2023, ..., &co._auto_prmdec2023, &co._auto_prmjan2024, and maybe &co._auto_prmfeb2024.

Now imagine that I need to remove those transactions (observations) from the original datasets

&co._auto_prmjan2023, &co._auto_prmfeb2023, ..., &co._auto_prmdec2023, &co._auto_prmjan2024, and maybe &co._auto_prmfeb2024 and to put those into new datasets like:

&co._auto_prmjan2023_dr, &co._auto_prmfeb2023_dr, ..., &co._auto_prmdec2023_dr, &co._auto_prmjan2024_dr, and maybe &co._auto_prmfeb2024_dr.

(co = company abreviation)

For only one policy, this task seems simple.

But the problem, I have a 1 200 000 policies spread into approximatively 7128 datasets. I have produced a SAS code able to read the policies list from a csv file,

then to scan at these 7128 datasets and tell me for each policy into that list, in which dataset I can find transactions related to each policy.

So now, I have a 12 000 000 observations datset, giving me the agreement_nbr, the path, the filename where each policy host transactions.

I could do a call execute to remove the observations related to each policy, one by one and to put these into datasets with the same name but with the extension ...._dr ex:

move transaction for policy x123456 from

dataset be._auto_prmjan2023 to dataset be_auto_prmjan2023_dr

dataset be._auto_prmfeb2023 to dataset be_auto_prmfeb2023_dr

...

dataset be._auto_prmfeb2024 to dataset be_auto_prmfeb2024_dr

and that only for the policy x123456

But from my 12 000 000 observations dataset, I know which policies need to be move from dataset be._auto_prmjan2023 to be._auto_prmjan2023_dt, from dataset be._auto_prmfeb2023 to be._auto_prmfeb2023 and so on.

What's the more efficient way to move those transactions, one by one, or to group those per dataset and move those in one shot.

And how to do that? Please provide an example (SAS snippets).

Ex: dataset be._auto_prmjan2023

x123456

x123457

x123458

x123459

need to move x123456 and 123459

After removing those policies transactions

be._auto_prmjan2023

x123457

x123458

dataset be._auto_prmjan2023_dr

x123456

x123459

I need to do this task but for approx. 1 000 000 policies

I hope you understand what I need to do.

So how having my data 12 000 000 dataset, how can I do those transactions shifts by policies group ?

Kurt_Bremser · Posted 07-31-2024 05:14 PM

Create a DATA step view combining all datasets, using the INDSNAME= option to keep track which dataset is currently read.

Use this view in a single DATA step which creates all new wanted datasets; load your policy numbers into a hash object. When a policy is found in the hash, run a SELECT block with OUTPUT statements for all targets; use the variable derived from the INDSNAME to control the SELECT.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

alepage · Posted 07-31-2024 08:53 PM

Hello MR. Bremser,

I not familiar with hash methods. On top of that, putting all the datasets in one is not possible due to dataset size.

any other solution? I still have my dataset which contains, the agreement number, the dataset name , its path. Can we do something with this information

Kurt_Bremser · Posted 08-01-2024 03:48 AM

See this prototype code:

data
  a
  b
  c
  d
  e
  a_dt
  b_dt
  c_dt
  d_dt
  e_dt
;
set
  a
  b
  c
  d
  e
  indsname= in_ds
;
if _n_ = 1
then do;
  declare hash dt (dataset:"delete");
  dt.definekey("policy");
  dt.definedata("policy"); * reduces size of hash object;
  dt.definedone();
end;
if dt.check() = 0
then do;
  select(in_ds);
    when ("A") output a_dt;
    when ("B") output b_dt;
    when ("C") output c_dt;
    when ("D") output d_dt;
    when ("E") output e_dt;
  end;
end;
else do;
  select(in_ds);
    when ("A") output a;
    when ("B") output b;
    when ("C") output c;
    when ("D") output d;
    when ("E") output e;
  end;
end;
run;

You can see the repeating patterns in the code, which allow you to create these with %DO macro loops.

Note that, the finer granularity of the splits, this will become more complex. Your situation is a fine illustration why splitting datasets is considered counter-productive, and you must seriously reconsider your current practice.

And if you use timestamped datasets, use a YMD date order and numeric month designations, which sorts the datasets chronologically on its own and makes the use of wildcards much easier.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

alepage · Posted 08-01-2024 08:50 AM

Hello Mr. Bremser,

thanks for the sas code. Where do I put my policy list? Into dt.definedata?
ex: dt.definedata(x123456 x123457 x123458 and so on). How does work the condition when (« A »). It is not clear for me

Kurt_Bremser · Posted 08-01-2024 12:41 PM

The list of policies to be removed is in the dataset "delete", from which the code builds the hash object.

The strings in the WHEN statements of the SELECT blocks are the dataset names as they are stored in the variable in_ds, which is created by the INDSNAME= option. In a real world case, you must also include the library, as INDSNAME stores the complete 2-level dataset name in the variable. Everything must be uppercase.

Please consult the documentation of the SET, SELECT and WHEN data step statements.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 07-31-2024 10:20 PM

Questions:

Are you saying that you have a file containing the policy numbers to remove and the names of the pre-existing datasets from which those policies are to be removed?
OR
If a policy is to be removed, is it to be removed from ALL datasets in which it may be found?
Are the pre-existing datasets sorted by policy ID? Answering this will help in determining whether to use hash methods for removing policies.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

alepage · Posted 08-01-2024 06:07 AM

Good Morning,

I have a file containing the policy numbers to remove and the names of the pre-existing datasets and the path from which those policies are to be removed?
Yes if a policy is to be removed, is it to be removed from ALL datasets in which it may be found?
If the pre-existing datasets are not sort by policy number, I can do a proc sort on it before manipulating the dataset.

Tom · Posted 08-01-2024 10:42 AM

So to remove policies from a single dataset and at the same time write them to another dataset you could use something like this:

data be._auto_prmjan2023 be._auto_prmjan2023_dr;
  set be._auto_prmjan2023;
  if POLICY in ("x123456" "x123459") 
    then output be._auto_prmjan2023_dr;
  else output be._auto_prmjan2023 ;
run;

So if you have dataset with the list of policies to remove from each dataset then generating such code will be trivial to do with a DATA step.

Let's assume you have build a dataset with two variables: DATASET and POLICY. So sort by DATASET and use BY group processing to generate the code:

filename code temp;
data _null_;
  file code lrecl=75 ;
  set policy_list;
  by dataset;
  if first.dataset then put 'data ' dataset dataset +(-1) '_dr;'
  /  'set ' dataset ';'
  /  'if policy in (' @
  ;
  put policy :$quote. @ ;
  if last.dataset then put ')'
   / '    then output ' dataset +(-1) '_dr;' 
   / '  else output ' dataset ';'
   / 'run;'
  ;
run;
%include code / source2;

If these datasets are large you might want to instead use MODIFY in the generated data steps.

data be._auto_prmjan2023 be._auto_prmjan2023_dr;
  modify be._auto_prmjan2023;
  if POLICY in ("x123456" "x123459") then do;
    output be._auto_prmjan2023_dr;
    remove be._auto_prmjan2023;
  end;
run;

alepage · Posted 10-02-2024 11:49 AM

Hello, I have been testing the following script to gather all the agreement_nbr per filename into one variable as below

data found_agreements ;
set found_agreements ;
run;

data policiesgroupbyfilename (keep=cie lob year month libname filename policieslist);
length policieslist $32676.;
do until(last.filename);
    set found_agreements; by cie lob year mm filename agreement_nbr;
    policieslist = catx(", ", policieslist, "'"||strip(agreement_nbr)||"'");
    end;
run;

Then the idea was to use a call execute to proceed with your script

data filename filename_dr;
  modify filename;
  if POLICY in ("x123456" "x123459") then do;
    output filename_dr;
    remove filename;
  end;
run;

The issue, I am facing is that imagine a agreement number such as '1111-11111' multiplied by 25000 agreement numbers it gives a length 250 000 which is over the maximum length of policieslist.

How to solve that issue ? Is there a way to overcome this issue or maybe another approach.

Any suggestions ?

Tom · Posted 10-02-2024 12:18 PM

Are you trying to generate this statement from a list of values in a dataset?

if POLICY in ("x123456" "x123459") then do;

If so then generate the statement to a FILE, not a variable nor a macro variable. Then you will not be bothered by the 32K or 64K byte limits on those.

filename code temp;
data _null_;
  file code lrecl=80;
  if _n_=1 then put 'if POLICY in (' ;
  if eof then put / ') then do;';
  set found_agreements end=eof;
  put agreement_nbr :$quote. @ ;
run;
data filename filename_dr;
  modify filename;
%include code / source2;
    output filename_dr;
    remove filename;
  end;
run;

ballardw · Posted 10-02-2024 12:20 PM

If you have the agreement numbers in a DATASET, ONE observation per value then something like:

Sort both datasets by agreement number prior to this step:

data match nomatch;
   merge filename (in=inbase)
              agreementdata (in=inlist)
  ;
  by agreementnumber;
  if inlist and inbase then output match;
  else if inbase then output nomatch;
run;

matches multiple records in the "filename" if there are any in the matching agreementdata set.

The IN= option creates temporary variables that indicate an observation from that set is contributing to the current observation. The values are 1/0 for Yes/No. So "If inlist and inbase then ..." means that both data sets have the same agreement number and written to the Match data set. The ones from the 'filename' data set that do not find a match in the agreementdata are written to Nomatch.

From some of your other "call execute" code, I wouldn't recommend it.

Note: if for some silly reason the names of the variables holding the agreement number values are not the same in both data sets you can use the rename option on the agreementdata to match that of the Filename:

data match nomatch;
   merge filename (in=inbase)
              agreementdata (in=inlist rename=(agreementnumber=nameinotherdataset) )
  ;
  by nameinotherdataset;
  if inlist and inbase then output match;
  else if inbase then output nomatch;
run;

Sticking multiple values, especially if thinking of 1000's of them, into a single variable is almost always a poor choice. Just as bad is placing them into multiple variables generally.

To generate a data set with agreement numbers looks like perhaps:

proc sql;
   create table as agreementnumbers as
   select distinct policy as agreementnumber
  from found_agreements
  ;
quit;

alepage · Posted 10-02-2024 01:34 PM

Hello, imagine that I have a dataset MasterList which contains policy number (agreement number is the same), filename and path.

This dataset may be order by filenames and policy number.

The idea is to say, in dataset A, I have x policies to remove from A and to be put into dataset A_dr

Then in dataset B, I have y policies to remove from A and to be put into dataset B_dr and so on.

So as I have MasterList, the path and the filename so I can create a libname and read the dataset A for example,

I need to find a way to remove and it is just an example

Please note that we may need to remove 20 000 policies in one shot per dataset

dataset A remove polices 1, 2, 4, 5, 6 in one shot and put it into dataset A_dr

dataset B remove polices 7,12,15,18 in one shot and put it into dataset B_dr

dataset C remove polices 30, 32,45, 79, 110 in one shot and put it into dataset C_dr

and so on

How do we do that ? Please provide an example

alepage · Posted 10-02-2024 03:14 PM

Hello,

I did some test with sashelp.class and you inlist and inbase statement and it seems to do what I am expecting to get.

Regards,

data test1;
rownumber=_N_;
set sashelp.class;

run;
/* Creating the synthetic masterlist using test1 and applying the following criteria (content of test1 minus people who's age eq 12 years old */

data masterlist;
set test1;
where age ne 12;
run;

data test2_org;
set test1;
where sex eq 'M';
run;

/*data test2 test2_dr;*/
/*   merge test2_org (in=inbase)*/
/*              masterlist (in=inlist)*/
/*  ;*/
/*  by rownumber;*/
/*  if inlist and inbase then output test2_dr;*/
/*  else if inbase then output test2;*/
/*run;*/

/* so as my master list contains many policies per filename, the idea is to create a subdataset just with the filename and path. The make a call execute with the 
   path and filename then execute the last sas code */

data pathNfilename;
input libname1 :$10. libname2 :$10. fname :$10.;
datalines;
source1 dest1 test2
;
run;

data _null_;
   set pathNfilename;
   call execute
   (compbl(cat(
   '%nrstr(libname ',strip(libname1),' "%sysfunc(pathname(work))"; )',
   '%nrstr(libname ',strip(libname2),' "%sysfunc(pathname(work))"; )',
   "data ",strip(fname)," ",strip(fname),"_dr;",
   "merge ",strip(fname),"_org (in=inbase) ","masterlist (in=inlist);",
   "by rownumber;",
   "if inlist and inbase then output ",strip(fname),"_dr;",
   "else if inbase then output ",strip(fname),";run;"
    )));
run;

Quentin · Posted 10-02-2024 01:12 PM

Any chance you can completely re-architect this?

Managing data in 7000 datasets seems like it must be a nightmare. Do all these tables have the same structure (i.e. same variables)? Could all this data be put into one SAS dataset? Or maybe better yet, loaded it all into one database table? And instead of moving rows between tables, you either delete them in place, or add a variable/column to do logical deletes?

How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Registration is open

SAS Training: Just a Click Away