Re: How to remove observations by group from one dataset and put it in...

alepage · Posted 07-31-2024 04:44 PM

Hello,

Imagine that you bought an auto insurance policy (duration = 12 months), let's say in January 2023 having number x123456. This policy will have transactions (observations) in Jan, Fev, Mar, Apr, May, Jun, Jul, Aug, Sep, oct, nov, dec 2023 and Jan 2024 and maybe in Fev 2024.

Therefore those transactions (observations) related to that policy will be spead into 12 or 13 datasets such as:

&co._auto_prmjan2023, &co._auto_prmfeb2023, ..., &co._auto_prmdec2023, &co._auto_prmjan2024, and maybe &co._auto_prmfeb2024.

Now imagine that I need to remove those transactions (observations) from the original datasets

&co._auto_prmjan2023, &co._auto_prmfeb2023, ..., &co._auto_prmdec2023, &co._auto_prmjan2024, and maybe &co._auto_prmfeb2024 and to put those into new datasets like:

&co._auto_prmjan2023_dr, &co._auto_prmfeb2023_dr, ..., &co._auto_prmdec2023_dr, &co._auto_prmjan2024_dr, and maybe &co._auto_prmfeb2024_dr.

(co = company abreviation)

For only one policy, this task seems simple.

But the problem, I have a 1 200 000 policies spread into approximatively 7128 datasets. I have produced a SAS code able to read the policies list from a csv file,

then to scan at these 7128 datasets and tell me for each policy into that list, in which dataset I can find transactions related to each policy.

So now, I have a 12 000 000 observations datset, giving me the agreement_nbr, the path, the filename where each policy host transactions.

I could do a call execute to remove the observations related to each policy, one by one and to put these into datasets with the same name but with the extension ...._dr ex:

move transaction for policy x123456 from

dataset be._auto_prmjan2023 to dataset be_auto_prmjan2023_dr

dataset be._auto_prmfeb2023 to dataset be_auto_prmfeb2023_dr

...

dataset be._auto_prmfeb2024 to dataset be_auto_prmfeb2024_dr

and that only for the policy x123456

But from my 12 000 000 observations dataset, I know which policies need to be move from dataset be._auto_prmjan2023 to be._auto_prmjan2023_dt, from dataset be._auto_prmfeb2023 to be._auto_prmfeb2023 and so on.

What's the more efficient way to move those transactions, one by one, or to group those per dataset and move those in one shot.

And how to do that? Please provide an example (SAS snippets).

Ex: dataset be._auto_prmjan2023

x123456

x123457

x123458

x123459

need to move x123456 and 123459

After removing those policies transactions

be._auto_prmjan2023

x123457

x123458

dataset be._auto_prmjan2023_dr

x123456

x123459

I need to do this task but for approx. 1 000 000 policies

I hope you understand what I need to do.

So how having my data 12 000 000 dataset, how can I do those transactions shifts by policies group ?

Kurt_Bremser · Posted 07-31-2024 05:14 PM

Create a DATA step view combining all datasets, using the INDSNAME= option to keep track which dataset is currently read.

Use this view in a single DATA step which creates all new wanted datasets; load your policy numbers into a hash object. When a policy is found in the hash, run a SELECT block with OUTPUT statements for all targets; use the variable derived from the INDSNAME to control the SELECT.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

alepage · Posted 07-31-2024 08:53 PM

Hello MR. Bremser,

I not familiar with hash methods. On top of that, putting all the datasets in one is not possible due to dataset size.

any other solution? I still have my dataset which contains, the agreement number, the dataset name , its path. Can we do something with this information

Kurt_Bremser · Posted 08-01-2024 03:48 AM

See this prototype code:

data
  a
  b
  c
  d
  e
  a_dt
  b_dt
  c_dt
  d_dt
  e_dt
;
set
  a
  b
  c
  d
  e
  indsname= in_ds
;
if _n_ = 1
then do;
  declare hash dt (dataset:"delete");
  dt.definekey("policy");
  dt.definedata("policy"); * reduces size of hash object;
  dt.definedone();
end;
if dt.check() = 0
then do;
  select(in_ds);
    when ("A") output a_dt;
    when ("B") output b_dt;
    when ("C") output c_dt;
    when ("D") output d_dt;
    when ("E") output e_dt;
  end;
end;
else do;
  select(in_ds);
    when ("A") output a;
    when ("B") output b;
    when ("C") output c;
    when ("D") output d;
    when ("E") output e;
  end;
end;
run;

You can see the repeating patterns in the code, which allow you to create these with %DO macro loops.

Note that, the finer granularity of the splits, this will become more complex. Your situation is a fine illustration why splitting datasets is considered counter-productive, and you must seriously reconsider your current practice.

And if you use timestamped datasets, use a YMD date order and numeric month designations, which sorts the datasets chronologically on its own and makes the use of wildcards much easier.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

alepage · Posted 08-01-2024 08:50 AM

Hello Mr. Bremser,

thanks for the sas code. Where do I put my policy list? Into dt.definedata?
ex: dt.definedata(x123456 x123457 x123458 and so on). How does work the condition when (« A »). It is not clear for me

Kurt_Bremser · Posted 08-01-2024 12:41 PM

The list of policies to be removed is in the dataset "delete", from which the code builds the hash object.

The strings in the WHEN statements of the SELECT blocks are the dataset names as they are stored in the variable in_ds, which is created by the INDSNAME= option. In a real world case, you must also include the library, as INDSNAME stores the complete 2-level dataset name in the variable. Everything must be uppercase.

Please consult the documentation of the SET, SELECT and WHEN data step statements.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 07-31-2024 10:20 PM

Questions:

Are you saying that you have a file containing the policy numbers to remove and the names of the pre-existing datasets from which those policies are to be removed?
OR
If a policy is to be removed, is it to be removed from ALL datasets in which it may be found?
Are the pre-existing datasets sorted by policy ID? Answering this will help in determining whether to use hash methods for removing policies.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

alepage · Posted 08-01-2024 06:07 AM

Good Morning,

I have a file containing the policy numbers to remove and the names of the pre-existing datasets and the path from which those policies are to be removed?
Yes if a policy is to be removed, is it to be removed from ALL datasets in which it may be found?
If the pre-existing datasets are not sort by policy number, I can do a proc sort on it before manipulating the dataset.

Tom · Posted 08-01-2024 10:42 AM

So to remove policies from a single dataset and at the same time write them to another dataset you could use something like this:

data be._auto_prmjan2023 be._auto_prmjan2023_dr;
  set be._auto_prmjan2023;
  if POLICY in ("x123456" "x123459") 
    then output be._auto_prmjan2023_dr;
  else output be._auto_prmjan2023 ;
run;

So if you have dataset with the list of policies to remove from each dataset then generating such code will be trivial to do with a DATA step.

Let's assume you have build a dataset with two variables: DATASET and POLICY. So sort by DATASET and use BY group processing to generate the code:

filename code temp;
data _null_;
  file code lrecl=75 ;
  set policy_list;
  by dataset;
  if first.dataset then put 'data ' dataset dataset +(-1) '_dr;'
  /  'set ' dataset ';'
  /  'if policy in (' @
  ;
  put policy :$quote. @ ;
  if last.dataset then put ')'
   / '    then output ' dataset +(-1) '_dr;' 
   / '  else output ' dataset ';'
   / 'run;'
  ;
run;
%include code / source2;

If these datasets are large you might want to instead use MODIFY in the generated data steps.

data be._auto_prmjan2023 be._auto_prmjan2023_dr;
  modify be._auto_prmjan2023;
  if POLICY in ("x123456" "x123459") then do;
    output be._auto_prmjan2023_dr;
    remove be._auto_prmjan2023;
  end;
run;

How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

Re: How to remove observations by group from one dataset and put it into another one

SAS Innovate 2025: Call for Content

Classroom Training Available!