Hello,
Imagine that you bought an auto insurance policy (duration = 12 months), let's say in January 2023 having number x123456. This policy will have transactions (observations) in Jan, Fev, Mar, Apr, May, Jun, Jul, Aug, Sep, oct, nov, dec 2023 and Jan 2024 and maybe in Fev 2024.
Therefore those transactions (observations) related to that policy will be spead into 12 or 13 datasets such as:
&co._auto_prmjan2023, &co._auto_prmfeb2023, ..., &co._auto_prmdec2023, &co._auto_prmjan2024, and maybe &co._auto_prmfeb2024.
Now imagine that I need to remove those transactions (observations) from the original datasets
&co._auto_prmjan2023, &co._auto_prmfeb2023, ..., &co._auto_prmdec2023, &co._auto_prmjan2024, and maybe &co._auto_prmfeb2024 and to put those into new datasets like:
&co._auto_prmjan2023_dr, &co._auto_prmfeb2023_dr, ..., &co._auto_prmdec2023_dr, &co._auto_prmjan2024_dr, and maybe &co._auto_prmfeb2024_dr.
(co = company abreviation)
For only one policy, this task seems simple.
But the problem, I have a 1 200 000 policies spread into approximatively 7128 datasets. I have produced a SAS code able to read the policies list from a csv file,
then to scan at these 7128 datasets and tell me for each policy into that list, in which dataset I can find transactions related to each policy.
So now, I have a 12 000 000 observations datset, giving me the agreement_nbr, the path, the filename where each policy host transactions.
I could do a call execute to remove the observations related to each policy, one by one and to put these into datasets with the same name but with the extension ...._dr ex:
move transaction for policy x123456 from
dataset be._auto_prmjan2023 to dataset be_auto_prmjan2023_dr
dataset be._auto_prmfeb2023 to dataset be_auto_prmfeb2023_dr
...
dataset be._auto_prmfeb2024 to dataset be_auto_prmfeb2024_dr
and that only for the policy x123456
But from my 12 000 000 observations dataset, I know which policies need to be move from dataset be._auto_prmjan2023 to be._auto_prmjan2023_dt, from dataset be._auto_prmfeb2023 to be._auto_prmfeb2023 and so on.
What's the more efficient way to move those transactions, one by one, or to group those per dataset and move those in one shot.
And how to do that? Please provide an example (SAS snippets).
Ex: dataset be._auto_prmjan2023
x123456
x123457
x123458
x123459
need to move x123456 and 123459
After removing those policies transactions
be._auto_prmjan2023
x123457
x123458
dataset be._auto_prmjan2023_dr
x123456
x123459
I need to do this task but for approx. 1 000 000 policies
I hope you understand what I need to do.
So how having my data 12 000 000 dataset, how can I do those transactions shifts by policies group ?
Create a DATA step view combining all datasets, using the INDSNAME= option to keep track which dataset is currently read.
Use this view in a single DATA step which creates all new wanted datasets; load your policy numbers into a hash object. When a policy is found in the hash, run a SELECT block with OUTPUT statements for all targets; use the variable derived from the INDSNAME to control the SELECT.
Hello MR. Bremser,
I not familiar with hash methods. On top of that, putting all the datasets in one is not possible due to dataset size.
any other solution? I still have my dataset which contains, the agreement number, the dataset name , its path. Can we do something with this information
See this prototype code:
data
a
b
c
d
e
a_dt
b_dt
c_dt
d_dt
e_dt
;
set
a
b
c
d
e
indsname= in_ds
;
if _n_ = 1
then do;
declare hash dt (dataset:"delete");
dt.definekey("policy");
dt.definedata("policy"); * reduces size of hash object;
dt.definedone();
end;
if dt.check() = 0
then do;
select(in_ds);
when ("A") output a_dt;
when ("B") output b_dt;
when ("C") output c_dt;
when ("D") output d_dt;
when ("E") output e_dt;
end;
end;
else do;
select(in_ds);
when ("A") output a;
when ("B") output b;
when ("C") output c;
when ("D") output d;
when ("E") output e;
end;
end;
run;
You can see the repeating patterns in the code, which allow you to create these with %DO macro loops.
Note that, the finer granularity of the splits, this will become more complex. Your situation is a fine illustration why splitting datasets is considered counter-productive, and you must seriously reconsider your current practice.
And if you use timestamped datasets, use a YMD date order and numeric month designations, which sorts the datasets chronologically on its own and makes the use of wildcards much easier.
Hello Mr. Bremser,
thanks for the sas code. Where do I put my policy list? Into dt.definedata?
ex: dt.definedata(x123456 x123457 x123458 and so on). How does work the condition when (« A »). It is not clear for me
The list of policies to be removed is in the dataset "delete", from which the code builds the hash object.
The strings in the WHEN statements of the SELECT blocks are the dataset names as they are stored in the variable in_ds, which is created by the INDSNAME= option. In a real world case, you must also include the library, as INDSNAME stores the complete 2-level dataset name in the variable. Everything must be uppercase.
Please consult the documentation of the SET, SELECT and WHEN data step statements.
Questions:
Good Morning,
So to remove policies from a single dataset and at the same time write them to another dataset you could use something like this:
data be._auto_prmjan2023 be._auto_prmjan2023_dr;
set be._auto_prmjan2023;
if POLICY in ("x123456" "x123459")
then output be._auto_prmjan2023_dr;
else output be._auto_prmjan2023 ;
run;
So if you have dataset with the list of policies to remove from each dataset then generating such code will be trivial to do with a DATA step.
Let's assume you have build a dataset with two variables: DATASET and POLICY. So sort by DATASET and use BY group processing to generate the code:
filename code temp;
data _null_;
file code lrecl=75 ;
set policy_list;
by dataset;
if first.dataset then put 'data ' dataset dataset +(-1) '_dr;'
/ 'set ' dataset ';'
/ 'if policy in (' @
;
put policy :$quote. @ ;
if last.dataset then put ')'
/ ' then output ' dataset +(-1) '_dr;'
/ ' else output ' dataset ';'
/ 'run;'
;
run;
%include code / source2;
If these datasets are large you might want to instead use MODIFY in the generated data steps.
data be._auto_prmjan2023 be._auto_prmjan2023_dr;
modify be._auto_prmjan2023;
if POLICY in ("x123456" "x123459") then do;
output be._auto_prmjan2023_dr;
remove be._auto_prmjan2023;
end;
run;
Hello, I have been testing the following script to gather all the agreement_nbr per filename into one variable as below
data found_agreements ;
set found_agreements ;
run;
data policiesgroupbyfilename (keep=cie lob year month libname filename policieslist);
length policieslist $32676.;
do until(last.filename);
set found_agreements; by cie lob year mm filename agreement_nbr;
policieslist = catx(", ", policieslist, "'"||strip(agreement_nbr)||"'");
end;
run;
Then the idea was to use a call execute to proceed with your script
data filename filename_dr;
modify filename;
if POLICY in ("x123456" "x123459") then do;
output filename_dr;
remove filename;
end;
run;
The issue, I am facing is that imagine a agreement number such as '1111-11111' multiplied by 25000 agreement numbers it gives a length 250 000 which is over the maximum length of policieslist.
How to solve that issue ? Is there a way to overcome this issue or maybe another approach.
Any suggestions ?
Are you trying to generate this statement from a list of values in a dataset?
if POLICY in ("x123456" "x123459") then do;
If so then generate the statement to a FILE, not a variable nor a macro variable. Then you will not be bothered by the 32K or 64K byte limits on those.
filename code temp;
data _null_;
file code lrecl=80;
if _n_=1 then put 'if POLICY in (' ;
if eof then put / ') then do;';
set found_agreements end=eof;
put agreement_nbr :$quote. @ ;
run;
data filename filename_dr;
modify filename;
%include code / source2;
output filename_dr;
remove filename;
end;
run;
If you have the agreement numbers in a DATASET, ONE observation per value then something like:
Sort both datasets by agreement number prior to this step:
data match nomatch; merge filename (in=inbase) agreementdata (in=inlist) ; by agreementnumber; if inlist and inbase then output match; else if inbase then output nomatch; run;
matches multiple records in the "filename" if there are any in the matching agreementdata set.
The IN= option creates temporary variables that indicate an observation from that set is contributing to the current observation. The values are 1/0 for Yes/No. So "If inlist and inbase then ..." means that both data sets have the same agreement number and written to the Match data set. The ones from the 'filename' data set that do not find a match in the agreementdata are written to Nomatch.
From some of your other "call execute" code, I wouldn't recommend it.
Note: if for some silly reason the names of the variables holding the agreement number values are not the same in both data sets you can use the rename option on the agreementdata to match that of the Filename:
data match nomatch; merge filename (in=inbase) agreementdata (in=inlist rename=(agreementnumber=nameinotherdataset) ) ; by nameinotherdataset; if inlist and inbase then output match; else if inbase then output nomatch; run;
Sticking multiple values, especially if thinking of 1000's of them, into a single variable is almost always a poor choice. Just as bad is placing them into multiple variables generally.
To generate a data set with agreement numbers looks like perhaps:
proc sql;
create table as agreementnumbers as
select distinct policy as agreementnumber
from found_agreements
;
quit;
Hello, imagine that I have a dataset MasterList which contains policy number (agreement number is the same), filename and path.
This dataset may be order by filenames and policy number.
The idea is to say, in dataset A, I have x policies to remove from A and to be put into dataset A_dr
Then in dataset B, I have y policies to remove from A and to be put into dataset B_dr and so on.
So as I have MasterList, the path and the filename so I can create a libname and read the dataset A for example,
I need to find a way to remove and it is just an example
Please note that we may need to remove 20 000 policies in one shot per dataset
dataset A remove polices 1, 2, 4, 5, 6 in one shot and put it into dataset A_dr
dataset B remove polices 7,12,15,18 in one shot and put it into dataset B_dr
dataset C remove polices 30, 32,45, 79, 110 in one shot and put it into dataset C_dr
and so on
How do we do that ? Please provide an example
Hello,
I did some test with sashelp.class and you inlist and inbase statement and it seems to do what I am expecting to get.
Regards,
data test1;
rownumber=_N_;
set sashelp.class;
run;
/* Creating the synthetic masterlist using test1 and applying the following criteria (content of test1 minus people who's age eq 12 years old */
data masterlist;
set test1;
where age ne 12;
run;
data test2_org;
set test1;
where sex eq 'M';
run;
/*data test2 test2_dr;*/
/* merge test2_org (in=inbase)*/
/* masterlist (in=inlist)*/
/* ;*/
/* by rownumber;*/
/* if inlist and inbase then output test2_dr;*/
/* else if inbase then output test2;*/
/*run;*/
/* so as my master list contains many policies per filename, the idea is to create a subdataset just with the filename and path. The make a call execute with the
path and filename then execute the last sas code */
data pathNfilename;
input libname1 :$10. libname2 :$10. fname :$10.;
datalines;
source1 dest1 test2
;
run;
data _null_;
set pathNfilename;
call execute
(compbl(cat(
'%nrstr(libname ',strip(libname1),' "%sysfunc(pathname(work))"; )',
'%nrstr(libname ',strip(libname2),' "%sysfunc(pathname(work))"; )',
"data ",strip(fname)," ",strip(fname),"_dr;",
"merge ",strip(fname),"_org (in=inbase) ","masterlist (in=inlist);",
"by rownumber;",
"if inlist and inbase then output ",strip(fname),"_dr;",
"else if inbase then output ",strip(fname),";run;"
)));
run;
Any chance you can completely re-architect this?
Managing data in 7000 datasets seems like it must be a nightmare. Do all these tables have the same structure (i.e. same variables)? Could all this data be put into one SAS dataset? Or maybe better yet, loaded it all into one database table? And instead of moving rows between tables, you either delete them in place, or add a variable/column to do logical deletes?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.