- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a basic append program that I run monthly to build a master historical dataset. It takes the previous month, writes it out as the current month and then appends current month claims to it, pretty basic and straight forward. This month however the size of the resulting data grew disproportionally to what the expected original and new claims combined should have created.
/*=============================================================================
Append SEPTEMBER Adjudicated claims to CLM_DATA_ASO 10/11/2024
IATLAS.CLM_DATA_ADJD_GLX_SEP created, with 11,667,014 rows and 254 columns.
===============================================================================*/
PROC APPEND BASE = ATLAS.CLM_DATA_ASO_SEP
DATA = ATLAS.CLM_DATA_ASO_AUG; RUN;
/***************Appending ATLAS.CLM_DATA_ASO_AUG to ATLAS.CLM_DATA_ASO_SEP.
NOTE: There were 466715507 observations read from the data set ATLAS.CLM_DATA_ASO_AUG.
NOTE: The data set ATLAS.CLM_DATA_ASO_SEP has 466715507 observations and 254 variables.*/
PROC APPEND BASE = ATLAS.CLM_DATA_ASO_SEP
DATA = IATLAS.CLM_DATA_ADJD_GLX_SEP; RUN;
/***************Appending IATLAS.CLM_DATA_ADJD_GLX_SEP to ATLAS.CLM_DATA_ASO_SEP.
NOTE: There were 11667014 observations read from the data set IATLAS.CLM_DATA_ADJD_GLX_SEP.
NOTE: The data set ATLAS.CLM_DATA_ASO_SEP has 478382521 observations and 254 variables.*/
The resulting datasets looked like this:
Dataset Observations File Size
ATLAS.CLM_DATA_ASO_AUG 466,715,507 519 GB
IATLAS.CLM_DATA_ADJD_GLX_SEP 11,667,014 14 GB
ATLAS.CLM_DATA_ASO_SEP 478,382,521 1.1 T
What would cause this to happen, and is there anything I should be doing in my code to prevent this? It has never happened before and I have been running this same process for over a year.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Tom of course you are correct! I talked with our IT lead yesterday and he agreed. We cannot explain why I've NEVER had to use this before in this append process that I've been running for over a year, but alas, here we are.
Additionally I also had to had a FORCE, which also had never been and issue prior.
Once again thank you to this community for always being willing to find the answers!
OPTIONS SET=TRUNCATE_BIGINT 'YES' COMPRESS=YES;
PROC APPEND BASE = ATLAS.CLM_DATA_ASO_SEP
DATA = ATLAS.CLM_DATA_ASO_AUG FORCE; RUN;
PROC APPEND BASE = ATLAS.CLM_DATA_ASO_SEP
DATA = IATLAS.CLM_DATA_ADJD_GLX_SEP FORCE; RUN;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I don't know, but has the length of variables in one of the data sets drastically increased, radically changing the size of the output data set? I would start by running PROC CONTENTS on both data sets and looking for something unexpected.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Good point Warren -
Oddly, the observation length on both the previous month and the monthly append files is 2,329, and the resulting final file is 2,336; that makes no sense to me. Not a huge difference but curious.
Unfortunately I only keep the latest 2 months so I can see what might have been going on historically.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Can you post the results of running PROC CONTENTS on the three datasets?
I wonder if one of them might be compressed? Or some other difference?
I agree, the results you report are suprising.
Next up: Troy Martin Hughes presents Calling Open-Source Python Functions within SAS PROC FCMP: A Google Maps API Geocoding Adventure on Wednesday April 23.
Register now at https://www.basug.org/events.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
How big was it BEFORE you appended the contents of the other two datasets? Does not seem unreasonable that should be 1,100 is just one of the files you ADDED to it was 519. So it was probably about 500 Gb before you started making it larger.
And if it did not exist before then check the system option COMPRESS and the compression settings of the three files.
If A and B are both using compression to make the file size smaller and you combine them to make C without compression then size of C will be larger than the sum of the sizes of A and B since it is not compressed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Fascinating! I never have specified Compressed type in this process, but you are correct! Both my BASE and Monthly datasets are Compressed=CHAR, and the resulting dataset is Compressed=NO. I have no idea how/why that happened, I literally have been running the exact process for over a year and have never specified any kind of compression. I am talking with one of our IT folks later today to see what's up.
@Patrick I agree, but I'm not that savvy and this is project space I'm working in, not a whole datamart. It's standard process for us to be good data stewards and delete what we are not using as we go.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@kbinan wrote:
Fascinating! I never have specified Compressed type in this process, but you are correct! Both my BASE and Monthly datasets are Compressed=CHAR, and the resulting dataset is Compressed=NO. I have no idea how/why that happened, I literally have been running the exact process for over a year and have never specified any kind of compression. I am talking with one of our IT folks later today to see what's up.
@Patrick I agree, but I'm not that savvy and this is project space I'm working in, not a whole datamart. It's standard process for us to be good data stewards and delete what we are not using as we go.
If you have never set any Compress options I wonder if you are working with a SAS server that the SAS Admin changed the system option for Compress.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Tom of course you are correct! I talked with our IT lead yesterday and he agreed. We cannot explain why I've NEVER had to use this before in this append process that I've been running for over a year, but alas, here we are.
Additionally I also had to had a FORCE, which also had never been and issue prior.
Once again thank you to this community for always being willing to find the answers!
OPTIONS SET=TRUNCATE_BIGINT 'YES' COMPRESS=YES;
PROC APPEND BASE = ATLAS.CLM_DATA_ASO_SEP
DATA = ATLAS.CLM_DATA_ASO_AUG FORCE; RUN;
PROC APPEND BASE = ATLAS.CLM_DATA_ASO_SEP
DATA = IATLAS.CLM_DATA_ADJD_GLX_SEP FORCE; RUN;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Something changed the default setting for the COMPRESS option for your jobs.
Either you changed your personal autoexec.sas file. Or perhaps some system level setting file changed.
In general if option settings are critical to your program then make sure to set them in your program.
The only reason that FORCE would be required in your program is if the structure (the set of variables, their types and storage lengths) of the dataset involved are DIFFERENT. If you did not need the FORCE option before and now you do then something has changed in the datasets you are appending. They have added and extra variable or changed the storage length of a variable.
The first step will only need the FORCE option is you are actually appending to an existing ATLAS.CLM_DATA_ASO_SEP dataset. Which is not what you said you wanted to do. Perhaps you should add code to check whether the target dataset already exists?
Assuming the first step created the ATLAS.CLM_DATA_ASO_SEP dataset then the only reason that you would need FORCE is that the structure of IATLAS.CLM_DATA_ADJD_GLX_SEP is different than the structure of ATLAS.CLM_DATA_ASO_AUG.
So what changed in the creation of IATLAS.CLM_DATA_ADJD_GLX_SEP ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Based on your description the most likely cause is as @Tom already guessed: Your source tables were compressed but your target table is not. Proc Contents will tell you.
But... are you really creating a single 500GB + file? That's certainly sub-optimal and I strongly suggest that you store your data in multiple slices by either using the SPDE engine or by creating monthly (or weekly) files and then just create views over these files.
Happy to elaborate further if you want to pursue such an approach. It would certainly make it much easier to add monthly data and would also increase performance a lot for cases where you only need data for certain months. ....plus your infrastructure people will also be grateful if they don't have to backup and version such a huge single file (especially given you change the name every single month).