SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
kbinan
Obsidian | Level 7

I have a basic append program that I run monthly to build a master historical dataset.  It takes the previous month, writes it out as the current month and then appends current month claims to it, pretty basic and straight forward.  This month however the size of the resulting data grew disproportionally to what the expected original and new claims combined should have created.

 

/*=============================================================================
	Append SEPTEMBER Adjudicated claims to CLM_DATA_ASO 10/11/2024

IATLAS.CLM_DATA_ADJD_GLX_SEP 	 created, with 11,667,014 rows and 254 columns.
===============================================================================*/

PROC APPEND BASE = ATLAS.CLM_DATA_ASO_SEP 
		    DATA = ATLAS.CLM_DATA_ASO_AUG; RUN;

/***************Appending ATLAS.CLM_DATA_ASO_AUG to ATLAS.CLM_DATA_ASO_SEP.
NOTE: There were 466715507 observations read from the data set ATLAS.CLM_DATA_ASO_AUG.
NOTE: The data set ATLAS.CLM_DATA_ASO_SEP has 466715507 observations and 254 variables.*/


PROC APPEND BASE = ATLAS.CLM_DATA_ASO_SEP   
		    DATA = IATLAS.CLM_DATA_ADJD_GLX_SEP; RUN;

/***************Appending IATLAS.CLM_DATA_ADJD_GLX_SEP to ATLAS.CLM_DATA_ASO_SEP.
NOTE: There were 11667014 observations read from the data set IATLAS.CLM_DATA_ADJD_GLX_SEP.
NOTE: The data set ATLAS.CLM_DATA_ASO_SEP has 478382521 observations and 254 variables.*/

The resulting datasets looked like this:

Dataset                                                         Observations             File Size
ATLAS.CLM_DATA_ASO_AUG                466,715,507              519 GB
IATLAS.CLM_DATA_ADJD_GLX_SEP        11,667,014                14 GB
ATLAS.CLM_DATA_ASO_SEP                 478,382,521                 1.1 T

 

What would cause this to happen, and is there anything I should be doing in my code to prevent this?  It has never happened before and I have been running this same process for over a year. 

1 ACCEPTED SOLUTION

Accepted Solutions
kbinan
Obsidian | Level 7

@Tom of course you are correct!  I talked with our IT lead yesterday and he agreed.  We cannot explain why I've NEVER had to use this before in this append process that I've been running for over a year, but alas, here we are.

Additionally I also had to had a FORCE, which also had never been and issue prior.

 

Once again thank you to this community for always being willing to find the answers!

 

OPTIONS SET=TRUNCATE_BIGINT 'YES' COMPRESS=YES;
PROC APPEND BASE = ATLAS.CLM_DATA_ASO_SEP 
		           DATA = ATLAS.CLM_DATA_ASO_AUG  FORCE; RUN;
PROC APPEND BASE = ATLAS.CLM_DATA_ASO_SEP   
		           DATA = IATLAS.CLM_DATA_ADJD_GLX_SEP FORCE; RUN;

View solution in original post

9 REPLIES 9
WarrenKuhfeld
Ammonite | Level 13

I don't know, but has the length of variables in one of the data sets drastically increased, radically changing the size of the output data set? I would start by running PROC CONTENTS on both data sets and looking for something unexpected.

kbinan
Obsidian | Level 7

Good point Warren - 

Oddly, the observation length on both the previous month and the monthly append files is 2,329, and the resulting final file is  2,336; that makes no sense to me.  Not a huge difference but curious. 

 

Unfortunately I only keep the latest 2 months so I can see what might have been going on historically.  

Quentin
Super User

Can you post the results of running PROC CONTENTS on the three datasets?  

 

I wonder if one of them might be compressed?  Or some other difference?  

 

I agree, the results you report are suprising.

The Boston Area SAS Users Group is hosting free webinars!
Next up: Troy Martin Hughes presents Calling Open-Source Python Functions within SAS PROC FCMP: A Google Maps API Geocoding Adventure on Wednesday April 23.
Register now at https://www.basug.org/events.
Tom
Super User Tom
Super User

How big was it BEFORE you appended the contents of the other two datasets?  Does not seem unreasonable that should be 1,100 is just one of the files you ADDED to it was 519.  So it was probably about 500 Gb before you started making it larger.

 

And if it did not exist before then check the system option COMPRESS and the compression settings of the three files.

If A and B are both using compression to make the file size smaller and you combine them to make C without compression then size of C will be larger than the sum of the sizes of A and B since it is not compressed.

kbinan
Obsidian | Level 7

Fascinating!  I never have specified Compressed type in this process, but you are correct!  Both my BASE and Monthly datasets are Compressed=CHAR, and the resulting dataset is Compressed=NO.  I have no idea how/why that happened, I literally have been running the exact process for over a year and have never specified any kind of compression.  I am talking with one of our IT folks later today to see what's up.

 

@Patrick I agree, but I'm not that savvy and this is project space I'm working in, not a whole datamart.   It's standard process for us to be good data stewards and delete what we are not using as we go.  

ballardw
Super User

@kbinan wrote:

Fascinating!  I never have specified Compressed type in this process, but you are correct!  Both my BASE and Monthly datasets are Compressed=CHAR, and the resulting dataset is Compressed=NO.  I have no idea how/why that happened, I literally have been running the exact process for over a year and have never specified any kind of compression.  I am talking with one of our IT folks later today to see what's up.

 

@Patrick I agree, but I'm not that savvy and this is project space I'm working in, not a whole datamart.   It's standard process for us to be good data stewards and delete what we are not using as we go.  


If you have never set any Compress options I wonder if you are working with a SAS server that the SAS Admin changed the system option for Compress.

kbinan
Obsidian | Level 7

@Tom of course you are correct!  I talked with our IT lead yesterday and he agreed.  We cannot explain why I've NEVER had to use this before in this append process that I've been running for over a year, but alas, here we are.

Additionally I also had to had a FORCE, which also had never been and issue prior.

 

Once again thank you to this community for always being willing to find the answers!

 

OPTIONS SET=TRUNCATE_BIGINT 'YES' COMPRESS=YES;
PROC APPEND BASE = ATLAS.CLM_DATA_ASO_SEP 
		           DATA = ATLAS.CLM_DATA_ASO_AUG  FORCE; RUN;
PROC APPEND BASE = ATLAS.CLM_DATA_ASO_SEP   
		           DATA = IATLAS.CLM_DATA_ADJD_GLX_SEP FORCE; RUN;

Tom
Super User Tom
Super User

Something changed the default setting for the COMPRESS option for your jobs.

Either you changed your personal autoexec.sas file.  Or perhaps some system level setting file changed.

 

In general if option settings are critical to your program then make sure to set them in your program.  

 

The only reason that FORCE would be required in your program is if the structure (the set of variables, their types and storage lengths) of the dataset involved are DIFFERENT.  If you did not need the FORCE option before and now you do then something has changed in the datasets you are appending.  They have added and extra variable or changed the storage length of a variable.

 

The first step will only need the FORCE option is you are actually appending to an existing ATLAS.CLM_DATA_ASO_SEP dataset.  Which is not what you said you wanted to do.  Perhaps you should add code to check whether the target dataset already exists?

 

Assuming the first step created the ATLAS.CLM_DATA_ASO_SEP dataset then the only reason that you would need FORCE is that the structure of  IATLAS.CLM_DATA_ADJD_GLX_SEP is different than the structure of ATLAS.CLM_DATA_ASO_AUG.  

 

So what changed in the creation of  IATLAS.CLM_DATA_ADJD_GLX_SEP ?

 

Patrick
Opal | Level 21

Based on your description the most likely cause is as @Tom already guessed: Your source tables were compressed but your target table is not. Proc Contents will tell you.

 

But... are you really creating a single 500GB + file? That's certainly sub-optimal and I strongly suggest that you store your data in multiple slices by either using the SPDE engine or by creating monthly (or weekly) files and then just create views over these files.

Happy to elaborate further if you want to pursue such an approach. It would certainly make it much easier to add monthly data and would also increase performance a lot for cases where you only need data for certain months.  ....plus your infrastructure people will also be grateful if they don't have to backup and version such a huge single file (especially given you change the name every single month).

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 1395 views
  • 1 like
  • 6 in conversation