Good Morning,
I am working on pulling data for our vendor and it requires me to retrieve claims data from a very huge dataset. I am pulling claims data from a dataset with 2.5B observations. I am retrieving all claims records associated with a set of unique claims_case_numbers with approx. 11.5M observations. The resulting dataset is about 52M obs with 47 variables and runtime is 1.5 hrs. I would like to know if my coding approach below utilizing Hash object is already efficient for this task or if it can still be improved. Appreciate any suggestions.
Also, any recommendations on how to get the dataset results with approx 13GB in size to our vendor . My initial thought is exporting in csv with zip. Thank you as always.
data vendor_claims;
length clm_case_key 8.;
if _N_ =1 then do;
DECLARE HASH H(dataset:'CLMS_CASES_LKUP');
H.DEFINEKEY ('clm_case_key');
H.DEFINEDONE();
end;
SET CLAIMS_MASTER (keep=&ON_FILE_HEADER where=((serv_from_date ge '1Jan2016'd AND serv_from_Date le '31Dec2017'd)
and prod_cd not in ('PROD_COM','PROD_MDR')
and payment_cd = 'C'));
if H.find() = 0 then output;
run;
> If both your source tables are in Oracle then definitely use SQL as this will push all the processing to Oracle.
Yes this. Keep all the subsetting in the source system and only transfer the result. Any other way is wasteful.
As for transferring, ask your vendor what they expect.
A SAS dataset is ODBC compliant. If they are willing to install the reader, I would leave it in sas7bdat format. If not, CSV is probably as good as it gets but I would use tabs vs commas as delimiter.
Try doing it with SQL vs hash. SQL will optimize your code.
You would be best to send this to 1 of 2 other places:
- SAS consulting forum on Google (declining size but really good SAS people)
- SAS-L. Venerable, best coders including Paul Dorfman who is the expert on hash object. Roger D will also weigh in who is damn good on volume issues.
If both your source tables are in Oracle then definitely use SQL as this will push all the processing to Oracle. If you want to further speed up the process then use explicit pass-through SQL (=formulated in the Oracle SQL flavour) with Oracle hints to parallelize processing.
...But: I guess the bottleneck is transferring the data from Oracle to SAS over the network so once you rewrite your code so that the data volume gets already reduced on the Oracle side end-to-end elapsed time should become quite a bit shorter.
Once you've got your data in SAS: If your vendor uses SAS as well then I'd first try and zip the SAS dataset. If that makes it small enough for an FTP transfer then good, else I'd split it up into chunks and zip these chunks in separate archives.
> If both your source tables are in Oracle then definitely use SQL as this will push all the processing to Oracle.
Yes this. Keep all the subsetting in the source system and only transfer the result. Any other way is wasteful.
As for transferring, ask your vendor what they expect.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.