Solved: Re: Advice on retrieving from a large dataset

lmtamina · Posted 03-26-2018 10:30 AM

Good Morning,

I am working on pulling data for our vendor and it requires me to retrieve claims data from a very huge dataset. I am pulling claims data from a dataset with 2.5B observations. I am retrieving all claims records associated with a set of unique claims_case_numbers with approx. 11.5M observations. The resulting dataset is about 52M obs with 47 variables and runtime is 1.5 hrs. I would like to know if my coding approach below utilizing Hash object is already efficient for this task or if it can still be improved. Appreciate any suggestions.

Also, any recommendations on how to get the dataset results with approx 13GB in size to our vendor . My initial thought is exporting in csv with zip. Thank you as always.

data vendor_claims;

length clm_case_key 8.;

if _N_ =1 then do;

DECLARE HASH H(dataset:'CLMS_CASES_LKUP');

H.DEFINEKEY ('clm_case_key');

H.DEFINEDONE();

end;

SET CLAIMS_MASTER (keep=&ON_FILE_HEADER where=((serv_from_date ge '1Jan2016'd AND serv_from_Date le '31Dec2017'd)

and prod_cd not in ('PROD_COM','PROD_MDR')

and payment_cd = 'C'));

if H.find() = 0 then output;

run;

ChrisNZ · Posted 03-26-2018 10:26 PM

> If both your source tables are in Oracle then definitely use SQL as this will push all the processing to Oracle.

Yes this. Keep all the subsetting in the source system and only transfer the result. Any other way is wasteful.

As for transferring, ask your vendor what they expect.

High-Performance SAS Coding - Third Edition

View solution in original post

AlanC · Posted 03-26-2018 12:08 PM

A SAS dataset is ODBC compliant. If they are willing to install the reader, I would leave it in sas7bdat format. If not, CSV is probably as good as it gets but I would use tabs vs commas as delimiter.

Try doing it with SQL vs hash. SQL will optimize your code.

You would be best to send this to 1 of 2 other places:

- SAS consulting forum on Google (declining size but really good SAS people)

- SAS-L. Venerable, best coders including Paul Dorfman who is the expert on hash object. Roger D will also weigh in who is damn good on volume issues.

https://github.com/savian-net

lmtamina · Posted 03-26-2018 01:42 PM

Spoiler

Thank you . I will try to post to the other forums.

LinusH · Posted 03-26-2018 12:45 PM

Just for my understanding, where is your two original data sets stored (DBMS, local/remote) and where do you need to store/analyze the result?

Data never sleeps

lmtamina · Posted 03-26-2018 01:42 PM

The results will be sent to our vendor via FTP site. And I am accessing the 2 database from DBMS(Oracle) and via client setup. We are using SAS enterprise guide. Hope this is helpful.

Patrick · Posted 03-26-2018 04:42 PM

@lmtamina

If both your source tables are in Oracle then definitely use SQL as this will push all the processing to Oracle. If you want to further speed up the process then use explicit pass-through SQL (=formulated in the Oracle SQL flavour) with Oracle hints to parallelize processing.

...But: I guess the bottleneck is transferring the data from Oracle to SAS over the network so once you rewrite your code so that the data volume gets already reduced on the Oracle side end-to-end elapsed time should become quite a bit shorter.

Once you've got your data in SAS: If your vendor uses SAS as well then I'd first try and zip the SAS dataset. If that makes it small enough for an FTP transfer then good, else I'd split it up into chunks and zip these chunks in separate archives.

ChrisNZ · Posted 03-26-2018 10:26 PM

> If both your source tables are in Oracle then definitely use SQL as this will push all the processing to Oracle.

Yes this. Keep all the subsetting in the source system and only transfer the result. Any other way is wasteful.

As for transferring, ask your vendor what they expect.

High-Performance SAS Coding - Third Edition

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away