Advice on retrieving from a large dataset

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 19
Accepted Solution

Advice on retrieving from a large dataset

Good Morning,

 

I am working on pulling data for our vendor and it requires me to retrieve claims data  from a very huge dataset. I am pulling claims data from a dataset with 2.5B observations.  I am retrieving all claims records associated with a set of unique claims_case_numbers with approx. 11.5M observations. The resulting dataset is about 52M obs with 47 variables and runtime is 1.5 hrs. I would like to know if my coding approach  below  utilizing Hash object is already efficient for this task or if it can still be improved. Appreciate any suggestions.

Also, any recommendations on how to get the dataset results with  approx 13GB in size to our vendor . My initial thought is exporting in csv with zip. Thank you as always.

 

data vendor_claims;

length clm_case_key 8.;

 

if _N_ =1 then do;

DECLARE HASH H(dataset:'CLMS_CASES_LKUP');

H.DEFINEKEY ('clm_case_key');

 H.DEFINEDONE();

end;

 

SET CLAIMS_MASTER (keep=&ON_FILE_HEADER where=((serv_from_date ge '1Jan2016'd AND serv_from_Date le '31Dec2017'd)

and prod_cd not in ('PROD_COM','PROD_MDR')

and payment_cd = 'C'));

if H.find() = 0 then output;

run;

 


Accepted Solutions
Solution
‎03-27-2018 06:48 AM
PROC Star
Posts: 2,375

Re: Advice on retrieving from a large dataset

> If both your source tables are in Oracle then definitely use SQL as this will push all the processing to Oracle. 

 

Yes this. Keep all the subsetting in the source system and only transfer the result. Any other way is wasteful.

 

As for transferring, ask your vendor what they expect.

 

 

 

 

View solution in original post


All Replies
Regular Contributor
Posts: 150

Re: Advice on retrieving from a large dataset

A SAS dataset is ODBC compliant. If they are willing to install the reader, I would leave it in sas7bdat format. If not, CSV is probably as good as it gets but I would use tabs vs commas as delimiter.

 

Try doing it with SQL vs hash. SQL will optimize your code.

 

You would be best to send this to 1 of 2 other places:

 

- SAS consulting forum on Google (declining size but really good SAS people)

- SAS-L. Venerable, best coders including Paul Dorfman who is the expert on hash object. Roger D will also weigh in who is damn good on volume issues.

 

 

Occasional Contributor
Posts: 19

Re: Advice on retrieving from a large dataset

Spoiler
 

Thank you . I will try to  post to the other forums.

 

Super User
Posts: 5,890

Re: Advice on retrieving from a large dataset

Just for my understanding, where is your two original data sets stored (DBMS, local/remote) and where do you need to store/analyze the result?
Data never sleeps
Occasional Contributor
Posts: 19

Re: Advice on retrieving from a large dataset

The results will be sent to our vendor via FTP site. And I am accessing the 2 database from DBMS(Oracle) and via client setup. We are using SAS enterprise guide. Hope this is helpful.
Respected Advisor
Posts: 4,743

Re: Advice on retrieving from a large dataset

[ Edited ]

@lmtamina

If both your source tables are in Oracle then definitely use SQL as this will push all the processing to Oracle. If you want to further speed up the process then use explicit pass-through SQL (=formulated in the Oracle SQL flavour) with Oracle hints to parallelize processing.

...But: I guess the bottleneck is transferring the data from Oracle to SAS over the network so once you rewrite your code so that the data volume gets already reduced on the Oracle side end-to-end elapsed time should become quite a bit shorter.

 

Once you've got your data in SAS: If your vendor uses SAS as well then I'd first try and zip the SAS dataset. If that makes it small enough for an FTP transfer then good, else I'd split it up into chunks and zip these chunks in separate archives.

Solution
‎03-27-2018 06:48 AM
PROC Star
Posts: 2,375

Re: Advice on retrieving from a large dataset

> If both your source tables are in Oracle then definitely use SQL as this will push all the processing to Oracle. 

 

Yes this. Keep all the subsetting in the source system and only transfer the result. Any other way is wasteful.

 

As for transferring, ask your vendor what they expect.

 

 

 

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 128 views
  • 1 like
  • 5 in conversation