Problem: I have two datasets: The first is detail records from a very large dataset (1.2 TB) and the second is row IDs from an only slightly smaller "header" dataset (110 GB). The relation between line and header is many-to-one. I am trying to select the obs in the line that have a match in the header. The header dataset only contains the key variable. What I've done so far: The smaller "header" dataset is too small to fit in a hash dataset even if I increased the memsize to 115 GB – almost all of the available memory on the box! I sorted and indexed the smaller header dataset by the key variable. I selected 1/20th of the large dataset using the firstobs and obs dataset option I use proc because I was advised that it is multi-threaded. Read post Efficient Way of Merging Very Large Datasets. Result: I started the script 8 days ago and my best guess from the looking at the size of the output lck file in Windows File Explorer is that it is only one tenth through. The help I need: What would I need to do to access this dataset in a reasonable amount of time -- a couple of days? Should I try to break the line input dataset into chunks, sort and interleave by clm_id and then try a data step merge? If I were to request a more memory and processors for this virtual machine, how much would I need? SAS Versions: The large dataset was created under SAS ver 9.0401M7 but the small dataset was created under 9.0401M5. They are being accessed under 9.0401M5. Large Line Dataset: taf_other_services_line (16) Size on disk: 1.22 TB Obs: 5,398,943,292 Vars: 59 Observation Length: 525 Page Size: 65,536 / Pages: 19,749,411 Indexes: 0 / Sorted: NO / Point to Observations: YES Smaller Header Dataset: Dataset size on disk: 110 GB Index size on disk: 126 GB Obs: 1,849,842,886 Vars: 1 Observation Length: 64 Page Size: 65,536 / Pages: 1,811,797 Indexes: 1 / Sorted: YES Query: proc sql stimer ;
create table saslibrary.outputdataset as
select t.bene_id, t.clm_id, <26 other variables>
from
saslibrary.lineinputdataset (firstobs=4859048953 obs=5128996116) as t
inner join saslibrary.headerinputdataset as c on (t.clm_id = c.clm_id)
;
quit; OS: MS Windows Server 2016 Standard V 10.0.14393 Build 14393 Hardware according to Windows Task Manager: Memory Installed: 128 GB Virtual Memory: 46 GB Page File Space: 18.0 GB Maximum Speed: 2.90 GHz Sockets: 6 Virtual processors: 12 L1 cache: n/a Processor: Intel Xeon Gold 6542Y For those of you familiar with Medicaid data this is the TAF data from CMS/MACBIS. Thank you for reading.
... View more