About kenkaran

kenkaran · ‎05-05-2025

My sincere apologies for not replying in a more timely manner. However, the size of the data and my desire to craft intelligent responses to all the great suggestions delayed my response until now. @mkeintz wrote: I would not use firstobs/obs to divide the join into subgroup joins, because a given CLM_ID may be in more than one of those subgroup joins. Instead, examine each CLM_ID once, by choosing a restricted range of CLM_ID in both datasets, for each subgroup join. This can work because CLM_ID is the join variable. Let's say you divide your CLM_ID values into 5 ranges, each range with a lower limit (LLLLLLL) and upper limit (UUUUUUU), where LLLLLLL and UUUUUUU are quintile values. Of course, the lowest range doesn't need a specified LLLLLLL, and the highest range doesn't need a specified UUUUUUU. Then you could run five programs, such as the below - just put in values in place of LLLLLLL and RRRRRRR: @RichardAD wrote: Let T = duration for flat read of detail table D Let K = number of header keys that _can_ fit in a hash table. Do 1,849,842,886 / K data step reads through D with hash lookup selection. Append selections of each run-through. Many many thanks to @mkeintz and @RichardAD Their technique was probably the least glamorous, but in the end it carried the day! Steps I took: Through trial and error I found how many key values could comfortably fit in memory. That was the size of my chunk. Dividing the dataset size by the chunk size I get the number of chunks: 7. I was able to create a dataset of sorted and unique the key-values (only). From this dataset, I created 7 ranges of high and low key-values. I read through the entire big dataset, sorting all obs into one of 7 line-chunks. I saved space on this step by only selecting the variables of interest (29/59) even though all obs were selected. Each year ran approx 4 hours and created 7 chunks. I loaded the first header/key-chunk into a memory hash and used it so select from the first line chunk. This was repeated for each of the 7 chunks. Each of the 7 chunks was repeated for each of the 6 years: 2016 through 2021. On average each chunk took 75 minutes. 75 minutes * 7 chunks * 6 years ~ 53 hours. The result was 64% of the obs were selected. Problem solved. Now that this unwieldy dataset has been cut down to size, I have come up with a number of enhancements inspired by many of these answers: Convert each unique 64-character claim header key into a sequence number from 1 to ~ 2.1B Convert that sequence number into a base36 string (license plate-style) This enables a 64 character string to be stored in 6 digits Repeat the same process for the beneficiary (or patient) ID. Now that the keys are cut down to size and the obs are cut down to size, the data can now be indexed. Also, other storage formats such as SPDE and access methods such as FedSQL can be researched and perhaps employed. Thanks again @mkeintz and @RichardAD ! Responses to most who responded: Follow-up question to @RichardAD: can you direct Proc DS2 to make use of THREADs if you are just looking up a huge number of sequential observations in a hash? I did read but the answer wasn't clear from this page. Also see @SAS_Jedi 's comment. @SASKiwi wrote: Are either of the datasets compressed? Compressing the 1.2TB dataset would likely speed up joining as it will improve IO. @Patrick wrote: I also would use the SPDE engine for storing such a huge SAS table. data spde_saslibrary.want(compress=yes); Both the small-er (header) and large (line) datasets are compressed with the binary option. Also the "point to observations" option is set to "yes." I feel that this is a big part of what is slowing down processing. It would seem that trying to point to a dataset that is compressed leads to a lot of decompression and computation about where to seek the obs pointer to on the disk. I think this may actually _increase_ IO and will definitely increase CPU. I know it definitely increases run time. Is the compression of big datasets worth the overhead? Once I extracted data into an uncompressed format, everything ran _much_ faster. @FreelanceReinh wrote: It should be possible to use a much smaller key item for the hash object, e.g. md5(clm_id), which takes only 16 bytes, instead of the 64-byte clm_id itself. This is an innovative approach. However, I would be calling the md5 function billions to trillions of times. I'm not sure what overhead that might add. As @Patrick pointed out, there is a risk -- even slight -- of collision. I am skittish about this approach. @Stu_SAS Because FEDSQL does not allow SAS data set options to limit obs and firstobs, I was not able to use this solution. The data is just too d--n big. Also, @SAS_Jedi also questions whether the multi-threading would be useful here. @KachiM Thank you for your stand-alone code example. However, the proc sort data=tempbig; is just not possible in my environment. @whymath I downloaded Paul Dorfman's paper. It is very complex, and I do not profess to have understood it thoroughly. However, the good Dr Dorfman does say on page 3 of the referenced paper that Bitmapping is suitable for "no-matter-how-many-short-keys." On the bottom of page 2, he says I would need to allocate an array of 10**[60]/53 which is 18,867,924,528,301,900,000,000,000,000,000,000,000,000,000,000,000,000,000,000 . So Bitmapping is not a practical solution. @quickbluefish wrote: Why do you want to join these? I get this a lot. This answer is specific to the topic area and not really of interest to SAS Users per se. The TAF is a collection of Medicaid Mgmt Info Sys (MMISs) data. Half of this file contains financial-only transactions that are not of current interest the researchers I work with. Plowing through half of all this data only to delete it is a fantastic waste of both people and computer resources. I am trying to get rid of this financial-only half to make it more usable. The dataset is static and once this is done, I won't need to do it again.

kenkaran · ‎04-11-2025

@Stu_SAS, thank you for the FedSQL suggestion and the clarifications. I will try this.

kenkaran · ‎04-11-2025

@Stu_SAS , Thank you for the response. 3 follow-up questions: Can I use the the firstobs and obs data set options as I did in regular PROC SQL? Is there a way to write what observation I'm on and the clock time to the log every n number of obs? Does FedSQL have a hint feature as in Oracle? (Not that Oracle ever "takes the hint.")

kenkaran · ‎04-11-2025

Thank you @ShelleySessoms for the fast turnaround.

kenkaran · ‎04-10-2025

My post was marked as spam. The link was: https://communities.sas.com/t5/SAS-Programming/Join-vs-Merge-1-2-TB-with-110-GB-Datasets/m-p/964079#M375495 . Can you please release it from quarantine?

kenkaran · ‎04-10-2025

Problem: I have two datasets: The first is detail records from a very large dataset (1.2 TB) and the second is row IDs from an only slightly smaller "header" dataset (110 GB). The relation between line and header is many-to-one. I am trying to select the obs in the line that have a match in the header. The header dataset only contains the key variable. What I've done so far: The smaller "header" dataset is too small to fit in a hash dataset even if I increased the memsize to 115 GB – almost all of the available memory on the box! I sorted and indexed the smaller header dataset by the key variable. I selected 1/20th of the large dataset using the firstobs and obs dataset option I use proc because I was advised that it is multi-threaded. Read post Efficient Way of Merging Very Large Datasets. Result: I started the script 8 days ago and my best guess from the looking at the size of the output lck file in Windows File Explorer is that it is only one tenth through. The help I need: What would I need to do to access this dataset in a reasonable amount of time -- a couple of days? Should I try to break the line input dataset into chunks, sort and interleave by clm_id and then try a data step merge? If I were to request a more memory and processors for this virtual machine, how much would I need? SAS Versions: The large dataset was created under SAS ver 9.0401M7 but the small dataset was created under 9.0401M5. They are being accessed under 9.0401M5. Large Line Dataset: taf_other_services_line (16) Size on disk: 1.22 TB Obs: 5,398,943,292 Vars: 59 Observation Length: 525 Page Size: 65,536 / Pages: 19,749,411 Indexes: 0 / Sorted: NO / Point to Observations: YES Smaller Header Dataset: Dataset size on disk: 110 GB Index size on disk: 126 GB Obs: 1,849,842,886 Vars: 1 Observation Length: 64 Page Size: 65,536 / Pages: 1,811,797 Indexes: 1 / Sorted: YES Query: proc sql stimer ; create table saslibrary.outputdataset as select t.bene_id, t.clm_id, <26 other variables> from saslibrary.lineinputdataset (firstobs=4859048953 obs=5128996116) as t inner join saslibrary.headerinputdataset as c on (t.clm_id = c.clm_id) ; quit; OS: MS Windows Server 2016 Standard V 10.0.14393 Build 14393 Hardware according to Windows Task Manager: Memory Installed: 128 GB Virtual Memory: 46 GB Page File Space: 18.0 GB Maximum Speed: 2.90 GHz Sockets: 6 Virtual processors: 12 L1 cache: n/a Processor: Intel Xeon Gold 6542Y For those of you familiar with Medicaid data this is the TAF data from CMS/MACBIS. Thank you for reading.

Online Status	Offline
Date Last Visited	‎05-28-2025 01:19 PM

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: What to do when our spam filter traps your post

Re: What to do when our spam filter traps your post

Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: What to do when our spam filter traps your post

Re: What to do when our spam filter traps your post

Join vs Merge 1.2 TB with 110 GB Datasets