SAS Support Communities

mkeintz

You could compare two daily datasets at a time, but that would mean processing most of the datasets twice, once as the "before" date, and once as the "after". But if each of the datasets are sorted by TKT, then you could process all of the datasets in a single pass. Something like (I have changed the daily dataset names to DATA_20250401, DATA_20250402, ... DATA_20250430): data want; set data_202504: ; by tkt descending date; if first.tkt=0 and dif(date)^=-1 then output; else if first.tkt=1 and date^='30apr2025'd then output; run; If the data are not sorted by TKT and if sorting would be expensive, then read the datasets in reverse chronological order. You could use two hash objects to hold current and next daily data (NEXTDAY in the code below). If an incoming observation has a TKT not found in the NEXTDAY object, then output it. At the end of each day, clear the NEXTDAY object and copy the CURRDAY data into it, in preparation for new current date. data want; set data_202504: ; by descending date; if _n_=1 then do; declare hash currday(); currday.definekey('tkt'); currday.definedata('tkt','date'); currday.definedone(); declare hiter i ('currday'); declare hash nextday(); nextday.definekey('tkt'); nextday.definedata('tkt','date'); nextday.definedone(); end; if date='30apr2025'd then do; nextday.add(); return; end; currday.add(); if nextday.check()^=0 then output; if last.date then do; /*Replace NEXTDAY with CURRDAY hash object */ nextday.clear(); do while (i.next()=0); nextday.add(); end; currday.clear(); end; run; Note these programs assume there are no duplicate TKT values within each daily dataset.

mkeintz

In the absence of sample data in the form of a working DATA step, here is untested code. This code assumes every EOS record is preceded by a matching WBC record: data want; set have; array var aval anrlo anrhi; array pct lborres lbornrlo lborrnrhi; do over var; var=ifn(param='EOS',sum(0,lag(var))*pct/100,var); end; run; The reason for the "sum(0,lag(var))" expression is to avoid an error message with the first observation, for which lag(var) is missing, and therefore would cause a missing value result when multiplied by "pct/100"). You could test for matching WBC record with something like: data want; set have; array var aval anrlo anrhi; array pct lborres lbornrlo lborrnrhi; do over var; var=ifn(param='EOS' and lag(param)='WBC' and subjid=lag(subjid) and avisit=lag(avisit) ,sum(0,lag(var))*pct/100 ,var); end; run; Again, untested.

mkeintz

I agree that much of the time order of the variables I produce in a data set are not important. But there are two situations in which I do care about variable order: I want a quick onscreen view of the data. It is effective to use the double-dash syntax to declare a list of variables

mkeintz

You want to know if you can form a table statement that generates one set of statistics for some rows (class vars) and other statistics for other rows (continuous vars). The answer is no. But the greater question is why? Are you trying to generate a specific sequence of variables in your report?

mkeintz

Let's say you want exactly 10 equal size groups, subject to you don't know exactly how many obs are in the dataset you may have tied price values you want to keep tied prices in the same group: Program below is corrected, by entering an explicit OUTPUT statement, which prevents premature increment to the GROUP variable. proc sort data=have out=need; by descending price; run; data want; set need nobs=n_need; by descending price; retain group 1; output; if last.price=1 and _n_ > group*(n_need/10) then group+1; run;

mkeintz

If you have tied prices, do you have a preferred order among those ties? Can a set of observations with tied prices be assigned to different (consecutive) groups? Or must tied prices all be in the same group? Are you using a fixed group size (100,000), or are you aiming for 10 groups of equal size (or as near equal as tied prices might allow)?

mkeintz

"it did not run" does not provide even the minimal amount of information needed to diagnose the problem you have encountered. We don't even know the problem description. Help us help you. Please start with showing the sas log generated when you submitted the program code.

mkeintz

@PRoul I see two issues with your solution: You are not solving the problem as stated by the OP. You are doing EXACT matches of AGE between case and control. But the OP was requesting matches having case_age within a range of case_control. Assuming the OP would be equally satisfied with a "close" match vs an "exact" match, this program may not fully satisfy the initial request. Of course, the OP might prefer to exclude close matches. More importantly, your solution is unlikely to generate anything like a random sample of matches (unlike PSMATCH). In particular, records near the end of the control dataset will not have the same probability of being selected as records near the start, even though they have the same match values. That is because. PROC SORT, by default, does not change the order of records having identical sort keys. hash objects using the multidata:'Y' option preserves the order of data items having the same key to match the order retrieved from the source dataset. This means your technique of FIND method followed by a series of FINDNEXT methods will always result in the data items near the start of the of the dataset being the first ones considered. This is true even though you are sampling without replacement. Now if your control dataset is in truly random order to begin with (i.e. before the proc sort), you would effectively have a random solution. But with real datasets, it's pretty heroic to assume they are in random order. You could resolve this problem by randomizing the order of data within each match key as follows: data controls; set controls; matchkey = catx('_', gender_control , age_control); call streaminit(0598666); rnum=rand('uniform'); run; proc sort data = controls out=controls; by matchkey rnum; run; followed by your data step with hash code. This would effectively mean that every observation with a given matchkey value would have the same probability of being selected. BTW, for purposes of matching cases and controls, there is no need to sort the CASES file. It's just wasted resources.

mkeintz

I would say that if you can have more than two obs for a given ID, just implement a minor modification to @Tom 's code: data have; input ID $ A B C D; cards; 1 1 2 3 4 1 2 3 4 5 2 1 3 4 5 2 2 6 3 8 3 1 3 4 7 3 9 9 9 9 /*extra obs for ID 3 */ 3 3 4 3 6 ; data want; set have; by id; if first.id or last.id; /*No further processing of "middle" obs*/ array _n _numeric_; do over _n; _n=dif(_n); end; if not (first.id); /*Avoid instances of 1 obs per id*/ run;

mkeintz

Does every ID have exactly 2 records? Then the solution offered by @Tom is the way to go.

mkeintz

@RichardAD wrote: Let T = duration for flat read of detail table D Let K = number of header keys that _can_ fit in a hash table. Do 1,849,842,886 / K data step reads through D with hash lookup selectiond. Append selections of each run-through. Guessing conservatively and presuming 70 GB set aside for a session there should be enough memory for a hash table that holds 500K 64byte keys. So maybe 6 reads through D. Make that a worst case of 10 read throughs. so 10 * (T + T-out) (writing key matched records) If this needs to be done more than one time you might want to also code a solution in Proc DS2 that uses THREADs and compare resource and time consumptions to DATA step. You don't need that much input output activity. If after examining the header dataset in comparison to your available memory for hash object you find that you have to do 10 subgroup joins, you don't have to have generate input/output totaling 10*(T +Tout). It can be reduced to 3*T + 1*Tout. Divide the header into 10 subgroups based on the value of the join variable. Then, given you have disk space, also divide the detail dataset into 10 smaller datasets using the same join variable. That can be done in one DATA step totaling 2*T input/output. Then each of the 10 subgroup joins will need only 0.1*(T+ ~0.1*Tout). You can save even more my creating the detail subgroups containing only the variables of interest.

mkeintz

@DerylHollick wrote: Will half of the header table fit into memory? A third? I'd first try to chunk the header table and run a hash lookup for each chunk. Plus, it might give you a decent idea as to how much memory would be needed for the whole header table. This is what my suggested code does. But instead of just loading any old half of the header (or in my example a fifth), I proposed selecting a half (or a fifth) for a given range of join variable values - on both datasets. This makes the number of needed join comparisons about one fourth (in the case of halves) or one 25th (for fifths) for each subgroup join - a very effective reduction especially when the process "outsources" the subgroup selection to the data engine. True, one has to process the big dataset for each subgroup join, but that burden is significantly reduced by outsourcing the subgroup filtering to the data engine, via the WHERE options.

mkeintz

You can set up a HISTORY array (one element per date from the earliest possible to latest possible date). Pass through each patient twice. Initialize each patient to class='Seq ' and flag=treatment of the first record. During the first pass, update the history array. If a date is encountered that has more than one treatment, set class to 'COMBI' and flag to 'C', ... and stop monitoring dates - you won't be going back from Combi to Seq. During the second pass, do nothing but permit the observations to be output, using the CLASS and FLAG values retained from the first pass: data have; infile datalines ;; input patient $2. treatment :$1. start :date9. end :date9. _class :$5. _flag :$1. ; format start end date9.; datalines; E1 A 10Apr2017 26Jun2017 Seq A E1 B 07Jun2018 08Aug2018 Seq A E2 B 06Sep2016 20Oct2016 Seq B E2 A 15Nov2017 04Oct2018 Seq B E3 A 07Dec2010 08Feb2011 Seq A E3 A 06Sep2016 20Oct2016 Seq A E3 B 15Nov2017 04Oct2018 Seq A E4 B 07Dec2010 08Feb2011 Seq B E4 B 06Sep2016 20Oct2016 Seq B E4 A 15Nov2017 04Oct2018 Seq B E5 A 27Feb2018 20Nov2018 Combi C E5 B 22May2018 30Oct2018 Combi C E7 A 01Feb2016 28Apr2016 Seq A E7 A 20Apr2017 16May2017 Seq A E7 B 21Aug2017 02Jan2019 Seq A E7 A 27May2019 29Jul2019 Seq A E8 B 01Feb2016 28Apr2016 Seq B E8 B 20Apr2017 16May2017 Seq B E8 A 21Aug2017 02Jan2019 Seq B E8 B 27May2019 29Jul2019 Seq B run; %let beg=01jan2010; %let end=31dec2019; data want (drop=d); set have (in=firstpass) have (in=secondpass); by patient; retain class ' ' Flag ' ' ; array history {%sysevalf("&beg"d):%sysevalf("&end"d)} _temporary_; if first.patient then do; call missing(of history{*}); class='Seq '; flag=treatment; end; if firstpass=1 and class='Seq' then do d=start to end while (class='Seq'); history{d}+1; if history{d}>1 then do; class='Combi'; flag='C'; end; end; if secondpass; run;

mkeintz · ‎04-12-2025

I would not use firstobs/obs to divide the join into subgroup joins, because a given CLM_ID may be in more than one of those subgroup joins. Instead, examine each CLM_ID once, by choosing a restricted range of CLM_ID in both datasets, for each subgroup join. This can work because CLM_ID is the join variable. Let's say you divide your CLM_ID values into 5 ranges, each range with a lower limit (LLLLLLL) and upper limit (UUUUUUU), where LLLLLLL and UUUUUUU are quintile values. Of course, the lowest range doesn't need a specified LLLLLLL, and the highest range doesn't need a specified UUUUUUU. Then you could run five programs, such as the below - just put in values in place of LLLLLLL and RRRRRRR: data want1; set bigdataset (keep= list of variables); where LLLLLLL <= clm_id < UUUUUUU; if _n_=1 then do; declare hash h (dataset:'header (where=(LLLLLLL <= clm_id < UUUUUUU))'); h.definekey('clm_id'); h.definedata('clm_id'); h.definedone(); end; if h.check()=0; run; Notes: Limiting the range in the hash object this way allows you to avoid requiring more memory than available. Using the "where" statement after the SET outsources the filtering of the big data set to the data set engine, saves a lot of resources. Then it's just a matter of seeing if the filtered CLM_ID from the big data set is also found in the hash object. I've coded the above ("<=" for lower limit, and "<" for upper limit) to avoid double inclusion of the quintile values. So either drop the upper limit for the highest range, or change "<" to "<=". Of course, this requires generating the quintile CLM_ID values. You could do something like this to find the quintiles: proc sort data=header out=header_sorted nodup; by clm_id; run; data limits (drop=CLM_ID); set header_sorted nobs=nclm; retain quintile 1; if _n_=1 then LLLLLLL=clm_id; retain LLLLLLL; if _N_ = ceil(nclm*(quintile/5)); UUUUUUU=clm_id; output; quintile+1; LLLLLLL=UUUUUUU; run; proc print; run;

mkeintz · ‎04-09-2025

@RobPratt It appears to me that your OPTMODEL code maximizes the total number of SEATS available in unused rooms. Do I have that correct? But the OP appears to ask to maximize the number of unused ROOMS (minimize number of rooms used, actually). Could there not be a configuration of room sizes that satisfy your objective but would not satisfy the OP's original request? For instance, if there are six rooms with sizes 7, 6, 6, 5, 2, and 2. For total load 13, I believe your code would select a 7 and 6, maximizing the number of seats in unused rooms (6+5+2+2=15). And that result would also satisfy the objective of maximizing unassigned rooms. For total load 11, I think your code would use three rooms (7, and both 2's,) leaving 3 rooms (6+6+5=17) unused, maximizing seats in unused rooms. However, the minimum number of ROOMS needed for the total load would be two (7 and either 5 or 6), leaving 4 rooms free, but with smaller total capacity. Maybe the OP doesn't care about this issue. But is there code that would do two levels of optimizing?First, minimize the count of used rooms. Then, for all combinations satisfying that objective, find one that maximizes your objective - total seats in unused rooms.

Online Status	Offline
Date Last Visited	Sunday

SAS Support Communities

Follow Us

What is...