I have a huge dataset with several million records, where IDs (customer identifier) are repeating many times, as I have historical data for same customers (lets say 10 years of historical data).
I need to remove duplicates and remain with only one record for each ID.
This must be a random exclusion, as in the end I need to have a representation of historical data.
Also, random selection of IDs must be repeatable - meaning, if someone else will run this logic on the same dataset, it will come back with exactly same Point-In-Time random selection every time (such as, if customer identifier 99900123 comes with 120 records (120 historical months) and was chosen to keep Aug 2012 in random selection - that historical month must come back every time for 99900123 when executing logic.
Hopefully, someone can come up with a quick resolution, as I need it relatively urgent 🙂
I looked at surveyselect, however my understanding any of the methods require sample size, which does not apply in my case
Hello @_MVB_ and welcome to the SAS Support Communities!
Use PROC SURVEYSELECT. The sample size is 1 -- per ID:
/* Create test data for demonstration */
data have;
input ID $ month;
cards;
1 4
1 7
1 2
2 3
2 1
3 9
3 5
;
/* Select randomly one observation per ID */
proc surveyselect data=have
method=srs n=1
seed=1732 out=want(drop=selectionprob samplingweight);
strata id;
run;
The STRATA statement requires that the input dataset (HAVE) is sorted by ID. The selection is repeatable thanks to the random seed (arbitrary integer).
Hello @_MVB_ and welcome to the SAS Support Communities!
Use PROC SURVEYSELECT. The sample size is 1 -- per ID:
/* Create test data for demonstration */
data have;
input ID $ month;
cards;
1 4
1 7
1 2
2 3
2 1
3 9
3 5
;
/* Select randomly one observation per ID */
proc surveyselect data=have
method=srs n=1
seed=1732 out=want(drop=selectionprob samplingweight);
strata id;
run;
The STRATA statement requires that the input dataset (HAVE) is sorted by ID. The selection is repeatable thanks to the random seed (arbitrary integer).
@FreelanceReinh - Thank you!
Question, can you elaborate on this:
drop=selectionprob samplingweight
What is it doing exactly?
@_MVB_ wrote:
Question, can you elaborate on this:
drop=selectionprob samplingweight
What is it doing exactly?
This DROP= dataset option excludes two variables (SelectionProb and SamplingWeight) from being written to the output dataset (WANT) which are created automatically by PROC SURVEYSELECT due to the STRATA statement. I assumed that you wouldn't want to add these "unexpected," technical new variables to the existing variables in the random sample. Feel free to omit the "(drop=...)" part of my example code and then look at the additional columns in dataset WANT to decide if they might be useful for your purposes.
@FreelanceReinh That is a good point - I do not want to have any extra variable created, as it would add to the total size of the dataset.
Surveyselect is the tool. Your sample size is 1 per strata, where ID is the strata.
proc sort data=have; by id date; /* if there a multiple records for the it with the same date then you need to provide enough variables to duplicate the sort again later */ run; Proc surveyselect data=have out=want sampsize=1 noprint seed=23456 /* this value initializing the sequence of random numbers used and is critical for duplicating the same selection HOWEVER: upgrades to SAS software, different OS make it impossible to guarantee same results afterwards */ ; strata id; /*data must be sorted by the strata variable which is why your "run again later" must be able to duplicate the sort order */ run;
Any change that affects the number of variables or the input order will render "duplicating" the selection later useless. Though with that restriction make sure that you don't lose the first created data set. Not that hard to place into a permanent library and not delete it.
@ballardw - Thank you!
My data is already sorted (repeating records with the same date already excluded), so pretty much I got customer ID, account/s (might be multiple) and date9.
So 'strata' ID will ensure none of the IDs will be completely removed in the 'want' dataset, while only one record kept for an ID, and this is because n=1? Is it because of the combination of 'strata' and 'n' used in the same logic? What would it be, if i keep n=1, but not including 'strata' at all? Also, what would it be if n=2 and 'strata' ID?
@_MVB_ wrote:
@ballardw - Thank you!
My data is already sorted (repeating records with the same date already excluded), so pretty much I got customer ID, account/s (might be multiple) and date9.
So 'strata' ID will ensure none of the IDs will be completely removed in the 'want' dataset, while only one record kept for an ID, and this is because n=1? Is it because of the combination of 'strata' and 'n' used in the same logic? What would it be, if i keep n=1, but not including 'strata' at all? Also, what would it be if n=2 and 'strata' ID?
I did not use N= anything, so I am not sure what you refer to here. If you mean "number of records selected per strata" please use either the proper term "sampsize" or state that is what you mean. The SAMPSIZE is the number of records to select per strata. If there is only one number that is applied to all the strata.
The sampsize can be set to specific number of records per strata but it would require either a list in the order the strata appear in the data, such as Sampsize = (25 15 30) which would select 25 records from the first stratum, 15 from the second, and 30 from the third. I suspect such is not at all desirable from your data description, SAMPSIZE could also reference a data set of specific structure that has the strata values and the desired sample size per strata, which would likely be the way to go if you want to control 100's of strata.
If you indicate that you want a SAMPSIZE of n for each strata then you need to consider what to do for strata that have fewer than n records. If you specify more than 1 for the SAMPSIZE and you want all the records if n is larger than the count for that strata you use the SELECTALL option (limits based on selection method), otherwise there will be an error because you requested more than exist in the data.
If you do not include a STRATA statement then your entire data set is considered. If you set SAMPSIZE=1 then your output data set will consist of 1 record, SAMPSIZE=10 would be 10 records.
Since you say that a value of Id variable may have multiple accounts then you should sort by Id and account. And if the Account is duplicated within Id yet another variable to force a repeatable unique order of the data. As I said, anything that changes the order of the values in the data will affect the output. So unless you can 100 percent state "this data set will never ever be resorted" you need to have a repeatable order from your sort.
Also you can have more than one variable on the STRATA statement. If you specific Strata Id Account, then each combination of Id and Account is a strata and the Sampsize would be applied. So Sampsize=1 would get one record for each Id and Account.
If your Strata variable is not missing you will get one, with the example code, per record Id. You want to be careful just using the word ID as that is also a statement in Proc surveyselect (and quite a few other procedures).
@ballardw -this is a detailed explanations, thanks!
I was referring to Sampsize (instead of N).
I called customer identifier ID here, however in my data this field has a different name, so there will not be an issue using it in a proc surveyselect statement.
My data already sorted by customer identifier, account, and historical month in ascending order, and this is my reference dataset that I am not going to change, hence if in the future this needs to be replicated, I can take this dataset and run surveyselect again with same seed number.
So if STRATA specified in the proc surveyselect and has reference to one variable only (in my case, lets call it Customer Identifier), while SAMPSIZE=1 - will this come back with unique records for each of the Customer Identifier, while a record will be chosen randomly?
To make sure we are on the same page, lets assume my dataset in 100M records, which are historical data for 1M customers and 3M accounts. Each customer has at least 1 account. In my output dataset (post surveyselect), I would want to have exactly 1M records. I do not need to capture a record for each of the accounts (which would make it 3M records in the output), but only one record per customer identifier.
Here is how I am about to run this:
PROC SURVEYSELECT DATA=HAVE
METHOD=SRS
SAMPSIZE=1
SEED=12345
OUT=WANT;
STRATA CUSTOMER_IDENTIFIER;
RUN;
Since SAS will not disturb you starting data set why not run the code and check the output?
With as many strata as you are using I suggest adding NOPRINT to the Proc statement so the details of the selection do not get sent to the results window.
That is what I am doing - wanted to double check here as well 🙂
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.