I have a dataset with IDs that are both valid and invalid. All valid IDs are numeric only and contain exactly 11 digits, which also MAY include 0’s in the front but not an absolute requirement. How do I write a sas code to delete all entries that do not fall into the 11 numeric only scheme? I have included a sample dataset on excel. Thank you so much.
@byeh2017 wrote:
Okay, I have a modified version of my original file.
data MYDATA.ORIGINALSAMPLE; infile datalines dsd truncover; input DCDeathDate:DATETIME16. NationalID:$20.; datalines4; 01NOV16:00:00:00, 17JAN17:00:00:00,0000000000 28NOV16:00:00:00,00000000000 16NOV16:00:00:00,004131103 21DEC16:00:00:00,00438162 12DEC16:00:00:00,01001000012 30NOV16:00:00:00,01001000029 23DEC16:00:00:00,01001000142 06NOV16:00:00:00,01001000234 18DEC16:00:00:00,01001000275 23NOV16:00:00:00,01001000320 22NOV16:00:00:00,01001000392 26DEC16:00:00:00,01001000399 18NOV16:00:00:00,01001000554 09DEC16:00:00:00,01001000608 30NOV16:00:00:00,01001000657 17NOV16:00:00:00,01001000881 27DEC16:00:00:00,01001000951 28DEC16:00:00:00,01001001048 04NOV16:00:00:00,01001001804 10DEC16:00:00:00,01001001853 04NOV16:00:00:00,01001001978 20DEC16:00:00:00,01001002032 27DEC16:00:00:00,01001002219 05DEC16:00:00:00,01001002348 04NOV16:00:00:00,01001002451 16DEC16:00:00:00,01001002597 19NOV16:00:00:00,01001002603 10NOV16:00:00:00,01001002666 26NOV16:00:00:00,01001002761 04NOV16:00:00:00,01001002854 25DEC16:00:00:00,01001002870 18NOV16:00:00:00,01001002882 25NOV16:00:00:00,01001002950 31DEC16:00:00:00,01001003003 01NOV16:00:00:00,01001003057 21NOV16:00:00:00,01001003198 02NOV16:00:00:00,01001003231 02DEC16:00:00:00,01001003234 05DEC16:00:00:00,01001003253 26DEC16:00:00:00,01001003284 05NOV16:00:00:00,01001003312 15DEC16:00:00:00,01001003370 02DEC16:00:00:00,01001003414 14DEC16:00:00:00,01001003476 04DEC16:00:00:00,01001003583 10DEC16:00:00:00,01001003609 28DEC16:00:00:00,01001003788 30NOV16:00:00:00,01001003803 10DEC16:00:00:00,01001003913 ;;;;
Use this:
data mydata.clean;
set MYDATA.ORIGINALSAMPLE;
if verify(trim(NationalID),"1234567890") = 0 and length(trim(NationalID)) = 11;
run;
Please post your example data in a data step.
An easy to use macro for converting your SAS dataset into a data step is found here:
Why?
- Excel files can carry malware
- therefore many institutional firewalls block their download
- and they cannot preserve SAS data as it is (attributes - like variable types and formats - are lost)
Posting the data step enables your possible helpers to exactly recreate your data with a simple copy/paste and run, without any kind of danger.
Sorry. Attached is the sas dataset.
Please reread my post. Use the macro provided in the link to create the datastep code in one easy step, and then post the resulting code here, using the "little running man" or {i} button.
Sorry, here it is:
data MYDATA.SAMPLEINVALIDID; infile datalines dsd truncover; input Date:MMDDYY10. ID:$11. Screened:$3.; datalines4; 11/01/2017,14522778554,Yes 12/01/2017,A,No 01/01/2018,14522778556,Yes 02/01/2018,14522778557,No 03/01/2018,14522778558,Yes 04/01/2018,14522778559,No 05/01/2018,14522778560,Yes 06/01/2018,,No 07/01/2018,14522778562,Yes 08/01/2018,14522778563,No 09/01/2018,14522778564,Yes 10/01/2018,14522778565,No 11/01/2018,14522778566,Yes ;;;;
I seem to have found a way to do this on this sample dataset, but when I apply it to a bigger dataset, it somehow erases all entries.
I am using this particular code:
Data mydata.sampleidsCLEAN; Set mydata.sampleids; If verify(ID,"0123456789") NE 0 then delete; Run;
This seems to work on my sample dataset:
data MYDATA.SAMPLEIDs; infile datalines dsd truncover; input Date:MMDDYY10. ID:$11.; datalines4; 11/01/2017,14522778554 12/01/2017,A 01/01/2018,14522778556 02/01/2018,14522778557 03/01/2018,14522778558 04/01/2018,14522778559 05/01/2018,14522778560 06/01/2018, 07/01/2018,14522778562 08/01/2018,14522778563 09/01/2018,14522778564 10/01/2018,14522778565 11/01/2018,14522778566 12/01/2018, 01/01/2019, 02/01/2019,sdf 03/01/2019,12a 04/01/2019,32r234a 05/01/2019,443bbb 06/01/2019,bbbbbbb ;;;; Data mydata.sampleidsCLEAN; Set mydata.sampleids; If verify(ID,"0123456789") NE 0 then delete; Run;
But when I apply it to a much bigger dataset, it is just erasing all entries:
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 61 62 Data mydata.originaldatasetDUPIDCLEAN; 63 Set mydata.originaldatasetDUPCLEAN; 64 If verify(nationalID,"0123456789") NE 0 then delete; 65 Run; NOTE: There were 74691 observations read from the data set MYDATA.ORIGINALDATASETDUPCLEAN. NOTE: The data set MYDATA.ORIGINALDATASETDUPIDCLEAN has 0 observations and 45 variables. NOTE: DATA statement used (Total process time): real time 1.92 seconds cpu time 1.14 seconds 66 67 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 80
Are you sure that nationalID contains the same type of data as ID in your example dataset?
You can use the fourth parameter in the macro to restrict the number of observations that gets into the example dataset datastep, so you could post just the first 10 obs from your real dataset MYDATA.ORIGINALDATASETDUPCLEAN.
Thanks. My main dataset MYDATA.ORIGINALDATASETDUPCLEAN has the following parameters
Label: NationalID
Name: NationalID
Length: 20
Type: Char
Format: $20
Informat: $20
This appears to be different than my sample dataset which has the following properties:
Label: ID
Name: ID
Length: 11
Type: Char
Format:
Informat:
There is some sensitive patient information from my original dataset, let me spend a few minutes of making a new dataset out of this to delete some of these sensitive variables. Thank you.
Okay, I have a modified version of my original file.
data MYDATA.ORIGINALSAMPLE; infile datalines dsd truncover; input DCDeathDate:DATETIME16. NationalID:$20.; datalines4; 01NOV16:00:00:00, 17JAN17:00:00:00,0000000000 28NOV16:00:00:00,00000000000 16NOV16:00:00:00,004131103 21DEC16:00:00:00,00438162 12DEC16:00:00:00,01001000012 30NOV16:00:00:00,01001000029 23DEC16:00:00:00,01001000142 06NOV16:00:00:00,01001000234 18DEC16:00:00:00,01001000275 23NOV16:00:00:00,01001000320 22NOV16:00:00:00,01001000392 26DEC16:00:00:00,01001000399 18NOV16:00:00:00,01001000554 09DEC16:00:00:00,01001000608 30NOV16:00:00:00,01001000657 17NOV16:00:00:00,01001000881 27DEC16:00:00:00,01001000951 28DEC16:00:00:00,01001001048 04NOV16:00:00:00,01001001804 10DEC16:00:00:00,01001001853 04NOV16:00:00:00,01001001978 20DEC16:00:00:00,01001002032 27DEC16:00:00:00,01001002219 05DEC16:00:00:00,01001002348 04NOV16:00:00:00,01001002451 16DEC16:00:00:00,01001002597 19NOV16:00:00:00,01001002603 10NOV16:00:00:00,01001002666 26NOV16:00:00:00,01001002761 04NOV16:00:00:00,01001002854 25DEC16:00:00:00,01001002870 18NOV16:00:00:00,01001002882 25NOV16:00:00:00,01001002950 31DEC16:00:00:00,01001003003 01NOV16:00:00:00,01001003057 21NOV16:00:00:00,01001003198 02NOV16:00:00:00,01001003231 02DEC16:00:00:00,01001003234 05DEC16:00:00:00,01001003253 26DEC16:00:00:00,01001003284 05NOV16:00:00:00,01001003312 15DEC16:00:00:00,01001003370 02DEC16:00:00:00,01001003414 14DEC16:00:00:00,01001003476 04DEC16:00:00:00,01001003583 10DEC16:00:00:00,01001003609 28DEC16:00:00:00,01001003788 30NOV16:00:00:00,01001003803 10DEC16:00:00:00,01001003913 ;;;;
@byeh2017 wrote:
Okay, I have a modified version of my original file.
data MYDATA.ORIGINALSAMPLE; infile datalines dsd truncover; input DCDeathDate:DATETIME16. NationalID:$20.; datalines4; 01NOV16:00:00:00, 17JAN17:00:00:00,0000000000 28NOV16:00:00:00,00000000000 16NOV16:00:00:00,004131103 21DEC16:00:00:00,00438162 12DEC16:00:00:00,01001000012 30NOV16:00:00:00,01001000029 23DEC16:00:00:00,01001000142 06NOV16:00:00:00,01001000234 18DEC16:00:00:00,01001000275 23NOV16:00:00:00,01001000320 22NOV16:00:00:00,01001000392 26DEC16:00:00:00,01001000399 18NOV16:00:00:00,01001000554 09DEC16:00:00:00,01001000608 30NOV16:00:00:00,01001000657 17NOV16:00:00:00,01001000881 27DEC16:00:00:00,01001000951 28DEC16:00:00:00,01001001048 04NOV16:00:00:00,01001001804 10DEC16:00:00:00,01001001853 04NOV16:00:00:00,01001001978 20DEC16:00:00:00,01001002032 27DEC16:00:00:00,01001002219 05DEC16:00:00:00,01001002348 04NOV16:00:00:00,01001002451 16DEC16:00:00:00,01001002597 19NOV16:00:00:00,01001002603 10NOV16:00:00:00,01001002666 26NOV16:00:00:00,01001002761 04NOV16:00:00:00,01001002854 25DEC16:00:00:00,01001002870 18NOV16:00:00:00,01001002882 25NOV16:00:00:00,01001002950 31DEC16:00:00:00,01001003003 01NOV16:00:00:00,01001003057 21NOV16:00:00:00,01001003198 02NOV16:00:00:00,01001003231 02DEC16:00:00:00,01001003234 05DEC16:00:00:00,01001003253 26DEC16:00:00:00,01001003284 05NOV16:00:00:00,01001003312 15DEC16:00:00:00,01001003370 02DEC16:00:00:00,01001003414 14DEC16:00:00:00,01001003476 04DEC16:00:00:00,01001003583 10DEC16:00:00:00,01001003609 28DEC16:00:00:00,01001003788 30NOV16:00:00:00,01001003803 10DEC16:00:00:00,01001003913 ;;;;
Use this:
data mydata.clean;
set MYDATA.ORIGINALSAMPLE;
if verify(trim(NationalID),"1234567890") = 0 and length(trim(NationalID)) = 11;
run;
This works! Thank you very much for your help.
Just trying to understand how this works. The original code was 0123456789, how come moving the code to 1234567890 made all the difference? And what exactly does the trim function do? Thank you again.
The difference between 0123456789 and 1234567890 is just the order of digits on the keyboard, it makes no difference with regard to the verify() function.
The trim() function removes all trailing blanks from strings, strip() removes leading and trailing blanks.
So, after using trim(), one can easily check for length and digits-only.
You could also use the notdigit() function instead of verify().
Another solution is to use a regular expression
if prxmatch("/^\d{11}$/",trim(ID));
If your NationalID only contains the same 11-digit numbers like your sample ID, the verify() function will cough up on the blanks that pad the longer variable.
Then your condition should look like
if verify(trim(NationalID),"1234567890") ne 0 or length(trim(NationalID)) ne 11 then delete;
To make it prettier, I'd use a subsetting if:
if verify(trim(NationalID),"1234567890") = 0 and length(trim(NationalID)) = 11;
If the ID's are not left-aligned, use strip() instead of trim().
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.