Hi SAS Community,
I am importing pipe delimited .txt and .csv files into SAS and want to run a series of checks:
1. Count the number of records
2. Count the number of distinct records for ID
3. Whether the variables match a list of set variables
4. Are the values expected or in the correct format
Is there a way to output 1 dataset with all this information? As it currently stands,I produce these results into a text file that are the results of multiple proc frequencies and proc sql.
These are my expected variables:
data VARS;
input vars $;
datalines;
ID
BDAY
GENDER
ADDRESS1
ADDRESS2
CITY
STATE
ZIP
ZIP4
COUNTY_CODE
;
run;
This is my sample data:
data sample;
infile datalines dlm="|" missover dsd;
input ID $2 BDAY $10 GENDER $ ADDRESS1 $ ADDRESS2 $10. CITY $ STATE $ ZIP $ ZIP4 $ COUNTY_CODE $;
datalines;
A1|20200420|M|123 Main St.|Suite 201|Juneau|AK|99802||02112
B2|4/20/2020|Male|124 Main St.||Juneau|AK|99802|Juneau
C3|4/20/2020|M|125 Main St.||Juneau|AK|99802||02112
4-1|20200420|Male|126 Main St.|Suite 101|Juneau|AK|99802||Juneau
;
run;
Conditions:
ID:should be alpha numeric and 2 characters
BDAY: should be YYYYMMDD format
Gender: should be 1 characters
Address1 and Address 2 should be character
City should character
State should be 2 characters
zip should be 5 characters
zip4 should be no more than 4 characters
county should be the fips code which is 5 characters
Results:
Line 1: accurate and meets all criteria
Line 2: dob is not in correct format, gender is not 2 characters, does not have zip4 variable, county does not have 5 character
Line 3: dob is not in correct format
Line 4: id is not alpha numeric with no special characters, and gender is not 2 characters
So far I have:
/*Variables match list? */
proc sql;
create table plan_vars as
select strip(upcase(name)) as vars
from sashelp.vcolumn
where libname='WORK' and memname =SAMPLE;
quit;
proc sql;
create table comparevar_&file. as
select a.*, b.*,
case when a.vars = b.varsm then 'Match'
else 'No' end as var_match
from VARS as a
full join plan_vars(rename=(vars=varsm)) as b
on a.vars=b.varsm;
quit;
/*output records */
ods rtf file="&output.\FileReview.rtf";
ods noptitle;
options nodate nonumber;
proc sql;
create table ctrltot_&file. as
select count(*) as Total_number_of_records,
count(distinct ID) as Total_number_of_unique_ID,
count (ID) as tot_ID
from sample;
quit;
proc freq data = sample;
tables _all_;
format _numeric_ _character_ $miss.;
run;
proc freq data = ctrltot_&file.;
tables _all_;
run;
proc freq data=comparevar_&file.;
tables _all_;
where var_match = 'No';
run;
Something like this?
proc sql;
select count(*) as NB_RECORDS
,count(distinct ID) as NB_UNQ_ID
,sum(^prxmatch('/^\w\w$/',ID)) as NB_BAD_ID
,sum(^input(BDAY,yymmdd8.)) as NB_BAD_BDAY
,sum(^prxmatch('/^[MF]$/i',GENDER)) as NB_BAD_GENDER
,sum(^prxmatch('/^[A-Z ]*$/i',CITY)) as NB_BAD_CITY
,sum(^prxmatch('/^\d{5} *$/i',ZIP)) as NB_BAD_ZIP
,sum(^prxmatch('/^\d{1,4} *$/i',ZIP4)) as NB_BAD_ZIP4
,sum(^prxmatch('/^\d{5} *$/i',COUNTY_CODE )) as NB_BAD_CC
from TABLE;
> Line 1: accurate and meets all criteria
How?
ID=A1 is not 7 characters, GENDER=M is not 2 characters etc
Sorry, my apologies. It was a sample and I forgot to change the length. It should be 2 and 1, respectively.
Something like this?
proc sql;
select count(*) as NB_RECORDS
,count(distinct ID) as NB_UNQ_ID
,sum(^prxmatch('/^\w\w$/',ID)) as NB_BAD_ID
,sum(^input(BDAY,yymmdd8.)) as NB_BAD_BDAY
,sum(^prxmatch('/^[MF]$/i',GENDER)) as NB_BAD_GENDER
,sum(^prxmatch('/^[A-Z ]*$/i',CITY)) as NB_BAD_CITY
,sum(^prxmatch('/^\d{5} *$/i',ZIP)) as NB_BAD_ZIP
,sum(^prxmatch('/^\d{1,4} *$/i',ZIP4)) as NB_BAD_ZIP4
,sum(^prxmatch('/^\d{5} *$/i',COUNTY_CODE )) as NB_BAD_CC
from TABLE;
Yes, this is exactly what I'm trying to accomplish. Thank you for the help.
I don't have much experience with Perl Regular Expressions but is there a way to find a series of values that is the length of the variable and if it matches then identify as bad? I want to match repeating but instead of n it's the length of the variable? I also want to identify an ID that is not alphanumeric. For example the ID is a description instead of the A235324.
Case 1:
ID: 0000000.
Case 2:
ID: Physician
Case 3:
ID: A235324
I was trying to present another example so I'm clear about the syntax involved with perl expressions since I'm new to those expressions.
If I have another dataset where I'm trying to identify
data sample;
infile datalines dlm="|" missover dsd;
input CATEGORY $ ID $;
datalines;
Physician|A123242
|0000000
PS220|A123456
run;
Case 3 has the proper format for each variable while case 2 has both wrong, and case 1 has CATEGORY wrong.
> I want to match repeating but instead of n it's the length of the variable
prxmatch('/(\d)\1{4}/',STR);
matches a digit repeated 4 times (ie 5 identical digits)
Another way could be testing:
repeat(first(STR),4)=STR
For the rest you can look at the ANYxxxx and NOTxxxx functions
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.