Hi SAS Community,
I am importing pipe delimited .txt and .csv files into SAS and want to run a series of checks:
1. Count the number of records
2. Count the number of distinct records for ID
3. Whether the variables match a list of set variables
4. Are the values expected or in the correct format
Is there a way to output 1 dataset with all this information? As it currently stands,I produce these results into a text file that are the results of multiple proc frequencies and proc sql.
These are my expected variables:
data VARS;
input vars $;
datalines;
ID
BDAY
GENDER
ADDRESS1
ADDRESS2
CITY
STATE
ZIP
ZIP4
COUNTY_CODE
;
run;
This is my sample data:
data sample;
infile datalines dlm="|" missover dsd;
input ID $2 BDAY $10 GENDER $ ADDRESS1 $ ADDRESS2 $10. CITY $ STATE $ ZIP $ ZIP4 $ COUNTY_CODE $;
datalines;
A1|20200420|M|123 Main St.|Suite 201|Juneau|AK|99802||02112
B2|4/20/2020|Male|124 Main St.||Juneau|AK|99802|Juneau
C3|4/20/2020|M|125 Main St.||Juneau|AK|99802||02112
4-1|20200420|Male|126 Main St.|Suite 101|Juneau|AK|99802||Juneau
;
run;
Conditions:
ID:should be alpha numeric and 2 characters
BDAY: should be YYYYMMDD format
Gender: should be 1 characters
Address1 and Address 2 should be character
City should character
State should be 2 characters
zip should be 5 characters
zip4 should be no more than 4 characters
county should be the fips code which is 5 characters
Results:
Line 1: accurate and meets all criteria
Line 2: dob is not in correct format, gender is not 2 characters, does not have zip4 variable, county does not have 5 character
Line 3: dob is not in correct format
Line 4: id is not alpha numeric with no special characters, and gender is not 2 characters
So far I have:
/*Variables match list? */
proc sql;
create table plan_vars as
select strip(upcase(name)) as vars
from sashelp.vcolumn
where libname='WORK' and memname =SAMPLE;
quit;
proc sql;
create table comparevar_&file. as
select a.*, b.*,
case when a.vars = b.varsm then 'Match'
else 'No' end as var_match
from VARS as a
full join plan_vars(rename=(vars=varsm)) as b
on a.vars=b.varsm;
quit;
/*output records */
ods rtf file="&output.\FileReview.rtf";
ods noptitle;
options nodate nonumber;
proc sql;
create table ctrltot_&file. as
select count(*) as Total_number_of_records,
count(distinct ID) as Total_number_of_unique_ID,
count (ID) as tot_ID
from sample;
quit;
proc freq data = sample;
tables _all_;
format _numeric_ _character_ $miss.;
run;
proc freq data = ctrltot_&file.;
tables _all_;
run;
proc freq data=comparevar_&file.;
tables _all_;
where var_match = 'No';
run;
Something like this?
proc sql;
select count(*) as NB_RECORDS
,count(distinct ID) as NB_UNQ_ID
,sum(^prxmatch('/^\w\w$/',ID)) as NB_BAD_ID
,sum(^input(BDAY,yymmdd8.)) as NB_BAD_BDAY
,sum(^prxmatch('/^[MF]$/i',GENDER)) as NB_BAD_GENDER
,sum(^prxmatch('/^[A-Z ]*$/i',CITY)) as NB_BAD_CITY
,sum(^prxmatch('/^\d{5} *$/i',ZIP)) as NB_BAD_ZIP
,sum(^prxmatch('/^\d{1,4} *$/i',ZIP4)) as NB_BAD_ZIP4
,sum(^prxmatch('/^\d{5} *$/i',COUNTY_CODE )) as NB_BAD_CC
from TABLE;
> Line 1: accurate and meets all criteria
How?
ID=A1 is not 7 characters, GENDER=M is not 2 characters etc
Sorry, my apologies. It was a sample and I forgot to change the length. It should be 2 and 1, respectively.
Something like this?
proc sql;
select count(*) as NB_RECORDS
,count(distinct ID) as NB_UNQ_ID
,sum(^prxmatch('/^\w\w$/',ID)) as NB_BAD_ID
,sum(^input(BDAY,yymmdd8.)) as NB_BAD_BDAY
,sum(^prxmatch('/^[MF]$/i',GENDER)) as NB_BAD_GENDER
,sum(^prxmatch('/^[A-Z ]*$/i',CITY)) as NB_BAD_CITY
,sum(^prxmatch('/^\d{5} *$/i',ZIP)) as NB_BAD_ZIP
,sum(^prxmatch('/^\d{1,4} *$/i',ZIP4)) as NB_BAD_ZIP4
,sum(^prxmatch('/^\d{5} *$/i',COUNTY_CODE )) as NB_BAD_CC
from TABLE;
Yes, this is exactly what I'm trying to accomplish. Thank you for the help.
I don't have much experience with Perl Regular Expressions but is there a way to find a series of values that is the length of the variable and if it matches then identify as bad? I want to match repeating but instead of n it's the length of the variable? I also want to identify an ID that is not alphanumeric. For example the ID is a description instead of the A235324.
Case 1:
ID: 0000000.
Case 2:
ID: Physician
Case 3:
ID: A235324
I was trying to present another example so I'm clear about the syntax involved with perl expressions since I'm new to those expressions.
If I have another dataset where I'm trying to identify
data sample;
infile datalines dlm="|" missover dsd;
input CATEGORY $ ID $;
datalines;
Physician|A123242
|0000000
PS220|A123456
run;
Case 3 has the proper format for each variable while case 2 has both wrong, and case 1 has CATEGORY wrong.
> I want to match repeating but instead of n it's the length of the variable
prxmatch('/(\d)\1{4}/',STR);
matches a digit repeated 4 times (ie 5 identical digits)
Another way could be testing:
repeat(first(STR),4)=STR
For the rest you can look at the ANYxxxx and NOTxxxx functions
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.