SAS Innovation: Programatically clean excel data, determine type and length on the excel side before import
1. You were are given an excel file from headquarters and asked to import the data to SAS.
2. You are concerned about type, length and bad data
3. You would like to clean the data on the excel side and determine all types and lengths before importing
With the libname and connection engines opart of base SAS.
It may be possible, and I have not tried this, to create an excel
view of any database that you have an ODBC driver and use the code below to import from
Oracle, MS-Access, SQL-Server...
HAVE EXCEL NAMED RANGE 'CLASS' FROM HEADQUARTERS
==================================================
WE HAVE NOT IMPORTED THE DATA TO SAS YET
EXCEL.class total obs=19
Obs WGTMIX NAME SEX AGE HEIGHT
1 113 Alfred M 14 69.0
2 A0084 Alice F 13 56.5
3 98 Barbara F 13 65.3
4 A0103 Carol F 14 62.8
5 103 Henry M 14 63.5
6 A0083 James M 12 57.3
WHAT WE WANT
1. Determine type and length
2. Fix WGTMIX by converting A0084 to 84, A0103 to 103 ...
BEFORE IMPORTING TO SAS
Number of rows with character data (from excel data not imported)
NumObs NumChrName NumChrSex NumChrAge NumChrHgt NumChrWgtMix
----------------------------------------------------------------------
19 19 19 0 0 9
From this we know
Name and Sex are all character
Age and Height are all numeric
Weight is both numeric and non-numeric (show you how to get nulls later)
Max length for each variable
MaxLenName MaxLenSex MaxLenAge MaxLenWgt MaxLenHgt
---------------------------------------------------------------
7 1 2 5 4
* This what we need to import cleaned data;
libname xls "d:/xls/vueclass.xlsx";
data class;
set xls.class(dbsastype=
(
Name = 'char(7)'
Sex = 'char(1)'
Age = 'numeric'
Weight = 'numeric'
Height = 'numeric'
));
run;quit;
libame xls clear;
SOLUTION ( CREATE THE EXCEL INPUT)
==================================
libname xel "d:\xls\class.xlsx" scan_text=no;
data xel.class;
retainname sex age height wgtmix;
length wgtmix $12;
set sashelp.class;
if mod(_n_,2)=0 then wgtmix=cats("A",put(weight,z4.));
else wgtmix=put(weight,4.);
keep name sex age height wgtmix;
run;quit;
libname xel clear;
/* how many columns have at least one character cell */
proc sql dquote=ansi;
connect to excel (Path="d:\xls\class.xlsx" mixed=yes);
select * from connection to Excel
(
Select
count(*) as NumObs
,count(*) + sum(isnumeric(name)) as NumChrName
,count(*) + sum(isnumeric(sex)) as NumChrSex
,count(*) + sum(isnumeric(age)) as NumChrAge
,count(*) + sum(isnumeric(height)) as NumChrHgt
,count(*) + sum(isnumeric(wgtmix)) as NumChrWgtMix
from
class
);
disconnect from Excel;
Quit;
Number of observations that have at least one character variable.
NumObs NumChrName NumChrSex NumChrAge NumChrWgtMix
---------------------------------------------------------
19 19 19 0 9
/* what is the max length of the columns */
proc sql dquote=ansi;
connect to excel (Path="d:\xls\class.xlsx" mixed=yes);
select * from connection to Excel
(
Select
max(len(name)) as MaxLenName
,max(len(sex)) as MaxLenSex
,max(len(age)) as MaxLenAge
,max(len(wgtmix)) as MaxLenWgtMix
,max(len(height)) as MaxLenHgt
from
class
);
disconnect from Excel;
Quit;
/*
MaxLenName MaxLenSex MaxLenAge MaxLenWgtMix MaxLenHgt
--------------------------------------------------------------
7 1 2 5 4
*/
HERE IS THE FINAL IMPORT
proc sql dquote=ansi;
connect to excel (Path="c:\xls\class.xlsx" mixed=yes);
create table xls_class as
select
ChrName length=7
,ChrSex length=1
,ChrWgtMix length=5
,input(ChrAge,4.) as NumAge
,input(ChrHgt,5.1) as NumHgt
,input(ChrWgtFix,4.) as NumWgt
from connection to Excel
(
/* make the changes on the excel size - send this to headquarters for confirmation */
/* formats provide documentation on what should be in the cells */
/* all fields ar character here - which is what they really are originally? */
Select
Name as ChrName
,Sex as ChrSex
,format(Age,'###') as ChrAge
,WgtMix as ChrWgtMix
,format(Height,'###.#') as ChrHgt
/* remove the leading A, make it an integer and format to 3 digits -99 to 999 */
,format(int(iif(mid(WgtMix,1,1)="A",mid(WgtMix,2,4),WgtMix)),'###') as ChrWgtFix
from
class
);
disconnect from Excel;
Quit;
proc sql;
drop table xel.class; /* only useful if rerunning */
quit;
... View more