Hi All
I work on banking data of customers and need to create a datamart for different purposes. we read data from different sources and it might not be properly cleaned and inserted with accuracy.
I derive variables as well from existing one.
My question is related to reading data in a way that makes it most accurate.
For example using upcase function while reading variables with possibility of upcase or lowcase input for deriving variables.
what would be your recommendations?
The most basic things you might want to check:
- missings
- data out of a valid range (if there is a valid range defined)
- invalid characters (eg. control characters)
- duplicate keys or rows - (only if this is still possible even though you are sourcing from dbms - eg. when combining tables from 2 different dbms).
And for look-ups
- no match over look-up key (if there must be a match, eg. to a row in a reference table).
To get a feel how data quality is: Proc Freq and Proc Univariate are quite helpful for this. Eg. a simple frequency count over your variables and then "eyeballing" the cases with low frequency (let's say product names and you find names which only occur a few times in a data set with millions of rows).
"But I would need more examples like this to avoid any future possibilities to commit any errors."
If data quality is not good then it's very hard to foresee all future possible cases. I would concentrate on the business keys so that bad data there can't mess up relationships (and you suddenly end up with many-to-many joins).
Look at functions COMPLEV, COMPGED, SPEDIS and SOUNDEX as ways to find approximate matches.
PG
Hi Bhpinder;
I also work with big data. To read data I always use the following methodology.
data test ;
infile "file_name.csv." delimiter = ',' MISSOVER DSD lrecl=32767
firstobs=2 ;
input
name : $10.
occupation : $50.
item : $25.
;
run;
This way you can always be sure that all the data are properly read with apprpriate format.
Rgds.
The questions I would ask:
- What is the impact if data is not "clean"?
- Do I have a requirement to cleanse the data (and is there time/money planned for this)?
- Where should/can the data cleansing happen (as early as possible - but what is realistic)?
The first thing I would look at are the business keys. If this ones are not clean then you must find a way to cleanse them (or get the source systems to cleanse them). If these are character fields then I normally compress and upcase such keys for joining (and then add a surrogate key to the target table so I don't have to care about missmatching business keys later on).
If things are really bad then you can also consider using match codes.
If you need to cleanse the data: Be prepared that this can be quite a bit of work.
There is Ron Cody's book "Data Cleaning Techniques" which might be helpful SAS Press - Ron Cody Author Page and there is of course also DataFlux for such tasks.
And for your ETL design: Design for Data Validation steps creating Error and Exception tables to capture issues. And also plan for the actions to be taken if issues are found (eg. stop load and send emails/reports on issues, or continue load or ....). So you need here to plan for technical processes but also for business processes (who monitors these tables, who needs to take action and which action).
Thanks Patrick. Basically, I work on creating a sas process for datamart production which fetch historical data upto100 million and some times more than that and different sources (But not any txt or csv) always from dbms. But realized the loop holes and data issues.
While reading the data, we must apply some checks to make sure whatever we are reading or deriving from there cannot go wrong. The example I mentioned while deriving to use Upcasing is the realization I made recently I found error in the output. I need to use Upcase function as well.
But I would need more examples like this to avoid any future possibilities to commit any errors.
I can possibly look into the books you have referred.
Bhupinder
The most basic things you might want to check:
- missings
- data out of a valid range (if there is a valid range defined)
- invalid characters (eg. control characters)
- duplicate keys or rows - (only if this is still possible even though you are sourcing from dbms - eg. when combining tables from 2 different dbms).
And for look-ups
- no match over look-up key (if there must be a match, eg. to a row in a reference table).
To get a feel how data quality is: Proc Freq and Proc Univariate are quite helpful for this. Eg. a simple frequency count over your variables and then "eyeballing" the cases with low frequency (let's say product names and you find names which only occur a few times in a data set with millions of rows).
"But I would need more examples like this to avoid any future possibilities to commit any errors."
If data quality is not good then it's very hard to foresee all future possible cases. I would concentrate on the business keys so that bad data there can't mess up relationships (and you suddenly end up with many-to-many joins).
Thanks for reply. It makes lot of sense first analyse your input data with basic procedures like freq and univariate before blindly processing them.
Thanks
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.