I need to analyze data fields in some large flat files and I was thinking about using PROC FREQ to get a list of unique values for every column for starters.
We have the Enterprise version of SAS, but i havn't used SAS in several years, so i'm wondering what SAS tools may be available to get a "brids eyer view" of the data, or at least get a unique list of values for every column in these files?
Right now, i'm thinking about creating a PROC FREQ for every column and from those outputs comparing unique values in order to determine data relationships, but before doing that, i wanted to check with you'all and see if you could suggest more effective approach to this kind of data analysis?
Long ago, in a galaxy far, far away, when I was young SAS newbie (no comments from the peanut gallery, those of you who know how -long- ago that was when I was "young"), "characterize your data" meant doing a bunch of PROC FREQs on all the character variables in the dataset and a PROC UNIVARIATE on all the numeric variables (to find the extreme obs, the mean and the median values).
And, sure enough, if you check out the EG "Characterize Data" task, it's doing just that -- the equivalent of PROC CONTENTS, PROC FREQ and PROC UNIVARIATE -- along with some GCHARTS thrown in to graphically show you the data and some PROC PRINTS of the data.
When I worked for lawyers, we'd give them that big stack of paper and then they'd formulate the questions (let's see the salary history of these 20 people; who makes over the median salary in these job categories, etc, etc.) but they'd always want the reams and reams of initial "paper" first.
My recommendation above is because I have never ever found any value to the PROC FREQ/PROC MEANS on everything approach. If the researcher cannot specify in advance some things he might want to know about the data, if its just a huge big fishing expedition, then I don't expect much success other than by random chance.
So, if someone wants to compute means, or frequencies, on phone numbers, please go right ahead. I'll pass ...
Thanks Cynthia, i'm having the Enterprise Guide software installed this week for the first time, so i'm looking forward to giving it a try, especially the "Characterize Data" feature and i'm not expecting any miracles, but if i can just get lists of finite higher level control type values, that would be a good start, ie for example from a list of mortgage types i would be able to say that this file only contains ALT-A, JUMBO and SUBPRIME data ect.
Cynthia I do remember when i used a PROC FREQ on a dollar value in a billion record file and the output was huge, so i was wondering if there may be a feature in PROC FREQ or "Characterize Data" that will allow me to set a limit count of lets say 10000, so if the number of unique values cached in a PROC FREQ or CD exceed 10000, then the PROC FREQ or CD would abort?
"Characterize data" does allow you to set a limit for the number of distinct values that it shows (the default is 30). It displays the **first** k distinct values, which can be misleading if there is some inherent order in the data that you don't know about. A separate limitation to "Characterize data" is that it only does the frequencies on character variables; it does mean/medians/etc. on numerics. Maybe EGuide 4.2 will be smarter.