- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Greetings SAS people,
I have a dataset with a number of different alphanumeric variables in it, something like this:
Record | Seeded | Country | Centre - Candidate | Task | Language Code | Section | Date | Month | Year | Criteria | Score | YearMonth | Qtr | Format | Version | Centre | KeyMark | |
1 | n | United Arab Emirates | AX001170389525345BPF | W2 | AC | 0 | 74326428 | 17-Mar-18 | 3 | 2018 | 1 | 3 | 20183 | 1 | CD | 45R67FFD | BC | n |
2 | n | United Arab Emirates | AX001170389525345BPF | W2 | AC | 0 | 74326428 | 17-Mar-18 | 3 | 2018 | 2 | 8 | 20183 | 1 | CD | 45R67FFD | BC | n |
3 | n | United Arab Emirates | AX001170389525345BPF | W2 | AC | 0 | 74326428 | 17-Mar-18 | 3 | 2018 | 3 | 5 | 20183 | 1 | CD | 45R67FFD | BC | n |
4 | n | United Arab Emirates | AX001170389525345BPF | W2 | AC | 0 | 74326428 | 17-Mar-18 | 3 | 2018 | 4 | 2 | 20183 | 1 | CD | 45R67FFD | BC | n |
5 | n | United Arab Emirates | AX001170389525345BPF | W2 | AC | 0 | 74326428 | 17-Mar-18 | 3 | 2018 | 5 | 4 | 20183 | 1 | CD | 45R67FFD | BC | n |
6 | n | United Arab Emirates | AX001170389525345BPF | W2 | AC | 0 | 74326428 | 17-Mar-18 | 3 | 2018 | 6 | 8 | 20183 | 1 | PB | 45R67FFD | BC | y |
7 | n | United Arab Emirates | AX001170389525345BPF | W1 | AC | 0 | 74326428 | 17-Mar-18 | 3 | 2018 | 1 | 4 | 20183 | 1 | PB | 45R67FFD | BC | y |
8 | n | United Arab Emirates | AX001170389525345BPF | W1 | AC | 0 | 74326428 | 17-Mar-18 | 3 | 2018 | 2 | 1 | 20183 | 1 | PB | 45R67FFD | BC | n |
9 | n | United Arab Emirates | AX001170389525345BPF | W1 | AC | 0 | 74326428 | 17-Mar-18 | 3 | 2018 | 3 | 2 | 20183 | 1 | PB | 45R67FFD | BC | n |
10 | n | United Arab Emirates | AX001170389525345BPF | W1 | AC | 0 | 74326428 | 17-Mar-18 | 3 | 2018 | 4 | 7 | 20183 | 1 | PB | 45R67FFD | BC | y |
This is a massive dataset (65m rows) and all the values change, sometimes there are only two different values, sometimes there are thousands. My task is to transform the data so that all of the content of every variable is numeric only. I could go through and identify all of the variations in data for each variable that is not already a numeric and change them all individually but that feels quite painful. Is there a clever way in SAS to transform a variable so that the transformed value is numeric only, and the system provides some kind of key to show what each original and transformed variable is?
Note I am not trying to convert the formats here, it's the contents that need to change.
Thanks
A
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This paper outlines some of the base method you can use here:
http://support.sas.com/resources/papers/proceedings11/145-2011.pdf
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The key to show a value that has been assigned a numeric value is custom format such as:
Proc format;
value yesno
1 = 'Yes'
0 = 'No'
;
run;
And then associate the format with a variable that takes numeric values of 0 and 1.
A similar INVALUE statement can create a numeric value when reading a variable.
So the tasks involved are
1) getting UNIQUE values
2) decide on format VALUE and INVALUE names (31 character limit for character variables as you need a $ in the name )
3) create the formats. Proc Format can use a specially structured data set to create a format so creating that set
4) use a data step to build a new dataset with the numeric variables and assign the formats.
Here is an example using a data set you should have available as a start.
data junk; set sashelp.class ; array c _character_; length vn $ 32. val $ 100; do i= 1 to dim(c); vn=vname(c[i]); val = upcase(c[i]); vl = vlength(c[i]); output; end; keep vn val vl; run; proc sql; create table use as select distinct vn, val, vl from junk order by vn, val; run; /* create a data set with numeric values from 1 to n for each unique value of the variable by name */ data makefmts; set use; by vn; retain num; if first.vn then num=1 ; else num+1; run; data cntlin; set makefmts; length fmtname $ 32 start label $ 100 hlo $ 3; fmtname=vn; /* characteristics of the informat*/ type='I' ; start = upcase(strip(val)); label= strip(put(num,best5.)); hlo='U'; output; /* characteristics of the FORMAT*/ type='N'; start= strip(put(num,best5.)); label= upcase(strip(val)); default=vl; output; run; proc sort data=cntlin; by fmtname type num; run; proc format library=work cntlin=cntlin cntlout=work.cntlout; run; data work.newclass; set sashelp.class; numname = input(name,name.); numsex = input(sex,sex.); format numname name. numsex sex.; run;
Warnings:
If you keep adding values there is no guarantee that a specific value will maintain the same numeric value. You likely should place the CNTLIN and generated CNTLOUT data sets in permanent library.
If you add variables then you have another start from scratch scenario.
Note that the SORTED order of the CNTLIN data set is very critical.
This uses the name of the variable to create both and informat and format with the same name. If you already have such then you may need to play around with the FMTSEARCH system option and where to place the results of this Proc Format code.
All of the character values will be treated in uppercase. If this is not acceptable then you will need to write a lot of manual code. The purpose of the uppercase options are so that values like "UNITED ARAB EMIRATES" "United Arab Emirates" and "united arab emirates" all would get the same numeric value. If you want the same spelling but different combinations of upper and lower case to get different numeric values you are on your own as there are obviously many more rules involved than you have specified.
The length for VAL in the first data step should be length of the longest expected value of any of the character variables. If you have some sort of open text variable that would contain a paragraph of description you should really consider if it is needed at all for whatever the process may be.
Anything involving dates will in the long run benefit from creating a actual SAS date values you can generate Month, Year, yearmonth and Qtr (assuming it is a calendar quarter) all from the single date as needed if you have that as a date value and sorting and analysis will be much cleaner. By the way a year month like 20183 will not sort properly in either numeric or character form. You would have to make sure that the months were all two digits so that 201803 comes before 201812 if character and if you use a numeric 201812 it will come after 20191.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Do you want to treat each variable separately or all of them together. So if TASK and FORMAT both have the value 'XX' should they get the same number to replace it? Or is it ok if XX in TASK is mapped to a different number than XX in FORMAT?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks to everyone who's helped with this. Where I've got to is I am going to create a numeric variable with a numeric value for each distinct value in my original data. What I need now is to be able to show a comparison between the two variables clearly.
So for example, if I have a table that includes this:
Var 1 | Var 2 |
abc123 | 1 |
abc123 | 1 |
abc123 | 1 |
def345 | 2 |
def345 | 2 |
def345 | 2 |
zxc334 | 3 |
zxc334 | 3 |
bnm567 | 4 |
bnm567 | 4 |
Then I want to be able to create a report, that looks like this, it's a bit like a cut-down Proc Freq:
Var 1 | Var 2 |
abc123 | 1 |
def345 | 2 |
zxc334 | 3 |
bnm567 | 4 |
Can anyone help me with a way to create this kind of report?
Thanks
A
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content