09-20-2015 11:42 PM
I'm trying to do some research with a professor and I've been tasked with learning SAS. I'm trying to use the proc sort command to sort some of the data and would like some help doing this:
1) Sorting data between years, i.e. from 1985-2005
2) Sorting data so it filters out/eliminates data with a certain character in the first slot of a cell. I.e., for SIC codes, I want to filter out any that start with 6 (6123, 6149, etc)
3) I want to add a column that shows how many firms are in a SIC 1-digit code. For example, I want to count the firms that start with 5 (5124, 5655), then sort them by Year. So an example would look like this:
1) A 5100 2
2) B 5101 2
3) C 4231 1
Thank you! If anyone can help me find some beginner tutorials/videos too, that would be great.
09-21-2015 03:08 AM
As you suggest yourself, as a beginner, you need some training.
If you don't have the possibility to attend a physical training location, there are e-learning options. You can start with the free tutorials, and see what that brings you:
There are probably a lot of stuff on youtube as well.
If you find SAS documentation hard to navigate, try to read SAS literature, like The Little SAS Book, or SAS for Dummies:
Proc Sort does what the name implies, sorts data, and not much more.
You can filter data there using the almost global WHERE statement.
For manipulating data you could use the Data Step, or SQL.
For doing statistical summaries, use (again) SQL, or some other SAS Procedures (Summary, Tabulate, Report).
09-21-2015 12:21 PM
09-21-2015 11:34 PM
1) So it would only show the data from 1985-2005, for this, would I use proc sort or would I use data conditional;? For this, I've posted what I've figured out so far below.
if fyear GT 1985 and fyear LT 2005 then;
I know this might not be correct, but could someone nudge me in the right direction? I don't want it to be an if statement, I just don't know how to filter so it ONLY shows between 1985 and 2005.
2) One column of data is SIC codes, so...
I'm trying to eliminate any observations that begin with a "6" (as they're more complicated) so after running my sort or filter, it would only show the "7043" data row.
09-21-2015 05:24 PM
PROC FREQ, one of the basic summarization procedures, will by default generate output in order, numeric or alphabetic and the data does not need to be sorted first.
Proc freq data= have;
tables year*siccode/ list;
Will generate a single output table with year first and then the SIC codes. You can add a where clause to filter data without sorting.
proc freq data=have;
where substr(siccode,1,1) = '5'; /*if the sic code is character, which it really should be unless you are doing arithmetic with it*/
where floor(siccode/1000) = 5; /* if numeric*/
tables year*siccode/ list;
09-21-2015 11:38 PM
Thank you so much! Modified your tables year*sic/list code and got my sort from 1985-2005. If I wanted to add more of my data columns to that table, would I just add them like this?
09-22-2015 11:22 AM
Yes adding the additional variables with the * says to show the combinations in the data. Note that if a variable has missing data then that does not appear in the output unless you include an instruction MISSING in the option list after the / separator.
Also if you have many levels for each variable you might exceed the display output or it will take a long time to produce. If you have 5 variables with 20 levels each you are asking for 3,200,000 rows of output. So be careful. I have seen an errant Proc freq table output generate more than 5 stack feet of not-quite-as-intended output from a line printer.