@emaguin wrote:
Yes, I think I understand what you are saying. Ok, the file is a 200 variable csv export from qualtricx, a survey program like surveymonkey. Excel is not involved. Record 1 is var names, record 2 is item text, which i've manually deleted. As you would judge correctly, I used proc import to read it. I think the alternative you are suggesting is something like this
proc file1; infile='file name string' dsd;
input v1 v2 $ v3 v4 v5 v6 $ a1-a5 b1-b20 c1-c34;
One question on what i wrote above. Let v3 and v4 be known numeric and v5 and v6 be known character. How do i distinguish v3 and v4 from v5 and v5 with respect to variable type?
You have a lot of experience. What are the tradeoffs between proc import and input? How would you choose one over the other?
I actually use a lot of Proc Import to create a basic input program to customize, especially when given the stuff SurveyMonkey and similar will produce.
You had a good idea about reducing the header rows to one row before using proc import.
One thing that helps is to set the GUESSINGROWS option in proc import very high so more than 20 rows of data are examined before setting the variable types.
The second step is that when importing a CSV file SAS will write the generated data step code to the log. You can copy that code back into the editor, save and modify it. Most of the Format statements can be done away with unless you have preferences.
There will be an informat statement for each variable. You can examine that to see if the variable is of the expected type: $ indicates character, most numeric values will be Best32. (overkill but works). If one of the informats looks incorrect, change it. This may also be a time to use the editor search and replace to rename variables.
I will typically also had Label statements, likely using that information from the second row of the header if it was like the question text or similar.
The tradeoff, between the two procs, at least with the data sources I use is that the import will often create wrong variable types for frequently missing values (skip patterns in surveys or uncommon conditions in other data) vs the time to write data step code. For delimited files the latter is usually not much of problem using the proc import to generate the skeleton unless I have/want to rename lots of variables with names like Client_characteristic_first_visit_for__ (something that gets truncated because the actual column header exceeded 32 characters). I have had data files with literally 100+ column headers that all started with the exact same 20+ characters. Which means shortening them is well worthwhile. And why I add labels.
With many of my recurring data sources I actually create custom informats for variables to indicate such things as errors for variables that should not be missing, or validate against an expected list of known values (Race should not be Purple for instance) or to read some text into numeric values such as "Yes" or "No" to 1/0 codes.
@emaguin wrote:
You have a lot of experience. What are the tradeoffs between proc import and input? How would you choose one over the other?
As soon as a process needs to be repeatable, proc import is out of the question, period. See Maxims 22, 31 and 45.
KEEP has no effect of ordering PDV variables, so it won't work for the purpose. The most reliable method is to use an unvalued RETAIN as the first statement in the DATA step. This is because it makes the compiler defer the determination of the data types for the variables listed under the RETAIN till the end of the step during the compilation. (In fact, I'd even say that this is the only fully reliable method.)
Surely you'd need to prepare the RETAIN list beforehand, but it's easy to extract from the dictionary tables:
data infile ;
array aa ID x1-x10 (11*1) ;
x11 = "11" ;
array zz x12-x20 (9*2) ;
run ;
proc sql noprint ;
select name into :invars separated by " "
from dictionary.columns
where libname = "WORK" and memname = "INFILE"
order varnum
;
quit ;
data outfile (drop = ___) ;
retain &invars ;
set infile (rename = x11 = ___) ;
x11 = input (___, 32.) ;
run ;
Kind regards
Paul D.
I want to try to do a wrap-up of this topic. First, I presented a simple (trivial) made up example when the real problem was more involved, which was my mistake. Many people responded and repeatedly. Thank you.
I'll use this code structure, which several suggested. So i think this part of the problem is solved.
data W1Friday_2; set W1Friday_1 (rename=
(V3=V3x fr1rr7=fr1rr7x fr1rr21=fr1rr21x));
V3=input(V3x,best32.);
fr1rr7=input(fr1rr7x,best32.);
fr1rr21x=input(fr1rr21,best32.);
drop V3x fr1rr7x fr1rr21x;
run;
With respect to the part about preserving variable order, several suggested either Length or Retain.
I've read the documentation about Retain and its seems to be about keeping values in memory across cases (and this is the use that Cody shows in his book). So, if i used Retain where would it be placed in the little example code shown above? Would the statement be this (this now the exact variable order):
Retain v1 v2 v3 v4-v10 Q_TotalDuration Q256 fr1alc1--fr1rr6 fr1rr7 f1fr1rr8-fr1rr20 fr1rr21 fr1rr22--end_1;
or this
Retain v1--end_1;
or this (which someone mentioned)
Retain
Now Length, where would it be placed?
And, how would the statement read?
Again, thank you.
Gene Maguin
RETAIN does have the use that you mention: "its seems to be about keeping values in memory across cases" — but it also forces the ordering of variables, which seems to be what you want. It can go anywhere in your DATA step, but it usually goes as the first command under the DATA command.
LENGTH can also force the ordering of the variables, it must go before the SET or MERGE or UPDATE statement in the DATA step.
You don't need both RETAIN and LENGTH to force the ordering of variables, one will do; although for other programming reasons you may need both.
Maybe I haven't been reading carefully, but it's still not clear to me why you need to put effort into re-ordering the variables. In your example, any code you write subsequently that uses the structure
v1-v10
will work properly, even if you don't re-order the variables.
Both RETAIN and LENGTH as like: FORMAT are declarative statements and you can place then
anywhere: either between DATA and SET or after the SET statement.
But to force variables order it should precede SET statement.
SAS defines the variables in the order that it sees them when compiling the data step. So the suggestions for using RETAIN or other statements are really suggestions for adding code that explicitly names the variables in the order that you want them to appear.
The advantage of RETAIN over LENGTH for this is that it does not actually define the variables type (and length for character variables) and only defines the name. The disadvantage is that it does mark the variable to not be reset to missing at the beginning of each data step iteration. Which could cause issues for variables that are not assigned a value during every iteration. Your example does not sound like one where that would be a concern.
The advantage of LENGTH over RETAIN is that it DOES explicitly define the variables. So if how the variables are defined is important then it is probably worth it to do this explicitly with a LENGTH statement. (or by using the LENGTH= option of the ATTRIB statement).
There are other ways to make sure the order is what you want. For example you can use KEEP= or DROP= options on the incoming datasets. So if you know that your incoming dataset has variable X,y,Z,a,b,c in that order and you want to change the type of A then you could do something like. Note you just need to know what is the FIRST variable and what is the one your want to convert.
data want;
set have(keep=x--a rename=(a=old_a));
length a 8 ;
set have ;
a = input(old_a,32.);
drop old_a;
run;
If you want to convert more than one then your are better off using some type of code generation. For example to generate a LENGTH statement and appropriate RENAME= dataset options, conversion code, and DROP statements.
While LENGTH and RETAIN have some kind of overhead I have used FORMAT statement to define the order,
without defining the format itself.
@Shmuel wrote:
While LENGTH and RETAIN have some kind of overhead I have used FORMAT statement to define the order,
without defining the format itself.
I knew there was another command that could be used.
Personally I still think if the order is important then the types and lengths are also likely just as, if not more, important so it is probably best to explicitly define the variables. Then if the code did try to read in a dataset that had a variable defined with the wrong type you will get an error instead of creating an output dataset with the variable defined with the wrong type.
@Tom, you emphasized "...so it is probably best to explicitly define the variables ..." .
I agree except that for most variables the attributes (type, length, format, label) are inherited from the input dataset(s).
In this case the emphasis is on the internal order of variables, that have impact when you browse the dataset or
use proc print with _ALL_ vars.
True; and a valuable addition to the discussion. An empty format or informat statement can be used without the RETAIN side effect, too.
Plus it's less busy to code than LABEL. In light of which I retract my assertion that an unvalued RETAIN is the most sensible way of ordering variables - unless, of course, one really wants the variables to be retained to begin with.
Kind regards
Paul D.
1. Your original inquiry is how to convert the type of a variable in the middle of a list and keep the new opposite type variable with the same name in the original position. The code you show here places the converted variables fr1rr7 and fr1rr21 at the end of the output variable list.
2. If you really want to impose the desired output order on a number of variables regardless anything else transpiring in the DATA step, using an unvalued RETAIN as the first statement where the compiler sees the first reference to all these variables is the only sensible option. This is because the unvalued RETAIN imposes no attributes on the variables other than their names. The rest of the attributes for each variable (data type, length, etc.) are determined downstream in the step. If the latter doesn't occur, i.e. if the variable isn't referenced in the step before its end, it will remain uninitialized (you'll get a log message to this effect) and it will be dropped from the output, as a variable with its data type undetermined cannot be stored in the descriptor (header) of the output data set, nor can it be stored as part of the general metadata structure in the SAS session.
3. As a corollary, a valued RETAIN may not work because it imposes the data type and length on the variable in question. For the same reason, LENGTH, FORMAT, INFORMAT, and ATTRIB (see an exception for ATTRIB below) may not work, either, as they may create data type and/or length conflicts down the line.
4. The only other statements that can have the effect of an unvalued RETAIN without side effects down the line are LABEL or ATTRIB with only the label attribute listed. Again, this is because all three impose neither a data type nor length on a variable under its control. However, LABEL and ATTRIB impose a label, which is not what one may want. Thus, from the standpoint of variable ordering, using any of the following three statements as the first LOC in the step has the same effect:
retain x y z ;
label x="x" y="y" z="z" ;
attrib x label="x" y label="y" z label="z" ;
You be the judge which one is simpler to code - this is why I said that the unvalued RETAIN is the only sensible option. However, in at least one respect LABEL/ATTRIB have an advantage explained as part of the point below.
5. The main intended effect of the RETAIN statement is to prevent the variables listed in it from being auto-reset to missing values at the top of the implied DATA step loop (valuing its variables at compile time and thus setting their type and length via a valued RETAIN gives it extra functionality; and imposing the variable order is just a useful side effect). However, the main effect of RETAIN may not be what one wants to be in effect (sorry for the tautology). In fact, in certain situations auto-resetting a variable to missing at the top of the implied loop is exactly what one needs (summarization via the DoW loop is a typical example). In this case, if pre-ordering of the output variables is still desired, using LABEL or ATTRIB with nothing but a label attribute (though the latter is a bit busy to code) is a better option than an unvalued RETAIN, because they don't have the "retaining" effect of RETAIN. If having the labels thus imposed is undesirable, they can be wiped out after the data set is written and closed via proc DATASETS or the proc SQL's ALTER statement.
6. RETAIN V1--END_1 as the first statement where the compiler sees END_1 referenced will result in an error because this list syntax notation assumes that END_1 has already been seen by the compiler. But coding such RETAIN after the compiler has already seen END_1 renders its ordering effect void, as the variable order with respect to END_1 has already been established.
Kind regards
Paul D.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.