Hi Group,
I'm using SAS 9.4.
I have a dataset with 3254 variables. There is missing data by design for all variables, but the pattern of missing data is different for each observation. I want to remove the missing to create a new data set with complete data on 200 variables for 300 observations.
Below is a small scale example of the data set with missing data by design. There are 5 observations and 5 variables. I want to create a new dataset with 5 observations and 3 variables.
Data I have:
id var1 var2 var3 var4 var5
1 A B C
2 C A A
3 A A B
4 B B C
5 C A C
Data I want:
id Nvar1 Nvar2 Nvar3
1 A B C
2 C A A
3 A A B
4 B B C
5 C A C
Can someone please post some code to help me solve this problem?
Thanks,
Aaron
Something like this?
data have;
input id$ var1$ var2$ var3$ var4$ var5$;
infile datalines dsd;
datalines;
1,A,,B,,C
2,C,,A,A,
3,A,,,A,B
4,B,B,C,,
5,C,,A,,C
;
data want;
set have;
array old var1-var5 ;
array new $ Nvar1-Nvar3;
j=1;
do i=1 to 5;
if old[i] ne '' then do;
new[j]=old[i];
j+1;
end;
end;
drop var1-var5 j i;
run;
So you have the exact same number of missing values in each of your records?
Good question. Actually no.
There will be observations with entries on 200 new variables or 150 new variables.
If you can post code that creates a file that works for observations that will have values for 200 new variables or 150 new variables, that would be a big help.
I was going to try to modify the code on my own for a file with 200 and 150.
Thanks,
Aaron
1. create an array that references your list of variables of interest.
2. Use CMISS to calculate the number of missing
3. If missing then delete.
4. Use a KEEP statement to limit it to your specific variables.
array myList(*) $ <list of variables goes here>;
x=cmiss(of myList(*));
if x >0 then delete;
keep <your list of variables here>;
Thanks,
I'll also study this.
Aaron
Are your variables numeric, character or mixed?
Are there any rules as to which variable ends in a specific order? It appears that you are just shifting them left but small examples don't always convey the complexity involved.
Hi Thanks,
All variables are character variables.
No rules about ordering of original variables assigned to new variables.
The new variables are stand-in columns of data. Across the 300 observations, data from the original variables measure different concepts, but I want to dump that data to a smaller set of 200 stand-in columns of data that will be read during a mail-merge process that populates a template for a letter that is customized for 300 people. The template only supports 200 entries.
Is that clear?
Thanks,
Aaron
Something like this?
data have;
input id$ var1$ var2$ var3$ var4$ var5$;
infile datalines dsd;
datalines;
1,A,,B,,C
2,C,,A,A,
3,A,,,A,B
4,B,B,C,,
5,C,,A,,C
;
data want;
set have;
array old var1-var5 ;
array new $ Nvar1-Nvar3;
j=1;
do i=1 to 5;
if old[i] ne '' then do;
new[j]=old[i];
j+1;
end;
end;
drop var1-var5 j i;
run;
Thanks, I will try this now.
Regards,
Aaron
@ADouglas wrote:
Thanks, I will try this now.
Regards,
Aaron
Something to check for with your data: the New variables should all have a length large enough to hold the longest value that may be encountered. This is needed because you basically say any value could end up in any position. If you define your new variables as 8 characters (default with way that @PeterClemmensen defined the new variables) and attempt to assign a value with 20 characters the value will be truncated to fit 8.
So the new array definition should be set after looking at your variables (proc contents) for the largest length. then use
Array new $ 20 Nvar1 - Nvar3;
using the determined length in place of 20 above.
You can also create the variables without listing the names as:
Array nvar {3} $ 20;
to create 3 variables nvar1 nvar2 and nvar3.
Good point @ballardw 🙂
Thanks,
That is right. Below my code now. You can see that the original variables were set to $1000. and the new variables set to $1000.
The code seems to be working, but I am testing it by making comparisons between a sample of the data produced by the SAs code and the result obtained using TEXTJOIN in Excel. I'm not done testing. The macro variable &NUMBER value is 3254.
/***Apply conditional set statements & assign the VADs for incorrect items***/
PROC IMPORT OUT=PEQ.Examinee_Data_Scored /***Imports the .csv file of examinee item-level test data into SAS***/
DATAFILE=
'C:\Users\ADouglas\Desktop\Fall_MOC_Performance_Reports\Processing_PEQs\Main_File_Fall_2017_MOC_Results.csv'
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;
DATA PEQ.Examinee_VADs;
IF _N_ = 1 THEN SET PEQ.MOTHER_PEQ_KEY_OUT;
IF _N_ = 1 THEN SET PEQ.MOTHER_PEQ_VAD_OUT;
SET PEQ.Examinee_Data_Scored;
INFORMAT ReportVAD1-ReportVAD&NUMBER $1000.;
FORMAT ReportVAD1-ReportVAD&NUMBER $1000.;
ARRAY ANS[&NUMBER] $ _&FirstAN-character-_&LastAN; /***Array of examinee answers***/
ARRAY KEY[&NUMBER] $ KEY_&FirstAN-character-KEY_&LastAN; /***Array of answer keys***/
ARRAY VAD[&NUMBER] $ VAD_&FirstAN-character-VAD_&LastAN; /***Array of VADs***/
ARRAY ReportVAD[&NUMBER] $ ReportVAD1-character-ReportVAD&NUMBER; /***Array of Report VADs***/
DO I=1 to &NUMBER;
IF ANS[I] ^= ' ' AND ANS[I] ^= KEY[I] THEN ReportVAD[I] = VAD[I]; /***Assign VAD to Report VAD variables for items answered incorrectly***/
IF ANS[I] ^= ' ' AND ANS[I] = KEY[I] THEN ReportVAD[I] = 'Nothing'; /***Assign nothing to Report VAD variables for items answered correctly***/
END;
KEEP External_ID _&FirstAN-character-_&LastAN ReportVAD1-character-ReportVAD&NUMBER;
RUN;
PROC CONTENTS DATA=PEQ.Examinee_VADs;
RUN;
/***Code to create the smaller dataset with not presented items not included***/
DATA PEQ.Feedback_Report;
RETAIN External_ID Y1-Y200;
SET PEQ.Examinee_VADs;
INFORMAT Y1-Y200 $1000.;
FORMAT Y1-Y200 $1000.;
ARRAY ReportVAD[&NUMBER] $ ReportVAD1-character-ReportVAD&NUMBER; /***Array of Report VADs***/
ARRAY Y[200] $ Y1-Y200; /***Array of Report VADs for Template***/
J=1;
DO I=1 to &NUMBER;
IF ReportVAD[I] ^= ' ' THEN DO;
Y[J]=ReportVAD[I];
J+1;
END;
END;
KEEP External_ID Y1-Y200;
RUN;
Actually there may be an easier way since you are starting with reading a CSV file.
The default behavior for reading delimited data treats multiple delimiters as a single delimiter.
If you look at the code generated by the Proc Import you will likely see the option DSD on the infile statement. This changes the read to force each comma to delineate a variable, which I will say most of the time is the desired behavior.
But with what your are doing if you 1) copy the data step from the log to the editor and 2) remove the DSD you will likely get the appearance you want.
Here's a brief example with a small data set example:
data junk; infile datalines dlm=',' truncover ; informat var1-var5 $10.; input var1 - var5; datalines; ab,,cd,e,f a,b,c,d,e, ,,c,d,e run; data junk2; infile datalines dlm=',' truncover dsd; informat var1-var5 $10.; input var1 - var5; datalines; ab,,cd,e,f a,b,c,d,e, ,,c,d,e run;
Junk2 is likely similar to your data and Junk is what you want.
This may be very helpful if you are going to be doing this with multiple files.
Something like - note not tested - Post Test data in the form of a datastep - we are not here to type it in for you!:
data want; length temp $2000; set have; temp=catx(',',of var:); array var{*}; call missing(of var:); do i=1 to countw(temp,","); var{i}=scan(temp,i,","); end; run;
Thanks, I appreciate your help.
Aaron
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.