How to remove missing data for many variables from dataset to make new data set with complete data.

Accepted Solution Solved
Reply
Contributor
Posts: 27
Accepted Solution

How to remove missing data for many variables from dataset to make new data set with complete data.

[ Edited ]

Hi Group,

 

I'm using SAS 9.4.

I have a dataset with 3254 variables. There is missing data by design for all variables, but the pattern of missing data is different for each observation. I want to remove the missing to create a new data set with complete data on 200 variables for 300 observations.

 

Below is a small scale example of the data set with missing data by design. There are 5 observations and 5 variables. I want to create a new dataset with 5 observations and 3 variables.

 

Data I have:

id    var1  var2  var3  var4  var5

1       A               B               C

2       C      A               A

3       A                       A       B

4       B      B       C

5       C               A               C

 

 

Data I want:

id    Nvar1  Nvar2  Nvar3

1       A          B          C

2       C          A          A

3       A          A          B

4       B          B          C

5       C          A          C

 

Can someone please post some code to help me solve this problem?

 

Thanks,

Aaron


Accepted Solutions
Solution
‎11-02-2017 05:10 PM
PROC Star
Posts: 1,217

Re: How to remove missing data for many variables from dataset to make new data set with complete da

Something like this?

 

data have;
input id$ var1$ var2$ var3$ var4$ var5$;
infile datalines dsd;
datalines;
1,A,,B,,C
2,C,,A,A,
3,A,,,A,B
4,B,B,C,,
5,C,,A,,C
;

data want;
  set have;
  array old var1-var5 ;
  array new $ Nvar1-Nvar3;
  j=1;
  do i=1 to 5;
    if old[i] ne '' then do;
      new[j]=old[i];
      j+1;
    end;
  end;
  drop var1-var5 j i;
run;

View solution in original post


All Replies
PROC Star
Posts: 1,217

Re: How to remove missing data for many variables from dataset to make new data set with complete da

So you have the exact same number of missing values in each of your records?

Contributor
Posts: 27

Re: How to remove missing data for many variables from dataset to make new data set with complete da

Good question. Actually no.

 

There will be observations with entries on 200 new variables or 150 new variables.

 

If you can post code that creates a file that works for observations that will have values for 200 new variables or 150 new variables, that would be a big help.

 

I was going to try to modify the code on my own for a file with 200 and 150.

 

Thanks,

Aaron

Super User
Posts: 23,323

Re: How to remove missing data for many variables from dataset to make new data set with complete da

1. create an array that references your list of variables of interest.

2. Use CMISS to calculate the number of missing

3. If missing then delete. 

4. Use a KEEP statement to limit it to your specific variables. 

 

array myList(*) $ <list of variables goes here>;

x=cmiss(of myList(*));

if x >0 then delete;

keep <your list of variables here>;
Contributor
Posts: 27

Re: How to remove missing data for many variables from dataset to make new data set with complete da

Thanks,

 

I'll also study this.

 

Aaron

Super User
Posts: 13,338

Re: How to remove missing data for many variables from dataset to make new data set with complete da

Are your variables numeric, character or mixed?

Are there any rules as to which variable ends in a specific order? It appears that you are just shifting them left but small examples don't always convey the complexity involved.

Contributor
Posts: 27

Re: How to remove missing data for many variables from dataset to make new data set with complete da

Hi Thanks,

 

All variables are character variables.

 

No rules about ordering of original variables assigned to new variables.

 

The new variables are stand-in columns of data. Across the 300 observations, data from the original variables measure different concepts, but I want to dump that data to a smaller set of 200 stand-in columns of data that will be read during a mail-merge process that populates a template for a letter that is customized for 300 people. The template only supports 200 entries.

 

Is that clear?

 

Thanks,

Aaron

Solution
‎11-02-2017 05:10 PM
PROC Star
Posts: 1,217

Re: How to remove missing data for many variables from dataset to make new data set with complete da

Something like this?

 

data have;
input id$ var1$ var2$ var3$ var4$ var5$;
infile datalines dsd;
datalines;
1,A,,B,,C
2,C,,A,A,
3,A,,,A,B
4,B,B,C,,
5,C,,A,,C
;

data want;
  set have;
  array old var1-var5 ;
  array new $ Nvar1-Nvar3;
  j=1;
  do i=1 to 5;
    if old[i] ne '' then do;
      new[j]=old[i];
      j+1;
    end;
  end;
  drop var1-var5 j i;
run;
Contributor
Posts: 27

Re: How to remove missing data for many variables from dataset to make new data set with complete da

Thanks, I will try this now.

 

Regards,

Aaron

Super User
Posts: 13,338

Re: How to remove missing data for many variables from dataset to make new data set with complete da


ADouglas wrote:

Thanks, I will try this now.

 

Regards,

Aaron


Something to check for with your data: the New variables should all have a length large enough to hold the longest value that may be encountered. This is needed because you basically say any value could end up in any position. If you define your new variables as 8 characters (default with way that @draycut defined the new variables) and attempt to assign a value with 20 characters the value will be truncated to fit 8.

So the new array definition should be set after looking at your variables (proc contents) for the largest length. then use

 

Array new $ 20 Nvar1 - Nvar3;

using the determined length in place of 20 above.

 

You can also create the variables without listing the names as:

Array nvar {3} $ 20;

to create 3 variables nvar1 nvar2 and nvar3.

PROC Star
Posts: 1,217

Re: How to remove missing data for many variables from dataset to make new data set with complete da

Good point @ballardw Smiley Happy

Contributor
Posts: 27

Re: How to remove missing data for many variables from dataset to make new data set with complete da

Thanks,

 

That is right. Below my code now. You can see that the original variables were set to $1000. and the new variables set to $1000.

 

The code seems to be working, but I am testing it by making comparisons between a sample of the data produced by the SAs code and the result obtained using TEXTJOIN in Excel. I'm not done testing. The macro variable &NUMBER value is 3254.

 

 

/***Apply conditional set statements & assign the VADs for incorrect items***/

 

PROC IMPORT OUT=PEQ.Examinee_Data_Scored   /***Imports the .csv file of examinee item-level test data into SAS***/
  DATAFILE=

'C:\Users\ADouglas\Desktop\Fall_MOC_Performance_Reports\Processing_PEQs\Main_File_Fall_2017_MOC_Results.csv' 
     DBMS=CSV REPLACE;
     GETNAMES=YES;
     DATAROW=2;
RUN;

 

DATA PEQ.Examinee_VADs;
   IF _N_ = 1 THEN SET PEQ.MOTHER_PEQ_KEY_OUT;
   IF _N_ = 1 THEN SET PEQ.MOTHER_PEQ_VAD_OUT;
   SET PEQ.Examinee_Data_Scored;
   INFORMAT ReportVAD1-ReportVAD&NUMBER $1000.;
   FORMAT ReportVAD1-ReportVAD&NUMBER $1000.;
   ARRAY ANS[&NUMBER] $ _&FirstAN-character-_&LastAN;                    /***Array of examinee answers***/
   ARRAY KEY[&NUMBER] $ KEY_&FirstAN-character-KEY_&LastAN;              /***Array of answer keys***/
   ARRAY VAD[&NUMBER] $ VAD_&FirstAN-character-VAD_&LastAN;              /***Array of VADs***/
   ARRAY ReportVAD[&NUMBER] $ ReportVAD1-character-ReportVAD&NUMBER;               /***Array of Report VADs***/
DO I=1 to &NUMBER;
   IF ANS[I] ^= ' ' AND ANS[I] ^= KEY[I] THEN ReportVAD[I] = VAD[I];   /***Assign VAD to Report VAD variables for items answered incorrectly***/
   IF ANS[I] ^= ' ' AND ANS[I]  = KEY[I] THEN ReportVAD[I] = 'Nothing';    /***Assign nothing to Report VAD variables for items answered correctly***/
END;
KEEP External_ID _&FirstAN-character-_&LastAN ReportVAD1-character-ReportVAD&NUMBER;
RUN;

 

PROC CONTENTS DATA=PEQ.Examinee_VADs;
RUN;

 

/***Code to create the smaller dataset with not presented items not included***/

 

DATA PEQ.Feedback_Report;
     RETAIN External_ID Y1-Y200;
     SET PEQ.Examinee_VADs;
  INFORMAT Y1-Y200 $1000.;
     FORMAT   Y1-Y200 $1000.;
     ARRAY ReportVAD[&NUMBER] $ ReportVAD1-character-ReportVAD&NUMBER;               /***Array of Report VADs***/
     ARRAY Y[200] $ Y1-Y200;                                                         /***Array of Report VADs for Template***/
     J=1;
  DO I=1 to &NUMBER;
        IF ReportVAD[I] ^= ' ' THEN DO;
           Y[J]=ReportVAD[I];
     J+1;
  END;
     END;
     KEEP External_ID Y1-Y200;
RUN;

 

 

Super User
Posts: 13,338

Re: How to remove missing data for many variables from dataset to make new data set with complete da

Actually there may be an easier way since you are starting with reading a CSV file.

The default behavior for reading delimited data treats multiple delimiters as a single delimiter.

If you look at the code generated by the Proc Import you will likely  see the option DSD on the infile statement. This changes the read to force each comma to delineate a variable, which  I will say most of the time is the desired behavior.

But with what your are doing if you 1) copy the data step from the log to the editor and 2) remove the DSD you will likely get the appearance you want.

 

Here's a brief example with a small data set example:

data junk;
   infile datalines dlm=',' truncover ;
   informat var1-var5 $10.;
   input var1 - var5;
datalines;
ab,,cd,e,f
a,b,c,d,e,
,,c,d,e
run;

data junk2;
   infile datalines dlm=',' truncover dsd;
   informat var1-var5 $10.;
   input var1 - var5;
datalines;
ab,,cd,e,f
a,b,c,d,e,
,,c,d,e
run;

Junk2 is likely similar to your data and Junk is what you want.

 

This may be very helpful if you are going to be doing this with multiple files.

Super User
Super User
Posts: 9,437

Re: How to remove missing data for many variables from dataset to make new data set with complete da

Something like - note not tested - Post Test data in the form of a datastep - we are not here to type it in for you!:

data want;
  length temp $2000;
  set have;
  temp=catx(',',of var:);
  array var{*};
  call missing(of var:);
  do i=1 to countw(temp,",");
    var{i}=scan(temp,i,",");
  end;
run;
  
Contributor
Posts: 27

Re: How to remove missing data for many variables from dataset to make new data set with complete da

Thanks, I appreciate your help.

 

Aaron

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 14 replies
  • 558 views
  • 6 likes
  • 5 in conversation