BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ADouglas
Obsidian | Level 7

Hi Group,

 

I'm using SAS 9.4.

I have a dataset with 3254 variables. There is missing data by design for all variables, but the pattern of missing data is different for each observation. I want to remove the missing to create a new data set with complete data on 200 variables for 300 observations.

 

Below is a small scale example of the data set with missing data by design. There are 5 observations and 5 variables. I want to create a new dataset with 5 observations and 3 variables.

 

Data I have:

id    var1  var2  var3  var4  var5

1       A               B               C

2       C      A               A

3       A                       A       B

4       B      B       C

5       C               A               C

 

 

Data I want:

id    Nvar1  Nvar2  Nvar3

1       A          B          C

2       C          A          A

3       A          A          B

4       B          B          C

5       C          A          C

 

Can someone please post some code to help me solve this problem?

 

Thanks,

Aaron

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

Something like this?

 

data have;
input id$ var1$ var2$ var3$ var4$ var5$;
infile datalines dsd;
datalines;
1,A,,B,,C
2,C,,A,A,
3,A,,,A,B
4,B,B,C,,
5,C,,A,,C
;

data want;
  set have;
  array old var1-var5 ;
  array new $ Nvar1-Nvar3;
  j=1;
  do i=1 to 5;
    if old[i] ne '' then do;
      new[j]=old[i];
      j+1;
    end;
  end;
  drop var1-var5 j i;
run;

View solution in original post

14 REPLIES 14
PeterClemmensen
Tourmaline | Level 20

So you have the exact same number of missing values in each of your records?

ADouglas
Obsidian | Level 7

Good question. Actually no.

 

There will be observations with entries on 200 new variables or 150 new variables.

 

If you can post code that creates a file that works for observations that will have values for 200 new variables or 150 new variables, that would be a big help.

 

I was going to try to modify the code on my own for a file with 200 and 150.

 

Thanks,

Aaron

Reeza
Super User

1. create an array that references your list of variables of interest.

2. Use CMISS to calculate the number of missing

3. If missing then delete. 

4. Use a KEEP statement to limit it to your specific variables. 

 

array myList(*) $ <list of variables goes here>;

x=cmiss(of myList(*));

if x >0 then delete;

keep <your list of variables here>;
ballardw
Super User

Are your variables numeric, character or mixed?

Are there any rules as to which variable ends in a specific order? It appears that you are just shifting them left but small examples don't always convey the complexity involved.

ADouglas
Obsidian | Level 7

Hi Thanks,

 

All variables are character variables.

 

No rules about ordering of original variables assigned to new variables.

 

The new variables are stand-in columns of data. Across the 300 observations, data from the original variables measure different concepts, but I want to dump that data to a smaller set of 200 stand-in columns of data that will be read during a mail-merge process that populates a template for a letter that is customized for 300 people. The template only supports 200 entries.

 

Is that clear?

 

Thanks,

Aaron

PeterClemmensen
Tourmaline | Level 20

Something like this?

 

data have;
input id$ var1$ var2$ var3$ var4$ var5$;
infile datalines dsd;
datalines;
1,A,,B,,C
2,C,,A,A,
3,A,,,A,B
4,B,B,C,,
5,C,,A,,C
;

data want;
  set have;
  array old var1-var5 ;
  array new $ Nvar1-Nvar3;
  j=1;
  do i=1 to 5;
    if old[i] ne '' then do;
      new[j]=old[i];
      j+1;
    end;
  end;
  drop var1-var5 j i;
run;
ballardw
Super User

@ADouglas wrote:

Thanks, I will try this now.

 

Regards,

Aaron


Something to check for with your data: the New variables should all have a length large enough to hold the longest value that may be encountered. This is needed because you basically say any value could end up in any position. If you define your new variables as 8 characters (default with way that @PeterClemmensen defined the new variables) and attempt to assign a value with 20 characters the value will be truncated to fit 8.

So the new array definition should be set after looking at your variables (proc contents) for the largest length. then use

 

Array new $ 20 Nvar1 - Nvar3;

using the determined length in place of 20 above.

 

You can also create the variables without listing the names as:

Array nvar {3} $ 20;

to create 3 variables nvar1 nvar2 and nvar3.

ADouglas
Obsidian | Level 7

Thanks,

 

That is right. Below my code now. You can see that the original variables were set to $1000. and the new variables set to $1000.

 

The code seems to be working, but I am testing it by making comparisons between a sample of the data produced by the SAs code and the result obtained using TEXTJOIN in Excel. I'm not done testing. The macro variable &NUMBER value is 3254.

 

 

/***Apply conditional set statements & assign the VADs for incorrect items***/

 

PROC IMPORT OUT=PEQ.Examinee_Data_Scored   /***Imports the .csv file of examinee item-level test data into SAS***/
  DATAFILE=

'C:\Users\ADouglas\Desktop\Fall_MOC_Performance_Reports\Processing_PEQs\Main_File_Fall_2017_MOC_Results.csv' 
     DBMS=CSV REPLACE;
     GETNAMES=YES;
     DATAROW=2;
RUN;

 

DATA PEQ.Examinee_VADs;
   IF _N_ = 1 THEN SET PEQ.MOTHER_PEQ_KEY_OUT;
   IF _N_ = 1 THEN SET PEQ.MOTHER_PEQ_VAD_OUT;
   SET PEQ.Examinee_Data_Scored;
   INFORMAT ReportVAD1-ReportVAD&NUMBER $1000.;
   FORMAT ReportVAD1-ReportVAD&NUMBER $1000.;
   ARRAY ANS[&NUMBER] $ _&FirstAN-character-_&LastAN;                    /***Array of examinee answers***/
   ARRAY KEY[&NUMBER] $ KEY_&FirstAN-character-KEY_&LastAN;              /***Array of answer keys***/
   ARRAY VAD[&NUMBER] $ VAD_&FirstAN-character-VAD_&LastAN;              /***Array of VADs***/
   ARRAY ReportVAD[&NUMBER] $ ReportVAD1-character-ReportVAD&NUMBER;               /***Array of Report VADs***/
DO I=1 to &NUMBER;
   IF ANS[I] ^= ' ' AND ANS[I] ^= KEY[I] THEN ReportVAD[I] = VAD[I];   /***Assign VAD to Report VAD variables for items answered incorrectly***/
   IF ANS[I] ^= ' ' AND ANS[I]  = KEY[I] THEN ReportVAD[I] = 'Nothing';    /***Assign nothing to Report VAD variables for items answered correctly***/
END;
KEEP External_ID _&FirstAN-character-_&LastAN ReportVAD1-character-ReportVAD&NUMBER;
RUN;

 

PROC CONTENTS DATA=PEQ.Examinee_VADs;
RUN;

 

/***Code to create the smaller dataset with not presented items not included***/

 

DATA PEQ.Feedback_Report;
     RETAIN External_ID Y1-Y200;
     SET PEQ.Examinee_VADs;
  INFORMAT Y1-Y200 $1000.;
     FORMAT   Y1-Y200 $1000.;
     ARRAY ReportVAD[&NUMBER] $ ReportVAD1-character-ReportVAD&NUMBER;               /***Array of Report VADs***/
     ARRAY Y[200] $ Y1-Y200;                                                         /***Array of Report VADs for Template***/
     J=1;
  DO I=1 to &NUMBER;
        IF ReportVAD[I] ^= ' ' THEN DO;
           Y[J]=ReportVAD[I];
     J+1;
  END;
     END;
     KEEP External_ID Y1-Y200;
RUN;

 

 

ballardw
Super User

Actually there may be an easier way since you are starting with reading a CSV file.

The default behavior for reading delimited data treats multiple delimiters as a single delimiter.

If you look at the code generated by the Proc Import you will likely  see the option DSD on the infile statement. This changes the read to force each comma to delineate a variable, which  I will say most of the time is the desired behavior.

But with what your are doing if you 1) copy the data step from the log to the editor and 2) remove the DSD you will likely get the appearance you want.

 

Here's a brief example with a small data set example:

data junk;
   infile datalines dlm=',' truncover ;
   informat var1-var5 $10.;
   input var1 - var5;
datalines;
ab,,cd,e,f
a,b,c,d,e,
,,c,d,e
run;

data junk2;
   infile datalines dlm=',' truncover dsd;
   informat var1-var5 $10.;
   input var1 - var5;
datalines;
ab,,cd,e,f
a,b,c,d,e,
,,c,d,e
run;

Junk2 is likely similar to your data and Junk is what you want.

 

This may be very helpful if you are going to be doing this with multiple files.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Something like - note not tested - Post Test data in the form of a datastep - we are not here to type it in for you!:

data want;
  length temp $2000;
  set have;
  temp=catx(',',of var:);
  array var{*};
  call missing(of var:);
  do i=1 to countw(temp,",");
    var{i}=scan(temp,i,",");
  end;
run;
  

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 9331 views
  • 6 likes
  • 5 in conversation