Solved: inconsistent number of variables per subject - reading the data in?

mhein61 · Posted 01-20-2017 12:12 PM

I've got data in excel files that looks like this:

_recordId		QID122_TEXT	QID124_TEXT
Response ID	Number of Responses	Q122 - What is the name of this item?	Q124 - In the space below, list as many creative uses as you can think of for this...	Category
R_3imYisBaKj1DltD	1	bar	stretching	4
R_2q7PvEtAwpPZf9F	2	barre	ballet	2
			clothes	0
R_3QVKnlbMIE7CHZO	4	a bar	to hold while squatting	1
			to balance your leg	1
			to steady your balance	1
			to use as a prop	9
R_eRGndjYN0wEIFkR	3	barre	horse tie station	10
			clothes rack,	5
			barricade,	8
R_2zBKJc4IOSw1bhb	2	dance bar	stretching	4
			dancing	2
R_1CaC6q2dVvHAtyN	4	balance bean	do stretches	4
			squats	4
			help hold balance	1
			clothes rack	5

The number of responses to the stimulus is a variable.

Based on the number of responses each subject has that many rows of data.

Two sheets per excel file with the same subjects responding to different stimuli.

What is the easiest way to get this data from excel to SAS?

SAS 9.4

ballardw · Posted 01-20-2017 01:15 PM

What is the result supposed to look like? Does each Excel file represent a different topic? Does each only have the response identification, number of responses, two text fields and a category or do they change content layout between files?

Are you planning on combining this data together? Are the response IDs the same across the data sets?

If yes to all of those then it may be possible to generate a similar program to read each of those files but the first step is going to be to save them as CSV files so you control reading them. The Number of responses and categores may need a different variable name in each to differentiate when combined. You will also want to identify the longest (most characters used) in each of the columns response id, and the two text columns. The second row would provide LABELS for the variables.

The generic plan would be to read all of the files and then merge or join on the response ID field.

data QID122;
   infile "path/qid122Page1.csv" dlm=',' lrecl=500 obs=3;
   informat _recordid   $25.
            QID122_Num  best5.
            QID122_text $50.   /* the 50 and 100 are "guesses" as to possible maximum lengths*/
            QID124_text $100.  /* for these to values, set as needed */
            QID122_Cat  best5.
   ;
   input
      _recordid  
      QID122_Num 
      QID122_text
      QID124_text
      QID122_Cat 
   ;
   label
      _recordid    ='Response Id'
      QID122_Num   ="Number of Q122 Responses"
      QID122_text  ="Q122 - What is the name of this item?"
      QID124_text  ="Q124 - In the space below, list as many creative uses as you can think of for this..."
      QID122_Cat   ="Q122 category"
   ;
run;

Replace the dataset name, infile name to reflect the saved CSV file and various 122 and 124 pieces as needed to read each file. This is the part that is simplified by identifying the longest value across the files.

One way to combine the data (if the response Ids are matching between the sheets):

Proc sort data=QID122;
   by responseId;
run;
/* repeat sort for each data set*/
data combined;
   merge
       QID122
       QID125
       /* place all of the data set names here*/
   ;
   by responseid;
run;

View solution in original post

ballardw · Posted 01-20-2017 01:15 PM

What is the result supposed to look like? Does each Excel file represent a different topic? Does each only have the response identification, number of responses, two text fields and a category or do they change content layout between files?

Are you planning on combining this data together? Are the response IDs the same across the data sets?

If yes to all of those then it may be possible to generate a similar program to read each of those files but the first step is going to be to save them as CSV files so you control reading them. The Number of responses and categores may need a different variable name in each to differentiate when combined. You will also want to identify the longest (most characters used) in each of the columns response id, and the two text columns. The second row would provide LABELS for the variables.

The generic plan would be to read all of the files and then merge or join on the response ID field.

data QID122;
   infile "path/qid122Page1.csv" dlm=',' lrecl=500 obs=3;
   informat _recordid   $25.
            QID122_Num  best5.
            QID122_text $50.   /* the 50 and 100 are "guesses" as to possible maximum lengths*/
            QID124_text $100.  /* for these to values, set as needed */
            QID122_Cat  best5.
   ;
   input
      _recordid  
      QID122_Num 
      QID122_text
      QID124_text
      QID122_Cat 
   ;
   label
      _recordid    ='Response Id'
      QID122_Num   ="Number of Q122 Responses"
      QID122_text  ="Q122 - What is the name of this item?"
      QID124_text  ="Q124 - In the space below, list as many creative uses as you can think of for this..."
      QID122_Cat   ="Q122 category"
   ;
run;

Replace the dataset name, infile name to reflect the saved CSV file and various 122 and 124 pieces as needed to read each file. This is the part that is simplified by identifying the longest value across the files.

One way to combine the data (if the response Ids are matching between the sheets):

Proc sort data=QID122;
   by responseId;
run;
/* repeat sort for each data set*/
data combined;
   merge
       QID122
       QID125
       /* place all of the data set names here*/
   ;
   by responseid;
run;

inconsistent number of variables per subject - reading the data in?

Re: inconsistent number of variables per subject - reading the data in?

Re: inconsistent number of variables per subject - reading the data in?

inconsistent number of variables per subject - reading the data in?

Re: inconsistent number of variables per subject - reading the data in?

Re: inconsistent number of variables per subject - reading the data in?

Registration is open