Solved: Transposing dataset by creating variables from substring

lydiawawa · Posted 10-17-2022 03:29 PM

Hi,

I have a dataset that I need to transpose by a specific design that involves substring.

The original dataset:

ID	NT1	NT2	NT3	NT4
1	NOTES 12:13:44 03-16-2018 CODE: ABC	AML	NOTES 09:13:11 03-12-2018 CODE: OPI	TEST
2	NOTES 04:25:09 01-04-2018 CODE: FDS	IMD	NOTES 03:25:10 01-09-2018 CODE: FGH	TEST
3	NOTES 12:22:49 11-12-2018 CODE: DGH	TEST	NOTES 08:02:49 11-11-2018 CODE: LKO	AML
4	NOTES 22:02:21 01-14-2018 CODE: MKL	TEST	NOTES 07:02:21 01-10-2018 CODE: LOP	IMD
5	NOTES 09:01:36 01-23-2018 CODE: HJK	TEST	NOTES 09:01:56 01-23-2018 CODE: UIY	TEST

Transpose by ID to split out time , date, code and notes:

ID	time	date	code	notes
1	12:13:44	03-16-2018	ABC	AML
1	09:13:11	03-12-2018	OPI	TEST
2	04:25:09	01-04-2018	FDS	IMD
2	03:25:10	01-09-2018	FGH	TEST
3	12:22:49	11-12-2018	DGH	TEST
3	08:02:49	11-11-2018	LKO	AML
4	22:02:21	01-14-2018	MKL	TEST
4	07:02:21	01-10-2018	LOP	IMD
5	09:01:36	01-23-2018	HJK	TEST
5	09:01:56	01-23-2018	UIY	TEST

After transposing, ID will have multiple levels based on NT variables.

The code I have so far does not yield the desirable output:

data note11;
 length note0 $50;
 set note10;

  array t{*} nt:;;
    do _i = 1 to dim(t);
/*make sure notes are not blank*/
      if not missing(t[_i]) then note0 = catx('/',note0,vname(t(_i)));
/*timestamp, date, code*/
      if find(upcase(t[_i]),"NOTES") then do;
	 timestamp=substr(left(nt0), index(left(nt0), 'NOTES')+5, 9);                                                                                                              
         date=substr(left(nt0), 15, 10);  
         code = substr(left(nt0), index(left(nt0), 'CODE:')+5); 	
	  end;

    end;
   drop _i;
run;

Thank you for any help!

mkeintz · Posted 10-17-2022 04:20 PM

@lydiawawa wrote:

In reality, there are about 75 nt variables, the 'notes' always have an even suffix , such as nt2, nt4, nt6....I cannot just call them independently

"about 75"? Are you saying that you don't have complete pairs (one with the parseable text, the other with the NOTES result)?

If you do have complete pairs, and there are many of them, then a loop is the way to go.

Edtted addition:

Let's say you have 36 complete pairs, saved in variables NT1-NT72. Then the loop approach (untested) could be:

data want (drop=nt: row);
  set have;
  length code $4;
  array ntvars {36,2} $ nt1-nt72;

  do row=1 to 36;
    code=scan(ntvars{row,1},-1,' ');
    time= input(scan(ntvars{row,1},2,' '),time8.0);
    format time time8.0;
    date= input(scan(ntvars{row,1},3,' '),mmddyy10.);
    format date mmddyy10. ;

    length notes $4;
    notes=ntvars{row,2};
    output;
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 10-17-2022 04:03 PM

You can "read" each original observation twice. The first time keeping id nt1 and nt2, the second time keeping id nt3 and nt4.

Rename nt2 and nt4 to NOTES is simple. Rename nt1 and nt3 to note_text, parse that text for TIME, DATE, and CODE.

Untested, in the absence of a working sample data step:

data want (drop=_note_text);
  set have (keep=id nt1 nt2  rename=(nt1=_note_text nt2=notes))
      have (keep=id nt3 nt4  rename=(nt3=_note_text nt4=notes));
  by id;
  length code $4;
  code=scan(_note_text,-1,' ');

  time= input(scan(_note_text,2,' '),time8.0);
  format time time8.0;

  date= input(scan(_note_text,3,' '),mmddyy10.);
  format date mmddyy10. ;
run;

This program assumes

time is always the second "word" in _NOTE_TEXT,
date is always the third
code is always the last

The set statement has two references to HAVE, one for NT1 and NT2, the other for NT3 and NT4. This eliminates the need to create a loop to first process NT1 and NT2, then NT3 and NT4.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ballardw · Posted 10-17-2022 04:09 PM

If this were my data, assuming this is as read, I would go back to the step where I read the data and modify the input statement and basically read the two records from each line.

lydiawawa · Posted 10-17-2022 04:15 PM

In reality, there are about 75 nt variables, the 'notes' always have an even suffix , such as nt2, nt4, nt6....I cannot just call them independently

mkeintz · Posted 10-17-2022 04:20 PM

@lydiawawa wrote:

In reality, there are about 75 nt variables, the 'notes' always have an even suffix , such as nt2, nt4, nt6....I cannot just call them independently

"about 75"? Are you saying that you don't have complete pairs (one with the parseable text, the other with the NOTES result)?

If you do have complete pairs, and there are many of them, then a loop is the way to go.

Edtted addition:

Let's say you have 36 complete pairs, saved in variables NT1-NT72. Then the loop approach (untested) could be:

data want (drop=nt: row);
  set have;
  length code $4;
  array ntvars {36,2} $ nt1-nt72;

  do row=1 to 36;
    code=scan(ntvars{row,1},-1,' ');
    time= input(scan(ntvars{row,1},2,' '),time8.0);
    format time time8.0;
    date= input(scan(ntvars{row,1},3,' '),mmddyy10.);
    format date mmddyy10. ;

    length notes $4;
    notes=ntvars{row,2};
    output;
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Transposing dataset by creating variables from substring

Re: Transposing dataset by creating variables from substring

Re: Transposing dataset by creating variables from substring

Re: Transposing dataset by creating variables from substring

Re: Transposing dataset by creating variables from substring

Re: Transposing dataset by creating variables from substring

Transposing dataset by creating variables from substring

Re: Transposing dataset by creating variables from substring

Re: Transposing dataset by creating variables from substring

Re: Transposing dataset by creating variables from substring

Re: Transposing dataset by creating variables from substring

Re: Transposing dataset by creating variables from substring

2025 SAS Hackathon: There is still time!

SAS Training: Just a Click Away