Solved: Re: Merging data by range

confooseddesi89 · Posted 08-20-2019 05:36 PM

I have a cognitive battery dataset that I'd like to merge with a survey dataset. The survey may have followed the battery by anywhere from 10 to 30 minutes; thus, the timestamp in each dataset for the same cognitive battery/survey administration will be different.

For example, for one administration of a cognitive battery, I may have the following cognitive data:

Date_Task_Start Time_Task_Start Cog_Outcome_1 Cog_Outcome_2

20160404 15:16:42 3.4 4.2

and the following survey data:

Date_Task_Start Time_Task_Start Surv_Outcome_1 Surv_Outcome_2

20160404 15:26:12 6 3

I want the following dataset:

Date_Task_Start Cog_Time_Task_Start Cog_Outcome_1 Cog_Outcome_2 Surv_Time_Task_Start Surv_Outcome_1 Surv_Outcome_2

20160404 15:16:42 3.4 4.2 15:26:12 6 3

Can anyone help me with this? Thank you.

Reeza · Posted 08-26-2019 11:30 AM

What are the next steps to get the other variables from the original cognitive and survey datasets into this linked dataset?

Start listing the variables you want from each data set in the PROC SQL statement and it will include them. Like I mentioned earlier, if you have variables with the same name, which you do, you need to rename one of them at minimum.

Look at the current code to see how a rename works.

Also, there's one thing I noticed - in ONE instance, a survey was administered the next day after the cognitive battery, because the battery was administered late at ~23:45 (the survey then was administered 0:03 the next day). Is there any way to rectify this so the records are linked?

Rather than link on date as exact and time as the calculation, you can create a variable that is a datetime and then use that variable instead of the time as currently being used.

Here's a great, but longer and in depth, reference for dates and times in SAS - it should answer your second question if you get stuck anywhere.
https://communities.sas.com/t5/SAS-Communities-Library/Working-with-Dates-and-Times-in-SAS-Tutorial/...

Unfortunately I don't have further time to spend on this today, Good Luck!

View solution in original post

Reeza · Posted 08-20-2019 05:44 PM

Is it only time based? What happens if there are multiple matches within the time frames? Did you administer the test to multiple people within those 10 to 30 minutes?

I think this can be done via SQL relatively easily but I think you'll definitely need to provide a more robust data example that has more types of issues that you may be dealing with. Given what you've posted a simple MERGE would work, but obviously would be incorrect because it wouldn't scale and work on your actual problem.

Here are instructions on how to provide sample data as a data step:
https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...

@confooseddesi89 wrote:

I have a cognitive battery dataset that I'd like to merge with a survey dataset. The survey may have followed the battery by anywhere from 10 to 30 minutes; thus, the timestamp in each dataset for the same cognitive battery/survey administration will be different.

For example, for one administration of a cognitive battery, I may have the following cognitive data:

Date_Task_Start            Time_Task_Start            Cog_Outcome_1            Cog_Outcome_2

20160404            15:16:42            3.4            4.2

and the following survey data:

Date_Task_Start            Time_Task_Start            Surv_Outcome_1            Surv_Outcome_2

20160404            15:26:12            6            3

I want the following dataset:

Date_Task_Start            Cog_Time_Task_Start            Cog_Outcome_1            Cog_Outcome_2            Surv_Time_Task_Start            Surv_Outcome_1            Surv_Outcome_2

20160404            15:16:42            3.4            4.2            15:26:12            6            3

Can anyone help me with this? Thank you.

confooseddesi89 · Posted 08-21-2019 09:45 AM

Hi Reeza,

Thanks, I used the link to create the following code for each of my datasets which need to be merged (I hope I did it right!):

Cognitive battery data:

data WORK.DSST_MEANS_SAMPLE;
  infile datalines dsd truncover;
  input Subject_ID_byDate:$6. Time_Task_Start:TIME8. Date_Task_Start:32. DSST_RT_Median:32. DSST_Correct_Responses:32.;
  format Time_Task_Start TIME8.;
  label Time_Task_Start="Time_Task_Start" Date_Task_Start="Date_Task_Start" DSST_Correct_Responses="Correct Responses";
datalines;
1605SR 12:37:50 20160808 0.1486427303 68
1708SR 20:18:27 20170703 1.0708801293 69
1705SR 11:33:04 20170515 0.9818222043 72
1724SR 11:38:19 20171204 1.0473129293 71
1705SR 14:16:17 20170515 1.0315467793 72
;;;;

Survey data:

data WORK.REDCAP_SAMPLE;
  infile datalines dsd truncover;
  input Date_Task_Start:32. Time_Task_Start:TIME8. performance_peers:32. effort_peers:32.;
  format Time_Task_Start TIME8.;
  label Date_Task_Start="Date_Task_Start" Time_Task_Start="Time_Task_Start" performance_peers="performance_peers" effort_peers="effort_peers";
datalines;
20160404 17:28:36 4 3
20160404 19:21:03 4 3
20160404 21:24:00 4 2
20160405 8:57:15 5 2
20160405 10:54:24 4 3
;;;;

For the sake of simplicity I have provided only some of the variables in each dataset, but note that each dataset may have 10-20 extra columns (variables). I will need to have all of these variables in the merged dataset.

As to your questions - there was only ever one participant at a time (it was an intensive laboratory study) with one survey taken after each cognitive battery. Thus, it should be one cognitive battery, one survey (10-30 minutes after the cognitive battery). It is both time AND date based - the date format is YYYYMMDD. The date will always match exactly, but as I mentioned, the times can be off within 30 minutes. Your help is much appreciated.

Reeza · Posted 08-21-2019 12:52 PM

This is possibly dangerous, but I think if you sort each by Date and Time and then do a MERGE directly - no BY statement you should in theory get what you need. It's really too bad you don't have a common ID across the data sets.

I'll try it and let you know in a few minutes.

Reeza · Posted 08-21-2019 12:54 PM

Would anything in your sample data actually match? They don't look like they have the same dates at all and from what you've said, it should be the same date. I can't actually test it with this data since there would be no matches 😞

confooseddesi89 · Posted 08-21-2019 03:42 PM

Hi Reeza,

I realize I didn't sort my datasets before creating the 5-row samples, which is why there appeared to be no date/time matches between the two datasets. I've sorted the datasets and included the first five rows of each dataset below.

Cognitive data:

data WORK.DSST_MEANS_SAMPLE;
  infile datalines dsd truncover;
  input Subject_ID_byDate:$6. Time_Task_Start:TIME8. Date_Task_Start:MMDDYY10. DSST_RT_Median:32. DSST_Correct_Responses:BEST12.;
  format Time_Task_Start TIME8. Date_Task_Start MMDDYY10. DSST_Correct_Responses BEST12.;
datalines;
1601SR 15:18:28 04/04/2016 0.781493399 95
1601SR 17:13:32 04/04/2016 0.748181999 99
1601SR 19:06:48 04/04/2016 0.765006255 97
1601SR 21:08:41 04/04/2016 0.731959533 100
1601SR 8:42:43 04/05/2016 0.781481933 94
;;;;

Survey data:

data WORK.REDCAP_SAMPLE;
  infile datalines dsd truncover;
  input performance_peers:BEST12. effort_peers:BEST12. date_task_start:MMDDYY10. Time_Task_Start:TIME8.;
  format performance_peers BEST12. effort_peers BEST12. date_task_start MMDDYY10. Time_Task_Start TIME8.;
datalines;
4 3 04/04/2016 17:28:00
4 3 04/04/2016 19:21:00
4 2 04/04/2016 21:24:00
5 2 04/05/2016 8:57:00
4 3 04/05/2016 10:54:00
;;;;

I should note that even though each cognitive battery SHOULD be followed by a survey (and each survey should be preceded by a battery), this is not always the case. For example, a research assistant may have failed to give a survey after a battery, and (more rarely) a survey may have been administered without a battery. As you can see, row 1 in the survey data should be matched to row 2 in the cognitive data (I believe this was a practice battery; this is why no survey was administered.) Thus, a simple merge would not work. Thanks for your help.

-confooseddesi

Reeza · Posted 08-23-2019 11:50 AM

I think this can get you started.

I suspect it may return multiple records though, so you may need to add another filter or add another layer of logic if that is happening.

Try running it on your actual data and see if you run into issues with duplicates. If that's the case post back with some sample data that shows a few duplicates. You may also not be 100% accurate here because of the missing data.

It depends a lot on your data but if some one didn't do a survey but did do a battery and there was a survey in that time period it would link. This also isn't testing that there's only a unique match to each one. I would see how it works though and then post back with issues, just giving you some ideas of what you'll need to check out. You may get lucky with mostly clean data 🙂

data WORK.DSST_MEANS_SAMPLE;
  infile datalines dsd truncover dlm=',';
  input Subject_ID_byDate:$6. Time_Task_Start:TIME8. Date_Task_Start:MMDDYY10. DSST_RT_Median:32. DSST_Correct_Responses:BEST12.;
  format Time_Task_Start TIME8. Date_Task_Start MMDDYY10. DSST_Correct_Responses BEST12.;
datalines;
1601SR, 15:18:28, 04/04/2016, 0.781493399, 95
1601SR, 17:13:32, 04/04/2016, 0.748181999, 99
1601SR, 19:06:48, 04/04/2016, 0.765006255, 97
1601SR, 21:08:41, 04/04/2016, 0.731959533, 100
1601SR, 8:42:43, 04/05/2016, 0.781481933, 94
;;;;


data WORK.REDCAP_SAMPLE;
  infile datalines dsd truncover dlm=',';
  input performance_peers:BEST12. effort_peers:BEST12. date_task_start:MMDDYY10. Time_Task_Start:TIME8.;
  format performance_peers BEST12. effort_peers BEST12. date_task_start MMDDYY10. Time_Task_Start TIME8.;
datalines;
4, 3, 04/04/2016, 17:28:00
4, 3, 04/04/2016, 19:21:00
4, 2, 04/04/2016, 21:24:00
5, 2, 04/05/2016, 8:57:00
4, 3, 04/05/2016, 10:54:00
;;;;

data redcap1;
set redcap_sample;

*adds ids;
id = _n_;
run;

data dsst1;
set dsst_means_sample;
*adds ids;
id = _n_;
run;

proc sql;
create table linked as
select a.id as task_id, 
		a.date_task_start as battery_date_start, 
		a.time_task_start as battery_time_start,
		 b.id as survey_id, 
		 b.date_task_start as survey_date_start, 
		 b.time_task_start as survey_time_start,
		 a.time_task_start as interval_start,
		 a.time_task_start + 30*60 as interval_end format=time8.,
		 case when missing(task_id) then 'No Battery'
		      when missing(survey_id) then 'No Survey'
		      else 'Match'
		 end as status
from dsst1 as a
full outer join redcap1 as b
on a.date_task_start = b.date_task_start
and b.time_task_start between a.time_task_start and (a.time_task_start + 30*60)
order by a.id, b.time_task_start;
quit;

proc sort data=linked;
by task_id survey_date_start;
run;

confooseddesi89 · Posted 08-24-2019 12:47 PM

Hi Reeza,

Please see below the data sample (first 10 observations) for the "linked" dataset (I renamed it to DSST_Means_RedCap):

data WORK.DSST_MEANS_REDCAP;
  infile datalines dsd truncover;
  input task_id:32. battery_date_start:MMDDYY10. battery_time_start:TIME8. survey_id:32. survey_date_start:MMDDYY10. survey_time_start:TIME8. interval_start:TIME8. interval_end:TIME8. status:$10.;
  format battery_date_start MMDDYY10. battery_time_start TIME8. survey_date_start MMDDYY10. survey_time_start TIME8. interval_start TIME8. interval_end TIME8.;
datalines;
. . . 64 05/16/2016 15:31:00 . . No Battery
. . . 120 05/26/2016 8:34:00 . . No Battery
. . . 121 06/14/2016 11:37:00 . . No Battery
. . . 123 06/14/2016 16:25:00 . . No Battery
. . . 124 06/14/2016 17:54:00 . . No Battery
. . . 210 07/23/2016 8:59:00 . . No Battery
. . . 304 10/05/2016 18:08:00 . . No Battery
. . . 305 01/27/2017 16:51:00 . . No Battery
. . . 306 03/07/2017 15:55:00 . . No Battery
. . . 333 03/18/2017 0:03:00 . . No Battery
;;;;

Below are the first 10 observations of the "redcap1" (renamed):

data WORK.REDCAP_FOR_MERGE_SAMPLE;
  infile datalines dsd truncover;
  input kss:BEST12. date_task_start:MMDDYY10. Time_Task_Start:TIME8. id:32.;
  format kss BEST12. date_task_start MMDDYY10. Time_Task_Start TIME8.;
datalines;
6 04/04/2016 17:28:00 1
7 04/04/2016 19:21:00 2
7 04/04/2016 21:24:00 3
5 04/05/2016 8:57:00 4
4 04/05/2016 10:54:00 5
3 04/05/2016 13:02:00 6
4 04/05/2016 14:50:00 7
4 04/05/2016 16:50:00 8
3 04/05/2016 19:01:00 9
3 04/05/2016 21:23:00 10
;;;;

Below are the first 10 observations of the "dsst1" (renamed):

data WORK.DSST_FOR_MERGE_SAMPLE;
  infile datalines dsd truncover;
  input Time_Task_Start:TIME8. Date_Task_Start:MMDDYY10. DSST_RT_Mean:32. id:32.;
  format Time_Task_Start TIME8. Date_Task_Start MMDDYY10.;
datalines;
15:18:28 04/04/2016 0.8297960572 1
17:13:32 04/04/2016 0.7964202534 2
19:06:48 04/04/2016 0.8120929878 3
21:08:41 04/04/2016 0.7753463086 4
8:42:43 04/05/2016 0.8330229793 5
10:39:27 04/05/2016 0.8386569351 6
12:47:11 04/05/2016 0.8897907703 7
14:35:50 04/05/2016 0.7906762696 8
16:36:32 04/05/2016 0.8850391059 9
18:47:32 04/05/2016 0.8465832596 10
;;;;

Am I meant to merge redcap1 and dsst1 by ID? Thanks for your help.

Tom · Posted 08-24-2019 01:03 PM

Perhaps you don't want to MERGE at all. Instead you might want to just interleave the records from the two datasets.

data DSST_MEANS_SAMPLE;
  infile datalines truncover;
  input Subject_ID_byDate:$6. Time:TIME8. Date:MMDDYY10. 
        DSST_RT_Median DSST_Correct_Responses;
  format Time TIME8. Date yymmdd10. ;
datalines;
1601SR 15:18:28 04/04/2016 0.781493399 95
1601SR 17:13:32 04/04/2016 0.748181999 99
1601SR 19:06:48 04/04/2016 0.765006255 97
1601SR 21:08:41 04/04/2016 0.731959533 100
1601SR 8:42:43 04/05/2016 0.781481933 94
;;;;

data REDCAP_SAMPLE;
  infile datalines  truncover;
  input performance_peers:BEST12. effort_peers:BEST12. date:MMDDYY10. Time:TIME8.;
  format date yymmdd10. Time TIME8.;
datalines;
4 3 04/04/2016 17:28:00
4 3 04/04/2016 19:21:00
4 2 04/04/2016 21:24:00
5 2 04/05/2016 8:57:00
4 3 04/05/2016 10:54:00
;;;;

data want ;
  format Date yymmdd10. Time TIME8. ;
  set  DSST_MEANS_SAMPLE REDCAP_SAMPLE;
  by date time ;
run;
proc print;
run;

                                 Subject_
                                  ID_by      DSST_RT_    DSST_Correct_    performance_    effort_
Obs          Date        Time      Date       Median       Responses          peers        peers

  1    2016-04-04    15:18:28     1601SR      0.78149          95               .            .
  2    2016-04-04    17:13:32     1601SR      0.74818          99               .            .
  3    2016-04-04    17:28:00                  .                .               4            3
  4    2016-04-04    19:06:48     1601SR      0.76501          97               .            .
  5    2016-04-04    19:21:00                  .                .               4            3
  6    2016-04-04    21:08:41     1601SR      0.73196         100               .            .
  7    2016-04-04    21:24:00                  .                .               4            2
  8    2016-04-05     8:42:43     1601SR      0.78148          94               .            .
  9    2016-04-05     8:57:00                  .                .               5            2
 10    2016-04-05    10:54:00                  .                .               4            3

confooseddesi89 · Posted 08-24-2019 04:21 PM

Hi Tom,

Interleaving keeps the cognitive data and survey data from a given battery/survey administration on separate rows. These need to be on the same rows to allow for analyses (e.g., whether survey variable predicts cognitive variable).

Reeza · Posted 08-25-2019 12:40 PM

@confooseddesi89 did you run the code I posted and did it work on the original data you posted?

Did you modify the SQL code posted in any manner?

What I did was add IDs to each record and then did the merge to get the ID's into one data set. Then once you're sure it works you can add in the other variables you need by listing them. You have variables that have the same name in each data set which will not work for your final data set since each variable needs a unique name. So you'll have to rename them into the SQL code.

EDIT:

I ran your NEW sample data (why did you change data set names?) with the code I originally used and it works fine but I don't get the same output you do. I'm including the output attached. You need to post the code you're running now and the log - did you make any changes, did the log have errors?

confooseddesi89 · Posted 08-25-2019 02:21 PM

Hi Reeza,

It seems like the "linked" file is creating the records appropriately. Attached are the log (.rtf) and the "linked" data I created, and below is the code I used. There were no errors in the log. I kept it exactly the same as in the code your provided, except the datasets are in the library "SR" and not in "work":

*use Reeza's EXACT code;
data redcap1;
set SR.redcap;
*adds ids;
id = _n_;
run;
data dsst1;
set SR.dsst_means;
*adds ids;
id = _n_;
run;
proc sql;
create table SR.linked as
select a.id as task_id, 
		a.date_task_start as battery_date_start, 
		a.time_task_start as battery_time_start,
		 b.id as survey_id, 
		 b.date_task_start as survey_date_start, 
		 b.time_task_start as survey_time_start,
		 a.time_task_start as interval_start,
		 a.time_task_start + 30*60 as interval_end format=time8.,
		 case when missing(task_id) then 'No Battery'
		      when missing(survey_id) then 'No Survey'
		      else 'Match'
		 end as status
from dsst1 as a
full outer join redcap1 as b
on a.date_task_start = b.date_task_start
and b.time_task_start between a.time_task_start and (a.time_task_start + 30*60)
order by a.id, b.time_task_start;
quit;

proc sort data=SR.linked;
by task_id survey_date_start;
run;

The linked file you produced is likely different than the linked file I produced because I am using the "real" data (not the sample data which I created to post on this forum which has only a subset of all the data). The observations in the linked data you posted are contained in the linked file I'm attaching here. What are the next steps to get the other variables from the original cognitive and survey datasets into this linked dataset?

Also, there's one thing I noticed - in ONE instance, a survey was administered the next day after the cognitive battery, because the battery was administered late at ~23:45 (the survey then was administered 0:03 the next day). Is there any way to rectify this so the records are linked?

Thanks.

Reeza · Posted 08-26-2019 11:30 AM

What are the next steps to get the other variables from the original cognitive and survey datasets into this linked dataset?

Start listing the variables you want from each data set in the PROC SQL statement and it will include them. Like I mentioned earlier, if you have variables with the same name, which you do, you need to rename one of them at minimum.

Look at the current code to see how a rename works.

Also, there's one thing I noticed - in ONE instance, a survey was administered the next day after the cognitive battery, because the battery was administered late at ~23:45 (the survey then was administered 0:03 the next day). Is there any way to rectify this so the records are linked?

Rather than link on date as exact and time as the calculation, you can create a variable that is a datetime and then use that variable instead of the time as currently being used.

Here's a great, but longer and in depth, reference for dates and times in SAS - it should answer your second question if you get stuck anywhere.
https://communities.sas.com/t5/SAS-Communities-Library/Working-with-Dates-and-Times-in-SAS-Tutorial/...

Unfortunately I don't have further time to spend on this today, Good Luck!

Classroom Training Available!