Hi I have an excel sheet where results of multiple angiograms of a given patient are concatenated into a single cell. The variable name of this cell is "cardiac_cath". There are thousands of patients in this excel sheet, with several thousand other variables.
In "cardiac cath" of this patient, there are results from 3 angiograms, separated by "|".
Each angiogram contains stenosis percentages of up to 4 coronary vessels:
I would like to extract this into variables:
date1, LAD1, RCA1, LM1, LCX1 (for first angiogram), where they have the values of 2003-07-15, 60, 0, 0, 70, respectively
date2, LAD2, RCA2, LM2, LCX2 (for second angiogram), where they have the values of 2003-11-24, 40, 0, 90, 95, respectively
would like to learn from you a code that can do this for all the patients on the excel
thanks so much in advance
Here is something that may get you started. I am using LINE as the name of the variable because I'm lazy.
There are some details that your description is missing and the examples you show indicate
1) not every variable will have any of the keywords for the stenosis values
2) any observation with stenosis values is not required to have all 4
3) that at any collection time point there is not requirement to have the same number of stenosis values.
The following therefore finds the individual key words indicating the stenosis to my best guess and gets each of a type in order. There is no guarantee that RCA1 is at the same time point as LAD1.
Some key elements. The FINDW function is using the characters :^| (space) as word separators, the 'i' is an Ignore case instruction. After we find one value of the word and assign a value we search the string again for another occurrence starting after the current word.
We are using SUBSTR to extract the "rest of the line" following the key word and really hoping the next thing following is the stenosis value.
data want; set have; array lad(30); array rca(30); count=0; pos= findw(line,'stenosis_lad_cath',' :^|','i'); if pos>0 then do until(pos=0); count=count+1; lad[count]= input(scan(substr(line,pos+length('stenosis_lad_cath')),1,' :^|'),32.); pos = findw(line,'stenosis_lad_cath',' :^|','i',pos+length('stenosis_lad_cath')); end; count=0; pos= findw(line,'stenosis_rca_cath',' :^|','i'); if pos>0 then do until(pos=0); count=count+1; rca[count]= input(scan(substr(line,pos+length('stenosis_rca_cath')),1,' :^|'),32.); pos = findw(line,'stenosis_rca_cath',' :^|','i',pos+length('stenosis_rca_cath')); end; drop pos count; run;
(I attempted to add similar code prior but it seems to have disappeared)
Arrays, if you aren't familiar with them, are short hand names to reference related variables. This code will create 30 variables whether they are used or not.
A similar approach could be used for just about any "keyword" that could appear in your data. If you actually have more "date" values then you would want to make sure that the INPUT uses an appropriate INFORMAT such as YYMMDD10. instead of 32. to read the date value. Assign a format as well.
It doesn't look too difficult but could you provide that example data as TEXT instead of a picture? Copy your 3 or 4 rows of example from the spreadsheet then on the forum open a text box using the </> icon above the message window and paste the text.
It appears that SCAN with '|' as a delimiter will find the different different angiograms and : and ^ to separate the measurements and date.
If you want to add horizontally creating multiple variables you would have tell us the maximum number of these that would ever across multiple files occur if you want this in one step. (do able)
Another approach is to extract just the identification information plus this variable and then manipulate and merge back if you want to add bunches of variables to the existing data which doesn't require knowing the number of angiograms performed.
Caveat: The maximum length for single variable in SAS is 32K. If there are enough of these multiples for one individual you might have truncated data
many thanks
datecoronaryangiography_days_study:11^cor_angio_results:Cardiac catheterization showed severe RCA obstructions and nonobstructing left lesion^datecoronaryangiography_year:2002 "rhc_cath:true^daterhc_cath_days_study:3291^daterhc_cath_year:2019 | stenosis_lad_cath:100.0^stenosis_rca_cath:100.0^rhc_cath:true^daterhc_cath_days_study:3237^daterhc_cath_year:2019 | stenosis_lad_cath:100.0^stenosis_rca_cath:100.0^stenosis_lcx_cath:100.0^datecoronaryangiography_days_study:2878^datecoronaryangiography_year:2018 | stenosis_lcx_cath:100.0^datecoronaryangiography_days_study:742^cor_angio_results:Plan: Post-PCI Care: ASA: indefinitely Post-PCI Care: Plavix x 12 months minimum - DES Please note: Due to the risk of subacute stent thrombosis, Plavix (and Aspirin) should not be held or discontinued prior to completing the minimum duration of therapy without first discussing with a member of the Brigham & Women's Hospital interventional cardiology staff. ^datecoronaryangiography_year:2012" datecoronaryangiography_days_study:861^datecoronaryangiography_year:2014 | datecoronaryangiography_days_study:210^datecoronaryangiography_year:2012 | datecoronaryangiography_days_study:244^datecoronaryangiography_year:2012 | datecoronaryangiography_days_study:2380^datecoronaryangiography_year:2018 | rhc_cath:true^daterhc_cath_days_study:2708^daterhc_cath_year:2019 | rhc_cath:true^daterhc_cath_days_study:2735^daterhc_cath_year:2019 | rhc_cath:true^daterhc_cath_days_study:2793^daterhc_cath_year:2019
I copied and pasted 3 patients'. The maximum number of angiograms per patient is 30. Do these work?
This newer example shows a lot more information than the previous.
If this new example is 3 patients it appears that only one of them has stenosis values.
Plus there is no date information as shown in the previous example.The first example had a text value of datecoronaryangiography. None of these 3 examples do.
This sort of multiple values in one field triggers my: Can I request a different file that just has identification information and patients with actual Stenosis measurements in this field? and just the stenosis values? questions from whoever provides the data.
The reason I would ask is that you may end up spending a lot of time (read man-power-hours and a cost to the organization) undoing what apparently has been done by the data providers. If this is a "general" report type file that is merging multiple values then perhaps it would be simpler to get the values as a separate request. Include enough identification information and merge back onto this data based on those identifiers.
Otherwise you are in a pretty free-form text search mode that you have to debug and may take a bit more work. So is there any chance of requesting your stenosis data? If there is we would suggest a different file structure such as ONE set of readings per line with the identification information needed. With that it would be relatively easy to generate your multiple variables if really needed and merge them back with the data.
Here is something that may get you started. I am using LINE as the name of the variable because I'm lazy.
There are some details that your description is missing and the examples you show indicate
1) not every variable will have any of the keywords for the stenosis values
2) any observation with stenosis values is not required to have all 4
3) that at any collection time point there is not requirement to have the same number of stenosis values.
The following therefore finds the individual key words indicating the stenosis to my best guess and gets each of a type in order. There is no guarantee that RCA1 is at the same time point as LAD1.
Some key elements. The FINDW function is using the characters :^| (space) as word separators, the 'i' is an Ignore case instruction. After we find one value of the word and assign a value we search the string again for another occurrence starting after the current word.
We are using SUBSTR to extract the "rest of the line" following the key word and really hoping the next thing following is the stenosis value.
data want; set have; array lad(30); array rca(30); count=0; pos= findw(line,'stenosis_lad_cath',' :^|','i'); if pos>0 then do until(pos=0); count=count+1; lad[count]= input(scan(substr(line,pos+length('stenosis_lad_cath')),1,' :^|'),32.); pos = findw(line,'stenosis_lad_cath',' :^|','i',pos+length('stenosis_lad_cath')); end; count=0; pos= findw(line,'stenosis_rca_cath',' :^|','i'); if pos>0 then do until(pos=0); count=count+1; rca[count]= input(scan(substr(line,pos+length('stenosis_rca_cath')),1,' :^|'),32.); pos = findw(line,'stenosis_rca_cath',' :^|','i',pos+length('stenosis_rca_cath')); end; drop pos count; run;
(I attempted to add similar code prior but it seems to have disappeared)
Arrays, if you aren't familiar with them, are short hand names to reference related variables. This code will create 30 variables whether they are used or not.
A similar approach could be used for just about any "keyword" that could appear in your data. If you actually have more "date" values then you would want to make sure that the INPUT uses an appropriate INFORMAT such as YYMMDD10. instead of 32. to read the date value. Assign a format as well.
many thanks for the great suggestion
I will try this
Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.
Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.