DATA Step, Macro, Functions and more

Reading data

Reply
Contributor
Posts: 24

Reading data

My dataset contains  a variable called RACE and values for it will be from 1 to 4.Each number indicates one race but number 9 is used to indicate a race OTHERS,which means it can be any different race other than those 4 races.A variable RACEOTHER is used to read those values.

Data looks like this

DATA RACE;

INPUT PATIENT CENTER RACE RACEOTHER $;

DATALINES;

001 01 1 

002 01 3

003 01 4

004 01 9  NON-ASIAN

005 02 2

006 03 9  NON-HISPANIC

;

RUN;

 

I will create a user defined format like

 

PROC FORMAT ;

 VALUE  1='''HISPANIC'

               2='''ASIAN'

               3='''CAUCASIAN''

               4='''BLACK"

               9='''OTHERS'

;

RUN;

My requirement is to change that numbers in race column to their respective names

Super User
Posts: 23,237

Re: Reading data

Posted in reply to Bhargav_Movva
if race = 9 then race_description=race_other;
else race = put(race, race_fmt.);
Contributor
Posts: 24

Re: Reading data

can you please elaborate that code

Contributor
Posts: 24

Re: Reading data

Thanks for responding,but i dont need a column RACEOTH and i want to see the value which is in RACEOTH variable in race column only.

PRESENT DATASET

RACE RACEOTH

1

2

3

4

9  NON-HISPANIC

 

output required

ASIAN

BLACK

CAUSI

XYYII

NON-HISPANIC

 

means i need only race column and i want to see whatever is there in RACEOTH variable when race=9 in the same RACE VARIABLE

Super User
Posts: 13,293

Re: Reading data

Posted in reply to Bhargav_Movva

You will need to have a format that can show all of the text values that you want. Each displayed value generally need to have a different starting value.

You will have to process your data once to find all of the 9 <something else combinations>, decide on a coding scheme, and assign new values to REPLACE the 9 to match a new format. It would likely be much easier to create a new variable as @Reeza suggests.

 

Additionally you might have to decide what to do when the 9 does not have an accompanying other description.

 

I have worked with similar data in the past and have had as many as 35 "other" categories out of a response file of 500 subjects. And the next month there would be more new "other" responses.

 

In most practical terms the approach you are suggesting is fragile and likely to be a maintenance headache if applied to a different dataset as each set would require a separate custom format. So any comparison across data sets might get very complicated for little gain.

Respected Advisor
Posts: 4,668

Re: Reading data

[ Edited ]
Posted in reply to Bhargav_Movva

@Bhargav_Movva

Here you go:

PROC FORMAT;
  VALUE race_fmt
    1='HISPANIC'
    2='ASIAN'
    3='CAUCASIAN'
    4='BLACK'
    9='OTHERS'
  ;
RUN;

DATA RACE;
  infile datalines truncover;
  INPUT PATIENT CENTER RACE RACEOTHER $20.;
  DATALINES;
001 01 1 
002 01 3
003 01 4
004 01 9  NON-ASIAN
005 02 2
006 03 9  NON-HISPANIC
;
RUN;

data want;
  set RACE;
  if race = 9 then race_description=RACEOTHER;
  else race_description = put(race, race_fmt.);
run;

And OT as my very personal opinion:

Looking at your race categories I came once more to realize what an abused and poisonous term race is when used on humans in a non-scientific way. I live in Australia and I believe Aborigines are genetically as close related to me ("Caucasian") than to people of African descent; and it's likely that I'm genetically closer related to many people of African descent than to Australian Aborigines. So what does "Black" stand for?  And why, what purpose does such a category serve?

Capture.JPG

 

 

Contributor
Posts: 24

Re: Reading data

@Patrick Hey there is nothing related to discrimination or categorization.It is just a dummy data related to clinical trials.

Super User
Posts: 23,237

Re: Reading data

Posted in reply to Bhargav_Movva

Bhargav_Movva wrote:

@Patrick Hey there is nothing related to discrimination or categorization.It is just a dummy data related to clinical trials.


Why exactly do you think the variable is included in the data? It's to see if there is a difference by race, i.e. categorizing people.  It's commonly done, but race is not well defined and people don't answer it well, so it's hard to measure. There are racial differences in physiology but the categories here are likely too broad to be useful. 

 

Even if this is 'dummy data' for practice, the distinction is one you should be aware of, since I'm assuming you're practicing for one day actually doing this in real life. 

 

 

Contributor
Posts: 24

Re: Reading data

@Reeza Hey nothing like that what you felt is true in my case.I know that it is collected as a part of demographics to check whether drug had different pk profile or therapeutic effect in different Races.But i dont have any personal intention in discrimination or etc .I just started learning sdtm mapping and whatever data is given to me i am working on it and personally i feel that every one is human .I am belong from which is mixed diversified culture and so you can believe that i am not a Racist.

Super User
Posts: 23,237

Re: Reading data

Posted in reply to Bhargav_Movva

Bhargav_Movva wrote:

@Reeza Hey nothing like that what you felt is true in my case.I know that it is collected as a part of demographics to check whether drug had different pk profile or therapeutic effect in different Races.But i dont have any personal intention in discrimination or etc .I just started learning sdtm mapping and whatever data is given to me i am working on it and personally i feel that every one is human .I am belong from which is mixed diversified culture and so you can believe that i am not a Racist.


 

No one is saying you're racist, I don't believe that was Patricks statement at all. It was, be careful with what you're measuring/analysing. We do need to be very aware of the analysis we do as statisticians/analysts and the effects you report. 

What happens if race is a significant factor? Would the company market the drug to different demographics - probably.  

 

You/We do have the responsibility to ensure that your analysis is appropriate and the effects it may have on peoples lives. 

There are tons of examples of 'bad data analysis' that is having significant effects on peoples lives, such as the models that predicts the probability of re-offence of a person who's charged with a crime, the idiots who decided to do record linkage without considering middle names or the brilliant analysts who chose to bundle the subprime market into a product and cause the depression in the US. 

 

In the medical field, we included the doctor in the analysis. It was a significant factor, the most significant factor in the end. We had to end up having an investigation and they lost their job in the end, but the patients are better off. I have more examples but it's bed time. 

 

Good Luck Smiley Happy

Super User
Posts: 13,293

Re: Reading data


Reeza wrote:

Bhargav_Movva wrote:

@Patrick Hey there is nothing related to discrimination or categorization.It is just a dummy data related to clinical trials.


Why exactly do you think the variable is included in the data? It's to see if there is a difference by race, i.e. categorizing people.  It's commonly done, but race is not well defined and people don't answer it well, so it's hard to measure. There are racial differences in physiology but the categories here are likely too broad to be useful. 

 

Even if this is 'dummy data' for practice, the distinction is one you should be aware of, since I'm assuming you're practicing for one day actually doing this in real life. 

 

 


@Reeza while I agree completely with your analysis on race (and ethnicity) reporting I know exactly why I include "race" in a similar manner: The funding source for the project requires it in the reports.

 

Almost every US governmental agency funded project gets stuck with this or a similar requirement regardless of actual relevance.

 

Ask a Question
Discussion stats
  • 10 replies
  • 240 views
  • 1 like
  • 4 in conversation