Cleaning Data

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 15
Accepted Solution

Cleaning Data

Hi,

I appreciate any help/suggestions

I have the following data which is a list of diseases. Here is a small sample;

3;N;X;01;A00;A00.-;A00;A00;Cholera;001;4-002;3-003;2-001;1-002
4;T;X;01;A00;A00.0;A00.0;A000;Cholera due to Vibrio cholera 01, biovar cholerae;001;4-002;3-003;2-001;1-002
4;T;X;01;A00;A00.1;A00.1;A001;Cholera due to Vibrio cholerae 01, biovar eltor;001;4-002;3-003;2-001;1-002
4;T;X;01;A00;A00.9;A00.9;A009;Cholera, unspecified;001;4-002;3-003;2-001;1-002
3;N;X;01;A00;A01.-;A01;A01;Typhoid and paratyphoid fevers;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.0;A01.0;A010;Typhoid fever;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.1;A01.1;A011;Paratyphoid fever A;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.2;A01.2;A012;Paratyphoid fever B;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.3;A01.3;A013;Paratyphoid fever C;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.4;A01.4;A014;Paratyphoid fever, unspecified;002;4-002;3-003;2-003

The problem is I am only after the the actual diseases and not the other information;

For example;

Row 1 should be Cholera

Row 2 should be Cholera due to Vibrio cholera, biovar cholerae

I can't really see a regular pattern so wasn't sure what to do.

Any help is appreciated!


Accepted Solutions
Solution
‎08-26-2014 04:33 AM
Super User
Super User
Posts: 7,977

Re: Cleaning Data

Posted in reply to wschnell1

You are trying to do the import in two different ways, choose one or the other.  Personally I would use the datastep method as you have more control, so adding in PhillipC's logic:

data diseaselist_2 (keep=var9);              

     length var1-var14 $40.;     

     infile "D:\Users\Disease List\codes.txt" dlm=';';

     Input var1-var14 $; 

     if strip(var1)="3" then result=var8;   

     if strip(var1)="4" then result=var9;

run;

View solution in original post


All Replies
Trusted Advisor
Posts: 3,215

Re: Cleaning Data

Posted in reply to wschnell1

9-th variable in a ':' semicolon separated file

---->-- ja karman --<-----
Contributor
Posts: 45

Re: Cleaning Data

Posted in reply to wschnell1

data have;               

infile datalines4 dlm=';';

length var1-var14 $40.;  

Input var1-var14 $;      

datalines4;

3;N;X;01;A00;A00.-;A00;A00;Cholera;001;4-002;3-003;2-001;1-002

4;T;X;01;A00;A00.0;A00.0;A000;Cholera due to Vibrio cholera 01, biovar cholerae;001;4-002;3-003;2-001;1-002

4;T;X;01;A00;A00.1;A00.1;A001;Cholera due to Vibrio cholerae 01, biovar eltor;001;4-002;3-003;2-001;1-002

4;T;X;01;A00;A00.9;A00.9;A009;Cholera, unspecified;001;4-002;3-003;2-001;1-002

3;N;X;01;A00;A01.-;A01;A01;Typhoid and paratyphoid fevers;002;4-002;3-003;2-003;1-004

4;T;X;01;A00;A01.0;A01.0;A010;Typhoid fever;002;4-002;3-003;2-003;1-004

4;T;X;01;A00;A01.1;A01.1;A011;Paratyphoid fever A;002;4-002;3-003;2-003;1-004

4;T;X;01;A00;A01.2;A01.2;A012;Paratyphoid fever B;002;4-002;3-003;2-003;1-004

4;T;X;01;A00;A01.3;A01.3;A013;Paratyphoid fever C;002;4-002;3-003;2-003;1-004

4;T;X;01;A00;A01.4;A01.4;A014;Paratyphoid fever, unspecified;002;4-002;3-003;2-003

;;;;



           

Super User
Posts: 11,343

Re: Cleaning Data

Posted in reply to RaviKommuri

And to keep just the text with the disease in it:

Keep var9;

Respected Advisor
Posts: 3,799

Re: Cleaning Data

Or read just enough to get move the pointer to field 9 and no need to read field 10 and up.

data have;              
  
infile datalines4 dlm=';' dsd;
  
length disease $60.
  
Input (8*d) (: $1.) disease;     
  
drop d;
   list;
  
datalines4;
3;N;X;01;A00;A00.-;A00;A00;Cholera;001;4-002;3-003;2-001;1-002
4;T;X;01;A00;A00.0;A00.0;A000;Cholera due to Vibrio cholera 01, biovar cholerae;001;4-002;3-003;2-001;1-002
4;T;X;01;A00;A00.1;A00.1;A001;Cholera due to Vibrio cholerae 01, biovar eltor;001;4-002;3-003;2-001;1-002
4;T;X;01;A00;A00.9;A00.9;A009;Cholera, unspecified;001;4-002;3-003;2-001;1-002
3;N;X;01;A00;A01.-;A01;A01;Typhoid and paratyphoid fevers;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.0;A01.0;A010;Typhoid fever;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.1;A01.1;A011;Paratyphoid fever A;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.2;A01.2;A012;Paratyphoid fever B;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.3;A01.3;A013;Paratyphoid fever C;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.4;A01.4;A014;Paratyphoid fever, unspecified;002;4-002;3-003;2-003
;;;;
   run;
proc print;
  
run;
Occasional Contributor
Posts: 15

Re: Cleaning Data

Posted in reply to RaviKommuri

Hi Ravi,

Thanks so much for your reply.

I run into the following problem when I apply your code to my data set.

PROC IMPORT OUT= diseaselist_1 

            DATAFILE= "D:\Users\Disease List\codes.txt"

  DBMS=TAB REPLACE;

  GETNAMES=NO;

      DATAROW=2;

data diseaselist_2;              

infile diseaselist_1 dlm=';';

length var1-var14 $40.; 

Input var1-var14 $;     

diseaselist_1;

run;

I receive an error message... no infile or datalines statement?

Sorry to bother you.

Solution
‎08-26-2014 04:33 AM
Super User
Super User
Posts: 7,977

Re: Cleaning Data

Posted in reply to wschnell1

You are trying to do the import in two different ways, choose one or the other.  Personally I would use the datastep method as you have more control, so adding in PhillipC's logic:

data diseaselist_2 (keep=var9);              

     length var1-var14 $40.;     

     infile "D:\Users\Disease List\codes.txt" dlm=';';

     Input var1-var14 $; 

     if strip(var1)="3" then result=var8;   

     if strip(var1)="4" then result=var9;

run;

New User
Posts: 1

Re: Cleaning Data

Posted in reply to wschnell1

Hi

There is a pattern. Rows starting with a 3 have the text you want after the 8th semi colon the ones starting with 4 the text is after the 9th one. Read each line as a string. Extract the first character and based on its value ectract either the 8th or 9th word using findw or scanw function (sorry cannt rember whic it is.)

Philip

Super User
Posts: 7,832

Re: Cleaning Data

Posted in reply to wschnell1

data have (keep=disease_name);

infile 'path_to_your_file' truncover;

length disease_name $ 100;

input whole_line $254.;

disease_name = scan(whole_line,9,';');

run;

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 370 views
  • 12 likes
  • 8 in conversation