Cleaning Data

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 15
Accepted Solution

Cleaning Data

Hi,

I appreciate any help/suggestions

I have the following data which is a list of diseases. Here is a small sample;

3;N;X;01;A00;A00.-;A00;A00;Cholera;001;4-002;3-003;2-001;1-002
4;T;X;01;A00;A00.0;A00.0;A000;Cholera due to Vibrio cholera 01, biovar cholerae;001;4-002;3-003;2-001;1-002
4;T;X;01;A00;A00.1;A00.1;A001;Cholera due to Vibrio cholerae 01, biovar eltor;001;4-002;3-003;2-001;1-002
4;T;X;01;A00;A00.9;A00.9;A009;Cholera, unspecified;001;4-002;3-003;2-001;1-002
3;N;X;01;A00;A01.-;A01;A01;Typhoid and paratyphoid fevers;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.0;A01.0;A010;Typhoid fever;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.1;A01.1;A011;Paratyphoid fever A;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.2;A01.2;A012;Paratyphoid fever B;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.3;A01.3;A013;Paratyphoid fever C;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.4;A01.4;A014;Paratyphoid fever, unspecified;002;4-002;3-003;2-003

The problem is I am only after the the actual diseases and not the other information;

For example;

Row 1 should be Cholera

Row 2 should be Cholera due to Vibrio cholera, biovar cholerae

I can't really see a regular pattern so wasn't sure what to do.

Any help is appreciated!


Accepted Solutions
Solution
‎08-26-2014 04:33 AM
Super User
Super User
Posts: 7,401

Re: Cleaning Data

You are trying to do the import in two different ways, choose one or the other.  Personally I would use the datastep method as you have more control, so adding in PhillipC's logic:

data diseaselist_2 (keep=var9);              

     length var1-var14 $40.;     

     infile "D:\Users\Disease List\codes.txt" dlm=';';

     Input var1-var14 $; 

     if strip(var1)="3" then result=var8;   

     if strip(var1)="4" then result=var9;

run;

View solution in original post


All Replies
Valued Guide
Posts: 3,208

Re: Cleaning Data

9-th variable in a ':' semicolon separated file

---->-- ja karman --<-----
Contributor
Posts: 45

Re: Cleaning Data

data have;               

infile datalines4 dlm=';';

length var1-var14 $40.;  

Input var1-var14 $;      

datalines4;

3;N;X;01;A00;A00.-;A00;A00;Cholera;001;4-002;3-003;2-001;1-002

4;T;X;01;A00;A00.0;A00.0;A000;Cholera due to Vibrio cholera 01, biovar cholerae;001;4-002;3-003;2-001;1-002

4;T;X;01;A00;A00.1;A00.1;A001;Cholera due to Vibrio cholerae 01, biovar eltor;001;4-002;3-003;2-001;1-002

4;T;X;01;A00;A00.9;A00.9;A009;Cholera, unspecified;001;4-002;3-003;2-001;1-002

3;N;X;01;A00;A01.-;A01;A01;Typhoid and paratyphoid fevers;002;4-002;3-003;2-003;1-004

4;T;X;01;A00;A01.0;A01.0;A010;Typhoid fever;002;4-002;3-003;2-003;1-004

4;T;X;01;A00;A01.1;A01.1;A011;Paratyphoid fever A;002;4-002;3-003;2-003;1-004

4;T;X;01;A00;A01.2;A01.2;A012;Paratyphoid fever B;002;4-002;3-003;2-003;1-004

4;T;X;01;A00;A01.3;A01.3;A013;Paratyphoid fever C;002;4-002;3-003;2-003;1-004

4;T;X;01;A00;A01.4;A01.4;A014;Paratyphoid fever, unspecified;002;4-002;3-003;2-003

;;;;



           

Super User
Posts: 10,500

Re: Cleaning Data

And to keep just the text with the disease in it:

Keep var9;

Respected Advisor
Posts: 3,777

Re: Cleaning Data

Or read just enough to get move the pointer to field 9 and no need to read field 10 and up.

data have;              
  
infile datalines4 dlm=';' dsd;
  
length disease $60.
  
Input (8*d) (: $1.) disease;     
  
drop d;
   list;
  
datalines4;
3;N;X;01;A00;A00.-;A00;A00;Cholera;001;4-002;3-003;2-001;1-002
4;T;X;01;A00;A00.0;A00.0;A000;Cholera due to Vibrio cholera 01, biovar cholerae;001;4-002;3-003;2-001;1-002
4;T;X;01;A00;A00.1;A00.1;A001;Cholera due to Vibrio cholerae 01, biovar eltor;001;4-002;3-003;2-001;1-002
4;T;X;01;A00;A00.9;A00.9;A009;Cholera, unspecified;001;4-002;3-003;2-001;1-002
3;N;X;01;A00;A01.-;A01;A01;Typhoid and paratyphoid fevers;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.0;A01.0;A010;Typhoid fever;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.1;A01.1;A011;Paratyphoid fever A;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.2;A01.2;A012;Paratyphoid fever B;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.3;A01.3;A013;Paratyphoid fever C;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.4;A01.4;A014;Paratyphoid fever, unspecified;002;4-002;3-003;2-003
;;;;
   run;
proc print;
  
run;
Occasional Contributor
Posts: 15

Re: Cleaning Data

Hi Ravi,

Thanks so much for your reply.

I run into the following problem when I apply your code to my data set.

PROC IMPORT OUT= diseaselist_1 

            DATAFILE= "D:\Users\Disease List\codes.txt"

  DBMS=TAB REPLACE;

  GETNAMES=NO;

      DATAROW=2;

data diseaselist_2;              

infile diseaselist_1 dlm=';';

length var1-var14 $40.; 

Input var1-var14 $;     

diseaselist_1;

run;

I receive an error message... no infile or datalines statement?

Sorry to bother you.

Solution
‎08-26-2014 04:33 AM
Super User
Super User
Posts: 7,401

Re: Cleaning Data

You are trying to do the import in two different ways, choose one or the other.  Personally I would use the datastep method as you have more control, so adding in PhillipC's logic:

data diseaselist_2 (keep=var9);              

     length var1-var14 $40.;     

     infile "D:\Users\Disease List\codes.txt" dlm=';';

     Input var1-var14 $; 

     if strip(var1)="3" then result=var8;   

     if strip(var1)="4" then result=var9;

run;

New User
Posts: 1

Re: Cleaning Data

Hi

There is a pattern. Rows starting with a 3 have the text you want after the 8th semi colon the ones starting with 4 the text is after the 9th one. Read each line as a string. Extract the first character and based on its value ectract either the 8th or 9th word using findw or scanw function (sorry cannt rember whic it is.)

Philip

Super User
Posts: 6,936

Re: Cleaning Data

data have (keep=disease_name);

infile 'path_to_your_file' truncover;

length disease_name $ 100;

input whole_line $254.;

disease_name = scan(whole_line,9,';');

run;

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 360 views
  • 12 likes
  • 8 in conversation