Solved: Re: entering data in billions of dollars

Damon1 · Posted 02-18-2020 07:46 PM

Hi Everyone,

Been learning for about a week. So, i have a data set that follows the following pattern:

rank (numeric), company ($), country ($), sales (num), profits (num), assets (num), market value (num)

the data under the last 4 numeric variables are presented in billions of dollars, as "$1.5B" or "$200.1B", and vary in length

Im using the following code:

DATA bigcos;
INFILE 'C:\Users\dlobsien\Documents\bigcompanies.txt';
INPUT ranking company_name $26. +1 country $18. sales_in_billions :dollar10. profits_in_billions :dollar10. assets_in_billions :dollar10. +1 profits_in_billions :dollar10.;
title companies data;
RUN;

——————————-

PROC PRINT data=bigcos (obs=15);
Run;

the first three variables go in fine, but it wont read anything after that with the Bs in the way (i assume that's the issue). whenever i add a line that says "compress(var, 'B')", after the input lines it gives me 0 observations—maybe I'm entering the syntax wrong or putting it in the wrong spot.

I've also included the data set in case that is useful.

Any help would be greatly appreciated.

Shmuel · Posted 02-19-2020 03:43 PM

Read the so called numeric data as char type first:

DATA bigcos;
INFILE 'C:\Users\dlobsien\Documents\bigcompanies.txt';
INPUT ranking company_name $26. 
      +1 country $18. sales $ profits $ assets $ ;
	
      drop sales profits assets ; /* the char type numeric variables */	
	  
      sales_in_billions  = input(compress(sales,'B'),dollar10.);
      profits_in_billions = inpt(compress(profits,'B'),dollar10.); 
      assets_in_billions  = input(compress(assets,'B'),dollar10.);
	  
      /* +1 profits_in_billions :dollar10.; <<< you can't have two variables with same name ! */

RUN;

View solution in original post

Shmuel · Posted 02-19-2020 03:43 PM

Read the so called numeric data as char type first:

DATA bigcos;
INFILE 'C:\Users\dlobsien\Documents\bigcompanies.txt';
INPUT ranking company_name $26. 
      +1 country $18. sales $ profits $ assets $ ;
	
      drop sales profits assets ; /* the char type numeric variables */	
	  
      sales_in_billions  = input(compress(sales,'B'),dollar10.);
      profits_in_billions = inpt(compress(profits,'B'),dollar10.); 
      assets_in_billions  = input(compress(assets,'B'),dollar10.);
	  
      /* +1 profits_in_billions :dollar10.; <<< you can't have two variables with same name ! */

RUN;

Damon1 · Posted 02-19-2020 06:11 PM

yeah this is what i did. i then used compress to take out the $,B and converted it to numeric in a new data set using the best. format

Tom · Posted 02-19-2020 06:49 PM

@Damon1 wrote:

yeah this is what i did. i then used compress to take out the $,B and converted it to numeric in a new data set using the best. format

You convert from text to values using an INFORMAT, not a FORMAT. BEST is a format but if you try to use it as in informat SAS will just go ahead and use the normal numeric informat instead. If you use the COMMA informat there is no need to remove the dollar signs or commas as that informat will strip those for you.

Damon1 · Posted 02-19-2020 06:51 PM

thanks!

Tom · Posted 02-19-2020 04:51 PM

Is it always a B? If so then just ignore it.

First thing to do is figure out where the data is on the line. The LIST statement is good for that.

data _null_;
  infile "&path\bigcompanies.txt" obs=5;
  input;
  list;
run;

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+--
1         1    Exxon Mobil                  United States     $433.5B   $41.1B   $331.1B   $407.4B 88
2         2    JPMorgan Chase               United States     $110.8B     $19B $2,265.8B   $170.1B 88
3         3    General Electric             United States     $147.3B   $14.2B   $717.2B   $213.7B 88
4         4    Royal Dutch Shell            Netherlands       $470.2B   $30.9B   $340.5B   $227.6B 88
5         5    ICBC                         China              $82.6B   $25.1B $2,039.1B   $237.4B 88
NOTE: 5 records were read from the infile "xxx\bigcompanies.txt".

Then you can write a program to read the lines. Make sure to give each variable a UNIQUE name.

data bigcos (label='companies data');
  infile "&path\bigcompanies.txt" truncover;
  input ranking 1-5
        company_name $ 6-34
        country $ 35-49 
    @51 sales_in_billions dollar8. 
    @60 profits_in_billions dollar8. 
    @69 assets_in_billions dollar8. 
    @79 profits_in_billions2 dollar9.
  ;
  format sales_in_billions -- profits_in_billions2 dollar10.1;
run;

TomKari · Posted 02-19-2020 05:51 PM

I believe that your problem is that several of the company names include accents, which are probably taking up more than one byte if you're not reading them correctly. This is pushing your columns out of line.

This second comment is a little advanced, but file it away for later. When I'm processing formatted data like this, I like to test the variables to ensure that they're well formed. The PRX routines are an easy way to do this (not completely tested).

Tom

data TestData;
	length SalesC $15;
	input;
	SalesC = _infile_;
	cards;
$433.5B
$110.8B
147.3B
 $470.2B
 $82.6B
  $102B
$310.1B
$310.B
 143.7B
 $87.6B
 $145.9
$375.5B
$.3B
 $68.7B
run;

data Problems(drop=_:);
	retain _RX1;

	if _n_ = 1 then
		_RX1 = prxparse("~^[[:space:]]*\$[[:digit:]]+(\.[[:digit:]])?B[[:space:]]*$~");
	set TestData;

	if ^prxmatch(_RX1, SalesC) then
		output;
run;

Register Today!

SAS Training: Just a Click Away