Saturday, February 25, 2006

RE: st: Encode/destring

Despite the title, the issue here is one-to-one mapping from string identifiers to numeric identifiers.

As Giorgia points out, -destring, ignore- is quite wrong for her problem, as ignoring the non-numeric characters throws away important information.

Joseph's solution is a reinvention of -egen, group()-. It shows the logic to follow, but for convenience you can do it directly:

egen numeric_panel_id = group(string_panel_id)

(Incidentally, keeping track of all the non-numeric characters in a string variable is not that difficult. A utility -charlist- on SSC is dedicated to this small question.)

(Giorgia: the Statalist FAQ explains the Statalist convention of using -cmdname- to refer to a command of that name.)


Joseph Coveney

> First, generate a numeric variable that takes the value one > at the first > observation of a (sorted) panel unit, and zero at all succeeding > observations of that panel unit. Then -sum()- the numeric > variable across > the dataset. The technique is illustrated below with dummy > data of about 150 000 panel units. > > clear > set more off > set seed `=date("2006-02-25", "ymd")' > set obs 150000 > generate str panel_unit = string(uniform(), "%19.18g") > * > * Begin here > * > bysort panel_unit: generate byte panel_number = _n == 1 > replace panel_number = sum(panel_number) > exit >

Giorgia Maffini

> I am working with a panel of more than 70,000 firms. > When running FE and RE I need to specify the panel unit (firms in my > dataset). The panel unit has to be recorded a numeric variable, as I > understand. > > In my data the firm idendifier is a STRING variable with both > numbers and > letters. Example: firm with identifier FR12345 is different > from firm with identifier GB12345. > > I used DESTRING-IGNORE but > 1) it is difficult to track down all the characters present > in the firm identifier variable > 2) Different firms will get the same id number. Example: FR12345 and > GB12345. > > I used ENCODE but I got the following error message (134): > You attempted to > encode a string variable that takes on more than 65,536 unique values.

