Phantom Types and How to Enhance Them

Posted on August 15, 2018

In an earlier post about config-joiner-json I briefly mentioned phantom types and the reason they are used in config-joiner-json. The goal in this post is to explain a little more about phantom types and small ways they can be enhanced to help create better defined programs.

If you are already familiar with phantom types just jump ahead to enhancing-phantom-types.

Phantom Types

What are they again?

A Phantom type is just a type parameter that appears in the type constructor, but not in the data constructor for that type. That is to say, it appears only at the type level. This is more obvious in an examples.

newtype NotSpookyType a = NotSpookyData a -- Not a phantom type! 

Here NotSpooky is a very normal Haskell newtype declaration. The type parameter a appears in the type constructor, NotSpookyType, and is in the actual data constructor as well, NotSpookyData. At run time the a is actually present in all non-bottom NotSpookyTypes constructed.

newtype SpookyType a = SpookyType Int -- Phantom Type!

Here SpookyType is making use of a phantom type, a. Notice that all constructed SpookyTypes will actually be wrapping an Int, regardless of what type is supplied as the argument to SpookyType. All the a is doing here is allowing the programmer to tag an arbitrary type on any constructed SpookyTypes for some additional information.

In addition, newtypes that include a phantom type parameter have the added benefit of being able to support many type classes, such as Functor

instance Functor SpookyType where
  fmap f (SpookyType n) = SpookyType n

Obviously, these instances aren’t going to actually do much, but it does give the programmer access to a lot of extra tools for these data types.

What’s the point?

Phantom types can be used for many things, although the canonical and frequent use case is to encode a state on the type carrying the phantom around. This is exactly how config-joiner-json is making use of phantom types today, so that will be the example here.

Starting Off

In our library, config-joiner-json, we make extensive use of JSON values - reading them, joining them, etc. Due to the nature of the library functions will frequently take two JSON values as argument to process them, but the two values are definitely different primitives.

type JSON            = ... 

-- | combines a common configuration with a specific configuration into a full configuration 
combine :: JSON -> JSON -> JSON

At first glance this function doesn’t offer a whole lot of insight (admittedly the documentation is sparse), but, what’s worse, is that we can trivially make mistakes with this API by swapping the order of arguments.

Since we’re careful programmers and want to differentiate between the common and an environment JSON values, we create newtypes of them.

type JSON            = ... 

-- |A config containing common values used across multiple environments
newtype CommonConfig = CommonConfig JSON

-- |A config containing some or all environment specific values
newtype EnvConfig    = EnvConfig JSON

-- | combines a common configuration with a specific configuration into a full configuration 
combine :: CommonConfig -> EnvConfig -> EnvConfig
combine = ...

So far so good, the ordering problem is gone and now we can create a fully processed environment config in a more obvious manner. Notice, though, that we still have EnvConfig in a couple of places. This type describes the primitive “a configuration for a specific environment”, but not if it is ready to be used in that environment (containing all required values). A user could still make odd mistakes with this type as well.

readCommonConfig :: IO CommonConfig
readCommonConfig = ...

readEnvConfig :: IO EnvConfig
readEnvConfig = ...

writeEnvConfig :: EnvConfig -> IO ()
writeEnvConfig = ...

main = do
  commonConfig <- readCommonConfig
  envConfig <- readEnvConfig
  let fullConfig = combine commonConfig envConfig
  writeEnvConfig envConfig -- oops! Breaks at runtime

One solution to this would just be to add another type. Types are free after all, so let’s try it.

type JSON                   = ... 

-- |A config containing common values used across multiple environments
newtype CommonConfig        = CommonConfig JSON

-- |A config containing some or all environment specific values that still needs to be processed
newtype PreProcessEnvConfig = PreProcessEnvConfig JSON

-- |A config containing all environment specific values that is ready to be used
newtype ProcessedEnvConfig  = ProcessesEnvConfig JSON

-- | combines a common configuration with a specific configuration into a full configuration 
combine :: CommonConfig -> PreProcessEnvConfig -> PostProcessEnvConfig
combine = ...

This solution actually works fine. All arguments have separate types for their specific use cases, and the user can’t accidentally reuse an environment JSON value that hasn’t been processed when they really want the fully processed value. There are a few draw backs though. First, we now no longer can write functions that operate on all environment config types regardless of their processing status (unless we wrote a type class for that, but we would still need an instance per type). Second, as the number of states grow the number of newtypes must grow lock-step with them. Here we are encoding each state as a new EnvConfig type, when we really have a single EnvConfig primitive whose state we want to track at the type level.

Phantom Type Approach

Instead of encoding our states as unique newtypes, we want to tag our one type with different states.

{-# LANGUAGE EmptyDataDecls #-}
type JSON                   = ... 

-- |A config containing common values used across multiple environments
newtype CommonConfig        = CommonConfig JSON

-- |The tagged type is not yet processed
data PreProcess              -- note, this has no constructor

-- |The tagged type is processed
data PostProcess             -- note, this has no constructor

-- |A config containing some or all environment specific value
newtype EnvConfig a          = EnvConfig JSON

-- | combines a common configuration with a specific configuration into a full configuration 
combine :: CommonConfig -> EnvConfig PreProcess -> EnvConfig PostProcess
combine = ...

Above is the encoding use phantom types. There are a few changes here. First, we are back to just a single newtype for environment configurations, EnvConfig, that includes a for encoding its state. Second is the new state-tracking types, PreProcess, and PostProcess. These are making use of the EmptyDataDecls extension so that that do not include any data constructor at all - they only exist at the type level.

This approach turns out to be rather convenient. First, we can write functions that operate on all EnvConfigs trivially:

envConfigTransform :: EnvConfig a -> EnvConfig a
envConfigTransform = ... 

Second, callers can’t accidentally reuse a PreProcess config when a PostProcess config is required, assuming the API is making use of these types:

readCommonConfig :: IO CommonConfig
readCommonConfig = ...

readEnvConfig :: IO EnvConfig PreProcess
readEnvConfig = ...

writeEnvConfig :: EnvConfig PostProcess -> IO ()
writeEnvConfig = ...

main = do
  commonConfig <- readCommonConfig
  envConfig <- readEnvConfig
  let fullConfig = combine commonConfig envConfig
  writeEnvConfig envConfig -- doesn't compile! Success!

Finally, if we add new types that must be supported we just make a new empty data declaration and all the generic functions we wrote just work! Not too bad.

Enhancing Phantom Types

By now you should hopefully have a decent idea about what phantom types are and when you might use them. An attentive reader may notice something at odds with phantom types and the way we typically make use of newtypes in Haskell programs.

newtype EnvConfig a = EnvConfig JSON

In Haskell we typically attempt to minimize the number of valid states in a program as much as possible, leveraging newtypes and data types along the way to attempt to capture the very specific states desired. Type parameters, on the other hand, are generalized extremely broadly. Typically that makes sense, there isn’t usually a use case for a List that only contains Num instances, for example.

-- hypothetical syntax, not actually supported in standard Haskell
data [] (Num a) => a = [] | a : [a] 

However, in the use case above for phantom types we are attempting to have a very small number of types that can be used as an encoding. This is where DataKinds come in.

DataKinds

For those who haven’t bumped into DataKinds before, it is an extension that lifts type constructors to the kind level and data constructors to the type level. Let’s break that down really quick.

Kinds

A kind is a type of a type, generally represented as a * in Haskell but also Type after GHC 8.0. So Int has a kind of *. Types that take other types as arguments have a different kind. For example, Maybe has kind * -> *

Type Constructors

A type constructor is a type that takes other types as arguments to create a concrete type. Maybe and [] are type constructors.

Data Constructors

A data constructor constructs a value of a type. Just is a data constructor of Maybe a with type Just :: a -> Maybe a

The DataKinds extension basically lets us use Data Constructors where we are using types, and Type Constructors where we are using kinds. Like most extensions in GHC, this has many applications and I definitely encourage you to try this extension out.

Phantom Types with Data Kinds

So how does DataKinds figure into our state encoding? We’re going to refactor EnvConfig to be parameterized on a type with a more limited kind than *, which is the type of all types.

{-# LANGUAGE DataKinds      #-}
{-# LANGUAGE EmptyDataDecls #-}
{-# LANGUAGE KindSignatures #-}
{-# LANGUAGE GADTs          #-}

-- |Current processing state of the file. 
data ProcessState :: * where
  -- |Indicates that the tagged type has not been processed
  PreProcess  :: ProcessState
  -- |Indicates that the tagged type has been processed
  PostProcess :: ProcessState
  
newtype EnvConfig (a :: ProcessState) = EnvConfig { envValue :: Value }

There are really only two changes here. First, notice that PreProcess and PostProcess have been combined into one single type, ProcessState. Also, these are no longer empty data declarations, we could create a ProcessState using the constructors PreProcess and PostProcess if we really needed to (we still don’t need that though). Second, the parameter a on EnvConfig now has this extra weird annotation on it: a :: ProcessState. This can be read as “a has kind ProcessState”. Remember that typically the kind of a type like this would be *, and with the above extensions turned on we can actually annotate our type parameter with that kind signature on our previous implementation:

-- old newtype with kind signature
newtype EnvConfig (a :: *) = EnvConfig { envValue :: Value }

With the type parameter a limited to the kind ProcessState, there are only two possible types that can be provided as type arguments: PreProcess and PostProcess, the data constructors for the type ProcessState. The gain here is that whenever we write a general function the generality of the type arguments is limited to exactly one of these two states here. Additionally, we only have to change a little syntax on our combine function.

combine :: CommonConfig -> EnvConfig 'PreProcess -> EnvConfig 'PostProcess

the ticks here are just to differentiate that these data constructors are in fact being used as types. We also know in any general function that we write for arbitrary EnvConfigs

envConfigTransform :: EnvConfig a -> EnvConfig a
envConfigTransform = ... 

that the EnvConfig has a very precise state for a as well.

Finally, extending supported states is still just as simple as adding new empty data declarations, we just need to include them as data declarations on the type ProcessState.

Should I use this?

In general, the DataKind approach is not needed, and in many cases overly restrictive. We could emulate the DataKinds version of our code here by not exporting the data constructors of EnvConfig and only exporting smart constructors and functions that include the types we want to support in our public / consumer API. The smart constructor approach also allows us as library writers to use any type as needed for a, which could be useful in some cases.

The smart constructor approach works well, and is used by many libraries today. In this case I like the DataKinds approach since it allows us to worry less about what is hidden and exported from the library. In this library DataKinds aren’t tying our hands enough to cause problems, instead giving us enough rope to implement the library, and use it, without worrying about weird states that our EnvConfigs could be in.