Interesting take, the article describes ENV variables and their inherent untyped danger when you depend on them, but doesn't really describe an alternative solution for them.
A good solution I've just seen is to validate the environment variables using a schema like [1] zod. Which will guarantee that the ENV variables are the exact type that you expect when the program starts, or it will throw an error.
Typing is nice, and when one is working within an existing system where environment variables are the main way for communicating data between parts of a system (which is many popular systems today), this kind of typing looks like it can add some nice benefits.
The blog post linked here is thinking about how the systems themselves could be designed differently, whether that's OS's, frameworks, platforms, languages, clusters, networks, or other things.
Animats 57 days ago [-]
Hm. So, like enums for the web?
Or, key/value stores where all key values are known.
Who maintains the master list?
sunfish 47 days ago [-]
In systems designed for it, communication channels can transmit handles without using a shared namespace or master list. An example of this in the real world is Unix-domain sockets having the ability to send file descriptors across the socket without using a namespace or master list.
josephg 58 days ago [-]
The normal alternatives to environment variables are:
- Command line arguments
- Or a configuration file. (Optionally at a path specified on a command-line argument)
jakewins 58 days ago [-]
I was pondering this while reading as well.. what makes
FOO=bar myapp
Different from
myapp —-foo=bar
Or
myfunc(foo=“bar”)
?
All of these are handled by various underlying mechanisms that then make the data available on the other side
gwd 58 days ago [-]
I think probably the main thing is that environment variables are a distinct and well-defined; they don't have to interact with anything else syntactically. This makes them nice to use, say, to pass secret API keys to your program in a way which is less likely to leak (apart from the "log all env variables" problem discussed in the post).
Consider say, aws-cli. If I have to pass the API key as an argument, I have to C&P it into every command I write, or write a shell wrapper, thus storing the key on a file, which could accidentally get checked in or leak in some other way. Same with a configuration file. But if I just once per shell session put the API key into an environment variable, then it's only stored in memory; virtually zero chance it will make it into a git tree, and not terribly inconvenient.
Similar if you need to pass an API key to the app running in a docker container. You don't want to rely on your docker container image itself being secret; and it's often very convenient to keep your Dockerfile in a git tree, so you don't want to explicitly set the command in the dockerfile to include your key. And do you really want to keep your config file on the data volume? Or create a special separate volume just for the configuration file (when in fact, you want the vast majority of configuration to be stored in your docker image)?
Much easier to just tell the system running your container, "When you start this container, set this environment variable to this secret value." Then the system knows to treat that environment variable with discretion.
The alternate would be to tell the system running your container, "Add this secret rune to the command line before running", or "Add this secret config file". Environment variables are much easier to reason about.
jakewins 58 days ago [-]
I agree with this.
My question was about the blog posts assertion that something using environment variables to pass information was using a "ghost", but something like a command line option isn't.
I guess, maybe the difference is in when and where the env var is specified. If you do something like
```
FOO=bar
<hundreds of lines of bash>
myapp # uses FOO
``
Then I think the blog post author has a point - there's "hidden" information in the environment/context that the callee expects and will use. If you copied just the `myapp` line and stuck it somewhere else, it would lose the implicit `FOO` it expects to have.
Vs something like
```
FOO=bar
<hundreds of lines of bash>
myapp --foo="${FOO}"
```
or
```
FOO=bar
<hundreds of lines of bash>
FOO="${FOO}" myapp # at least this way it's explicit!
```
kec 57 days ago [-]
Environment variables can have a spooky action at a distance when inherited by subprocesses.
For your example specifically though, there's no check against a variable you misspelled / neglected to specify which could be ever worse as the variable may already exist in the environment with some other value you don't expect.
eithed 58 days ago [-]
From my understanding, explicitness of passing and usage - if your application for example logs all env variables `FOO=bar myapp` would log `FOO=bar XYZ=asd`, but `myapp --foo=bar` would log only `foo=bar`
If two applications are setting FOO, then it results in undesired behaviour:
1. myapp1 runs, FOO=bar
2. myapp2 runs, FOO=xyz
3. myapp1 tries to use FOO, gets xyz
In my example I set the env var on the same line as the command, so the env var is only set in the new process, per open group:
> If the command name is not a special built-in utility or function, the variable assignments shall be exported for the execution environment of the command and shall not affect the current execution environment
Eg
FOO=bar myapp
FOO=xyz myapp
echo $FOO # prints empty string/newline
bkfunk 57 days ago [-]
The issue is, can myapp know that it was set that way, vs being set somewhere else? The command line argument uses syntactical constraints to ensure that the value of the argument was set specifically with the intention of using it in myapp.
eptcyka 57 days ago [-]
Technically, environment variables can be changed at runtime, and so can the config file, but I don't believe that's the case for arguments.
gpderetta 58 days ago [-]
Scoping.
fefe23 58 days ago [-]
I have a hard time understanding what the article is trying to tell me.
As soon as you have components talk to each other over IPC, you pass strings of bytes. It does not matter whether you create a nice abstraction class around a string. Over the wire it is still a string.
What is he proposing we do to prevent confusion between strings?
Adding namespaces does not appear to be a useful idea because it boils down to "do input validation", which you are hopefully already doing.
sunfish 57 days ago [-]
There are IPC mechanisms today which aren't just bytes. For example, the ability to send file descriptors over unix-domain sockets. Strings of bytes fundamentally can't do that. And in programs that pass file descriptors, it doesn't require any ghostly assumptions about what namespace the strings need to be resolved in.
To be sure, Unix-domain sockets aren't the answer to everything, but they are an example of a different way to think about communication.
ivanbakel 57 days ago [-]
As a programmer in a high-level language, you do not need to have your ear to the wire. Just because whatever network-communicable abstraction you use ends up serialized as a sequence of bytes, doesn't mean you should be thinking about byte sequences when communicating in a network.
>What is he proposing we do to prevent confusion between strings?
Don't use a paradigm where your code can see the strings. If your high-level language only manipulates resources, it can't confuse strings with each other. It also can't manipulate strings to try to access different resources.
Sure, at the low level, you'll still be doing communication via bytestrings. But you can keep the footprint of that "core" low-level component small and trustworthy, and let the high level work confidently.
An analogy is using a memory-safe language like Java vs C. Java still relies on fast-but-scary code like pointer arithmetic for fast execution, but Java programmers don't need to know about pointers and aren't constantly on the lookout for buffer overflows or other bugs that plague C programs. The guarantees of the JVM let Java programmers do more expressive and saner things, like "take all the even numbers in a list greater than 100".
skybrian 58 days ago [-]
This article seems to be recommending the use of capabilities, but a question is how you represent a capability, if not as a string or a number like a file descriptor. And how do you send it over a network, if not as a byte sequence?
gpderetta 58 days ago [-]
For capabilities to work they need to be unforgeable. So you need some opaque handle (and a memory/type safe language) and on the wire you need some sort of cryptographic signing.
skybrian 57 days ago [-]
Seems like you could store an unforgeable hash in an environment variable, though, or put it in a constant in source code. This only works if you treat as a secret. Also, maybe they should expire?
unsafecast 58 days ago [-]
The idea is that you don't look. A handle is opaque. You don't care or depend on what's in it. Doesn't matter if it's actually a number or a string.
As for sending over a network, that's the low level details. You can keep high-level types most of the way.
skybrian 57 days ago [-]
Types work within a single process, but that's pretty limiting. Most people are dealing with multi-process systems, one way or another.
There is a trend towards more sandboxing - the sandboxes and pinholes model described in the article.
tonto 58 days ago [-]
this is a an important article. It's a bit hard to grok and requires some experience to understand but after you end up with a messy system doing too much ghosty stuff you yearn for stability
zwkrt 58 days ago [-]
Agreed. This is one of the few articles that I’ve read this month that is really caused me to reflect on my own systems.
The insight that
open(“foo.txt”)
is, from the perspective of component analysis, actually secretly
open(filesystem, access_level, “foo.txt”)
is worth its weight in gold. I feel like I understand more about system design, capability-based permissions, and my own code.
Realistically it won’t immediately change the way I program and I won’t be running to the terminal to refactor my current codebase, but it’s like I’ve been clued in to a whole new class of code smells.
jeffparsons 58 days ago [-]
The parts about accidentally leaking secrets reminded me of an idea I had a while back that builds on the idea of capabilities / unforgeable handles, and making a distinction between different kinds of trust.
An outer component, in this example a web service, might be provided with a handle to a secret that it needs for connecting to some other system — let's take Consul, for example. The web service is trusted to basically _intend_ to do the right thing, but it is not trusted to be vigilant enough to avoid leaking the secret, so it is not allowed to ever actually resolve that secret itself.
What it _can_ do is provide the secret-handle to another component whose job is to establish the connection to Consul for it. That second component has to be given a handle to a secret (it can't just look them up by itself) but once it has one, it can resolve it to the actual secret string. This second component does as little as possible, and is trusted to not accidentally leak the actual secret string — not even to the outer component that is using it.
The Wasm component model makes this sort of scheme really easy to implement because capabilities / unforgeable handles are a first-class concept and they are available for all components to create and communicate between each other.
I guess this might already be a well-established pattern elsewhere, but I don't remember seeing it anywhere.
gpderetta 57 days ago [-]
Doesn't this just moves the problem to leaking the handle instead?
sunfish 57 days ago [-]
It does, but that opens up much more powerful tools to work with.
As an example of one such tool in practice, compare the task of "list all open file descriptors in an arbitrary Unix process" with "list all strings an arbitrary Unix process incorporates some knowledge of". One is a one-liner (`lsof -p <pid>`) and one is really tricky at best, and probably can't be done reliably.
jeffparsons 57 days ago [-]
Yes, but that's much less likely to happen accidentally/implicitly, whereas accidentally leaking secret strings into logs etc. is extremely common.
It's not a silver bullet, but it doesn't need to be to radically improve security in practice.
AtlasBarfed 57 days ago [-]
So they want either:
- "fully qualified names" ... to some arbitrary obvious-after-the-fact degree, because fully FULL qualification becomes one of the heavyweight barriers he doesn't like: a central name validator/registry, reduced ability to reuse data because of the funny wrapping/name
- universal data typing, but that doesn't exist, and would be a barrier if it was
- universal data formats: oh god, that means standards bodies, doesn't it.
I totally agree about the make services interact --> they interact, but there's security holes --> impose security loop.
Animats 57 days ago [-]
"By “ghost” here, I mean any situation where resources are referenced by plain data."
OK, whatever.
What he's railing against is canned strings which identify things. URLs, names in key/value stores, etc.
Attempts to get rid of that include the Windows Registry. That may not be a good example.
Another attempt is identifying everything with an arbitrary GUID or UUID. Pixar moved away from that in their USD format for animation data.
Guthur 58 days ago [-]
Is this not just context free vs connect sensitive?
nivertech 56 days ago [-]
Isn't "No Ghosts" the same as "Make the implicit - explicit"?
nemo1618 57 days ago [-]
tl;dr: A ghost is the opposite of a capability. Use capabilities.
A good solution I've just seen is to validate the environment variables using a schema like [1] zod. Which will guarantee that the ENV variables are the exact type that you expect when the program starts, or it will throw an error.
[1] Example -> https://github.com/t3-oss/create-t3-app/blob/bc57d02789209f1...
The blog post linked here is thinking about how the systems themselves could be designed differently, whether that's OS's, frameworks, platforms, languages, clusters, networks, or other things.
- Command line arguments
- Or a configuration file. (Optionally at a path specified on a command-line argument)
All of these are handled by various underlying mechanisms that then make the data available on the other side
Consider say, aws-cli. If I have to pass the API key as an argument, I have to C&P it into every command I write, or write a shell wrapper, thus storing the key on a file, which could accidentally get checked in or leak in some other way. Same with a configuration file. But if I just once per shell session put the API key into an environment variable, then it's only stored in memory; virtually zero chance it will make it into a git tree, and not terribly inconvenient.
Similar if you need to pass an API key to the app running in a docker container. You don't want to rely on your docker container image itself being secret; and it's often very convenient to keep your Dockerfile in a git tree, so you don't want to explicitly set the command in the dockerfile to include your key. And do you really want to keep your config file on the data volume? Or create a special separate volume just for the configuration file (when in fact, you want the vast majority of configuration to be stored in your docker image)?
Much easier to just tell the system running your container, "When you start this container, set this environment variable to this secret value." Then the system knows to treat that environment variable with discretion.
The alternate would be to tell the system running your container, "Add this secret rune to the command line before running", or "Add this secret config file". Environment variables are much easier to reason about.
My question was about the blog posts assertion that something using environment variables to pass information was using a "ghost", but something like a command line option isn't.
I guess, maybe the difference is in when and where the env var is specified. If you do something like
```
FOO=bar
<hundreds of lines of bash>
myapp # uses FOO
``
Then I think the blog post author has a point - there's "hidden" information in the environment/context that the callee expects and will use. If you copied just the `myapp` line and stuck it somewhere else, it would lose the implicit `FOO` it expects to have.
Vs something like
```
FOO=bar
<hundreds of lines of bash>
myapp --foo="${FOO}"
```
or
```
FOO=bar
<hundreds of lines of bash>
FOO="${FOO}" myapp # at least this way it's explicit!
```
For your example specifically though, there's no check against a variable you misspelled / neglected to specify which could be ever worse as the variable may already exist in the environment with some other value you don't expect.
If two applications are setting FOO, then it results in undesired behaviour: 1. myapp1 runs, FOO=bar 2. myapp2 runs, FOO=xyz 3. myapp1 tries to use FOO, gets xyz
Instead 1. myapp1 --foo=bar runs 2. myapp2 --foo=xyz runs 3. myapp uses foo, gets bar
> If the command name is not a special built-in utility or function, the variable assignments shall be exported for the execution environment of the command and shall not affect the current execution environment
Eg
FOO=bar myapp
FOO=xyz myapp
echo $FOO # prints empty string/newline
As soon as you have components talk to each other over IPC, you pass strings of bytes. It does not matter whether you create a nice abstraction class around a string. Over the wire it is still a string.
What is he proposing we do to prevent confusion between strings?
Adding namespaces does not appear to be a useful idea because it boils down to "do input validation", which you are hopefully already doing.
To be sure, Unix-domain sockets aren't the answer to everything, but they are an example of a different way to think about communication.
>What is he proposing we do to prevent confusion between strings?
Don't use a paradigm where your code can see the strings. If your high-level language only manipulates resources, it can't confuse strings with each other. It also can't manipulate strings to try to access different resources.
Sure, at the low level, you'll still be doing communication via bytestrings. But you can keep the footprint of that "core" low-level component small and trustworthy, and let the high level work confidently.
An analogy is using a memory-safe language like Java vs C. Java still relies on fast-but-scary code like pointer arithmetic for fast execution, but Java programmers don't need to know about pointers and aren't constantly on the lookout for buffer overflows or other bugs that plague C programs. The guarantees of the JVM let Java programmers do more expressive and saner things, like "take all the even numbers in a list greater than 100".
As for sending over a network, that's the low level details. You can keep high-level types most of the way.
There is a trend towards more sandboxing - the sandboxes and pinholes model described in the article.
The insight that
is, from the perspective of component analysis, actually secretly is worth its weight in gold. I feel like I understand more about system design, capability-based permissions, and my own code.Realistically it won’t immediately change the way I program and I won’t be running to the terminal to refactor my current codebase, but it’s like I’ve been clued in to a whole new class of code smells.
An outer component, in this example a web service, might be provided with a handle to a secret that it needs for connecting to some other system — let's take Consul, for example. The web service is trusted to basically _intend_ to do the right thing, but it is not trusted to be vigilant enough to avoid leaking the secret, so it is not allowed to ever actually resolve that secret itself.
What it _can_ do is provide the secret-handle to another component whose job is to establish the connection to Consul for it. That second component has to be given a handle to a secret (it can't just look them up by itself) but once it has one, it can resolve it to the actual secret string. This second component does as little as possible, and is trusted to not accidentally leak the actual secret string — not even to the outer component that is using it.
The Wasm component model makes this sort of scheme really easy to implement because capabilities / unforgeable handles are a first-class concept and they are available for all components to create and communicate between each other.
I guess this might already be a well-established pattern elsewhere, but I don't remember seeing it anywhere.
As an example of one such tool in practice, compare the task of "list all open file descriptors in an arbitrary Unix process" with "list all strings an arbitrary Unix process incorporates some knowledge of". One is a one-liner (`lsof -p <pid>`) and one is really tricky at best, and probably can't be done reliably.
It's not a silver bullet, but it doesn't need to be to radically improve security in practice.
- "fully qualified names" ... to some arbitrary obvious-after-the-fact degree, because fully FULL qualification becomes one of the heavyweight barriers he doesn't like: a central name validator/registry, reduced ability to reuse data because of the funny wrapping/name
- universal data typing, but that doesn't exist, and would be a barrier if it was
- universal data formats: oh god, that means standards bodies, doesn't it.
I totally agree about the make services interact --> they interact, but there's security holes --> impose security loop.
OK, whatever.
What he's railing against is canned strings which identify things. URLs, names in key/value stores, etc.
Attempts to get rid of that include the Windows Registry. That may not be a good example. Another attempt is identifying everything with an arbitrary GUID or UUID. Pixar moved away from that in their USD format for animation data.