The crew package has unavoidable risks, and the user is
responsible for safety, security, and computational resources. This
vignette describes known risks and safeguards, but is by no means
exhaustive. Please read the software
license.
The crew package launches external worker processes to
run tasks, which may include expensive jobs on cloud services like AWS
Batch or traditional clusters like SLURM. In the event of a poorly-timed
crash or network error, these processes may not terminate properly. If
that happens, they will continue to run, which may strain traditional
clusters or incur heavy expenses on the cloud.
Please monitor the platforms you use and manually terminate defunct
hanging processes as needed. To list and terminate local processes,
please use crew_monitor_local() as explained in the
introduction vignette. To manage and monitor non-local high-performance
computing workers such as those on SLURM and AWS Batch, please
familiarize yourself with the given computing platform, and consider
using the monitor objects in the relevant third-party plugin packages
such as crew.cluster
or crew.aws.batch.
Example: https://wlandau.github.io/crew.aws.batch/index.html#job-management.
The local R process could crash if resources are exhausted. A common
cause of crashes is running out of computer memory. The “Resources”
section of the introduction
explains how to monitor memory usage. If you are running
crew in a targets
pipeline (as explained here in the
targets user manual), consider setting
storage = "worker" and retrieval = "worker in
tar_option_set() to minimize memory consumption of the
local processes (see also the performance
chapter).
In addition, crew worker processes may crash silently at
runtime, or they may fail to launch or connect at all. The reasons may
be platform-specific, but here are some common possibilities:
crew.aws.batch
and crew.cluster
expose special platform-specific parameters in the controllers to do
this.In addition, crew occupies one TCP port per controller.
TCP ports range from 0 to 65535, and only around 16000 of these ports
are considered ephemeral or dynamic, so please be careful not to run too
many controllers simultaneously on shared machines, especially in controller
group. The terminate() frees these ports again for
other processes to use.
By default, crew uses unencrypted TCP connections for
transactions among workers. In a compromised network, an attacker can
read the data in transit, and even gain direct access to the client or
host.
It is best to avoid persistent direct connections between your local
computer and the public internet. The host argument of the
controller should not be a public IP address. Instead, please try to
operate entirely within a perimeter such as a firewall, a virtual
private network (VPN), or an Amazon Web Services (AWS) security group.
In the case of AWS, your security group can open ports to itself. That
way, the crew workers on e.g. AWS Batch jobs can connect to
a crew client running in the same security group on an AWS
Batch job or EC2 instance.
In the age of Zero Trust, perimeters alone are seldom sufficient. Transport layer security (TLS) encrypts data to protect it from hackers while it travels over a network. TLS is the state of the art of encryption for network communications, and it is responsible for security in popular protocols such as HTTPS and SSH. TLS is based on public key cryptography, which requires two files:
To use TLS in crew with automatic configuration, simply
set tls = crew_tls(mode = "automatic") in the controller,
e.g. crew_controller_local().1 mirai generates a
one-time key pair and encrypts data for the current crew
client. The key pair expires when the client terminates, which reduces
the risk of a breach. In addition, the public key is a self-signed
certificate, which somewhat protects against tampering on its way from
the client to the server.
Launcher
plugins should expose the tls argument of
crew_client().↩︎