The Apache Arrow project is implemented in multiple languages, and
the R package depends on the Arrow C++ library (referred to from here on
as libarrow). This means that when you install arrow, you need both the
R and C++ versions. If you install arrow from CRAN on a machine running
Windows or MacOS, when you call
a precompiled binary containing both the R package and libarrow will be
downloaded. However, CRAN does not host R package binaries for Linux,
and so you must choose from one of the alternative approaches.
This vignette outlines the recommend approaches to installing arrow on Linux, starting from the simplest and least customisable to the most complex but with more flexbility to customise your installation.
The intended audience for this document is arrow R package
users on Linux, and not Arrow developers. If you’re
contributing to the Arrow project, see
vignette("developing", package = "arrow") for resources to
help you on set up your development environment. You can also find a
more detailed discussion of the code run during the installation process
in the developers’
Having trouble installing arrow? See the “Troubleshooting” section below.
As mentioned above, on macOS and Windows, when you run
install.packages("arrow"), and install arrow from CRAN, you
get an R binary package that contains a precompiled version of libarrow,
though CRAN does not host binary packages for Linux. This means that the
default behaviour when you run
install.packages() on Linux
is to retrieve the source version of the R package that has to be
compiled locally, including building libarrow from source. See method 2
below for details of this.
For a faster installation, we recommend that you instead use one of the methods below for installing arrow with a precompiled libarrow binary.
If you want a quicker installation process, and by default a more fully-featured build, you could install arrow from RStudio’s public package manager, which hosts binaries for both Windows and Linux.
For example, if you are using Ubuntu 20.04 (Focal):
install.packages("arrow", repos = "https://packagemanager.rstudio.com/all/__linux__/focal/latest")
For other Linux distributions, to get the relevant URL, you can visit the RSPM site, click on ‘binary’, and select your preferred distribution.
Similarly, if you use
conda to manage your R
environment, you can get the latest official release of the R package
including libarrow via:
conda install -c conda-forge --strict-channel-priority r-arrow
Another way of achieving faster installation with all key features
enabled is to use our self-hosted libarrow binaries. You can do this by
NOT_CRAN environment variable before you call
Sys.setenv("NOT_CRAN" = TRUE) install.packages("arrow")
This installs the source version of the R package, but during the installation process will check for compatible libarrow binaries that we host and use those if available. If no binary is available or can’t be found, then this option falls back onto method 2 below, but results in a more fully-featured build than default.
When you install libarrow, its dependencies will be automatically
The environment variable
whether the libarrow installation also downloads or installs all
dependencies (when set to
BUNDLED), uses only
system-installed dependencies (when set to
checks system-installed dependencies first and only installs
dependencies which aren’t already present (when set to
These dependencies vary by platform; however, if you wish to install these yourself prior to libarrow installation, we recommend that you take a look at the docker file for whichever of our CI builds (the ones ending in “cpp” are for building Arrow’s C++ libaries aka libarrow) corresponds most closely to your setup. This will contain the most up-to-date information about dependencies and minimum versions.
The arrow package allows you to work with data in AWS S3 or in other
cloud storage system that emulate S3. However, support for working with
S3 is not enabled in the default build, and it has additional system
requirements. To enable it, set the environment variable
choose the full-featured build, or more selectively set
ARROW_S3=ON. You also need the following system
gcc>= 4.9 or
clang>= 3.3; note that the default compiler on CentOS 7 is gcc 4.8.5, which is not sufficient
The prebuilt libarrow binaries come with S3 support enabled, so you will need to meet these system requirements in order to use them–the package will not install without them (and will error with a message that explains this).If you’re building everything from source, the install script will check for the presence of these dependencies and turn off S3 support in the build if the prerequisites are not met–installation will succeed but without S3 functionality. If afterwards you install the missing system requirements, you’ll need to reinstall the package in order to enable S3 support.
Generally compiling and installing R packages with C++ dependencies, requires either installing system packages, which you may not have privileges to do, or building the C++ dependencies separately, which introduces all sorts of additional ways for things to go wrong, which is why we recommend method 1 above.
However, if you wish to fine-tune or customise your Linux installation, the instructions in this section explain how to do that.
If you wish to install libarrow from source instead of looking for
pre-compiled binaries, you can set the
Sys.setenv("LIBARROW_BINARY" = FALSE)
By default, this is set to
TRUE, and so libarrow will
only be built from source if this environment variable is set to
FALSE or no compatible binary for your OS can be found.
When compiling libarrow from source, you have the power to really
fine-tune which features to install. You can set the environment
FALSE to enable a
more full-featured build including S3 support and alternative memory
Sys.setenv("LIBARROW_MINIMAL" = FALSE)
By default this variable is unset; if set to
trimmed-down version of arrow is installed with many features
Note that in this guide, you will have seen us mention the
NOT_CRAN - this is a convenience
variable, which when set to
TRUE, automatically sets
Building libarrow from source requires more time and resources than
installing a binary. We recommend that you set the environment variable
TRUE for more verbose output
during the installation process if anything goes wrong.
Sys.setenv("ARROW_R_DEV" = TRUE)
Once you have set these variables, call
install.packages() to install arrow using this
The section below discusses environment variables you can set before
install.packages("arrow") to build from source and
customise your configuration.
In this section, we describe how to fine-tune your installation at a more granular level.
Some features are optional when you build Arrow from source - you can configure whether these components are built via the use of environment variables. The names of the environment variables which control these features and their default values are shown below.
||S3 support (if dependencies are met)*||
||The JSON parsing library||
||The RE2 regular expression library, used in some string compute functions||
||The UTF8Proc string library, used in many other string compute functions||
There are a number of other variables that affect the
configure script and the bundled build script. All boolean
variables are case-insensitive.
||Allow building from source||
||Try to install
||Build with minimal features enabled||(unset)|
||More verbose messaging and regenerates some code||
||Directory to save source build logs||(unset)|
||Alternative CMake path||(unset)|
See below for more in-depth explanations of these environment variables.
LIBARROW_BINARY: If set to
true, the script will try to download a binary C++ library built for your operating system. You may also set it to some other string, a related “distro-version” that has binaries built that work for your OS. See the distro map for compatible binaries and OSs. If no binary is found, installation will fall back to building C++ dependencies from source.
LIBARROW_BUILD: If set to
false, the build script will not attempt to build the C++ from source. This means you will only get a working arrow R package if a prebuilt binary is found. Use this if you want to avoid compiling the C++ library, which may be slow and resource-intensive, and ensure that you only use a prebuilt binary.
LIBARROW_MINIMAL: If set to
false, the build script will enable some optional features, including S3 support and additional alternative memory allocators. This will increase the source build time but results in a more fully functional library. If set to
trueturns off Parquet, Datasets, compression libraries, and other optional features. This is not commonly used but may be helpful if needing to compile on a platform that does not support these features, e.g. Solaris.
NOT_CRAN: If this variable is set to
true, as the
devtoolspackage does, the build script will set
LIBARROW_MINIMAL=falseunless those environment variables are already set. This provides for a more complete and fast installation experience for users who already have
NOT_CRAN=trueas part of their workflow, without requiring additional environment variables to be set.
ARROW_R_DEV: If set to
true, more verbose messaging will be printed in the build script.
arrow::install_arrow(verbose = TRUE)sets this. This variable also is needed if you’re modifying C++ code in the package: see the developer guide vignette.
ARROW_USE_PKG_CONFIG: If set to
false, the configure script won’t look for Arrow libraries on your system and instead will look to download/build them. Use this if you have a version mismatch between installed system libraries and the version of the R package you’re installing.
LIBARROW_DEBUG_DIR: If the C++ library building from source fails (
cmake), there may be messages telling you to check some log file in the build directory. However, when the library is built during R package installation, that location is in a temp directory that is already deleted. To capture those logs, set this variable to an absolute (not relative) path and the log files will be copied there. The directory will be created if it does not exist.
CMAKE: When building the C++ library from source, you can specify a
/path/to/cmaketo use a different version than whatever is found on the
Daily development builds, which are not official releases, can be installed from the Ursa Labs repository:
Sys.setenv(NOT_CRAN = TRUE) install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
or for conda users via:
conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
You can also install the R package from a git checkout:
git clone https://github.com/apache/arrow cd arrow/r R CMD INSTALL .
If you don’t already have libarrow on your system, when installing the R package from source, it will also download and build libarrow for you. See the section above on build environment variables for options for configuring the build source and enabled features.
The previous instructions are useful for a fresh arrow installation,
but arrow provides the function
install_arrow(), which you
can use if you:
install_arrow() provides some convenience wrappers
around the various environment variables described below.
Although this function is part of the arrow package, it is also available as a standalone script, so you can access it for convenience without first installing the package:
install_arrow(nightly = TRUE)
install_arrow(verbose = TRUE)
install_arrow() does not require environment variables
to be set in order to satisfy C++ dependencies.
Note that, unlike packages like
blogdown, and others that require external dependencies, you do not need to run
install_arrow()after a successful arrow installation.
install-arrow.R file also includes the
create_package_with_all_dependencies() function. Normally,
when installing on a computer with internet access, the build process
will download third-party dependencies as needed. This function provides
a way to download them in advance.
Doing so may be useful when installing Arrow on a computer without
internet access. Note that Arrow can be installed on a computer
without internet access without doing this, but many useful features
will be disabled, as they depend on third-party components. More
arrow::arrow_info()$capabilities() will be
FALSE for every capability. One approach to add more
capabilities in an offline install is to prepare a package with
pre-downloaded dependencies. The
create_package_with_all_dependencies() function does this
If you’re using binary packages you shouldn’t need to follow these steps. You should download the appropriate binary from your package repository, transfer that to the offline computer, and install that. Any OS can create the source bundle, but it cannot be installed on Windows. (Instead, use a standard Windows binary package.)
Note if you’re using RStudio Package Manager on Linux: If you still
want to make a source bundle with this function, make sure to set the
first repo in
options("repos") to be a mirror that contains
source packages (that is: something other than the RSPM binary mirror
my_arrow_pkg.tar.gzto the computer without internet access
install.packages("my_arrow_pkg.tar.gz", dependencies = c("Depends", "Imports", "LinkingTo"))
cmakemust be available
arrow_info()to check installed capabilities
cpp/thirdparty/download_dependencies.shmay be helpful)
ARROW_THIRDPARTY_DEPENDENCY_DIRon the offline computer, pointing to the copied directory.
The intent is that
install.packages("arrow") will just
work and handle all C++ dependencies, but depending on your system, you
may have better results if you tune one of several parameters. Here are
some known complications and ways to address them.
If you see a message like
------------------------- NOTE --------------------------- There was an issue preparing the Arrow C++ libraries. See https://arrow.apache.org/docs/r/articles/install.html ---------------------------------------------------------
in the output when the package fails to install, that means that installation failed to retrieve or build the libarrow version compatible with the current version of the R package.
Please check the “Known installation issues” below to see if any
apply, and if none apply, set the environment variable
ARROW_R_DEV=TRUE for more verbose output and try installing
again. Then, please report an
issue and include the full installation output.
If a system library or other installed Arrow is found but it doesn’t
match the R package version (for example, you have libarrow 1.0.0 on
your system and are installing R package 2.0.0), it is likely that the R
bindings will fail to compile. Because the Apache Arrow project is under
active development, it is essential that versions of libarrow and the R
package matches. When
install.packages("arrow") has to
download libarrow, the install script ensures that you fetch the
libarrow version that corresponds to your R package version. However, if
you are using a version of libarrow already on your system, version
match isn’t guaranteed.
To fix version mismatch, you can either update your libarrow system
packages to match the R package version, or set the environment variable
ARROW_USE_PKG_CONFIG=FALSE to tell the configure script not
to look for system version of libarrow. (The latter is the default of
install_arrow().) System libarrow versions are available
corresponding to all CRAN releases but not for nightly or dev versions,
so depending on the R package version you’re installing, system libarrow
version may not be an option.
Note also that once you have a working R package installation based
on system (shared) libraries, if you update your system libarrow
installation, you’ll need to reinstall the R package to match its
version. Similarly, if you’re using libarrow system libraries, running
update.packages() after a new release of the arrow package
will likely fail unless you first update the libarrow system
If the R package finds and downloads a prebuilt binary of libarrow, but then the arrow package can’t be loaded, perhaps with “undefined symbols” errors, please report an issue. This is likely a compiler mismatch and may be resolvable by setting some environment variables to instruct R to compile the packages to match libarrow.
A workaround would be to set the environment variable
LIBARROW_BINARY=FALSE and retry installation: this value
instructs the package to build libarrow from source instead of
downloading the prebuilt binary. That should guarantee that the compiler
If a prebuilt libarrow binary wasn’t found for your operating system
but you think it should have been, check the logs for a message that
*** Unable to identify current OS/version, or a
message that says
*** No C++ binaries found for an invalid
OS. If you see either, please report an
issue. You may also set the environment variable
ARROW_R_DEV=TRUE for additional debug messages.
A workaround would be to set the environment variable
LIBARROW_BINARY to a
exists in the Ursa Labs repository. Setting
is also an option when there’s not an exact match for your OS but a
similar version would work, such as if you’re on
ubuntu-18.10 and there’s only a binary for
If that workaround works for you, and you believe that it should work for everyone else too, you may propose adding an entry to this lookup table. This table is checked during the installation process and tells the script to use binaries built on a different operating system/version because they’re known to work.
If building libarrow from source fails, check the error message. (If
you don’t see an error message, only the
----- NOTE -----,
set the environment variable
ARROW_R_DEV=TRUE to increase
verbosity and retry installation.) The install script should work
everywhere, so if libarrow fails to compile, please report an
issue so that we can improve the script.
On CentOS, if you are using a more modern
devtoolset, you may need to set the environment variables
CXX either in the shell or in R’s
Makeconf. For CentOS 7 and above, both the Arrow system
packages and the C++ binaries for R are built with the default system
compilers. If you want to use either of these and you have a
devtoolset installed, set
CC=/usr/bin/gcc CXX=/usr/bin/g++ to use the system
compilers instead of the
devtoolset. Alternatively, if you
want to build arrow with the newer
false so that you build the
Arrow C++ from source using those compilers. Compiler mismatch between
the arrow system libraries and the R package may cause R to segfault
when arrow package functions are used. See discussions here and here.
If you have multiple versions of
zstd installed on
your system, installation by building libarrow from source may fail with
an “undefined symbols” error. Workarounds include (1) setting
LIBARROW_BINARY to use a C++ binary; (2) setting
ARROW_WITH_ZSTD=OFF to build without
(3) uninstalling the conflicting
zstd. See discussion here.
As mentioned above, please report an issue if you encounter ways to improve this. If you find that your Linux distribution or version is not supported, we welcome the contribution of Docker images (hosted on Docker Hub) that we can use in our continuous integration. These Docker images should be minimal, containing only R and the dependencies it requires. (For reference, see the images that R-hub uses.)
You can test the arrow R package installation using the
docker-compose setup included in the
apache/arrow git repository. For example,
R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose build r R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose run r
installs the arrow R package, including libarrow, on the rhub/ubuntu-gcc-release image.