epitweetr package allows you to automatically monitor trends of tweets by time, place and topic. This automated monitoring aims at early detecting public health threats through the detection of signals (e.g. an unusual increase in the number of tweets for a specific time, place and topic). The
epitweetr package was designed to focus on infectious diseases, and it can be extended to all hazards or other fields of study by modifying the topics and keywords.
The general principle behind
epitweetr is that it collects tweets and related metadata from the Twitter Standard API versions 1.1 (https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/overview) and 2 (https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent) according to specified topics and stores these tweets on your computer on a database that can operate to calculate statistics or as a search engine.
epitweetr geolocalises the tweets and collects information on key words, URLs, hashtags within a tweet but also entities and context detected by the Twitter API 2. Tweets are aggregated according to topic and geographical location. Next, a signal detection algorithm identifies the number of tweets (by topic and geographical location) that exceeds what is expected for a given day. If a number of tweets exceeds what is expected,
epitweetr sends out email alerts to notify those who need to further investigate these signals following the epidemic intelligence processes (filtering, validation, analysis and preliminary assessment).
The package includes an interactive web application (Shiny app) with six pages: the dashboard, where a user can visualise and explore tweets (Fig 1), the alerts page, where you can view the current alerts and train machine learning models for alert classification on user defined categories (Fig 2), the geotag page, where you can evaluate the geolocation algorithm and provide annotations for improving its performance (Fig 3), the data protection page, where the user can search, anonymise and delete tweets from the epitweetr database to support data deletion requests (Fig 4), the configuration page, where you can change settings and check the status of the underlying processes (Fig 5), and the troubleshoot page, with automatic checks and hints for using
epitweetr with all its functionalities (Fig 6).
On the dashboard, users can view the aggregated number of tweets over time, the location of these tweets on a map and different most frequent elements found in or extracted from these tweets (words, hashtags, URLs, contexts and entities). These visualisations can be filtered by the topic, location and time period you are interested in. Other filters are available and include the possibility to adjust the time unit of the timeline, whether retweets/quotes should be included, what kind of geolocation types you are interested in, the sensitivity of the prediction interval for the signal detection, and the number of days used to calculate the threshold for signals. This information is also downloadable directly from this interface in the form of data, pictures, and/or reports.
More information on the methodology used is available in the epitweetr peer-review publication. In addition, you can also visit the general post in the discussion forum of the GitHub epitweetr repository for additional materials and training.
Shiny app dashboard:
Shiny app alerts page:
Shiny app geotag evaluation page:
Shiny app data protection page:
Shiny app configuration page:
Shiny app troubleshoot page:
Article 3 of the European Centre for Disease Prevention and Control (ECDC) funding regulation and the Decision No 1082/2013/EU on serious cross-border threats to health have established the detection of public health threats as a core activity of ECDC.
ECDC performs Epidemic Intelligence (El) activities aiming at rapidly detecting and assessing public health threats, focusing on infectious diseases, to ensure EU’s health security. ECDC uses social media as part of its sources to early detect signals of public health threats. Until 2020, the monitoring of social media was mainly performed through the screening and analysis of posts from pre-selected experts or organisations, mainly in Twitter and Facebook.
More information and an online tutorial are available:
The primary objective of
epitweetr is to use the Twitter Standard Search API version 1.1 and Twitter Recent Search API version 2 in order to detect early signals of potential threats by topic and by geographical unit.
Its secondary objective is to enable the user through an interactive web interface to explore the trend of tweets by time, geographical location and topic, including information on top words and numbers of tweets from trusted users, using charts and tables.
More information on epitweetr is available in the epitweetr GitHub discussions. This post contains a summary of links and materials of relevance for new users.
The minimum and suggested hardware requirements for the computer are in the table below:
|RAM Needed||8GB||16GB recommended|
|CPU Needed||4 cores||12 cores|
|Space needed for 3 years of storage||3TB||5TB|
The CPU and RAM usage can be configured in the Shiny app configuration page (see section The interactive user application (Shiny app)>The configuration page). The RAM, CPU and space needed may depend on the amount and size of the topics you request in the collection process.
epitweetr is conceived to be platform independent, working on Windows, Linux and Mac. We recommend that you use
epitweetr on a computer that can be run continuously. You can switch the computer off, but you may miss some tweets if the downtime is large enough, which will have implications for the alert detection.
If you need to upgrade or reinstall epitweetr after activating its tasks, you must stop the tasks from the Shiny app or restart the machine running epitweetr first.
You can find below a summary of the steps required to install
epitweetr. Further detailed information is available in the corresponding sections.
epitweetrShiny app (ensure to indicate the full path to your data directory)
epitweetr, the following items need to be installed:
R version 3.6.3 or higher
Java 1.8 eg. openjdk version “1.8” https://www.java.com/download/. The 64-bit rather than the 32-bit version is preferred, due to memory limitations. In Mac, also the Java Development Kit https://docs.oracle.com/javase/9/install/installation-jdk-and-jre-macos.htm]
If you are running it in Windows, you will also need Microsoft Visual C++ which in most cases it is likely to be pre-installed:
Pandoc, for exporting PDFs and Markdown
Tex installation (TinyTeX or MiKTeX) (or other TeX installation) for exporting PDFs
Easiest: https://yihui.org/tinytex/ install from R, logoff/logon required after installation
https://miktex.org/download full installation required, logoff/logon required after installation
Machine learning optimisation (only for advanced users)
Open Blas (BLAS optimizer), which will speed up some of the geolocation processes: https://www.openblas.net/ Installation instructions: https://github.com/fommil/netlib-Java
or Intel MKL ([https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html])
If using Windows, you need to install the R package: taskscheduleR
If using Linux, you need to plan the tasks manually
If using a Mac, you need to plan the tasls manually
If you would like to develop
epitweetr further, then the following development tools are needed:
Git (source code control) https://git-scm.com/downloads
Sbt (compiling scala code) https://www.scala-sbt.org/download.html
If you are using Windows, then you will additionally need Rtools: https://cran.r-project.org/bin/windows/Rtools/
epitweetr will need to download some dependencies in order to work. The tool will do this automatically the first time the alert detection process is launched. The Shiny app configuration page will allow you to change the target URLs of these dependencies, which are the following:
CRAN JARs: Transitive dependencies for running Spark, Lucene and embedded scala code. [https://repo1.maven.org/maven2]
Winutils.exe (Windows only) This is a Hadoop binary necessary for running SPARK locally on Windows [https://github.com/steveloughran/winutils/raw/master/hadoop-3.0.0/bin/winutils.exe].
Please note that during the dependencies download, you will be prompted: first to stop the embedded database and then to enable it again. If you are on Windows and you have activated the tasks using the ‘activate’ buttons on the configuration page, you can performs this tasks by disabling and enabling the tasks on the ‘Windows Task Scheduler’. For more information see the section ‘Setting up tweet collection and the alert detection loop’
After installing all required dependencies listed in the section “Prerequisites for running epitweetr”, you can install
Additionally, the R environment needs to know where the Java installation home is. To check this, type in the R console:
If the command returns null or empty, then you will need to set the Java Home environment variable, for your operating system (OS). Please see your specific OS instructions. In some cases,
epitweetr can work without setting the Java Home environment variable.
The first time you run the application, if the tool cannot identify a secure password store provided by the operating system, you will see a pop-up window requesting a keyring password (Linux and Mac). This is a password necessary for storing encrypted Twitter credentials. Please choose a strong password and remember it. You will be asked for this password each time you run the tool. You can avoid this by setting a system environment variable named ecdc_twitter_tool_kr_password containing the chosen password.
You can launch the
epitweetr Shiny app from the R session by typing in the R console. Replace “data_dir” with the designated data directory (full path) which is a local folder you choose to store tweets, time series and configuration files in:
Please note that the data directory entered in R should have ‘/’ instead of ‘\’ (an example of a correct path would be ‘C:/user/name/Documents’). This applies especially in Windows if you copy the path from the File Explorer.
Alternatively, you can use a launcher: In an executable .bat or shell file type the following (replacing “data_dir” with the designated data directory):
R –vanilla -e epitweetr::epitweetr_app(“data_dir”)
You can check that all requirements are properly installed in the troubleshoot page. More information is available in section The interactive user application (Shiny app)>Dashboard:The interactive user interface for visualisation>The troubleshoot page
Migrating epitweetr from previous versions (before January 2022) to version 2.0.0 or higher is possible without any data loss. In this section, we will describe the necessary steps to perform the migration.
This migration is not necessary if you are installing epitweetr for the first time.
In epitweetr v2, we redesigned the way how tweets and series are stored. In previous versions, tweets were saved as compressed JSON files and series as RDS data frames in ‘tweets’ and ‘series’ folder, respectively. In epitweetr v2 or higher, we have moved to a different storage system allowing epitweetr to work as a search engine and allowing efficient updates, deletions and faster aggregation of data. For doing so, data is stored using Apache Lucene indexes in the ‘fs’ folder. Note that during migration, Twitter data are moved to the ‘fs’ folder and series are left as it is. Epitweetr reports will combine data from older and new storage system.
If you have an existing installation that contains data in the previous format, you have to migrate it following the steps detailed in this section. This applies to any epitweetr version before v2.0.0. You can also check this by looking in ‘tweets/geo’ or ‘tweets/search’ folders. If there is a json.gz file, migration is needed.
The migration steps are the following:
In order to use
epitweetr, you will need to collect and process tweets, run the ‘epitweetr database’ and ‘Requirements & alerts’ pipelines. Further details are also available in subsequent sections of this user documentation. A summary of the steps needed is as follows:
Set up the Twitter authentication using a Twitter account or a Twitter developer app, see section Collection of tweets>Twitter authentication for more details
If you want to use the Twitter API v2, you have to request access from the Twitter developer portal. More information on [https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api]
Activate the embedded database
Windows: Click on the “Epitweetr database” activate button
Other platforms: In a new R session, run the following command
You can confirm that the embeded database is running if the ‘epitweetr database’ status is “Running” in the Shiny app configuration page and “true” in the Shiny app troubleshoot page.
Activate the tweet collection and data processing
Windows: Click on the “Data collection & processing” activate button
Other platforms: In a new R session, run the following command
You can confirm that the tweet collection is running, if the ‘Data collection & processing’ status is “Running” in the Shiny app configuration page (green text in screenshot above) and “true” in the Shiny app troubleshoot page.
Activate the epitweetr database
Windows: Click on the “epitweetr database” activate button
Other platforms: In a new R session, run the following command
You can confirm that the epitweetr database is active if the ‘epitweetr database’ status is “Running” in the Shiny app configuration page (green text in screenshot above) and “true” in the Shiny app troubleshoot page.
Activate the ‘Requirements & alerts’ pipeline:
You can confirm that the ‘Requirements & alerts’ pipeline is running, if the ‘Requirements & alerts’ pipeline status is “Running” in the Shiny app configuration page and “true” in the Shiny app troubleshoot page.
You will be able to visualise tweets after ‘Data collection & processing’ and ‘epitweetr database’ are activated and the languages task has finished successfully.
You can start working with the generated signals. Happy signal detection!
For more details, you can go through the section How does it work? General architecture behind
epitweetr, which describes the underlying processes behind the tweet collection and the signal detection. Also, the section “The interactive Shiny application (Shiny app)>The configuration page” describes the different settings available in the configuration page.
The following sections describe in detail the above general principles. The settings of many of these elements can be configured in the Shiny app configuration page, which is explained in the section The interactive Shiny application (Shiny app)>The configuration page.
epitweetr uses the Twitter Standard Search API version 1.1 and/or Twitter Recent Search API version 2. The advantage of these APIs is that these are a free service provided by Twitter enabling users of
epitweetr to access tweets free of charge. The search API is not meant to be an exhaustive source of tweets. It searches against a sample of recent tweets published in the past 7 days and it focuses on relevance and not completeness. This means that some tweets and users may be missing from search results.
While this may be a limitation in other fields of public health or research, the
epitweetr development team believe that for the objective of signal detection a sample of tweets is sufficient to detect potential threats of importance in combination with other type of sources.
Other attributes of the Twitter Standard Search API version 1.1 include:
Only tweets from the last 5–8 days are indexed by Twitter
A maximum of 180 requests every 15 minutes are supported by the Twitter Standard Search API (450 requests every 15 minutes if you are using the Twitter developer app credentials; see next section)
Each request returns a maximum of 100 tweets and/or retweets
Other attributes of the Twitter Recent Search API version 2 include:
Only tweets from the last week days are indexed by Twitter
A maximum of 300 requests every 15 minutes are supported
Each request returns a maximum of 100 tweets and/or retweets
500.000 tweets per month in the essential access level. You can upgrade it for free to elevated access level allowing for up to 2 million tweets per month.
If you are using both endpoints,
epitweetr will alternate between them when the limits are reached.
You can authenticate the collection of tweets by using a Twitter account (this approach uses the rtweet package app) or by using a Twitter application. For the latter, you will need a Twitter developer account, which can take some time to obtain, due to verification procedures. We recommend using a Twitter account via the rtweet package for testing purposes and short-term use, and the Twitter developer application for long-term use.
Using a Twitter account: delegated via rtweet (user authentication)
You will need a Twitter account (username and password)
The rtweet package will send a request to Twitter, so it can access your Twitter account on your behalf
A pop-up window will appear where you can enter your Twitter user name and password to confirm that the application can access Twitter on your behalf. You will send this token each time you access tweets. If you are already logged in Twitter, this pop-up window may not appear and automatically take the credentials of the ‘active’ Twitter account in the machine
You can only use Twitter API version 1.1
Using a Twitter developer app: via
epitweetr (app authentication)
You will need to create a Twitter developer account, if you have not created it yet: [https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api]
Follow the instuctions, answer the questions to activate the Twitter API v2 using Essential or Elevated access.
Next, you need to create a project and an associated developer app during the onboarding process, which will provide you a set of credentials that you will use to authenticate all requests to the API.
Save your OAuth settings
Add them to the configuration page in the Shiny app (see image below)
With this information,
epitweetr can request a token at any time directly to Twitter. The advantage of this method is that the token is not connected to any user information and tweets are returned independently of any user context.
With this app, you can perform 450 requests every 15 minutes instead of the 180 requests every 15 minutes that authenticating using Twitter account allows.
You can activate Twitter API version 2 in the configuration page
If you have rtweet 1.0.2+, you will need to enter your bearer token. For previous versions the information to enter is: App Name, API key, API key secret, access token and access token secret
After the Twitter authentication, you need to specify a list of topics in
epitweetr to indicate which tweets to collect. For each topic, you have one or more queries that
epitweetr uses to collect the relevant tweets (e.g. several queries for a topic using different terminology and/or languages).
A query consists of keywords and operators that are used to match tweet attributes. Keywords separated by a space indicate an AND clause. You can also use an OR operator. A minus sign before the keyword (with no space between the sign and the keyword) indicates the keyword should not be in the tweet attributes. While queries can be up to 512 characters long, best practice is to limit your query to 10 keywords and operators and limit complexity of the query, meaning that sometimes you need more than one query per topic. If a query surpasses this limit, it is recommended to split the topic in several queries.
epitweetr comes with a default list of topics as used by the ECDC Epidemic Intelligence team. You can view details of the list of topics in the Shiny app configuration page (see screenshot below). In addition, the colour coding in the downloadable file allows users to see if the query for a topic is too long (red colour) and the topic should be split in several queries.
In the configuration page, you can also download the list of topics, modify and upload it to
epitweetr. The new list of topics will then be used for tweet collection and visible in the Shiny app. The list of topics is an Excel file (*.xlsx) as it handles user-specific regional settings (e.g. delimiters) and special characters well. You can create your own list of topics and upload it too, noting that the structure should include at least:
The name of the topic, with the header “Topic” in the Excel spreadsheet. This name should include alphanumeric characters, spaces, dashes and underscores only. Note that it should start with a letter.
The query, with the header “Query” in the Excel spreadsheet. This is the query
epitweetr uses in its requests to obtain tweets from the Twitter Standard Search API. See above for syntax and constraints of queries.
The topics.xlsx file additionally includes the following fields:
An ID, with the header “#” in the Excel spreadsheet, noting a running integer identifier for the topic.
A label, with the header “Label” in the Excel spreadsheet, which is displayed in the drop-down topic menu of the Shiny app tabs.
An alpha parameter, with the header “Signal alpha (FPR)” in the Excel spreadsheet. FPR stands for “false positive rate”. Increasing the alpha will decrease the threshold for signal detection, resulting in an increased sensitivity and possibly obtaining more signals. Setting this alpha can be done empirically and according to the importance and nature of the topic.
“Length_charact” is an automatically generated field that calculates the length of all characters used in the query. This field is helpful as a request should not exceed 500 characters.
“Length_word” indicates the number of words used in a request, including operators. Best practice is to limit your number of keywords to 10.
An alpha parameter, with the header “Outlier alpha (FPR)” in the Excel spreadsheet. FPR stands for “false positive rate”. This alpha sets the false positive rate for determining what an outlier is when downweighting previous outliers/signals. The lower the value, the fewer previous outliers will potentially be included. A higher value will potentially include more previous outliers.
“Rank” is the number of queries per topic
When uploading your own file, please modify the topic and query fields, but do not modify the column titles.
As a reminder,
epitweetr is scheduled to make 180 requests (queries) to Twitter API every 15 minutes with the user authentication; or 450 (v1.1) or 300 (v2) requests every 15 minutes if you are using Twitter developer app credentials depending on the API version you use. Each request can return 100 tweets. The requests return tweets and retweets. These are returned in JSON format, which is a light-weighted data format.
In order to collect the maximum number of tweets, given the API limitations, and in order for popular topics not to prevent other topics from being adequately collected,
epitweetr uses “search plans” for each query.
The first “search plan” for a query will collect tweets from the current date-time backwards until 7 days (7 days because of the Standard Search API limitation) before the current “search plan” was implemented. The first “search plan” is the biggest, as no tweets have been collected so far.
All subsequent “search plans” are done in scheduled intervals that are set up in the configuration page of the
epitweetr Shiny app (see section The interactive Shiny app > the configuration page > General). For illustration purposes, let us consider the search plans are scheduled at four-hour intervals. The plans collect tweets for a specific query from the current date-time back until four hours before the date-time when the current “search plan” is implemented (see image below).
epitweetr will make as many requests (each returning up to 100 tweets) during the four-hour interval as needed to obtain all tweets created within that four-hour interval.
For example, if the “search plan” begins at 4 am on the 10th of November 2021,
epitweetr will launch requests for tweets corresponding to its queries for the four-hour period from 4 am to midnight on the 10th of November 2021.
epitweetr starts by collecting the most recent tweets (the ones from 4 am) and continues backwards. If during the four-hour time period between 4 am and midnight the API does not return any more results, the “search plan” for this query is considered completed.
However, if topics are very popular (e.g. COVID-19 in 2020 and 2021), then the “search plan” for a query in a given four-hour window may not be completed. If this happens,
epitweetr will move on to the “search plans” for the subsequent four-hour window, and put any previous incomplete “search plan” in a queue to execute when “search plans” for this new four-hour window are completed.
Each “search plan” stores the following information:
|expected_end||Timestamp||End DateTime of the current search window|
|scheduled_for||Timestamp||The scheduled DateTime for the next request. On plan creation this will be the current DateTime and after each request this value will be set to a future DateTime. To establish the future DateTime, the application will estimate the number of requests necessary to finish. If it estimates that N requests are necessary, the next schedule will be in 1/N of the remaining time.|
|start_on||Timestamp||The DateTime when the first request of the plan was finished|
|end_on||Timestamp||The DateTime when the last request of the plan was finished if that request reached a 100% plan progress.|
|max_id||Long||The max Twitter id targeted by this plan, which will be defined after the first request|
|since_id||Long||The last tweet id returned by the last request of this plan. The next request will start collecting tweets before this value. This value is updated after each requests and allows the Twitter API to return tweets before min_time(pi)|
|since_target||Long||If a previous plan exists, this value stores the first tweet id that was downloaded for that plan. The current plan will not collect tweets before that id. This value allows the Twitter API to return tweets after pi-time_back|
|requests||Int||Number of requests performed as part of the plan|
|progress||Double||Progress of the current plan as a percentage. It is calculated as (current$max_id - current$since_id)/(current$max_id - current$since_target). If the Twitter API returns no tweets the progress is set to 100%. This only applies for non error responses containing an empty list of tweets.|
epitweetr will execute plans according to these rules:
epitweetr will detect the newest unfinished plan for each search query with the scheduled_for variable located in the past.
epitweetr will execute the plans with the minimum number of requests already performed. This ensures that all scheduled plans perform the same number of requests.
As a result of the two previous rules, requests for topics under the 180 limit of the Twitter Standard Search API (or 450 if you are using Twitter developer app authentication) will be executed first and will produce higher progress than topics over the limit.
The rationale behind this is that topics with such a large number of tweets that the 4-hour search window is not sufficient to collect them, are likely to already be a known topic of interest. Therefore, priority should be given to smaller topics and possibly less well-known topics.
An example is the COVID-19 pandemic in 2020. In early 2020, there was limited information available regarding COVID-19, which allowed detecting signals with meaningful information or updates (e.g. new countries reporting cases or confirming that it was caused by a coronavirus). However, throughout the pandemic, this topic became more popular and the broad topic of COVID-19 was not effective for signal detection and was taking up a lot of time and requests for
epitweetr. In such a case it is more relevant to prioritise the collection of smaller topics such as sub-topics related to COVID-19 (e.g. vaccine AND COVID-19), or to make sure you do not miss other events with less social media attention.
If search plans cannot be finished, several search plans per query may be in a queue:
This design can have the draw back of slowing down big topics collection since
epitweetr is trying to rebuilt last 7 days of history. If you are not interested in rebuilding history on a particular point of time, you can click on the “Dismiss past tweets” button which will discard all previous/historical plans and will start collecting new data.
In a parallel process to the collection of tweets,
epitweetr attempts to geolocate all collected tweets using a supervised machine learning process. This process runs automatically after tweets are collected.
epitweetr stores two types of geolocation for a tweet: tweet location, which is geolocation information within the text of a tweet (or a retweeted or quoted tweet), and user location from the available metadata. For signal detection, the preferred location is used (i.e., tweet location) while in the dashboard both types can be visualised.
The tweet location is extracted and stored by
epitweetr based on the geolocation information found within a tweet text. In case of a retweet or quoted tweet, it will extract the geolocation information from the original tweet text that was retweeted or quoted. If neither are available, no tweet location is stored based on tweet text.
epitweetr identifies if a tweet text contains reference to a particular location by breaking down the tweet text into sets of words and evaluating those which are more likely to be a location by using a machine learning model. If several parts of the text are likely to be a location,
epitweetr will chose the one closest to a topic. After the location candidate has been identified
epitweetr matches these words against a reference database, which is geonames.org. This is a geographical database available and accessible through various web services, under a Creative Commons attribution license. The GeoNames.org database contains over 25,000,000 geographical names.
epitweetr uses by default those limited to currently existing ones and those with a known population (so just over 500,000 names). You can change this default parameter in the Shiny app configuration page, by unchecking “Simplified geonames”. The database also contains longitude and latitude attributes of localities and variant spellings (cross-references), which are useful for finding purposes, as well as non-Roman script spellings of many of these names.
The matches can be performed at any level of administrative geography. The matching is powered by Apache Lucene, which is an open-source high-performance full-featured text search engine library.
To validate the candidate against geonames, a score is associated with the probability that a match is correct. A score is:
Higher if unusual parts of the name are matched
Higher if several administrative levels are matched
Higher if location population is bigger
Higher for countries an