Firing up an analytics stack

Create the stack using the existing stack's management network and security group:

openstack stack create \
    --template heat-templates/hot/edx-analytics-server.yaml \
    --parameter name=<server_name> \
    --parameter image=<image> \
    --parameter flavor=<flavor> \
    --parameter key_name=<key_name> \
    --parameter network=<existing_network> \
    --parameter security_group=<existing_security_group> \
    --parameter public_net_id=<uuid> \
    <analytics_stack_name>

The analytics server's default internal IP is 192.168.122.120. Deploy the stack's SSH key pair to it from the deploy node:

ip=192.168.122.120
ssh-keyscan $ip >> ~/.ssh/known_hosts
ssh-copy-id -i ~/.ssh/id_rsa $ip
scp ~/.ssh/id_rsa* $ip:.ssh/

You'll create a static inventory file, analytics.ini, containing the existing backend_servers, and only the analytics server under analytics_servers:

vim /var/tmp/edx-configuration-secrets/analytics.ini

[analytics_servers]
192.168.122.120

[backend_servers]
192.168.122.111
192.168.122.112
192.168.122.113

You can find out what are the existing backend servers by running:

/var/tmp/edx-configuration-secrets/openstack.py --list

Now, run the openstack-analytics.yaml playbook using this inventory file on the analytics_servers group.

cd /var/tmp/edx-configuration/playbooks
ansible-playbook \
    -i ../../edx-configuration-secrets/analytics.ini \
    -e migrate_db=yes \
    openstack-analytics.yml

SSH into the the analytics node for the following:

ssh 192.168.122.120

Disable the hadoop nodemanager memory check:

sudo vim /edx/app/hadoop/hadoop-2.3.0/etc/hadoop/yarn-site.xml
...
  <property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
  </property>
...
sudo service yarn restart

Set up the pipeline in a new virtual env:

# Create a new virtualenv for the pipeline, and activate it
virtualenv pipeline
. pipeline/bin/activate

# Clone the repository and bootstrap it
git clone https://github.com/edx/edx-analytics-pipeline
cd edx-analytics-pipeline
make bootstrap

Copy the sample devstack.cfg configuration file and change it as follows:

sudo cp ~/edx-analytics-pipeline/config/devstack.cfg /edx/etc/edx-analytics-pipeline/override.cfg
sudo vim /edx/etc/edx-analytics-pipeline/override.cfg

[elasticsearch]
host = http://192.168.122.111:9201/

Test it with a simple task that counts daily events. This will run through the installation procedure and may take a while. On subsequent invocations, however, it will be possible to skip it.

FROM_DATE="2 days ago"
TO_DATE=now
remote-task \
    --wait TotalEventsDailyTask \
    --interval $(date +%Y-%m-%d -d "${FROM_DATE}")-$(date +%Y-%m-%d -d "${TO_DATE}") \
    --output-root hdfs://localhost:9000/output/ \
    --override-config /edx/etc/edx-analytics-pipeline/override.cfg \
    --repo https://github.com/edx/edx-analytics-pipeline \
    --host localhost \
    --user ubuntu \
    --remote-name analyticstack \
    --local-scheduler \
    --n-reduce-tasks 1

Now process enrollment data for the last 2 days, skipping the installation:

FROM_DATE="2 days ago"
TO_DATE=now
OVERWRITE_DAYS=2
remote-task \
    --wait ImportEnrollmentsIntoMysql \
    --interval $(date +%Y-%m-%d -d "${FROM_DATE}")-$(date +%Y-%m-%d -d "${TO_DATE}") \
    --overwrite-n-days $OVERWRITE_DAYS \
    --override-config /edx/etc/edx-analytics-pipeline/override.cfg \
    --skip-setup \
    --host localhost \
    --user ubuntu \
    --remote-name analyticstack \
    --local-scheduler \
    --n-reduce-tasks 1

What follows are the other tasks to be run on a regular basis:

TO_DATE=now
WEEKS=24
remote-task \
    --wait CourseActivityWeeklyTask \
    --end-date $(date +%Y-%m-%d -d "$TO_DATE") \
    --weeks $WEEKS \
    --override-config /edx/etc/edx-analytics-pipeline/override.cfg \
    --skip-setup \
    --host localhost \
    --user ubuntu \
    --remote-name analyticstack \
    --local-scheduler \
    --n-reduce-tasks 1

FROM_DATE="1 month ago"
TO_DATE=now
remote-task \
    --wait ModuleEngagementIntervalTask \
    --interval $(date +%Y-%m-%d -d "${FROM_DATE}")-$(date +%Y-%m-%d -d "${TO_DATE}") \
    --overwrite-from-date $(date +%Y-%m-%d -d "$TO_DATE") \
    --overwrite-mysql \
    --override-config /edx/etc/edx-analytics-pipeline/override.cfg \
    --skip-setup \
    --host localhost \
    --user ubuntu \
    --remote-name analyticstack \
    --local-scheduler \
    --n-reduce-tasks 1

Finally, log in to Insights, making sure you're logged into the LMS with a staff account. If your OAUTH variables were set correctly when deploying up the LMS, the edx-multi-node.yaml playbook should have already created the correct Insights token on it.

If there's a 500 error on the /courses page, you're likely hitting a migration bug where some tables weren't created properly. To fix this, run the following:

sudo -Hu insights bash
cd
. venvs/insights/bin/activate
. insights_env
cd edx_analytics_dashboard
./manage.py migrate --run-syncdb --settings=analytics_dashboard.settings.production
exit