Data Best Practices Workshop

Paul Nuyujukian

(Stanford University)

Please LOG IN to view the video.

Date: March 27, 2019


This is a new workshop that will cover how to setup a free data storage and processing pipeline that is specific to Stanford’s unique offerings.

You will learn how to store data on cloud repositories securely and rapidly via command-line tools. You will also perform data analysis directly from these cloud repositories. All tools covered in this workshop are free for Stanford students, staff, and faculty. They provide unlimited data storage at line-saturating transfer rates.

The workshop will work through an example dataset.

  1. Participants will redundantly store data across two cloud repositories (Stanford Google Drive & Stanford Box/Medicine Box) and Stanford Sherlock via the command-line (e.g., rclone).
  2. Participants will login to StanfordGitLab, a private git repository hosting service for Stanford affiliates.
  3. Participants will directly analyze the dataset stored on Stanford Google Drive without downloading a copy via Google Colaboratory using a Python package hosted on Stanford GitLab.

At the end of the workshop, participants will be given time to repeat the data storage process, but with their own data. Participants are encouraged to bring some of their own datasets with them for this final exercise.

Target audience:

This hands-on workshop is geared towards experimental researchers who work with datasets routinely.


  1. Familiarity with UNIX command-line operation is a prerequisite.
  2. Workshop enrollment is limited to 50 people. To manage enrollment, there is an application followed by registration. Selected applicants will be invited to register on March 20, 2019.
  3. This workshop is hands-on, so participants are required to bring their own laptops.
  4. Completion of the Preparatory Assignment is required to apply (included in form).

Created: Wednesday, March 27th, 2019