A Brief Git Tutorial for Collaborative Research

Mark Gottscho | Jan 24, 2015 min read

Continuing in the trend I’ve established with some technical tutorials that have proven to be useful to others, I’ve decided to write another brief one, this time focusing on Git version control for researchers. Why the emphasis on researchers? One often comes across academic code that is poorly documented, or there are various copies of the code being developed simultaneously by different people, none of whom communicate their changes to one another, making collaboration and follow-up work harder.

As a researcher, I use version control to keep track of active research progress and to enable better collaboration without making a mess of code, benchmarks, papers, and other important files. This also makes it easier to release my work publicly.

Important
Version control is essential to maximizing the utility, impact, and maintainability of academic code. In this tutorial, I will provide you with a crash course on collaborating with Git, hopefully encouraging you to investigate it further and use it in your future projects. However, this tutorial is not about learning the basics of Git itself — I link to a few good resources for this purpose below.

Unlike many alternatives, Git is distributed and lightweight, yet is a highly capable version control system. It was originally created by Linus Torvalds to maintain the Linux kernel, and has since become one of the most popular version control options. I think you’ll find it easier to use than most alternatives. Git is available for pretty much platform you are likely to use.

In this tutorial, I assume you are familiar with the shell and command line on your platform. I also assume you are using a Linux-based OS, but actually the Git command syntax is the same across different systems. There are also GUIs available for it as well.

1. Getting Started

First things first: make sure you can run git. There are many ways to get it installed on your system. If you’re running Ubuntu, you can install via sudo apt-get install git. Git can be installed for all users on the system or as a local binary. You can also build it from source. After installing, try running:

git --help

If the shell can’t locate the binary, you haven’t properly set up your path. Once you have the git binary working, you’re ready to begin.

You want to set up your personal information, so that your colleagues can see the history of changes that you make to files under version control. This information will stay embedded in any repositories that you work with, so make sure you use your real name and a public email address in case you release your version-controlled content publicly in the future. From any directory, run the following commands:

git config --global user.name "Your Full Name"
git config --global user.email "your_public_email@website.com"

This will add entries to your global Git configuration file on your machine, which is typically located at ~/.gitconfig. If you open this file now, you can see the changes made by the above commands. The settings in your .gitconfig file are used as defaults for any Git repository you use on the system. You can always edit this configuration file manually, or override its settings on a per-repository basis as needed. I recommend keeping identical copies of your .gitconfig file on each machine that you work on to make your life easier.

Next, you will want to set up line-ending conventions for cross-system compatibility. This is because most Unix-like OSes use the linefeed character \n to designate the end-of-line in text files, while Windows uses a carriage return plus line feed by convention (\r\n). Since it is often the case that version controlled files are used on multiple platforms by multiple people, it will save everyone a headache if line-endings are handled properly.

On Unix/Linux/Mac systems, you usually want to do the following:

git config --global core.autocrlf input
git config --global core.safecrlf true

On Windows systems, you usually want this:

git config --global core.autocrlf true
git config --global core.safecrlf true

See this article for more information about how Git handles line endings.

2. Follow a Git Tutorial

Practice using basic Git features in a sandbox directory that does NOT contain anything of importance. I recommend you use the fantastic tutorial at Git Immersion, or use one listed at this link. I assume you have the basics of Git down before we proceed.

3. Setting up a Bare Repository for Your Project

If you’ve done your Git homework, you’ll know that each repository is entirely self-contained — that is, it contains both your working files as well as the entire version control information needed to reconstruct the working files, track the whole change history, jump to a random revision, clone it to others, etc.

As you collaborate, you typically want to choose a shared location as the central hub for all files under version control, e.g., a directory on a shared lab server. There are several advantages to this approach:

  • It is much, much easier to locate relevant project files

  • It is easier to collaborate with others on a project

  • It allows each user to maintain a master copy of the project revisions that are "golden"

  • The backup time and size should be dramatically reduced, and "junk" files do not need to be version controlled or backed up

Thus, I recommend using bare repositories, which contain ONLY the git version control information for the project, and NO working files. This cuts down on the size of the repository, but more importantly, makes it a hub for cloning working repositories only. Bare repositories cannot be used for development-in-place unlike regular git repositories.

Let’s say you have an existing project called foo somewhere on your local system, and it isn’t yet under Git version control.

Caution
Back up your project before starting!

Then change into that directory:

# On local machine
cd /path/to/foo/

Now, assuming you’ve set up your Git user information properly, initialize a Git repository and check in the current state of the files to create an initial revision.

# On local machine
git init

Your project may have build files (e.g., *.o object files for C programs), logs and other items (e.g., experiment1.log, test_output.txt, test_input6.csv, etc.) that you don’t want under version control. You have two primary ways of excluding these files from version control in your project. The first method is to manually add files in one-by-one that should be tracked:

# On local machine
git add foo.c
git add bar.h
...
git add my_useful_subdirectory/
git add README

However, this could prove cumbersome with many files to be included, or many files to be excluded. The preferred alternative method is to create a .gitignore file in the top level of your project directory, and add rules for the types of files you want Git to ignore by default. See the gitignore documentation and some example .gitignore files.

If you have a properly crafted .gitignore for your project, you could probably add everything in your project just like this:

# On local machine
git add .

This puts all files in the current directory and below into the "staging area", aside from ignored files that you defined earlier. Check the Git status if you like, make changes, and when you’re ready, commit them for your initial revision.

# On local machine
git commit -m "My initial commit of this amazing project."

Looking at the Git revision history or status should show you that it worked. Now, you want to create the bare repository for your project at the shared location mentioned earlier, e.g., a directory on your shared lab server. Without loss of generality, I assume that your local working machine is not the same as the shared lab server. There are multiple ways of doing this. Here is one way, for your foo project:

# On local machine
ssh YOU@your-lab-server
# Type the password for YOU@your-lab-server when prompted
# Now on lab server
cd /some/shared/path/
mkdir version_control/
cd version_control/
mkdir foo.git/
cd foo.git/
git init --bare

We name bare repositories with the .git suffix by convention so that it’s clear what the directory contains, and that it is not meant to be used as a working directory.

Now on your local machine in a different terminal, change to your original project folder and push your initial commit to the master shared-and-bare repository on the lab server so that your collaborators can see your changes.

# On local machine
cd /path/to/foo/
git remote add origin YOU@your-lab-server:/some/shared/path/version_control/foo.git
git push -u origin master
# Type the password for YOU@your-lab-server when prompted

This may take some time depending on the size of your project. Note that origin is the conventional name for the shared-and-bare remote repository. This way we always know that origin points to the "golden" copy of the repository. When the push is done, there should be only be Git version control data at your-lab-server:/some/shared/path/version_control/foo.git/. If you go there and run git log, it should display the changes you made. Remember that there are no working files there and thus can’t do any development there. For your collaborator to work on the project, she must check out, pull, clone, etc. your repository to a working private repository which is just like your original project directory. In fact, they are functionally identical. If you accidentally delete your entire working copy, it can be perfectly reconstructed either from the master shared-and-bare repository, or your colleagues' working copy (assuming their private repositories had pulled your updates before you deleted your own copy). Before having your colleagues pull your changes, check the bare repository config file to make sure it is OK using your shell connected to the shared lab server:

# On lab server
cd /some/shared/path/version_control/foo.git/
vim config

The file should look like this:

# On lab server
[core]
repositoryformatversion = 0
filemode = true
bare = true

There should especially be no information under [remote]. If there is, the bare repository thinks it is tracking another remote repository. However, you usually don’t want this.

Now, when you clone the master repository to your working repository, you can check the working repository’s .git/config to make sure "remote" points to the master bare repository. Now you can pull and push as you like to the lab server along with your colleagues.

Note that git commit tracks changes only in your current working Git repository! It does not automatically synchronize/push to the remote lab server’s Git repository. If you want the lab server master repository to track what you commit in your local repository, you must first commit to your local repository and then git push to update the lab server’s version.

  • Each unique project should be maintained in its own, descriptively named directory, with a .git suffix, e.g. myProject.git/. This is the preferred naming convention for bare repositories, to distinguish them from working directories.

  • Do not include the year, target publication, or author names as part of the repository directory name. This is because the project might eventually have a broader scope in the future as it is developed. Changing the repository name later will break working copies that try to push and pull to the shared-and-bare repository.

  • Do not use separate repositories for different development branches of the same project, unless their purposes have diverged considerably and there is no reason to keep them related. Instead, use Git’s branching capabilities.

  • Do not manually put anything in the bare repository.

  • Make sure file permissions in the shared-and-bare repository allow your collaborators to read and modify the project.

  • For clarity, it is recommended that you either nest related components of a project within a single project subdirectory, or keep them "flat" at the top-level but use a common naming convention. For example, suppose my project foo has a primary C component and a set of Python post-processing scripts. You could maintain them under one repository `foo` with subdirectories `src` for the C source, and scripts for the Python components. Alternatively, you may want two repositories, one called foo for the C++ core component and foo-scripts for the Python scripts. This decision may depend on which components require collaboration, or if you want to release only certain parts of your project to the research community.

5. Useful Resources

And of course, you can always use man git and git --help. =]

comments powered by Disqus