Due to the recent COVID-19 outbreak, millions of employees are having to embrace remote working, including thousands of data scientists. As data science teams grow larger and more complex, facilitating remote working has become more complex too. Data science is now done by cross-functional teams that bring together subject-matter experts, modellers, data visualisation experts, machine learning engineers, product managers and designers.
Unsurprisingly, when these teams begin to shift toward remote working, effective collaboration often becomes a huge challenge especially as communication is harder when teams are not co-located. Read on to see that the right tooling is essential to helping data science teams perform to the best of their ability while working remotely.
Avoiding Frustrations
High-functioning data science teams strive for consistent tooling and infrastructure across the team. It is much easier to call and help a colleague if you both work with similar tools and to write reusable, shareable code if the environment in which that code runs in is more constrained.
There are many components that make up a data science infrastructure. Faculty have found the below processes to be effective:
- Give everyone the same hardware.
- Enforce common tooling. If everyone in the team uses Python, sharing code and models is greatly simplified. Similarly, if everyone uses the same text editor, the organisation can develop processes or even software (e.g. editor plugins) to facilitate collaboration.
- Enforce common environment management. If everyone in the team uses the same environment management system (e.g. Conda environments or Docker containers), data scientists can collaborate both on the code and on the environment the code runs in.
- Always store data online as it is then accessible by everyone, making it much easier to share code or to debug issues together.
- Have an easy way to declare reproducible workflows that other team members can run. For instance, having a well-documented store of Docker containers that can run particular parts of the team’s data processing pipeline means that not everyone in the team needs to know every part of that pipeline at the same level of detail. In turn, this reduces the need for long phone calls explaining how to run the pipeline.
- As much as possible, deploy models behind documented APIs. Having a clear interface allows more people to leverage the team’s work.
Avoid Isolation by Making Work Visible
Sharing a goal is one of the most motivating elements of teamwork. However, when working remotely, it is hard to know what other team members are working on and to generate a sense of shared purpose. This can lead to team members feeling isolated.
Ensuring the work is visible is the best way to reduce this feeling. By having a good source code manager like GitHub, GitLab or Bitbucket you can see the steady flow of PRs, comments, and reviews to give a sense of motion to the team. Continuous integration pipelines running from this source code manager also increases visibility.As well as exposing activity through a source code manager, organisations also make other sources of activity available.
Avoid silos by sharing best practices
When teams work remotely, there is less space for ad-hoc knowledge sharing. Hallway conversation that can spark new ideas is less likely to happen. Therefore, teams need to be much more deliberate about knowledge sharing.
Faculty have seen teams build knowledge repositories in Confluence or the open source knowledge repo. In Faculty Platform, a data science workbench, Faculty are working on building a way to easily share blueprints for common data science tasks. This allows data scientists to gradually build an organisation-specific knowledge centre for a single, consistent view of best practices.
Conclusion
It is hard to build cross-functional teams to deliver on the promise of machine learning as well as to recruit people and to get them to speak the same language. Trying to do this with a remote team is even harder.
Good tools will not guarantee success, but they will make your team-members more productive and foster a feeling of collaboration and shared goals.
Faculty Platform gives remote data science teams a shared infrastructure for collaborating on model development and deployment – and all the best practices from this post are built into it.
If you are interested in finding out more about the Faculty Platform, contact us today.
This is an extract from an article written by Pascal Bugnion, Data Engineer at Faculty. To see the full details click here.