Homepage

Jannis Teunissen


blog:computational_scientist

Becoming a computational scientist

Note: the text below is copied from Appendix A of my PhD thesis from 2015.

I have now been working on computational science and computational physics problems for more than six years. What has surprised me, in hindsight, is the great number of things that one has to be familiar with in order to be productive. The reason is probably the relatively large amount of DIY (do it yourself) in computing, compared to other disciplines. Below I will try to summarize what skills I have found to be generally important.

Selecting problems

For doing research, the most important skill is perhaps the ability to pick the right problems. This is a rather difficult skill to master, and I am confident that I have not done so yet. Still, there are a couple of simple questions that I find useful for selecting problems:

  • Are you interested in the problem?
  • Are others interested in the problem?
  • Do you expect to learn something useful when studying the problem?
  • Does the problem seem feasible to you? If this is not clear, how much time do you approximately have to invest to answer this?
  • Suppose that everything works out: you solve the problem. What would that mean to you? And what could you do next?
  • How long do you give yourself? And suppose that you are unable to solve the full problem, is there then an intermediate result that could be of value?
  • How hard will it be to write about the results? For example, for certain types of results a carefully written introduction, motivation, discussion or analysis might be required.

Another question that becomes more relevant towards the end of a PhD is whether it is possible to obtain (future) grants or funding for a topic.

Theoretical skills

Below, I briefly discuss some of the theoretical topics that I believe to be important for a computational scientist. The most important topic is missing however, namely knowledge of the domain that you are working in. Such knowledge will help in selecting the right problems and in making the right approximations.

Applied mathematics

These are some of the topics in applied mathematics that I think are important for a computational scientist:

  • Linear algebra: many problems can be written as a system of linear equations.
  • Calculus: ordinary differential equations and Taylor series are important for many numerical methods. Also helps for knowing what can be calculated analytically or for being able to construct reference solutions.
  • Statistics: Monte Carlo methods are quite common; to work with them, at least a basic understanding of statistics is required. The same goes for problems that are probabilistic in nature or contain data with noise.

Computer science

When we want to solve a problem on a computer, we have to select the appropriate algorithm. Algorithms can be classified by their `difficulty' or computational cost, which is the main topic of computational complexity theory. Knowing and understanding the computational cost of algorithms is not only important for efficiently solving a problem, but also for predicting what problems are feasible. For example, if you recognize that you are trying to solve an NP-hard problem, then you immediately know that you are limited to small problem sizes. With parallel computing, it is usually possible to go to larger problem sizes. To what extent this is the case depends on how well the algorithmic components can be parallelized, i.e., on the amount of local computation versus global communication.

The practical cost of algorithms also has to do with the device that performs the algorithmic steps or computations. Modern processors operate in a rather complicated way, but knowledge of the cost of typical operations is important when you have to develop an efficient numerical method. The hardware in a processor also determines what integer and floating point numbers you can use. Understanding floating point arithmetic and its subtleties can save you a lot of time debugging `weird' behavior.

Computational science

Although there are many types of computations, most of them can be categorized into just a few categories:

  • Solving linear systems of equations, i.e., solve $A x = b$ for a given matrix $A$ and vector $b$. Surprisingly many problems can be transformed into such a linear system.
  • Optimization, for example: find the shortest path between $N$ cities, find the ground state energy of a quantum system or find the minimum of a function.
  • Ordinary and partial differential equations. Many (physical) systems can be described by such equations. Different types of partial differential equations require quite different solution strategies.

A computational scientist should probably be familiar with the basic methods for solving problems from these categories, so that one is able to find and select the best methods when the need arises. To prevent reinventing the wheel, some knowledge of the available libraries and codes is valuable.

Practical skills

The best strategy for solving a problem depends on what tools are already available. If sufficiently many other people have worked on a (similar) problem, software might be available that you can directly use. Take for example CFD (computational fluid dynamics), for which there are many different simulation tools. Selecting the right one then becomes one of the most important aspects of solving your problem.

The other extreme would be that no existing software exists for your problem, so that you have to develop everything yourself. There are of course also many cases in between, for example when existing tools have to be modified to suit your needs. This means that it is often necessary to write computer code. Below, some of the practical aspects of writing your own code and reusing others' code are discussed.

Computer basics

For computing, the *nix operating systems appear to be most popular. Being familiar with a variant of e.g., GNU/Linux, BSD or OS X is therefore quite helpful – this allows you to quickly use the code and tools that others have written.

Good command of a text editor such as vim or emacs, or a suitable IDE (integrated development environment) will speed up your code and text editing. This might also reduce the risk of developing RSI (repetitive strain injury), because most editors can be operated without a mouse 1). There are many useful tools included in a *nix system, but ssh gets a special mention, because it allows you to work on remote systems.

There exist a number of software suites for doing numerical or symbolic computations. Commercial packages are for example Matlab and Mathematica, whereas Octave or SageMath are examples of free software alternatives. The many built-in functions can help you to quickly develop a computational method. Even if you eventually have to implement this solver in a different environment, it can be helpful to start from a simple proof-of-concept. The generality of such suites is also their drawback: typically they will not be as efficient as a special purpose solution.

Programming

When you develop a method from scratch, you can use your preferred programming language – this is of course not possible when you have to modify an existing method. The traditional languages for computing are C and Fortran. Especially C is quite `low-level', so that experience with C will be useful for understanding how a computer and other languages work. Fortran was specifically designed for numerical computing, which can make code development more convenient. Another popular compiled language is C++, which allows for many programming styles. This flexibility can be good for the expert but is sometimes hard for the beginner. Performance wise, there are no major differences between these languages as long as you know what you are doing.

For certain tasks, scripting or interpreted languages such as Python can be more convenient. Such languages can for example be used to glue together other programs, to process data or to visualize results. Python can also be used for computations, although the numerical work is then typically performed by routines written in C or Fortran, which are made available by Python modules such as numpy.

Numerical code is no different from other code: many things can go wrong. Sometimes a program simply does not compile or run, but at other times it might not be clear whether there is a bug or whether there is a failure for another reason. Code often depends on (particular versions of) libraries, which is a source of compilation errors; understanding how code is compiled will help in figuring out what is required. Another example are the Makefiles2) included with numerical software: they might not work on your machine, in which case you need to know how to modify them. As most programs contain bugs, basic debugging skills are very valuable. The larger a project grows, the more important these skills become.

Being familiar with a version control system such as git has various benefits: you can keep tracks of your changes, get the latest version of a code, collaborate with others etcetera. Perhaps even more important is being able to visualize your results. There exist many tools for this, examples of popular open source packages are gnuplot, Visit and Paraview.

1)
In my experience, the combination of stress and mouse usage is most likely to cause physical discomfort.
2)
Makefiles contain rules that describe how a collection of source files should be compiled. Another common build system is CMake.