Notes on writing distributed systems

This is a note summarizing some of my reflections on writing distributed systems for 6.824 at MIT. I had a horrible time debugging a concurrency bug last Friday. After some discussions with friends, I reflected on some of the mistakes I made. As a result, I completed the new distributed systems lab much faster and less painful.

  1. Read the problem statement carefully and write out the pseudo code before jumping to implementation.
  2. Ponder on what states need to be stored by different participants in the system
    1. Use as few states as possible, favor functional style over lots of states. The problem with keeping a large number of states is that sometimes you forget to update them on the right time. As a result, you should always focus on keep track of a minimum set of states
    2. Keep the states for different participants as separate as possible. You should aim for not querying another participant’s state. Instead just call that participant’s method if interaction is needed. (RPC call in a lot of the cases). This helps prevent deadlock.
  3. Synchronization
    1. Use as few locks as possible. It helps for preventing deadlocks.
    2. Unlock quickly after you finished using a shared resource.
    3. Be very careful when you are acquiring a second lock after already acquired a lock. Try to avoid this situation almost at all times. Hopefully, you should be able to unlock the first lock before you grab the second lock.
    4. Try to write methods or functions that guards certain resources instead of using locks everywhere in the code.
  4. Debugging
    1. I can not emphasize more on the importance of well designed print statements. No matter how well the system is designed, concurrency bugs will occur.
      1. Good debugging messages save you tremendous amount of time when narrowing down the bug. Sometimes it takes a while to replicate a concurrency bug. It is too late when you realized that you don’t have the necessary information in the log. You have to go back and modify the print statements and rerun the programs for a number of times to replicate the bug. Therefore, it is very important to design the print messages to be informative enough from the start.
    2. The key components that have to be present in the print statement
      1. Who: the sender of the message, possibly the sender’s port number, address etc
      2. Where: the function call that is being invoked. Preferably, there should be a print statement in every function call
      3. Input: what is the argument sent to the call
      4. Output: what is the result the function returns. This message along with the input will help you evaluate whether the function is doing what you think it should do.
      5. Success or Fail: Always output failure or error messages, they help you track down the problem.
This entry was posted in Distributed Systems. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s