OpenMP 執行緒同步 Construct 實現原理以及原始碼分析（下）

前言

在上面文章當中我們主要分析了 flush, critical, master 這三個 construct 的實現原理。在本篇文章當中我們將主要分析另外兩個 construct : barrier 和 single 。

Barrier Construct

編譯器角度分析

在本小節當中我們主要介紹 #pragma omp barrier 的使用，事實上這個 construct 在編譯器的處理上非常簡單，只是將這條編譯指導語句變成了一個函數呼叫。

void GOMP_barrier (void)

每一條 #pragma omp barrier 都會變成呼叫函數 GOMP_barrier 。我們來看一個範例程式：


#include <stdio.h>
#include <omp.h>

int main()
{
#pragma omp parallel num_threads(4) default(none)
  {
    printf("tid = %d start\n", omp_get_thread_num());
#pragma omp barrier
    printf("tid = %d end\n", omp_get_thread_num());
  }
  return 0;
}

在前面的文章當中我們已經提到了編譯器會將一個 parallel construct 編譯成一個函數，上面的 parallel construct 被編譯的之後的結果如下所示，可以看到確實編譯成了呼叫函數 GOMP_barrier 。

000000000040118a <main._omp_fn.0>:
  40118a:       55                      push   %rbp
  40118b:       48 89 e5                mov    %rsp,%rbp
  40118e:       48 83 ec 10             sub    $0x10,%rsp
  401192:       48 89 7d f8             mov    %rdi,-0x8(%rbp)
  401196:       e8 a5 fe ff ff          callq  401040 <omp_get_thread_num@plt>
  40119b:       89 c6                   mov    %eax,%esi
  40119d:       bf 10 20 40 00          mov    $0x402010,%edi
  4011a2:       b8 00 00 00 00          mov    $0x0,%eax
  4011a7:       e8 a4 fe ff ff          callq  401050 <printf@plt>
  4011ac:       e8 7f fe ff ff          callq  401030 <GOMP_barrier@plt>
  4011b1:       e8 8a fe ff ff          callq  401040 <omp_get_thread_num@plt>
  4011b6:       89 c6                   mov    %eax,%esi
  4011b8:       bf 20 20 40 00          mov    $0x402020,%edi
  4011bd:       b8 00 00 00 00          mov    $0x0,%eax
  4011c2:       e8 89 fe ff ff          callq  401050 <printf@plt>
  4011c7:       c9                      leaveq 
  4011c8:       c3                      retq   
  4011c9:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

原始碼分析

void
GOMP_barrier (void)
{
  // 得到當前執行緒的相關資料
  struct gomp_thread *thr = gomp_thread ();
  // 得到當前執行緒的執行緒組
  struct gomp_team *team = thr->ts.team;

  /* It is legal to have orphaned barriers.  */
  if (team == NULL)
    return;
  // 使用執行緒組內部的 barrier 只有所有的執行緒都到達這個同步點之後才能夠繼續往後執行
  // 否則就需要進入核心掛起 
  gomp_team_barrier_wait (&team->barrier);
}

上面的程式碼就是使用當前執行緒執行緒組內部的 barrier ，讓執行緒組當中的所有執行緒都到達同步點之後才繼續往後執行，如果你使用過 pthread 中的執行緒同步工具路障 pthread_barrier_t 的話就很容易理解了。

在繼續往後分析程式之前我們首先需要了解兩個資料型別：

typedef struct
{
  /* Make sure total/generation is in a mostly read cacheline, while
     awaited in a separate cacheline.  */
  unsigned total __attribute__((aligned (64)));
  unsigned generation;
  unsigned awaited __attribute__((aligned (64)));
} gomp_barrier_t;
typedef unsigned int gomp_barrier_state_t;

我們重點分析一下 gomp_barrier_t ，team->barrier 就是這個變數型別，在這個結構體當中一共有三個變數我們重點分析第一個和第三個變數的含義：

total，這個變數表示一個需要等待多少個執行緒到達同步點之後才能夠繼續往後執行。
awaited，這個變數表示還需要等待多少個執行緒。
初始化的時候 total 和 awaited 這兩個變數是相等的，當有一個執行緒到達之後 awaited 就減去 1 。
generation 這個變數與 OpenMP 當中的 task 有關，這個變數稍微有點複雜，由於我們的分析不涉及到 OpenMP 當中的任務，因此這類對這個變數不做分析，這個變數的初始值等於 0。

結構體 gomp_barrier_t 初始化函數如下所示：

static inline void gomp_barrier_init (gomp_barrier_t *bar, unsigned count)
{
  bar->total = count;
  bar->awaited = count;
  bar->generation = 0;
}

現在我們來對函數 gomp_team_barrier_wait 進行分析，關於程式碼的詳細都在程式碼的對應位置：

void
gomp_team_barrier_wait (gomp_barrier_t *bar)
{
  gomp_team_barrier_wait_end (bar, gomp_barrier_wait_start (bar));
}

static inline gomp_barrier_state_t
gomp_barrier_wait_start (gomp_barrier_t *bar)
{
  // 因為我們不分析 OpenMP 當中的 task ,因此在這裡可能認為 generation 始終等於 0 
  // 那麼 ret 也等於 0
  unsigned int ret = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE) & ~3;
  /* A memory barrier is needed before exiting from the various forms
     of gomp_barrier_wait, to satisfy OpenMP API version 3.1 section
     2.8.6 flush Construct, which says there is an implicit flush during
     a barrier region.  This is a convenient place to add the barrier,
     so we use MEMMODEL_ACQ_REL here rather than MEMMODEL_ACQUIRE.  */
  // 這裡將 awaited 還需要等待的執行緒數 -1 並且判斷 awaited 是否等於 0
  // 如果等於 0 則返回 1 反之則返回 0 如果不考慮 task 只有最後一個到達同步點的執行緒
  // 才會返回 1
  ret += __atomic_add_fetch (&bar->awaited, -1, MEMMODEL_ACQ_REL) == 0;
  return ret;
}


// 為了方便閱讀下面的程式碼已經刪除了與 task 相關的部分
void
gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
{
  unsigned int generation, gen;
	// 如果 state 等於 1 將會進入下面的 if 語句
  if (__builtin_expect ((state & 1) != 0, 0))
    {
    	// 如果是最後一個執行緒到達這裡，那麼將會重新將 awaited 變成 total
      /* Next time we'll be awaiting TOTAL threads again.  */
      struct gomp_thread *thr = gomp_thread ();
      struct gomp_team *team = thr->ts.team;

      bar->awaited = bar->total;
    // 如果還有需要執行的任務 那麼將進入 if 語句
      if (__builtin_expect (team->task_count, 0))
	{
	  gomp_barrier_handle_tasks (state);
	  state &= ~1;
	}
      else
	{
    // 如果沒有需要執行的任務 那麼則需要將之前被掛起的執行緒全部喚醒
	  __atomic_store_n (&bar->generation, state + 3, MEMMODEL_RELEASE);
	  futex_wake ((int *) &bar->generation, INT_MAX);
	  return;
	}
    }
  // 如果 if 條件不滿足，也就是說到達 barrier 的執行緒不是最後一個執行緒
  // 那麼將會執行到這裡進行掛起
  
  // 這裡省略了程式碼 如果程式執行到這裡將會被繼續掛起 直到上面的 futex_wake 被執行
}

技巧分析

在上面的結構體 gomp_barrier_t 當中有語句 unsigned total __attribute__((aligned (64))); 後面的 attribute((aligned (64))) 表示這個欄位需要使用 64 位元組對齊，那麼這個欄位也佔 64 位元組，一般來說一個快取行有 64 個位元組的資料，也就是說這三個欄位的資料不會儲存在同一個快取行，這樣的話多個執行緒在操作這三個資料的時候不會產生假共用 (false sharing) 的問題，這可以很提高程式的效率。
我們在前面討論 critical construct 的時候談到啦 critical 有匿名和命令兩種方式：

#pragma omp critical
#pragma omp critical(name)

那麼按道理來說 barrier 也應該有兩種方式啊，那麼為什麼會沒有呢？根據前面的程式分析，我們可以知道，最重要的一行程式碼是 gomp_team_barrier_wait (&team->barrier); 因為每一個執行緒都屬於一個執行緒組，每個執行緒組內部都有一個 barrier ，因此當進行同步的時候只需要使用執行緒組內部的 barrier 即可，因此不需要使用命名的 barrier。

Single Construct

#pragma omp single

在本小節當中我們主要分析 single construct ，他的一半形式如下所示：

#pragma omp single
{
  body;
}

類似於上面的結構的程式碼會被編譯器編譯成如下形式：

if (GOMP_single_start ())
  body;
GOMP_barrier ();

關於 GOMP_barrier 函數我們在前面的內容當中已經進行了詳細的分析，他的功能就是使用一個執行緒組內部的 barrier 變數，當所有的執行緒都到達這個位置之後才放行所有執行緒，讓他們繼續執行，如果執行緒組的執行緒沒有全部到達同步點，則到達同步點的執行緒會被掛起。

我們使用一個實際的例子進行分析，看一下最終被編譯成的程式是什麼樣子：

#include <stdio.h>
#include <omp.h>

int main()
{
#pragma omp parallel num_threads(4) default(none)
  {
#pragma omp single
    {
      printf("Hello World\n");
    }

    printf("tid = %d\n", omp_get_thread_num());
  }
  return 0;
}

上面的 parallel 程式碼塊被編譯之後的反組合程式如下所示：

00000000004011aa <main._omp_fn.0>:
  4011aa:       55                      push   %rbp
  4011ab:       48 89 e5                mov    %rsp,%rbp
  4011ae:       48 83 ec 10             sub    $0x10,%rsp
  4011b2:       48 89 7d f8             mov    %rdi,-0x8(%rbp)
  4011b6:       e8 c5 fe ff ff          callq  401080 <GOMP_single_start@plt>
  4011bb:       3c 01                   cmp    $0x1,%al
  4011bd:       74 1d                   je     4011dc <main._omp_fn.0+0x32>
  4011bf:       e8 7c fe ff ff          callq  401040 <GOMP_barrier@plt>
  4011c4:       e8 87 fe ff ff          callq  401050 <omp_get_thread_num@plt>
  4011c9:       89 c6                   mov    %eax,%esi
  4011cb:       bf 10 20 40 00          mov    $0x402010,%edi
  4011d0:       b8 00 00 00 00          mov    $0x0,%eax
  4011d5:       e8 86 fe ff ff          callq  401060 <printf@plt>
  4011da:       eb 0c                   jmp    4011e8 <main._omp_fn.0+0x3e>
  4011dc:       bf 1a 20 40 00          mov    $0x40201a,%edi
  4011e1:       e8 4a fe ff ff          callq  401030 <puts@plt>
  4011e6:       eb d7                   jmp    4011bf <main._omp_fn.0+0x15>
  4011e8:       c9                      leaveq 
  4011e9:       c3                      retq   
  4011ea:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

從上面的組合程式我們可以看到，被編譯的程式確實呼叫了函數 GOMP_single_start，如果這個函數的返回值不等於 true 的時候就會執行函數 GOMP_barrier 。這和我們上面的分析是一樣的。

現在最主要的函數就是 GOMP_single_start ，他的原始碼如下所示：

bool
GOMP_single_start (void)
{
  struct gomp_thread *thr = gomp_thread ();
  struct gomp_team *team = thr->ts.team;
  unsigned long single_count;
	
  if (__builtin_expect (team == NULL, 0))
    return true;
	// 首先獲得執行緒本地儲存的遇到的 single construct 數量
  // 並且將這個數量進行加一操作 因為又遇到了一次
  single_count = thr->ts.single_count++;
  // 如果下面的操作還沒有完成 執行緒組中儲存的 single_count 和 執行緒
  // 原生的 single_count 是相等的，因此才可以進行下面的比較並交換
  // 操作，當有一個執行緒成功之後 後面的執行緒執行下面的語句都會返回 false
  return __sync_bool_compare_and_swap (&team->single_count, single_count,
				       single_count + 1L);
}

上面函數只有一個執行緒會執行返回 true ，其他的執行緒執行都會返回 false，因此可以保證只有一個執行緒執行，single construct 程式碼塊，上面的執行的主要原理就是依賴比較並交換指令 (compare and swap , CAS) 指令實現的。

在分析上面的程式碼的時候需要注意 team->single_count 和 thr->ts.single_count，這是兩個不同的資料。__sync_bool_compare_and_swap 是編譯器內建的一個函數，這個函數的主要作用是將 &team->single_count 指向的資料和 single_count 進行比較，如果這兩個資料相等則進行交換操作，如果操作成功就返回 true，否則就返回 false 。

#pragma omp single copyprivate(...)

在這一小節當中我們將介紹一個比較少用的子句 copyprivate，並且分析 single construct 在處理這個子句的時候是如何進行處理的。

我們首先來了解一下這個子句改如何使用，這個是用於在 single construct 當中，當一個變數在每個執行緒當中都有一個副本的時候，在執行完成 single construct 之後只有一個執行緒的資料會被修改，如果想讓所有執行緒知道這個修改，那麼就需要使用 copyprivate ，比如下面的例子：

#include <stdio.h>
#include <omp.h>

int x = 100;
int y = -100;
#pragma omp threadprivate(x, y)

int main()
{
#pragma omp parallel num_threads(4) default(none) copyin(x)
  {
    x = omp_get_thread_num();
    printf("tid = %d x = %d\n", omp_get_thread_num(), x);
#pragma omp single copyprivate(x, y)
    {
      x = 200;
      y = -200;
    }
    printf("tid = %d x = %d y = %d\n", omp_get_thread_num(), x, y);
  }
  return 0;
}

在上面的程式當中 x 是一個全域性變數，#pragma omp threadprivate(x) 會讓每個執行緒都會有一個全域性變數 x 的執行緒原生的副本，copyin(x) 是將全域性變數 x 的值拷貝到每個執行緒原生的變數副本當中。我們知道只會有一個執行緒執行 single construct ，那麼只會有執行 single 程式碼的執行緒當中的 x 會變成 200，但是因為有 copyprivate，線上程執行完 single 程式碼塊之後會將修改之後的 x 值賦給其他的執行緒，這樣的話其他執行緒的 x 的值也變成 200 啦。上面的程式碼執行結果如下：

tid = 2 x = 2
tid = 3 x = 3
tid = 0 x = 0
tid = 1 x = 1
tid = 3 x = 200 y = -200
tid = 0 x = 200 y = -200
tid = 2 x = 200 y = -200
tid = 1 x = 200 y = -200

如果我們寫的程式碼如下所示：

#pragma omp single copyprivate(x, y)
  body;

上面的程式碼會被編譯器翻譯成下面的樣子：

datap = GOMP_single_copy_start ();
if (datap == NULL)
  {
    body;
    data = allocate memory;
    data.x = x;
    data.y = y;
    GOMP_single_copy_end (&data);
  }
else
{
  x = datap->x;
  y = datap->y;
}
GOMP_barrier ();

首先我們來了解一下 GOMP_single_copy_start 的返回值：

如果這個執行緒的返回值是 NULL，那麼就說明這個執行緒會執行 single construct 中的程式碼，反之執行緒就不會執行 single 中的程式碼。
如果執行緒沒有獲得 single 程式碼塊的執行權的話，那麼這個執行緒將會被阻塞在函數 GOMP_single_copy_start 當中，只有 single 中的程式碼被執行完成之後執行緒才會被喚醒，具體來說是執行 single 程式碼塊的執行緒進入到 GOMP_single_copy_end 中之後才會喚醒其他的執行緒，之所以這麼做的原因是首先要得到最終的 x 的值，然後將這個值通過執行緒組之間的共用變數讓沒有執行 single 程式碼塊的執行緒能夠獲得執行 single 程式碼塊的執行緒當中的 x 的值，因為在沒有執行完成 single 程式碼塊之後是不能夠知道 x 的最終的值的，而不知道 x 的最終的值，是不能夠執行 x = datap->x; 的，因此需要將執行緒阻塞在 GOMP_single_copy_start 當中。
如果執行緒的返回值不等於 NULL，那麼就說明這個執行緒沒有獲取到 single 程式碼塊的執行權，這個返回值 datap 是指向 threadprivate 資料的指標，比如上面的例子就是指向 x 的指標，因為可以做到申請 x, y 記憶體空間的時候是連續的，知道 x 的指標和 x 的大小就可以計算出執行緒私有變數 y 的地址，這是編譯器可以做到的。

上面的兩個動態庫函數的原始碼如下所示（詳細的說明已經在註釋當中）：

/* This routine is called when first encountering a SINGLE construct that
   does have a COPYPRIVATE clause.  Returns NULL if this is the thread
   that should execute the clause; otherwise the return value is pointer
   given to GOMP_single_copy_end by the thread that did execute the clause.  */

void *
GOMP_single_copy_start (void)
{
  struct gomp_thread *thr = gomp_thread ();

  bool first;
  void *ret;
	
  // 這個函數可以返回 true 或者 false 如果執行緒需要執行 single 程式碼塊
  // 則返回 true, 否則返回 false
  first = gomp_work_share_start (0);
  
  if (first)
    {
      gomp_work_share_init_done ();
      ret = NULL;
    }
  else
    {
    	// 我們在前面提到了 沒有執行 single 程式碼塊的執行緒會被阻塞在這個函數當中
    	// 實際就是在這個位置進行阻塞的，以保證 copyprivate 當中的變數的值已經被更新啦
      gomp_team_barrier_wait (&thr->ts.team->barrier);
			// 這裡就是沒執行 single 程式碼塊的執行緒的函數返回值
    	// 執行 single 程式碼塊的執行緒會將 x, y 拷貝一份並且將指向 x, y 記憶體地址的
    	// 指標賦值給變數 thr->ts.work_share->copyprivate; （在函數 GOMP_single_copy_end 當中可以看到具體的程式碼）
      ret = thr->ts.work_share->copyprivate;
      gomp_work_share_end_nowait ();
    }

  return ret;
}

/* This routine is called when the thread that entered a SINGLE construct
   with a COPYPRIVATE clause gets to the end of the construct.  */

void
GOMP_single_copy_end (void *data)
{
  struct gomp_thread *thr = gomp_thread ();
  struct gomp_team *team = thr->ts.team;

  if (team != NULL)
    {
    	// 這個函數只有執行了 single 程式碼塊的執行緒才會執行
    	// 我們在前面已經提到了傳給這個函數的引數是指向 x, y 
    	// 記憶體地址的指標，現在將這個指標賦值給 thr->ts.work_share->copyprivate
    	// 那麼其他的執行緒就能夠通過 thr->ts.work_share->copyprivate 獲取到 x, y 
    	// 的值啦
      thr->ts.work_share->copyprivate = data;
    	// 因為前面執行緒都被阻塞了 需要等待所有的執行緒都到達之後才能夠繼續往後執行
    	// 因此這個執行緒需要進入 barrier ，當所有的執行緒都到達之後那麼就能夠繼續往後執行了
      gomp_team_barrier_wait (&team->barrier);
    }

  gomp_work_share_end_nowait ();
}

上面的整個流程如下圖所示：

我們在來看一下前面提到的使用 single copyprivate(x, y) 的程式

#pragma omp parallel num_threads(4) default(none) copyin(x)
  {
    x = omp_get_thread_num();
    printf("tid = %d x = %d\n", omp_get_thread_num(), x);
#pragma omp single copyprivate(x, y)
    {
      x = 200;
      y = -200;
    }
    printf("tid = %d x = %d y = %d\n", omp_get_thread_num(), x, y);
  }

編譯之後的組合程式是怎麼樣的（重要的部分已在程式碼當中進行標出）：

00000000004011bb <main._omp_fn.0>:
  4011bb:       55                      push   %rbp
  4011bc:       48 89 e5                mov    %rsp,%rbp
  4011bf:       41 54                   push   %r12
  4011c1:       53                      push   %rbx
  4011c2:       48 83 ec 20             sub    $0x20,%rsp
  4011c6:       48 89 7d d8             mov    %rdi,-0x28(%rbp)
  4011ca:       e8 81 fe ff ff          callq  401050 <omp_get_thread_num@plt>
  4011cf:       85 c0                   test   %eax,%eax
  4011d1:       0f 85 c2 00 00 00       jne    401299 <main._omp_fn.0+0xde>
  4011d7:       e8 74 fe ff ff          callq  401050 <omp_get_thread_num@plt>
  4011dc:       64 89 04 25 f8 ff ff    mov    %eax,%fs:0xfffffffffffffff8
  4011e3:       ff 
  4011e4:       64 8b 1c 25 f8 ff ff    mov    %fs:0xfffffffffffffff8,%ebx
  4011eb:       ff 
  4011ec:       e8 5f fe ff ff          callq  401050 <omp_get_thread_num@plt>
  4011f1:       89 da                   mov    %ebx,%edx
  4011f3:       89 c6                   mov    %eax,%esi
  4011f5:       bf 10 20 40 00          mov    $0x402010,%edi
  4011fa:       b8 00 00 00 00          mov    $0x0,%eax
  4011ff:       e8 5c fe ff ff          callq  401060 <printf@plt>
  401204:       e8 87 fe ff ff          callq  401090 <GOMP_single_copy_start@plt>
  401209:       48 85 c0                test   %rax,%rax
  40120c:       74 4c                   je     40125a <main._omp_fn.0+0x9f>
  40120e:       eb 33                   jmp    401243 <main._omp_fn.0+0x88>
  401210:       e8 1b fe ff ff          callq  401030 <GOMP_barrier@plt>
  401215:       64 44 8b 24 25 fc ff    mov    %fs:0xfffffffffffffffc,%r12d
  40121c:       ff ff 
  40121e:       64 8b 1c 25 f8 ff ff    mov    %fs:0xfffffffffffffff8,%ebx
  401225:       ff 
  401226:       e8 25 fe ff ff          callq  401050 <omp_get_thread_num@plt>
  40122b:       44 89 e1                mov    %r12d,%ecx
  40122e:       89 da                   mov    %ebx,%edx
  401230:       89 c6                   mov    %eax,%esi
  401232:       bf 21 20 40 00          mov    $0x402021,%edi
  401237:       b8 00 00 00 00          mov    $0x0,%eax
  40123c:       e8 1f fe ff ff          callq  401060 <printf@plt>
  401241:       eb 69                   jmp    4012ac <main._omp_fn.0+0xf1>
  # //////////// 沒有獲得 single construct 執行權的執行緒將執行下面的程式碼 ///////////
  # 下面的 5 條組合指令其實就是將 x, y 的資料拷貝到執行緒的私有資料 thread local storage
  401243:       8b 50 04                mov    0x4(%rax),%edx # 
  401246:       64 89 14 25 fc ff ff    mov    %edx,%fs:0xfffffffffffffffc
  40124d:       ff 
  40124e:       8b 00                   mov    (%rax),%eax
  401250:       64 89 04 25 f8 ff ff    mov    %eax,%fs:0xfffffffffffffff8
  401257:       ff 
  # ////////////////////////////////////////////////////////////////////////
  401258:       eb b6                   jmp    401210 <main._omp_fn.0+0x55>
  # //////////// 獲得 single construct 執行權的執行緒將執行下面的程式碼 //////////////
  # 下面的程式碼就是 x = 200
  40125a:       64 c7 04 25 f8 ff ff    movl   $0xc8,%fs:0xfffffffffffffff8
  401261:       ff c8 00 00 00 
  # 下面的程式碼就是 y = -200
  401266:       64 c7 04 25 fc ff ff    movl   $0xffffff38,%fs:0xfffffffffffffffc
  40126d:       ff 38 ff ff ff 
  # 下面的程式碼就是將 y 的值儲存到 eax 暫存器
  401272:       64 8b 04 25 fc ff ff    mov    %fs:0xfffffffffffffffc,%eax
  401279:       ff 
  # 將 eax 暫存器的值儲存到棧上
  40127a:       89 45 ec                mov    %eax,-0x14(%rbp)
  # 將 x 的值儲存到 eax 暫存器
  40127d:       64 8b 04 25 f8 ff ff    mov    %fs:0xfffffffffffffff8,%eax
  401284:       ff 
  # 將 eax 暫存器的值儲存到棧上
  401285:       89 45 e8                mov    %eax,-0x18(%rbp)
  # 上面的幾行程式碼就完成了執行緒私有資料的拷貝 下面的程式碼就是將棧上儲存 x, y 的記憶體地址通過引數傳遞給函數 GOMP_single_copy_end 這樣就可以儲存在 thr->ts.work_share->copyprivate 上啦
  401288:       48 8d 45 e8             lea    -0x18(%rbp),%rax
  40128c:       48 89 c7                mov    %rax,%rdi
  40128f:       e8 ac fd ff ff          callq  401040 <GOMP_single_copy_end@plt>
  # ////////////////////////////////////////////////////////////////////////
  401294:       e9 77 ff ff ff          jmpq   401210 <main._omp_fn.0+0x55>
  401299:       48 8b 45 d8             mov    -0x28(%rbp),%rax
  40129d:       8b 00                   mov    (%rax),%eax
  40129f:       64 89 04 25 f8 ff ff    mov    %eax,%fs:0xfffffffffffffff8
  4012a6:       ff 
  4012a7:       e9 2b ff ff ff          jmpq   4011d7 <main._omp_fn.0+0x1c>
  4012ac:       48 83 c4 20             add    $0x20,%rsp
  4012b0:       5b                      pop    %rbx
  4012b1:       41 5c                   pop    %r12
  4012b3:       5d                      pop    %rbp
  4012b4:       c3                      retq   
  4012b5:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  4012bc:       00 00 00 
  4012bf:       90                      nop

總結

在本篇文章當中主要給大家深入分析了 barrier construct 的實現原理，以及 single construct 的兩種使用方式並且深入分析了 copy private 的實現原理，具體的執行緒私有資料是如果通過 OpenMP 庫函數進行傳遞的，整個過程還是有些複雜的，需要仔細的對整個流程進行思考才能夠理解。以上就是本篇文章的所有內容希望大家有所收穫！

更多精彩內容合集可存取專案：https://github.com/Chang-LeHung/CSCore

關注公眾號：一無是處的研究僧，瞭解更多計算機（Java、Python、計算機系統基礎、演演算法與資料結構）知識。