forked from dglai/dglai.github.io
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfeed.xml
2208 lines (1820 loc) · 286 KB
/
feed.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.5.2">Jekyll</generator><link href="https://www.dgl.ai/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.dgl.ai/" rel="alternate" type="text/html" /><updated>2024-08-13T10:18:24+00:00</updated><id>https://www.dgl.ai/feed.xml</id><title type="html">Deep Graph Library</title><subtitle>Easy Deep Learning on Graphs</subtitle><entry><title type="html">GNN training acceleration with BFloat16 data type on CPU</title><link href="https://www.dgl.ai/blog/2024/08/10/bfloat16.html" rel="alternate" type="text/html" title="GNN training acceleration with BFloat16 data type on CPU" /><published>2024-08-10T00:00:00+00:00</published><updated>2024-08-10T00:00:00+00:00</updated><id>https://www.dgl.ai/blog/2024/08/10/bfloat16</id><content type="html" xml:base="https://www.dgl.ai/blog/2024/08/10/bfloat16.html"><p>Graph neural networks (GNN) have achieved state-of-the-art performance on
various industrial tasks. However, most GNN operations are memory-bound and
require a significant amount of RAM. To tackle this problem well known
technique to reduce tensor size by using small data type is proposed for
memory efficiency optimization of GNN training on Intel® Xeon® Scalable
processors with Bfloat16. The proposed approach could achieve outstanding
optimization on various GNN models, covering a wide range of datasets, which
speeds up the training by up to 5×.</p>
<h2 id="bfloat16-data-type">Bfloat16 data type</h2>
<p>Bfloat16 is a half-precision data type. It differs from the default float data
type only in mantissa length, which is 7 bits long compared to 23 bits.</p>
<p><img src="/assets/images/posts/2024-08-10-bfloat16/bfloat16.png" alt="bfloat16" /></p>
<p>Bfloat16 was developed by the Google Brain team and is currently used widely in
DNN and other AI applications. A lot of devices natively support bfloat16
starting from GPUs and AI accelerators and ending with CPUs. Even compilers such
as GCC, and LLVM are enabling this data type in the latest <a href="https://en.cppreference.com/w/cpp/types/floating-point">C/C++ standards</a>.
According to the Google Brain team exponent is more valuable for training and ML
operations, so such reduction of mantissa bits will preserve the accuracy of the
model and at the same time provide the same performance as other half-precision
data types. Another advantage of using bfloat16 data type is the simplicity of
conversion between bfloat16 and float.</p>
<h2 id="bfloat16-cpu-acceleration">Bfloat16 CPU acceleration</h2>
<p>Starting from the 3rd Gen Intel® Xeon® Scalable processor (codenamed Cooper Lake)
x86 CPU natively supports bfloat16. It was enabled via the Intel® Advanced Vector
Extensions-512 (Intel® AVX-512): AVX512_BF16 vector instruction set, which like
other AVX512 instructions provides basic operations: dot product and conversion
functions.
In the latest 4th Gen Intel® Xeon® Scalable processors (codename Sapphire Rapids),
the Intel® <a href="https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html">AMX</a>
instruction set was introduced to further improve 16-bit and 8-bit matrix
performance. In this instruction set “tile” instructions were added, which operate
on special “tile” 2D registers. Currently, this instruction set has only tile
matrix multiply unit (TMUL) support, that can perform matrix multiplication for
bfloat16 and int8 data types.
In the next Intel Xeon generations, starting from Granite Rapids, in addition to
bfloat16, fp16 will be supported.</p>
<h2 id="bfloat16-in-dgl">Bfloat16 in DGL</h2>
<p>Recently bfloat16 support was added to DGL library (starting from <a href="https://github.com/dmlc/dgl/releases/tag/1.0.0">DGL version 1.0.0</a>
for Nvidia GPU and <a href="https://github.com/dmlc/dgl/releases/tag/1.1.0">DGL version 1.1.0</a>
for CPU), so it is possible to use it in model training and inference for both
CPU and GPU devices.
Examples of DGL API, which will help to convert the graph and the model to
bfloat16 data type:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># Convert graph, model, and graph features to bfloat16</span>
<span class="n">g</span> <span class="o">=</span> <span class="n">dgl</span><span class="o">.</span><span class="n">to_bfloat16</span><span class="p">(</span><span class="n">g</span><span class="p">)</span>
<span class="n">feat</span> <span class="o">=</span> <span class="n">feat</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="o">.</span><span class="n">bfloat16</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="o">.</span><span class="n">bfloat16</span><span class="p">)</span>
</code></pre>
</div>
<p>The following example trains <a href="https://snap.stanford.edu/graphsage/">GraphSAGE</a>
with bfloat16 using the provided API:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torch.nn</span> <span class="kn">as</span> <span class="nn">nn</span>
<span class="kn">import</span> <span class="nn">torch.nn.functional</span> <span class="kn">as</span> <span class="nn">F</span>
<span class="kn">import</span> <span class="nn">dgl</span>
<span class="kn">from</span> <span class="nn">dgl.data</span> <span class="kn">import</span> <span class="n">CiteseerGraphDataset</span>
<span class="kn">from</span> <span class="nn">dgl.nn</span> <span class="kn">import</span> <span class="n">SAGEConv</span>
<span class="kn">from</span> <span class="nn">dgl.transforms</span> <span class="kn">import</span> <span class="n">AddSelfLoop</span>
<span class="k">class</span> <span class="nc">SAGE</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">in_size</span><span class="p">,</span> <span class="n">hid_size</span><span class="p">,</span> <span class="n">out_size</span><span class="p">):</span>
<span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="n">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">layers</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">ModuleList</span><span class="p">()</span>
<span class="c"># two-layer SAGE</span>
<span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">SAGEConv</span><span class="p">(</span><span class="n">in_size</span><span class="p">,</span> <span class="n">hid_size</span><span class="p">,</span> <span class="s">"gcn"</span><span class="p">))</span>
<span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">SAGEConv</span><span class="p">(</span><span class="n">hid_size</span><span class="p">,</span> <span class="n">out_size</span><span class="p">,</span> <span class="s">"gcn"</span><span class="p">))</span>
<span class="bp">self</span><span class="o">.</span><span class="n">dropout</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.5</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">graph</span><span class="p">,</span> <span class="n">features</span><span class="p">):</span>
<span class="n">h</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">dropout</span><span class="p">(</span><span class="n">features</span><span class="p">)</span>
<span class="k">for</span> <span class="n">l</span><span class="p">,</span> <span class="n">layer</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">):</span>
<span class="n">h</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">graph</span><span class="p">,</span> <span class="n">h</span><span class="p">)</span>
<span class="k">if</span> <span class="n">l</span> <span class="o">!=</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">:</span>
<span class="n">h</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="n">h</span><span class="p">)</span>
<span class="n">h</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">dropout</span><span class="p">(</span><span class="n">h</span><span class="p">)</span>
<span class="k">return</span> <span class="n">h</span>
<span class="c"># Data loading</span>
<span class="n">transform</span> <span class="o">=</span> <span class="n">AddSelfLoop</span><span class="p">()</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">CiteseerGraphDataset</span><span class="p">(</span><span class="n">transform</span><span class="o">=</span><span class="n">transform</span><span class="p">)</span>
<span class="n">g</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">g</span> <span class="o">=</span> <span class="n">g</span><span class="o">.</span><span class="nb">int</span><span class="p">()</span>
<span class="n">train_mask</span> <span class="o">=</span> <span class="n">g</span><span class="o">.</span><span class="n">ndata</span><span class="p">[</span><span class="s">'train_mask'</span><span class="p">]</span>
<span class="n">feat</span> <span class="o">=</span> <span class="n">g</span><span class="o">.</span><span class="n">ndata</span><span class="p">[</span><span class="s">'feat'</span><span class="p">]</span>
<span class="n">label</span> <span class="o">=</span> <span class="n">g</span><span class="o">.</span><span class="n">ndata</span><span class="p">[</span><span class="s">'label'</span><span class="p">]</span>
<span class="n">in_size</span> <span class="o">=</span> <span class="n">feat</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">hid_size</span> <span class="o">=</span> <span class="mi">16</span>
<span class="n">out_size</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">num_classes</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">SAGE</span><span class="p">(</span><span class="n">in_size</span><span class="p">,</span> <span class="n">hid_size</span><span class="p">,</span> <span class="n">out_size</span><span class="p">)</span>
<span class="c"># Convert model and graph to bfloat16</span>
<span class="n">g</span> <span class="o">=</span> <span class="n">dgl</span><span class="o">.</span><span class="n">to_bfloat16</span><span class="p">(</span><span class="n">g</span><span class="p">)</span>
<span class="n">feat</span> <span class="o">=</span> <span class="n">feat</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="o">.</span><span class="n">bfloat16</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="o">.</span><span class="n">bfloat16</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="n">train</span><span class="p">()</span>
<span class="c"># Create optimizer</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">optim</span><span class="o">.</span><span class="n">Adam</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="mf">1e-2</span><span class="p">,</span> <span class="n">weight_decay</span><span class="o">=</span><span class="mf">5e-4</span><span class="p">)</span>
<span class="n">loss_fcn</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">CrossEntropyLoss</span><span class="p">()</span>
<span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span>
<span class="n">logits</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="n">feat</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">loss_fcn</span><span class="p">(</span><span class="n">logits</span><span class="p">[</span><span class="n">train_mask</span><span class="p">],</span> <span class="n">label</span><span class="p">[</span><span class="n">train_mask</span><span class="p">])</span>
<span class="n">loss</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="n">optimizer</span><span class="o">.</span><span class="n">step</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Epoch {} | Loss {}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">epoch</span><span class="p">,</span> <span class="n">loss</span><span class="o">.</span><span class="n">item</span><span class="p">()))</span>
</code></pre>
</div>
<h2 id="experimental-results">Experimental results</h2>
<p>The most popular examples have been tested – <a href="https://arxiv.org/abs/1312.6203">GCN</a>,
<a href="https://arxiv.org/abs/1706.02216">GraphSAGE</a>. For full graph training, basic
datasets were chosen, while for the mini-batch approach datasets from <a href="https://ogb.stanford.edu/docs/nodeprop/">OGB</a>
were used, the sizes of which significantly exceed the sizes of the basic ones.
For instance, ogbn-products has around 2.5 million nodes and 61 million edges,
whereas ogbn-papers100M has 111 million nodes and 1.6 million edges. Table 1.
demonstrates accuracy which is similar for float and bfloat16 or has not been
changed significantly.</p>
<table style="text-align: center;">
<tr>
<th>Model</th>
<th>Dataset</th>
<th>Test accuracy(float)</th>
<th>Test accuracy(bfloat16)</th>
</tr>
<tr>
<td>gcn</td>
<td>citeseer</td>
<td>71%</td>
<td>71%</td>
</tr>
<tr>
<td>gcn</td>
<td>cora</td>
<td>81%</td>
<td>81%</td>
</tr>
<tr>
<td>gcn</td>
<td>pubmed</td>
<td>79%</td>
<td>79%</td>
</tr>
<tr>
<td>graphsage</td>
<td>citeseer</td>
<td>71%</td>
<td>71%</td>
</tr>
<tr>
<td>graphsage</td>
<td>cora</td>
<td>81%</td>
<td>81%</td>
</tr>
<tr>
<td>graphsage</td>
<td>pubmed</td>
<td>78%</td>
<td>78%</td>
</tr>
<tr>
<td>gcn minibatch</td>
<td>ogbn-paper100M</td>
<td>57%</td>
<td>57%</td>
</tr>
<tr>
<td>gcn minibatch</td>
<td>ogbn-products</td>
<td>78%</td>
<td>78%</td>
</tr>
<tr>
<td>graphsage minibatch</td>
<td>ogbn-paper100M</td>
<td>62%</td>
<td>61%</td>
</tr>
<tr>
<td>graphsage minibatch</td>
<td>ogbn-products</td>
<td>76%</td>
<td>74%</td>
</tr>
</table>
<p>The plots below show results on <a href="https://aws.amazon.com/ec2/instance-types/r6i/">AWS r6i</a>
powered by 3rd Generation Intel Xeon Scalable processors (codename Ice Lake),
which does not have native bfloat16 instruction, and results on <a href="https://aws.amazon.com/ec2/instance-types/r7iz/">AWS r7iz</a>
4th Generation Intel Xeon Scalable-based (Sapphire Rapids) instances, which has
native support for both AVX512_BF16 and AMX. In both experiments, the number of
threads is limited to 16, which is the best-known number of threads for a single
run on Intel® Xeon®.</p>
<p><img src="/assets/images/posts/2024-08-10-bfloat16/plot_1.png" alt="plot_1" width="800x" class="aligncenter" />
<img src="/assets/images/posts/2024-08-10-bfloat16/plot_2.png" alt="plot_2" width="800x" class="aligncenter" /></p>
<p>The efficiency of GNN training has been enhanced on both types of Intel® Xeon®
instances with bfloat16. Notably, for basic datasets on the AWS r6i, there was
an improvement in performance of up to 32%. Similarly, for basic datasets on the
r7iz accelerated by Intel® AMX machine, the use of bfloat16 led to an improvement
in training performance of up to 92%.
When discussing the results of <a href="https://ogb.stanford.edu/docs/nodeprop/">OGB</a>
datasets, which are notably larger in size, performance improved by up to 2.89
times on the r6i and up to 5.04 times on the r7iz. On the provided plots it was
demonstrated that all training steps experienced improvements in performance.
The most significant impact was observed in the forward pass, which was up to
12.7 times faster when utilizing bfloat16.</p>
<p><img src="/assets/images/posts/2024-08-10-bfloat16/plot_3.png" alt="plot_3" width="800x" class="aligncenter" />
<img src="/assets/images/posts/2024-08-10-bfloat16/plot_4.png" alt="plot_4" width="800x" class="aligncenter" />
<img src="/assets/images/posts/2024-08-10-bfloat16/plot_5.png" alt="plot_5" width="800x" class="aligncenter" />
<img src="/assets/images/posts/2024-08-10-bfloat16/plot_6.png" alt="plot_6" width="800x" class="aligncenter" /></p>
<h2 id="conclusion">Conclusion</h2>
<p>Using the bfloat16 data type is strongly recommended for GNN training on Sapphire
Rapids and earlier generations of Intel Xeon Scalable processors to enhance
performance. Even on Ice Lake, bfloat16 enhances the efficiency of memory-bound
operations and reduces training costs.</p>
<p>Nevertheless, certain methods within DL frameworks may not fully support or
optimally use CPU bfloat16 instructions. In such situations, we advise evaluating
both float and bfloat16 performance over a small number of epochs to determine
the optimal choice.</p></content><author><name>Ilia Taraban</name></author><category term="blog" /><category term="blog" /><summary type="html">Graph neural networks (GNN) have achieved state-of-the-art performance on various industrial tasks. However, most GNN operations are memory-bound and require a significant amount of RAM. To tackle this problem well known technique to reduce tensor size by using small data type is proposed for memory efficiency optimization of GNN training on Intel® Xeon® Scalable processors with Bfloat16. The proposed approach could achieve outstanding optimization on various GNN models, covering a wide range of datasets, which speeds up the training by up to 5×.</summary></entry><entry><title type="html">DGL 2.1: GPU Acceleration for Your GNN Data Pipeline</title><link href="https://www.dgl.ai/release/2024/03/06/release.html" rel="alternate" type="text/html" title="DGL 2.1: GPU Acceleration for Your GNN Data Pipeline" /><published>2024-03-06T00:00:00+00:00</published><updated>2024-03-06T00:00:00+00:00</updated><id>https://www.dgl.ai/release/2024/03/06/release</id><content type="html" xml:base="https://www.dgl.ai/release/2024/03/06/release.html"><p>We are happy to announce the release of DGL 2.1. In this release, we are making
GNN data loading lightning fast. We introduce GPU acceleration for the whole GNN
data loading pipeline in GraphBolt, including the graph sampling and feature
fetching stages.</p>
<h2 id="flexible-data-pipeline--customizable-stages-all-accelerated-on-your-gpu">Flexible data pipeline &amp; customizable stages, all accelerated on your GPU</h2>
<p>Starting from this release, the data moving stage can now be moved earlier in
the data pipeline to enable GPU acceleration. With this in mind, the following
permutations of the core stages are now possible:</p>
<p><img src="/assets/images/posts/2024-03-06-release/workstream.png" alt="diagram" width="800x" class="aligncenter" /></p>
<p>To execute all of the data loading stages on the GPU, the graph and the features
need to be GPU accessible. As the GPU memory may be limited, GraphBolt offers an
in-place pinning operation to enable GPU access to the graph and features
resident in main memory.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># Pin the graph and features in-place.</span>
<span class="n">graph</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">graph</span><span class="o">.</span><span class="n">pin_memory_</span><span class="p">()</span>
<span class="n">features</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">feature</span><span class="o">.</span><span class="n">pin_memory_</span><span class="p">()</span>
</code></pre>
</div>
<p>However, if GPU has sufficient memory, either graph and/or its features can be
moved to the GPU as follows:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># Move the graph and features to the GPU.</span>
<span class="n">graph</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">graph</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="s">"cuda:0"</span><span class="p">)</span>
<span class="n">features</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">feature</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="s">"cuda:0"</span><span class="p">)</span>
</code></pre>
</div>
<p>It may be the case that the GPU has a large memory, however it may not be large
enough to fit all the features. In that case, it is possible to cache some part
of the features using <a href="https://docs.dgl.ai/generated/dgl.graphbolt.GPUCachedFeature.html#dgl.graphbolt.GPUCachedFeature">gb.GPUCachedFeature</a>,
please see the <a href="https://github.com/dmlc/dgl/blob/3ced3411e55bca803ed5ec5e1de6f62e1f21478f/examples/multigpu/graphbolt/node_classification.py#L288-L292">GraphBolt multiGPU example</a>.</p>
<p>By placing the copy operation earlier in the pipeline enables GPU execution for
the rest of the operations. All of the GraphBolt components compose as you
expect.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># Seed edge sampler.</span>
<span class="n">dp</span> <span class="o">=</span> <span class="n">gb</span><span class="o">.</span><span class="n">ItemSampler</span><span class="p">(</span><span class="n">train_edge_set</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">1024</span><span class="p">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c"># Copy here to execute the remaining operations on the GPU.</span>
<span class="n">dp</span> <span class="o">=</span> <span class="n">dp</span><span class="o">.</span><span class="n">copy_to</span><span class="p">(</span><span class="n">device</span><span class="o">=</span><span class="s">"cuda:0"</span><span class="p">)</span>
<span class="c"># Negative sampling.</span>
<span class="n">dp</span> <span class="o">=</span> <span class="n">dp</span><span class="o">.</span><span class="n">sample_uniform_negative</span><span class="p">(</span><span class="n">graph</span><span class="p">,</span> <span class="n">negative_ratio</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="c"># Neighbor sampling.</span>
<span class="n">dp</span> <span class="o">=</span> <span class="n">dp</span><span class="o">.</span><span class="n">sample_neighbor</span><span class="p">(</span><span class="n">graph</span><span class="p">,</span> <span class="n">fanouts</span><span class="o">=</span><span class="p">[</span><span class="mi">15</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">])</span>
<span class="c"># Fetch features.</span>
<span class="n">dp</span> <span class="o">=</span> <span class="n">dp</span><span class="o">.</span><span class="n">fetch_feature</span><span class="p">(</span><span class="n">features</span><span class="p">,</span> <span class="n">node_feature_keys</span><span class="o">=</span><span class="p">[</span><span class="s">"feat"</span><span class="p">])</span>
</code></pre>
</div>
<p>The descriptive nature of the PyTorch datapipe lets us take a defined data
pipeline, make modifications to it to support GPU specific optimizations with no
change to the user experience. Two such examples are the <code class="highlighter-rouge">overlap_feature_fetch</code>
and <code class="highlighter-rouge">overlap_graph_fetch</code> arguments of <a href="https://docs.dgl.ai/en/latest/generated/dgl.graphbolt.DataLoader.html">gb.DataLoader</a>,
where the feature fetching and graph access operations are overlapped with the
rest of the operations using a separate CUDA stream via pipeline parallelism.</p>
<h2 id="gpu-acceleration-speedups">GPU acceleration speedups</h2>
<p>The dgl.graphbolt doesn’t just give you flexibility, it also provides top
performance under the hood. As for the 2.1 release, almost all dgl.graphbolt
operations are GPU accelerated, except for sampling with replacement.
Additionally, the feature fetch operation now runs in parallel with everything
else, via pipeline parallelism. This has the potential to cut runtimes by up to
<strong>2x</strong> depending on the scenario. Moreover, utilizing <a href="https://docs.dgl.ai/generated/dgl.graphbolt.GPUCachedFeature.html#dgl.graphbolt.GPUCachedFeature">gb.GPUCachedFeature</a>
can cut feature transfer times even more, our multi-GPU benchmarks show up to
<strong>1.6x</strong> speedup.</p>
<p>To evaluate the performance of GraphBolt, we have tested 4 different scenarios:</p>
<ul>
<li>Single-GPU Node Classification</li>
<li>Single-GPU Link Prediction</li>
<li>Single-GPU Heterogeneous Node Classification</li>
<li>Multi-GPU Node Classification</li>
</ul>
<p>In these scenarios, we will compare 5 different configurations. First two are
the existing baselines, and the last three are new configurations enabled by the
DGL 2.1 release:</p>
<ul>
<li>GraphBolt CPU backend, denoted as “dgl.graphbolt (cpu)”.</li>
<li>The legacy DGL dataloader with UVA on GPU, denoted as “Legacy DGL (pinned)” by
pinning the dataset in system memory.</li>
<li>GraphBolt GPU backend, denoted as “dgl.graphbolt (pinned)” by pinning the
dataset in system memory.</li>
<li>GraphBolt GPU backend, denoted as “dgl.graphbolt (pinned, 5M)” by pinning the
dataset in system memory, utilizing <a href="https://docs.dgl.ai/generated/dgl.graphbolt.GPUCachedFeature.html#dgl.graphbolt.GPUCachedFeature">gb.GPUCachedFeature</a>
to cache 5M of the node features.</li>
<li>GraphBolt GPU backend, denoted as “dgl.graphbolt (cuda)” by moving the dataset
to the GPU memory.</li>
</ul>
<p>All the experiments were run on an NVIDIA DGX-A100 system with 8 GPUs.</p>
<h3 id="single-gpu-node-classification">Single-GPU Node Classification</h3>
<p>We evaluate the performance of GraphBolt and the Legacy DGL dataloader when the
dataset is stored in pinned memory (UVA) against the CPU GraphBolt baseline. We
use a 3 layer GraphSage model with batch size 1024 and fanout 10 in each layer
and evaluate the performance on the ogbn-products and ogbn-papers100M datasets
using the listed baselines above.</p>
<p><img src="/assets/images/posts/2024-03-06-release/single-gpu-node-classification.png" alt="diagram" width="800x" class="aligncenter" /></p>
<p>As one can see, GraphBolt’s new GPU backend can get up to <strong>4.2x</strong> speedup
compared to the GraphBolt CPU baseline while the legacy DGL dataloader can get
at most <strong>2.5x</strong>.</p>
<h3 id="single-gpu-link-prediction">Single-GPU Link Prediction</h3>
<p>Here, we shift our focus to the link prediction scenario on the ogbl-citation2
dataset with a similar setting as the previous section. Here, two different
modes are evaluated, including or excluding reverse edges.</p>
<p><img src="/assets/images/posts/2024-03-06-release/single-gpu-link-prediction.png" alt="diagram" width="800x" class="aligncenter" /></p>
<p>We observe that GraphBolt’s new GPU backend can get up to <strong>5x</strong> speedup
compared to its CPU baselines. Legacy DGL dataloader is slow due to missing GPU
counterparts of some operations required for link prediction dataloading.</p>
<h3 id="single-gpu-heterogeneous-node-classification">Single-GPU Heterogeneous Node Classification</h3>
<p>You can accelerate heterogeneous sampling on your GPU as well. The
<a href="https://github.com/dmlc/dgl/blob/master/examples/sampling/graphbolt/rgcn/hetero_rgcn.py">R-GCN example</a>
runtime on the ogbn-mag dataset was 43.5s with the “dgl.graphbolt (cpu)”
baseline and it went down to 25.2s with the new “dgl.graphbolt (pinned)” for a
<strong>1.73x</strong> speedup. You can expect the speedup numbers to increase as we optimize
the different use cases for the GPU.</p>
<h3 id="multi-gpu-node-classification">Multi-GPU Node Classification</h3>
<p>Here, we evaluate the node classification use case in the multi-GPU setting with
GraphSage model on the ogbn-papers100M dataset (111M nodes and 3.2B edges). A
similar setting is used as the single-GPU scenario except that each GPU is using
a batch size of 1024 with global batch size increasing linearly with the number
of GPUs.</p>
<p><img src="/assets/images/posts/2024-03-06-release/multi-gpu-node-classification.png" alt="diagram" width="800x" class="aligncenter" /></p>
<p>40GB memory on each of our A100 GPUs can be utilized by the GPU cache feature in
GraphBolt. We can achieve <strong>1.75x</strong> improvement on a single GPU and <strong>2.31x</strong>
improvement on 8 GPUs compared to previous state-of-the-art baseline, Legacy DGL
(pinned). Moreover, compared to the GraphBolt CPU baseline, we achieve over
<strong>10x</strong> improvement.</p>
<h2 id="reduced-graph-storage-space-requirements">Reduced graph storage space requirements</h2>
<p>Many large-scale graphs and existing GNN datasets have fewer than 2 billion
nodes but more than 2 billion edges. One such example is the ogbn-papers100M
graph with its 111 million nodes and 3.2 billion edges. dgl.graphbolt uses the
CSC (Compressed Sparse Column) format to store your graph in a memory efficient
way. With our latest additions, the memory usage now scales by 4 bytes (int32)
w.r.t. # edges and 8 bytes (int64) w.r.t. # nodes, meaning close to <strong>2x</strong> space
savings for graph storage by using mixed data types. The provided preprocessing
functionality casts the tensors in your dataset into the smallest data types
automatically for optimum space use and performance such as the edge type
information in the heterogeneous case. With these optimizations, you get <strong>3x</strong>
space savings for the heterogenous ogb-lsc-mag240m graph compared to our
previous release.</p>
<h2 id="whats-more">What’s more</h2>
<p>Furthermore, dgl.graphbolt is compatible with Pytorch Geometric as well. In the
figure below, the notation in the parentheses represents where the graph and the
features are placed. “(cpu-cuda)” means that the graph is placed on the CPU
while the features are moved to the GPU. We compare our <a href="https://github.com/dmlc/dgl/blob/master/examples/sampling/graphbolt/pyg/node_classification_advanced.py">advanced PyG example</a>
against the <a href="https://github.com/pyg-team/pytorch_geometric/blob/master/examples/ogbn_products_sage.py">PyG official example</a>,
both using the PyG GraphSAGE model. We run the Node Classification task on the
ogbn-products dataset with [15, 10, 5] fanout.</p>
<p><img src="/assets/images/posts/2024-03-06-release/pyg-graphsage.png" alt="diagram" width="800x" class="aligncenter" /></p>
<p>While providing an extremely optimized Neighbor Sampler implementation, we also
offer a new drop-in replacement called Layer Neighbor Sampler from NeurIPS 2023.
One can see that we provide up to <strong>5.5x</strong> speedup over PyG, combining GPU
acceleration, pipeline parallelism and the state-of-the-art algorithms. For more
information on the new features in DGL 2.1, please refer to our <a href="https://github.com/dmlc/dgl/releases/tag/v2.1.0">release notes</a>.</p>
<h2 id="get-started-with-dgl-21">Get started with DGL 2.1</h2>
<p>You can easily install DGL 2.1 with dgl.graphbolt on any platform using <a href="https://www.dgl.ai/pages/start.html">pip or conda</a>.
Dive into our updated <a href="https://docs.dgl.ai/en/latest/stochastic_training/index.html">Stochastic Training of GNNs with GraphBolt tutorial</a>
and experiment with our <a href="https://colab.research.google.com/github/dmlc/dgl/blob/master/notebooks/stochastic_training/node_classification.ipynb">node classification</a>
and <a href="https://colab.research.google.com/github/dmlc/dgl/blob/master/notebooks/stochastic_training/link_prediction.ipynb">link prediction</a>
examples in Google Colab. No need to set up a local environment - just point and
click! We also updated the existing <a href="https://github.com/dmlc/dgl/tree/master/examples/sampling/graphbolt">7 comprehensive single-GPU examples</a>
and <a href="https://github.com/dmlc/dgl/tree/master/examples/multigpu/graphbolt">1 multi-GPU example</a>
with GPU Acceleration options. DGL 2.1 will be featured in the NVIDIA DGL
container 24.03 release which will be released before the end of March 2024.</p>
<p>We welcome your feedback and are available via <a href="https://github.com/dmlc/dgl/issues">Github issues</a> and <a href="https://discuss.dgl.ai/">Discuss posts</a>.
Join our <a href="http://slack.dgl.ai/">Slack channel</a> to stay updated and to connect with the community.</p></content><author><name>Muhammed Fatih Balin</name></author><category term="release" /><category term="release" /><summary type="html">We are happy to announce the release of DGL 2.1. In this release, we are making GNN data loading lightning fast. We introduce GPU acceleration for the whole GNN data loading pipeline in GraphBolt, including the graph sampling and feature fetching stages.</summary></entry><entry><title type="html">DGL 2.0: Streamlining Your GNN Data Pipeline from Bottleneck to Boost</title><link href="https://www.dgl.ai/release/2024/01/26/release.html" rel="alternate" type="text/html" title="DGL 2.0: Streamlining Your GNN Data Pipeline from Bottleneck to Boost" /><published>2024-01-26T00:00:00+00:00</published><updated>2024-01-26T00:00:00+00:00</updated><id>https://www.dgl.ai/release/2024/01/26/release</id><content type="html" xml:base="https://www.dgl.ai/release/2024/01/26/release.html"><p>We’re thrilled to announce the release of DGL 2.0, a major milestone in our
mission to empower developers with cutting-edge tools for Graph Neural Networks
(GNNs). Traditionally, data loading has been a significant bottleneck in GNN
training. Complex graph structures and the need for efficient sampling often
lead to slow data loading times and resource constraints. This can drastically
hinder the training speed and scalability of your GNN models. DGL 2.0 breaks
free from these limitations with the introduction of dgl.graphbolt, a
revolutionary data loading framework that supercharges your GNN training by
streamlining the data pipeline.</p>
<p><img src="/assets/images/posts/2024-01-26-release/diagram.png" alt="diagram" width="800x" class="aligncenter" /></p>
<p><center>High-Level Architecture of GraphBolt Data Pipeline</center></p>
<h2 id="flexible-data-pipeline--customizable-stages">Flexible data pipeline &amp; customizable stages</h2>
<p>One size doesn’t fit all - and especially not when it comes to dealing with a
variety of graph data and GNN tasks. For instance, link prediction requires
negative sampling but not node classification, some features are too large to be
stored in memory, and occasionally, we might combine multiple sampling
operations to form subgraphs. To offer adaptable operators while maintaining
high performance, dgl.graphbolt integrates seamlessly with the PyTorch datapipe,
relying on the unified “MiniBatch” data structure to connect processing stages.
The core stages are defined as:</p>
<ul>
<li><strong>Item Sampling</strong>: randomly selects a subset (nodes, edges, graphs) from the
entire training set as an initial mini-batch for downstream computation.</li>
<li><strong>Negative Sampling (for Link Prediction)</strong>: generates non-existing edges as
negative examples.</li>
<li><strong>Subgraph Sampling</strong>: generates subgraphs based on the input nodes/edges.</li>
<li><strong>Feature Fetching</strong>: fetches related node/edge features from the dataset for
the given input.</li>
<li><strong>Data Moving (for training on GPU)</strong>: moves the data to specified device for
training.</li>
</ul>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># Seed edge sampler.</span>
<span class="n">dp</span> <span class="o">=</span> <span class="n">gb</span><span class="o">.</span><span class="n">ItemSampler</span><span class="p">(</span><span class="n">train_edge_set</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">1024</span><span class="p">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c"># Negative sampling.</span>
<span class="n">dp</span> <span class="o">=</span> <span class="n">dp</span><span class="o">.</span><span class="n">sample_uniform_negative</span><span class="p">(</span><span class="n">graph</span><span class="p">,</span> <span class="n">negative_ratio</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="c"># Neighbor sampling.</span>
<span class="n">dp</span> <span class="o">=</span> <span class="n">dp</span><span class="o">.</span><span class="n">sample_neighbor</span><span class="p">(</span><span class="n">graph</span><span class="p">,</span> <span class="n">fanouts</span><span class="o">=</span><span class="p">[</span><span class="mi">15</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">])</span>
<span class="c"># Fetch features.</span>
<span class="n">dp</span> <span class="o">=</span> <span class="n">dp</span><span class="o">.</span><span class="n">fetch_feature</span><span class="p">(</span><span class="n">features</span><span class="p">,</span> <span class="n">node_feature_keys</span><span class="o">=</span><span class="p">[</span><span class="s">"feat"</span><span class="p">])</span>
<span class="c"># Copy to GPU for training.</span>
<span class="n">dp</span> <span class="o">=</span> <span class="n">dp</span><span class="o">.</span><span class="n">copy_to</span><span class="p">(</span><span class="n">device</span><span class="o">=</span><span class="s">"cuda:0"</span><span class="p">)</span>
</code></pre>
</div>
<p>The dgl.graphbolt allows you to plug in your own custom processing steps to
build the perfect data pipeline for your needs, for example:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># Seed edge sampler.</span>
<span class="n">dp</span> <span class="o">=</span> <span class="n">gb</span><span class="o">.</span><span class="n">ItemSampler</span><span class="p">(</span><span class="n">train_edge_set</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">1024</span><span class="p">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c"># Negative sampling.</span>
<span class="n">dp</span> <span class="o">=</span> <span class="n">dp</span><span class="o">.</span><span class="n">sample_uniform_negative</span><span class="p">(</span><span class="n">graph</span><span class="p">,</span> <span class="n">negative_ratio</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="c"># Neighbor sampling.</span>
<span class="n">dp</span> <span class="o">=</span> <span class="n">dp</span><span class="o">.</span><span class="n">sample_neighbor</span><span class="p">(</span><span class="n">graph</span><span class="p">,</span> <span class="n">fanouts</span><span class="o">=</span><span class="p">[</span><span class="mi">15</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">])</span>
<span class="c"># Exclude seed edges.</span>
<span class="n">dp</span> <span class="o">=</span> <span class="n">dp</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">gb</span><span class="o">.</span><span class="n">exclude_seed_edges</span><span class="p">)</span>
<span class="c"># Fetch features.</span>
<span class="n">dp</span> <span class="o">=</span> <span class="n">dp</span><span class="o">.</span><span class="n">fetch_feature</span><span class="p">(</span><span class="n">features</span><span class="p">,</span> <span class="n">node_feature_keys</span><span class="o">=</span><span class="p">[</span><span class="s">"feat"</span><span class="p">])</span>
<span class="c"># Copy to GPU for training.</span>
<span class="n">dp</span> <span class="o">=</span> <span class="n">dp</span><span class="o">.</span><span class="n">copy_to</span><span class="p">(</span><span class="n">device</span><span class="o">=</span><span class="s">"cuda:0"</span><span class="p">)</span>
</code></pre>
</div>
<p>The dgl.graphbolt empowers you to customize stages in your data pipelines.
Implement custom stages using pre-defined APIs, such as loading features from
external storage or adding customized caching mechanisms (e.g.
<a href="https://github.com/dmlc/dgl/blob/0cb309a1b406d896311b5cfc2b5b1a1915f57c3b/python/dgl/graphbolt/impl/gpu_cached_feature.py#L11">GPUCachedFeature</a>),
and integrate the custom stages seamlessly without any modifications to your
core training code.</p>
<h2 id="speed-enhancement--memory-efficiency">Speed enhancement &amp; memory efficiency</h2>
<p>The dgl.graphbolt doesn’t just give you flexibility, it also provides top
performance under the hood. It features a compact graph data structure for
efficient sampling, blazing-fast multi-threaded neighbor sampling operator and
edge exclusion operator, and a built-in option to store large feature tensors
outside your CPU’s main memory. Additionally, The dgl.graphbolt takes care of
scheduling across all hardware, minimizing wait times and maximizing efficiency.</p>
<p>The dgl.graphbolt brings impressive speed gains to your GNN training, showcasing
over 30% faster node classification in our benchmark and a remarkable ~390%
acceleration for link prediction in our benchmark that involve edge exclusion.</p>
<table style="text-align: center;">
<tr>
<th>Epoch Time(s)</th>
<th>GraphSAGE</th>
<th>R-GCN</th>
</tr>
<tr>
<td>DGL Dataloader</td>
<td>22.5</td>
<td>73.6</td>
</tr>
<tr>
<td>dgl.graphbolt</td>
<td>17.2</td>
<td>64.6</td>
</tr>
<tr>
<td>**Speedup**</td>
<td>**1.31x**</td>
<td>**1.14x**</td>
</tr>
</table>
<p><center>Node classification speedup (NVIDIA T4 GPU). GraphSAGE is tested on OGBN-Products. R-GCN is tested on OGBN-MAG</center></p>
<table style="text-align: center;">
<tr>
<th>Epoch Time(s)</th>
<th>include seeds</th>
<th>exclude seeds</th>
</tr>
<tr>
<td>DGL Dataloader</td>
<td>37.75</td>
<td>135.32</td>
</tr>
<tr>
<td>dgl.graphbolt</td>
<td>15.51</td>
<td>27.62</td>
</tr>
<tr>
<td>**Speedup**</td>
<td>**2.43x**</td>
<td>**4.90x**</td>
</tr>
</table>
<p><center>Link prediction speedup (NVIDIA T4 GPU) on OGBN-Citation2</center></p>
<p>For memory-constrained training on enormous graphs like OGBN-MAG240m, the
dgl.graphbolt also proves its worth. While both utilize mmap-based optimization,
compared to DGL dataloader, the dgl.graphbolt boasts a substantial speedup. The
dgl.graphbolt’s well-defined component API streamlines the process for
contributors to refine out-of-core RAM solutions for future optimization,
ensuring even the most massive graphs can be tackled with ease.</p>
<table style="text-align: center;">
<tr>
<th>Iteration time with different RAM size (s)</th>
<th>128GB RAM</th>
<th>256GB RAM</th>
<th>384GB RAM</th>
</tr>
<tr>
<td>Naïve DGL dataloader</td>
<td>OOM</td>
<td>OOM</td>
<td>OOM</td>
</tr>
<tr>
<td>Optimized DGL dataloader</td>
<td>65.42</td>
<td>3.86</td>
<td>0.30</td>
</tr>
<tr>
<td>dgl.graphbolt</td>
<td>60.99</td>
<td>3.21</td>
<td>0.23</td>
</tr>
</table>
<p><center>Node classification on OGBN-MAG240m under different RAM sizes. Optimized DGL dataloader baseline uses mmap to load features.</center></p>
<h2 id="whats-more">What’s more</h2>
<p>Furthermore, DGL 2.0 includes various new additions such as a hetero-relational
GCN example and several datasets. Improvements have been introduced to the
system, examples, and documentation, including updates to the CPU Docker
tcmalloc, supporting sparse matrix slicing operators and enhancements in various
examples. A set of <a href="https://docs.dgl.ai/api/python/nn-pytorch.html#utility-modules-for-graph-transformer">utilities</a> for building graph transformer models is released
along with this version, including NN modules such as positional encoders and
layers as building blocks, and <a href="https://github.com/dmlc/dgl/tree/master/examples/core/Graphormer">examples</a> and <a href="https://docs.dgl.ai/en/latest/graphtransformer/index.html">tutorials</a> demonstrating the usage of
them. Additionally, numerous bug fixes have been implemented, resolving issues
such as the cusparseCreateCsr format for cuda12, addressing the lazy device copy
problem related to DGL node/edge features e.t.c. For more information on the new
additions and changes in DGL 2.0, please refer to our <a href="https://github.com/dmlc/dgl/releases/tag/v2.0.0">release note</a>.</p>
<h2 id="get-started-with-dgl-20">Get started with DGL 2.0</h2>
<p>You can easily install DGL 2.0 with dgl.graphbolt on any platform using <a href="https://www.dgl.ai/pages/start.html">pip or conda</a>.
To jump right in, dive into our brand-new <a href="https://docs.dgl.ai/en/latest/stochastic_training/index.html">Stochastic Training of GNNs with GraphBolt tutorial</a>
and experiment with our <a href="https://colab.research.google.com/github/dmlc/dgl/blob/master/notebooks/stochastic_training/node_classification.ipynb">node classification</a>
and <a href="https://colab.research.google.com/github/dmlc/dgl/blob/master/notebooks/stochastic_training/link_prediction.ipynb">link prediction</a>
examples in Google Colab. No need to set up a local environment - just point and
click! This first release of DGL 2.0 with dgl.graphbolt packs a punch with
<a href="https://github.com/dmlc/dgl/tree/master/examples/sampling/graphbolt">7 comprehensive single-GPU examples</a>
and <a href="https://github.com/dmlc/dgl/tree/master/examples/multigpu/graphbolt">1 multi-GPU example</a>, covering a wide range of tasks.</p>
<p>We welcome your feedback and are available via <a href="https://github.com/dmlc/dgl/issues">Github issues</a> and <a href="https://discuss.dgl.ai/">Discuss posts</a>.
Join our <a href="http://slack.dgl.ai/">Slack channel</a> to stay updated and to connect with the community.</p></content><author><name>DGLTeam</name></author><category term="release" /><category term="release" /><summary type="html">We’re thrilled to announce the release of DGL 2.0, a major milestone in our mission to empower developers with cutting-edge tools for Graph Neural Networks (GNNs). Traditionally, data loading has been a significant bottleneck in GNN training. Complex graph structures and the need for efficient sampling often lead to slow data loading times and resource constraints. This can drastically hinder the training speed and scalability of your GNN models. DGL 2.0 breaks free from these limitations with the introduction of dgl.graphbolt, a revolutionary data loading framework that supercharges your GNN training by streamlining the data pipeline.</summary></entry><entry><title type="html">DGL 1.0: Empowering Graph Machine Learning for Everyone</title><link href="https://www.dgl.ai/release/2023/02/20/release.html" rel="alternate" type="text/html" title="DGL 1.0: Empowering Graph Machine Learning for Everyone" /><published>2023-02-20T00:00:00+00:00</published><updated>2023-02-20T00:00:00+00:00</updated><id>https://www.dgl.ai/release/2023/02/20/release</id><content type="html" xml:base="https://www.dgl.ai/release/2023/02/20/release.html"><p>We are thrilled to announce the arrival of DGL 1.0, a cutting-edge machine
learning framework for deep learning on graphs. Over the past three years,
there has been growing interest from both academia and industry in this
technology. Our framework has received requests from various scenarios, from
academic research on state-of-the-art models to industrial demands for scaling
Graph Neural Network (GNN) solutions to large, real-world problems. With DGL
1.0, we aim to provide a comprehensive and user-friendly solution for all users
to take advantage of graph machine learning.</p>
<p><img src="/assets/images/posts/2023-02-20-release/request.png" alt="request" width="800x" class="aligncenter" /></p>
<p><center>Different levels of user requests and what DGL 1.0 provides to fulfill them</center></p>
<p>DGL 1.0 adopts a layered and modular design to fulfill various user requests. The key features of DGL 1.0 include:</p>
<ul>
<li><a href="https://github.com/dmlc/dgl/tree/master/examples/pytorch">100+ examples</a> of
state-of-the-art GNN models, <a href="https://github.com/dmlc/dgl/tree/master/examples/pytorch/ogb">15+ top-ranked baselines</a> on
Open Graph Benchmark (OGB), available for learning and integration</li>
<li><a href="https://docs.dgl.ai/api/python/nn-pytorch.html">150+ GNN utilities</a>
including GNN layers, datasets, graph data transform modules, graph samplers,
etc. for building new model architectures or GNN-based solutions</li>
<li>Flexible and efficient message passing and sparse matrix abstraction for
developing new GNN building blocks</li>
<li>Multi-GPU and distributed training capability to scale to graphs of billions
of nodes and edges</li>
</ul>
<p>The new additions and updates in DGL 1.0 are depicted in the accompanying
figure. One of the highlights of this release is the introduction of
<strong>DGL-Sparse</strong>, a new specialized package for graph ML models defined in sparse
matrix notations. DGL-Sparse streamlines the programming process not just for
well-established GNNs such as Graph Convolutional Networks, but also for the
latest models, including diffusion-based GNNs, hypergraph neural networks, and
Graph Transformers. In the following article, we will provide an overview of
two popular paradigms for expressing GNNs, i.e., the message passing view and
the matrix view, which motivated the creation of DGL-Sparse. We will then show
you how to get started with this new and exciting feature.</p>
<p><img src="/assets/images/posts/2023-02-20-release/arch.png" alt="arch" width="800x" class="aligncenter" /></p>
<p><center>DGL 1.0 stack</center></p>
<h2 id="the-message-passing-view-vs-the-matrix-view">The Message Passing View v.s. The Matrix View</h2>
<p><em>“It’s the theory that the language you speak determines how you think and affects how you see everything.” — Louise Banks from Film Arrival</em></p>
<p>Representing a Graph Neural Network can take two distinct forms. The first,
known as the message passing view, approaches GNN models from a <em>fine-grained,
local</em> perspective, detailing how messages are exchanged along edges and how
node states are updated accordingly. Alternatively, due to the algebraic
equivalence of a graph to a sparse adjacency matrix, many researchers opt to
express their GNN models from a coarse-grained, global perspective, emphasizing
the operations involving the sparse adjacency matrix and dense feature tensors.</p>
<p><img src="/assets/images/posts/2023-02-20-release/view.png" alt="view" width="800x" class="aligncenter" /></p>
<p>These local and global perspectives are sometimes interchangeable, but more
often, provide complementary insights into the fundamentals and limitations of
GNNs. For instance, the message passing view highlights the connection between
GNNs and the Weisfeiler Lehman (WL) graph isomorphism test, which also relies
on aggregating information from neighbors (as described in <a href="https://arxiv.org/abs/1810.00826">Xu et al., 2018</a>).
Meanwhile, the matrix view provides valuable understanding of the algebraic
properties of GNNs, leading to intriguing findings such as the <em>oversmoothing</em>
phenomenon (as discussed in <a href="https://arxiv.org/abs/1801.07606">Li et al., 2018</a>). In conclusion, both the message
passing view and matrix view are indispensable tools in studying and describing
GNNs, and this is precisely what motivates the key feature we will be
showcasing in DGL 1.0.</p>
<h2 id="dgl-sparse-sparse-matrix-abstraction-for-graph-ml">DGL Sparse: Sparse Matrix Abstraction for Graph ML</h2>
<p>In DGL 1.0, we are happy to announce the release of DGL Sparse, a new
sub-package (<code class="highlighter-rouge">dgl.sparse</code>) in addition to the existing message passing interface
in DGL to accomplish the support of the entire spectrum of GNN models. DGL
Sparse provides sparse matrix classes and operations specialized for Graph ML,
making it easier to program GNNs described in the matrix view. In the following
section, we will demonstrate a few examples of GNNs, showcasing their
mathematical formulation and corresponding code implementation in DGL Sparse.</p>
<p>The <strong>Graph Convolutional Network</strong> (<a href="https://arxiv.org/abs/1609.02907">Kipf et al., 2017</a>) is one of the pioneer works
in GNN modeling. GCN can be expressed in both message passing view and matrix
view. The code below compares the two different perspectives and
implementations in DGL.</p>
<p><img src="/assets/images/posts/2023-02-20-release/gcn.png" alt="gcn" width="800x" class="aligncenter" /></p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">dgl.function</span> <span class="kn">as</span> <span class="nn">fn</span> <span class="c"># DGL message passing functions</span>
<span class="k">class</span> <span class="nc">GCNLayer</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
<span class="o">...</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">X</span><span class="p">):</span>
<span class="n">g</span><span class="o">.</span><span class="n">ndata</span><span class="p">[</span><span class="s">'X'</span><span class="p">]</span> <span class="o">=</span> <span class="n">X</span>
<span class="n">g</span><span class="o">.</span><span class="n">ndata</span><span class="p">[</span><span class="s">'deg'</span><span class="p">]</span> <span class="o">=</span> <span class="n">g</span><span class="o">.</span><span class="n">in_degrees</span><span class="p">()</span><span class="o">.</span><span class="nb">float</span><span class="p">()</span>
<span class="n">g</span><span class="o">.</span><span class="n">update_all</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">message</span><span class="p">,</span> <span class="n">fn</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="s">'m'</span><span class="p">,</span> <span class="s">'X_neigh'</span><span class="p">))</span>
<span class="n">X_neigh</span> <span class="o">=</span> <span class="n">g</span><span class="o">.</span><span class="n">ndata</span><span class="p">[</span><span class="s">'X_neigh'</span><span class="p">]</span>
<span class="k">return</span> <span class="n">F</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">W</span><span class="p">(</span><span class="n">X_neigh</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">message</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">edges</span><span class="p">):</span>
<span class="n">c_ij</span> <span class="o">=</span> <span class="p">(</span><span class="n">edges</span><span class="o">.</span><span class="n">src</span><span class="p">[</span><span class="s">'deg'</span><span class="p">]</span> <span class="o">*</span> <span class="n">edges</span><span class="o">.</span><span class="n">dst</span><span class="p">[</span><span class="s">'deg'</span><span class="p">])</span> <span class="o">**</span> <span class="o">-</span><span class="mf">0.5</span>
<span class="k">return</span> <span class="p">{</span><span class="s">'m'</span> <span class="p">:</span> <span class="n">edges</span><span class="o">.</span><span class="n">src</span><span class="p">[</span><span class="s">'X'</span><span class="p">]</span> <span class="o">*</span> <span class="n">c_ij</span><span class="p">}</span>
</code></pre>
</div>
<p><center>GCN in DGL's message passing API</center></p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">dgl.sparse</span> <span class="kn">as</span> <span class="nn">dglsp</span> <span class="c"># DGL 1.0 sparse matrix package</span>
<span class="k">class</span> <span class="nc">GCNLayer</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
<span class="o">...</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">A</span><span class="p">,</span> <span class="n">X</span><span class="p">):</span>
<span class="n">D_invsqrt</span> <span class="o">=</span> <span class="n">dglsp</span><span class="o">.</span><span class="n">diag</span><span class="p">(</span><span class="n">A</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="mi">1</span><span class="p">))</span> <span class="o">**</span> <span class="o">-</span><span class="mf">0.5</span>
<span class="n">A_norm</span> <span class="o">=</span> <span class="n">D_invsqrt</span> <span class="err">@</span> <span class="n">A</span> <span class="err">@</span> <span class="n">D_invsqrt</span>
<span class="k">return</span> <span class="n">F</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">W</span><span class="p">(</span><span class="n">A_norm</span> <span class="err">@</span> <span class="n">X</span><span class="p">))</span>
</code></pre>
</div>
<p><center>GCN in DGL Sparse</center></p>
<p><strong>Graph Diffusion-based GNNs.</strong> Graph diffusion is a process of propagating or
smoothing node features/signals along edges. Many classical graph algorithms
such as PageRank belong to this category. A series of research has shown that
combining graph diffusion with neural networks is an effective and efficient
way to enhance model predictions. The equation below describe the core
computation of one representative model — <em>Approximated Personalized Propagation
of Neural Prediction</em> (<a href="https://arxiv.org/abs/1810.05997">Gasteiger et al., 2018</a>), which can be implemented in DGL
Sparse straightforwardly.</p>
<p><img src="/assets/images/posts/2023-02-20-release/appnp.png" alt="appnp" width="300x" class="aligncenter" /></p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="k">class</span> <span class="nc">APPNP</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
<span class="o">...</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">A</span><span class="p">,</span> <span class="n">X</span><span class="p">):</span>
<span class="n">Z_0</span> <span class="o">=</span> <span class="n">Z</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">f_theta</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">num_hops</span><span class="p">):</span>
<span class="n">A_drop</span> <span class="o">=</span> <span class="n">dglsp</span><span class="o">.</span><span class="n">val_like</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">A_dropout</span><span class="p">(</span><span class="n">A</span><span class="o">.</span><span class="n">val</span><span class="p">))</span>
<span class="n">Z</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">alpha</span><span class="p">)</span> <span class="o">*</span> <span class="n">A_drop</span> <span class="err">@</span> <span class="n">Z</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">alpha</span> <span class="o">*</span> <span class="n">Z_0</span>
<span class="k">return</span> <span class="n">Z</span>
</code></pre>
</div>
<p><strong>Hypergraph Neural Networks</strong>. A hypergraph is a generalization of a graph in
which an edge can join any number of nodes (called an hyperedge). Hypergraphs
are particularly useful in scenarios that require capturing high-order
relations such as co-purchase behaviors in e-commerce platforms, or
co-authorship in citation networks, etc. A hypergraph is typically
characterized by its sparse incidence matrix, and thus Hypergraph Neural
Networks (HGNN) are commonly defined in sparse matrix notations. The equation
and code implementation of Hypergraph Convolution, proposed by <a href="https://arxiv.org/abs/1809.09401">Feng et al.,
2018</a>, are presented below.</p>
<p><img src="/assets/images/posts/2023-02-20-release/hypergraph.png" alt="hypergraph" width="800x" class="aligncenter" /></p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="k">class</span> <span class="nc">HypergraphConv</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
<span class="o">...</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">H</span><span class="p">,</span> <span class="n">X</span><span class="p">):</span>
<span class="n">d_V</span> <span class="o">=</span> <span class="n">H</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="c"># node degree</span>
<span class="n">d_E</span> <span class="o">=</span> <span class="n">H</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="c"># edge degree</span>
<span class="n">n_edges</span> <span class="o">=</span> <span class="n">d_E</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">D_V_invsqrt</span> <span class="o">=</span> <span class="n">dglsp</span><span class="o">.</span><span class="n">diag</span><span class="p">(</span><span class="n">d_V</span><span class="o">**-</span><span class="mf">0.5</span><span class="p">)</span> <span class="c"># D_V ** (-1/2)</span>
<span class="n">D_E_inv</span> <span class="o">=</span> <span class="n">dglsp</span><span class="o">.</span><span class="n">diag</span><span class="p">(</span><span class="n">d_E</span><span class="o">**-</span><span class="mi">1</span><span class="p">)</span> <span class="c"># D_E ** (-1)</span>
<span class="n">W</span> <span class="o">=</span> <span class="n">dglsp</span><span class="o">.</span><span class="n">identity</span><span class="p">((</span><span class="n">n_edges</span><span class="p">,</span> <span class="n">n_edges</span><span class="p">))</span>
<span class="n">L</span> <span class="o">=</span> <span class="n">D_V_invsqrt</span> <span class="err">@</span> <span class="n">H</span> <span class="err">@</span> <span class="n">W</span> <span class="err">@</span> <span class="n">D_E_inv</span> <span class="err">@</span> <span class="n">H</span><span class="o">.</span><span class="n">T</span> <span class="err">@</span> <span class="n">D_V_invsqrt</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">Theta</span><span class="p">(</span><span class="n">L</span> <span class="err">@</span> <span class="n">X</span><span class="p">)</span>
</code></pre>
</div>
<p><strong>Graph Transformers</strong>. The Transformer has proven to be an effective learning
architecture in natural language processing and computer vision. Researchers
have begun to extend the use of Transformers to graph learning as well. One of
the pioneer work by (<a href="https://arxiv.org/abs/2012.09699">Dwivedi et al., 2020</a>) proposed to constrain the all-pair
multi-head attention to the connected node pairs in a graph. With DGL Sparse,
implementing this new formulation is now a straightforward process, taking only
about 10 lines of code.</p>
<p><img src="/assets/images/posts/2023-02-20-release/gt.png" alt="gt" width="800x" class="aligncenter" /></p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="k">class</span> <span class="nc">GraphMHA</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
<span class="o">...</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">A</span><span class="p">,</span> <span class="n">h</span><span class="p">):</span>
<span class="n">N</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">h</span><span class="p">)</span>
<span class="n">q</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">q_proj</span><span class="p">(</span><span class="n">h</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">head_dim</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">num_heads</span><span class="p">)</span>
<span class="n">q</span> <span class="o">*=</span> <span class="bp">self</span><span class="o">.</span><span class="n">scaling</span>
<span class="n">k</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">k_proj</span><span class="p">(</span><span class="n">h</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">head_dim</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">num_heads</span><span class="p">)</span>
<span class="n">v</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">v_proj</span><span class="p">(</span><span class="n">h</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">head_dim</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">num_heads</span><span class="p">)</span>
<span class="n">attn</span> <span class="o">=</span> <span class="n">dglsp</span><span class="o">.</span><span class="n">bsddmm</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">q</span><span class="p">,</span> <span class="n">k</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span> <span class="c"># [N, N, nh]</span>
<span class="n">attn</span> <span class="o">=</span> <span class="n">attn</span><span class="o">.</span><span class="n">softmax</span><span class="p">()</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">dglsp</span><span class="o">.</span><span class="n">bspmm</span><span class="p">(</span><span class="n">attn</span><span class="p">,</span> <span class="n">v</span><span class="p">)</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">out_proj</span><span class="p">(</span><span class="n">out</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">))</span>
</code></pre>
</div>
<h3 id="key-features-of-dgl-sparse">Key Features of DGL Sparse</h3>
<p>To handle diverse use cases in an efficient manner, DGL Sparse is designed with
two key features that set it apart from other sparse matrix libraries such as
<code class="highlighter-rouge">scipy.sparse</code> or <code class="highlighter-rouge">torch.sparse</code>:</p>
<ul>
<li><strong>Automatic Sparse Format Selection</strong>. DGL Sparse eliminates the complexity of
choosing the right data structure for storing a sparse matrix (also known as
the <em>sparse format</em>). Users can create a sparse matrix with a single call to
<code class="highlighter-rouge">dgl.sparse.spmatrix</code> and the internal DGL’s sparse matrix will automatically
select the optimal format based on the intended operation.</li>
<li><strong>Scalar or Vector Non-zero Elements</strong>. GNN models often associate edges with
multi-channel weight vectors, such as multi-head attention vectors, as
demonstrated in the Graph Transformer example. To accommodate this, DGL Sparse
allows non-zero elements to have vector shapes and extends common sparse
operations, such as sparse-dense-matrix multiplication (SpMM), to operate on
this new form (as seen in the <code class="highlighter-rouge">bspmm</code> operation in the Graph Transformer
example).</li>
</ul>
<p>By utilizing these design features, DGL Sparse reduces code length by <strong>2.7 times</strong>
on average when compared to previous implementations of matrix-view models with
message passing interface. The simplified code also results in <strong>43% less
overhead</strong> in the framework. Additionally, DGL Sparse is PyTorch compatible,
making it easy to integrate with the various tools and packages available
within the PyTorch ecosystem.</p>
<h2 id="get-started-with-dgl-10">Get started with DGL 1.0</h2>
<p>The framework is readily available on all platforms and can be easily installed
using <a href="https://www.dgl.ai/pages/start.html">pip or conda</a>. To get started with DGL Sparse, check out the new
<a href="https://docs.dgl.ai/en/latest/notebooks/sparse/quickstart.html">Quickstart tutorial</a> and play with it in <a href="https://colab.research.google.com/github/dmlc/dgl/blob/master/notebooks/sparse/quickstart.ipynb">Google Colab</a> without having to set up a
local environment. In addition to the examples you’ve seen above, the first
release of DGL Sparse includes <a href="https://docs.dgl.ai/en/latest/notebooks/sparse/index.html">5 tutorials</a> and <a href="https://github.com/dmlc/dgl/tree/master/examples/sparse">11 end-to-end examples</a> to help
you learn and understand the different uses of this new package.</p>
<p>We welcome your feedback and are available via <a href="https://github.com/dmlc/dgl/issues">Github issues</a> and <a href="https://discuss.dgl.ai/">Discuss posts</a>.
Join our <a href="http://slack.dgl.ai/">Slack channel</a> to stay updated and to connect with the community.</p>
<p>For more information on the new additions and changes in DGL 1.0, please refer
to our <a href="https://github.com/dmlc/dgl/releases/tag/1.0.0">release note</a>.</p>
<p align="right"><em>(Banner image generated by Midjourney.)</em></p></content><author><name>DGLTeam</name></author><category term="release" /><category term="release" /><summary type="html">We are thrilled to announce the arrival of DGL 1.0, a cutting-edge machine learning framework for deep learning on graphs. Over the past three years, there has been growing interest from both academia and industry in this technology. Our framework has received requests from various scenarios, from academic research on state-of-the-art models to industrial demands for scaling Graph Neural Network (GNN) solutions to large, real-world problems. With DGL 1.0, we aim to provide a comprehensive and user-friendly solution for all users to take advantage of graph machine learning.</summary></entry><entry><title type="html">Improving Graph Neural Networks via Network-in-network Architecture</title><link href="https://www.dgl.ai/blog/2022/11/28/ngnn.html" rel="alternate" type="text/html" title="Improving Graph Neural Networks via Network-in-network Architecture" /><published>2022-11-28T00:00:00+00:00</published><updated>2022-11-28T00:00:00+00:00</updated><id>https://www.dgl.ai/blog/2022/11/28/ngnn</id><content type="html" xml:base="https://www.dgl.ai/blog/2022/11/28/ngnn.html"><p>As Graph Neural Networks (GNNs) has become increasingly popular, there is a
wide interest of designing deeper GNN architecture. However, deep GNNs suffer
from the <em>oversmoothing</em> issue where the learnt node representations quickly
become indistinguishable with more layers. This blog features a simple yet
effective technique to build a deep GNN without the concern of oversmoothing.
The new architecture, <strong>Network in Graph Neural Networks (NGNN)</strong> inspired by the
network-in-network architecture for computer vision, has shown superior
performance on multiple Open Graph Benchmark (OGB) leaderboards.</p>
<h2 id="introducing-ngnn">Introducing NGNN</h2>
<p>At a high-level, a graph neural network (MPGNN) layer can be written as a non-linear function:</p>
<script type="math/tex; mode=display">h^{(l+1)}=\sigma\left(f_w\left(\mathcal{G}, h^l\right)\right)</script>
<p>with <script type="math/tex">h^{(0)}=X</script> being the input node features, <script type="math/tex">\mathcal{G}</script> being the input
graph, <script type="math/tex">h^L</script> being the node embeddings in the last layer used by downstream
tasks, <script type="math/tex">L</script> being the number of GNN layers. Additionally, the function
<script type="math/tex">f_w\left(\mathcal{G}, h^l\right)</script> is determined by learnable parameters <script type="math/tex">w</script>
and <script type="math/tex">\sigma(\cdot)</script> is a non-linear activation function.</p>
<p>Instead of adding many more GNN layers, NGNN deepens a GNN model by inserting
nonlinear feedforward neural network layer(s) within each GNN layer.</p>
<p><img src="/assets/images/posts/2022-11-28-ngnn/NGNN.png" alt="ngnn" width="800x" class="aligncenter" /></p>
<p>In essence, NGNN is just a nonlinear transformation of the original embeddings
of the nodes in the <script type="math/tex">l</script>-th layer. Despite its simplicity, the NGNN technique is
quite powerful (we will come to that in a moment). Additionally, it does not
have large memory overhead and can work with various training methods such as
neighbor sampling or subgraph sampling.</p>
<p>The intuition behind is straightforward. As the number of GNN layers and the
number of training iterations increases, the representations of nodes within
the same connected component will tend to converge to the same value. NGNN uses
a simple MLP after certain GNN layers to tackle the so-called oversmoothing
issue.</p>
<h2 id="implementing-ngnn-in-deep-graph-library-dgl">Implementing NGNN in Deep Graph Library (DGL)</h2>
<p>For better gaining insights into this trick, let us use DGL to implement a
simple NGNN, using the GCN layer as the backbone.</p>
<p>With DGL’s builtin GCN layer <code class="highlighter-rouge">dgl.nn.GraphConv</code>, we can easily implement a
minimal <code class="highlighter-rouge">NGNN_GCN</code> layer, which just applies an <script type="math/tex">\rm{ReLU}</script> activation and a
linear transformation after a GCN layer.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">dgl.nn</span> <span class="kn">import</span> <span class="n">GraphConv</span>
<span class="k">class</span> <span class="nc">NGNN_GCNConv</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_channels</span><span class="p">,</span> <span class="n">hidden_channels</span><span class="p">,</span> <span class="n">output_channels</span><span class="p">):</span>
<span class="nb">super</span><span class="p">(</span><span class="n">NGNN_GCNConv</span><span class="p">,</span> <span class="bp">self</span><span class="p">)</span><span class="o">.</span><span class="n">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">conv</span> <span class="o">=</span> <span class="n">GraphConv</span><span class="p">(</span><span class="n">input_channels</span><span class="p">,</span> <span class="n">hidden_channels</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fc</span> <span class="o">=</span> <span class="n">Linear</span><span class="p">(</span><span class="n">hidden_channels</span><span class="p">,</span> <span class="n">output_channels</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">edge_weight</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">conv</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">edge_weight</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">fc</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x</span>
</code></pre>
</div>
<p>Afterwards, you can simply stack the <code class="highlighter-rouge">dgl.nn.GraphConv</code> layer and the
<code class="highlighter-rouge">NGNN_GCN</code> layer to form a multi-layer <code class="highlighter-rouge">NGNN_GCN</code> network.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="k">class</span> <span class="nc">NGNN_GCN</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_channels</span><span class="p">,</span> <span class="n">hidden_channels</span><span class="p">,</span> <span class="n">output_channels</span><span class="p">):</span>
<span class="nb">super</span><span class="p">(</span><span class="n">Model</span><span class="p">,</span> <span class="bp">self</span><span class="p">)</span><span class="o">.</span><span class="n">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">conv1</span> <span class="o">=</span> <span class="n">NGNN_GCNConv</span><span class="p">(</span><span class="n">input_channels</span><span class="p">,</span> <span class="n">hidden_channels</span><span class="p">,</span> <span class="n">hidden_channels</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">conv2</span> <span class="o">=</span> <span class="n">GraphConv</span><span class="p">(</span><span class="n">hidden_channels</span><span class="p">,</span> <span class="n">output_channels</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">input_channels</span><span class="p">):</span>
<span class="n">h</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">conv1</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="n">input_channels</span><span class="p">)</span>
<span class="n">h</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="n">h</span><span class="p">)</span>
<span class="n">h</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">conv2</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="n">h</span><span class="p">)</span>
<span class="k">return</span> <span class="n">h</span>
</code></pre>
</div>
<p>You can replace <code class="highlighter-rouge">dgl.nn.GraphConv</code> with any other graph convolution layers in
the NGNN architecture. DGL provides implementation of many popular
convolutional layers and utility modules. You can easily invoke them with one
line of code and build your own NGNN modules.</p>
<h2 id="model-performance">Model Performance</h2>
<p>NGNN can be used for many downstream tasks, such as Node
Classification/Regression, Edge Classification/Regression, Link prediction and
Graph Classification. In general, NGNN achieves better results than its
backbone GNN on these tasks. For instance, <strong>NGNN+SEAL achieves top-1
performance on the
<a href="https://ogb.stanford.edu/docs/leader_linkprop/#ogbl-ppa">ogbl-ppa</a> leaderboard
with an improvement of Hit@100 by <script type="math/tex">10.91\%</script> over the vanilla SEAL</strong>. The table
below shows the performance improvement of NGNN over various vanilla GNN
backbones.</p>
<table style="text-align: center;">
<tr>
<th>Dataset</th>
<th>Metric</th>
<th>Model</th>
<th></th>
<th>Performance</th>
</tr>
<tr>
<td>ogbn-proteins</td>
<td>ROC-AUC(%)</td>
<td>GraphSage+Cluster Sampling</td>
<td>Vanilla</td>
<td>67.45 ± 1.21</td>
</tr>
<tr>