Changelog¶

Note

Code is maintained in https://github.com/fengggli/gpu-computing-materials/

Ask Feng for access.

Current¶

Date:	2019-11-25

Added¶

Flexflow paper(https://fengggli.github.io/ResearchDocs/topics/hybridparal/index.html#beyond-data-and-model-parallelism-for-deep-neural-networks) from Alex Aiken’s group(whose previous work includes Sequoia and Legion), architecture topology is considered for GPU clusters.
Channel and Filter parallelism (https://fengggli.github.io/ResearchDocs/journal/Fall19/Week13.html#channel-and-filter-parallelism)

Working on¶

Reorganize code thread model, so that different layers could have different parallelism.
Performance evaluation on different model/data paralellism setup.

TODO List¶

Use fine-grained lock to reduce contention.
Theoretical model.

Previous¶

0.4.14¶

Date:	2019-11-04

Added¶

Reviews of works on hybrid parallelism(https://fengggli.github.io/ResearchDocs/topics/hybridparal/index.html#hybrid-parallelism)
- One weird trick: data parallelism for convolution layer, model parallelism for dense layer, transformation in between. (because conv/dense layer have different computation/communication requirement.)
- How to decide process layerout for a given batch size and a network architecture.
Amazon neocpu(https://fengggli.github.io/ResearchDocs/journal/Fall19/Week9.html#neocpu)
- end-to-end optimization for cpu-based inference.
pipedream is part of microsoft fiddle project: https://www.microsoft.com/en-us/research/project/fiddle/, fiddle is targeting serveral problems:
- How to train efficiently in a single gpu
- How to train with multiple gpu
- How to train with multi-tenant Clusters
Different types of optimizations(coarse-grained, fine-grained, layer-wise, end-to-end) are discussed here (https://fengggli.github.io/ResearchDocs/journal/Fall19/Week10.html#coarse-grain-fine-grain-and-layer-wise)

0.4.13¶

Date:	2019-10-22

Added¶

Explained why AWNN is slower than Intel-Caffe in Stampede2 SKX node Gibson (also SKX with avx512)
- performance analysis results using intel vtune see (https://github.com/fengggli/gpu-computing-materials/issues/57)
- AWNN still has worse single-threaded performance, most of the elapsed time is spent on im2col and col2im, since they are not currently vectorized.
- intel-caffe uses mkldnn JIT avx code generation to accelerate operations like convolution/pooling.
- A SC18 paper describes some of the optimizations used in MKL-DNN(e.g. vectorization, cache/register blocking, loop reordering, kernel streaming, software prefetching, layer fusion, etc: https://dl.acm.org/citation.cfm?id=3291744)
Followed several suggestions from intel performance guide, improved single-thread forward/backward time from 540 to 380ms(https://github.com/fengggli/gpu-computing-materials/issues/57#issuecomment-540705655).
We can add those optimization implemented in MKLDNN, (e.g. vectorization of im2col/col2im). But such optimizations are not urgent.
Some literature on pipeline parallelism (https://fengggli.github.io/ResearchDocs/topics/pipeline/pipeline.html#pipeline), it’s a form of model parallelism.

0.4.12¶

Date:	2019-10-07

Added¶

Performance comparision with Intel-caffe in skx and knl nodes and corresponding analysis.
- Performance of intel-caffe is x3.9 faster than awnn in stampede skx(https://github.com/fengggli/gpu-computing-materials/issues/54#issuecomment-537741399), not consistent with the sievert results.
- Now I am able to build caffe using preloaded dependencies in stampede2. Need to profile to understand the inconsistent performance in stampedede2.
- Also need to do same set of experiments in gibson.

0.4.11¶

:Data 2019-09-26

Added¶

Add worker threads support, details are at: https://github.com/fengggli/gpu-computing-materials/issues/54
reorganize code-structure, so that:
- each type of layer is now associated with a “layer_setup” function, which can infer the size of output tensor and working memory based on the layer below it.
- all working memory and middle-layer output memory are preallocated during the “set_up” phase, instead allocated/free during forward/backward
- improved implementations of layers like fc/relu/pool to reduce extra memory copies.
- x1.77 speedup, using float32(in sievert).

0.4.10¶

:Data 2019-08-23

Added¶

model, extended resnet with 3 stages: * previous simple model: http://ethereon.github.io/netscope/#/gist/64b013d6fee840473edc1a9a444e22ca * new 14-layer model: http://ethereon.github.io/netscope/#/gist/b14a68b31b3973c68b38dfc2f73d2d10

0.4.9¶

:Data 2019-06-27

Added¶

Adding downsampling in the beginning of stage 3,4,5, more details see https://github.com/fengggli/gpu-computing-materials/issues/51, ignoring the boundries.
Residual blocks using with downsampling support and its tests.
Add resnet14, made of 3 stages, each stage containing 2 residual blocks.

0.4.8¶

:Data 2019-05-12

Add nnpack support, resnet can use nnpack backend for the convolution operations(https://github.com/fengggli/gpu-computing-materials/pull/41)
Initial implementation of convolution is slow due to explict transpose and memory copies. (https://github.com/fengggli/gpu-computing-materials/pull/41#issuecomment-486513801), we did performance analysis and improvement for the convolution layer.
Add per-image convolution like in Caffe(https://github.com/fengggli/gpu-computing-materials/pull/49).
There is also a comparision of AWNN vs caffe in the case of (1)NNPACK or (2)per-img im2col+openblas gemm when different batch sizes are used (https://github.com/fengggli/gpu-computing-materials/pull/49#issuecomment-490657411): Our implementation is slightly faster than Caffe when using openblas gemm; nnpack in caffe patch doesn’t provide backward implementation, I can add it though.

0.4.7¶

:Data 2019-04-22

Simplified resnet(https://github.com/fengggli/gpu-computing-materials/pull/38)
Fix memory leaks, and some obvious optimization.
Initializer (kaiming initialization)

0.4.6¶

:Data 2019-04-15

Added¶

residual block and simple resnet. See https://github.com/fengggli/gpu-computing-materials/pull/37.

0.4.5¶

:Date 2019-04-10

Added¶

utils for debug use (tensor mean/std, etc)
fixed several bugs
utils to report statistics during training(loss, train/val accuracy).
results of mlp is in https://github.com/fengggli/gpu-computing-materials/pull/27/

0.4.4¶

:Date 2019-04-08

Added¶

cifar Data loader:

Use data/cifar10/get_cifar10.sh to download data.

preprocess: normailzed, and with channel mean substracted.

train/validation split

Solver(main for loop):

feed batches from loader, forward/backward and gradient updates(test/test_net_mlp_cifar)

Weight init

Kaiming init and weight-scale based init.

Extract this part to utils/ since we use distribution from stl.

Doc

Added the network memory allocation figure.

Cuda

naiive CUDA pooling layer, set USE_CUDA=on to enable

0.4.3¶

:Date 2019-04-01

See (https://github.com/fengggli/gpu-computing-materials/pull/19)

Added¶

a fc_relu sandwich layer
weight initialization (currently only linspace is used)
macro: tensor_for_each_entry in tensor.h
net-mlp:
- inference-only forward - mlp_forward
- loss function to update the gradients mlp_loss
- forward compared with numpy version
- backward checked with numerical results
- regulizer is added

Changed¶

changed the layer cache, now each layer has a lcache_t, which can be assessed as a stack using lcache_push, and lcache_pop. See docs/source/memory.rst for more details

others¶

clangformat using google style

0.4.2¶

:Date 2019-03-30

Added¶

Layers:

fully-connected

global avg pool.

relu

softmax

Data structure

The param_t uses linux-kernel style linked list, which can be also used to construct other basic data structures like stack/queue.

currently it’s used to manage all learnable params of fc layers.

< 0.4.1¶

see dl-docs for changelog prior to 0.4.1