Changelog¶
Note
Code is maintained in https://github.com/fengggli/gpu-computing-materials/
Ask Feng for access.
Current¶
Date: | 2019-11-25 |
---|
Added¶
- Flexflow paper(https://fengggli.github.io/ResearchDocs/topics/hybridparal/index.html#beyond-data-and-model-parallelism-for-deep-neural-networks) from Alex Aiken’s group(whose previous work includes Sequoia and Legion), architecture topology is considered for GPU clusters.
- Channel and Filter parallelism (https://fengggli.github.io/ResearchDocs/journal/Fall19/Week13.html#channel-and-filter-parallelism)
Working on¶
- Reorganize code thread model, so that different layers could have different parallelism.
- Performance evaluation on different model/data paralellism setup.
TODO List¶
- Use fine-grained lock to reduce contention.
- Theoretical model.
Previous¶
0.4.14¶
Date: | 2019-11-04 |
---|
Added¶
- Reviews of works on hybrid parallelism(https://fengggli.github.io/ResearchDocs/topics/hybridparal/index.html#hybrid-parallelism)
- One weird trick: data parallelism for convolution layer, model parallelism for dense layer, transformation in between. (because conv/dense layer have different computation/communication requirement.)
- How to decide process layerout for a given batch size and a network architecture.
- Amazon neocpu(https://fengggli.github.io/ResearchDocs/journal/Fall19/Week9.html#neocpu)
- end-to-end optimization for cpu-based inference.
- pipedream is part of microsoft fiddle project: https://www.microsoft.com/en-us/research/project/fiddle/, fiddle is targeting serveral problems:
- How to train efficiently in a single gpu
- How to train with multiple gpu
- How to train with multi-tenant Clusters
- Different types of optimizations(coarse-grained, fine-grained, layer-wise, end-to-end) are discussed here (https://fengggli.github.io/ResearchDocs/journal/Fall19/Week10.html#coarse-grain-fine-grain-and-layer-wise)
0.4.13¶
Date: | 2019-10-22 |
---|
Added¶
- Explained why AWNN is slower than Intel-Caffe in Stampede2 SKX node Gibson (also SKX with avx512)
- performance analysis results using intel vtune see (https://github.com/fengggli/gpu-computing-materials/issues/57)
- AWNN still has worse single-threaded performance, most of the elapsed time is spent on im2col and col2im, since they are not currently vectorized.
- intel-caffe uses mkldnn JIT avx code generation to accelerate operations like convolution/pooling.
- A SC18 paper describes some of the optimizations used in MKL-DNN(e.g. vectorization, cache/register blocking, loop reordering, kernel streaming, software prefetching, layer fusion, etc: https://dl.acm.org/citation.cfm?id=3291744)
- Followed several suggestions from intel performance guide, improved single-thread forward/backward time from 540 to 380ms(https://github.com/fengggli/gpu-computing-materials/issues/57#issuecomment-540705655).
- We can add those optimization implemented in MKLDNN, (e.g. vectorization of im2col/col2im). But such optimizations are not urgent.
- Some literature on pipeline parallelism (https://fengggli.github.io/ResearchDocs/topics/pipeline/pipeline.html#pipeline), it’s a form of model parallelism.
0.4.12¶
Date: | 2019-10-07 |
---|
Added¶
- Performance comparision with Intel-caffe in skx and knl nodes and corresponding analysis.
- Performance of intel-caffe is x3.9 faster than awnn in stampede skx(https://github.com/fengggli/gpu-computing-materials/issues/54#issuecomment-537741399), not consistent with the sievert results.
- Now I am able to build caffe using preloaded dependencies in stampede2. Need to profile to understand the inconsistent performance in stampedede2.
- Also need to do same set of experiments in gibson.
0.4.11¶
:Data 2019-09-26
Added¶
- Add worker threads support, details are at: https://github.com/fengggli/gpu-computing-materials/issues/54
- reorganize code-structure, so that:
- each type of layer is now associated with a “layer_setup” function, which can infer the size of output tensor and working memory based on the layer below it.
- all working memory and middle-layer output memory are preallocated during the “set_up” phase, instead allocated/free during forward/backward
- improved implementations of layers like fc/relu/pool to reduce extra memory copies.
- x1.77 speedup, using float32(in sievert).
0.4.10¶
:Data 2019-08-23
Added¶
- model, extended resnet with 3 stages: * previous simple model: http://ethereon.github.io/netscope/#/gist/64b013d6fee840473edc1a9a444e22ca * new 14-layer model: http://ethereon.github.io/netscope/#/gist/b14a68b31b3973c68b38dfc2f73d2d10
0.4.9¶
:Data 2019-06-27
Added¶
- Adding downsampling in the beginning of stage 3,4,5, more details see https://github.com/fengggli/gpu-computing-materials/issues/51, ignoring the boundries.
- Residual blocks using with downsampling support and its tests.
- Add resnet14, made of 3 stages, each stage containing 2 residual blocks.
0.4.8¶
:Data 2019-05-12
- Add nnpack support, resnet can use nnpack backend for the convolution operations(https://github.com/fengggli/gpu-computing-materials/pull/41)
- Initial implementation of convolution is slow due to explict transpose and memory copies. (https://github.com/fengggli/gpu-computing-materials/pull/41#issuecomment-486513801), we did performance analysis and improvement for the convolution layer.
- Add per-image convolution like in Caffe(https://github.com/fengggli/gpu-computing-materials/pull/49).
- There is also a comparision of AWNN vs caffe in the case of (1)NNPACK or (2)per-img im2col+openblas gemm when different batch sizes are used (https://github.com/fengggli/gpu-computing-materials/pull/49#issuecomment-490657411): Our implementation is slightly faster than Caffe when using openblas gemm; nnpack in caffe patch doesn’t provide backward implementation, I can add it though.
0.4.7¶
:Data 2019-04-22
- Simplified resnet(https://github.com/fengggli/gpu-computing-materials/pull/38)
- Fix memory leaks, and some obvious optimization.
- Initializer (kaiming initialization)
0.4.6¶
:Data 2019-04-15
Added¶
- residual block and simple resnet. See https://github.com/fengggli/gpu-computing-materials/pull/37.
0.4.5¶
:Date 2019-04-10
Added¶
- utils for debug use (tensor mean/std, etc)
- fixed several bugs
- utils to report statistics during training(loss, train/val accuracy).
- results of mlp is in https://github.com/fengggli/gpu-computing-materials/pull/27/
0.4.4¶
:Date 2019-04-08
Added¶
- cifar Data loader:
- Use data/cifar10/get_cifar10.sh to download data.
- preprocess: normailzed, and with channel mean substracted.
- train/validation split
- Solver(main for loop):
- feed batches from loader, forward/backward and gradient updates(test/test_net_mlp_cifar)
- Weight init
- Kaiming init and weight-scale based init.
- Extract this part to utils/ since we use distribution from stl.
- Doc
- Added the network memory allocation figure.
- Cuda
- naiive CUDA pooling layer, set USE_CUDA=on to enable
0.4.3¶
:Date 2019-04-01
See (https://github.com/fengggli/gpu-computing-materials/pull/19)
Added¶
- a fc_relu sandwich layer
- weight initialization (currently only linspace is used)
- macro: tensor_for_each_entry in tensor.h
- net-mlp:
- inference-only forward - mlp_forward
- loss function to update the gradients mlp_loss
- forward compared with numpy version
- backward checked with numerical results
- regulizer is added
Changed¶
- changed the layer cache, now each layer has a lcache_t, which can be assessed as a stack using lcache_push, and lcache_pop. See docs/source/memory.rst for more details
others¶
- clangformat using google style
0.4.2¶
:Date 2019-03-30
Added¶
- Layers:
- fully-connected
- global avg pool.
- relu
- softmax
- Data structure
- The param_t uses linux-kernel style linked list, which can be also used to construct other basic data structures like stack/queue.
- currently it’s used to manage all learnable params of fc layers.
< 0.4.1¶
see dl-docs for changelog prior to 0.4.1