.. _changelog: ========= Changelog ========= .. note:: Code is maintained in https://github.com/fengggli/gpu-computing-materials/ Ask Feng for access. Current ======= :Date: 2019-11-25 Added ------- 1. Flexflow paper(https://fengggli.github.io/ResearchDocs/topics/hybridparal/index.html#beyond-data-and-model-parallelism-for-deep-neural-networks) from Alex Aiken's group(whose previous work includes Sequoia and Legion), architecture topology is considered for GPU clusters. 2. Channel and Filter parallelism (https://fengggli.github.io/ResearchDocs/journal/Fall19/Week13.html#channel-and-filter-parallelism) Working on ----------- 1. Reorganize code thread model, so that different layers could have different parallelism. 2. Performance evaluation on different model/data paralellism setup. TODO List ---------- * Use fine-grained lock to reduce contention. * Theoretical model. ========= Previous ========= 0.4.14 ======== :Date: 2019-11-04 Added ------- * Reviews of works on hybrid parallelism(https://fengggli.github.io/ResearchDocs/topics/hybridparal/index.html#hybrid-parallelism) - One weird trick: data parallelism for convolution layer, model parallelism for dense layer, transformation in between. (because conv/dense layer have different computation/communication requirement.) - How to decide process layerout for a given batch size and a network architecture. * Amazon neocpu(https://fengggli.github.io/ResearchDocs/journal/Fall19/Week9.html#neocpu) - end-to-end optimization for cpu-based inference. * pipedream is part of microsoft fiddle project: https://www.microsoft.com/en-us/research/project/fiddle/, fiddle is targeting serveral problems: - How to train efficiently in a single gpu - How to train with multiple gpu - How to train with multi-tenant Clusters * Different types of optimizations(coarse-grained, fine-grained, layer-wise, end-to-end) are discussed here (https://fengggli.github.io/ResearchDocs/journal/Fall19/Week10.html#coarse-grain-fine-grain-and-layer-wise) 0.4.13 ======== :Date: 2019-10-22 Added ------- 1. Explained why AWNN is slower than Intel-Caffe in Stampede2 SKX node Gibson (also SKX with avx512) - performance analysis results using intel vtune see (https://github.com/fengggli/gpu-computing-materials/issues/57) - AWNN still has worse single-threaded performance, most of the elapsed time is spent on im2col and col2im, since they are not currently vectorized. - intel-caffe uses mkldnn JIT avx code generation to accelerate operations like convolution/pooling. - A SC18 paper describes some of the optimizations used in MKL-DNN(e.g. vectorization, cache/register blocking, loop reordering, kernel streaming, software prefetching, layer fusion, etc: https://dl.acm.org/citation.cfm?id=3291744) 2. Followed several suggestions from intel performance guide, improved single-thread forward/backward time from 540 to 380ms(https://github.com/fengggli/gpu-computing-materials/issues/57#issuecomment-540705655). 3. We can add those optimization implemented in MKLDNN, (e.g. vectorization of im2col/col2im). But such optimizations are not urgent. 4. Some literature on pipeline parallelism (https://fengggli.github.io/ResearchDocs/topics/pipeline/pipeline.html#pipeline), it's a form of model parallelism. 0.4.12 ======== :Date: 2019-10-07 Added ------ * Performance comparision with Intel-caffe in skx and knl nodes and corresponding analysis. - Performance of intel-caffe is x3.9 faster than awnn in stampede skx(https://github.com/fengggli/gpu-computing-materials/issues/54#issuecomment-537741399), not consistent with the sievert results. - Now I am able to build caffe using preloaded dependencies in stampede2. Need to profile to understand the inconsistent performance in stampedede2. - Also need to do same set of experiments in gibson. 0.4.11 ======= :Data 2019-09-26 Added -------- 1. Add worker threads support, details are at: https://github.com/fengggli/gpu-computing-materials/issues/54 2. reorganize code-structure, so that: * each type of layer is now associated with a "layer_setup" function, which can infer the size of output tensor and working memory based on the layer below it. * all working memory and middle-layer output memory are preallocated during the "set_up" phase, instead allocated/free during forward/backward * improved implementations of layers like fc/relu/pool to reduce extra memory copies. * x1.77 speedup, using float32(in sievert). 0.4.10 ======== :Data 2019-08-23 Added -------- 1. model, extended resnet with 3 stages: * previous simple model: http://ethereon.github.io/netscope/#/gist/64b013d6fee840473edc1a9a444e22ca * new 14-layer model: http://ethereon.github.io/netscope/#/gist/b14a68b31b3973c68b38dfc2f73d2d10 0.4.9 ====== :Data 2019-06-27 Added -------- 1. Adding downsampling in the beginning of stage 3,4,5, more details see https://github.com/fengggli/gpu-computing-materials/issues/51, ignoring the boundries. 2. Residual blocks using with downsampling support and its tests. 3. Add resnet14, made of 3 stages, each stage containing 2 residual blocks. 0.4.8 ====== :Data 2019-05-12 * Add nnpack support, resnet can use nnpack backend for the convolution operations(https://github.com/fengggli/gpu-computing-materials/pull/41) * Initial implementation of convolution is slow due to explict transpose and memory copies. (https://github.com/fengggli/gpu-computing-materials/pull/41#issuecomment-486513801), we did performance analysis and improvement for the convolution layer. * Add per-image convolution like in Caffe(https://github.com/fengggli/gpu-computing-materials/pull/49). * There is also a comparision of AWNN vs caffe in the case of (1)NNPACK or (2)per-img im2col+openblas gemm when different batch sizes are used (https://github.com/fengggli/gpu-computing-materials/pull/49#issuecomment-490657411): Our implementation is slightly faster than Caffe when using openblas gemm; nnpack in caffe patch doesn't provide backward implementation, I can add it though. 0.4.7 ====== :Data 2019-04-22 * Simplified resnet(https://github.com/fengggli/gpu-computing-materials/pull/38) * Fix memory leaks, and some obvious optimization. * Initializer (kaiming initialization) 0.4.6 ====== :Data 2019-04-15 Added ------- * residual block and simple resnet. See https://github.com/fengggli/gpu-computing-materials/pull/37. 0.4.5 ====== :Date 2019-04-10 Added ------- * utils for debug use (tensor mean/std, etc) * fixed several bugs * utils to report statistics during training(loss, train/val accuracy). * results of mlp is in https://github.com/fengggli/gpu-computing-materials/pull/27/ 0.4.4 ====== :Date 2019-04-08 Added ------- 1. cifar Data loader: * Use data/cifar10/get_cifar10.sh to download data. * preprocess: normailzed, and with channel mean substracted. * train/validation split 2. Solver(main for loop): * feed batches from loader, forward/backward and gradient updates(test/test_net_mlp_cifar) 2. Weight init * Kaiming init and weight-scale based init. * Extract this part to utils/ since we use distribution from stl. 3. Doc * Added the network memory allocation figure. 4. Cuda * naiive CUDA pooling layer, set USE_CUDA=on to enable 0.4.3 ======= :Date 2019-04-01 See (https://github.com/fengggli/gpu-computing-materials/pull/19) Added ----------- * a fc_relu sandwich layer * weight initialization (currently only linspace is used) * macro: tensor_for_each_entry in tensor.h * net-mlp: - inference-only forward - mlp_forward - loss function to update the gradients mlp_loss - forward compared with numpy version - backward checked with numerical results - regulizer is added Changed -------- * changed the layer cache, now each layer has a lcache_t, which can be assessed as a stack using lcache_push, and lcache_pop. See docs/source/memory.rst for more details others ------ * clangformat using google style 0.4.2 ====== :Date 2019-03-30 Added ------- 1. Layers: * fully-connected * global avg pool. * relu * softmax 2. Data structure * The param_t uses linux-kernel style linked list, which can be also used to construct other basic data structures like stack/queue. * currently it's used to manage all learnable params of fc layers. < 0.4.1 ======== see dl-docs for changelog prior to 0.4.1