NPRG054 - High Performance Software Development

David Bednárek

Official webpage

Slides

(under eternal construction/translation)

Timeline of seminars and assignments

SeminarAssignment #4 levensteinAssignment #5 matrixAssignment #3 bsearch
1.3.2018
15.3.2018assignment
29.3.2018assignment
12.4.2018deadlineassignment
26.4.2018evaluationdeadline
10.5.2018evaluationdeadline
24.5.2018evaluation

Homework assignments


DÚ 4 - Levenstein

Single-threaded version of an assignment from NPRG042 Programming in Parallel Environment.

Levenstein distance of words A[0..M-1], B[0..N-1] is defined as:

	D(0,0) = 0;
	D(0,j) = j;
	D(i,0) = i;
	D(i+1,j+1) = min( D(i,j+1) + 1, D(i,j) + (A[i]==B[j] ? 0 : 1), D(i+1,j) + 1);
	D = D(M,N);

The word elements A[i], B[j] as well as the distances D[i,j] are 32-bit numbers (use int or unsigned in the implementation).

template< typename policy>
class levenstein {
public:
	typedef int data_element;

	template< typename I1, typename I2>
	levenstein(I1 i1b, I1 i1e, I2 i2b, I2 i2e);

	data_element compute();
};

struct policy_sse {};
struct policy_avx {};
struct policy_avx512 {};

The class constructor accepts two pairs of iterators; each pair determines one input word. The constructor shall copy the data to internal buffers of the class. The constructor shall also allocate all buffers used by the compute function. Time consumed by the constructor is not measured (but it is expected to be linear wrt. the sizes).

The compute function computes and returns the levenstein distance D(M,N). It uses the copy of the inputs and the buffers allocated by the constructor. It shall not cache the results; every invocation of compute shall start the computation from scratch.

The test wrapper displays the wall time divided by the M*N factor, i.e. the time per one matrix element, in nanoseconds. For better accuracy on smaller inputs, the test wrapper reruns the compute function several times. The range of input sizes and repetitions are chosen to generate total run time about 10 seconds for the reference solution. If your code (or your hardware) is slower, you may want to adjust the target_complexity variable in main.

Handling different vector instruction sets

The template argument policy is used to specify the desired vector platform. You shall define the following three policy classes: policy_sse, policy_avx, and policy_avx512. You may fill the policy classes with whatever contents or you may leave them empty and use explicit specialization of class levenstein. You shall make sure that the following instantiations work: levenstein<policy_sse>, levenstein<policy_avx>, and levenstein<policy_avx512>. However, you are not required to actually use the respective vector instructions; for instance, levenstein<policy_avx512> may be identical to levenstein<policy_sse>.

The test wrapper exp4main.cpp uses policy_sse by default, other policies may be selected using command-line argument "avx" or "avx512".

Compiler optionsCommand-line argument
MSVCGCC(none)avxavx512
Platform capabilityPolicy used
SSE 4.2(none)-msse4.2policy_sse(N.A.)(N.A.)
AVX2/arch:AVX2 -D USE_AVX-mavx2 -D USE_AVXpolicy_ssepolicy_avx(N.A.)
AVX512(N.A.)-mavx512f -D USE_AVX512policy_ssepolicy_avxpolicy_avx512

The program must be compiled with the switches corresponding to particular platform as defined by the table above. The parts of source code which use AVX512 intrinsics must be enclosed inside "#ifdef USE_AVX512" directives to avoid compilation errors on unsupporting platforms, similarly for USE_AVX. Don't try to run programs built in avx-enabled modes on non-avx platforms (even without the runtime argument) since avx-enabled compilers may emit avx-instructions from normal C++ code.


DÚ 5 - "Násobení" matic

Referenční algoritmus

uint16_t a[L][N], b[N][M], c[L][M];
	
for( i = 0; i < L; ++ i)
  for( j = 0; j < M; ++ j)
  {
    c[i][j] = 0xFFFF;
    for( k = 0; k < N; ++ k)
      c[i][j] = std::min( c[i][j], a[i][k] + b[k][j]);
  }

Matice

Datová struktura pro matici a operace násobení.

Předepsané rozhraní matice

Umožňuje připojení k testovacímu mechanismu.

template< typename policy>
class matrix {
public:
	matrix(size_t m, size_t n);
	size_t vsize() const;
	size_t hsize() const;
	bool get(size_t i, size_t j) const;
	void set(size_t i, size_t j, bool e);
	void assign_mul(const matrix & a, const matrix & b);
private: // ...
};

DÚ 3 - Binární vyhledávání

Vstupem je setříděná posloupnost idata o velikosti isize a nesetříděná posloupnost odata o velikosti osize. Úkolem je roztřídit odata do (isize+1) skupin, jejichž hraniční hodnoty jsou dány posloupností idata. V k-té skupině jsou prvky větší nebo rovné idata[k-1] a menší než idata[k].

	typedef std::int32_t data_element;
	
	template< typename policy>
	class bsearch_inner {
	public:
		bsearch_inner( const data_element * idata, std::size_t isize);
	};
	template< typename policy>
	class bsearch_outer { 
	public:
		bsearch_outer( const bsearch_inner< policy> & inner, std::size_t osize);

		void bucketize( const data_element * odata);
		
		typedef std::pair< const data_element *, std::size_t> bucket_rv;

		bucket_rv bucket( std::size_t k) const;
	};

DÚ 1 - Polymorfní datová struktura



Assignment 6 - Bit counting

Create a class named bitarray<P> which represents an array of bits and supports the following operations:

Array size is specified as an argument to constructor; allocation is done during construction. The set/reset operations shall have constant complexity, the others shall be linear wrt. array size. The counting operation is expected to be slower than the others (i.e. the count shall not be precomputed during other operations).

The template argument P is used to specify the vector platform available. You shall define the following three policy classes: policy_sse, policy_avx, and policy_avx512. You shall implement bitarray<policy_sse>, bitarray<policy_avx>, and bitarray<policy_avx512>.

The test wrapper du6main.cpp uses policy_sse by default, other policies may be selected using command-line argument "avx" or "avx512".

If you use AVX512 instruction, be aware that not every compiler supports them. Hide the parts which use AVX512 into "#ifdef USE_AVX512" directives to avoid error messages in non-supporting compilers and define the macro if your compiler supports it. du6main.cpp will not require policy_avx512 if this macro is not defined.

If you do not want to implement specific AVX512 implementation, define policy_avx512 equal to policy_avx.

The du6bitcount.hpp (as distributed) contains implementation based on int64_t blocks and 8-bit lookup for bitcounting (all policy classes are identical).

The empty.err file contains stderr output of the test wrapper. Your implementation shall produce the same contents.

The test wrapper outputs measured time (in nanoseconds per bit). Note that the count operation is measured together with an and operation (to detect tricks like precomputing the count). The reference implementation shows the following times (for 1M of bits):