DUKKAR2  Huge Pascal triangle
Given P a prime number, and N an integer, Dukkar found a really fast way to compute how many numbers are divisible by P on the N^{th} row of the Pascal triangle.
Now the task will be much harder : it's for all the N first rows.
Moreover N will be a giant number, given in base P for convenience.
Input
The first line of input contains an integer
T, the number of test cases. Follow 2×T lines.
For each test case, on the first line your are given two integers P and k.
On the second line you are given k integers : the digits of N in base P.
N = a_{0}×P^{0} + ... + a_{i}×P^{i} + ... + a_{k1}×P^{k1}
Output
For each test case, you have to print the number of numbers in all the first N rows of the Pascal triangle that are divisible by P. As the answer could not fit in a 64bit container, give your answer modulo 1000000007.
Example
Input: 3 5 2 0 1 5 2 1 1 7 3 2 0 2
Output: 0 4 2689
Explanations
For the first case, N = 0×5^{0} + 1×5^{1} = 5. No numbers are divisible by 5 in the first 5 rows.
For the second case, N = 1×5^{0} + 1×5^{1} = 6. Only 4 numbers are divisible by 5 in the first 6 rows.
1 1 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10 10 5 1
For the third case, N = 2×7^{0} + 0×7^{1} + 2×7^{2} = 100.
Constraints
0 < T < 300 0 < P < 10^9, a prime number 0 < k < 1000 0 <= a_{i} < P, a_{k1}>0
For your information, my 300Bpython3 code get AC in 3.03s with 11MB of memory print.
My Ccode get AC in 0.08s with 1.6MB of memory print.
Have fun ;)
Edit(25/I/2015) With Cube cluster my Ctime is 0.01s and my PY3.4time is 0.26s.
Edit(11/II/2017) With compiler changes my Ctime is 0.00s and my PY3.4time is 0.12s.
hide comments
luc4sdreyer:
20150925 21:08:07
I got it! I thought the input was most significant digit to least significant, where it's actually the other way around. The funny thing is the sample input gives the same output in both ordering styles! 

luc4sdreyer:
20150924 23:06:19
My solution is very fast and correct for all small inputs (verified by my Dukkar1 solution), and still I get W/A even with Java BigInteger to prevent overflow. Are there any weird edge cases or things to keep in mind for large inputs? I made many C++ and Java submissions, all incorrect. My C++ code's signature is a lot like the author's: 0.09s with 3.4M memory.

Added by:  Francky 
Date:  20140317 
Time limit:  0.300s 
Source limit:  50000B 
Memory limit:  1536MB 
Cluster:  Cube (Intel G860) 
Languages:  All except: ASM64 
Resource:  Own Problem 