General usefulness of a screening calculation modification?

ldamewood · Post by **ldamewood** » Thu Feb 17, 2011 8:13 pm

Hello,

I may have a fairly unique situation: I have access to a set of 16 workstations (quad and dual core mixed, all running sci-linux and sharing a massive NFS), but our department firewall policies make it so running MPI won't work. For doing big screening calculations, I am forced to manually split up the workload across the 16 computers by doing subsets of q-point calculations and then using mrgscr afterwards. A problem that I've come across is that even if I split my q-points across multiple computers (16 points on computer A, 16 on B, 8 on C, etc), since these computers are in-use by the department for doing other workstation tasks (and I therefore need to "nice" my instances of abinit) some of the subsets of q-points finish very quickly and others take much longer to finish. Waiting for the slow computations to finish while other, faster, computers are waiting idle is a huge waste of time and resources.

My first solution was to stop slow-progressing calculations and resubmit on computers that seemed to be quicker. This works fine except that restarting the calculation takes a good amount of time (loading the KSS file at the beginning, in particular, takes a decent amount time) and then I need to use mrgscr to extract previous q-point slices of the screening matrix and it can be quite a hassle to keep track of the files. Oh! I also have the problem that some users at the workstations will push the power button if X crashes and they can't login! So, solution #1 works but it still wastes some time.

Instead, since these computers all shared a networked hard drive, my second solution was to create a mechanism for queuing q-points on this drive so each instance of abinit can pop q-points off the shared queue until all calculations are finished. It writes each q-point slice of the SCR file to a uniquely named file and after all calculations are finished, I am able to concatenate the files together into my final SCR file. The queue seems to be quite robust since I implemented a mutex algorithm to keep the different computers from accessing the queue at the same time. At the moment, this modification is rough around the edges (untested in Windows/Mac and it includes some quick hacks) but it works for my particular situation.

My big question is, would this solution be something that has enough general usefulness to be incorporated into abinit? Is this something that could be implemented for groups running large GW calculations on grid systems where many different computing centers are participating in the calculation (i.e. Teragrid)? Also, is there an alternate way of solving this problem that is already implemented and that I missed? If this seems very useful, then I can spend time polishing it up, testing it and pushing it into the main code; otherwise, I can just make a patch available of the modification as-is for people who need it.

Thanks for any comments.

mverstra · Post by **mverstra** » Wed Mar 02, 2011 10:37 pm

Hi,

interesting, but a little bit of a specific setup. You use scripts which launch abinit by pbs? I don't understand to what extent the mutex is integrated inside abinit, and to what extent you would be adding a scripting tool to abinit (perfectly valid addition - we can see how many people are interested).

cheers

Matthieu

ABINIT Discussion Forums

General usefulness of a screening calculation modification?

General usefulness of a screening calculation modification?

Re: General usefulness of a screening calculation modificati