The -2LL test is a test of fit of the model to the estimation data. It has nothing to do with holdout data.
If you want to test model validity using holdout choices, then you probably need to hold out many more than 1-2 choice sets (I usually recommend at least 8).
But to answer your question directly, when using holdouts, you should probably test the hit rates (assuming you ran HB analysis and have respondent-level utilities). In that case you look your prediction of how each respondent would make the holdout choices and compare to the actual holdout choices each respondent makes. Then for each respondent you correctly predicted 0, 1 or 2 of the holdout question (or 0 or 1 if you have only one holdout. You can then test whether this number is higher for your New or Old model using a dependent t-test.