From vision detected initial and user given final state, generate tasks