Subjects/Math/Advanced Mathematics/Applied Mathematics/Mathematical optimization

Mathematical optimization - Algorithms and Solvers

Understand exact and iterative optimization algorithms, heuristic metaheuristic methods, and key criteria for selecting appropriate solvers.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

Which specific problem types are addressed by extensions of the Simplex algorithm?

1 of 22

Summary

Computational Optimization Techniques Introduction Optimization is the process of finding the best solution to a problem—whether that means minimizing cost, maximizing profit, or achieving the best fit for data. A computational optimization technique is a systematic algorithm for solving such problems. The right technique depends on the problem's mathematical structure: Is it linear or nonlinear? Does it have constraints? Is an exact solution required, or is a good approximation acceptable? This guide covers the major classes of optimization algorithms and how to choose among them. Exact Algorithms Exact algorithms find the optimal solution in a finite number of steps. They work best for well-structured problems with special mathematical properties. The Simplex Algorithm The Simplex algorithm is the foundational method for solving linear programming (LP) problems—optimization problems where both the objective function and constraints are linear. The key insight of Simplex is geometric: an LP problem's optimal solution always occurs at a corner point (vertex) of the feasible region. The algorithm starts at one vertex and systematically moves to adjacent vertices, improving the objective value at each step, until no improvement is possible. Why is this important? Linear programming appears everywhere: scheduling, resource allocation, blending problems, and production planning. The Simplex algorithm reliably solves these problems exactly in practical time. Extensions for Specialized Problems The Simplex algorithm has been extended to handle more complex problem types: Quadratic programming extends Simplex to problems where the objective function is quadratic (contains squared terms) while constraints remain linear. This is useful in portfolio optimization and least-squares fitting with constraints. Linear-fractional programming handles problems where the objective is a ratio of linear functions, common in productivity and efficiency analysis. Network-flow Simplex is a specialized variant for transportation and network problems. By exploiting the network structure, this approach runs much faster than standard Simplex and is practical for very large problems. Combinatorial Algorithms For discrete optimization problems (where variables must be integers or chosen from a finite set), combinatorial algorithms are essential. These include: Shortest-path algorithms (Dijkstra's, Bellman-Ford) for finding minimum-cost routes Matching algorithms for pairing elements optimally, such as assigning workers to tasks Minimum spanning tree algorithms for connecting networks with minimum total cost These algorithms exploit the discrete structure of the problem and guarantee finding the true optimum. Iterative Methods Iterative algorithms approximate the optimal solution by starting from an initial point and repeatedly improving it. They are essential for nonlinear and large-scale problems where exact algorithms are impractical. Gradient-Based Methods: The Foundation Gradient descent (also called steepest descent) is the simplest iterative method. The core idea is intuitive: to minimize a function, move in the direction opposite to the gradient (the direction of steepest increase). At each iteration: $$x{k+1} = xk - \alphak \nabla f(xk)$$ where $\nabla f(xk)$ is the gradient and $\alphak$ is the step size. Why it works: The negative gradient points downhill locally, so taking a step downhill makes progress toward the minimum. Key limitation: Gradient descent converges slowly, especially when the problem is ill-conditioned. The algorithm can be sensitive to step-size selection and may get stuck in local minima for nonconvex problems. Example: Minimizing $f(x) = x^2 + 4y^2$ from an initial point, gradient descent would take many small steps toward the origin, moving faster along the steeper $y$ direction. Coordinate Descent Coordinate descent optimizes one variable at a time while holding others fixed. This approach is particularly useful when single-variable subproblems are easy to solve. At each iteration, the algorithm selects a coordinate $i$ and updates: $$xi^{(k+1)} = \arg\min{xi} f(x1^{(k)}, \ldots, x{i-1}^{(k)}, xi, x{i+1}^{(k)}, \ldots, xn^{(k)})$$ Advantage: Coordinate descent can be faster than gradient descent on certain problems and naturally accommodates parallel computing when updates can be done simultaneously. Conjugate Gradient Methods The conjugate gradient method is elegant for large, unconstrained optimization problems. Unlike gradient descent, it uses only gradient evaluations (no second derivatives) but achieves much faster convergence by keeping track of previous search directions. Key property: In conjugate gradient, successive search directions are orthogonal with respect to the Hessian matrix, ensuring that each step makes maximal progress. For quadratic problems, it converges in at most $n$ iterations (where $n$ is the problem dimension). When to use it: Large-scale least-squares problems, solving systems of linear equations, and unconstrained nonlinear optimization where you have only gradient information. Handling Non-Smooth Functions: Subgradient Methods Most optimization methods require smooth, differentiable functions. Real problems sometimes involve non-differentiable convex functions—functions with sharp corners or kinks where the gradient doesn't exist. Subgradient methods generalize gradient descent to non-smooth functions by using a subgradient: a generalized gradient that exists even when the function isn't differentiable. The update rule looks similar to gradient descent: $$x{k+1} = xk - \alphak gk$$ where $gk$ is any subgradient of $f$ at $xk$. Trade-off: Subgradient methods guarantee finding good solutions for convex problems, but convergence is slower than gradient methods for smooth functions. Quasi-Newton Methods: Approximating the Hessian Computing the full Hessian matrix (second derivatives) is expensive for high-dimensional problems. Quasi-Newton methods approximate the Hessian using only gradient information, achieving superlinear convergence without explicit second-derivative calculations. The most popular quasi-Newton method is BFGS, which maintains a running approximation of the Hessian that is updated each iteration based on gradient changes. This approximation gradually becomes more accurate. Advantage: Much faster than gradient descent, nearly as fast as methods using true Hessian, but without the computational cost of computing second derivatives. Conditional Gradient Method (Frank-Wolfe) The Frank-Wolfe algorithm (conditional gradient method) is specialized for problems with linear constraints. Instead of moving arbitrarily, it solves a linear subproblem to generate a feasible search direction, then takes a step along that direction. This approach is elegant for structured problems (like matrix completion or sparse optimization) where the linear subproblem is efficient to solve, even if the full problem is difficult. <extrainfo> Additional Gradient-Based Methods Ellipsoid method provides polynomial-time guarantees for certain convex problems and has deep connections to complexity theory in combinatorial optimization, though it's rarely used in practice. Simultaneous perturbation stochastic approximation (SPSA) estimates gradients using random perturbations of the input, useful when you can only evaluate functions (not compute gradients) or in stochastic optimization settings where noise is inevitable. </extrainfo> Methods Using Hessian Information These methods use second-derivative information (the Hessian matrix) to achieve faster convergence. Newton's Method Newton's method is one of the most powerful optimization algorithms. It uses the Hessian matrix to approximate the function as quadratic and solves that quadratic problem exactly: $$x{k+1} = xk - Hk^{-1} \nabla f(xk)$$ where $Hk$ is the Hessian matrix at $xk$. Convergence: Newton's method exhibits quadratic convergence—the number of correct digits roughly doubles at each iteration near the solution. This is dramatically faster than gradient descent. Trade-off: Computing and inverting the Hessian is expensive for high-dimensional problems. Newton's method also requires the function to be twice differentiable and works best near the optimum (it can diverge if started far away). When to use it: Small to medium problems where Hessian computation is feasible and you need high accuracy quickly. Sequential Quadratic Programming (SQP) Sequential quadratic programming extends Newton's method to constrained optimization problems (minimizing $f(x)$ subject to constraints like $g(x) \leq 0$ or $h(x) = 0$). The core idea: At each iteration, solve a quadratic programming subproblem that approximates the original constrained problem locally. The solution to this subproblem gives the next iterate. Advantage: Handles both equality and inequality constraints naturally, and often converges rapidly even from poor starting points. Limitation: Best suited to problems of moderate size; doesn't scale to thousands of variables. Interior-Point Methods Interior-point methods are the modern standard for large-scale constrained optimization. Instead of moving along the boundary of the feasible region (like Simplex), they move through the interior, using a "barrier function" to prevent constraint violations. The key update rule involves solving linear systems with Hessian information, and the algorithm "pushes" toward the boundary as it converges, eventually finding the optimal boundary point. Advantages: Polynomial-time complexity guarantees (for certain problem classes) Highly parallelizable Scales to tens of thousands of variables Variants: Some interior-point methods require the full Hessian (second derivatives), while others use only gradient information, making them practical when Hessians are unavailable. Derivative-Free and Heuristic Methods When derivatives are unavailable, expensive to compute, or when guaranteed convergence to a global optimum is not required, other approaches become practical. Finite-Difference Approximations When analytical derivatives are unavailable, finite-difference approximations estimate gradients numerically by evaluating the function at nearby points: $$\frac{\partial f}{\partial xi} \approx \frac{f(x + \delta ei) - f(x)}{\delta}$$ where $\delta$ is a small step and $ei$ is the unit vector in direction $i$. Trade-off: This requires $n$ additional function evaluations per gradient estimate (where $n$ is the dimension), making it expensive for high dimensions, but it works whenever you can evaluate the function. Heuristic and Metaheuristic Algorithms For highly nonconvex, discrete, or complex problems where traditional methods struggle, heuristic algorithms trade theoretical guarantees for practical effectiveness. These algorithms often find good (but not necessarily optimal) solutions in reasonable time. Genetic/Evolutionary Algorithms maintain a population of candidate solutions and evolve them over generations using: Selection: Favoring better solutions to reproduce Crossover: Combining features from two parent solutions Mutation: Randomly modifying solutions to explore new regions These algorithms excel at global optimization and work for discrete, combinatorial, and nonsmooth problems. Particle Swarm Optimization simulates social behavior: a swarm of particles moves through the search space, influenced by their own best position found so far and the global best position found by any particle. This balance between personal and collective learning often finds good solutions efficiently. Differential Evolution is similar to genetic algorithms but uses vector differences between population members to generate new candidates. It's particularly effective for continuous optimization and problems with many local minima. Simulated Annealing is inspired by the physical cooling process in metallurgy. The algorithm probabilistically accepts worse solutions early on (allowing "jumps" that escape local minima) and gradually reduces this acceptance probability, "annealing" toward a good solution. Nelder-Mead Simplex Heuristic (not related to the Simplex algorithm) optimizes without gradients by maintaining a simplex (a polytope with $n+1$ vertices in $n$ dimensions) and repeatedly replacing the worst vertex with a better one through geometric operations: reflection, expansion, and contraction. Tabu Search explicitly uses memory to avoid cycling. It maintains a list of recently visited solutions (the "tabu list") and forbids returning to them, encouraging exploration of new regions. Trade-offs: Heuristic methods rarely guarantee convergence to the true optimum, but they often find good solutions quickly for problems where exact algorithms fail. Optimization Solvers What is an Optimization Solver? An optimization solver is a software package or tool that automatically finds the minimum or maximum of a mathematical function subject to constraints. Rather than implementing an algorithm yourself, solvers provide robust, tested implementations tuned for performance. Types of Solvers Solvers are typically classified by the problem types they handle: Linear programming (LP) solvers handle problems with linear objectives and linear constraints. These are highly optimized and can solve problems with millions of variables (e.g., CPLEX, Gurobi, GLOP). Nonlinear programming (NLP) solvers handle general smooth nonlinear problems using methods like SQP or interior-point methods (e.g., IPOPT, SNOPT). Mixed-integer programming (MIP) solvers handle problems with both continuous and integer variables, combining branch-and-bound with cutting planes (e.g., Gurobi, CPLEX). Global optimization solvers are designed to find global optima in nonconvex problems, using techniques like branch-and-bound, interval arithmetic, or convex relaxations (e.g., BARON, SCIP). Choosing the Right Solver The best solver for your problem depends on several factors: Problem size: Gradient-based methods scale to thousands or millions of variables; exact algorithms for combinatorial problems scale to hundreds at best. Problem type: LP needs an LP solver, integer problems need MIP, smooth nonlinear needs NLP, non-smooth needs special handling. Accuracy required: If a 1% suboptimal solution is acceptable, heuristics are fast; if you need the true optimum, exact algorithms are necessary. Available resources: Large-scale interior-point methods require more memory and parallelization infrastructure than simple gradient descent. Common Solver Features Modern solvers provide several quality-of-life features: Presolve preprocessing automatically removes redundant constraints, fixes variables at their bounds, and simplifies the problem before solving—often speeding up solution by 10-100×. Parallel computing distributes the solution process across multiple processors or GPUs, critical for large-scale problems. Automatic differentiation computes derivatives symbolically from the problem definition, eliminating manual error-prone derivative coding and enabling use of gradient-based methods even when derivatives weren't explicitly provided. <extrainfo> These features have made modern solvers dramatically more powerful and user-friendly than solvers from a decade ago, with automatic differentiation and parallel algorithms being particular game-changers for practitioners. </extrainfo> Summary Choosing an optimization technique is a balance between problem structure, scale, required accuracy, and computational resources. Start with exact methods if applicable (Simplex for LP, branch-and-bound for MIP). For large-scale smooth problems, use iterative methods—gradient-based for first-order information only, Newton-type methods if second derivatives are available. For non-smooth, highly nonconvex, or discrete problems where gradients aren't available, heuristic methods offer practical solutions. Always consider using a mature solver rather than implementing algorithms from scratch.

Flashcards

Which specific problem types are addressed by extensions of the Simplex algorithm?

Quadratic programming Linear‑fractional programming

For what size and type of problems is sequential quadratic programming typically used?

Constrained problems of moderate size

What scale of constrained problems are interior‑point methods designed to handle?

Large constrained problems

How does coordinate descent update variables during the optimization process?

It updates a single variable at a time while keeping others fixed

What is the primary advantage of conjugate gradient methods for solving large unconstrained problems?

They require only gradient evaluations (no second derivatives)

In which direction does the gradient descent algorithm move relative to the gradient?

Opposite to the gradient direction

What is the primary drawback of the gradient descent (steepest descent) method?

It converges slowly

What specific class of functions are subgradient methods designed to handle?

Non‑differentiable convex functions

What type of time-complexity guarantees does the ellipsoid method provide for certain convex problems?

Polynomial‑time guarantees

What is another common name for the conditional gradient method?

Frank–Wolfe algorithm

What type of convergence do quasi‑Newton methods achieve without computing second derivatives?

Superlinear convergence

How does the SPSA method estimate gradients for stochastic optimization?

By using random perturbations

When are finite‑difference approximations typically used to estimate gradients?

When analytic derivatives are unavailable

Which two mechanisms does differential evolution use to evolve a population of candidate solutions?

Mutation Crossover

Which three core processes are used by evolutionary algorithms to improve solutions over generations?

Selection Crossover Mutation

What three operations does the Nelder–Mead heuristic perform on a simplex of points to optimize without gradients?

Reflecting Expanding Contracting

By which two positions is the movement of particles in particle swarm optimization influenced?

Personal best position Global best position

Why does simulated annealing probabilistically accept worse solutions?

To escape local minima

What mechanism does tabu search use to avoid cycling back to previously visited solutions?

It maintains a memory of recent moves

What is the general definition of an optimization solver?

A software tool that finds the minimum or maximum of a mathematical function subject to constraints

What are the four main classifications of optimization solvers?

Linear programming solvers Nonlinear programming solvers Mixed‑integer programming solvers Global optimization solvers

Which four criteria should be considered when selecting an optimization solver?

Problem size Problem type Required solution accuracy Available computational resources

Quiz

What type of problem does the Simplex algorithm solve exactly?

1 of 8

Key Concepts

Optimization Algorithms

Simplex algorithm

Newton's method

Interior-point method

Gradient descent

Quasi‑Newton method

Heuristic and Metaheuristic Methods

Simulated annealing

Particle swarm optimization

Tabu search

Specialized Optimization Techniques

Mixed‑integer programming

Automatic differentiation

Definitions

Simplex algorithm

An exact method for solving linear programming problems by moving between vertices of the feasible region.

Newton's method

An iterative optimization technique that uses the Hessian matrix to achieve quadratic convergence for smooth problems.

Interior-point method

A class of algorithms that traverse the interior of the feasible region to solve large-scale constrained optimization problems.

Gradient descent

A first‑order iterative method that moves opposite to the gradient to minimize differentiable functions.

Quasi‑Newton method

An optimization approach that approximates the Hessian matrix to obtain superlinear convergence without computing second derivatives.

Simulated annealing

A probabilistic metaheuristic that accepts worse solutions with decreasing probability to escape local minima.

Particle swarm optimization

A population‑based algorithm that updates candidate solutions based on personal and global best positions.

Tabu search

A heuristic that uses memory structures to forbid or penalize recently visited solutions, preventing cycling.

Mixed‑integer programming

An optimization framework that involves both continuous and integer decision variables.

Automatic differentiation

A computational technique that evaluates exact derivatives of functions defined by computer programs.