John D Kececioglu

John D Kececioglu

Professor, Computer Science
Associate Professor, Applied Mathematics - GIDP
Associate Professor, Genetics - GIDP
Associate Professor, BIO5 Institute
Primary Department
Department Affiliations
Contact
(520) 621-4526

Work Summary

John Kececioglu's research is in applied algorithms, with an emphasis on bioinformatics and computational biology, including: multiple sequence alignment, inverse parametric alignment, sequence assembly, and genome rearrangement. Software developed by his group includes Opal, a tool for multiple sequence alignment, Facet, a tool for alignment accuracy estimation, InverseOpt, a library for inverse parametric optimization, Ipa, a tool for inverse sequence alignment, and AlignAlign, a tool for optimally aligning alignments.

Research Interest

John Kececioglu is an Associate Professor in the Department of Computer Science, and the BIO5 Institute. His research is in applied algorithms, especially for areas of bioinformatics and computational biology such as: multiple alignment, inverse alignment, sequence assembly, and genome rearrangement. Software developed by his group includes Opal, a tool for multiple sequence alignment; Facet, a tool for alignment accuracy estimation; Ipa, a tool for inverse sequence alignment; AlignAlign, a tool for optimally aligning alignments, and Ninja, a tool for evolutionary tree construction. John is a recipient of a National Science Foundation CAREER Award, serves as an Associate Editor for IEEE/ACM Transactions on Computational Biology and Bioinformatics, and is on the Editorial Board of Algorithms for Molecular Biology. He was Conference Chair for RECOMB 2009, and Program Committee Co-Chair in the area of Sequence Analysis for ISMB 2011 and BCB 2012. John served as Associate Head of the Department of Computer Science during 2012.

Publications

DeBlasio, D., & Kececioglu, J. D. (2014). Learning parameter sets for alignment advising. Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB 2014), 230-239.

While the multiple sequence alignment output by an alignerstrongly depends on the parameter values used for the alignmentscoring function(such as the choice of gap penalties and substitution scores),most users rely on the single default parameter settingprovided by the aligner.A different parameter setting, however,might yield a much higher-quality alignment for the specific set ofinput sequences.The problem of picking a good choice of parameter valuesfor specific input sequences is called \textit{parameter advising}.A parameter advisor has two ingredients:(i) a set of parameter choices to select from,and (ii) an estimator that provides an estimate of the accuracyof the alignment computed by the aligner using a parameter choice.The parameter advisor picks the parameter choice from the setwhose resulting alignment has highest estimated accuracy.We consider for the first timethe problem of learning the optimal set of parameterchoices for a parameter advisor that uses a given accuracy estimator.The optimal set is one that maximizes the expected true accuracy of theresulting parameter advisor, averaged over a collection of training data.While we prove that learning an optimal set for an advisor is NP-complete,we show there is a natural approximation algorithm for this problem,and prove a tight bound on its approximation ratio.Experiments with an implementation of this approximation algorithmon biological benchmarks,using various accuracy estimators from the literature,show it finds sets for advisors that are surprisingly close to optimal.Furthermore, the resulting parameter advisors are significantlymore accurate in practice than simply aligningwith a single default parameter choice.

Wheeler, T. J., & Kececioglu, J. D. (2007). Multiple alignment by aligning alignments. Bioinformatics, 23(13), i559-i568.

PMID: 17646343;Abstract:

Motivation: Multiple sequence alignment is a fundamental task in bioinformatics. Current tools typically form an initial alignment by merging subalignments, and then polish this alignment by repeated splitting and merging of subalignments to obtain an improved final alignment. In general this form-and-polish strategy consists of several stages, and a profusion of methods have been tried at every stage. We carefully investigate: (1) how to utilize a new algorithm for aligning alignments that optimally solves the common subproblem of merging subalignments, and (2) what is the best choice of method for each stage to obtain the highest quality alignment. Results: We study six stages in the form-and-polish strategy for multiple alignment: parameter choice, distance estimation, merge-tree construction, sequence-pair weighting, alignment merging, and polishing. For each stage, we consider novel approaches as well as standard ones. Interestingly, the greatest gains in alignment quality come from (i) estimating distances by a new approach using normalized alignment costs, and (ii) polishing by a new approach using 3-cuts. Experiments with a parameter-value oracle suggest large gains in quality may be possible through an input-dependent choice of alignment parameters, and we present a promising approach for building such an oracle. Combining the best approaches to each stage yields a new tool we call Opal that on benchmark alignments matches the quality of the top tools, without employing alignment consistency or hydrophobic gap penalties. © 2007 The Author(s).

Kim, E., & Kececioglu, J. (2008). Learning scoring schemes for sequence alignment from partial examples. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 5(4), 546-556.

PMID: 18989042;Abstract:

When aligning biological sequences, the choice of scoring scheme is critical. Even small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to learn parameter values that are appropriate for biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct reference alignments, this is the problem of finding parameter values that make the scores of the reference alignments be as close as possible to those of optimal alignments of their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the reference alignment is not specified, and to an improved formulation based on minimizing the average error between the scores of the reference alignments and the scores of optimal alignments. Experiments on benchmark biological alignments show we can learn scoring schemes that generalize across protein families, and that boost the accuracy of multiple sequence alignment by as much as 25 percent. © 2008 IEEE.

Kececioglu, J., & Starrett, D. (2004). Aligning alignments exactly. Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB, 8, 85-96.

Abstract:

A basic computational problem that arises in both the construction and local-search phases of the best heuristics for multiple sequence alignment is that of aligning the columns of two multiple alignments. When the scoring function is the sum-of-pairs objective and induced pairwise alignments are evaluated using linear gap-costs, we call this problem Aligning Alignments. While seemingly a straightforward extension of two-sequence alignment, we prove it is actually NP-complete. As explained in the paper, this provides the first demonstration that minimizing linear gap-costs, in the context of multiple sequence alignment, is inherently hard. We also develop an exact algorithm for Aligning Alignments that is remarkably efficient in practice, both in time and space. Even though the problem is NP-complete, computational experiments on both biological and simulated data show we can compute optimal alignments for all benchmark instances in two standard datasets, and solve very-large random instances with highly-gapped sequences.

Christof, T., Jünger, M., Kececioglu, J., Mutzel, P., & Reinelt, G. (1997). A branch-and-cut approach to physical mapping of chromosomes by unique end-probes. Journal of Computational Biology, 4(4), 433-447.

PMID: 9385538;Abstract:

A fundamental problem in computational biology is the construction of physical maps of chromosomes from hybridization experiments between unique probes and clones of chromosome fragments in the presence of error. Alizadeh, Karp, Weisser and Zweig (Algorithmica 13:1/2, 52-76, 1995) first considered a maximum-likelihood model of the problem that is equivalent to finding an ordering of the probes that minimizes a weighted sum of errors and developed several effective heuristics. We show that by exploiting information about the end-probes of clones, this model can be formulated as a Weighted Betweenness Problem. This affords the significant advantage of allowing the well-developed tools of integer linear-programming and branch-and-cut algorithms to be brought to bear on physical mapping, enabling us for the first time to solve small mapping instances to optimality even in the presence of high error. We also show that by combining the optimal solution of many small overlapping Betweenness Problems, one can effectively screen errors from larger instances and solve the edited instance to optimality as a Hamming-Distance Traveling Salesman Problem. This suggests a new approach, a Betweenness-Traveling Salesman hybrid, for constructing physical maps.