ALGORITHM DESIGN FOR DATA SCIENCE, NUMERICAL & OPTIMIZATION
— Mathesia (@Mathesia_) October 4, 2018
Design, simulation and analysis of fixed-point systems
Fixed-Point Designer provides data types and tools for designing fixed-point and single-precision algorithms for optimal performance on embedded hardware. Fixed-Point Designer analyzes the design and suggests data types and characteristics such as word length and scaling. You can fine-tune data attributes such as rounding mode, overflow operations, and use single-precision data with fixed-point data. You can perform bit-true simulations without observing the design on your hardware and observe the effect of limits and precision.
Fixed-Point Designer allows you to convert double-precision algorithms to single-precision or fixed-point algorithms. You can create and optimize data types that meet your numeric accuracy requirements and target hardware constraints. Mathematical analysis and metrology simulation can be used to determine the scope requirements of the design. Fixed-Point Designer provides apps and tools to guide you through the data conversion process and allows you to compare fixed-point results to floating-point criteria.
As a data scientist in this industry for several years, on LinkedIn and QuoLa, I often contacted some students or people who want to change careers to help them with career choices or guidance related to machine learning. Some questions revolve around the choice of educational pathways and procedures, but the focus of many questions is what algorithms or models are common today in the field of data science.
Since there are too many algorithms to choose from, it’s hard to know where to start. Courses may include less than typical algorithms used in today’s industry, and courses may not include methods that are not currently popular but are particularly useful. Software-based programs can eliminate important statistical concepts, and math-based programs can skip some of the key themes in algorithm design.
I have compiled a short guide for some of the data experts who are pursuing, especially focusing on statistical models and machine learning models (supervised learning and unsupervised learning); these topics include textbooks, graduate-level statistics, and data science training. Camp and other training resources. (Some of them are included in the reference section of the article). Since machine learning is a branch of statistics, machine learning algorithms are technically classified as statistical knowledge, as well as data mining and more computer science-based methods. However, since some algorithms overlap with the content of computer science courses, and because many people separate traditional statistical methods from new methods, I will separate the two branches in the list.
Statistical methods include some of the more common methods outlined in bootcamps and certificate programs, as well as some less common methods that are commonly taught in graduate statistics programs (but can have significant advantages in practice). All suggested tools are tools I use frequently:
1) Generalized linear models, which form the basis of most supervised machine learning methods (including logistic regression and Tweedie regression, which summarizes most of the counts or continuous results encountered in industry…)
2) Time series method (ARIMA, SSA, machine learning based method)
3) Structural equation modeling (simulation and test-mediated pathways)
4) Factor analysis (exploration and verification of survey design and verification)
5) Power analysis/test design (especially simulation-based test design to avoid excessive analysis)
6) Non-parametric test (derivation from zero, especially through simulation) /MCMC
7) K-means clustering
8) Bayesian method (Naïve Bayes, Bayesian model averaging, Bayesian adaptive test…)
9) Penalize the regression model (elastic net, LASSO, LARS…), which usually adds penalty factors (SVM, XGBoost…) to the model, which is useful for data sets with predicted values exceeding the observed values (common in genomics) And social science research)
10) Spline model (MARS…) for the flexibility modeling process
11) Markov chain and stochastic processes (another method of time series modeling and predictive modeling)
12) Missing data filling scheme and its assumptions (miss Forest, MICE…)
13) Survival analysis (very helpful in manufacturing modeling and consumption processes)
14) Hybrid modeling
15) Statistical inference and group testing (A/B testing and implementation of more complex designs in many trading activities)
Machine learning extends many such frameworks, especially K-means clustering and generalized linear modeling. Some useful common techniques in many industries (and some more ambiguous algorithms that are surprisingly useful in bootcamps or certificate programs, but rarely taught in schools) include:
1) Regression/classification tree (for early promotion of generalized linear models with high precision, good interpretability and low computational cost)
2) Dimension reduction (PCA and diverse learning methods such as MDS and tSNE)
3) Classic feedforward neural network
4) Bagging combination (constituting the basis of algorithms such as random forest and KNN regression integration)
7) Accelerated integration (this is the basis for gradient boost and XGBoost algorithms)
8) Optimization algorithm for parameter optimization or design project (genetic algorithm, quantum heuristic evolutionary algorithm, simulated exercise, particle swarm optimization)
9) Topological data analysis tools, especially suitable for unsupervised learning of small sample sizes (persistent coherence, Morse-Smale clustering, Mapper…)
10) Deep learning architecture (general deep architecture)
11) KNN local modeling method (regression, classification)
12) Gradient-based optimization method
13) Network metrics and algorithms (central metrics, intermediate, diversity, entropy, Laplacian, epidemic spread, spectral clustering)
14) Convolution and convergence layers in a deep architecture (specifically for computer vision and image classification models)
15) Hierarchical clustering (clustering and topology data analysis tools related)
16) Bayesian network (path mining)
17) Complexity and dynamic systems (related to differential equations, but usually used to simulate systems without known drivers)
Depending on the industry chosen, additional algorithms related to natural language processing (NLP) or computer vision may be required. However, these are specialized areas of data science and machine learning, and those who enter these fields are usually experts in that particular field.
Some resources outside of academic programs to learn these methods include:
Christopher, M. B. (2016). Pattern Recognition and Machine Learning, Springer Press, New York.
Friedman, J., Hastie, T., & Tibshirani, R. (2001). Elements of Statistical Learning (Vol. 1, pp. 337-387). New York: Springer series in statistics.