网上阅读笔记2017.4.12

89 阅读 0 评论 59 点赞

我是靠谱客的博主土豪便当，最近开发中收集的这篇文章主要介绍网上阅读笔记2017.4.12，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

1.如何验证使用多元回归生成的（定量结果变量）预测模型。

推荐的模型验证的方法：

如果模型预测值远在响应变量范围之外，那么很显然，模型的预估或者准确性存在问题。

模型系数的误差

如果值看起来合理，参数存在以下问题中任何一个也可以判断出预估的问题或者多重共线性问题：期望值相反的迹象，值特别大或特别小，或者在给模型输入新数据时发现不一致。
通过向模型输入新的数据，来做预测，然后用相关系数（R平方）来评价模型的正确性。
使用数据分割构建一个分离的数据集来训练模型参数，另一个来验证预测。

训练样本和预测样本

如果数据集包含有很小数量的实例，就要使用 jackknife resampling技术，并用R平方和MSE来测量效度。

此外，特殊问题的专用或通用统计验证方法，例如：LB检验、arch检验等

2.中英文单词统计分析模型

搜索原因：邮箱广告--数据科学面试技巧（第一期）

搜索：数据科学面试

(搜索trace)

KDnuggets™ News 17:n11, Mar 22: 50 Companies Leading The AI Revolution; 17 More Must-Know Data Science Q&A, part 3

http://www.kdnuggets.com/2017/03/kanri-distance-calculator.html

Kanri combination of patented statistical and process methods provide a powerful ability to evaluate large data, tells users the exact distance from target, and variable contributions for participant. Free trial and 88% KDnuggets discount for the first 100 buyers.

2.2搜索trace--百度翻译

combination of…的组合
patented有专利权的
statistical统计的; 统计学的
evaluate评价; 求…的值; 对…评价; 求…的数值; 评价，估价
large data海量数据
tells讲述; 讲; 表明( tell的第三人称单数 ); 知道
exact准确的; 严密的; 精密的，精确的; 要求; 苛求; 迫使; 强求
contributions捐赠; 贡献( contribution的名词复数 ); 捐助物; 一则
participant参加者，参与者; 与会代表; 参与国; 关系者; 参加的; 有关系的
Free trial免费试用

精密机械

3.http://www.kdnuggets.com/2017/03/statistical-modeling-primer.html

"Model" means different things to different people and different things at different times.

模型对不同的人和事情在不同的时间意义不同。

As I briefly explain in A Model's Many Faces, I often find it helpful to classify models as conceptual, operational or statistical. In this post we'll have a closer look at the last of these, statistical models. First, it's critical to understand that statistical models are simplified representations of reality and, to paraphrase the famous words of statistician George Box, they're all wrong but some of them are useful. So why do we use statistical models? We use them because we need to better understand something we don't understand very well or because we wish to predict something - sales, for instance.

当我解释一个模型的许多事实时，将模型归类为概念、操作和统计是有用的。（统计模型对真实世界的简单表示）

There is also an important distinction between deterministic and stochastic models I should mention. Put very simply, with a deterministic model we can calculate the answer from one or more equations. A stochastic model, on the other hand, possesses some inherent randomness and we can only estimate the answer. Our estimates may be quite close, or they may be way off. In a field such as marketing research, we often don't know because we lack the data needed to make this assessment. Sometimes, though, we are able to compare model predictions with real data - predicted sales versus actual sales, for example.

Statistical models are stochastic and what we normally use in marketing research. To crib from Wikipedia: "A statistical model is a class of mathematical model, which embodies a set of assumptions concerning the generation of some sample data, and similar data from a larger population. A statistical model represents, often in considerably idealized form, the data-generating process." A word of caution is that What If? simulation tools based on statistical models are sometimes mistaken for deterministic models by naive users because of their user-friendly interfaces.

Another useful distinction is between dependence and interdependence methods. Regression, in which we have both a dependent variable and one or more independent (predictor) variables, is an example of the former. Note that we can have more than one dependent variable, as we often do in Structural Equation Modeling. Cluster analysis and factor analysis are examples of interdependence methods, which do not distinguish between dependent and independent variables. They are frequently used for brand mapping in marketing research in addition to segmentation.

Some models are purely predictive - they are only concerned with predicting something that hasn't happened yet. An example would be predicting futures sales from past sales alone. Recommender systems are another type of predictive model now widely used in marketing. Amazon presumably doesn't care why you like novels featuring attorneys but knows that people who buy John Grisham's books also frequently buy Scott Turow's. (I plead guilty on both counts.)

A causal model, on the other hand, seeks explanations. This is particularly important in marketing research when simply predicting how a customer will behave is not enough and we need to know why some consumers behave as they do in order to formulate and implement marketing activities. There is an erroneous notion among some marketing researchers that quantitative research is for getting the numbers and qualitative research is for understanding the why underlying the numbers. (I address this rather alarming misconception in Combining Smart Design with Smart Analytics.) Note that a causal model can also be used for prediction and how well it predicts is often (but not always) a criterion for judging how good the model is, so this dichotomy is somewhat blurry.

There are other important categorizations as well, for instance between time-series or longitudinal modeling, in which our data span two or more points in time, and cross-sectional modeling, in which we are only have data for one slice in time. Marketing mix modeling uses time-series data whereas most marketing research surveys are cross sectional. Tracking studies are exceptions to this rule. Some multi-level models fall between these cracks by combining cross-sectional data with time-series or longitudinal data in one model. Though complex, models for spatial and spatiotemporal data are relevant to specialized corners of marketing research.

Frequentist versus Bayesian statistics...at times this resembles the academic equivalent of a religious war. The linked post is an interview with noted Bayesian statistician Andrew Gelman who, fortunately, is the peaceful sort as well as being an outstanding educator. Most of the time either approach will work for marketing research though, generally speaking, Bayesian methods are more complex and there are fewer people skilled at them. Another conflict zone for some is between statistics and machine learning, but the two terms are increasingly used synonymously. There are also nonparametric and semiparametric models and some disagreement among statisticians regarding when these are better suited than more familiar parametric statistics.

I haven't even mentioned mixture modeling! This is particularly useful when you suspect more than one process gave rise to your data, segmented driver analysis being one example.

Suffice it to say that statisticians now have an immense tool kit, and An Analytics Toolbox gives you a peek inside of it. Despite what some have claimed over the years, we're still nowhere near the point where Artificial Intelligence or some other form of automation can replace a competent statistician or marketing science person. The growing complexity of statistical science is actually making this goal more elusive.

How these tools are used by human experts matters a great deal and will for the foreseeable future - see What Makes a Good Analyst? for some thoughts on what to look for in an analyst. Technical competence, of course, is a must since it's very easy for someone untutored in statistics to point and click themselves and their clients into a heap of trouble. However, in my experience, it's even more critical to understand who will be using the results and, to the extent possible, how they will be used.

It all begins with the brief.

I hope you've found this interesting and helpful!

Bio: Kevin Gray is president of Cannon Gray, a marketing science and analytics consultancy.

Original. Reposted with permission.

格雷--灰色