Test03

Jupyter Notebook实践

%matplotlib inlineimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsdf = pd.read_csv('fortune500.csv')

pandas用于数据处理，matplotlib用于绘图，seaborn使绘图更美观。第一行不是python命令，而被称为line magic。%表示作用与一行，%%表示作用于全文。此处%matplotlib inline 表示使用matlib画图，并将图片输出。
随后，加载数据集。

df = pd.read_csv('fortune500.csv')

df.head()

df.tail()

对数据属性列进行重命名，以便在后续访问

df.columns = ['year', 'rank', 'company', 'revenue', 'profit']

接下来，检查数据条目是否加载完整。

len(df)

从1955至2055年总共有25500条目录。然后，检查属性列的类型。

df.dtypes

其他属性列都正常，但是对于profit属性，期望的结果是float类型，因此其可能包含非数字的值，利用正则表达式进行检查。

non_numberic_profits = df.profit.str.contains('[^0-9.-]')df.loc[non_numberic_profits].head()

确实存在这样的记录，profit这一列为字符串，统计一下到底存在多少条这样的记录。

len(df.profit[non_numberic_profits])

总体来说，利润（profit）列包含非数字的记录相对来说较少。更进一步，使用直方图显示一下按照年份的分布情况。

bin_sizes, _, _ = plt.hist(df.year[non_numberic_profits], bins=range(1955, 2006))

可见，单独年份这样的记录数都少于25条，即少于4%的比例。这在可以接受的范围内，因此删除这些记录。

df = df.loc[~non_numberic_profits]df.profit = df.profit.apply(pd.to_numeric)

再次检查数据记录的条目数。

len(df)

df.dtypes

可见，上述操作已经达到清洗无效数据记录的效果。

接下来，使用matplotlib进行绘图
以年分组绘制平均利润和收入。首先定义变量和方法。

group_by_year = df.loc[:, ['year', 'revenue', 'profit']].groupby('year')avgs = group_by_year.mean()x = avgs.indexy1 = avgs.profitdef plot(x, y, ax, title, y_label):    ax.set_title(title)    ax.set_ylabel(y_label)    ax.plot(x, y)    ax.margins(x=0, y=0)

现在开始绘图

fig, ax = plt.subplots()plot(x, y1, ax, 'Increase in mean Fortune 500 company profits from 1955 to 2005', 'Profit (millions)')

看起来像指数增长，但是1990年代初期出现急剧的下滑，对应当时经济衰退和网络泡沫。再来看看收入曲线。

y2 = avgs.revenuefig, ax = plt.subplots()plot(x, y2, ax, 'Increase in mean Fortune 500 company revenues from 1955 to 2005', 'Revenue (millions)')

公司收入曲线并没有出现急剧下降，可能是由于财务会计的处理。对数据结果进行标准差处理。

def plot_with_std(x, y, stds, ax, title, y_label):    ax.fill_between(x, y - stds, y + stds, alpha=0.2)    plot(x, y, ax, title, y_label)fig, (ax1, ax2) = plt.subplots(ncols=2)title = 'Increase in mean and std Fortune 500 company %s from 1955 to 2005'stds1 = group_by_year.std().profit.valuesstds2 = group_by_year.std().revenue.valuesplot_with_std(x, y1.values, stds1, ax1, title % 'profits', 'Profit (millions)')plot_with_std(x, y2.values, stds2, ax2, title % 'revenues', 'Revenue (millions)')fig.set_size_inches(14, 4)fig.tight_layout()

可见，不同公司之间的收入和利润差距惊人，那么到底前10%和后10%的公司谁的波动更大了？此外，还有很多有价值的信息值得进一步挖掘。
具体请看https://github.com/Nefelibata6337/Test03

转载请注明：文章转载自 http://www.konglu.com/

本文地址：http://www.konglu.com/it/1097141.html

免责声明：

我们致力于保护作者版权，注重分享，被刊用文章【【Jupyter Notebook实践】】因无法核实真实出处，未能及时与作者取得联系，或有版权异议的，请联系管理员，我们会立即处理，本文部分文字与图片资源来自于网络，转载此文是出于传递更多信息之目的,若有来源标注错误或侵犯了您的合法权益，请立即通知我们，情况属实，我们会第一时间予以删除，并同时向您表示歉意,谢谢!

【Jupyter Notebook实践】

Test03

Jupyter Notebook实践

Python相关栏目本月热门文章