资讯 小学 初中 高中 语言 会计职称 学历提升 法考 计算机考试 医护考试 建工考试 教育百科
栏目分类:
子分类:
返回
空麓网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
空麓网 > 计算机考试 > 软件开发 > 后端开发 > Python

2022年泰迪杯数据分析

Python 更新时间: 发布时间: 计算机考试归档 最新发布

2022年泰迪杯数据分析

另有题目文件,第四题第五题全部可实现文件均放在了我的资源里面

任务1.1(1)首先,使用pandas将数据导入,查看数据的详细信息,发现education、default、marital、job等字段存在明显缺失值,使用dropna()删除缺失值所在的行;其次查看数据是否存在重复值,整体不存在重复值,单独查看‘user_id’发现,存在56个重复值,drop_duplicates()删除重复值所在的行;最后 将处理好的数据保存到“result1_1.xlsx”中。

import pandas as pdimport numpy as npimport re
#导入数据short_data1=pd.read_csv('short-customer-data.csv')short_data1
user_idagejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationpoutcomey
0BA220000156housemaidmarriedpostgraduatenononotelephonemaymon261nonexistentno
1BA220000257servicesmarriedhigh schoolNaNnonotelephonemaymon149nonexistentno
2BA220007737servicesmarriedhigh schoolnoyesnotelephonemaymon226nonexistentno
3BA220000440admin.marriedpostgraduatenononotelephonemaymon151nonexistentno
4BA220000556servicesmarriedhigh schoolnonoyestelephonemaymon307nonexistentno
.............................................
41171BA224117273retiredmarriedjunior collegenoyesnocellularnovfri334nonexistentyes
41172BA224117346blue-collarmarriedjunior collegenononocellularnovfri383nonexistentno
41173BA224117456retiredmarriedundergraduatenoyesnocellularnovfri189nonexistentno
41174BA224117544technicianmarriedjunior collegenononocellularnovfri442nonexistentyes
41175BA224117674retiredmarriedjunior collegenoyesnocellularnovfri239failureno

41176 rows × 14 columns

#查看数据详细信息short_data1.info()
RangeIndex: 41176 entries, 0 to 41175Data columns (total 14 columns): #   Column       Non-Null Count  Dtype ---  ------       --------------  -----  0   user_id      41176 non-null  object 1   age          41176 non-null  int64  2   job          40846 non-null  object 3   marital      41096 non-null  object 4   education    39446 non-null  object 5   default      32580 non-null  object 6   housing      40186 non-null  object 7   loan         40186 non-null  object 8   contact      41176 non-null  object 9   month        41176 non-null  object 10  day_of_week  41176 non-null  object 11  duration     41176 non-null  int64  12  poutcome     41176 non-null  object 13  y            41176 non-null  objectdtypes: int64(2), object(12)memory usage: 4.4+ MB
#查看是否存在重复值short_data1.duplicated().sum()
0
#查看'user_id'是否存在重复值short_data1['user_id'].duplicated().sum()
56
#删除重复值short_data1=short_data1.drop_duplicates(subset=['user_id'])short_data1
user_idagejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationpoutcomey
0BA220000156housemaidmarriedpostgraduatenononotelephonemaymon261nonexistentno
1BA220000257servicesmarriedhigh schoolNaNnonotelephonemaymon149nonexistentno
2BA220007737servicesmarriedhigh schoolnoyesnotelephonemaymon226nonexistentno
3BA220000440admin.marriedpostgraduatenononotelephonemaymon151nonexistentno
4BA220000556servicesmarriedhigh schoolnonoyestelephonemaymon307nonexistentno
.............................................
41170BA224117129unemployedsinglepostgraduatenoyesnocellularnovfri112successno
41172BA224117346blue-collarmarriedjunior collegenononocellularnovfri383nonexistentno
41173BA224117456retiredmarriedundergraduatenoyesnocellularnovfri189nonexistentno
41174BA224117544technicianmarriedjunior collegenononocellularnovfri442nonexistentyes
41175BA224117674retiredmarriedjunior collegenoyesnocellularnovfri239failureno

41120 rows × 14 columns

#查看重复值是否删除short_data1['user_id'].duplicated().sum()
0
#查看缺失值short_data1.isnull().sum()
user_id           0age               0job             328marital          80education      1726default        8578housing         988loan            988contact           0month             0day_of_week       0duration          0poutcome          0y                 0dtype: int64
#删除缺失值所在的行short_data1=short_data1.dropna()short_data1
user_idagejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationpoutcomey
0BA220000156housemaidmarriedpostgraduatenononotelephonemaymon261nonexistentno
2BA220007737servicesmarriedhigh schoolnoyesnotelephonemaymon226nonexistentno
3BA220000440admin.marriedpostgraduatenononotelephonemaymon151nonexistentno
4BA220000556servicesmarriedhigh schoolnonoyestelephonemaymon307nonexistentno
6BA220000759admin.marriedjunior collegenononotelephonemaymon139nonexistentno
.............................................
41170BA224117129unemployedsinglepostgraduatenoyesnocellularnovfri112successno
41172BA224117346blue-collarmarriedjunior collegenononocellularnovfri383nonexistentno
41173BA224117456retiredmarriedundergraduatenoyesnocellularnovfri189nonexistentno
41174BA224117544technicianmarriedjunior collegenononocellularnovfri442nonexistentyes
41175BA224117674retiredmarriedjunior collegenoyesnocellularnovfri239failureno

30444 rows × 14 columns

#查看缺失值是否删除short_data1.isnull().sum()
user_id        0age            0job            0marital        0education      0default        0housing        0loan           0contact        0month          0day_of_week    0duration       0poutcome       0y              0dtype: int64
#将处理好的数据导出short_data1.to_excel('result1_1.xlsx',index = False)short_data1
user_idagejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationpoutcomey
0BA220000156housemaidmarriedpostgraduatenononotelephonemaymon261nonexistentno
2BA220007737servicesmarriedhigh schoolnoyesnotelephonemaymon226nonexistentno
3BA220000440admin.marriedpostgraduatenononotelephonemaymon151nonexistentno
4BA220000556servicesmarriedhigh schoolnonoyestelephonemaymon307nonexistentno
6BA220000759admin.marriedjunior collegenononotelephonemaymon139nonexistentno
.............................................
41170BA224117129unemployedsinglepostgraduatenoyesnocellularnovfri112successno
41172BA224117346blue-collarmarriedjunior collegenononocellularnovfri383nonexistentno
41173BA224117456retiredmarriedundergraduatenoyesnocellularnovfri189nonexistentno
41174BA224117544technicianmarriedjunior collegenononocellularnovfri442nonexistentyes
41175BA224117674retiredmarriedjunior collegenoyesnocellularnovfri239failureno

30444 rows × 14 columns

任务1.1.2

import pandas as pdimport numpy as npimport re
long_data1=pd.read_csv('long-customer-train.csv')long_data1.info()
RangeIndex: 9300 entries, 0 to 9299Data columns (total 11 columns): #   Column           Non-Null Count  Dtype  ---  ------           --------------  -----   0   CustomerId       9300 non-null   int64   1   CreditScore      9300 non-null   int64   2   Gender           9300 non-null   int64   3   Age              9300 non-null   object  4   Tenure           9300 non-null   int64   5   Balance          9300 non-null   float64 6   NumOfProducts    9300 non-null   int64   7   HasCrCard        9300 non-null   int64   8   IsActiveMember   9300 non-null   int64   9   EstimatedSalary  9300 non-null   float64 10  Exited           9300 non-null   int64  dtypes: float64(2), int64(8), object(1)memory usage: 799.3+ KB
long_data1[long_data1['Age']=='-']
CustomerIdCreditScoreGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExited
464155704855580-4161766.8710092378.540
566155734526630-7115930.8711019862.780
651155760007650-6138033.5511167972.450
696155770645920-2104702.65210107948.720
796155800685260-50.00211105618.140
....................................
8176157841615830-8102945.0110052861.890
8270157870356021-80.00211152843.530
8591157958817760-8106365.29111148527.560
8776158010625571-40.00201105433.530
8794158014176570-482500.28111115260.720

78 rows × 11 columns

长期数据中的客户年龄“Age”列存在数值为-1、0 和“-”的异常值,删除存在该情况的行数据。
这里我们先将异常值赋值为空值

long_data2=long_data1.replace(to_replace=['-1','0','-','1'],value=[np.nan,np.nan,np.nan,np.nan])long_data2.info()
RangeIndex: 9300 entries, 0 to 9299Data columns (total 11 columns): #   Column           Non-Null Count  Dtype  ---  ------           --------------  -----   0   CustomerId       9300 non-null   int64   1   CreditScore      9300 non-null   int64   2   Gender           9300 non-null   int64   3   Age              9180 non-null   object  4   Tenure           9300 non-null   int64   5   Balance          9300 non-null   float64 6   NumOfProducts    9300 non-null   int64   7   HasCrCard        9300 non-null   int64   8   IsActiveMember   9300 non-null   int64   9   EstimatedSalary  9300 non-null   float64 10  Exited           9300 non-null   int64  dtypes: float64(2), int64(8), object(1)memory usage: 799.3+ KB

再用删除空值的方法将其所在的行删除

long_data3=long_data2.dropna()long_data3.info()
Int64Index: 9180 entries, 0 to 9299Data columns (total 11 columns): #   Column           Non-Null Count  Dtype  ---  ------           --------------  -----   0   CustomerId       9180 non-null   int64   1   CreditScore      9180 non-null   int64   2   Gender           9180 non-null   int64   3   Age              9180 non-null   object  4   Tenure           9180 non-null   int64   5   Balance          9180 non-null   float64 6   NumOfProducts    9180 non-null   int64   7   HasCrCard        9180 non-null   int64   8   IsActiveMember   9180 non-null   int64   9   EstimatedSalary  9180 non-null   float64 10  Exited           9180 non-null   int64  dtypes: float64(2), int64(8), object(1)memory usage: 860.6+ KB
# long_data3['age_new']=long_data3.Age.str.extract(r"(d+)")# #long_data3['new_Age']=long_data3.Age.str.replace('岁','')# #long_data3[long_data3.age_new.str.contains(" ")]# long_data3['age_new']=long_data3['age_new'].astype(int)# long_data4=long_data3.drop(columns=['Age'])# long_data4.rename(columns={'age_new':'Age'},inplace=True)# long_data4.info()

“Age”列存在空格和“岁”等异常字符,删除这些异常
字符但须保留年龄数值,将处理后的数值存于“Age”列。
这里通过正则表达式匹配异常字符中的正确年龄,并将年龄保存在了Age列,数据类型用int

long_data3['Age']=long_data3.Age.str.extract(r"(d+)")long_data3['Age']=long_data3['Age'].astype(int)
C:Users31214AppDataLocalTempipykernel_108603539377572.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy  long_data3['Age']=long_data3.Age.str.extract(r"(d+)")C:Users31214AppDataLocalTempipykernel_108603539377572.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy  long_data3['Age']=long_data3['Age'].astype(int)
long_data3.to_excel('result1_2.xlsx',index = False)long_data3.info()
Int64Index: 9180 entries, 0 to 9299Data columns (total 11 columns): #   Column           Non-Null Count  Dtype  ---  ------           --------------  -----   0   CustomerId       9180 non-null   int64   1   CreditScore      9180 non-null   int64   2   Gender           9180 non-null   int64   3   Age              9180 non-null   int32   4   Tenure           9180 non-null   int64   5   Balance          9180 non-null   float64 6   NumOfProducts    9180 non-null   int64   7   HasCrCard        9180 non-null   int64   8   IsActiveMember   9180 non-null   int64   9   EstimatedSalary  9180 non-null   float64 10  Exited           9180 non-null   int64  dtypes: float64(2), int32(1), int64(8)memory usage: 824.8 KB
long_data3
CustomerIdCreditScoreGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExited
0155532517131520185891.5411146369.571
11555325661914180.0031179866.731
215553283603142891611.12100144675.301
31555330858916110.0011061108.561
41555338768713920.00300188150.601
....................................
9295158156287111378113899.9210080215.200
9296158156454810378152303.66211175082.200
9297158156565411399100116.67111199808.101
9298158156607581341154139.4511160728.890
9299158156906141403113348.5011177789.010

9180 rows × 11 columns

任务1.2

import pandas as pd
short_data=pd.read_excel('result1_1.xlsx')short_data
user_idagejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationpoutcomey
0BA220000156housemaidmarriedpostgraduatenononotelephonemaymon261nonexistentno
1BA220007737servicesmarriedhigh schoolnoyesnotelephonemaymon226nonexistentno
2BA220000440admin.marriedpostgraduatenononotelephonemaymon151nonexistentno
3BA220000556servicesmarriedhigh schoolnonoyestelephonemaymon307nonexistentno
4BA220000759admin.marriedjunior collegenononotelephonemaymon139nonexistentno
.............................................
30439BA224117129unemployedsinglepostgraduatenoyesnocellularnovfri112successno
30440BA224117346blue-collarmarriedjunior collegenononocellularnovfri383nonexistentno
30441BA224117456retiredmarriedundergraduatenoyesnocellularnovfri189nonexistentno
30442BA224117544technicianmarriedjunior collegenononocellularnovfri442nonexistentyes
30443BA224117674retiredmarriedjunior collegenoyesnocellularnovfri239failureno

30444 rows × 14 columns

对短期数据中的字符型数据进行特征编码,如将信用违约情况
{‘否’,‘是’}编码为{0,1}

在这种情况下,我们必须将数据进行编码,即是说,将文字型数据转换为数值型。
这里我们采用preprocessing.LabelEncoder将汉字进行编码

短期客户数据中,需要进行特征编码的字段都包含{job,marital,education,default,housing,loan,contact,poutcome,y}

from sklearn.preprocessing import LabelEncodershort_data6=short_data.copy()short_data6
user_idagejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationpoutcomey
0BA220000156housemaidmarriedpostgraduatenononotelephonemaymon261nonexistentno
1BA220007737servicesmarriedhigh schoolnoyesnotelephonemaymon226nonexistentno
2BA220000440admin.marriedpostgraduatenononotelephonemaymon151nonexistentno
3BA220000556servicesmarriedhigh schoolnonoyestelephonemaymon307nonexistentno
4BA220000759admin.marriedjunior collegenononotelephonemaymon139nonexistentno
.............................................
30439BA224117129unemployedsinglepostgraduatenoyesnocellularnovfri112successno
30440BA224117346blue-collarmarriedjunior collegenononocellularnovfri383nonexistentno
30441BA224117456retiredmarriedundergraduatenoyesnocellularnovfri189nonexistentno
30442BA224117544technicianmarriedjunior collegenononocellularnovfri442nonexistentyes
30443BA224117674retiredmarriedjunior collegenoyesnocellularnovfri239failureno

30444 rows × 14 columns

t1=short_data6.loc[:,'job']    #要输入的是标签,不是特征矩阵,所以允许一维数据t2=short_data6.loc[:,'marital']t3=short_data6.loc[:,'education']t4=short_data6.loc[:,'default']t5=short_data6.loc[:,'housing']t6=short_data6.loc[:,'loan']t7=short_data6.loc[:,'contact']t8=short_data6.loc[:,'poutcome']t9=short_data6.loc[:,'y']
le1 = LabelEncoder()     #实例化le1 = le1.fit(t1) # 导入数据label1 = le1.transform(t1) # transform接口调取结果short_data6.loc[:,"job"] = label1short_data6['job'].unique()
array([ 3,  7,  0,  9,  1, 10,  5,  2,  4,  8,  6])
le2 = LabelEncoder()     #实例化le2 = le2.fit(t2) # 导入数据label2 = le2.transform(t2) # transform接口调取结果short_data6.loc[:,"marital"] = label2short_data6['marital'].unique()
array([1, 2, 0])
le3 = LabelEncoder()     #实例化le3 = le3.fit(t3) # 导入数据label3 = le3.transform(t3) # transform接口调取结果short_data6.loc[:,"education"] = label3short_data6['education'].unique()
array([3, 0, 2, 4, 1])
le4 = LabelEncoder()     #实例化le4 = le4.fit(t4) # 导入数据label4 = le4.transform(t4) # transform接口调取结果short_data6.loc[:,"default"] = label4short_data6['default'].unique()
array([0, 1])
le5 = LabelEncoder()     #实例化le5 = le5.fit(t5) # 导入数据label5 = le5.transform(t5) # transform接口调取结果short_data6.loc[:,"housing"] = label5short_data6['housing'].unique()
array([0, 1])
le6 = LabelEncoder()     #实例化le6 = le6.fit(t6) # 导入数据label6 = le6.transform(t6) # transform接口调取结果short_data6.loc[:,"loan"] = label6short_data6['loan'].unique()
array([0, 1])
le7 = LabelEncoder()     #实例化le7 = le7.fit(t7) # 导入数据label7 = le7.transform(t7) # transform接口调取结果short_data6.loc[:,"contact"] = label7short_data6['contact'].unique()
array([1, 0])
le8 = LabelEncoder()     #实例化le8 = le8.fit(t8) # 导入数据label8 = le8.transform(t8) # transform接口调取结果short_data6.loc[:,"poutcome"] = label8short_data6['poutcome'].unique()
array([1, 0, 2])
le9 = LabelEncoder()     #实例化le9 = le9.fit(t9) # 导入数据label9 = le9.transform(t9) # transform接口调取结果short_data6.loc[:,"y"] = label9short_data6['y'].unique()
array([0, 1])
short_data7=short_data6.copy()short_data7
user_idagejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationpoutcomey
0BA2200001563130001maymon26110
1BA2200077377100101maymon22610
2BA2200004400130001maymon15110
3BA2200005567100011maymon30710
4BA2200007590120001maymon13910
.............................................
30439BA22411712910230100novfri11220
30440BA2241173461120000novfri38310
30441BA2241174565140100novfri18910
30442BA2241175449120000novfri44211
30443BA2241176745120100novfri23900

30444 rows × 14 columns

#short_data7.loc[short_data7['month']=='mar']['month'].unique()short_data6['month'].unique()
array(['may', 'jun', 'jul', 'aug', 'oct', 'nov', 'dec', 'mar', 'apr',       'sep'], dtype=object)

对上次拜访客户的月份(month)和最近拜访客户的月份(day_of_week)将不适用LabelEncoder库进行编码,因此这里将手动赋值,例如:一月:1,星期一:1

short_data8=short_data7.replace(to_replace=['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'],value=['1','2','3','4','5','6','7','8','9','10','11','12'])short_data8
user_idagejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationpoutcomey
0BA22000015631300015mon26110
1BA22000773771001015mon22610
2BA22000044001300015mon15110
3BA22000055671000115mon30710
4BA22000075901200015mon13910
.............................................
30439BA2241171291023010011fri11220
30440BA224117346112000011fri38310
30441BA224117456514010011fri18910
30442BA224117544912000011fri44211
30443BA224117674512010011fri23900

30444 rows × 14 columns

short_data8['day_of_week'].unique()
array(['mon', 'tue', 'wed', 'thu', 'fri'], dtype=object)
short_data9=short_data8.replace(to_replace=['mon','tue','wed','thu','fri'],value=[1,2,3,4,5])short_data9
user_idagejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationpoutcomey
0BA22000015631300015126110
1BA22000773771001015122610
2BA22000044001300015115110
3BA22000055671000115130710
4BA22000075901200015113910
.............................................
30439BA2241171291023010011511220
30440BA224117346112000011538310
30441BA224117456514010011518910
30442BA224117544912000011544211
30443BA224117674512010011523900

30444 rows × 14 columns

short_data9.to_excel("result1_3.xlsx",encoding = 'openpyxl',index = False)
转载请注明:文章转载自 http://www.konglu.com/
本文地址:http://www.konglu.com/it/1098105.html
免责声明:

我们致力于保护作者版权,注重分享,被刊用文章【2022年泰迪杯数据分析】因无法核实真实出处,未能及时与作者取得联系,或有版权异议的,请联系管理员,我们会立即处理,本文部分文字与图片资源来自于网络,转载此文是出于传递更多信息之目的,若有来源标注错误或侵犯了您的合法权益,请立即通知我们,情况属实,我们会第一时间予以删除,并同时向您表示歉意,谢谢!

我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2023 成都空麓科技有限公司

ICP备案号:蜀ICP备2023000828号-2