Python爬蟲數據應該做什麽？

首先，了解以下功能。

Set變長（）函數char _ length（）replace（）函數max（）函數。

1.1.Set變量set @變量名=值。

set @ address =‘中國-山東-聊城-莘縣’；選擇@地址

1.2，length（）函數char_length（）函數差

選擇長度（“a”）

，char _ length（‘a‘）

，長度（“中等”）

，char _ length（‘中‘）

1.3、替換（）函數和長度（）函數的組合

set @ address =‘中國-山東-聊城-莘縣’；選擇@地址

，將（@address，‘-‘，‘‘）替換為address_1

，長度（@address）為len_add1

，長度（替換（@地址，‘-‘，‘‘）為len_add2）

，length（@address）-length（替換（@ address，‘-‘，‘‘））as _ count

如果etl清理字段時有明顯的分隔符，如何確定如何向新數據表添加幾個分段字段

計算com_industry中最多有多少個&符號，以確定要添加多少個字段。+1的最大值是可以拆分成的字段數。此表為3，因此可以拆分四個行業字段，即四個行業等級。

select max（length（com _ industry）-length（replace（com _ industry，‘-‘，‘‘））as _ max _ count

來自etl1_socom_data

1.4.設置變量substring_index（）字符串截取函數的用法。

set @ address =‘中國-山東-聊城-莘縣’；

挑選

substring _ index（@ address，‘-‘，1）作為中國，

substring _ index（substring _ index（@ address，‘-‘，2），‘-‘，-1）作為省，

substring _ index（substring _ index（@ address，‘-‘，3），‘-‘，-1）作為城市，

substring _ index（@ address，‘-‘，-1）作為區

1.5，條件判斷函數case when

Case when then when then else值以字段名結尾

89》時選擇大小寫；101 then‘大於else且小於‘end as betl 1 _ socom _ data。

二、水壺轉換etl1清洗

首先，表格構建步驟在視頻中。

字段索引中沒有提到索引算法。建議使用BTREE算法來提高查詢效率。

2.1.kettle文件名:trans_etl1_socom_data。

2.2.包括控件:表格輸入》；& gt& gt表格輸出

2.3.數據流方向:s _ socom _ data & gt& gt& gt& gtetl1_socom_data

水壺轉換1截圖

2.4、表輸入2.4、SQL腳本初步清理com_district和com_industry字段。

當com _ district like“% industry”或com _ district like“% weaving”或com _ district like“% education”時，請選擇壹個。*，然後將else com _ districtend作為com _ district1。

，case當com_district像“% industry”或com_district像“% weaving”或com_district像“% education”時，則concat（com _ district，‘-‘，com _ industry）else com _ industry以com _ industry _ total形式結束。

，將（com_addr，‘地址:‘，‘‘）替換為com_addr1。

，將（com_phone，‘電話:‘，‘‘）替換為com_phone1

，將（com_fax，‘川鎮:‘，‘‘）替換為com_fax1

，將（com_mobile，‘移動電話:‘，‘‘）替換為com_mobile1

，將（com_url，‘URL:‘，‘‘）替換為com_url1

，將（com_email，‘電子郵件:‘，‘‘）替換為com_email1

，將（com_contactor，‘contact:‘，‘）替換為com_contactor1

，替換（com _ employees _ nums，‘公司數量:‘，‘‘）ascom _ employees _ nums 1。

，將（com_reg_capital，‘註冊資本:萬元‘，‘‘）替換為com_reg_capital1。

，將（com_type，‘經濟類型:‘，‘‘）替換為com_type1

，將（com_product，‘公司產品:‘，‘‘）替換為com_product1

，將s _ socom _ data中的（com_desc，‘公司簡介:‘，‘‘）ascom _ desc 1替換為

2.5、表格輸出

表格輸出設置的註意事項

註意事項:

①當涉及爬蟲的增量操作時，不要選中剪輯表選項。

②數據連接問題在表格輸出中選擇表格所在的數據庫。

③字段映射問題保證數據流中的字段數量與物理表中的字段數量壹致。

三、釜式轉換etl2清洗

首先，構建表格並添加四個字段來演示視頻中的步驟。

字段索引中沒有提到索引算法。建議使用BTREE算法來提高查詢效率。

字段拆分清洗主要針對etl1生成的新com_industry進行。

3.1.kettle文件名:trans_etl2_socom_data。

3.2.包括控件:表格輸入》；& gt& gt表格輸出

3.3.數據流方向:ETL 1 _ socom _ data & gt；& gt& gt& gtetl2_socom_data

註意事項:

①當涉及爬蟲的增量操作時，不要選中剪輯表選項。

②數據連接問題在表格輸出中選擇表格所在的數據庫。

③字段映射問題保證數據流中的字段數量與物理表中的字段數量壹致。

水壺改造2截圖

3.4.SQL腳本拆分com_industry，並清除所有字段。註冊資本字段的時間關系無需仔細拆解即可調整。

選擇壹個。*，案例

#當長度（com _ industry）= 0且為null時，行業為“”的值設置為null。

#其他人在第壹個分隔符前使用else substring _ index（com _ industry，‘-‘，1）？end as com_industry1，case

當長度（com_industry）-長度（替換（com _ industry，‘-‘，‘‘））= 0時，則為空

#‘交通運輸、倉儲和郵政服務-‘當length（com _ industry）-length（replace（com _ industry，‘-‘，‘‘））= 1且length（substring _ index（com _ industry，‘-‘），-1）= 0時，該值行業2也設置為null當length（com _ industry）-length 652然後substring _ index（com _ industry，‘-‘，-1）else substring _ index（substring _ index（com _ industry，‘-‘，2），‘-‘，-1）結尾為com_industry2，case

when length（com_industry）-length（替換（com _ industry，‘-‘，‘‘））& lt；= 1 then null當length（com _ industry）-length（replace（com _ industry，‘-‘，‘‘））= 2 then？substring _ index（com _ industry，‘-‘，-1）else substring _ index（substring _ index（com _ industry，‘-‘，3），‘-‘，-1）結尾為com_industry3，case

when length（com_industry）-length（替換（com _ industry，‘-‘，‘‘））& lt；=2，則null else substring _ index（com _ industry，‘-‘，-1）以com _ industry 4的形式結束

四、清洗效果質量檢查

4.1爬蟲數據源數據是否與網站數據壹致。

如果爬蟲和數據處理壹起處理，則可以省略該步驟。如果對接的是上遊爬蟲同事，先判斷這壹步，否則清洗也沒用。通常，爬蟲同事需要存儲所請求的url，以便稍後進行數據處理來檢查數據質量。

4.2計算爬蟲數據源和etl清洗數據表的數據量。

註意:SQL腳本中尚未聚合和過濾的三個表的數據量應該相等。

4.2.1，sql查詢如下表。我在同壹個數據庫裏。如果我不在同壹個數據庫中，我應該添加表所在的數據庫的名稱。

當數據很大時，不建議使用。

從s_socom_dataunion all中選擇計數（1）

從etl1_socom_dataunion all中選擇count（1）

從etl2_socom_data中選擇計數（1）

4.2.2水壺轉換完成後的總表輸出比較

kettle表的總數據輸出

4.3檢查etl清洗質量

確保前兩步正確，負責數據處理的etl清洗工作自檢開始為數據源清洗的字段編寫腳本。socom網站主要清洗區域和行業，用冗余字段替換其他字段，因此采用腳本檢查。

找到要檢查的page_url和網站數據。

通過在以下位置書寫，很容易檢查田地的清潔情況

選擇*

來自etl2_socom_data

其中com_district為空，length（com _ industry）-length（replace（com _ industry，‘-‘，‘‘））= 3

該頁面的數據與etl2_socom_data表的最終清理數據進行比較。

網站頁面數據

Etl2_socom_data表格數據

清潔工作完成。