时间:2021-05-22
前言
数据分析时候,需要将数据进行加载和存储,本文主要介绍和excel的交互。
read_excel()
加载函数为read_excel(),其具体参数如下。
read_excel(io, sheetname=0, header=0, skiprows=None, skip_footer=0, index_col=None,names=None, parse_cols=None, parse_dates=False,date_parser=None,na_values=None,thousands=None, convert_float=True, has_index_names=None, converters=None,dtype=None, true_values=None, false_values=None, engine=None, squeeze=False, **kwds)常用参数解析:
数据源:
sheet1:ID NUM-1 NUM-2 NUM-336901 142 168 66136902 78 521 60236903 144 600 52136904 95 457 46836905 69 596 695sheet2:ID NUM-1 NUM-2 NUM-336906 190 527 69136907 101 403 470(1)函数原型
basestation ="F://pythonBook_PyPDAM/data/test.xls"data = pd.read_excel(basestation)print data输出:是一个dataframe
ID NUM-1 NUM-2 NUM-30 36901 142 168 6611 36902 78 521 6022 36903 144 600 5213 36904 95 457 4684 36905 69 596 695(2) sheetname参数:返回多表使用sheetname=[0,1],若sheetname=None是返回全表 注意:int/string 返回的是dataframe,而none和list返回的是dict of dataframe
data_1 = pd.read_excel(basestation,sheetname=[0,1])print data_1print type(data_1)输出:dict of dataframe
OrderedDict([(0, ID NUM-1 NUM-2 NUM-30 36901 142 168 6611 36902 78 521 6022 36903 144 600 5213 36904 95 457 4684 36905 69 596 695), (1, ID NUM-1 NUM-2 NUM-30 36906 190 527 6911 36907 101 403 470)])(3)header参数:指定列名行,默认0,即取第一行,数据为列名行以下的数据 若数据不含列名,则设定 header = None ,注意这里还有列名的一行。
data = pd.read_excel(basestation,header=None)print data输出: 0 1 2 30 ID NUM-1 NUM-2 NUM-31 36901 142 168 6612 36902 78 521 6023 36903 144 600 5214 36904 95 457 4685 36905 69 596 695data = pd.read_excel(basestation,header=[3])print data输出: 36903 144 600 521 0 36904 95 457 4681 36905 69 596 695(4) skiprows 参数:省略指定行数的数据
data = pd.read_excel(basestation,skiprows = [1])print data输出: ID NUM-1 NUM-2 NUM-30 36902 78 521 6021 36903 144 600 5212 36904 95 457 4683 36905 69 596 695(5)skip_footer参数:省略从尾部数的int行的数据
data = pd.read_excel(basestation, skip_footer=3)print data输出: ID NUM-1 NUM-2 NUM-30 36901 142 168 6611 36902 78 521 602(6)index_col参数:指定列为索引列,也可以使用u”strings”
data = pd.read_excel(basestation, index_col="NUM-3")print data输出: ID NUM-1 NUM-2NUM-3 661 36901 142 168602 36902 78 521521 36903 144 600468 36904 95 457695 36905 69 596(7)names参数: 指定列的名字。
data = pd.read_excel(basestation,names=["a","b","c","e"])print data a b c e0 36901 142 168 6611 36902 78 521 6022 36903 144 600 5213 36904 95 457 4684 36905 69 596 695具体参数如下:
>>> print help(pandas.read_excel)Help on function read_excel in module pandas.io.excel:read_excel(io, sheetname=0, header=0, skiprows=None, skip_footer=0, index_col=None, names=None, parse_cols=None, parse_dates=False, date_parser=None, na_values=None, thousands=None, convert_float=True, has_index_names=None, converters=None, dtype=None, true_values=None, false_values=None, engine=None, squeeze=False, **kwds) Read an Excel table into a pandas DataFrame Parameters ---------- io : string, path object (pathlib.Path or py._path.local.LocalPath), file-like object, pandas ExcelFile, or xlrd workbook. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file://localhost/path/to/workbook.xlsx sheetname : string, int, mixed list of strings/ints, or None, default 0 Strings are used for sheet names, Integers are used in zero-indexed sheet positions. Lists of strings/integers are used to request multiple sheets. Specify None to get all sheets. str|int -> DataFrame is returned. list|None -> Dict of DataFrames is returned, with keys representing sheets. Available Cases * Defaults to 0 -> 1st sheet as a DataFrame * 1 -> 2nd sheet as a DataFrame * "Sheet1" -> 1st sheet as a DataFrame * [0,1,"Sheet5"] -> 1st, 2nd & 5th sheet as a dictionary of DataFrames * None -> All sheets as a dictionary of DataFrames header : int, list of ints, default 0 Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row positions will be combined into a ``MultiIndex`` skiprows : list-like Rows to skip at the beginning (0-indexed) skip_footer : int, default 0 Rows at the end to skip (0-indexed) index_col : int, list of ints, default None Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a ``MultiIndex``. If a subset of data is selected with ``parse_cols``, index_col is based on the subset. names : array-like, default None List of column names to use. If file contains no header row, then you should explicitly pass header=None converters : dict, default None Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content. dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32} Use `object` to preserve data as stored in Excel and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion. .. versionadded:: 0.20.0 true_values : list, default None Values to consider as True .. versionadded:: 0.19.0 false_values : list, default None Values to consider as False .. versionadded:: 0.19.0 parse_cols : int or list, default None * If None then parse all columns, * If int then indicates last column to be parsed * If list of ints then indicates list of column numbers to be parsed * If string then indicates comma separated list of Excel column letters and column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of both sides. squeeze : boolean, default False If the parsed data only contains one column then return a Series na_values : scalar, str, list-like, or dict, default None Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'nan'. thousands : str, default None Thousands separator for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format. keep_default_na : bool, default True If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they're appended to. verbose : boolean, default False Indicate number of NA values placed in non-numeric columns engine: string, default None If io is not a buffer or path, this must be set to identify io. Acceptable values are None or xlrd convert_float : boolean, default True convert integral floats to int (i.e., 1.0 --> 1). If False, all numeric data will be read in as floats: Excel stores all numbers as floats internally has_index_names : boolean, default None DEPRECATED: for version 0.17+ index names will be automatically inferred based on index_col. To read Excel output from 0.16.2 and prior that had saved index names, use True. Returnsto_excel()
存储函数为pd.DataFrame.to_excel(),注意,必须是DataFrame写入excel, 即Write DataFrame to an excel sheet。其具体参数如下:
to_excel(self, excel_writer, sheet_name='Sheet1', na_rep='', float_format=None,columns=None, header=True, index=True, index_label=None,startrow=0, startcol=0, engine=None, merge_cells=True, encoding=None,inf_rep='inf', verbose=True, freeze_panes=None)常用参数解析
数据源:
ID NUM-1 NUM-2 NUM-30 36901 142 168 6611 36902 78 521 6022 36903 144 600 5213 36904 95 457 4684 36905 69 596 6955 36906 165 453 加载数据:basestation ="F://python/data/test.xls"basestation_end ="F://python/data/test_end.xls"data = pd.read_excel(basestation)(1)参数excel_writer,输出路径。
data.to_excel(basestation_end)输出: ID NUM-1 NUM-2 NUM-30 36901 142 168 6611 36902 78 521 6022 36903 144 600 5213 36904 95 457 4684 36905 69 596 6955 36906 165 453(2)sheet_name,将数据存储在excel的那个sheet页面。
data.to_excel(basestation_end,sheet_name="sheet2")(3)na_rep,缺失值填充
data.to_excel(basestation_end,na_rep="NULL")输出: ID NUM-1 NUM-2 NUM-30 36901 142 168 6611 36902 78 521 6022 36903 144 600 5213 36904 95 457 4684 36905 69 596 6955 36906 165 453 NULL(4) colums参数: sequence, optional,Columns to write 选择输出的的列。
data.to_excel(basestation_end,columns=["ID"])输出 ID0 369011 369022 369033 369044 369055 36906(5)header 参数: boolean or list of string,默认为True,可以用list命名列的名字。header = False 则不输出题头。
data.to_excel(basestation_end,header=["a","b","c","d"])输出: a b c d0 36901 142 168 6611 36902 78 521 6022 36903 144 600 5213 36904 95 457 4684 36905 69 596 6955 36906 165 453 data.to_excel(basestation_end,header=False,columns=["ID"])header = False 则不输出题头输出:0 369011 369022 369033 369044 369055 36906(6)index : boolean, default True Write row names (index)
默认为True,显示index,当index=False 则不显示行索引(名字)。
index_label : string or sequence, default None
设置索引列的列名。
data.to_excel(basestation_end,index=False)输出:ID NUM-1 NUM-2 NUM-336901 142 168 66136902 78 521 60236903 144 600 52136904 95 457 46836905 69 596 69536906 165 453 data.to_excel(basestation_end,index_label=["f"])输出:f ID NUM-1 NUM-2 NUM-30 36901 142 168 6611 36902 78 521 6022 36903 144 600 5213 36904 95 457 4684 36905 69 596 6955 36906 165 453以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持。
声明:本页内容来源网络,仅供用户参考;我单位不保证亦不表示资料全面及准确无误,也不保证亦不表示这些资料为最新信息,如因任何原因,本网内容或者用户因倚赖本网内容造成任何损失或损害,我单位将不会负任何法律责任。如涉及版权问题,请提交至online#300.cn邮箱联系删除。
写在前面QQ群里偶然看到群友问这个问题,pandas读取大文件时怎么才能获取进度?我第一反应是:除非pandas的read_excel等函数提供了回调函
知乎上有同学求助说,当他试图打开一个20M左右的excel文件时,无论是使用pandas的read_excel,还是直接使用xlrd或者openpyxl模块,速
问题描述使用pandas库的read_excel()方法读取外部excel文件报错,截图如下好像是缺少了什么方法的样子问题分析分析个啥,水平有限,直接面向sta
今天展示一个利用pandas将json数据导入excel例子,主要利用的是pandas里的read_json函数将json数据转化为dataframe。先拿出我
Python中pandas.read_excel详细介绍#coding:utf-8importpandasaspdimportnumpyasnpfilefull