首页 文章

使用pandas从csv获取值的总和

提问于
浏览
0

我想总结第3列中的所有值,以便在使用pandas的第一和第二列的新csv文件中获得结果,这样的思考更有效 .

可以加在一起的最大值介于0和2之间

如果存在除0.5,1或2之外的值或字符,则将忽略该加法

Sample of the csv file :

https://pastebin.com/WwDWqU3U

encounterId|chartTime|11885|67187|6711|6711|6710|1356|1357|1358|1359|1360|1361|1362|1366|140|140

325|2014-01-01 00:00:00|0
325|2014-01-01 01:00:00|0|0|0
325|2014-01-01 02:00:00|0
325|2014-01-01 03:00:00|0|0|0
325|2014-01-01 04:00:00|0
325|2014-01-01 05:00:00|1
325|2014-01-01 06:00:00|0|0|0
325|2014-01-01 07:00:00|1|0|0.5|1
325|2014-01-01 08:00:00|0
325|2014-01-01 09:00:00|1|0|0
325|2014-01-01 10:00:00|0
325|2014-01-01 11:00:00|1|0|0
325|2014-01-01 12:00:00|0
325|2014-01-01 13:00:00|0|0|0.5|1
325|2014-01-01 14:00:00|0
325|2014-01-01 15:00:00|0

what I'm looking for :

323|2013-06-03 00:00:00|0
323|2013-06-03 01:00:00|1
323|2013-06-03 02:00:00|1.5
323|2013-06-03 03:00:00|1.5
323|2013-06-03 04:00:00|0
323|2013-06-03 05:00:00|0.5
323|2013-06-03 06:00:00|0
323|2013-06-03 07:00:00|3.5
323|2013-06-03 08:00:00|0.5

我试过没有大熊猫它给了我一些奇怪的结果

4 回答

  • 1

    您可以按照上一个答案here中的建议求和并设置参数轴= 1

  • 1

    用这个:

    from io import StringIO
    csvfile = StringIO("""323|2013-06-03 00:00:00|0|0|0
    323|2013-06-03 01:00:00|1|
    323|2013-06-03 02:00:00|1|0|0.5|86
    323|2013-06-03 03:00:00|1|0|0.5|0
    323|2013-06-03 04:00:00|0
    323|2013-06-03 05:00:00|0|0|0.5|0
    323|2013-06-03 06:00:00|0
    323|2013-06-03 07:00:00|1|0|0.5|2
    323|2013-06-03 08:00:00|0|0.5""")
    
    df = pd.read_csv(csvfile, sep='|', names=['ID','date','A','B','C','D'])
    
    df_out = df.set_index(['ID','date'])
    
    df_out.where((df_out>0) & (df_out<=2), 0)\
          .sum(1)\
          .reset_index()\
          .to_csv('outfile.csv', index=False, header=False)
    
    !type outfile.csv
    

    输出:

    323,2013-06-03 00:00:00,0.0
    323,2013-06-03 01:00:00,1.0
    323,2013-06-03 02:00:00,1.5
    323,2013-06-03 03:00:00,1.5
    323,2013-06-03 04:00:00,0.0
    323,2013-06-03 05:00:00,0.5
    323,2013-06-03 06:00:00,0.0
    323,2013-06-03 07:00:00,3.5
    323,2013-06-03 08:00:00,0.5
    
  • 1

    请注意,如果读取具有可变列数的csv, pd.read_csv() 将抛出错误,除非您提前提供列名 . 这应该这样做:

    import pandas as pd
    import numpy as np
    
    df = pd.read_csv('sample.txt', names=['Index','Date','Val1','Val2','Val3','Val4'], sep='|')
    
    df[df[['Val1','Val2','Val3','Val4']]>2] = np.nan
    
    df['Final'] = df.iloc[:,2:].sum(axis=1)
    
    df = df[['Index','Date','Final']]
    

    得到:

    Index                 Date  Final
    0    323  2013-06-03 00:00:00    0.0
    1    323  2013-06-03 01:00:00    1.0
    2    323  2013-06-03 02:00:00    1.5
    3    323  2013-06-03 03:00:00    1.5
    4    323  2013-06-03 04:00:00    0.0
    5    323  2013-06-03 05:00:00    0.5
    6    323  2013-06-03 06:00:00    0.0
    7    323  2013-06-03 07:00:00    3.5
    8    323  2013-06-03 08:00:00    0.5
    

    这是一个更简洁的方法(它与下面的@Scott Boston的答案非常相似,但避免创建单独的数据帧) . 将csv的前两列设置为数据框的索引允许您有条件地过滤仅包含浮点值的数据帧的其余部分:

    df = pd.read_csv('sample.txt', names=['Index','Date','Val1','Val2','Val3','Val4'], sep='|').set_index(['Index','Date'])
    
    df['Final'] = df[(df>0) & (df<=2)].sum(axis=1)
    
    df.reset_index()[['Index','Date','Final']].to_csv('output.csv', index=False, header=False)
    

    得到:

    323,2013-06-03 00:00:00,0.0
    323,2013-06-03 01:00:00,1.0
    323,2013-06-03 02:00:00,1.5
    323,2013-06-03 03:00:00,1.5
    323,2013-06-03 04:00:00,0.0
    323,2013-06-03 05:00:00,0.5
    323,2013-06-03 06:00:00,0.0
    323,2013-06-03 07:00:00,3.5
    323,2013-06-03 08:00:00,0.5
    
  • 0

    这个怎么样?

    for row in df.rows:
       row[row.columns[2]]=sum(row[row.columns[>1]])
    

相关问题