首页 文章

R如何删除序列中的不一致值

提问于
浏览
1

如之前的question所述,我每隔五天收集一次关于植物发育或物候的数据(使用分类变量'Code'进行编码),沿着横断面划分为78个连续区段 . 每个物种都在每个区段的横断面上进行调查 .

我在收集数据时没有考虑的另一个问题是,有时候观察者可能会错过现场的观察结果,影响他们选择的代码,或者他们只是犯了一个错字 . 具体来说,他们使用的代码是:

b1 = single flower
b2 = sparse flowers (two or three)
b3 = flowers common (more than three)
B4 = flowering ended

随时间推移的预期(简化)观察序列看起来像'b1','b2','b3','b2','b1','b4' . 请注意,可以有多个样本日期具有相同的观察结果,因此数据可能看起来像'b1','b1','b2','b3','b3','b2','b2','b2' ,'b1','b1','b4' .

不幸的是,我发现了许多序列看起来像的例子

Date    Segment Species Code
01-Jun-17   1   A   b1
06-Jun-17   1   A   b1
10-Jun-17   1   A   b2
14-Jun-17   1   A   b2
19-Jun-17   1   A   b3
23-Jun-17   1   A   b3
28-Jun-17   1   A   b2 # out of sequence - assume it should be b3
02-Aug-17   1   A   b3
07-Aug-17   1   A   b2 # out of sequence - assume it should be b3
12-Aug-17   1   A   b3
17-Aug-17   1   A   b2
22-Aug-17   1   A   b1 # out of sequence - assume it should be b2
27-Aug-17   1   A   b2 
02-Sep-17   1   A   b1
07-Sep-17   1   A   b4

它应该看起来像:

Date    Segment Species Code
01-Jun-17   1   A   b1
06-Jun-17   1   A   b1
10-Jun-17   1   A   b2
14-Jun-17   1   A   b2
19-Jun-17   1   A   b3
23-Jun-17   1   A   b3
28-Jun-17   1   A   b3
02-Aug-17   1   A   b3
07-Aug-17   1   A   b3
12-Aug-17   1   A   b3
17-Aug-17   1   A   b2
22-Aug-17   1   A   b2
27-Aug-17   1   A   b2
02-Sep-17   1   A   b1
07-Sep-17   1   A   b4

假设我们无法知道观察者是否错过了观察开花植物或者数据集上的错字,那么更多的方法是丢弃第一个不按顺序的值 . 那么,每次出现序列错误时,如何删除不在序列中的第一个值?在这种情况下,数据集看起来像:

Date    Segment Species Code
01-Jun-17   1   A   b1
06-Jun-17   1   A   b1
10-Jun-17   1   A   b2
14-Jun-17   1   A   b2
19-Jun-17   1   A   b3
23-Jun-17   1   A   b3
02-Aug-17   1   A   b3
12-Aug-17   1   A   b3
17-Aug-17   1   A   b2
22-Aug-17   1   A   b2
02-Sep-17   1   A   b1
07-Sep-17   1   A   b4

这是示例代码:

Test.Data <- structure(list(Date = structure(c(17318, 17323, 17327,
17331, 17336, 17340, 17345, 17380, 17385, 17390, 17395, 17400, 17405, 
17411, 17416, 17318, 17323, 17327, 17331, 17336, 17340, 17345, 
17380, 17385, 17390, 17395, 17400, 17405, 17411, 17416, 17318, 
17323, 17327, 17331, 17336, 17340, 17345, 17380, 17385, 17390, 
17395, 17400, 17405, 17411, 17416), class = "Date"), Segment = c(1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2), Species = c("A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A"), Code = c("b1", 
"b1", "b2", "b2", "b3", "b3", "b2", "b3", "b2", "b3", "b2", "b1", 
"b2", "b1", "b4", "b1", "b1", "b2", "b2", "b3", "b3", "b2", "b3", 
"b2", "b3", "b2", "b1", "b2", "b1", "b4", "b1", "b1", "b2", "b2", 
"b3", "b3", "b2", "b3", "b2", "b3", "b2", "b1", "b2", "b1", "b4"
)), .Names = c("Date", "Segment", "Species", "Code"), row.names = c(NA, 
-45L), class = "data.frame")

当然这是第一次观察特定物种的植物开花事件(即'b1','b2','b3','b4')是正确的!

注意:这个问题反映了我想要重新编码我的数据集以克服原始研究编码系统的不足(见question) . 如果我考虑在赛季前使用数据,我会使用如下编码系统:

b1a = single flower
b2a = sparse flowers (two or three)
b3 = flowers common (more than three)
b2b = sparse flowers (two or three)
b1b = single flower
B4 = flowering ended

无论如何,我仍然需要克服历史数据集的编码问题!

1 回答

  • 1

    这是一种依赖于 cummax 的可能性 .

    # extract numbers from 'Code', except the last which I assume always is 4
    x <- as.numeric(substring(d$Code[-length(d$Code)], 2))
    
    # find index of first max
    ix <- which.max(x == max(x))
    
    # find cumulative max on
    # (1) x from index 1 to ix
    # (2) x from end to index ix + 1
    # reverse (2)
    # concatenate (1), (2) and a 4
    d$Code2 <- c(cummax(x[1:ix]), rev(cummax(x[length(x):(ix + 1)])), 4)
    
    d[ , c("Code", "Code2")]
       Code Code2
    1    b1     1
    2    b1     1
    3    b2     2
    4    b2     2
    5    b3     3
    6    b3     3
    7    b2     3
    8    b3     3
    9    b2     3
    10   b3     3
    11   b2     2
    12   b1     2
    13   b2     2
    14   b1     1
    15   b4     4
    

    要通过'Segment'和'Species'执行此操作,您可以使用例如 data.table .

    library(data.table)
    setDT(Test.Data)
    Test.Data[ , Code2 := {
      x = as.numeric(substring(Code[-.N], 2))
      ix = which.max(x == max(x))
      .(paste0("b", c(cummax(x[1:ix]), rev(cummax(x[length(x):(ix + 1)])), 4)))
    },
    by = .(Segment, Species)]
    

相关问题