HDU-4825 Xor Sum（OJ 优化实验）

2016年07月25日 |

acm

| acm

题目：
解析&吐槽：
总结
16 年 8 月 20 日更新

题目：

Xor Sum
Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 132768/132768 K (Java/Others)
Total Submission(s): 1330 Accepted Submission(s): 561

Problem Description
Zeus 和 Prometheus 做了一个游戏，Prometheus 给 Zeus 一个集合，集合中包含了 N 个正整数，随后 Prometheus 将向 Zeus 发起 M 次询问，每次询问中包含一个正整数 S，之后 Zeus 需要在集合当中找出一个正整数 K，使得 K 与 S 的异或结果最大。Prometheus 为了让 Zeus 看到人类的伟大，随即同意 Zeus 可以向人类求助。你能证明人类的智慧么？

Input
输入包含若干组测试数据，每组测试数据包含若干行。
输入的第一行是一个整数 T（T < 10），表示共有 T 组数据。
每组数据的第一行输入两个正整数 N，M（<1=N,M<=100000），接下来一行，包含 N 个正整数，代表 Zeus 的获得的集合，之后 M 行，每行一个正整数 S，代表 Prometheus 询问的正整数。所有正整数均不超过 2^32。

Output
对于每组数据，首先需要输出单独一行”Case #?:”，其中问号处应填入当前的数据组数，组数从 1 开始计算。
对于每个询问，输出一个正整数 K，使得 K 与 S 异或值最大。

Sample Input

2
3 2
3 4 5
1
5
4 1
4 6 5 6
3

Sample Output

Case #1:
4
3
Case #2:
4

解析&吐槽：

这次的主题是优化实验，就把代码放在下面了。

这道题用字典树就可以做，就是道水题，但是我在这道题上面试验了很多优化代码的方法，把运行时间从 951ms 降到了 280ms。

提交语言：c++ vs g++

这两个是用不同编译器编译的。顾名思义，g++指的是 GNU G++；而 c++则是 VC++，微软的编译器。大部分的时候，用哪个编译器都是一样的，但是会在运行时间和内存占用上有不少的区别，很多时候这个区别就是你是否能 AC 的关键。在我们学校的 acm 群中，不止一次有同学提到有些题用 c++交可以过，但是 g++就会爆内存。我在这道题中遇到的问题则是 c++超时，g++勉强能过。贴上第一次 ac 的代码，耗时：951ms，内存占用：56840KB。为节约空间起见我就不写 include 和 namespace 了。

struct ct
{
    ct *L[2];
    ct() {
        L[0] = L[1] = nullptr;
    }
    ct(const unsigned long &s,int i) {
        L[0] = L[1] = nullptr;
        if (i>=32)
            return;

        this->ins(s,i);
    }
    //~ct() {
    //	for (int i = 0;i < 2;++i)
    //		if (L[i])
    //			delete L[i];
    //}
    void ins(const unsigned long& s,int i) {
        if (i < 32) {
            ct*& nw = L[(s >> (31 - i)) & 1];
            if (nw)
                nw->ins(s,i+1);
            else
                nw = new ct(s,i+1);
        }
    }
};

unsigned long query(ct* c, const unsigned long& s, int i, const unsigned long& have) {
    if (i >= 32)
        return have;

    ct* nw = c->L[(s >> (31 - i)) & 1];
    if (nw)
        return query(nw, s, i + 1, (have << 1) + ((s >> (31 - i)) & 1));
    else
        return query(c->L[1 - ((s >> (31 - i)) & 1)],
            s, i + 1, (have << 1) + 1 - ((s >> (31 - i)) & 1));
}

int main() {
    ios::sync_with_stdio(false);
    cin.tie(NULL);

    int T;
    cin >> T;

    for (int k = 1;k <= T;++k)
    {
        cout << "Case #" << k << ":" << endl;
        int n, m;
        cin >> n >> m;

        ct t;

        while (n--)
        {
            unsigned long a;
            cin >> a;

            t.ins(a, 0);
        }

        while (m--)
        {
            unsigned long a;
            cin >> a;

            //bitset<32> b(~a);
            //string bstr = b.to_string();
            //cout << "debug: " << query(&t, string(32 - bstr.size(), '0') + bstr, "") << endl; //debug
            //b = bitset<32>(query(&t, string(32 - bstr.size(), '0') + bstr, 0, ""));

            cout << query(&t, ~a, 0, 0) << endl;
        }
    }
}

在这里我本来是用 bitset 把数字转换成 2 进制的字符串然后让函数处理的，结果频繁地创建字符串严重降低了代码的运行效率，最后还是改用位运算了。

堆分配 vs 栈分配

在前面的代码中，我通过 new 在堆上分配内存，而这种行为本来就很慢，因为堆中的内存不是连续的，如果程序想要分配一段内存，就不得不在堆上寻找可以容纳这个对象的空间。上面的代码中我注释掉了释放内存的代码，节省了释放时间，但是仍然差点超时。不过和后面相比内存倒是没超过多少，真是神奇。

然后我把代码改成了在栈上分配空间，一开始开一个数组来存储所有的对象，用 idx 表示数组的尾部，每次将新产生的对象插入到 idx 的位置，分配空间的时间复杂度降为了 O(1)。改动的内容比较少，只要把指针改成索引，new 改成++idx 即可。

此外还必须注意一些语法的问题，数组的声明必须放在 struct 的完整声明之后，否则编译器无法推断出数组的大小，会报错，因此我们只好把用到数组的 ins 函数的声明放进 struct 里面，定义放在数组声明后面。还有我们不知道最多需要多少节点，所以只好把数组尽量往大了开了。尽管这道题只需要二叉树，但是我们没法用二叉树组的方法（就是子节点索引等于父节点索引*2 加 1 或者 0 的那个），因为存储的是 32 位的无符号整数，二叉树至少需要 33 层，就需要 2^33 个节点，我们开不到这么大的数组。

此代码 c++仍然超时，但 g++耗时为 577ms，内存占用：32960KB，效率有了巨大的提高。

struct ct
{
    int L[2];
    ct() {
        L[0] = L[1] = 0;
    }
    ct(const unsigned long &s, int i) {
        L[0] = L[1] = 0;
        if (i >= 32)
            return;

        this->ins(s, i);
    }
    void ins(const unsigned long& s, int i);
};

ct heap[1000000 << 2];
int idx = 1;

inline void ct::ins(const unsigned long& s, int i) {
    if (i < 32) {
        int& nw = L[(s >> (31 - i)) & 1];
        if (nw) {
            heap[nw].ins(s, i + 1);
        }
        else {
            nw = idx;
            ++idx;
            heap[nw] = ct(s, i + 1);
        }
    }
}

unsigned long query(ct* c, const unsigned long& s, int i, const unsigned long& have) {
    if (i >= 32)
        return have;

    int nw = c->L[(s >> (31 - i)) & 1];
    if (nw)
        return query(&heap[nw], s, i + 1, (have << 1) + ((s >> (31 - i)) & 1));
    else
        return query(&heap[c->L[1 - ((s >> (31 - i)) & 1)]],
            s, i + 1, (have << 1) + 1 - ((s >> (31 - i)) & 1));
}

int main() {
    ios::sync_with_stdio(false);
    cin.tie(NULL);

    int T;
    cin >> T;

    for (int k = 1;k <= T;++k)
    {
        cout << "Case #" << k << ":" << endl;
        int n, m;
        cin >> n >> m;
        idx = 1;
        heap[0] = ct();

        while (n--)
        {
            unsigned long a;
            cin >> a;

            heap[0].ins(a, 0);
        }

        while (m--)
        {
            unsigned long a;
            cin >> a;

            cout << query(heap, ~a, 0, 0) << endl;
        }
    }
}

递归 vs 非递归

可以看到，前面的代码是使用递归来处理数据的，如果改成循环会怎么样呢？我们做一下实验。

struct ct
{
    int L[2];
    ct() {
        L[0] = L[1] = 0;
    }
};

ct heap[1000000 << 2];
int idx = 1;

inline void insert(const unsigned long& s) {
    ct *c = heap;
    for (int i = 0;i < 32;++i) {
        int& nw = c->L[(s >> (31 - i)) & 1];
        if (!nw) {
            nw = idx;
            ++idx;
            heap[nw] = ct();
        }
        c = &heap[nw];
    }
}

inline unsigned long query(const unsigned long& s) {
    unsigned have = 0;
    int i = 0;
    ct *c = heap;
    while (i < 32) {
        int nw = c->L[(s >> (31 - i)) & 1];
        if (nw) {
            c = &heap[nw];
            have = (have << 1) + ((s >> (31 - i)) & 1);
        }
        else {
            c = &heap[c->L[1 - ((s >> (31 - i)) & 1)]];
            have = (have << 1) + 1 - ((s >> (31 - i)) & 1);
        }
        ++i;
    }
    return have;
}

int main() {
    ios::sync_with_stdio(false);
    cin.tie(NULL);

    int T;
    cin >> T;

    for (int k = 1;k <= T;++k)
    {
        cout << "Case #" << k << ":" << endl;
        int n, m;
        cin >> n >> m;
        idx = 1;
        heap[0] = ct();

        while (n--)
        {
            unsigned long a;
            cin >> a;

            insert(a);
        }

        while (m--)
        {
            unsigned long a;
            cin >> a;

            cout << query(~a) << endl;
        }
    }
}

代码使用 g++编译，耗时 405 MS，内存消耗 32936 KB。c++仍然超时。可见把递归改成循环可以提高一定的效率，但在本题效果不大（只减少了 172ms）

cin&cout vs scanf&printf

众所周知，cin 的效率比 scanf 低了不止一点，因为 cin 会与 c 语言的输入输出函数自动同步，每次 cin 读取都会刷新缓冲区来防止其与 scanf 混合使用时产生错误。但我们可以通过下列语句来取消同步：

1 2	ios::sync_with_stdio(false); cin.tie(NULL);

加上这个语句之后大部分只有 scanf 能过的题 cin 也能过了，但是仍然有少部分题会卡时间。这里，我们把输入改成 scanf，输出仍然使用 cout，但是不使用 endl 来刷新缓冲区，而是输出 \n 来换行。本来 oj 的输入和输出就是固定的，什么时候刷新输出缓冲区对输入没有影响，这么做之后程序会在结束的时候刷新缓冲区输出内容。

| 编译器 | 耗时 | 内存消耗 |
| g++ | 327 | 32932 |
| c++ | 561 | 33096 |

顺带一提，如果不加上那两行代码，使用 endl 来换行的话，g++耗时是 561ms。

在全部改成 scanf 和 printf 之后，耗时如下：

| 编译器 | 耗时 | 内存消耗 |
| g++ | 280 | 32708 |
| c++ | 374 | 33096 |

总结

貌似 c++对于递归的优化不大好，在这道题使用递归时 c++永远是超时的。其他时候 c++与 g++差不多，但效率总是低一些。

而堆分配内存会极大地拖慢运行速度，改成栈分配之后无论是分配效率还是删除效率都会提高很多，删除的时候只需要把索引重新移到 1 的位置即可。另外，结构体中存储的子节点位置不一定是索引，和堆内存时一样存储指针也是行的，索引只需要在分配一个新的对象时使用即可。

最终的代码：

  // oj.cpp : 定义控制台应用程序的入口点。
//

//#include <iostream>
#include <cstdio>
#include <limits>
#include <algorithm>
#include <numeric>
#include <string>
#include <utility>
#include <cstdlib>
#include <vector>
#include <sstream>
#include <cstring>
#include <cctype>
#include <cmath>
#include <functional>
#include <deque>
#include <stack>
#include <bitset>

using namespace std;

struct ct
{
    int L[2];
    ct() {
        L[0] = L[1] = 0;
    }
};

ct heap[1000000 << 2];
int idx = 1;

inline void insert(const unsigned long& s) {
    ct *c = heap;
    for (int i = 0;i < 32;++i) {
        int& nw = c->L[(s >> (31 - i)) & 1];
        if (!nw) {
            nw = idx;
            ++idx;
            heap[nw] = ct();
        }
        c = &heap[nw];
    }
}

inline unsigned long query(const unsigned long& s) {
    unsigned have = 0;
    int i = 0;
    ct *c = heap;
    while (i < 32) {
        int nw = c->L[(s >> (31 - i)) & 1];
        if (nw) {
            c = &heap[nw];
            have = (have << 1) + ((s >> (31 - i)) & 1);
        }
        else {
            c = &heap[c->L[1 - ((s >> (31 - i)) & 1)]];
            have = (have << 1) + 1 - ((s >> (31 - i)) & 1);
        }
        ++i;
    }
    return have;
}

int main() {
//	ios::sync_with_stdio(false);
//	cin.tie(NULL);

    int T;
    scanf("%d", &T);

    for (int k = 1;k <= T;++k)
    {
        //cout << "Case #" << k << ":" << endl;
        printf("Case #%d:\n", k);
        int n, m;
        scanf("%d%d", &n, &m);
        idx = 1;
        heap[0] = ct();

        while (n--)
        {
            unsigned long a;
            scanf("%lu", &a);

            insert(a);
        }

        while (m--)
        {
            unsigned long a;
            scanf("%lu", &a);

            //cout << query(~a) << endl;
            printf("%lu\n", query(~a));
        }
    }
}

16 年 8 月 20 日更新

最近发现了一个新的提升读取速度的方法：fread。

这是一个 c 语言函数，具体作用是把 n 个字符作为字符串一次性读入。根据这里的说法，它可以把读取时间效率再提高十倍。

在那篇文章中，fread 被用来读取文件，由于 oj 实际上是把文件重定向到标准 io 来实现输入的，所以我们必须把它再映射回去才能实现从标准 io 输入。然后我们再手动把字符串转换成数字即可，这里的转换操作最好不要使用 stringstream ，因为创建字符串流的过程会将字符串中的内容复制进一个 string 中，这个过程涉及到拷贝和堆的内存分配，耗时很多，还会多占用 2 倍的内存。

以下是我的实际处理过程：

1
2
3

const int MAXBUF = 1000000 << 2;
char buf[MAXBUF];
char* startpoint = buf; //字符串的读取起点

buf 是一个字符数组，用来存储读入的数据，程序开始的时候会把所有输入都读取到这个地方。 startpoint 表示读取的起点，下面的函数会使用它。

template<typename T>
inline void sread(T &res)
{
    res = 0;
    for (;*startpoint;++startpoint)
    {
        if (*startpoint == ' ' || *startpoint == '\n')
        {
            ++startpoint;
            return;
        }
        else
            res = res * 10 + *startpoint - '0';
    }
}

这个函数用来在 buf 数组中读取数据，类似于 scanf，每次调用会从 buf 中读取一个数据填充到 res 中，并且以空白符为分隔。但是这里每次只能读取一个数据，而且只能读取十进制数字。这个函数使用指针 startpoint 来表示读取的起点，每次从这里读取并且将 startpoint 设置到下一个数据的开头（目前还没法处理连续两个分隔符的情况）。

在 main 函数中的读取操作：

int len = fread(buf, 1, MAXBUF, stdin);
if (len >= MAXBUF)
    throw int();
buf[len] = '\0';

fread 函数的第一个参数是读取到的数据的存储位置，第二个是读取的单个数据的字节数，由于是 char，所以这里填 1；第三个是最多读取多少数据，这里填上数组的大小，总不能越界不是(´･ω･`)。第四个参数是重点：输入流。如果从文件读取的话，这里要填上一个 FILE 指针，但我们要从标准 io 读入，所以写 stdin——标准输入流。当 fread 读满 MAXBUF 个字符或者读取到 EOF 的时候（别忘了 oj 是从文件映射到标准输入的，所以最后一定会有 EOF），就会终止输入，并且返回读取的字符数。

后面的两行是为了确保 fread 读取了全部的输入。如果 buf 被读取满了的话，很可能没有完整读入所有的输入，这里会抛出一个异常，在 oj 的结果页面可以看到一个 RE（如果你用的是 PC^2，很可能会得到一个 CE，貌似有些情况下 pc^2 会把 RE 当成 CE），这总比因为读入不完整而得到其他的奇怪结果要强得多，问题是很容易定位的。

然后把最后一个位置赋成 '\0' ，表示字符串的结尾，这样对于以 EOF 结束输入的题比较有用。

最后使用 sread 读取就可以了。

经过测试，现在的耗时是 234ms，又缩短了 46ms。

最终代码（应该是吧）：

// oj.cpp : ??????????????
//

//#include <iostream>
#include <cstdio>
#include <limits>
#include <algorithm>
#include <numeric>
#include <string>
#include <utility>
#include <cstdlib>
#include <vector>
#include <sstream>
#include <cstring>
#include <cctype>
#include <cmath>
#include <functional>
#include <deque>
#include <stack>
#include <bitset>

using namespace std;

struct ct
{
    int L[2];
    ct() {
        L[0] = L[1] = 0;
    }
};

ct heap[1000000 << 2];
int idx = 1;

inline void insert(const unsigned long& s) {
    ct *c = heap;
    for (int i = 0;i < 32;++i) {
        int& nw = c->L[(s >> (31 - i)) & 1];
        if (!nw) {
            nw = idx;
            ++idx;
            heap[nw] = ct();
        }
        c = &heap[nw];
    }
}

inline unsigned long query(const unsigned long& s) {
    unsigned have = 0;
    int i = 0;
    ct *c = heap;
    while (i < 32) {
        int nw = c->L[(s >> (31 - i)) & 1];
        if (nw) {
            c = &heap[nw];
            have = (have << 1) + ((s >> (31 - i)) & 1);
        }
        else {
            c = &heap[c->L[1 - ((s >> (31 - i)) & 1)]];
            have = (have << 1) + 1 - ((s >> (31 - i)) & 1);
        }
        ++i;
    }
    return have;
}

const int MAXBUF = 1000000 << 2;
char buf[MAXBUF];
char* startpoint = buf; //????????

template<typename T>
inline void sread(T &res)
{
    res = 0;
    for (;*startpoint;++startpoint)
    {
        if (*startpoint == ' ' || *startpoint == '\n')
        {
            ++startpoint;
            return;
        }
        else
            res = res * 10 + *startpoint - '0';
    }
}

int main() {
    //  ios::sync_with_stdio(false);
    //  cin.tie(NULL);

    int len = fread(buf, 1, MAXBUF, stdin);
    //int len = fread(buf, 1, MAXBUF, fopen("oj.in", "r")); //debug
    if (len >= MAXBUF)
        throw int();
    buf[len] = '\0';

    int T;
    sread(T);

    for (int k = 1;k <= T;++k)
    {
        //cout << "Case #" << k << ":" << endl;
        printf("Case #%d:\n", k);
        int n, m;
        sread(n);
        sread(m);
        idx = 1;
        heap[0] = ct();

        while (n--)
        {
            unsigned long a;
            sread(a);

            insert(a);
        }

        while (m--)
        {
            unsigned long a;
            sread(a);

            //cout << query(~a) << endl;
            printf("%lu\n", query(~a));
        }
    }
}

有趣的是，目前在 VJ 中这道题的所有提交里面，耗时最少和耗时最多的人都是我_(:зゝ∠)_

本作品采用署名-相同方式共享 4.0 国际进行许可。欢迎转载、使用、重新发布，但务必保留文章署名 “不科学的科学君” (Liu233w) 与博客链接： https://liu233w.github.io ，基于本文修改后的作品务必以相同的许可发布。如有任何疑问，请与我联系。