Datahub 安装失败记录

背景介绍

参考链接: https://blog.csdn.net/ddxygq/article/details/123437072

DataHub是由LinkedIn的数据团队开源的一款提供元数据搜索与发现的工具。

提到LinkedIn,不得不想到大名鼎鼎的Kafka,Kafka就是LinkedIn开源的。LinkedIn开源的Kafka直接影响了整个实时计算领域的发展,而LinkedIn的数据团队也一直在探索数据治理的问题,不断努力扩展其基础架构,以满足不断增长的大数据生态系统的需求。随着数据的数量和丰富性的增长,数据科学家和工程师要发现可用的数据资产,了解其出处并根据见解采取适当的行动变得越来越具有挑战性。为了帮助增长的同时继续扩大生产力和数据创新,创建了通用的元数据搜索和发现工具DataHub。

Datahub作为新一代的元数据管理平台,大有取代老牌元数据管理工具Atlas之势。首先,阿里云也有一款名为DataHub的产品,是一个流式处理平台,本文所述DataHub与其无关。

市面上常见的元数据管理系统有如下几个:

笔者之前白嫖了亚马逊的EC2服务器,在链接文章的教程下尝试安装datahub, 系统默认环境python2.7、python3.7;

安装datahub过程

1、笔者尝试安装了 python3.8,本来开始没有安装3.8,但是在安装完datahub的时候,尝试验证版本号时抱如下第一个错误

python3 -m datahub version
DataHub CLI version: 0.10.0.1
Python version: 3.7.16 (default, Dec 15 2022, 23:24:54) 
[GCC 7.3.1 20180712 (Red Hat 7.3.1-15)]
Exception ignored in: <generator object configure_logging at 0x7f8fcca5a050>
Traceback (most recent call last):
  File "/home/ec2-user/.local/lib/python3.7/site-packages/datahub/utilities/logging_manager.py", line 187, in configure_logging
  File "/usr/lib64/python3.7/contextlib.py", line 486, in __exit__
AttributeError: 'NoneType' object has no attribute 'exc_info'

2、上边的错误出现后没有找到合适的解决方案,就直接升级python版本到3.8了,以后就有了一系列的坑,如下第二个错误,发生在对python3.8编译的过程中。

    wget https://www.python.org/ftp/python/3.8.3/Python-3.8.3.tgz

    tar zxvf Python-3.8.3.tgz

    ./configure --prefix=/usr/lib/python3.8
    checking build system type... x86_64-pc-linux-gnu
    checking host system type... x86_64-pc-linux-gnu
    checking for python3.8... no
    checking for python3... python3
    checking for --enable-universalsdk... no
    checking for --with-universal-archs... no
    checking MACHDEP... "linux"
    checking for gcc... no
    checking for cc... no
    checking for cl.exe... no
    configure: error: in `/home/ec2-user/Python-3.8.3':
    configure: error: no acceptable C compiler found in $PATH
    See `config.log' for more details

通过安装 gcc 解决了上边的问题; 因为ec2有权限限制,非root用户执行命令时尽可能带上sudo

    sudo yum -y install gcc-c++

    sudo make && sudo make install

3、然后make install的时候遇到问题三,如下提示,通过安装zlib-devel解决

zipimport.ZipImportError: can't decompress data; zlib not available

    sudo yum install zlib-devel

编译完并通过ln -s 命令修改了python3的软链之后,执行pip3的操作开始报如下的错误,通过安装openssl 和 openssl-devel解决,但是还得从新编译python3.8,从.config开始从新执行之前的命令

python3 -m pip install --upgrade pip wheel setuptools

WARNING: pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/pip/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/pip/
    sudo yum -y install openssl openssl-devel

    python3 -m pip install --upgrade acryl-datahub

4、之后再次执行datahub的安装,入到如下问题, 通过安装 libffi-devel 解决

 File "/usr/lib/python3.8/lib/python3.8/ctypes/__init__.py", line 7, in 
      from _ctypes import Union, Structure, Array
  ModuleNotFoundError: No module named '_ctypes'
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for avro

  File "/usr/lib/python3.8/lib/python3.8/ctypes/__init__.py", line 7, in 
      from _ctypes import Union, Structure, Array
  ModuleNotFoundError: No module named '_ctypes'
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for click-default-group

    sudo yum install libffi-devel

5、继续执行datahub的安装,遇到如下问题, 通过安装bzip2-devel 同时从新编译解决(如果不从新编译还会报错)

    File "/usr/lib/python3.8/lib/python3.8/bz2.py", line 19, in 
    from _bz2 import BZ2Compressor, BZ2Decompressor

ModuleNotFoundError: No module named '_bz2'
    sudo yum install bzip2-devel 

    sudo make && sudo make install

直到此时解决以上的问题,才算是把datahub安装好了。

启动datahub

6、如果docker服务没有启动会可能有如下问题;启动服务失败,报标题中的错误,试试启动服务命令前加上 sudo。

The name org.freedesktop.PolicyKit1 was not provided by any .service files See system logs and 'systemctl status docker.service' for details.

7、docker出现的问题如下

ERROR: Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info": dial unix /var/run/docker.sock: connect: permission denied

通过修改docker 组来处理

Try adding your user to the docker group:

    Run usermod -aG docker "${USER}", then
    either log out and log back in, or run newgrp docker.
    After this you have to restart your docker daemon: sudo service docker restart.

到现在所有的坑基本上踩完了,直接执行如下启动datahub的命令并得到如下提示,GAME OVER!

python3 -m datahub docker quickstart

Total Docker memory configured 0.96GB is below the minimum threshold 3.8GB. You can increase the memory allocated to Docker in the Docker settings.

是的,这是一次全程失败的实验过程,但是笔者认为除了最后设备硬件不够用之外,主要的安装踩坑过程基本上就差不多了。

anaconda pip 添加国内加速源 镜像源 提升安装速度

我们在使用anaconda或者pip安装python包的时候经常会遇到类似的错误,如Timeout,或者PackagesNotFoundError: The following packages are not available from current channels。一般这些情况都是由于不可描述原因导致的请求连接异常而不能正常安装我们所需要的包。

我们可以指定anaconda的chanles或者在pip安装某个包时临时指定源;通过添加国内的源来加速安装,以及拓宽对所需包的检索范围。

狗头保佑

1、添加清华镜像(源)

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/

conda config --set show_channel_urls yes

2、添加豆瓣镜像(源)

conda config --add channels https://pypi.douban.com/anaconda/cloud/conda-forge/
conda config --add channels https://pypi.douban.com/anaconda/cloud/msys2/
conda config --add channels https://pypi.douban.com/anaconda/cloud/bioconda/
conda config --add channels https://pypi.douban.com/anaconda/cloud/menpo/
conda config --add channels https://pypi.douban.com/anaconda/cloud/pytorch/

conda config --set show_channel_urls yes

3、删除源

conda config --remove-key channels

4、pip带源安装

可以临时指定安装所需要的源

pip install -i https://pypi.douban.com/simple tensorflow-gpu==1.14