当前位置: 首页 > AI > 文章内容页

那些需要守护的

纸嫁衣4红丝缠

蚊子模拟器2022

寿司制作模拟器

鸡尾酒王子

修仙之百世归来

闹鬼的屋子

建造和生存工艺

大王为何独宠我手机

天天快送

科大讯飞-学术论文分类挑战赛：ERNIE 准确率0.79

时间:2025-07-25 作者:游乐小编

随着人工智能技术不断发展，每周都有非常多的论文公开发布。现如今对论文进行分类逐渐成为非常现实的问题，这也是研究人员和研究机构每天都面临的问题。现在希望选手能构建一个论文分类模型。

科大讯飞-学术论文分类挑战赛：ernie 准确率0.79 - 游乐网

赛事任务

本次赛题希望参赛选手利用论文信息：论文id、标题、摘要，划分论文具体类别。

赛题样例（使用\t分隔）：

paperid：9821title：Calculation of prompt diphoton production cross sections at Tevatron and LHC energiesabstract：A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to-leading logarithmic accuracy.categories：hep-ph

登录后复制

数据说明

训练数据和测试集以csv文件给出，其中：

训练集5W篇论文。其中每篇论文都包含论文id、标题、摘要和类别四个字段。

测试集1W篇论文。其中每篇论文都包含论文id、标题、摘要，不包含论文类别字段。

评估指标

本次竞赛的评价标准采用准确率指标，最高分为1。

计算方法参考https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html，评估代码参考：

from sklearn.metrics import accuracy_scorey_pred = [0, 2, 1, 3]y_true = [0, 1, 2, 3]

登录后复制In [1]

!pip install paddle-ernie > log.log

登录后复制In [2]

import numpy as npimport paddle as P# 导入ernie模型from ernie.tokenizing_ernie import ErnieTokenizerfrom ernie.modeling_ernie import ErnieModelmodel = ErnieModel.from_pretrained('ernie-1.0')    # Try to get pretrained model from server, make sure you have network connectionmodel.eval()tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')ids, _ = tokenizer.encode('hello world')ids = P.to_tensor(np.expand_dims(ids, 0))  # insert extra `batch` dimensionpooled, encoded = model(ids)                 # eager executionprint(pooled.numpy())

登录后复制In [9]

import sysimport numpy as npimport pandas as pdfrom sklearn.metrics import f1_scoreimport paddle as Pfrom ernie.tokenizing_ernie import ErnieTokenizerfrom ernie.modeling_ernie import ErnieModelForSequenceClassification

登录后复制In [10]

train_df = pd.read_csv('train.csv', sep='\t')train_df['title'] = train_df['title'] + ' ' + train_df['abstract']train_df = train_df.sample(frac=1.0)train_df.head()

登录后复制In [11]

train_df.shape

登录后复制In [12]

train_df['categories'].nunique()

登录后复制In [13]

train_df['categories'], lbl_list = pd.factorize(train_df['categories'])

登录后复制In [14]

# 模型超参数BATCH=32MAX_SEQLEN=300LR=5e-5EPOCH=10# 定义ernie分类模型ernie = ErnieModelForSequenceClassification.from_pretrained('ernie-2.0-en', num_labels=39)optimizer = P.optimizer.Adam(LR,parameters=ernie.parameters())tokenizer = ErnieTokenizer.from_pretrained('ernie-2.0-en')

登录后复制In [15]

train_df.iterrows()

登录后复制In [16]

# 对数据集进行转换，主要操作为文本编码def make_data(df):    data = []    for i, row in enumerate(df.iterrows()):        text, label = row[1].title, row[1].categories        text_id, _ = tokenizer.encode(text) # ErnieTokenizer 会自动添加ERNIE所需要的特殊token，如[CLS], [SEP]        text_id = text_id[:MAX_SEQLEN]        text_id = np.pad(text_id, [0, MAX_SEQLEN-len(text_id)], mode='constant')        data.append((text_id, label))    return datatrain_data = make_data(train_df.iloc[:-5000])val_data = make_data(train_df.iloc[-5000:])

登录后复制In [ ]

# 获取batch数据def get_batch_data(data, i):    d = data[i*BATCH: (i + 1) * BATCH]    feature, label = zip(*d)    feature = np.stack(feature)  # 将BATCH行样本整合在一个numpy.array中    label = np.stack(list(label))    feature = P.to_tensor(feature) # 使用to_variable将numpy.array转换为paddle tensor    label = P.to_tensor(label)    return feature, label

登录后复制In [12]

EPOCH=1# 模型训练for i in range(EPOCH):    np.random.shuffle(train_data) # 每个epoch都shuffle数据以获得最佳训练效果；    ernie.train()    for j in range(len(train_data) // BATCH):        feature, label = get_batch_data(train_data, j)        loss, _ = ernie(feature, labels=label)         loss.backward()        optimizer.minimize(loss)        ernie.clear_gradients()        if j % 50 == 0:            print('Train %d: loss %.5f' % (j, loss.numpy()))                # 模型验证        if j % 100 == 0:            all_pred, all_label = [], []            with P.no_grad():                ernie.eval()                for j in range(len(val_data) // BATCH):                    feature, label = get_batch_data(val_data, j)                    loss, logits = ernie(feature, labels=label)                    all_pred.extend(logits.argmax(-1).numpy())                    all_label.extend(label.numpy())                ernie.train()            acc = (np.array(all_label) == np.array(all_pred)).astype(np.float32).mean()            print('Val acc %.5f' % acc)

登录后复制In [13]

test_df = pd.read_csv('test.csv', sep='\t')test_df['title'] = test_df['title'] + ' ' + test_df['abstract']test_df['categories'] = 0test_data = make_data(test_df.iloc[:])

登录后复制In [20]

all_pred, all_label = [], []# 模型预测with P.no_grad():    ernie.eval()    for j in range(len(test_data) // BATCH+1):        feature, label = get_batch_data(test_data, j)        loss, logits = ernie(feature, labels=label)        all_pred.extend(logits.argmax(-1).numpy())        all_label.extend(label.numpy())

登录后复制In [21]

pd.DataFrame({    'paperid': test_df['paperid'],    'categories': lbl_list[all_pred]}).to_csv('submit.csv', index=None)

登录后复制

侠盗世界 dofm情侣飞行棋古云传奇全民酷跑公主恋爱日记我和朋友去露营英文

首页

游戏

软件

资讯

排行榜

专题

科大讯飞-学术论文分类挑战赛：ERNIE 准确率0.79

赛事任务

数据说明

评估指标

小编推荐:

相关攻略

热门推荐

热门文章