时间:2025-07-25 作者:游乐小编
随着人工智能技术不断发展,每周都有非常多的论文公开发布。现如今对论文进行分类逐渐成为非常现实的问题,这也是研究人员和研究机构每天都面临的问题。现在希望选手能构建一个论文分类模型。
本次赛题希望参赛选手利用论文信息:论文id、标题、摘要,划分论文具体类别。
赛题样例(使用\t分隔):
paperid:9821title:Calculation of prompt diphoton production cross sections at Tevatron and LHC energiesabstract:A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to-leading logarithmic accuracy.categories:hep-ph登录后复制
训练数据和测试集以csv文件给出,其中:
训练集5W篇论文。其中每篇论文都包含论文id、标题、摘要和类别四个字段。
测试集1W篇论文。其中每篇论文都包含论文id、标题、摘要,不包含论文类别字段。
本次竞赛的评价标准采用准确率指标,最高分为1。
计算方法参考https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html, 评估代码参考:
from sklearn.metrics import accuracy_scorey_pred = [0, 2, 1, 3]y_true = [0, 1, 2, 3]登录后复制In [1]
!pip install paddle-ernie > log.log登录后复制In [2]
import numpy as npimport paddle as P# 导入ernie模型from ernie.tokenizing_ernie import ErnieTokenizerfrom ernie.modeling_ernie import ErnieModelmodel = ErnieModel.from_pretrained('ernie-1.0') # Try to get pretrained model from server, make sure you have network connectionmodel.eval()tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')ids, _ = tokenizer.encode('hello world')ids = P.to_tensor(np.expand_dims(ids, 0)) # insert extra `batch` dimensionpooled, encoded = model(ids) # eager executionprint(pooled.numpy())登录后复制In [9]
import sysimport numpy as npimport pandas as pdfrom sklearn.metrics import f1_scoreimport paddle as Pfrom ernie.tokenizing_ernie import ErnieTokenizerfrom ernie.modeling_ernie import ErnieModelForSequenceClassification登录后复制In [10]
train_df = pd.read_csv('train.csv', sep='\t')train_df['title'] = train_df['title'] + ' ' + train_df['abstract']train_df = train_df.sample(frac=1.0)train_df.head()登录后复制In [11]
train_df.shape登录后复制In [12]
train_df['categories'].nunique()登录后复制In [13]
train_df['categories'], lbl_list = pd.factorize(train_df['categories'])登录后复制In [14]
# 模型超参数BATCH=32MAX_SEQLEN=300LR=5e-5EPOCH=10# 定义ernie分类模型ernie = ErnieModelForSequenceClassification.from_pretrained('ernie-2.0-en', num_labels=39)optimizer = P.optimizer.Adam(LR,parameters=ernie.parameters())tokenizer = ErnieTokenizer.from_pretrained('ernie-2.0-en')登录后复制In [15]
train_df.iterrows()登录后复制In [16]
# 对数据集进行转换,主要操作为文本编码def make_data(df): data = [] for i, row in enumerate(df.iterrows()): text, label = row[1].title, row[1].categories text_id, _ = tokenizer.encode(text) # ErnieTokenizer 会自动添加ERNIE所需要的特殊token,如[CLS], [SEP] text_id = text_id[:MAX_SEQLEN] text_id = np.pad(text_id, [0, MAX_SEQLEN-len(text_id)], mode='constant') data.append((text_id, label)) return datatrain_data = make_data(train_df.iloc[:-5000])val_data = make_data(train_df.iloc[-5000:])登录后复制In [ ]
# 获取batch数据def get_batch_data(data, i): d = data[i*BATCH: (i + 1) * BATCH] feature, label = zip(*d) feature = np.stack(feature) # 将BATCH行样本整合在一个numpy.array中 label = np.stack(list(label)) feature = P.to_tensor(feature) # 使用to_variable将numpy.array转换为paddle tensor label = P.to_tensor(label) return feature, label登录后复制In [12]
EPOCH=1# 模型训练for i in range(EPOCH): np.random.shuffle(train_data) # 每个epoch都shuffle数据以获得最佳训练效果; ernie.train() for j in range(len(train_data) // BATCH): feature, label = get_batch_data(train_data, j) loss, _ = ernie(feature, labels=label) loss.backward() optimizer.minimize(loss) ernie.clear_gradients() if j % 50 == 0: print('Train %d: loss %.5f' % (j, loss.numpy())) # 模型验证 if j % 100 == 0: all_pred, all_label = [], [] with P.no_grad(): ernie.eval() for j in range(len(val_data) // BATCH): feature, label = get_batch_data(val_data, j) loss, logits = ernie(feature, labels=label) all_pred.extend(logits.argmax(-1).numpy()) all_label.extend(label.numpy()) ernie.train() acc = (np.array(all_label) == np.array(all_pred)).astype(np.float32).mean() print('Val acc %.5f' % acc)登录后复制In [13]
test_df = pd.read_csv('test.csv', sep='\t')test_df['title'] = test_df['title'] + ' ' + test_df['abstract']test_df['categories'] = 0test_data = make_data(test_df.iloc[:])登录后复制In [20]
all_pred, all_label = [], []# 模型预测with P.no_grad(): ernie.eval() for j in range(len(test_data) // BATCH+1): feature, label = get_batch_data(test_data, j) loss, logits = ernie(feature, labels=label) all_pred.extend(logits.argmax(-1).numpy()) all_label.extend(label.numpy())登录后复制In [21]
pd.DataFrame({ 'paperid': test_df['paperid'], 'categories': lbl_list[all_pred]}).to_csv('submit.csv', index=None)登录后复制
2021-11-05 11:52
手游攻略2021-11-19 18:38
手游攻略2021-10-31 23:18
手游攻略2022-06-03 14:46
游戏资讯2025-06-28 12:37
单机攻略